TIMEFRAME:
08/07/19 7:48 AM PST - 08/07/19 9:42 AM PST
SUMMARY:
On Aug 7th 7:48 AM PST, authorized and approved cleanup work on legacy monitoring applications was initiated in a non customer facing production environment. During the cleanup, an unexpected complication arose with a package manager that caused impairment of command and control functions. This propagated to customer facing environments and affected both the primary and backup ecosystems. Normal recovery procedures were severely impeded due to items deleted that were needed for recovery.
ROOT CAUSE
Loss of items in command and control caused the environments to become unavailable in an unexpected and unintended way.
RESOLUTION
Service was restored by first rebuilding online internal components that were needed to recover production systems. Once this was done, both clusters were restarted by recreating each of the connected nodes and recreating each cluster’s quorum. As soon as these clusters were restarted and the docker registry was online, the application servers immediately became available.
FUTURE PREVENTATIVE MEASURES: