Time-frames:
3/23/17 10:09 PM PST to 3/23/17 11:29:03 PM PST
3/23/17 11:36 PM PST to 3/23/17 11:52 PM PST
Affected Systems: Production and Sandbox environments
Symptoms: Failed connections to Production and Sandbox UI and API
Root Cause:
Incident 1: Malfunction and misconfiguration of a Spanning Tree (RST version) protocol on one of the network devices during routine maintenance to restore redundancy.
Incident 2: Thread exhaustion at the front end API/UI tier due to unavailability of async queueing service.
Details: At 6:00 PM PST 3/23/17 we began a routine maintenance to restore critical redundancy in our core switching infrastructure by replacing a failed device. The maintenance was expected to have no impact.
Between 6:00 PM PST and 10:00 PM PST we conducted investigatory and preparatory work as part of our maintenance.
At 10:00 PM PST our datacenter vendor started an unrelated no-impact maintenance to update route policies. Similar maintenance had caused Feb 2/ Feb 3 incidents.
At 10:02 PM PST as part of our maintenance we made a configuration change on the replacement hardware with the goal of moving traffic away from it. Prior to committing the configuration, our team had confirmed that both devices participating in the change had loop prevention configured (Spanning Tree-RST).
At 10:10 PM PST we observed alerts indicating errors connecting from external monitoring. Alert pattern coming from our global reachability monitoring solution had the same pattern of a carrier routing loop issue exhibited during a Feb 2/Feb 3 incidents.
Between 10:12 PM PST and 10:21 PM PST we established a call with our datacenter vendor to discuss their maintenance. The vendor agreed to begin rolling back their change. In parallel we've attempted to make routing changes to divert the traffic away from routing paths we believed were impacted. Switching to a backup peering didn't provide the desired outcome.
At 10:30 PM PST the roll back of vendor changes was underway and we opened additional paths of investigation.
Between 10:30 PM PST and 11:10 PM PST our investigation identified a Spanning Tree loop within our network. At 11:13 PM PST we isolated the source of the loop and normal network operation was restored.
Once network services were restored and the application started to receive traffic our teams observed some of the internal messaging services were hanging in a non-deterministic state. The inability of UI/API front ends to successfully queue async jobs had lead to thread exhaustion on the front end tier, rendering application unavailable and causing the secondary impact.
At 11:52 PM PST we completed sequential restarts of backend queueing and front end components which restored services completely.
Preventative Measures: In our post-incident review process we identified several key areas for immediate focus:
Network infrastructure audit and changes to engineer out all the devices with the same potential RST defect
Process improvements to ensure more rapid consideration of alternative causes and contributors to an issue
Monitoring improvements to eliminate or reduce dependency on core infrastructure (maintain better visibility during infrastructure events)
Application enhancements to allow front end components to continue to take synchronous traffic when asynchronous queueing is unavailable