Disruption on production and sandbox services.

Incident Report for Zuora

Postmortem

Time-frames:

3/23/17 10:09 PM PST to 3/23/17 11:29:03 PM PST

3/23/17 11:36 PM PST to 3/23/17 11:52 PM PST

Affected Systems: Production and Sandbox environments

Symptoms: Failed connections to Production and Sandbox UI and API

Root Cause:

Incident 1: Malfunction and misconfiguration of a Spanning Tree (RST version) protocol on one of the network devices during routine maintenance to restore redundancy.

Incident 2: Thread exhaustion at the front end API/UI tier due to unavailability of async queueing service.

Details: At 6:00 PM PST 3/23/17 we began a routine maintenance to restore critical redundancy in our core switching infrastructure by replacing a failed device. The maintenance was expected to have no impact.

Between 6:00 PM PST and 10:00 PM PST we conducted investigatory and preparatory work as part of our maintenance.

At 10:00 PM PST our datacenter vendor started an unrelated no-impact maintenance to update route policies. Similar maintenance had caused Feb 2/ Feb 3 incidents.

At 10:02 PM PST as part of our maintenance we made a configuration change on the replacement hardware with the goal of moving traffic away from it. Prior to committing the configuration, our team had confirmed that both devices participating in the change had loop prevention configured (Spanning Tree-RST).

At 10:10 PM PST we observed alerts indicating errors connecting from external monitoring. Alert pattern coming from our global reachability monitoring solution had the same pattern of a carrier routing loop issue exhibited during a Feb 2/Feb 3 incidents.

Between 10:12 PM PST and 10:21 PM PST we established a call with our datacenter vendor to discuss their maintenance. The vendor agreed to begin rolling back their change. In parallel we've attempted to make routing changes to divert the traffic away from routing paths we believed were impacted. Switching to a backup peering didn't provide the desired outcome.

At 10:30 PM PST the roll back of vendor changes was underway and we opened additional paths of investigation.

Between 10:30 PM PST and 11:10 PM PST our investigation identified a Spanning Tree loop within our network. At 11:13 PM PST we isolated the source of the loop and normal network operation was restored.

Once network services were restored and the application started to receive traffic our teams observed some of the internal messaging services were hanging in a non-deterministic state. The inability of UI/API front ends to successfully queue async jobs had lead to thread exhaustion on the front end tier, rendering application unavailable and causing the secondary impact.

At 11:52 PM PST we completed sequential restarts of backend queueing and front end components which restored services completely.

Preventative Measures: In our post-incident review process we identified several key areas for immediate focus:

Network infrastructure audit and changes to engineer out all the devices with the same potential RST defect
Process improvements to ensure more rapid consideration of alternative causes and contributors to an issue
Monitoring improvements to eliminate or reduce dependency on core infrastructure (maintain better visibility during infrastructure events)
Application enhancements to allow front end components to continue to take synchronous traffic when asynchronous queueing is unavailable

Posted Mar 27, 2017 - 16:23 PDT

Resolved

This issue has now been fully resolved. We shall publish the Root Cause Analysis once it's ready.

Posted Mar 24, 2017 - 03:34 PDT

Monitoring

The environments are stable as of 11.52pm pst. Our teams continue to monitor the service.

Posted Mar 24, 2017 - 00:01 PDT

Identified

We're investigating a recurrence of the issue.

Posted Mar 23, 2017 - 23:42 PDT

Monitoring

We're observing stable service operations as of 11.28pm pst.

Posted Mar 23, 2017 - 23:33 PDT

Update

Our team is continuing to investigate the issue with our network peering providers.

Posted Mar 23, 2017 - 22:54 PDT

Investigating

We're investigating a disruption on production and sandbox services.

Posted Mar 23, 2017 - 22:18 PDT

This incident affected: AMERICAS - CLOUD 2 (NA2) - www|rest.zuora.com (Production UI, Production API, Production Batch Operations, Production Analytics, Sandbox UI, Sandbox API, Sandbox Integrations, Sandbox Batch Operations, Sandbox Analytics).