Zuora production service disruption
Incident Report for Zuora
Postmortem

Timeframe: 9:18am May 19, 2017 - 9:38am PST, May 19, 2017

Affected Systems: Aproximately 80% of Production Traffic (SOAP and UI Operations)

Symptoms: Zuora API and UI failures

Root Cause: Rapid thread exhaustion on the front-end application tier due to a metadata lock on one of the underlying databases.

Contributing Factors: The lock was caused by a database backup operation. Our DB backups are instrumented on standby database replicas and the backup system is using a discovery service to identify which node has a standby role in the cluster. Approximately 12 hours prior to the service event a DB node failover occurred. The failover didn't register new standby position correctly in the discovery service. This caused the backup to be kicking off on the active master node.

Resolution: Database backup was terminated and frontend recycled to release blocking resources.

Future Preventative Measures:
- Improvements in database monitoring and alerting area
- Improvements in DB backup and discovery service subsystems
- Architectural modifications to contain thread exhaustion blast radius to a specific functionality

Posted May 23, 2017 - 12:40 PDT

Resolved
Production services remain stable as of 9:38am PST and the teams have put mitigations in place. We will provide an RCA once it's available.
Posted May 19, 2017 - 13:49 PDT
Update
We've engaged our Engineering Teams and are working on identifying the root cause. We will keep you posted with any progresses made.
Posted May 19, 2017 - 12:46 PDT
Update
Service was restored around 9:38am PST, and we are observing stable operations. We are still investigating root cause.
Posted May 19, 2017 - 10:01 PDT
Investigating
We are investigating a production service disruption as of 9:18 am PST and have all hands on deck investigating to restore service.
Posted May 19, 2017 - 09:26 PDT
This incident affected: AMERICAS - CLOUD 2 (NA2) - www|rest.zuora.com (Production UI, Production API, Sandbox Batch Operations).