Timeframe: 9:18am May 19, 2017 - 9:38am PST, May 19, 2017
Affected Systems:
Aproximately 80% of Production Traffic (SOAP and UI Operations)
Symptoms:
Zuora API and UI failures
Root Cause:
Rapid thread exhaustion on the front-end application tier due to a metadata lock on one of the underlying databases.
Contributing Factors:
The lock was caused by a database backup operation. Our DB backups are instrumented on standby database replicas and the backup system is using a discovery service to identify which node has a standby role in the cluster. Approximately 12 hours prior to the service event a DB node failover occurred. The failover didn't register new standby position correctly in the discovery service. This caused the backup to be kicking off on the active master node.
Resolution:
Database backup was terminated and frontend recycled to release blocking resources.
Future Preventative Measures:
- Improvements in database monitoring and alerting area
- Improvements in DB backup and discovery service subsystems
- Architectural modifications to contain thread exhaustion blast radius to a specific functionality