TIMEFRAME:
Primary impact timeline:
2019-10-01 09:30 AM PT to 02:54 PM PT
Higher intermittent error rates were observed.
Also noted: our investigation also uncovered prior lower order-of-magnitude levels of error rates were found following Zuora Billing release on 9/26
SUMMARY:
During the primary impact timeline above, Zuora’s Billing APIs experienced increased authentication errors for a subset of our customers in our North America - Data Center - Production and Sandbox environments.
IMPACT:
Impacted customers observed elevated HTTP 401 response (Authentication) errors for a fraction of overall API calls made to Zuora Billing service
ROOT CAUSE:
As part of an earlier scheduled Zuora Billing release, a sub-optimal caching approach was introduced that resulted in a performance issue against the systems running our Authentication service. This resulted in higher than expected system resource utilization which, in turn, triggered intermittent timeouts to our authentication calls.
RESOLUTION
We identified the reason for the underlying issue as a missing database index which was added thereby restoring complete service availability. Simultaneously, we also fixed an incorrect configuration in our caching layer and pushed out an emergency patch fix release that restored full functionality as well.
FUTURE PREVENTATIVE MEASURES:
We will be implementing the following: