Increased authentication (401) return from Billing APIs services
Incident Report for Zuora
Postmortem

TIMEFRAME:
Primary impact timeline:
2019-10-01 09:30 AM PT to 02:54 PM PT
Higher intermittent error rates were observed.

Also noted: our investigation also uncovered prior lower order-of-magnitude levels of error rates were found following Zuora Billing release on 9/26

SUMMARY:
During the primary impact timeline above, Zuora’s Billing APIs experienced increased authentication errors for a subset of our customers in our North America - Data Center - Production and Sandbox environments.

IMPACT:
Impacted customers observed elevated HTTP 401 response (Authentication) errors for a fraction of overall API calls made to Zuora Billing service

ROOT CAUSE:
As part of an earlier scheduled Zuora Billing release, a sub-optimal caching approach was introduced that resulted in a performance issue against the systems running our Authentication service. This resulted in higher than expected system resource utilization which, in turn, triggered intermittent timeouts to our authentication calls.

RESOLUTION
We identified the reason for the underlying issue as a missing database index which was added thereby restoring complete service availability. Simultaneously, we also fixed an incorrect configuration in our caching layer and pushed out an emergency patch fix release that restored full functionality as well.

FUTURE PREVENTATIVE MEASURES:

We will be implementing the following:

  • Improved testing and validation of Zuora Billing Releases in pre-production environments
  • Enhance existing detection and alerting for caching and other related database errors in logs and from application level metrics.
Posted Oct 05, 2019 - 10:06 PDT

Resolved
This incident has been resolved.
Posted Oct 02, 2019 - 04:59 PDT
Monitoring
We are still continuing to monitor to ensure that the fix has fully addressed this issue.
So far we have not seen the reoccurrence of the issue since the fix was applied.
Posted Oct 01, 2019 - 23:52 PDT
Update
Current fix appears to have addressed the issue. We are continuing to monitor and insure that this fix has fully addressed this issue.
Posted Oct 01, 2019 - 15:44 PDT
Update
Engineering is testing out a fix and we will give another update once we have completed our validation tests.
Posted Oct 01, 2019 - 14:21 PDT
Update
We are still investigating this issue.
Posted Oct 01, 2019 - 13:24 PDT
Investigating
We are seeing an increased authentication (401) return from Billing APIs. We are currently investigating on what is causing this issue.
Posted Oct 01, 2019 - 11:29 PDT
This incident affected: AMERICAS - CLOUD 2 (NA2) - www|rest.zuora.com (Production API, Sandbox API).