Performance degradation on NA2 production tenants
Incident Report for Zuora
Postmortem

DATE(S):

2022/08/04 09:35 AM PDT - 2022/08/04 11:10 AM PDT
2022/08/04 12:00 PM PDT - 2022/08/04 03:30 PM PDT

SUMMARY AND IMPACT:

A subset of Zuora Billing customers in the NA2 production environment experienced intermittent performance degradation. The impact of this performance degradation manifested as slower than normal response times and/or 504 timeouts to Billing API, UI and integration calls. Timeouts predominantly occurred for Billing SOAP API calls. Billing batch operations such as bill runs, journal runs and payment runs were not impacted.

ROOT CAUSE: 

Zuora detected that an underlying caching data store used by certain Billing services was maxed out on its resource usage. This resulted in slower than normal response time resulting in intermittent performance degradation.

The root cause for the resource exhaustion was a recent change introduced to the Billing application.

During the incident, our normal auto scaling methods as well as rolling restarts did not remediate the issue. The issue was fixed by optimizing the lookup calls to the cache.

RESOLUTION:

The impact was mitigated through the following actions:

  1. Implementing a fix to remove the suboptimal data access pattern.
  2. Disabling access to certain query types so that they can never be used in production systems.
  3. No changes to our alerting since they fired as designed; we are optimizing our trust update cadence to be more frequent in such incidents.

Additional system level checks were completed to ensure that performance returned to optimal baseline levels.

FUTURE PREVENTATIVE MEASURES:

  • Zuora to perform a thorough audit to identify potential suboptimal data access patterns against all similar caching data stores and evaluate them for optimization.
Posted Aug 15, 2022 - 15:15 PDT

Resolved
This incident has been resolved.
Posted Aug 04, 2022 - 21:32 PDT
Monitoring
Our engineering teams have applied a fix to address this issue and will continue to monitor
Posted Aug 04, 2022 - 17:51 PDT
Update
Our Engineering teams continue to observe improving API performance and stability.
Posted Aug 04, 2022 - 17:13 PDT
Identified
We have identified the source of this issue and are observing some improvement in performance on some internal changes. We will continue to work on resolving this issue fully.
Posted Aug 04, 2022 - 16:06 PDT
Update
Our Engineering teams continue to investigate this this issue with the highest priority.
Posted Aug 04, 2022 - 14:33 PDT
Investigating
We are reviewing a reoccurrence of this issue since 12:08PM PT - we are continuing our investigation
Posted Aug 04, 2022 - 12:22 PDT
Monitoring
Our Engineering team has applied a solution to address this issue. We have observed a return to normal performance since 11:07am PT and will continue to monitor moving forward
Posted Aug 04, 2022 - 11:42 PDT
Investigating
Our Engineers are investigating a performance issue in our Americas Cloud 2 (NA2) Production center.
Posted Aug 04, 2022 - 10:47 PDT
This incident affected: AMERICAS - CLOUD 2 (NA2) - www|rest.zuora.com (Production UI, Production API).