Timeouts impacting Hosted Production environment via UI and API
Incident Report for Zuora
Postmortem

TIMEFRAME:
24 Aug 2020 04:56 AM PDT to 24 Aug 2020 07:10 AM PDT

SUMMARY AND IMPACT:
During the above incident timeline, some customers in our US hosted datacenter production environment experienced increasing and intermittent 500 HTTP response errors for Zuora Billing API and UI transactions.

ROOT CAUSE: A code change to a customer’s integration platform caused API traffic to exceed expected rates by a factor of 100. This caused significant congestion between our Load Balancing connection pool and our Billing Application processing resulting in some traffic falling to obtain a connection to a processing resource.

RESOLUTION:
Source of increased Customer traffic was blocked
Customer rolled back code causing this issue

FUTURE PREVENTATIVE MEASURES:

  • Optimize and increase capacity in Load Balancer TCP connection pool
  • Review improvements in our concurrent request handling
  • Improved monitoring and alerting for similar scenarios
Posted Aug 25, 2020 - 16:43 PDT

Resolved
This incident and its impact have been resolved.
Posted Aug 24, 2020 - 11:00 PDT
Monitoring
We do not see more errors as we continue to actively monitor the state of the service very closely for the mitigation measures that we implemented earlier.
Posted Aug 24, 2020 - 09:44 PDT
Update
We are continuing to monitor following the mitigation measures which have been implemented.
Posted Aug 24, 2020 - 09:03 PDT
Update
We are continuing to monitor following the mitigation measures which have been implemented.
Posted Aug 24, 2020 - 08:17 PDT
Update
We are continuing to monitor following the mitigation measures which have been implemented.
Posted Aug 24, 2020 - 07:58 PDT
Update
We are continuing to monitor following the mitigation measures which have been implemented.
Posted Aug 24, 2020 - 07:43 PDT
Identified
The issue has been identified and we have implemented mitigation measures. We are continuing to monitor.
Posted Aug 24, 2020 - 07:25 PDT
Update
We are rolling out mitigation measures. We continue to investigate for now.
Posted Aug 24, 2020 - 07:12 PDT
Update
We are continuing to investigate this issue.
Posted Aug 24, 2020 - 07:05 PDT
Update
We are continuing to investigate this issue.
Posted Aug 24, 2020 - 06:48 PDT
Update
We are continuing to investigate this issue.
Posted Aug 24, 2020 - 06:32 PDT
Update
Our internal teams continue to investigate this and working on mitigation measures with the highest priority available.
Posted Aug 24, 2020 - 06:15 PDT
Update
We are continuing to investigate this issue.
Posted Aug 24, 2020 - 06:02 PDT
Update
We are continuing to investigate this issue.
Posted Aug 24, 2020 - 05:46 PDT
Update
We are continuing to investigate this issue.
Posted Aug 24, 2020 - 05:32 PDT
Investigating
We are investigating an issue with logging in to Hosted Production environments with the highest priority available.
Posted Aug 24, 2020 - 05:09 PDT
This incident affected: AMERICAS - CLOUD 2 (NA2) - www|rest.zuora.com (Production UI, Production API, Production Integrations, Production Batch Operations, Production Analytics).