Intermittent connectivity issue on EU prod and SBX environments for both Zuora billing and Revenue application
Incident Report for Zuora
Postmortem

TIMEFRAME:
2021/06/10 01:18 PM PDT - 2021/06/10 04:26 PM PDT

SUMMARY AND IMPACT:
During the above timeline, customers in our Europe Cloud environment experienced intermittent connectivity issues and/or slower than normal performance to Zuora Billing UI and APIs.  Within this region, we use three availability zones for our load balancing and traffic routing & distributions systems.  As this only impacted one of three availability zones used by Zuora, approximately 1/3rd of the traffic was exposed to this issue, and of that 1/3rd - only a small fraction of traffic realized any direct impact.

This event triggered multiple alerts from our infrastructure monitoring in near real-time as the event started, and continued to alert intermittently until the issue mitigation was deployed.

ROOT CAUSE: Our Cloud service provider experienced a temperature management issue within one of three availability zones used by Zuora Billing services in that region.  This in turn resulted in a small subset of cloud instances (that were part of the top-level traffic distribution layer) in the affected availability zone to become unavailable or less performant.

RESOLUTION
Mitigation:
Zuora Engineers removed the top-level distribution layer from routing traffic to the impacted instance’s availability zone which restored normal operations.  Timeline for removal: 3:39PM PDT for EU Sandbox, and 3:43PM PDT for EU Production.  Following this action, we observed improved and normalized operations for all impacted environments.

Resolution:
Our cloud service provider was able to address the issue and normalize the performance and confirmed restoration of their systems by 4:33 PM PDT.  Following confirmation, Zuora restored the affected availability zone to EU Sandbox systems for continued observation.  After observing continued stability of affected systems for another 12h, we restored EU production environments as well.

FUTURE PREVENTATIVE MEASURES:

  • Refine procedures and automation around removing an impacted availability zone from Zuora Services when impacted by performance issues.

NOTE: For this incident, the impacted components were mistakenly marked against AMERICAS CLOUD datacenter. This was in error as the impacted center was EUROPE CLOUD which was correctly referenced in the incident title. We apologies for any confusion.

Posted Jun 25, 2021 - 12:15 PDT

Resolved
This incident has been resolved.
Posted Jun 11, 2021 - 14:28 PDT
Update
Service has been restored. We will continue to monitor for the next 12-24h
Posted Jun 10, 2021 - 18:43 PDT
Monitoring
We have implemented some mitigation measures as recommended by our cloud provider, and are seeing improvement and recovery of Zuora Billing services in the EU region across API Sandbox and Production.
Posted Jun 10, 2021 - 16:35 PDT
Investigating
We are facing the intermittent connectivity issue(Starting at 1:22 PM PT) on EU prod and SBX environments for both Zuora billing and Revenue application due to an underlying issues with our Cloud provider.

We are actively working with our cloud service provider on the issue.
Posted Jun 10, 2021 - 14:22 PDT
This incident affected: AMERICAS - CLOUD 1 (NA1) - *.na.zuora.com (Production UI, Production API, Production Integrations, Production Batch Operations, Production Analytics, Sandbox UI, Sandbox API, Sandbox Integrations, Sandbox Batch Operations, Sandbox Analytics).