TIMEFRAME:
2021/06/10 01:18 PM PDT - 2021/06/10 04:26 PM PDT
SUMMARY AND IMPACT:
During the above timeline, customers in our Europe Cloud environment experienced intermittent connectivity issues and/or slower than normal performance to Zuora Billing UI and APIs. Within this region, we use three availability zones for our load balancing and traffic routing & distributions systems. As this only impacted one of three availability zones used by Zuora, approximately 1/3rd of the traffic was exposed to this issue, and of that 1/3rd - only a small fraction of traffic realized any direct impact.
This event triggered multiple alerts from our infrastructure monitoring in near real-time as the event started, and continued to alert intermittently until the issue mitigation was deployed.
ROOT CAUSE: Our Cloud service provider experienced a temperature management issue within one of three availability zones used by Zuora Billing services in that region. This in turn resulted in a small subset of cloud instances (that were part of the top-level traffic distribution layer) in the affected availability zone to become unavailable or less performant.
RESOLUTION
Mitigation:
Zuora Engineers removed the top-level distribution layer from routing traffic to the impacted instance’s availability zone which restored normal operations. Timeline for removal: 3:39PM PDT for EU Sandbox, and 3:43PM PDT for EU Production. Following this action, we observed improved and normalized operations for all impacted environments.
Resolution:
Our cloud service provider was able to address the issue and normalize the performance and confirmed restoration of their systems by 4:33 PM PDT. Following confirmation, Zuora restored the affected availability zone to EU Sandbox systems for continued observation. After observing continued stability of affected systems for another 12h, we restored EU production environments as well.
FUTURE PREVENTATIVE MEASURES:
NOTE: For this incident, the impacted components were mistakenly marked against AMERICAS CLOUD datacenter. This was in error as the impacted center was EUROPE CLOUD which was correctly referenced in the incident title. We apologies for any confusion.