TIMEFRAME:
2021/11/20 5:35 PM PDT - 2021/11/20 8:05 PM PDT
2021/11/21 1:35 AM UTC - 2021/11/21 4:05 AM UTC
SUMMARY AND IMPACT:
During the above timeframe customers of the following environments experienced elevated 500 error rates for API calls sent to Zuora Billing
Production and Sandbox, Americas Cloud 1 (NA1)
Production and Sandbox, Americas Cloud 2 (NA2) formerly known as “Americas Hosted”
ROOT CAUSE:
A planned change to DNS configuration was deployed on 2021/11/19. The change added Cloud Service Provider native DNS services to our infrastructure. Directly following this change, no issues were encountered and all post deployment tests passed successfully.
Twenty-four hours following this deployment, our monitoring systems detected and alerted Incident and Engineering teams to an elevated level of 500 rates from multiple services. Following a prompt investigation, Engineering teams discovered DNS Zone Transfers necessary to propagate DNS Resolution across our internal network were not correctly enabled for the new configuration. This resulted in one of three servers failing to provide proper DNS which contributed to an elevated error rate impacting a subset of API transactions which traverse availability zones.
RESOLUTION:
Once the root cause was confirmed, enabling Zone Transfers within DNS Configuration resolved the issue.
FUTURE PREVENTATIVE MEASURES:
Additional change validation around DNS updates
Additional monitoring & alerting around DNS health