Elevated error rates with our Connect Apps

Incident Report for Zuora

Postmortem

Service impact:

Connect services unavailable or degraded

Time-frames:

(1) 12/4/2018 2:00PM PST - 12/5/2018 12:00AM PST

(2) 12/11/2018 8:30AM PST - 12/11/2018 9:11AM PST

(3) 12/12/2018 11:55AM PST - 12/13/2018 12:00AM PST

Affected systems:

Production Connect Marketplace and Apps

Symptoms:

Service Timeouts, Service responding with either 502 or 404 response codes

Root Cause:

(1) A system patch to our underlying connect platform failed to properly deploy which resulted in the service disruption. The system patch was deployed prior to production on an identical staging cluster however the results of the patch were non-deterministic as they succeeded on the staging cluster but not production. Upon further investigation the patching process inadvertently caused the cluster to go into a state where it could not properly determine the authoritative master.

(2) The Connect services configuration master state was lost again due to the dynamic nature of the underlying infrastructure that we use. Our team was still finalizing the plans required to bring the masters back into sync.

(3) The Connect services configuration master state was lost again due to the dynamic nature of the underlying infrastructure that we use. Our team was then getting ready to deploy a fix using an emergency maintenance window that evening.

Resolution:

(1) Our engineers were able to force the cluster to use a single master in the cluster. This brought the cluster back online and services started to respond normally again. This short term fix brought the services back online while the team worked out a strategy to resolve the underlying root cause of the cluster split brain scenario.

(2) Our engineers again forced the cluster to use a single master again. Our team was still working on the long term fix for the cluster at the time of this failure.

(3) Attempts to force master the third time did not work completely, as services were mostly restored but the cluster was now in a degraded state and required the scheduled emergency maintenance. In this maintenance we addressed the underlying issue by upgrading part of the cluster(etcd) to a new version. The maintenance was successful and the service was restored.

Future Preventative measures:

Our team is working on a number of follow-up actions to add resiliency to our Connect infrastructure that is expected to prevent service disruptions like this in the future.12/4/2018 2:00PM PST - 12/5/2018 12:00AM PST

12/11/2018 8:30AM PST - 12/11/2018 9:11AM PST

12/12/2018 11:55AM PST - 12/13/2018 12:00AM PST12/4/2018 2:00PM PST - 12/5/2018 12:00AM PST

12/11/2018 8:30AM PST - 12/11/2018 9:11AM PST

12/12/2018 11:55AM PST - 12/13/2018 12:00AM PST

Posted Dec 18, 2018 - 13:51 PST

Resolved

This incident has been resolved.

Posted Dec 04, 2018 - 23:52 PST

Update

We are continuing to fix this issue.

Posted Dec 04, 2018 - 21:24 PST

Update

Connect services are intermittently unavailable for all Connect customers. Engineering teams are currently working to implement a fix.

Posted Dec 04, 2018 - 17:02 PST

Investigating

We are experiencing elevated error rates with our Connect Apps and its platform, which we are actively working on to identify root cause and mitigation steps.

Posted Dec 04, 2018 - 15:14 PST

This incident affected: MARKETPLACE (CONNECT) APPLICATIONS (Marketplace (Connect) - Production).