Elevated error rates with Connect Apps and its platform

Incident Report for Zuora

Postmortem

Service impact:

Connect services unavailable or degraded

Time-frames:

(1) 12/4/2018 2:00PM PST - 12/5/2018 12:00AM PST

(2) 12/11/2018 8:30AM PST - 12/11/2018 9:11AM PST

(3) 12/12/2018 11:55AM PST - 12/13/2018 12:00AM PST

Affected systems:

Production Connect Marketplace and Apps

Symptoms:

Service Timeouts, Service responding with either 502 or 404 response codes

Root Cause:

(1) A system patch to our underlying connect platform failed to properly deploy which resulted in the service disruption. The system patch was deployed prior to production on an identical staging cluster however the results of the patch were non-deterministic as they succeeded on the staging cluster but not production. Upon further investigation the patching process inadvertently caused the cluster to go into a state where it could not properly determine the authoritative master.

(2) The Connect services configuration master state was lost again due to the dynamic nature of the underlying infrastructure that we use. Our team was still finalizing the plans required to bring the masters back into sync.

(3) The Connect services configuration master state was lost again due to the dynamic nature of the underlying infrastructure that we use. Our team was then getting ready to deploy a fix using an emergency maintenance window that evening.

Resolution:

(1) Our engineers were able to force the cluster to use a single master in the cluster. This brought the cluster back online and services started to respond normally again. This short term fix brought the services back online while the team worked out a strategy to resolve the underlying root cause of the cluster split brain scenario.

(2) Our engineers again forced the cluster to use a single master again. Our team was still working on the long term fix for the cluster at the time of this failure.

(3) Attempts to force master the third time did not work completely, as services were mostly restored but the cluster was now in a degraded state and required the scheduled emergency maintenance. In this maintenance we addressed the underlying issue by upgrading part of the cluster(etcd) to a new version. The maintenance was successful and the service was restored.

Future Preventative measures:

Our team is working on a number of follow-up actions to add resiliency to our Connect infrastructure that is expected to prevent service disruptions like this in the future.

Posted Dec 18, 2018 - 13:52 PST

Resolved

This incident has been resolved.

Posted Dec 11, 2018 - 11:14 PST

Investigating

We are monitoring elevated error rates with our Connect Apps and its platform from approximately 8:45am to 9:00am Pacific time this morning. The issue appears to be resolved and we are continuing to investigate and identify root cause.

Posted Dec 11, 2018 - 10:09 PST

This incident affected: MARKETPLACE (CONNECT) APPLICATIONS (Marketplace (Connect) - Production).