Elevated error rates with our Connect Apps
Incident Report for Zuora
Postmortem

Service impact:

Connect services unavailable or degraded

Time-frames:

(1) 12/4/2018 2:00PM PST - 12/5/2018 12:00AM PST

(2) 12/11/2018 8:30AM PST - 12/11/2018 9:11AM PST

(3) 12/12/2018 11:55AM PST - 12/13/2018 12:00AM PST

Affected systems:

Production Connect Marketplace and Apps

Symptoms:

Service Timeouts, Service responding with either 502 or 404 response codes

Root Cause:

(1) A system patch to our underlying connect platform failed to properly deploy which resulted in the service disruption. The system patch was deployed prior to production on an identical staging cluster however the results of the patch were non-deterministic as they succeeded on the staging cluster but not production. Upon further investigation the patching process inadvertently caused the cluster to go into a state where it could not properly determine the authoritative master.

(2) The Connect services configuration master state was lost again due to the dynamic nature of the underlying infrastructure that we use. Our team was still finalizing the plans required to bring the masters back into sync.

(3) The Connect services configuration master state was lost again due to the dynamic nature of the underlying infrastructure that we use. Our team was then getting ready to deploy a fix using an emergency maintenance window that evening.

Resolution:

(1) Our engineers were able to force the cluster to use a single master in the cluster. This brought the cluster back online and services started to respond normally again. This short term fix brought the services back online while the team worked out a strategy to resolve the underlying root cause of the cluster split brain scenario.

(2) Our engineers again forced the cluster to use a single master again. Our team was still working on the long term fix for the cluster at the time of this failure.

(3) Attempts to force master the third time did not work completely, as services were mostly restored but the cluster was now in a degraded state and required the scheduled emergency maintenance. In this maintenance we addressed the underlying issue by upgrading part of the cluster(etcd) to a new version. The maintenance was successful and the service was restored.

Future Preventative measures:

Our team is working on a number of follow-up actions to add resiliency to our Connect infrastructure that is expected to prevent service disruptions like this in the future.

Posted Dec 18, 2018 - 13:53 PST

Resolved
The emergency maintenance has completed and all Connect services have been restored to normal operation.
Posted Dec 13, 2018 - 01:02 PST
Update
We will continue to monitor connect services. We will be performing emergency maintenance to the Connect platform from 9:00pm 12/12/2018 Pacific to 12:00am 12/13/2018 Pacific. During this time Connect applications may be unavailable. This maintenance is to address some recent stability issues on the platform.
Posted Dec 12, 2018 - 16:11 PST
Update
We are continuing to monitor for further issues.
Posted Dec 12, 2018 - 14:41 PST
Monitoring
Services are currently back up, we are monitoring the services and working on next steps.
Posted Dec 12, 2018 - 12:54 PST
Update
Our team is still working on a fix to restore Connect Apps and hope to restore services as soon as possible.
Posted Dec 12, 2018 - 12:38 PST
Identified
We have identified the problem with the Connect services and are working on a fix to restore the services as soon as possible.
Posted Dec 12, 2018 - 12:22 PST
Update
Our team is still working on restoring Connect Apps services with the highest priority.
Posted Dec 12, 2018 - 12:06 PST
Investigating
We are currently experiencing elevated error rates with our Connect Apps. We are currently investigating.
Posted Dec 12, 2018 - 11:48 PST
This incident affected: CONNECT APPLICATIONS (Connect - Production).