Service degradation in US Production Environment

Incident Report for Zuora

Postmortem

Summary

A memory unit failure caused a kernel panic on a primary message queue server, resulting in hung threads and timeout/failures for a subset of Zuora customers. Failover did not occur in a timely manner due to orphaned locks being held by the message queue server on the DB, which needed to be manually resolved before failover could be forced to occur.

Further analysis revealed that this behavior (orphan locks on the DB upon primary message queue failure) is resolved in the next version of the message queue software, which was scheduled for immediate upgrade.

Immediate Actions:

Cleared DB locks which were preventing failover of primary message queue server, and forced failover to standby

Replaced primary message queue with a new host, configured as standby

Cycled Front End Tomcat servers to immediately reset connection pool and restore service

Root Cause Analysis:

A memory unit in a physical server acting as a primary in the Zuora message queueing infrastructure failed.

57 2017/12/13 09:16:15 OEM Memory Uncorrectable ECC @ DIMMH1(CPU2) 58 2017/12/13 09:17:01 Memory Error BIOS OEM (runtime) Failing DIMM: DIMM location. (P2-DIMMH1)

Failover to standby did not proceed in a timely manner due to an inability to clear DB locks held by the primary. Locks needed to be cleared manually to force failover.

Corrective Actions:

This issue is resolved by an upgrade of the message queue software version, which moves from DB locks to leases - DONE
Further, there is an Improvement Plan in place to further scale the message queuing infrastructure to narrow the fault domain further

Posted Jan 16, 2018 - 13:19 PST

Resolved

Our monitoring operations are complete. The issue was resolved around 9:40am PST today. We have a mitigation currently in place and are actively working on a long-term solution to avoid the recurrence of the event.

Posted Dec 13, 2017 - 15:35 PST

Monitoring

We have restored service and are monitoring

Posted Dec 13, 2017 - 09:43 PST

Investigating

We are currently experiencing a service degradation in our US Production environment, and are actively working to restore service as soon as possible.

Posted Dec 13, 2017 - 09:30 PST