Service degradation in US Production Environment
Incident Report for Zuora
Postmortem

Summary

A memory unit failure caused a kernel panic on a primary message queue server, resulting in hung threads and timeout/failures for a subset of Zuora customers. Failover did not occur in a timely manner due to orphaned locks being held by the message queue server on the DB, which needed to be manually resolved before failover could be forced to occur.

Further analysis revealed that this behavior (orphan locks on the DB upon primary message queue failure) is resolved in the next version of the message queue software, which was scheduled for immediate upgrade.

Immediate Actions:

Cleared DB locks which were preventing failover of primary message queue server, and forced failover to standby

Replaced primary message queue with a new host, configured as standby

Cycled Front End Tomcat servers to immediately reset connection pool and restore service

Root Cause Analysis:

A memory unit in a physical server acting as a primary in the Zuora message queueing infrastructure failed.

57 2017/12/13 09:16:15 OEM Memory Uncorrectable ECC @ DIMMH1(CPU2) 58 2017/12/13 09:17:01 Memory Error BIOS OEM (runtime) Failing DIMM: DIMM location. (P2-DIMMH1)

Failover to standby did not proceed in a timely manner due to an inability to clear DB locks held by the primary. Locks needed to be cleared manually to force failover.

Corrective Actions:

  • This issue is resolved by an upgrade of the message queue software version, which moves from DB locks to leases - DONE

  • Further, there is an Improvement Plan in place to further scale the message queuing infrastructure to narrow the fault domain further

Posted Jan 16, 2018 - 13:19 PST

Resolved
Our monitoring operations are complete. The issue was resolved around 9:40am PST today. We have a mitigation currently in place and are actively working on a long-term solution to avoid the recurrence of the event.
Posted Dec 13, 2017 - 15:35 PST
Monitoring
We have restored service and are monitoring
Posted Dec 13, 2017 - 09:43 PST
Investigating
We are currently experiencing a service degradation in our US Production environment, and are actively working to restore service as soon as possible.
Posted Dec 13, 2017 - 09:30 PST