Summary
A memory unit failure caused a kernel panic on a primary message queue server, resulting in hung threads and timeout/failures for a subset of Zuora customers. Failover did not occur in a timely manner due to orphaned locks being held by the message queue server on the DB, which needed to be manually resolved before failover could be forced to occur.
Further analysis revealed that this behavior (orphan locks on the DB upon primary message queue failure) is resolved in the next version of the message queue software, which was scheduled for immediate upgrade.
Immediate Actions:
Cleared DB locks which were preventing failover of primary message queue server, and forced failover to standby
Replaced primary message queue with a new host, configured as standby
Cycled Front End Tomcat servers to immediately reset connection pool and restore service
Root Cause Analysis:
A memory unit in a physical server acting as a primary in the Zuora message queueing infrastructure failed.
57 2017/12/13 09:16:15 OEM Memory Uncorrectable ECC @ DIMMH1(CPU2) 58 2017/12/13 09:17:01 Memory Error BIOS OEM (runtime) Failing DIMM: DIMM location. (P2-DIMMH1)
Failover to standby did not proceed in a timely manner due to an inability to clear DB locks held by the primary. Locks needed to be cleared manually to force failover.
Corrective Actions:
This issue is resolved by an upgrade of the message queue software version, which moves from DB locks to leases - DONE
Further, there is an Improvement Plan in place to further scale the message queuing infrastructure to narrow the fault domain further