Symptom
A Nexus 9K switch running NX-OS 9.2(x) code may experience a HAP reset due to a segfault (signal 11) crash in the ELTM process when it is handling an MTS (Messaging and Transaction System) message. This is due to the fact that software level memory corruption has occurred. The actual feature that generated the MTS event that ELTM is handing can be random, as the corruption was caused by something prior.
Eg:
%SYSMGR-SLOTX-2-SERVICE_CRASHED: Service "eltm" (PID XXXX) hasn't caught signal 11 (core will be saved)
SWITCH# show version
Last reset at XXXXXX usecs after Wed Jan 1 00:00:00 2020
Reason: Reset triggered due to HA policy of Reset
System version: 9.2(3)
Service: eltm hap reset
Conditions
- Switch is a Nexus 9k running NX-OS 9.2(2) or 9.2(3)
- ELTM process was handling an MTS (Messaging and Transaction System) event at the time of crash.
The second condition would need to be confirmed by Cisco TAC by collecting the ELTM core file, as seen in "show core".
Further Problem Description
Changes were made starting in NX-OS 9.3(1) and later that were found to make it unsusceptible to this bug.