Symptom
Numerous issues including: soft lock of the CPU, blackholing of all Control Plane traffic, and dual active VPC pair.
Conditions
DIMM failure occurs on the primary Supervisor and failover is not triggered.
Example error messages for a DIMM failure:
2019 Feb 10 08:29:43.040 N7K-1 %DAEMON-3-SYSTEM_MSG: corrected DIMM memory error count exceeded threshold: 235 in 24h - mcelog
2019 Feb 10 08:29:43.040 N7K-1 %DAEMON-3-SYSTEM_MSG: Location: SOCKET:0 CHANNEL:1 DIMM:0 [] - mcelog
Workaround
Physically remove the Supervisor with the failed DIMM.
Further Problem Description
A process needs to be developed to detect a DIMM failure, trigger failover, and if possible gracefully recovered the failed module.
If recovery is not possible, the failed module needs to remain in an inactive state.
Fix-Remarks
The bug fix currently prints only a syslog to warn that the supervisor should be reloaded or switchover done at the earliest and avoid potential Supervisor hang situations.
Auto recovery is enabled by the fix in CSCvq56953.