Symptom
A Cisco ACI switch reloads due to a Machine Check Exception similar to the following output:
[603029.390562] sbridge: HANDLING MCE MEMORY ERROR
[603029.390563] CPU 0: Machine Check Exception: 0 Bank 7: 8c00004000010091
[603029.390564] TSC 0 ADDR 2e3d3f40 MISC 140545486 PROCESSOR 0:50663 TIME 1569464793 SOCKET 0 APIC 0
[603029.390710] sbridge: HANDLING MCE MEMORY ERROR
Conditions
This issue occurs following an upgrade to release 4.1(2m).
Further Problem Description
You may also see the following trace in "show logging onboard stack-trace"
[3211318.075177] MACHINE CHECK ERROR
[3211318.075179] MACHINE CHECK ERROR
[3211318.075181] MACHINE CHECK ERROR
[3211318.075182] MACHINE CHECK ERROR
[3211318.075183] MACHINE CHECK ERROR
[3211318.075185] MACHINE CHECK ERROR
[3211318.075186] MACHINE CHECK ERROR
[3211318.075187] MACHINE CHECK ERROR
[3211318.075189] MACHINE CHECK ERROR
[3211318.075194] MACHINE CHECK ERROR
[3211318.075195] MACHINE CHECK ERROR
[3211318.075196] MACHINE CHECK ERROR
[3211318.075481] cctrli: SUP/TOR NMI handler called. cmd: 1
[3211318.075482] cctrli: SUP/TOR NMI handler called
[3211318.075488] @@@cctrli: wrote 2 to scratch RR
[3211318.076299] nvram_klm wrote rr=2 rr_str=(null) to nvram
[3211318.076299] Sending SIGUSR1 signal to port_client process
[3211318.076355] Sending SIGUSR2 signal to port_client process
[3211318.076360] (1583133410.356212) (03-02-2020 07:16:50 UTC)cctrl2 card_index=21135, link flap done.
[3211318.076361] Kernel panic - not syncing: FPGA watchdog