...
Phase 1 The first symptom is that the Nexus 9000 switch /bootflash goes into read-only mode. As the switch can no longer write to the SSD, the switch will most likely crash shortly after with the following message: 'sysmgr failed to re-register with heartbeat klm' Leaf# vsh -c "show logging onboard internal reset-reason" ---------------------------- Module: 1 ---------------------------- ...SNIP.. Reset Reason for this card: Image Version : 14.2(4i) Reset Reason (LCM): Unknown (0) at time Sat Oct 10 04:43:19 2020 Reset Reason (SW): Reset Requested due to Fatal System Error (3) at time Sat Oct 10 04:38:34 2020 Service (Additional Info): sysmgr failed to re-register with heartbeat klm Reset Reason (HW): Reset Requested due to Fatal System Error (3) at time Sat Oct 10 04:43:19 2020 Reset Cause (HW): 0x01 at time Sat Oct 10 04:43:19 2020 Reset internal (HW): 0x00 at time Sat Oct 10 04:43:19 2020 ...SNIP... Phase 2 Phase 2 is hit once the threshold is crossed. When the switch crashes due to threshold cross, the switch will come back up with the SSD tn r/w mode and any ro symptoms will not be observed. However, this means the switch has entered the second phase of the FN behavior where the SSD will again go into read-only mode (then crash) every 1008 hours (~42 days).
1. The Switch has a Micron_M500IT_MTFDDAT064SBD SSD. 2. The Switch is on an affected version, specifically the SSD firmware is MU01.00 or MC02.00. 3. The Switch's SSD power_on_Hours (attribute 9) RAW_VALUE has crossed initial 28224 threshold. Results: If conditions 1 and 2 are true, the switch is affected and the SSD Firmware should be upgraded ASAP to avoid symptom phase 1. If conditions 1, 2 and 3 are all true, the switch is in symptom phase 2 and will crash every ~42 days until the SSD Firmware is upgraded.
apic# moquery -c eqptFlash -f 'eqpt.Flash.model*"Micron_M500IT"' Total Objects shown: 2 # eqpt.Flash dn : topology/pod-1/node-101/sys/ch/supslot-1/sup/flash model : Micron_M500IT_MTFDDAT064SBD rev : MC02.00
Can be used if the switch is running ACI Version 13.2(5) or later (excluding the 14.0(x) train): leaf# tail -n 103 /mnt/pss/smartctl_full_dump.log | egrep "Device Model|Firmware Version|ATTRIBUTE_NAME|Power_On_Hours" Device Model: Micron_M500IT_MTFDDAT064SBD Firmware Version: MC02.00 ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 28228 In this example output, RAW_VALUE of Power_on_Hours is 28228. It has crossed the initial 28224 thresholdand is now in Symptom Phase 2. If the SSD Firmware is not upgraded, the switch will crash again.
For a permanent fix: 1. Upgrade to an ACI version noted in the "Known fixed releases" details of this bug. All of which include the updated SSD Firmware version with the fix. OR 2. Upgrade the SSD firmware directly using the SSDUpgrader APIC app from the DC App Center: https://dcappcenter.cisco.com/ssdupgrader.html NOTE: Using the SSDUpgrade app version 1.1+ will not require a switch reload. Specific Steps to Address this FN: https://www.cisco.com/c/en/us/support/docs/cloud-systems-management/application-policy-infrastructure-controller-apic/217677-addressing-aci-fn72145-nexus-aci-9000-w.html