Symptom
Certain DIMMs from a specific manufacturing lot (specific date codes only) will fail at a higher rate than expected. The most common failure symptom will be significant single bit (correctable) errors. If left untreated, the DIMM may be a higher risk for multibit (uncorrectable) errors during runtime
On NXOS devices, single bit correctable errors will be logged with the following logs:
%DEVICE_TEST-3-MCE_24HR_FAIL: Module 1 has exceeded MCE 24 hour correctable threshold of 100 with ##### correctable errors within 24 hours.
or
%DAEMON-3-SYSTEM_MSG: corrected Socket memory error count exceeded threshold: ####### in 24h - mcelog
On ACI Devices, The impacted dimm can be find from /mnt/pss/bootlogs/current/dmesg, or output of "dmesg" command, for example logs below confirms DIMMs are bad and in which DIMM-0 is bad.
[ 167.751610] sbridge: HANDLING MCE MEMORY ERROR
[ 167.751614] CPU 0: Machine Check Exception: 0 Bank 7: 8c00004000010091
[ 168.415928] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#1_DIMM#0 (channel:1 slot:0 page:0x53232 offset:0xfc0 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0091 socket:0 channel_mask:2 rank:0)
Conditions
This issue impacts a subset of DIMMs within a certain date range. Even inside this date range, not all DIMMs are impacted.
Workaround
This is a hardware error. No SW workarounds are available to address this issue.
Further Problem Description
Impacted devices:
N9K family of switches running NXOS or ACI
APIC family:
APIC-SERVER-L3
APIC-SERVER-M3
Please see the following document for additional information:
https://www.cisco.com/c/en/us/support/docs/field-notices/724/fn72464.html