...
User have a 13G or 14G server reporting MEM errors in the iDRAC event log.
ECC memory errors usually are caused by random alpha particle bombardment Alpha particles are part of normal radiation that occurs every day On occasion an alpha particle knocks a single electron off of a memory module corrupting the data Modern memory modules are designed to recognize this event and repair them. Each module keeps an internal counter of how many times it is repaired a memory error. A threshold is set in the BIOS that when reached alerts the server that the number of memory events has exceeded that threshold.Note: In a situation where a user encounter message ID MEM8000 (Correctable memory error logging disabled for a memory device at location DIMM_XX) which appears in isolation (ie not in a similar timeframe) to any corresponding MEM0005/MEM0701/MEM0702 messages, it does not result in a PPR being scheduled for the next reboot.Message ID MEM8000 in isolation or with a corresponding MCE (machine check exception) is an indication of a general failure of the DIMM module and is not a situation where the correctable or uncorrectable buckets initially overflow. This type of memory event should be treated as a DIMM failure and the listed DIMM module should be replaced at the customer s earliest convenience.
What is DDR4 "self-healing"? How do these DDR4 "self-healing" capabilities (BIOS enhancements) change the recommended customer and Technical Support actions when encountering memory errors on a server?There are two main memory-related "self-healing" BIOS enhancements that were implemented for PowerEdge Servers with DDR4 running BIOS version 2.1.x and newer. These enhancements do change the recommended steps/actions to take if memory errors occur and are logged in vCenter, VxFM, dial home or in the LifeCycle log.Note: If you are getting memory errors with DDR4 and you are running a BIOS version older than 2.1.x, update your BIOS to the latest revision to include memory Self-healing enhancements. Then reboot your node to go to (PPR) See Resolution Section for more details.Note: Current memory troubleshooting steps incorporate moving failing DIMMs to a different slot to confirm whether or not the errors follow the DIMM or remain with the DIMM slot.If the 13G node is running BIOS 2.8.x or higher, the first recommended step is a reboot/restart (without moving DIMMs to a different slot). Allowing the new BIOS enhancements to run, potentially resolving (self-healing) the DIMM errors without the need for any DIMM replacements.If the 14G node is running BIOS version 2.4.8 or higher, the first recommended step is a reboot/restart (without moving DIMMs to a different slot). Allowing the new BIOS enhancements to run, potentially resolving (self-healing) the DIMM errors without the need for any DIMM replacements.Upgrade Bios to (2.8.x or higher for 13G) and (2.1.x or higher for 14G) to enable memory retraining enhancements for servers with DDR4 RAM installed - Memory retraining which happens during boot, optimize the signal timing/margining for each DIMM/slot for best access. Timing characteristics of a DIMM may change for several different reasons:Examples include but are not limited to:1. Changes in Server memory configuration2. BIOS changes3. Different operating temperatures of the Server or DIMM4. The general age of the DIMMPreviously, BIOS updates or memory configuration changes being detected would have resulted in memory retraining occurring during the subsequent boot. Starting with BIOS 2.1.x (14G) and 2.8.x (13G), additional correctable and uncorrectable memory errors "triggers" were added for scheduled retraining:Warning - MEM0701- "Correctable memory error rate exceeded for DIMM_XX."Critical - MEM0702 - "Correctable memory error rate exceeded for DIMM_XX."Critical - MEM0005 - "Persistent correctable memory error limit reached for a memory device at locations XX."Any of the above errors logged in the VC events/ dial home/ SEL /LifeCycle logs results in Memory retraining being scheduled for the next reboot (warm or cold), BIOS automatically forces a cold reboot regardless of what is initiated.Critical - MEM0001 - "Multi-bit memory errors detected on memory device at locations DIMM_XX."MEM0001 results in the server rebooting due to the fatal error. Memory retraining automatically occurs during that boot.With either of these correctable or uncorrectable (multibit) memory errors, the resulting memory retraining on reboot/restart may "self-heal" the failing DIMM by optimizing the signal timing for each DIMM/slot. A DIMM replacement for these errors is not necessary unless memory retraining fails (UEFI0106) during boot or these same errors continue to occur.2. Post Package Repair (PPR) - The second "self-healing' memory enhancement, results in repairing a failing memory location on a DIMM by disabling the location at the hardware layer enabling a spare memory row to be used instead. The exact number of spare memory rows available depends on the DRAM device and DIMM size. Previously, this functionality was limited to the manufacturing process. Like with the memory retraining enhancements mentioned earlier, there are certain correctable memory errors that result in PPR being scheduled on a specific DIMM slot for the next reboot (warm or cold). BIOS automatically forces a cold reboot regardless of what is initiated. Since the PPR operation is scheduled on a specific DIMM slot, DO NOT change DIMM slot locations until the PPR operation is run. Examples of the errors are:Warning - MEM0701- "Correctable memory error rate exceeded for DIMM_XX."Critical - MEM0702 - "Correctable memory error rate exceeded for DIMM_XX."Critical - MEM0005 - "Persistent correctable memory error limit reached for a memory device at locations XX."Any of the above errors being logged in the VC events/ Dial home/SEL/LifeCycle log results in Post Package Repair being scheduled for the next reboot (warm or cold).After the reboot, verify that the PPR operation was successfully performed. An example of a successful PPR operation is similar to:Message ID MEM9060 - "The Post-Package Repair operation is successfully completed on the Dual In-line memory module (DIMM) device that was failing earlier." A DIMM replacement for these correctable memory errors is not necessary unless the PPR operation fails after the reboot. An example of a failing PPR message is:Critical - Message ID UEFI0278 - "Unable to complete the Post Package Repair (PPR) operation because of an issue in the DIMM memory slot X."