
OPERATIONAL DEFECT DATABASE
...

...
Gen4T M2400 Storage Node becomes unresponsive due to a spontaneous expander reset, with insane event and Reset Cause 0: less 'Controller 1 Monitor_log.txt 10/09/19 03:53:28.202137: (LowIO=0; RawBLK=0; RawIO=0; FastIO=0; BCSyncCtr=5337) 01/01/70 00:00:07.740135: ################# History Buffer Initialized - History Retained ############# 01/01/70 00:00:07.740376: Reset Cause 0 01/01/70 00:00:07.740440: WARNING!!! NV FW Update data structure is insane.Not relying on the data !!! 01/01/70 00:00:08.106073: *** THIS IS AN OFFICIAL BUILD *** 01/01/70 00:00:08.106113: OsTimeSet: UpdatedTime = 10/9/2019 3:53:28 Time Zone 0:0 10/09/19 03:53:28.052874: ARC Build Number : 33083 10/09/19 03:53:29.571529: Controller Serial Number: 0XX23R2D (Cookie=XSADADED2) 10/09/19 03:53:30.332901: INFO1: SubID=0000; DevID=555; vendID=9005; Plat=49; CpuClass=49; CacheMem=9 MB 10/09/19 03:53:30.333005: INFO2: Base=49; CpuArch=9; CpuVar=14; Clk=1100 MHz; ExeMem=92 MB; BufMem=931 MB 10/09/19 03:53:30.333106: INFO3: TotMem=1024 MB; KernelRev=0X70A0400,0000813B; MonRev=0X70A0400,0000813B; HWrev=0X80600100,00000000 Note: See the Additional Info section for information about how to generate an up to date "Controller 1 Monitor_log.txt" file. There are no "BAD mirror" events in the log reported before the spontaneous expander reset: cat 'Controller 1 Monitor_log.txt' | grep -i "BAD mirror" There should be no output. If "BAD mirror" events are found, stop using this article. The Gen4T Firmware installed in the environment is old: showfwvers showfwvers 3.0 === BIOS | 41.91 SERDES | 2.8 POST | 24.50 UEFI | 14.30 Local SP CMD | 01.11.30.01 BMC Main SP1 Partition | 17.30 BMC Boot Block Partition | 01.40 BMC SSP Partition | 02.06 BMC Adaptive Cooling Table Partition | 01.00 CPU0 VRD (ST Micro) | 01.07 CPU0, Memory Channels 0/1 VRD (ST Micro) | 01.07 CPU0, Memory Channels 2/3 VRD (ST Micro) | 01.07 CPU1 VRD (ST Micro) | 01.05 CPU1, Memory Channels 0/1 VRD (ST Micro) | 01.03 CPU1, Memory Channels 2/3 VRD (ST Micro) | 01.03 Power Supply 0 MCU | 04.27.00.01 Power Supply 1 MCU | 04.27.00.01 SLIC 0 CMD (303-242-100C-01) | 02.01.32.01 SLIC 1 CMD (303-254-100C-00) | 01.08.29.01 Drive I/O Card CMD | 03.02.93.01 Expander SXP | 2.9.0 Expander Boot | 0.5.0 Expander InitStr | 0.10.0 Expander FPGA | 21.00 ROC DIB Controller 0 | 7.10-0 (33083) Physical Disk 0 (HUS72602CLAR2000) | NAM4 ... Physical Disk 12 (HUSMM112 CLAR200) | C29C
An exact cause was unable to be determined. It is thought that the expander fails to read the serial number of the node midplane, potentially causing the "Unrecoverable Error" bit to be set.This can indicate a hardware issue, specifically regarding the TWI bus.
1. Determine the version of hwfaultd installed: rpm -qa | grep -i hwfault 2. If hwfaultd v1.0-10 or v1.0-11 is running, install Hotfix 304772 on the grid in order to gather debug hwfaultd logs and then wait for the next occurrence of this event. 3. If hwfaultd v1.0.12 or greater is running, install Hotfix 310344 on the grid. This hotfix updates BIOS, BMC Main, and ROC Firmware to the following versions, providing a fix for the spontaneous expander resets: BIOS 41.98 BMC Main 24.10 ROC Controller Firmware 7.14-0 (33303) If the issue recurs after both 304772 and 310344 are installed, Avamar Support must open a swarm with the Avamar SER L2 team for further investigation. The following must be included: Avamar: How to run "getlogs" to gather Avamar server logsAvamar: Gen4T Hardware: How to collect hardware logs (get-platform-logs)The .zip file output from this get_cdes_buffers.pl script (attached to this article) The Avamar SER L2 Team should engage Engineering to contact the CDES Team to investigate further.
Click on a version to see all relevant bugs
Dell Integration
Learn more about where this data comes from
Bug Scrub Advisor
Streamline upgrades with automated vendor bug scrubs
BugZero Enterprise
Wish you caught this bug sooner? Get proactive today.