Loading...
Loading...
The issue happens when the Self-Service Platform (SSP) firmware on the node is outdated. You can verify the firmware by running the below command on logs. SSP firmware : $ grep ssp cluster-{1..4}/etcifs.tar/ifs/firmware_versions cluster-2/etcifs.tar/ifs/firmware_versions: DEssp_infinity ePOST 02.50 2 cluster-3/etcifs.tar/ifs/firmware_versions: DEssp_infinity ePOST 02.50 3 cluster-4/etcifs.tar/ifs/firmware_versions: DEssp_infinity ePOST 02.50 4$ grep ssp firmware DEssp_infinity ePOST 02.50 2,4 The SSP firmware should be 2.80 or greater. Nodes may reboot unexpectedly with the below messages recorded in /var/log/isi_hwmon.log : 2019-07-02T06:45:28-04:00 <3.6> cluster-1 isi_hwmon[5304]: Failed to get device ID from BMC: failed 1 times 2019-07-02T06:45:46-04:00 <3.6> cluster-1 isi_hwmon[5304]: BMC back to healthy after 1 failed get device IDs 2019-07-02T06:45:46-04:00 <3.6> cluster-1 isi_hwmon[5304]: Setting node to R/O (system-overtemp) 2019-07-02T06:45:47-04:00 <3.5> cluster-1 isi_hwmon[5304]: HWMON EVENT: {'specifier': {'sensor_state': 1, 'sensor_data': 49154, 'sensor_name': 'Shutdown_In_Prog'}, 'severity': 3, 'event_id': 'HW_INFINITY_DELAYED_REBOOT', 'force_celog': 1, 'assert': 1, 'message': 'The node located in chassis {chassis} slot {slot} is in a delayed reboot because of the following reason: SSP Hang. As a result, the node may reset itself. Setting the node to read-only to protect the journal.', 'send_celog': True} 2019-07-02T06:45:49-04:00 <3.5> cluster-1 isi_hwmon[5304]: HWMON EVENT: {'specifier': {'eventdata': 'Delayed Reboot In Progress', 'sensor': 'Shutdown In Progress', 'generator': 'Shutdown_In_Prog', 'index': 50}, 'severity': 3, 'event_id': None, 'assert': True, 'message': 'Shutdown In Progress: Delayed Reboot In Progress (Reason Code: 0e)', 'send_celog': None} 2019-07-02T06:57:12-04:00 <3.5> cluster-1 isi_hwmon[2650]: Starting isi_hwmon daemon Events recorded as : 6.8194 07/02 06:57 C 1 163645 The node located in chassis XXXNN888888111 slot 1 is in a delayed reboot. As a result, the node may reset itself. Setting the node to read-only to protect the journal. 6.7492 07/02 06:45 C 1 163645 The node located in chassis XXXNN888888111 slot 1 is in a delayed reboot because of the following reason: SSP Hang. As a result, the node may reset itself. Setting the node to read-only to protect the journal.
When OneFS detects that an SSP (Secondary Service Processor) is hung, it tries to reboot the system proactively. The secondary core on the Baseboard Management Controller (BMC) is responsible for servicing all real-time operations. Its the real time operation service of BMC that hangs and reboots the system proactively to avoid any more issues, DU/DL.
Upgrade the Node Firmware Package to 10.3.0.
Click on a version to see all relevant bugs
Dell Integration
Learn more about where this data comes from
Bug Scrub Advisor
Streamline upgrades with automated vendor bug scrubs
BugZero Enterprise
Wish you caught this bug sooner? Get proactive today.