Symptoms
The issue happens when the Self-Service Platform (SSP) firmware on the node is outdated.You can verify the firmware by running the below command on logs. SSP firmware :
$ grep ssp cluster-{1..4}/etcifs.tar/ifs/firmware_versions
cluster-2/etcifs.tar/ifs/firmware_versions: DEssp_infinity ePOST 02.50 2
cluster-3/etcifs.tar/ifs/firmware_versions: DEssp_infinity ePOST 02.50 3
cluster-4/etcifs.tar/ifs/firmware_versions: DEssp_infinity ePOST 02.50 4
$ grep ssp firmware
DEssp_infinity ePOST 02.50 2,4
The SSP firmware should be 2.80 or greater. Nodes may reboot unexpectedly with the below messages recorded in /var/log/isi_hwmon.log :
2019-07-02T06:45:28-04:00 cluster-1 isi_hwmon[5304]: Failed to get device ID from BMC: failed 1 times
2019-07-02T06:45:46-04:00 cluster-1 isi_hwmon[5304]: BMC back to healthy after 1 failed get device IDs
2019-07-02T06:45:46-04:00 cluster-1 isi_hwmon[5304]: Setting node to R/O (system-overtemp)
2019-07-02T06:45:47-04:00 cluster-1 isi_hwmon[5304]: HWMON EVENT: {'specifier': {'sensor_state': 1, 'sensor_data': 49154, 'sensor_name': 'Shutdown_In_Prog'}, 'severity': 3, 'event_id': 'HW_INFINITY_DELAYED_REBOOT', 'force_celog': 1, 'assert': 1, 'message': 'The node located in chassis {chassis} slot {slot} is in a delayed reboot because of the following reason: SSP Hang. As a result, the node may reset itself. Setting the node to read-only to protect the journal.', 'send_celog': True}
2019-07-02T06:45:49-04:00 cluster-1 isi_hwmon[5304]: HWMON EVENT: {'specifier': {'eventdata': 'Delayed Reboot In Progress', 'sensor': 'Shutdown In Progress', 'generator': 'Shutdown_In_Prog', 'index': 50}, 'severity': 3, 'event_id': None, 'assert': True, 'message': 'Shutdown In Progress: Delayed Reboot In Progress (Reason Code: 0e)', 'send_celog': None}
2019-07-02T06:57:12-04:00 cluster-1 isi_hwmon[2650]: Starting isi_hwmon daemon
Events recorded as :
6.8194 07/02 06:57 C 1 163645 The node located in chassis XXXNN888888111 slot 1 is in a delayed reboot. As a result, the node may reset itself. Setting the node to read-only to protect the journal.
6.7492 07/02 06:45 C 1 163645 The node located in chassis XXXNN888888111 slot 1 is in a delayed reboot because of the following reason: SSP Hang. As a result, the node may reset itself. Setting the node to read-only to protect the journal.
Cause
When OneFS detects that an SSP (Secondary Service Processor) is hung, it tries to reboot the system proactively.The secondary core on the Baseboard Management Controller (BMC) is responsible for servicing all real-time operations.Its the real time operation service of BMC that hangs and reboots the system proactively to avoid any more issues, DU/DL.
Resolution
Upgrade the Node Firmware Package to 10.3.0.