Symptoms
A VxRail node equipped with an HBA355 as part of a vSAN cluster may report a Purple Screen with the message "... disk name: naa.5000xxxxxxxxx detected suspended I/Os..." or may report all drives that are attached to an HBA are missing due to an issue with the lsi-msgpt35 driver.It results in any or all the following:
Purple Screen or system crashDrives connected to the HBA are no longer detectedSystem sluggishness
There is a small window for this to occur, every 49 days of uptime when using lsi-msgpt35 driver version 18.00.01.00 (or earlier).The vobd.log may also have entries similar to the ones below:PDL offline errors:
2022-01-01T01:23:45.678Z: [vSANCorrelator] 4295047470123us: [esx.problem.vob.vsan.pdl.offline] vSAN device 72e18922-407d-4a94-b423-a6537d851241 has gone offline.
2022-01-01T01:24:45.678Z: [vSANCorrelator] 4295024201345us: [esx.problem.vob.vsan.pdl.offline] vSAN device 8a83db83-fd02-465b-b584-760f98b7c0ea has gone offline.
2022-01-01T01:25:32.138Z: [vSANCorrelator] 4295036216589us: [esx.problem.vob.vsan.pdl.offline] vSAN device 4180c5c8-34f7-4fab-9c1c-aaf699350f9b has gone offline.
Storage Connectivity Lost:
2022-01-01T01:23:51.678Z: [scsiCorrelator] 4295047470123us: [esx.problem.storage.connectivity.lost] Lost connectivity to storage device naa.58ce38ee21a26b79. Path vmhba3:C0:T6:L0 is down. Affected datastores: Unknown.
2022-01-01T01:24:51.678Z: [scsiCorrelator] 4295024201345us: [esx.problem.storage.connectivity.lost] Lost connectivity to storage device naa.58ce38ee21a26b79. Path vmhba3:C0:T6:L0 is down. Affected datastores: Unknown.
2022-01-01T01:25:38.138Z: [scsiCorrelator] 4295036216589us: [esx.problem.storage.connectivity.lost] Lost connectivity to storage device naa.58ce38ee21a26b79. Path vmhba3:C0:T6:L0 is down. Affected datastores: Unknown.
Cause
This issue occurs due to an issue with the lsi-msgpt35 ESXi driver not waiting for completion of some commands that are issued when the system uptime reaches a certain time window (every 49 days 17 hours 2 min 47.295 seconds).When system uptime reaches this window or any multiple of it, any driver command is treated as a timeout.Controller reset commands from the driver also fail if issued within this window and HBA loses communication with all drives. This is a small window of only a few milliseconds when this issue presents itself and it recurs again every 49 days 17 hours 2 min 47.295 seconds of uninterrupted uptime.
Resolution
For VxRail and VCF on VxRail solutions, an updated LSI driver is included to resolve this issue in the releases listed below.
VxRail 7.0.x: VxRail 7.0.320 or laterVCF 4.x: VCF 4.4 or later
The LCM update workflow is the only supported method to correct this issue. LCM update packages are available online using the VxRail plug-in in vCenter or direct download from the Dell Technologies support site.