...
Note: This is an extremely rare condition that requires encountering multiple factors simultaneously to trigger.Affects ESXi Versions before 7.0 Update 3 (19193900).When using vSAN and a Storage controller utilizing driver lsi_msgpt35, a Host can encounters one or more of the following: Stuck I/O condition ( see for further details KB 71207 )Disks / Disk Groups are being offlinePSOD (At this time no PSOD trace available) You may see similar messages in vmkernel log (specific times and details will vary) with path, driver, and SCSI H:0x8 errors: 2021-11-24T12:53:09.824Z cpu33:2098069)HPP: HppThrottleLogForDevice:1070: Error status H:0x8 D:0x0 P:0x0 . from device naa.5000c500a19ca6ef repeated 10240 times, hppAction = 32021-11-24T12:53:10.596Z cpu12:2098067)ScsiDeviceIO: 4277: Cmd(0x45bc65297ac0) 0x88, CmdSN 0x6c6e100c from world 0 to dev "naa.5000c500a19ca9a3" failed H:0x8 D:0x0 P:0x02021-11-24T12:53:11.133Z cpu2:2098067)ScsiDeviceIO: 4277: Cmd(0x45bc6527e0c0) 0x28, CmdSN 0x7b456b19 from world 0 to dev "naa.5000c500a1999bb3" failed H:0x8 D:0x0 P:0x02021-11-24T12:53:13.004Z cpu1:10484749) [HB state abcdef02 offset 4161536 gen 131 stampUS 4294797547629 uuid 615cad72-4a3176a6-3710-4c52624f2444 jrnl <FB 8388608> drv 24.82 lockImpl 4 ip 192.168.195.117]2021-11-24T12:53:13.061Z cpu1:2103503)lsi_msgpt35_0: _base_static_config_pages: 4929: TimeSyncInterval value read from Manufacturing page-11 is zero. Periodic Time-Sync will be disabled.2021-11-24T12:53:13.061Z cpu1:2103503)lsi_msgpt35_0: _base_display_ioc_capabilities: 4606: SAS3408: FWVersion(14.00.02.00), ChipRevision(0x01), BiosVersion(00.00.00.00)2021-11-24T12:53:13.061Z cpu1:2103503)lsi_msgpt35_0: _base_display_ioc_capabilities: 4613: FWPackageVersion(14.00.02.06)2021-11-24T12:53:13.061Z cpu1:2103503)lsi_msgpt35_0: _base_send_port_enable: 5362: Command terminated due to timeout2021-11-24T12:53:13.061Z cpu1:2103503)lsi_msgpt35_0: _debug_dump: 244: Port enable request dump2021-11-24T12:53:13.061Z cpu1:2103503)lsi_msgpt35_0: offset:data2021-11-24T12:53:13.061Z cpu1:2103503)lsi_msgpt35_0: [0x00]:060000002021-11-24T12:53:13.061Z cpu1:2103503)WARNING: lsi_msgpt35_0: _base_make_ioc_operational: 5625: Port Enable failed - Timeout2021-11-24T12:53:13.061Z cpu56:2097936)lsi_msgpt35_0: _scsih_remove_device: 9717: ENTER: C0:T1, handle(0x0000), sas_addr(0x300705b01088f6e0), portId(0)2021-11-24T12:53:13.061Z cpu56:2097936)lsi_msgpt35_0: _scsih_remove_device: 9721: ENTER: enclosure level(0x0000), connector name(C1 )2021-11-24T12:53:13.061Z cpu56:2097936)WARNING: ScsiPath: 11252: Path lost for adapter vmhba0 target 1 channel 0 lun 02021-11-24T12:53:13.061Z cpu1:2103503)lsi_msgpt35_0: _ctl_process_mpt_command: 1267: Command terminated due to timeout2021-11-24T12:53:13.061Z cpu1:2103503)lsi_msgpt35_0: msgpt_afd_release: 1431: Diagnostic Trace Buffer was already released2021-11-24T12:53:13.061Z cpu4:2097584)ScsiPath: 9180: DeletePath : adapter=vmhba0, channel=0, target=1, lun=02021-11-24T12:53:13.061Z cpu4:2097584)HPP: HppUnclaimPath:3861: Unclaiming path vmhba0:C0:T1:L02021-11-24T12:53:13.061Z cpu4:2097584)ScsiDevice: 10527: device mpx.vmhba0:C0:T1:L0 refCount is 3; waiting for 1.D:0x0 P:0x0 . from device naa.5000c500a1999bb3 repeated 81920 times, hppAction = 32021-11-24T12:53:13.500Z cpu4:2097584)ScsiDevice: 10527: device mpx.vmhba0:C0:T1:L0 refCount is 3; waiting for 1.2021-11-24T12:53:13.749Z cpu12:2097584)ScsiDevice: 10527: device mpx.vmhba0:C0:T1:L0 refCount is 3; waiting for 1.
lsi_msgpt35 driver versions from 15.xx through 18.00.01.00 have a Bug related to the uptime of the drive resulting ina potential Window for this issue to occur every 49 days 17 hours 2 minutes 47.295 seconds of uptime.That occurring Window lasts a few milliseconds until the counter resets to 0. During this Window the Controller loses access to the drives.If certain IOCTL commands are issued to the drives within this Window, further IO will be stuck, IO timeouts will be encountered without the ability to clear the stuck IO. Once vSAN detects the stuck IO a PSOD will be initiated or the affected Disk group will be taken offline.( see for further details KB 71207 )
This carries the same risk as any PSOD or Disk group offline actions for vSAN.During a PSOD VMs running on the Host will crash.Depending on Storage Policy used and compliance, data may be unavailable during this time period until the Host is rebooted.
Upgrade to lsi_msgpt35 version 18.00.02.00 or higher (as per vSAN HCL guidance for your Build and Controller) as soon as possible. (Issue is fixed with 7.0 Update 3 (19193900) Inbox driver )
If PSOD encountered please reboot the Host to clear the condition.If Disk group offline encountered, please reboot the Host to clear the condition, and recreate the Disk group.
Please see Lenovo advisory of this issue: https://support.lenovo.com/ie/en/solutions/ht512561-esxi-node-psod-or-multiple-drives-missing-lenovo-thinksystem