...
Symptom - 1:- "DIF Error" in vmkernel logs 2020-04-06T08:25:12.154Z cpu74:2098447)qlnativefc: vmhba2(af:0.0): iocb(s) 0x430a045f7340 Returned STATUS.2020-04-06T08:25:12.154Z cpu74:2098447)qlnativefc: vmhba2(af:0.0): DIF ERROR in cmd: 0x28 Type=0x0 lba=0xb100 actRefTag=0x1000000, expRefTag=0xb100, actAppTag=0x0, expAppTag=0x0, actGuard=0x400, expGuard=0xa671.2020-04-06T08:25:12.154Z cpu74:2098447)ScsiDeviceIO: 3449: Cmd(0x45a74c43f980) 0x28, CmdSN 0x70af4 from world 2100597 to dev "naa.60060e80165251000001973100000105" failed H:0xf D:0x0 P:0x0 Invalid sense data: 0x1a 0x1b 0x45.2020-04-06T08:25:12.155Z cpu0:2107305)qlnativefc: vmhba2(af:0.0): iocb(s) 0x430a04577780 Returned STATUS.2020-04-06T08:25:12.155Z cpu0:2107305)qlnativefc: vmhba2(af:0.0): DIF ERROR in cmd: 0x28 Type=0x0 lba=0xb100 actRefTag=0x1000000, expRefTag=0xb100, actAppTag=0x0, expAppTag=0x0, actGuard=0x400, expGuard=0xa671.2020-04-06T08:25:12.155Z cpu2:2098444)ScsiDeviceIO: 3449: Cmd(0x459b62387340) 0x28, CmdSN 0x70af5 from world 2100597 to dev "naa.60060e80165251000001973100000105" failed H:0xf D:0x0 P:0x0 Invalid sense data: 0x4f 0x2 0x43. Additional symptoms might also include: Disk I/O operation failures reported by VMsFilesystem in Linux Guest OS become read-only due to underlying disk I/O latency or failureUnresponsive or Sluggish ESXi host The high number of H:0xf SCSI error codes in vmkernel logs A large number of issues observed with qlnativefc driver version 3.1.29.0 and qlnativefc driver version 3.1.31.0, but not limited to these driver versions. Symptom - 2:- "Data Integrity Field (DIF) Error" in VMkernel logs appears only if the Debugging is enabled in the qlnativefc Driver 2020-07-31T10:01:02.130Z cpu0:66211)qlnativefc: vmhba1(37:0.0): Data phase error, rediscover DIF capability : senseKey = 0x5 : asc = 0x4b : ascq = 0x822020-07-31T10:01:02.143Z cpu17:480587)qlnativefc: vmhba1(37:0.0): Data phase error, rediscover DIF capability : senseKey = 0x5 : asc = 0x4b : ascq = 0x822020-07-31T10:01:02.144Z cpu0:2122267)qlnativefc: vmhba1(37:0.0): Data phase error, rediscover DIF capability : senseKey = 0x5 : asc = 0x4b : ascq = 0x82 Disk I/O operation failures reported by VMsUnresponsive Virtual Machines Unresponsive or Sluggish ESXi host Hostd reporting unresponsive The VMkernel logs shows too many "State in doubt, requested fast path state update" messages 2020-07-26T22:40:56.347Z cpu17:66374)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.xxx.ID" state in doubt; requested fast path state update...2020-07-26T22:40:56.546Z cpu17:66374)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.xxx.ID" state in doubt; requested fast path state update...2020-07-26T22:41:56.346Z cpu0:66374)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.xxx.ID" state in doubt; requested fast path state update... I/O Error reported to the LUNs very specific to the ones which have DIF disabled from the Storage Array 2020-07-29T06:15:00.439Z cpu23:65624)ScsiDeviceIO: 3015: Cmd(0x439a86176840) 0x28, CmdSN 0x49855f from world 0 to dev "naa.xxx-ID" failed H:0x5 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0.2020-07-29T06:15:00.439Z cpu15:2952583)Partition: 427: Failed read for "naa.xxx-ID": I/O error2020-07-29T06:15:00.439Z cpu15:2952583)Partition: 1007: Failed to read protective mbr on "naa.xxx-ID" : I/O error This was mostly observed with a combination of 3Pardata HP Arrays which have DIF disabled LUNs and qlnativefc driver with DIF enabled by default within the driverHBA Adapter Failed (Read, Blocks Read and Written) are way to high in such cases, refer the HBA stats for the info vmhba1: Successful Commands: 728172193 Blocks Read: 44452702311 Blocks Written: 13163331407 Read Operations: 325720036 Write Operations: 355132847 Reserve Operations: 6142 Reservation Conflicts: 431652 Failed Commands: 48551498 Failed Blocks Read: 403288824638317 Failed Blocks Written: 311812910615 Failed Read Operations: 48078480 Failed Write Operations: 50916 The issue was observed in the below versions of drivers vSphere 6.5 :- 2.1.96.0vSphere 6.7 :- 3.1.31.0The issue was also observed in vSphere 7.0 The issue was also mostly observed with QLogic QLE2690 Single Port 16Gb Fibre Channel Adapters and the same vendor QLogic ISP2532-based 8Gb Fibre Adapters had no issues reported
The 3PAR storage reports an illegal request made by the driver.The issue was the Qlogic driver would mistakenly send Data Integrity Field (DIF) enabled I/O to a LUN that did not support it, resulting in the error seen in the logs.
To Enable Debug logs on qlnativefc driver follow the below steps, NOTE:- The DIF errors in the VMkernel logs are only observed if the driver logging is changed to debug mode. Dynamically without requiring reboot: $ /usr/lib/vmware/vmkmgmt_keyval/vmkmgmt_keyval -i MOD_PARM/qlogic -s scsi-qlaenable-log -k DRIVERINFOTo make persistent across reboots: $ esxcfg-module -s "ql2xextended_error_logging=1" qlnativefc (reboot required) Once the Debug is enabled you will see the DIF errors on the logs reportedThe issue is fixed in the Qlogic Driver versions mentioned below vSphere 6.5 :- Qlogic Driver version : 2.1.101.0 vSphere 6.7 :- Qlogic Driver version : 3.1.36.0 vSphere 7.0 :- Qlogic Driver version : 4.1.14.0 Below is the command to disable the Debugging on the qlnativefc driver ‘esxcfg-module -s “ql2xextended_error_logging=0” qlnativefc’ this command will require a reboot“/usr/lib/vmware/vmkmgmt_keyval/vmkmgmt_keyval -i MOD_PARM/qlogic -s scsi-qladisable-log -k DRIVERINFO” this command does not require a reboot NOTE:- Symptom -1 mentioned above was only noticed in the combination of vSphere 6.7 version of qlnativefc driver, however the fix and the workaround for both Symptom 1 & 2 are the same and the fixes are available in the above mentioned Driver release.
Workaround is to disable the DIF from the driver, follow the steps below, this will address the DIF errors reported esxcfg-module -s “ql2xt10difvendor=0” qlnativefc ESXi Host Reboot is required once you disable the parameter in the Driver