Loading...
Loading...
Issue Description ESXi hosts lose access to ScaleIO volumes due to IO_FAULT_RESERVATION_CONFLICT Scenario This can happen when one of the hosts sharing the same volumes has some changes. It can happen when: one of the ESXi hosts is rebooted and the volume is mapped to a new ESXi host Symptoms Error message not limited to the following might be seen in VMkernel log: 2017-07-10T17:57:40.059Z cpu2:32967)HBX: 2961: Waiting for timed out [HB state abcdef02 offset 3526656 gen 51 stampUS 6291663389308 uuid 5903bfd0-a9cf888e-56c8-1402ec750038 jrnl drv 14.60] on vol '' ... 2017-07-10T17:57:44.312Z cpu30:32851)ScsiDeviceIO: 2369: Cmd(0x412e88aa3740) 0x2a, CmdSN 0x58dbf4 from world 32813 to dev "eui." failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x0 0x0. ... 2017-07-10T17:58:19.646Z cpu9:32872)HBX: 612: Reading HB at 3527168 on vol '' failed: Timeout ... 2017-07-10T17:59:33.661Z cpu3:34122)Fil3: 15438: Max retries (10) exceeded for caller Fil3_FileIO (status 'IO was aborted by VMFS via a virt-reset on the device') 2017-07-10T17:59:33.661Z cpu3:34122)BC: 2288: Failed to write (uncached) object '.iormstats.sf': Maximum kernel-level retries exceeded 2017-07-10T17:59:33.682Z cpu18:34122)Fil3: 15438: Max retries (10) exceeded for caller Fil3_FileIO (status 'IO was aborted by VMFS via a virt-reset on the device') 2017-07-10T17:59:33.682Z cpu18:34122)BC: 2288: Failed to write (uncached) object '.iormstats.sf': Maximum kernel-level retries exceeded 2017-07-10T17:59:33.751Z cpu18:34122)Fil3: 15438: Max retries (10) exceeded for caller Fil3_FileIO (status 'IO was aborted by VMFS via a virt-reset on the device') 2017-07-10T17:59:33.751Z cpu18:34122)BC: 2288: Failed to write (uncached) object '.iormstats.sf': Maximum kernel-level retries exceeded With newer versions of SDC, the scini module might indicate reservation conflicts: 2017-07-10T21:42:40.839Z cpu12:33503)scini: mapVolIO_ReportIOErrorIfNeeded:361: ScaleIO R2_0:[211482] IO-ERROR comb: . offsetInComb 0. SizeInLB 1. SDS_ID . Comb Gen b. Head Gen 2653. 2017-07-10T21:42:40.839Z cpu12:33503)scini: mapVolIO_ReportIOErrorIfNeeded:374: ScaleIO R2_0:Vol ID 0x. Last fault Status SUCCESS(65).Last error Status SUCCESS(65) Reason (reservation conflict) Retry count (0) chan (1) 2017-07-10T21:42:40.839Z cpu12:33503)scini: blkScsi_PrintIOInfo:3304: ScaleIO R2_0:hCmd 0x4136803d9d40, OpCode 0x28, rc 53 scsiStat 24, senseCode 5, asc 0, ascq 0 2017-07-10T21:42:41.792Z cpu12:32833)ScsiDeviceIO: 2369: Cmd(0x4136803d9d40) 0x28, CmdSN 0x3bb from world 0 to dev "eui." failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x0 0x0. 2017-07-10T21:42:41.793Z cpu34:33507)Partition: 423: Failed read for "eui.": I/O error 2017-07-10T21:42:41.793Z cpu34:33507)Partition: 1003: Failed to read protective mbr on "eui." : I/O error 2017-07-10T21:42:41.793Z cpu34:33507)WARNING: Partition: 1112: Partition table read from device eui. failed: I/O error 2017-07-10T21:42:41.793Z cpu34:33507)ScsiDevice: 3445: Successfully registered device "eui." from plugin "NMP" of type 0 2017-07-10T21:42:41.793Z cpu34:33507)ScsiEvents: 301: EventSubsystem: Device Events, Event Mask: 180, Parameter: 0x410ae65aa220, Registered! In SDS trc: 10/07 13:58:30.985746 0x7efcf296ceb0:ioh_NewRequest:05490: Write to comb - Done rc is IO_FAULT_RESERVATION_CONFLICT (Lba 710960 16), volume (dit) Impact The entire cluster may lose access to the volumes, or experience severe performance degradation.
One ESXi node has reserved the volume, preventing all other nodes from accessing it.
Workaround Do not unmap volumes from, or remove SDCs. This will not help in clearing the scsi reservation and might make it more difficult to troubleshoot. As an immediate workaround, release the reservation on the LUN, or reset the LUN: The release command must be issued from the holder of the reservation. The initiator that holds the reservation might be identified following this VMware KB: https://kb.vmware.com/kb/10051 From the host, issue this command to release the reservation: vmkfstools -L release/vmfs/devices/disks/eui. Otherwise, from any host that has the volume mapped, issue the following command to reset the volume. This drops all inflight IOs to the volume and should be used with caution. Also see VMware KB: https://kb.vmware.com/kb/1002293 vmkfstools -L lunreset /vmfs/devices/disks/eui. This issue is seen when the hosts are configured differently regarding datastore locking, or previous configurations on all hosts were not effective, and rebooting one of them made it start to behave differently than others. To avoid this issue in the future, ensure all hosts sharing the same volumes and datastores are configured in the same way and the new configuration is loaded and effective simultaneously. Also choose ATS over scsi-2 reservation where possible. ATS-locking can be configured at different levels, which are described in this VMware KB: https://kb.vmware.com/kb/2146451: On a datastore: This can be seen with this command: vmkfstools -Ph -v1 /vmfs/volumes/VMFS-volume-name In the output, "Mode: public ATS-Only"means ATS can be used on the datastore;"Mode: public" means scsi2 reservation is used. On the host level, This is the "HardwareAcceleratedLocking" parameter in /etc/vmware/esx.conf. When this parameter is not present in the configuration file, or set to 1, it is enabled and the host can use ATS locking. If it is set to 0, ATS-locking is disabled. Even for "public ATS-Only" datastores, the host would use scsi2 reservation. Also on the host level, whether to use ATS for heartbeat on VMFS5 datastores: This is the "seATSForHBOnVMFS5" parameter in /etc/vmware/esx.conf. When this parameter is not present in the configuration file, or set to 1, it is enabled and the host will use ATS for VMFS5 heartbeat. Otherwise, scsi2 reservation is used.
Click on a version to see all relevant bugs
Dell Integration
Learn more about where this data comes from
Bug Scrub Advisor
Streamline upgrades with automated vendor bug scrubs
BugZero Enterprise
Wish you caught this bug sooner? Get proactive today.