Loading...
Loading...
Multiple VMs may crash due to very high latency / log congestion.In vSAN Skyline Health, you will observe a single drive failure and log congestion.From vmkernel.log; you may notice stuck descriptor events: DOM: DOM2PCPrintDescriptor:1797: [105568173:0x4313fe8f3718] => Stuck descriptor In vobd.log, you will observe the affected disk hit transient errors. 2022-05-31T11:42:46.065Z: [vSANCorrelator] 10605891965954us: [vob.vsan.lsom.devicerepair] vSAN device 521a74ce-c980-c16c-ff3d-38a036233daf is being repaired due to I/O failures, and will be out of service until the repair is complete. If the device is part of a dedup disk group, the entire disk group will be out of service until the repair is complete.2022-05-31T11:42:46.065Z: [vSANCorrelator] 10606062774178us: [esx.problem.vob.vsan.lsom.devicerepair] Device 521a74ce-c980-c16c-ff3d-38a036233daf is in offline state and is getting repaired In vsandevicemonitord.log showing DDH tried repairing/re-mounting the disk but that failed continuously due to sustained errors from the device: 2022-06-03 01:44:16,575 INFO vsandevicemonitord stderr None, stdout b"VsanUtil::ReadFromDevice: Failed to open /vmfs/devices/disks/naa.500a0751281163a2, errno (5)\nVsanUtil::GetVsanDisks: Error occurred 'Failed to open device /vmfs/devices/disks/naa.500a0751281163a2', create disk with null id\nVsanUtil::ReadFromDevice: Failed to open /vmfs/devices/disks/naa.500a0751281163a2, errno (5)\nErrors: \nUnable to mount: Failed to open device /vmfs/devices/disks/naa.500a0751281163a2\n" from command /sbin/localcli vsan storage diskgroup mount -d naa.500a0751281163a2. 2022-06-03 01:44:16,575 INFO vsandevicemonitord Mounting failed on VSAN device naa.500a0751281163a2.2022-06-03 01:44:16,575 INFO vsandevicemonitord Repair attempt 131 for device 521a74ce-c980-c16c-ff3d-38a036233daf vSAN performance graphs may show high congestion.
As the RELOG on the failed disk did not happen, this led to PLOG build-up leading to congestion and latencies at the VM level..RELOG is an internal process of vSAN which is used to free up the space in LSOM layer for log reclamation.RELOG doesn't happen on device if device remains in repair state for long time which might lead to log buildup.
Performance issue on cluster due to log congestion.These issues have been reported in non-dedup disk-groups
The issue has been fixed in 6.7 U3 P05 and 7.0 U3D and higher respectively.
If you notice any drive reporting an "Operational health error" in Skyline Health and it matches the instances mentioned above, then follow the below steps: Put the affected host into Maintenance mode choosing "ensure object accessibility";Remove the faulty disk from the disk-groupReplace the failed drive and add the new drive to the disk-group
The above behavior is reported in ESXi 6.7.x and "7.0 GA / 7.0 U1"After applying the fix, vSAN shall process relog on the disk under repair to avoid PLOG log build-up.
Click on a version to see all relevant bugs
VMware Integration
Learn more about where this data comes from
Bug Scrub Advisor
Streamline upgrades with automated vendor bug scrubs
BugZero Enterprise
Wish you caught this bug sooner? Get proactive today.