BugZero | VMware BugID 88917 - Log congestion due to a single failed drive

VMware - Defect ID: 88917

Log congestion due to a single failed drive

VMware - Defect ID: 88917

Log congestion due to a single failed drive

Last updated on July 26th, 2023

BugZero Risk Score
5.3 Medium

Overall: N/A

Severity: N/A

Community: N/A

Lifecycle: N/A

What is the BugZero Risk Score?

VMware Integration

Learn more about where this data comes from

VMware Integration

Learn more

Bug Scrub Advisor

Streamline upgrades with automated vendor bug scrubs

Bug Scrub Advisor

Learn more

BugZero Enterprise

Wish you caught this bug sooner? Get proactive today.

BugZero Enterprise

Learn more

Bug Details

Views: 6

Description

Symptoms

Multiple VMs may crash due to very high latency / log congestion.In vSAN Skyline Health, you will observe a single drive failure and log congestion.From vmkernel.log; you may notice stuck descriptor events: DOM: DOM2PCPrintDescriptor:1797: [105568173:0x4313fe8f3718] => Stuck descriptor In vobd.log, you will observe the affected disk hit transient errors. 2022-05-31T11:42:46.065Z: [vSANCorrelator] 10605891965954us: [vob.vsan.lsom.devicerepair] vSAN device 521a74ce-c980-c16c-ff3d-38a036233daf is being repaired due to I/O failures, and will be out of service until the repair is complete. If the device is part of a dedup disk group, the entire disk group will be out of service until the repair is complete.2022-05-31T11:42:46.065Z: [vSANCorrelator] 10606062774178us: [esx.problem.vob.vsan.lsom.devicerepair] Device 521a74ce-c980-c16c-ff3d-38a036233daf is in offline state and is getting repaired In vsandevicemonitord.log showing DDH tried repairing/re-mounting the disk but that failed continuously due to sustained errors from the device: 2022-06-03 01:44:16,575 INFO vsandevicemonitord stderr None, stdout b"VsanUtil::ReadFromDevice: Failed to open /vmfs/devices/disks/naa.500a0751281163a2, errno (5)\nVsanUtil::GetVsanDisks: Error occurred 'Failed to open device /vmfs/devices/disks/naa.500a0751281163a2', create disk with null id\nVsanUtil::ReadFromDevice: Failed to open /vmfs/devices/disks/naa.500a0751281163a2, errno (5)\nErrors: \nUnable to mount: Failed to open device /vmfs/devices/disks/naa.500a0751281163a2\n" from command /sbin/localcli vsan storage diskgroup mount -d naa.500a0751281163a2. 2022-06-03 01:44:16,575 INFO vsandevicemonitord Mounting failed on VSAN device naa.500a0751281163a2.2022-06-03 01:44:16,575 INFO vsandevicemonitord Repair attempt 131 for device 521a74ce-c980-c16c-ff3d-38a036233daf vSAN performance graphs may show high congestion.

Cause

As the RELOG on the failed disk did not happen, this led to PLOG build-up leading to congestion and latencies at the VM level..RELOG is an internal process of vSAN which is used to free up the space in LSOM layer for log reclamation.RELOG doesn't happen on device if device remains in repair state for long time which might lead to log buildup.

Impact / Risks

Performance issue on cluster due to log congestion.These issues have been reported in non-dedup disk-groups

Resolution

The issue has been fixed in 6.7 U3 P05 and 7.0 U3D and higher respectively.

Workaround

If you notice any drive reporting an "Operational health error" in Skyline Health and it matches the instances mentioned above, then follow the below steps: Put the affected host into Maintenance mode choosing "ensure object accessibility";Remove the faulty disk from the disk-groupReplace the failed drive and add the new drive to the disk-group

Related Information

The above behavior is reported in ESXi 6.7.x and "7.0 GA / 7.0 U1"After applying the fix, vSAN shall process relog on the disk under repair to avoid PLOG log build-up.

Relevant Products

Click on a version to see all relevant bugs

Affected versions:7.06.7

Fixed versions: No known fixed versions

Relevant Products

Click on a version to see all relevant bugs

Affected versions:7.06.7

Fixed versions: No known fixed versions

Top VMware Defects

No bugs this month

VMware Integration

Learn more about where this data comes from

VMware Integration

Learn more

Bug Scrub Advisor

Streamline upgrades with automated vendor bug scrubs

Bug Scrub Advisor

Learn more

BugZero Enterprise

Wish you caught this bug sooner? Get proactive today.

BugZero Enterprise

Learn more

Ready to prevent the next vendor outage?

Get a demo

VMware - Defect ID: 88917

Log congestion due to a single failed drive

VMware - Defect ID: 88917

Log congestion due to a single failed drive

Last updated on July 26th, 2023

BugZero Risk Score5.3 Medium

Bug Details

Symptoms

Cause

Impact / Risks

Resolution

Workaround

Related Information

Top VMware Defects

Ready to prevent the next vendor outage?

Links

BugZero Risk Score
5.3 Medium