Info
An issue exists where any SATA/SAS/NVMe SSD drive configured in a VMware All-Flash vSAN disk group may be mistakenly reported as failed and marked by vSAN as having a permanent error. This is due to the Medium Errors being continuously reported after multiple attempts to remap the bad area by the ESXi operating system. The SMART data will be retrieved directly from the SSD device and show the drive has available spare space to remap bad areas on the drive.
vSAN 6.7 may not allow the recovery of a single Unrecoverable Read Error (URE), when it occurs in the metadata regions of an all flash vSAN disk group, without removing the disk group from the vSAN first.
Depending on the version of ESXi and features enabled, the host may perform an "autoDG" creation operation on the failed disk group in an attempt to repair the bad area on a disk group. As a result, a drive may be reported as failed after multiple attempts to repair the drive using the "autoDG" operation.
This may happen because of how vSAN interacts with various vendor drives in the handling of the 5-10% area used for metadata operations. Based on
VMware KB 81121
, an autoDG creation feature runs a TRIM utility, and by default TRIM only runs on the first 5-10% of the metadata region. If the bad area is beyond the 5-10% on the drive, the bad area will not be remapped, causing premature replacement of the drive.