Symptoms
Following symptoms can be seen:
Running ESXi 7.0 Update 1 or later'SSD Congestion' alarms in Skyline Health point to one or several DiskGroups in the clusterIncreasing 'ssdCongestion' / 'logCongestion' values when running GSS congestion check one liner:
Example:
# for ssd in $(localcli vsan storage list |grep "Group UUID"|awk '{print $5}'|sort -u);do echo $ssd;vsish -e get /vmkModules/lsom/disks/$ssd/info|grep Congestion;done
Tue Jul 20 19:56:18 UTC 20215218efcf-206f-800d-00a3-b945fe425409 memCongestion:0 slabCongestion:0 ssdCongestion:227 <------- This is already too high, seeing a value <100 but incrementing is already enough to suspect. iopsCongestion:0 logCongestion:0 <------- In some cases logCongestion has increased and no ssdCongestion is present. compCongestion:0 mdCongestion:0 memCongestionLocalMax:0 slabCongestionLocalMax:0 ssdCongestionLocalMax:227 iopsCongestionLocalMax:0 logCongestionLocalMax:0 compCongestionLocalMax:0 mdCongestionLocalMax:0
Following the DiskGroup's host in question, if you go to 'Host → Monitor → vSAN → Performance → Disks → Diskgroup → ', the "Write Buffer Free Percentage" is <70% and there is no throughput showing up at the "Cache Disk De-stage Rate" metric
Purpose
To provide guidance on addressing a known issue
Cause
Due to an underflow of the outstanding IO counter, vSAN elevator thinks that the capacity device already has outstanding IO to be de-staged and waits for that to complete before it can de-stage the next data. However, there are no pending IOs to complete with the capacity disk. Hence, we end up with no data being de-staged by the elevator.
Impact / Risks
Overall vSAN performance could be impacted if PLOG consumption buildup has already caused vSAN congestionVMs may start presenting different problems such as:
Increased latency Switching to a "Read-Only" mode Guest OS getting stuck
Resolution
Fixed in vSAN 7.0 U3g (EP5), please update to this build or newer to address the issue.