Loading...
Loading...
When a ESXi Host is put into Maintenance Mode, and the ESXi Host is a member of a DRS Cluster, DRS can migrate Virtual Machines automatically to other compatible hosts in the Cluster. If a ESXi Host with vGPU Virtual Machines is put into Maintenance Mode, the “Enter maintenance mode” task will not complete with failure events: “DRS failed to generate a vMotion recommendation for a virtual machine on a host entering Maintenance Mode.” vGPU Virtual Machines are not automatically migrated by DRS when a ESXi Host enters Maintenance Mode, due to workload disruption from long Virtual Machine Stun Times. The Virtual Infrastructure Admin will need to manually remediate by explicitly migrating the ESXi Host’s vGPU Virtual Machines.For more information about vMotion and Virtual Machine Stun Time see the following documentation:Using vMotion to Migrate vGPU Virtual MachinesThe vMotion Process Under the Hood - VMware vSphere BlogVirtual Machine Conditions and Limitations for vSphere vMotion
vGPU architecture has long Virtual Machine Stun Times.
Starting with vSphere 8.0 U2, DRS can estimate the Stun Time for a given vGPU VM configuration. When the DRS Cluster Advanced Options are set and the Estimated VM Devices Stun Time for a VM is lower than the VM Devices vMotion Stun Time limit, DRS will automate VM migrations. To enable this functionality, make sure your infrastructure meets the following requirements:* Healthy vSphere Lifecycle Services (Refer to: https://kb.vmware.com/s/article/91891)* Configuration of the VM's vGPU devices through the VCenter UI only* Healthy vMotion network (Example: vMotion NICs setup through Cluster QuickStart)Then add the following DRS Cluster Advanced Options: Option: PassthroughDrsAutomationValue: 1For vGPU VMs with Stun Times exceeding the "vMotion Stun Time Limit" (default 100 seconds), a VI Admin can add the following DRS Cluster Advanced Option:Option: VmDevicesStunTimeToleratedValue: <number of seconds, greater than any VM's Estimated Stun Time in the Cluster> (Default 100 seconds)ORModify the "vMotion Stun Time Limit" in the VM's Configuration -> "VM Options" Tab -> "Advanced" SectionIf needed, the Workaround below will allow evacuation even during vMotion network health degradation.
With vCenter Server 7.0 Update 3f and vSphere 7.0.3 or newer, a DRS Cluster Advanced Options override was added to provide Virtual Infrastructure Admins a way to OPT-IN to automated evacuation of vGPU Virtual Machines: Option: VgpuMMAutomationTimeoutSecs Value: -1 The above override comes with the following behavior changes: Evacuation of vGPU Virtual Machines is automated, subject to the 100 second vMotion timeout. During Switchover a vGPU Virtual Machines Stun Time may exceed 10 seconds (dependent on both network bandwidth and the size of the vGPU profile). Evacuation of Virtual Machines is serialized to avoid network contention. Requirements: Extra vGPU host capacity in the DRS cluster (Example: duplicate host configuration for the host going into Maintenance Mode). No compatibility issues reported for the VMs on the host going into Maintenance Mode.
Click on a version to see all relevant bugs
VMware Integration
Learn more about where this data comes from
Bug Scrub Advisor
Streamline upgrades with automated vendor bug scrubs
BugZero Enterprise
Wish you caught this bug sooner? Get proactive today.