Loading...
Loading...
During the snapshot removal step of a Veeam Backup & Replication task, the source vSphere VM loses connectivity temporarily.
Veeam does not remove the snapshot itself; Veeam sends an API call to the vSphere environment to have the action performed. The snapshot removal process significantly reduces the total IOPS that can be delivered by the VM due to additional locks on the VMFS storage resulting from increased metadata updates, as well as the added I/O load of the snapshot removal process itself. In most environments, if you are already over 30-40% IOP load for your target storage, which is not uncommon with a busy SQL/Exchange server, then the snapshot removal process will easily push that into the 80%+ mark and likely much higher. Most storage arrays will see a significant latency penalty once IOPs get into the 80%+ mark, which will, of course, be detrimental to application performance.
The following test should be performed when connectivity to the VM is not sensitive, for instance, during off-peak hours. To isolate the vSphere snapshot removal event, Veeam suggests the following isolation test: Create a snapshot on the VM in question. Leave the snapshot on the VM for the duration of time that a Veeam job runs against that VM. Remove the snapshot. Observe the VM during the snapshot removal. While performing the test above, if you observe the same connectivity issues as during the Veeam job run, the problem is likely to exist within the vSphere environment itself. Review the following list of troubleshooting steps and known issues. If none of the following options resolve the issue, we recommend contacting Broadcom support directly regarding the problem with removing the snapshot.
If the VM being stunned is stored on an NFS 3.0 Datastore, see the section below about a Known Issue with NFS 3.0 and Hotadd. Check for snapshots on the VM while no Veeam job is running and remove any that are found.Veeam Backup & Replication can back up a VM that has snapshots present. However, it has been observed that when vSphere attempts to remove a snapshot created during a Veeam job operation, and a snapshot was already present on the VM before the Veeam job, snapshot stun may occur. Check for orphaned snapshots on the VM. Reference: Finding and listing virtual machine snapshots Reduce the number of concurrent tasks that are occurring within Veeam. This will reduce the number of active snapshot tasks on the datastores. Move the VM to a datastore with more available IOPS, or split the VM's disks across multiple datastores to more evenly distribute the load. If the VM's CPU resources spike heavily during Snapshot consolidation, consider increasing the CPU reservation for that VM. Ensure you are on the latest build of your current version of vSphere, hypervisors, VMware Tools, and SAN firmware when applicable. Move the VM to a host with more available resources. If possible, change the time of day that the VM gets backed up or replicated to a time when the least storage activity occurs. Use a workingDir to redirect Snapshots to a different datastore than the one the VM resides on. Reference: Creating snapshots in a different location than default virtual machine directory for VMware ESXi and VMware ESX
Resolved in ESXi 8.0 U2 The VMware article regarding this issue has been updated to indicate that the underlying issue causing the snapshot stun issue with NFS 3.0 and Virtual Appliance (HotAdd) transport mode was resolved in the VMware ESXi 8.0 Update 2b release. Broadcom KB323118: Virtual machines residing on NFS storage become unresponsive during a snapshot removal operation
Related Broadcom Articles Snapshot removal stops a virtual machine for long time (323397) Virtual machines residing on NFS storage become unresponsive during a snapshot removal operation (323118) High VM Stun time during snapshot deletion or SVmotion failure on ESXi 6.7U2 or later (317708) Virtual machine becomes unresponsive or inactive when taking memory snapshot (321376) Finding and listing virtual machine snapshots (344559) A virtual machine can freeze under load when you take quiesced snapshots or use custom quiescing scripts (343375) VM snapshot stun times correlate with the number of virtual disks (337998)
Veeam Integration
Learn more about where this data comes from
Bug Scrub Advisor
Streamline upgrades with automated vendor bug scrubs
BugZero Enterprise
Wish you caught this bug sooner? Get proactive today.