Symptoms
Running thousands of HCX vMotion, Replication Assisted vMotion, and/or Cold migrations may cause the Mobility Agent to fail after some time for a given Service Mesh in an HCX deployment.Migration workflows will fail with connectivity errors between the source ESXi host and the IX/MA:
Error: vMotion failed. System Error. Source side error is : Source side relocate failed for the virtual machine. Migration to host <10.1.1.1> failed with error Connection closed by remote host, possibly due to timeout.
Purpose
This document is created as a reference for the HCX vMotion/RAV/Cold migration services recovery due to Mobility Agent crash.
Cause
vMotion/RAV/Cold migrations are serviced by the HCX Mobility Agent.A Virtual Machine has multiple related objects, including CPU IDs, feature flags, etc.Once a migration is completed successfully, those referenced objects were not getting released from the Mobility Agent and accumulating, eventually leading to memory exhaustion after a few thousand workflows.
Impact / Risks
This only affects vMotion/RAV/Cold migrations.No impact to VR Bulk migration.No Impact to Network Extension services.The issue affects 4.0.0 and later releases.
Resolution
Issue is resolved in HCX version 4.2.1 and upgrade to that version is required to prevent the issue from reoccurring.
Workaround
As a potential workaround, following steps can be performed to recover Mobility Agent services:
Re-deploy the IX appliance for a given Service Mesh.Remove and add again vMotion and RAV services from the Service Mesh to recreate the IX appliance.