BugZero | VMware BugID 93605 - HCX - Bulk Migration & Replication Assisted vMotio...

VMware - Defect ID: 93605

HCX - Bulk Migration & Replication Assisted vMotion (RAV) scalability guide

VMware - Defect ID: 93605

HCX - Bulk Migration & Replication Assisted vMotion (RAV) scalability guide

Last updated on 8/15/2023

Overall: 0N/A

Severity: 0N/A

Community: 0N/A

Lifecycle: 0N/A

What is the BugZero Risk Score?

Vendor details

No defect details.

Overall: 0N/A

Severity: 0N/A

Community: 0N/A

Lifecycle: 0N/A

What is the BugZero Risk Score?

Vendor details

No defect details.

Details

This document describes the functional capacity for migrations using vSphere Replication (vSR) Bulk and Replication Assisted vMotion (RAV) in HCX.The supported scale numbers are referenced per HCX Manager, irrespective of the number of Site Pairings or Service Mesh/IX Appliances deployed.Considerations for concurrent MigrationThere are several factors, at both source & target HCX Manager, that can limit the number of concurrent migrations performed using Bulk & RAV (initial/delta sync): Data storage IOPS capacityShared vs. dedicated Host resources Overall ESXi host resources for all servicesCPU & MEM reservations for the IX appliance VMpNIC/VMNIC capacity and shared loadDedicated vmk interfaces for different services like mgmt/vMotion/vSR. Network Infrastructure throughout the entire data path Data Center local networkService Provider network infrastructure between source/target sitesBandwidth availabilityLatency and path reliability (packet loss) vSphere replication (vSR) performance drops exponentially with higher packet loss and/or higher latency.There is a built-in tolerance for high latency in vSphere replication but throughput will be reduced significantly. Note: HCX Transport Analytics functionality can be used to measure network infrastructure throughput during migration planning phase. Refer VMware HCX user guide. Workload VM conditions Number of disksTotal and size per diskActive services/applicationsData churning/disk changes Default (Baseline) HCX Manager Resource Allocation: vCPURAM (GB)Disk Size (GB)41260 The supported numbers for concurrent Bulk/RAV migrations per Baseline HCX Manager deployments are: 300 concurrent migrations per Manager200 concurrent migrations per Service Mesh/IX Appliance.1Gbps max per migration workflow1.6Gbps max per IX appliance (any number of concurrent migration workflows)

Solution

Scale up Migration ConcurrencyTo improve concurrent migration scalability, resources on the HCX Connector & Cloud Manager must be increased as below:Baseline Migration Concurrency:Supported 300 Migrations (Bulk & RAV) per HCX Manager vCPURAM (GB)Disk Size (GB)Tuning41260N/A Extended Migration Concurrency :Supported 600 Migrations (Bulk & RAV) per HCX Manager vCPURAM (GB)Disk Size (GB)Tuning3248300Y Increase resources on the HCX Connector/Cloud ManagerThe following procedure must be used to increase resource allocation on HCX Connector & Cloud Manager VM both.Requirements and Considerations before increasing resources on the HCX Connector & Cloud Manager Do NOT exceed recommended allocations as that may cause the HCX Connector/Cloud Manager to malfunction.Both HCX Cloud Manager and Connector must be running version HCX 4.7.0 or later.There should be NO active migration or configuration workflows when making these resource changes.Changes must be made during a scheduled Maintenance Window.There is NO impact to Network Extension services.There is NO change of concurrency for HCX vMotion/Cold Migration workflow.The concurrent migration limit specified for HCX Replicated Assisted vMotion (RAV) is ONLY for Initial & Delta sync. During RAV switchover stage, only one relocation will be serviced at a time on a serial basis.Additional service meshes/IX appliance should be deployed for unique workload clusters to aggregate the replication capacity of multiple IX appliances. A different Services Mesh can be deployed for each workload cluster at source and/or target.If there are multiple service meshes/IX Appliances then RAV can switchover in parallel, however per SM/IX Pair it will always be sequential. ProcedureIMPORTANT: It is recommended to take snapshots for HCX Connector & Cloud Manager VMs prior to executing steps.Step 1: Increase the vCPU and memory of HCX Manager to 32 and 48GB respectively. Login to vCenter that hosts the HCX Manager.Shutdown HCX Manager VM's GuestOS using vCenter UI.Edit HCX Manager's VM to increase the vCPU and MEM reservations. Refer to: Virtual CPU ConfigurationVirtual Memory Configuration Power ON the HCX Manager VM. Step 2: Add a 300GB disk to HCX Connector & Cloud Manager.IMPORTANT: Following steps can be used to add a 300GB disk to both HCX managers. Refer to VMware Knowledge base 1003940 for creating a new virtual disk to an existing Linux virtual machine. Mount the created disk to HCX managers. mount /dev/sdc1 /common_ext df -hT # Check if /common_ext has been mounted and has the correct type Add an entry to "/etc/fstab" to ensure mounted disk will sustain a reboot and HCX Manager upgrade. vi /etc/fstab /dev/sdc1 /common_ext ext3 rw,nosuid,nodev,exec,auto,nouser,async 1 2 Note: Use Linux VI editor to edit/modify the file. 1. Press the ESC key for normal mode.2. Press "i" Key for insert mode.3. Press ":q!" keys to exit from the editor without saving a file.4. Press ":wq!" keys to save the updated file and exit from the editor. Step 3: Stop HCX services as below: # systemctl stop postgresdb # systemctl stop zookeeper # systemctl stop kafka # systemctl stop app-engine # systemctl stop web-engine # systemctl stop appliance-management Step 4: Redirect existing contents under "kafka-db" and "postgres-db" to the newly created disk. Move directory "/common/kafka-db" to "/common/kafka-db.bak". cd /common mv kafka-db kafka-db.bak Create a new directory "/common_ext/kafka-db". cd /common_ext mkdir kafka-db Note: The contents inside Kafka doesn't require to be copied and will be generated after kafka/app-engine services restart. Change the ownership and permissions of this directory same as "/common/kafka-db.bak". chmod 755 kafka-db chown kafka:kafka kafka-db Make a soft link from "/common/kafka-db" to "/common_ext/kafka-db". cd /common ln -s /common_ext/kafka-db kafka-db Move directory "/common/postgres-db" to "/common/postgres-db.bak" as a backup cd /common mv postgres-db postgres-db.bak Copy the content for directory "/common/postgres-db.bak" to "/common_ext/postgres-db" and change the ownership to postgres. Note: Use "-R" option to change the ownership for "/common_ext/postgres-db" as below: cp -r /common/postgres-db.bak /common_ext/postgres-db chown -R postgres:postgres /common_ext/postgres-db Make a soft link from "/common/postgres-db" to "/common_ext/postgres-db". cd /common ln -s /common_ext/postgres-db postgres-db Step 5: Start HCX services as below: # systemctl start postgresdb # systemctl start zookeeper # systemctl start kafka # systemctl start app-engine # systemctl start web-engine # systemctl start appliance-management Performance Tuning on the HCX ManagerIn addition to increasing HCX resources, you must perform the following tuning steps to scale concurrent migrations.IMPORTANT: The steps performed in this procedure are not persisted after an HCX Manager upgrade.ProcedureStep 6: Stop HCX services again. Login to HCX Connector/Cloud Manager Root Console # systemctl stop postgresdb # systemctl stop zookeeper # systemctl stop kafka # systemctl stop app-engine # systemctl stop web-engine # systemctl stop appliance-management Step 7: Increase memory page in app-engine framework. Edit "app-engine-start" file to increase JAVA memory allocation and max perm size. vi /etc/systemd/app-engine-start JAVA_OPTS="-Xmx4096m -Xms4096m -XX:MaxPermSize=1024m ... Step 8: Increase thread pooling for Mobility Migration services. Edit "MobilityMigrationService.zql" and "MobilityTransferService.zql" to increase thread numbers. vi /opt/vmware/deploy/zookeeper/MobilityMigrationService.zql "numberOfThreads": "50", vi /opt/vmware/deploy/zookeeper/MobilityTransferService.zql "numberOfThreads":50, Step 9: Increase message size limit for kafka framework. Edit "vchsApplication.zql" and update "kafkaMaxMessageSizeBytes" from "2097152" to "4194304". vi /opt/vmware/deploy/zookeeper/vchsApplication.zql "kafkaMaxMessageSizeBytes":4194304 Edit "kafka server.properties" and update "message.max.bytes" from "2097152" to "4194304". vi /etc/kafka/server.properties message.max.bytes=4194304 Step 10: Start HCX services. # systemctl start postgresdb # systemctl start zookeeper # systemctl start kafka # systemctl start app-engine # systemctl start web-engine # systemctl start appliance-management Step 11: Check the below services running in the HCX Connector/Cloud Manager: admin@hcx [ ~ ]$ systemctl --type=service | grep "zoo\|kaf\|web\|app\|postgres" app-engine.service loaded active running App-Engine appliance-management.service loaded active running Appliance Management kafka.service loaded active running Kafka postgresdb.service loaded active running PostgresDB web-engine.service loaded active running WebEngine zookeeper.service loaded active running Zookeeper IMPORTANT: In the event the HCX Manager fails to reboot OR any above listed services fail to start, revert the configuration changes immediately and ensure the system comes back on-line. Additionally, Snapshots can also be used to revert the above configurations incase of any failure while applying the steps.Note: Snapshot revert process won't restore HCX Connector/Cloud Manager's compute resources vCPU/MEM. User must follow "Step 1" to restore vCPU and memory of HCX Manager to "8" and "12GB" respectively, if needed.Recommendations operating concurrent migrations at scale As a best practice, use vSphere Monitoring and Performance to monitor HCX Connector & Cloud Manager CPU utilization and MEM usage.Do NOT exceed the recommended limits as that could cause system instability and failed migration workflows.In a scaled up environment, when migration operations are being processed, expect for the CPU utilization to increase significantly during a short periods of time and there may be a temporary delay in the UI response for migration progressing events.Limit the concurrency of MON operations on target cloud when making configuration changes while having active concurrent Bulk migrations into MON enabled segments during switchover.Follow the migration events and estimation on the HCX UI to determine any slowness that may be caused by the infrastructure or the network.Additionally, vSphere Replication status can be monitored from the source ESXi host. Refer VMware Knowledge base 87028If a source ESXi host is heavily occupied from memory, I/O rate perspective, then replication performance will be affected. As a result, Bulk/RAV workflow may takes more time to complete initial base sync provided there are no slowness in the underlying datapath. Note: In such cases, the recommendation is to relocate the source VM compute resources to another ESXi host probably a free one using native vCenter vMotion. This action won't impact ongoing replication process and do not require any changes in the migration workflow. The Bulk/RAV migration workflow consists of multiple stages (i.e. initial/delta sync, off-line sync, disk consolidation, data checksum, VM instantiation, etc.) and most are not dependent of network infrastructure hence the time to complete a migration for any given VM, from start to finish, may vary depending on the conditions and it is not a simple calculation based on the size of the VM and the assumed network bandwidth.

Related Information

Refer to VMware HCX Configuration LimitsRefer to Network Underlay Characterization for more information about HCX dependencies on the network infrastructure between sites.Refer to VMware HCX Bulk Migration Operations & Best PracticesContact your Cloud Provider regarding the availability of this procedure to scale up your cloud Data Center.For Scale up requirements on VMConAWS Cloud, please open service request with VMware Support team.

Original Vendor Announcement

No bugs this month

Ready to prevent the next vendor outage?

Get a demo

OPERATIONAL DEFECT DATABASE

VMware - Defect ID: 93605

HCX - Bulk Migration & Replication Assisted vMotion (RAV) scalability guide

VMware - Defect ID: 93605

HCX - Bulk Migration & Replication Assisted vMotion (RAV) scalability guide

Last updated on 8/15/2023

Vendor details

Vendor details

Description

Details

Solution

Related Information

Links

Top VMware defects by risk score

Ready to prevent the next vendor outage?