...
HCX Network Extension (NE) appliance VM may experience system/kernel crash during operation stage.Below dump could be seen in the logs: 2022-10-15T08:13:15.106Z| vcpu-2| I125: Guest: <1>[ 92.248389] BUG: unable to handle kernel paging request at 0000000000025280 2022-10-15T08:13:15.107Z| vcpu-2| I125: Guest: <6>[ 92.248654] PGD 0 P4D 0 2022-10-15T08:13:15.107Z| vcpu-2| I125: Guest: <4>[ 92.248881] Oops: 0000 [#1] SMP PTI 2022-10-15T08:13:15.107Z| vcpu-2| I125: Guest: <4>[ 92.248953] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G OE 4.19.245-1.ph3-esx #1-photon 2022-10-15T08:13:15.107Z| vcpu-2| I125: Guest: <4>[ 92.249028] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018 2022-10-15T08:13:15.107Z| vcpu-2| I125: Guest: <4>[ 92.249120] RIP: 0010:get_rps_cpu+0x89e/0x920 2022-10-15T08:13:15.107Z| vcpu-2| I125: Guest: <4>[ 92.249162] Code: ff ff ff e9 aa f8 ff ff 8b 76 1c 89 b5 70 ff ff ff 49 63 ca 48 c7 c2 00 52 02 00 44 8b 85 70 ff ff ff 48 8b 34 cd c0 b4 9b bc <8b> 8c 32 80 00 00 00 8b 94 16 ec 00 00 00 41 29 c8 29 ca 44 39 c2 IMPORTANT:- The Network Extension appliance VM will go through a reboot during crash event as part of self recovery process.Location of crash dump: ESXi host : Go to Network Extension VM directory : vmware.log
Identify a known issue with HCX Network Extension appliance VM system/kernel crash and provide a procedure to clear it.
This is a buggy behavior identified in get_rps_cpu() component running on Network Extension appliance.This usually happens at slow kernel networking initialization OR in few rare unknown system abnormal conditions.Note: This is purely a datapath symptom which may get triggered when some of special workload VMs with specific traffic type connected to NE appliance over a given extended segment. For example: Splunk VM.
All HCX versions are affected.Network extension service will be affected during system/kernel crash.There will be NO impact to HCX migration services.
This is fixed in HCX 4.5.2 release.
As soon as crash is being observed or noticed on a given NE appliance, the recommendation is to follow below steps: Try to isolate the workload VMs which may have been recently migrated and sitting on the extended network corresponding to the NE appliance which has crashed.Upon identification of any such VM, try to collect the traffic profile and incase it appears to be a busy VM based on certain traffic types, then follow next steps: Disconnect that specific workload VM from L2E extended segment Or, reverse migrate the VM back to OnPrem/Source side to avoid bridge data over extended datapath.There is NO need to redeploy NE appliance since the appliance VM will perform self reboot and should recover from its state. There is NO need to disconnect or reverse migrate all workload VMs connected to that Network extension appliance VM and they should continue operating normally using same NE appliance upon disconnection of that special workload VM.