BugZero | VMware BugID 90117 - HCX - NE appliance VM may experience system/kernel...

VMware - Defect ID: 90117

HCX - NE appliance VM may experience system/kernel crash

VMware - Defect ID: 90117

HCX - NE appliance VM may experience system/kernel crash

Last updated on 1/5/2023

Overall: 0N/A

Severity: 0N/A

Community: 0N/A

Lifecycle: 0N/A

What is the BugZero Risk Score?

Vendor details

No defect details.

Overall: 0N/A

Severity: 0N/A

Community: 0N/A

Lifecycle: 0N/A

What is the BugZero Risk Score?

Vendor details

No defect details.

Symptoms

HCX Network Extension (NE) appliance VM may experience system/kernel crash during operation stage.Below dump could be seen in the logs: 2022-10-15T08:13:15.106Z| vcpu-2| I125: Guest: <1>[ 92.248389] BUG: unable to handle kernel paging request at 0000000000025280 2022-10-15T08:13:15.107Z| vcpu-2| I125: Guest: <6>[ 92.248654] PGD 0 P4D 0 2022-10-15T08:13:15.107Z| vcpu-2| I125: Guest: <4>[ 92.248881] Oops: 0000 [#1] SMP PTI 2022-10-15T08:13:15.107Z| vcpu-2| I125: Guest: <4>[ 92.248953] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G OE 4.19.245-1.ph3-esx #1-photon 2022-10-15T08:13:15.107Z| vcpu-2| I125: Guest: <4>[ 92.249028] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018 2022-10-15T08:13:15.107Z| vcpu-2| I125: Guest: <4>[ 92.249120] RIP: 0010:get_rps_cpu+0x89e/0x920 2022-10-15T08:13:15.107Z| vcpu-2| I125: Guest: <4>[ 92.249162] Code: ff ff ff e9 aa f8 ff ff 8b 76 1c 89 b5 70 ff ff ff 49 63 ca 48 c7 c2 00 52 02 00 44 8b 85 70 ff ff ff 48 8b 34 cd c0 b4 9b bc <8b> 8c 32 80 00 00 00 8b 94 16 ec 00 00 00 41 29 c8 29 ca 44 39 c2 IMPORTANT:- The Network Extension appliance VM will go through a reboot during crash event as part of self recovery process.Location of crash dump: ESXi host : Go to Network Extension VM directory : vmware.log

Purpose

Identify a known issue with HCX Network Extension appliance VM system/kernel crash and provide a procedure to clear it.

Cause

This is a buggy behavior identified in get_rps_cpu() component running on Network Extension appliance.This usually happens at slow kernel networking initialization OR in few rare unknown system abnormal conditions.Note: This is purely a datapath symptom which may get triggered when some of special workload VMs with specific traffic type connected to NE appliance over a given extended segment. For example: Splunk VM.

Impact / Risks

All HCX versions are affected.Network extension service will be affected during system/kernel crash.There will be NO impact to HCX migration services.

Resolution

This is fixed in HCX 4.5.2 release.

Workaround

As soon as crash is being observed or noticed on a given NE appliance, the recommendation is to follow below steps: Try to isolate the workload VMs which may have been recently migrated and sitting on the extended network corresponding to the NE appliance which has crashed.Upon identification of any such VM, try to collect the traffic profile and incase it appears to be a busy VM based on certain traffic types, then follow next steps: Disconnect that specific workload VM from L2E extended segment Or, reverse migrate the VM back to OnPrem/Source side to avoid bridge data over extended datapath.There is NO need to redeploy NE appliance since the appliance VM will perform self reboot and should recover from its state. There is NO need to disconnect or reverse migrate all workload VMs connected to that Network extension appliance VM and they should continue operating normally using same NE appliance upon disconnection of that special workload VM.

Original Vendor Announcement

No bugs this month

Ready to prevent the next vendor outage?

Get a demo

OPERATIONAL DEFECT DATABASE

VMware - Defect ID: 90117

HCX - NE appliance VM may experience system/kernel crash

VMware - Defect ID: 90117

HCX - NE appliance VM may experience system/kernel crash

Last updated on 1/5/2023

Vendor details

Vendor details

Description

Symptoms

Purpose

Cause

Impact / Risks

Resolution

Workaround

Links

Top VMware defects by risk score

Ready to prevent the next vendor outage?