BugZero | VMware BugID 85112 - High Availability in Velocloud SD-WAN edge- HA Act...

VMware - Defect ID: 85112

High Availability in Velocloud SD-WAN edge- HA Active/Active Panic issue

VMware - Defect ID: 85112

High Availability in Velocloud SD-WAN edge- HA Active/Active Panic issue

Last updated on 4/4/2023

Overall: 0N/A

Severity: 0N/A

Community: 0N/A

Lifecycle: 0N/A

What is the BugZero Risk Score?

Vendor details

No defect details.

Overall: 0N/A

Severity: 0N/A

Community: 0N/A

Lifecycle: 0N/A

What is the BugZero Risk Score?

Vendor details

No defect details.

Symptoms

On a High Availability (HA) enabled VMware SD-WAN edge, Active/Active panic is observed occasionally and as a result, one of the following might happen.1. Standby edge goes to UNKNOWN state.2. Sometimes, this results to partial traffic loss.

Purpose

This KB article documents the list of various scenarios of Active/Active panic and provides information on their resolution.Affected versions: 3.4.x and 4.X

Cause

Causes of Active/Active panic 1. Network outage:HA interface flap might cause both the edges move to Active state. Network outage is not software issue and it is expected new Active edge to restart to recover from Active/Active panic.Resolution: None. It is expected behaviour 2. Heavy synchronisation traffic:When there is heavy synchronisation traffic between active and standby edges for a brief period of time because of route flap or high number of flows, there is a possibility of active/active panic.Resolution: Fixed in software release 3.4.4, bug-id VLENG-44640 3. Heavy flow processing load on edge:When there are high number of flows processed per second on the active edge, there is a possibility of active/active panic.Resolution: Not resolved yet. Currently documented as a known issue and engineering ticket VLENG-66183 is opened for tracking this. 4. Standby reboot:When standby edge (mostly commonly observed on edge5x0 and edge6x0), is rebooted, there is a possibility of active/active panic.Resolution: Fixed in software releases 4.2.2 and 4.3.0, bug-id VLENG-600065. Interrupt 16 (IR16) disable (Edge 6X0 models): IRQ16 disable cause the slowness in I/O operation and this leads to thread starvation and resulted in Active/Active panic in 6X0 hardware. IRQ16 disabled can be detected using below commands in both the active and standby edges; and if the interrupt counters are not increased after sleep then it confirms IRQ16 is disabled cat /proc/interrupts | grep mmcsleep 5cat /proc/interrupts | grep mmc

Impact / Risks

1. Edges running Enhanced HA will have partial loss2. When both edges in active state simultaneously trying to establish BGP/OSPF neighbor relationships, the neighbor relation ship can get hung in rare cases.

Resolution

As mentioned in causes section, some of the scenarios are fixed and some are documented as known issues which are planned to be fixed.

Workaround

WorkaroundThere is no baseline numbers for scalability, throughput and number of flows per seconds to avoid Active/Active panic and also this document doesn’t recommend to increasing HA failover time out value to bail out from Active/Active panic since it will increase more traffic loss during HA failover.

Original Vendor Announcement

No bugs this month

Ready to prevent the next vendor outage?

Get a demo

OPERATIONAL DEFECT DATABASE

VMware - Defect ID: 85112

High Availability in Velocloud SD-WAN edge- HA Active/Active Panic issue

VMware - Defect ID: 85112

High Availability in Velocloud SD-WAN edge- HA Active/Active Panic issue

Last updated on 4/4/2023

Vendor details

Vendor details

Description

Symptoms

Purpose

Cause

Impact / Risks

Resolution

Workaround

Links

Top VMware defects by risk score

Ready to prevent the next vendor outage?