...
On a High Availability (HA) enabled VMware SD-WAN edge, Active/Active panic is observed occasionally and as a result, one of the following might happen.1. Standby edge goes to UNKNOWN state.2. Sometimes, this results to partial traffic loss.
This KB article documents the list of various scenarios of Active/Active panic and provides information on their resolution.Affected versions: 3.4.x and 4.X
Causes of Active/Active panic 1. Network outage:HA interface flap might cause both the edges move to Active state. Network outage is not software issue and it is expected new Active edge to restart to recover from Active/Active panic.Resolution: None. It is expected behaviour 2. Heavy synchronisation traffic:When there is heavy synchronisation traffic between active and standby edges for a brief period of time because of route flap or high number of flows, there is a possibility of active/active panic.Resolution: Fixed in software release 3.4.4, bug-id VLENG-44640 3. Heavy flow processing load on edge:When there are high number of flows processed per second on the active edge, there is a possibility of active/active panic.Resolution: Not resolved yet. Currently documented as a known issue and engineering ticket VLENG-66183 is opened for tracking this. 4. Standby reboot:When standby edge (mostly commonly observed on edge5x0 and edge6x0), is rebooted, there is a possibility of active/active panic.Resolution: Fixed in software releases 4.2.2 and 4.3.0, bug-id VLENG-600065. Interrupt 16 (IR16) disable (Edge 6X0 models): IRQ16 disable cause the slowness in I/O operation and this leads to thread starvation and resulted in Active/Active panic in 6X0 hardware. IRQ16 disabled can be detected using below commands in both the active and standby edges; and if the interrupt counters are not increased after sleep then it confirms IRQ16 is disabled cat /proc/interrupts | grep mmcsleep 5cat /proc/interrupts | grep mmc
1. Edges running Enhanced HA will have partial loss2. When both edges in active state simultaneously trying to establish BGP/OSPF neighbor relationships, the neighbor relation ship can get hung in rare cases.
As mentioned in causes section, some of the scenarios are fixed and some are documented as known issues which are planned to be fixed.
WorkaroundThere is no baseline numbers for scalability, throughput and number of flows per seconds to avoid Active/Active panic and also this document doesn’t recommend to increasing HA failover time out value to bail out from Active/Active panic since it will increase more traffic loss during HA failover.