...
Packet loss may be observed between hosts connected via vPC to vPC VTEPs in a VXLAN BGP EVPN fabric that uses eBGP as an underlay as one of the vPC VTEPs is coming online after a reload or power outage. Specifically, the packet loss starts after the vPC Delay Restore timer of the reloaded vPC peer expires. During this time, the NVE source loopback (that is, the loopback interface sourced with the "source-interface {interface}" command) is held in an Administratively Down state. The total duration of the packet loss will vary, but usually ranges from 60 seconds to several minutes depending on the precise vPC Delay Restore and NVE source loopback hold-down timers. The precise symptoms and order of operations for this software defect are best explained through an example. Consider the following VXLAN BGP EVPN Multi-Site topology: +-------------------------------------------------------------------------+ | | | Site 1 VXLAN BGP EVPN Fabric - ASN 64501 | | | | +-----------------+ | | +-------------+ VPC Host 1 +------------+ | | | +-----------------+ | | | | | | | | | | | | | | | +-------+--------+ Po1 - VPC Peer-Link +-------+--------+ | | | Site-1-BGW-1 +---------------------------+ Site-1-BGW-2 | | | | 192.0.2.2 |---------------------------| 192.0.2.3 | | +-------------------------------------------------------------------------+ +----------------+ 192.0.2.1 +----------------+ | | | | | | | | +----------------+ 192.0.2.4 +----------------+ +-------------------------------------------------------------------------+ | | 192.0.2.5 +---------------------------+ 192.0.2.6 | | | | Site-2-BGW-2 | Po1 - VPC Peer-Link | Site-2-BGW-2 | | | +-------+--------+ +-------+--------+ | | | | | | | | | | | +-----------------+ | | | +-------------+ VPC Host 2 +------------+ | | +-----------------+ | | | | Site 2 VXLAN BGP EVPN Fabric - ASN 64502 | | | +-------------------------------------------------------------------------+ Site-1-BGW-1 and Site-1-BGW-2 act as vPC Border Gateways for the Site 1 VXLAN BGP EVPN fabric. Similarly, Site-2-BGW-1 and Site-2-BGW-2 act as vPC Border Gateways for the Site 2 VXLAN BGP EVPN fabric. The IP address within each node represents the primary IP address of that node's NVE source loopback, while the IP address in between both nodes represents the secondary IP address of that node's NVE source loopback (which is required for vPC VTEPs). Notice that Site 1 and Site 2 are in different BGP autonomous systems, which means the BGP peering between Site 1's vPC Border Gateways and Site 2's vPC Border Gateways will be an eBGP peering. Assume that there is also an iBGP peering across the vPC Peer-Link of each site (e.g. Site-1-BGW-1 is peering with Site-1-BGW-2, and Site-2-BGW-1 is peering with Site-2-BGW-2). Also, notice that each vPC Border Gateway has a vPC-connected host. Consider a scenario where Site-2-BGW-1 is reloaded. A minimal amount of packet loss is expected as packets in-flight are lost when Site-2-BGW-1 powers off. As Site-2-BGW-1 comes back online, two key timers (among others) will start counting down on Site-2-BGW-1: 1. The vPC Delay Restore timer, which brings vPCs up on expiration 2. The NVE source loopback hold-down timer, which brings the NVE source loopback (that is, the loopback interface sourced with the "source-interface {loopback-interface}" command under the NVE1 interface) out of an Administratively Down state into an up/up state. Simultaneously, the iBGP peering between Site-2-BGW-1 and Site-2-BGW-2 will come up over the vPC Peer-Link. The NVE source loopback hold-down timer will keep Site-2-BGW-1's NVE loopback in an Administratively Down state - however, Site-2-BGW-2's NVE loopback is in an up/up status and will be advertised by BGP to all BGP peers, including Site-2-BGW-1. Site-2-BGW-1 will install a route to the primary and secondary IP addresses of Site-2-BGW-2's NVE source loopback in its routing table. Site-2-BGW-1 will also advertise reachability to the primary and secondary IP addresses of Site-2-BGW-2's NVE source loopback to Site-1-BGW-1 and Site-1-BGW-2 via eBGP citing Site-2-BGW-1 as a valid path. Site-1-BGW-1 and Site-1-BGW-2 will successfully install this path in their routing table, as from an eBGP perspective, the AS Path of both paths is identical, resulting in ECMP (Equal-Cost Multi-Pathing) to the NVE source loopback's secondary IP address through both Site-2-BGW-1 and Site-2-BGW-2. While the vPC Delay Restore timer and the NVE source loopback hold-down timer are active on Site-2-BGW-1, connectivity through Site-2-BGW-1 and Site-2-BGW-2 will work as expected. Any VXLAN-encapsulated packet that ingresses Site-2-BGW-1 that is destined for a vPC-connected host will be forwarded across the vPC Peer-Link to Site-2-BGW-2 for decapsulation and forwarding. This is because Site-2-BGW-1's vPC is down (due to the vPC Delay Restore timer) and Site-2-BGW-1's NVE source loopback is down (due to the NVE source loopback hold-down timer), which prevents Site-2-BGW-1 from decapsulating and forwarding the native Ethernet packet locally. When the vPC Delay Restore timer expires on Site-2-BGW-1, Site-2-BGW-1 will bring up its vPCs to connected hosts. If the vPC Delay Restore timer expires before the NVE source loopback hold-down timer, you will observe packet loss to vPC-connected hosts. This is because VXLAN-encapsulated packets that ingress Site-2-BGW-1 cannot be decapsulated by Site-2-BGW-1, since the NVE source loopback is still in an Administratively Down state on Site-2-BGW-1 due to the NVE source loopback hold-down timer. As a result, Site-2-BGW-1 will forward these VXLAN-encapsulated packets to Site-2-BGW-2. Site-2-BGW-2 will decapsulate these packets, but it will drop the packets instead of forwarding them out of its local vPC due to a violation of the vPC Loop Avoidance Rule. In other words, because Site-2-BGW-1's local vPC to the vPC-connected host is up, Site-2-BGW-2 assumes that Site-2-BGW-1 should have been able to decapsulate and forward these VXLAN-encapsulate packets locally instead of sending them across the vPC Peer-Link, so Site-2-BGW-2 incorrectly drops these packets as a loop prevention mechanism. When the NVE source loopback hold-down timer finally expires, the NVE source loopback on Site-2-BGW-1 will come up/up. Site-2-BGW-1 is now able to decapsulate VXLAN-encapsulated packets and forward them out of its local vPC successfully. You will no longer observe packet loss to vPC-connected hosts shortly after the NVE source loopback hold-down timer expires.
This issue has been observed in VXLAN BGP EVPN Multi-Site fabrics when Border Gateways have hosts connected to them via vPC and eBGP peerings are used between Border Gateways of two sites. Packet loss is only observed for traffic destined to these vPC-connected hosts - packet loss for hosts (vPC-connected or otherwise) connected to leafs elsewhere in the same site's EVPN fabric is not observed. As of this time, this issue has only been observed with Border Gateways. However, you will also observe this issue with normal VTEPs when an eBGP underlay is used instead of an iBGP underlay.
You can proactively avoid this issue by ensuring the NVE source loopback hold-down timer is set to a value less than the vPC Delay Restore timer. Please keep in mind however: During the vPC Border Gateway boot up process the NVE source loopback interface undergoes the hold down timer twice instead of just once. This is a day-1 and expected behavior. Source: Cisco Nexus 9000 Series NX-OS VXLAN Configuration Guide, Release 9.3(x) ; Chapter: Configuring VXLAN BGP EVPN Therefore, the vpc "delay-restore" or nve "hold-down-time" adjustment should account for this reset when attempting this workaround.
Click on a version to see all relevant bugs
Cisco Integration
Learn more about where this data comes from
Bug Scrub Advisor
Streamline upgrades with automated vendor bug scrubs
BugZero Enterprise
Wish you caught this bug sooner? Get proactive today.