Symptom
In certain vPC designs on EP move between different vPC pairs of Leaves, the old vPC peer might publish an EP DELETE to COOP after the new vPC pair have already published an ADD.
This will cause the EP entry to be deleted from the COOP DB on Spines, which in turn will cause the traffic sent to Spine-proxy to be blackholed until the new pair refreshes the EP in COOP (which might take up to 1 hour).
Conditions
The issue was seen in a certain design:
- Blade chassis with two blade switches attached to two separate vPC pairs of Leaves: BSW-A to 101/102 and BSW-B to 103/104
- When one of the vPCs (to which the EP was pinned from the Blade chassis side) went down, from the Blade chassis side the failover happened to another vPC, which goes to different pair of Leaves.
Workaround
- Clear the EP entry on the new vPC pair to make it re-publish the EP again into COOP:
vsh -c "clear system internal epm endpoint key vrf EXAMPLE_TX:EXAMPLE_VRF ip 192.0.2.100"
- If multiple EPs are affected, you might clear all EPs in the VRF:
vsh -c "clear system internal epm endpoint vrf EXAMPLE_TX:EXAMPLE_VRF all"
Further Problem Description
This is a timing issue - if the EP DELETE caused by the port going down on old vPC pair is detected later by COOP than the EP is learned and published from the new vPC pair, the DELETE from the old pair will override the COOP DB entry.
This is an enhancement request to introduce certain checks b/w the COOP and EPM to prevent this issue from happening.