Symptom
In ASR9k CXR systems, MTU value used by pfi-protect component is incorrect. (Correct MTU value for ASR9k is 1084, but value used is 7708). Due to this issue, if more than 3 interfaces flaps in quick succession, bfd_agent process can bulk more than 3 "BFD session down" notifications in one message and then that notification would fail, which in term causes FIB/forwarding transient inconsistency
Conditions
For this problem to trigger, there need to be at least 4 "BFD session down" events that should occur in very quick succession. This problem is very specific to asr9k CXR platform, and this problem is not present in any other XR platforms.
Workaround
If SRLG group number of interfaces larger then 3, we can adjust BFD minimal interval with the step 25ms for each 3 members. This should ensure that bfd_agent notifications will arrive for 3 interfaces at a time.
Further Problem Description
[1] Example on identifying issue from trace:
#show protection-notif trace location 0/x/cpu0:
Mar 31 00:26:50.514 protect/server-trig 0/1/CPU0 40# t8 SERVER bfd_agent[125]: Trigger failed for event 'BFD Session Down', input count 7, reason: 'ce' detected the 'warning' condition 'msg data size greater then MTU' (0x45d91200)
Mar 31 00:26:50.514 protect/server-err- 0/1/CPU0 t8 SERVER bfd_agent[125]: ERROR - failed to send trigger message for event 'BFD Session Down', length 2000 - GSP gang deliver event returned: 'ce' detected the 'warning' condition 'msg data size greater then MTU' (0x45d91200)
[2] MTU calculations for GSP gang
Protect message overhead: 68
Per interface data sizes:
BFD data: 272
Interface handle: 4
Per interface data size: 276
Message size for 3 interfaces
Per interface data size * 3 + Protect message overhead
(276 * 3) + 68 = 896 ==> This is less than GSP gang MTU of 1084 and delivered
Message size for 4 interfaces
Per interface data size * 4 + Protect message overhead
(276 * 4) + 68 = 1172 ==> This is greater than GSP gang MTU of 1084 and not delivered