...
OSPF NSR will not sync for some OSPF processes: + show redundancy will show that some OSPF processes reached NSR sync while others did not: Redundancy information for node 0/RSP0/CPU0: ========================================== Node 0/RSP0/CPU0 is in ACTIVE role Node Redundancy Partner (0/RSP1/CPU0) is in STANDBY role Standby node in 0/RSP1/CPU0 is ready Standby node in 0/RSP1/CPU0 is NSR-not-configured Node 0/RSP0/CPU0 is in process group PRIMARY role Process Redundancy Partner (0/RSP1/CPU0) is in BACKUP role Backup node in 0/RSP1/CPU0 is ready Backup node in 0/RSP1/CPU0 is not NSR-ready Group Primary Backup Status --------- --------- --------- --------- v6-routing 0/RSP0/CPU0 0/RSP1/CPU0 Ready mcast-routing 0/RSP0/CPU0 0/RSP1/CPU0 Ready netmgmt 0/RSP0/CPU0 0/RSP1/CPU0 Ready v4-routing 0/RSP0/CPU0 0/RSP1/CPU0 Ready v4-routing 0/RSP0/CPU0 0/RSP1/CPU0 Not NSR-Ready central-services 0/RSP0/CPU0 0/RSP1/CPU0 Ready dsc 0/RSP0/CPU0 0/RSP1/CPU0 Ready dlrsc 0/RSP0/CPU0 0/RSP1/CPU0 Ready Process Group Details --------------------- Current primary rmf state: Ready All backup not-ready bits clear - backup should be ready Current primary rmf state for NSR: Not Ready Reason for backup not NSR-ready 1028 0/RSP0/CPU0 ospf v4-routing No connection to standby Not ready set Sun Jan 24 05:10:01 2021: 6 hours, ago 1030 0/RSP0/CPU0 ospf v4-routing No connection to standby Not ready set Sun Jan 24 05:10:01 2021: 6 hours, ago 1029 0/RSP0/CPU0 ospf v4-routing No connection to standby Not ready set Sun Jan 24 05:10:01 2021: 6 hours, ago 1032 0/RSP0/CPU0 ospf v4-routing No connection to standby Not ready set Sun Jan 24 05:10:01 2021: 6 hours, ago Reload and boot info ---------------------- A9K-RSP440-TR reloaded Fri Jul 12 01:49:24 2019: 1 year, 28 weeks, 1 day, 9 hours, 20 minutes ago Active node booted Sun Jan 24 04:10:29 2021: 6 hours, 59 minutes ago Last switch-over Sun Jan 24 04:28:34 2021: 6 hours, 41 minutes ago Standby node boot Sun Jan 24 05:03:18 2021: 6 hours, 6 minutes ago Standby node last went not ready Sun Jan 24 05:04:43 2021: 6 hours, 5 minutes ago Standby node last went ready Sun Jan 24 05:07:34 2021: 6 hours, 2 minutes ago There have been 2 switch-overs since reload Active node reload "Cause: self-reset to use new boot image" Standby node reload "Cause: MBI-HELLO reloading node on receiving reload notification" + show ospf vrf all statistics nsr | i "FSM state|VRF" Fri Jan 29 05:14:02.184 EST OSPF CGN11, VRF cgn1-in-1 NSR Statistics: FSM state: Nsr-Available OSPF CGN12, VRF cgn1-in-2 NSR Statistics: FSM state: Connecting <<< OSPF CGN21, VRF cgn2-in-1 NSR Statistics: FSM state: Nsr-Available OSPF CGN22, VRF cgn2-in-2 NSR Statistics: FSM state: Nsr-Available OSPF CGN31, VRF cgn3-in-1 NSR Statistics: FSM state: Connecting <<< OSPF CGN32, VRF cgn3-in-2 NSR Statistics: FSM state: Nsr-Available + NSR timeouts because NSR messages get delivered to the wrong OSPF process ID: observation for OSPF process CGN21 ---------------------------------- OSPF CGN21 NSR Statistics: NSR revision: 77590 FSM state: Connecting ---- >>> connectivity issue FSM flags 0xf1af IF_MAP(1),HA_OPT(1),DSENDP(1),CONFIG(1) BIGRCV(0),BIGPUL(0),DISCON(0),LIBNSR(1) RAWCON(1),ST_WAI(0),CON_ST(1),CON_AC(0) PEERID(1),NODEID(1),QAD_IN(1),INIT__(1) FSM Running Timers rev_ack_tmr: Will expire in 0 secs My nodeid: 0x41 Partner nodeid: 0x51 NSR version 1 Partner NSR version 1 Fabric MTU: 8844 NBR QAD queue id: 3 LSA QAD queue id: 4 NBR QAD pending msgs: 0 LSA QAD pending msgs: 0 Pending init-sync NBR msgs: 0 Pending init-sync LSA msgs: 0 Neighbor TLV sequence no: 0 Interface TLV sequence no: 0 Timers processed per pulse: 32 Attempts to connect to active: 2 Failures to connect to active: 0 Conn to active opens: 2 Conn to active closes: 1 Conn to active errors: 0 Scheduler Statistics: Events processed per pulse: 112 Events pending: 0 Events to NSR thread: 318860 Events processed: 318860 Bad pulses received: 0 Good pulses received: 318848 Pulses to NSR thread: 318848 Pulse send failures: 0 High priority queue: Quantum: 32, Max q length: 2048 Remaining quantum: 10 Events pending: 0 Peak queue length: 5 Events queued: 318838 Event enqueue fails: 0 Events dequeued: 318838 Medium priority queue: Quantum: 16, Max q length: 4096 Remaining quantum: 8 Events pending: 0 Peak queue length: 2 Events queued: 8 Event enqueue fails: 0 Events dequeued: 8 Low priority queue: Quantum: 8, Max q length: 4096 Remaining quantum: 2 Events pending: 0 Peak queue length: 1 Events queued: 14 Event enqueue fails: 0 Events dequeued: 14 NSR pulse statistics for the Router Thread: Events processed per pulse: 112 Events pending: 0 Events to Router thread: 21 Events processed: 21 Bad pulses received: 0 Good pulses received: 15 Pulses to Router thread: 15 Pulse send failures: 0 High priority queue: Quantum: 32, Max q length: 2048 Remaining quantum: 23 Events pending: 0 Peak queue length: 6 Events queued: 9 Event enqueue fails: 0 Events dequeued: 9 Medium priority queue: Quantum: 16, Max q length: 4096 Remaining quantum: 13 Events pending: 0 Peak queue length: 1 Events queued: 3 Event enqueue fails: 0 Events dequeued: 3 Low priority queue: Quantum: 8, Max q length: 4096 Remaining quantum: 7 Events pending: 0 Peak queue length: 1 Events queued: 9 Event enqueue fails: 0 Events dequeued: 9 Error history info: NSR Rev 77580, Total 1 errors seen Jan 24 11:10:16.895: Nsr Rev Tx timeout: seq_num(77582) (1 times) ---- >>> timeout message NSR Rev 77581, Total 1 errors seen Jan 24 11:10:17.183: Nsr Rev Tx timeout: seq_num(77583) (1 times) NSR Rev 77582, Total 1 errors seen Jan 24 11:10:17.471: Nsr Rev Tx timeout: seq_num(77584) (1 times) NSR Rev 77583, Total 1 errors seen Jan 24 11:10:17.759: Nsr Rev Tx timeout: seq_num(77585) (1 times) NSR Rev 77584, Total 1 errors seen Jan 24 11:10:18.047: Nsr Rev Tx timeout: seq_num(77586) (1 times) NSR Rev 77585, Total 1 errors seen Jan 24 11:10:18.335: Nsr Rev Tx timeout: seq_num(77587) (1 times) NSR Rev 77586, Total 1 errors seen Jan 24 11:10:18.623: Nsr Rev Tx timeout: seq_num(77588) (1 times) NSR Rev 77587, Total 1 errors seen Jan 24 11:10:18.911: Nsr Rev Tx timeout: seq_num(77589) (1 times) NSR Rev 77588, Total 1 errors seen Jan 24 11:10:19.199: Nsr Rev Tx timeout: seq_num(77590) (1 times) NSR Rev 77589, Total 1 errors seen Jan 24 11:10:19.487: Nsr Rev Tx timeout: seq_num(77591) (1 + from traces: ++ Standby RSP is receiving messages from Active RSP and sending the ACK. The problem is that Active RSP is receiving the ACK on some different JID. ACTIVE NODE: bgl-ads-4325:/auto/tftp-blr-users2/devanegi/showtech-tcp_nsr-2021-Jan-30.053723.UTC > bgl-ads-4325:/auto/tftp-blr-users2/devanegi/showtech-tcp_nsr-2021-Jan-30.053723.UTC > cat node0_RSP0_CPU0.tech | grep "seq_num 1809567" Jan 30 05:37:24.552 ospf/npl/pakevents 0/RSP0/CPU0 t13 [jid=1032]nsr_pl_client_send2: Sending DATA of len 236, seq_num 1809567, wheel_id 3 to NAL (chksum:2951) --> JID 1032 sending message to Standby Jan 30 05:37:24.553 ospf/npl/pakevents 0/RSP0/CPU0 t13 [jid=1029]npl_recv_cb: ACK, wheel_id 3 seq_num 1809567 app off 256 app len 24 --> JID 1029 receiving the ACK from Standby Jan 30 05:37:24.680 ospf/npl/pakretx 0/RSP0/CPU0 t13 [jid=1032]npl_walk_retransmit_list: Sending RETX packet of len 236, seq_num 1809567, wheel_id 3 (1 of 3), err 0 to NAL Jan 30 05:37:24.808 ospf/npl/pakretx 0/RSP0/CPU0 t13 [jid=1032]npl_walk_retransmit_list: Sending RETX packet of len 236, seq_num 1809567, wheel_id 3 (2 of 3), err 0 to NAL Jan 30 05:37:24.840 ospf/npl/pakerrors 0/RSP0/CPU0 t13 [jid=1032]npl_prod_invoke_timeout: Calling app timeout wheel_id 3, seq_num 1809567 app_offset 200, app_len 36, peer_hdl 136ba964 --> Timeout after 3 retries bgl-ads-4325:/auto/tftp-blr-users2/devanegi/showtech-tcp_nsr-2021-Jan-30.053723.UTC > bgl-ads-4325:/auto/tftp-blr-users2/devanegi/showtech-tcp_nsr-2021-Jan-30.053723.UTC > STANDBY NODE: bgl-ads-4325:/auto/tftp-blr-users2/devanegi/showtech-tcp_nsr-2021-Jan-30.053723.UTC > bgl-ads-4325:/auto/tftp-blr-users2/devanegi/showtech-tcp_nsr-2021-Jan-30.053723.UTC > cat node0_RSP1_CPU0.tech | grep "seq_num 1809567" Jan 30 05:37:24.552 ospf/npl/pakevents 0/RSP1/CPU0 t13 [jid=1032]npl_recv_cb: Pak : DATA, wheel_id 3 seq_num 1809567 app off 256 app len 36 --> JID 1032 receiving message on Standby Jan 30 05:37:24.552 ospf/npl/pakevents 0/RSP1/CPU0 t13 [jid=1032]nsr_pl_client_send_ack: Sending ACK of len 224, seq_num 1809567, wheel_id 3 to NAL --> Sending ACK for received packet by JID 1032 Jan 30 05:37:24.680 ospf/npl/pakevents 0/RSP1/CPU0 t13 [jid=1032]npl_recv_cb: Pak : DATA, wheel_id 3 seq_num 1809567 app off 256 app len 36 Jan 30 05:37:24.808 ospf/npl/pakevents 0/RSP1/CPU0 t13 [jid=1032]npl_recv_cb: Pak : DATA, wheel_id 3 seq_num 1809567 app off 256 app len 36
Certain processes names will result in the same internal hash result when passing the NSR packets from the OSPF component to the Transport component. This will result in packets from a certain OSPF process being delivered to the wrong OSPF process.
Modify the OSPF process name. Tested with the following alternative names and it worked: CGN-11 CGN-12 CGN-13 CGN-14 CGN-15 CGN-16 CGN-21 CGN-22 CGN-23 CGN-24 CGN-25 CGN-26 CGN-31 CGN-32 CGN-33 CGN-34 CGN-35 CGN-36 CGN-41 CGN-42 CGN-43 CGN-44 CGN-45 CGN-46 CGN-51 CGN-52 CGN-53 CGN-54 CGN-55 CGN-56 CGN-61 CGN-62 CGN-63 CGN-64 CGN-65 CGN-66 CGN-71 CGN-72 CGN-73 CGN-74 CGN-75 CGN-76 CGN-81 CGN-82 CGN-83 CGN-84 CGN-85 CGN-86 CGN-91 CGN-92 CGN-93 CGN-94 CGN-95 CGN-96
Certain processes names will result in the same internal hash result when passing the NSR packets from the OSPF component to the Transport component. This will result in packets from a certain OSPF process being delivered to the wrong OSPF process.