...
+ ALL bundles down: RP/0/RP0/CPU0:ASR9K1#show bundle bundle-ether 5 Tue May 28 05:14:27.678 CDT Bundle-Ether5 Status: Down Local links : 0 / 0 / 3 Local bandwidth : 0 (0) kbps MAC address (source): 74a0.2f8a.ed07 (Chassis pool) Inter-chassis link: No Minimum active links / bandwidth: 1 / 180000000 kbps Maximum active links: 64 Wait while timer: 2000 ms Load balancing: Link order signaling: Not configured Hash type: Default Locality threshold: None LACP: Not operational Flap suppression timer: Off Cisco extensions: Disabled Non-revertive: Disabled mLACP: Not configured IPv4 BFD: Operational State: Down Mode: Not configured Fast detect: Enabled Start timer: 60 s Neighbor-unconfigured timer: 60 s Preferred min interval: 100 ms Preferred multiple: 3 Destination address: 10.240.5.173 IPv6 BFD: Not configured Port Device State Port ID B/W, kbps -------------------- --------------- ----------- -------------- ---------- Hu0/0/0/3 Local Configured 0x8000, 0x0000 100000000 Bundle is in the process of being replicated to this location <<<<< HERE Hu0/1/0/9 Local Configured 0x8000, 0x0000 100000000 Bundle is in the process of being replicated to this location <<<<< HERE Hu0/3/0/2 Local Configured 0x8000, 0x0000 100000000 Bundle is in the process of being replicated to this location <<<<< HERE + failing qos configurations on phy interfaces (not for bundles): RP/0/RP0/CPU0:May 28 03:03:49 : cfgmgr-rp[136]: %MGBL-CONFIG-4-FAILED : Some 'sh_p_policymgr_rp' Configuration could not be restored by Configuration Manager. To view failed config(s) use the command - 'show configuration failed startup' RP/0/RP0/CPU0:ASR9K1#show configuration failed startup Tue May 28 05:32:19.436 CDT !!03:03:32 CDT Tue May 28 2019 !! SEMANTIC ERRORS: This configuration was rejected by !! the system due to semantic errors. The individual !! errors with each failed configuration command can be !! found below. interface TenGigE0/7/0/3 service-policy output 10GE_policy !!% 'infra-app-obj' detected the 'warning' condition 'app-obj: Producer not found' ! interface TenGigE0/7/0/35 service-policy output 10GE_policy !!% 'infra-app-obj' detected the 'warning' condition 'app-obj: Producer not found' ! interface HundredGigE0/5/0/5 service-policy output 100GE_policy !!% 'infra-app-obj' detected the 'warning' condition 'app-obj: Producer not found' ! + removing the qos policy from bundles, allows them to come up interface Bundle-Ether5 service-policy output BDLE_policy
Following an upgrade from 5.2.5 to 6.4.2, customer reloaded the chassis in order to complete the FPD upgrade.
+ restart policymgr_rp process process restart policymgr_rp location all
+ syslogs/traces on NRS issues. apparently due to transient transport issues between active RSP0 and stby RSP1 which could be the trigger to policymgr issues: RP/0/RP0/CPU0:May 28 03:02:19.004 CDT: BM-ADJ[132]: gang_create: nrs_register failed: 'NRS' detected the 'try again' condition 'NRS server did not respond within the timeout period' RP/0/RP0/CPU0:May 28 03:02:19.004 CDT: BM-ADJ[132]: prot_client_event_register failed 'NRS' detected the 'try again' condition 'NRS server did not respond within the timeout period' May 28 03:01:20.733 nrs/lib/error 0/RP0/CPU0 t5651 JID:345 nrs_async_send_internal: failed retcode = 0x82ce0a00 ('NRS' detected the 'try again' condition 'Problem connection to the server') May 28 03:01:20.733 nrs/lib/error 0/RP0/CPU0 t5651 JID:345 nrs_async_send: nrs_async_send_internal failed. iov_number = 3, default_size = 20,planetype = 1, rc =0x82ce0a00 ('NRS' detected the 'try again' condition 'Problem connection to the server') + the above NRS issues experienced while policymgr was coming up, caused it's communication with NRS to fail. This DDTS is related to why policymgr did not retry to register with NRS. + Issue happened in the cleanup path of the failure. Ideally, policymgr retries endlessly till it is successful, but in this case the cleanup from the earlier failure was not successful because policymgr process hanged on__pthread_cond_destroy. The cleanup path in this scenario has an issue of double free of mutex. + Because of policymgr_rp process hanging, policymgr_rp didn't get initialized. As a result, qos policy on the LC couldn't be downloaded to all the cards, resulting into qos configuration failure and subsequent failures after it was retried as well.