...
TACACS authentication for SSH sessions may fail when large configuration ASCII replay is done (e.g. after disruptive NXOS upgrade or downgrade) with AAA reporting that TACACS server is dead, even though TACACS server IP is reachable from the switch (e.g. via ping): 2021 Nov 28 14:44:23.069 Nexus9K %$ VDC-1 %$ %ASCII-CFG-2-CONFIG_REPLAY_STATUS: Ascii Replay Started. During the problem time there are a lot of tacacsd messages seen in MTS recv_q and pers_q queues: "show system internal mts buffer summary" node sapno recv_q pers_q npers_q log_q app/sap_description sup 112 576 179 0 0 tacacsd/Tacacs Daemon %AAA-1-AAA_SESSION_LIMIT_REJECT messages can be seen in the SYSLOG: %$ VDC-1 %$ %AAA-1-AAA_SESSION_LIMIT_REJECT: aaa request rejected as maximum aaa sessions are in progress Shortly after ASCII replay is done, the problem self-recovers and TACACS authentication starts working. 2021 Nov 28 14:44:32.498 Nexus9K %$ VDC-1 %$ %ASCII-CFG-2-CONFIG_REPLAY_STATUS: Ascii Replay Done. CPU packet capture on the switch does not show packets sent to the TACACS server during the issue.
The problem is seen when the following conditions are met: 1. Large size configuration on the switch. 2. AAA remote accounting to TACACS server is configured. 3. ASCII replay as the trigger - e.g. due to disruptive NXOS upgrade or downgrade.
1. Disable AAA remote accounting to the TACACS server before the reboot/upgrade, and enable it after ASCII replay is done. That will prevent MTS messages for AAA queueing up during ASCII replay and TACACS authentication requests would not expire. 2. Use fallback to local switch authentication during the time the issue is present.
TACACS authentication issues with ASCII replay based bootup are caused by the large number of AAA accounting requests generated at this time to the TACACS server, which are buffered in the MTS queues. Any accounting or authentication request during this time would have to wait for its turn in the queue and a longer wait may cause remote authentication timeout. When it happens, local authentication is tried out as a fallback method. This behavior is expected due to the FIFO design of the MTS queues. During the short period of MTS queues churn local authentication can be used instead. After ASCII replay is finished and AAA messages in MTS buffer are fully processed, such TACACS authentication failure would no longer happen as there is no extended wait time in queue. TACACS dead timer behavior: Note that with tacacs deadtimer configured, the period of TACACS authentication failures will be extended by the duration of the timer, as initial failures due to MTS queue churn would trigger dead timer and server will be considered by AAA as dead until the timer expires. To recover before the timer expires, command "test aaa server tacacs+ vrf " can be used to send a test packet - if the server sends back a successful response against a test request, it would be marked alive again immediately.