Symptom
A Nexus 5k switch may experience a timeout or a crash on the SNMP process when polling the CISCO-RMON-CONFIG-MIB ("1.3.6.1.4.1.9.9.103") or CISCO-PROCESS-MIB ("1.3.6.1.4.1.9.9.109"). This is due to an MTS queue leak that occurs when polling either MIB.
Example of a timeout:
user@server:~$ snmpwalk -c community -v 2c xx.xx.xx.xx iso.3.6.1.4.1.9.9.103.1.4
Timeout: No Response from xx.xx.xx.xx
Example of a crash (note: a signal "6" abort means the process was killed for being unresponsive):
%SYSMGR-2-SERVICE_CRASHED: Service "snmpd" (PID XXXXX) hasn't caught signal 6 (core will be saved).
Conditions
This bug is only encountered on models of Nexus 5600 switch where "show module" indicates the supervisor engine is in slot "0". Eg:
SWITCH# show module
Mod Ports Module-Type Model Status
--- ----- ----------------------------------- ---------------------- -----------
0 0 Nexus 5624Q Supervisor N5K-C5624Q-SUP active *
The issue can be faced during SNMP walk on CISCO-PROCESS-MIB for the following OIDs:
.1.3.6.1.4.1.9.9.109.1.1.1.1.2
.1.3.6.1.4.1.9.9.109.1.1.1.1.6
.1.3.6.1.4.1.9.9.109.1.1.1.1.7
.1.3.6.1.4.1.9.9.109.1.1.1.1.8
.1.3.6.1.4.1.9.9.109.1.1.1.1.9
.1.3.6.1.4.1.9.9.109.1.1.1.1.12
.1.3.6.1.4.1.9.9.109.1.1.1.1.13
Further Problem Description
The MTS is the messaging and transaction system which allows SNMP and other NX-OS processes to communicate. In the context of this bug, there is an MTS queue leak when polling the aforementioned MIBs — CISCO-RMON-CONFIG-MIB or CISCO-PROCESS-MIB.
The following logs may also be seen as a result of MTS queue exhaustion:
%KERN-2-SYSTEM_MSG: [5951622.021949] mts_is_q_space_available_haslock_old(): NO SPACE - node=4, sap=28, ...
%KERN-2-SYSTEM_MSG: [5951622.021968] mts_print_msg_opcode_in_queue: opcode 3176 - XXX messages - kernel
%KERN-2-SYSTEM_MSG: [5951623.643196] [sap 28][pid XXXXX][comm:snmpd] sap recovering failed and so Killed - kernel