Symptom
Nexus 1000v VSM stops transmitting heartbeat requests for approx 15 sec resulting in VEM removal due to heartbeat loss.
Example of 'show logging' from VSM:
2014 Jun 3 21:18:08 n1k-vsm-syd %VEM_MGR-2-VEM_MGR_REMOVE_NO_HB: Removing VEM 3 (heartbeats lost)
2014 Jun 3 21:18:08 n1k-vsm-syd %VEM_MGR-2-VEM_MGR_REMOVE_NO_HB: Removing VEM 5 (heartbeats lost)
2014 Jun 3 21:18:08 n1k-vsm-syd %VEM_MGR-2-VEM_MGR_REMOVE_NO_HB: Removing VEM 6 (heartbeats lost)
2014 Jun 3 21:18:08 n1k-vsm-syd %VEM_MGR-2-VEM_MGR_REMOVE_NO_HB: Removing VEM 8 (heartbeats lost)
On newer releases (SV2(2.1a) the issue can be identified by checking for 'aipc drop' and 'socket fail' with:
show stun statistics aipc
Conditions
Issue observed on Nexus 1000v SV1(5.2b) & SV2(2.1a).
Workaround
Issue fixed in Nexus 1000v release SV3(1.1)
Issue is observed to occur less frequently when the following directories are not filling up on the VSM. Engage TAC to check directories are not filling up.
/var/sysmgr/policy-agent/logs/vsm_pa_coupler.log
/isan/apache/logs/access_log
/isan/apache/logs/error_log
/isan/apache/logs/ssl_request_log
/var/sysmgr/tmp_logs/fwm.out
/var/tmp/httpd.sh.log
Further Problem Description
The VSM heartbeat transmit failures are caused by:
1. Buffer overrun at STUN layer used for encrypting VSM/VEM communication
2. Incorrect handling of transmit failures - packets are silently dropped instead of being put back into the queue for re-transmit