Symptom
Switch may become completely unresponsive and fully isolated
* only chassis LED lit is STATUS light
* console and management interface are unresponsive
* control-plane stops responding
* interfaces are all down
On more recent versions of code a hap reset will be detected, which causes the switch to self-reload and become operational again. On older versions of code when there is no hap-reset, the only way to recover is to manually power cycle the switch
* Syslogs may show the following process crash: ascii-cfg
* Switch may experience hap-resets for tahusd, l2fm, vdc_mgr or pltfm_config
sh logg nvram | in core
2021 Dec 11 05:40:38.782532 LEAF47 %$ VDC-1 %$ %SYSMGR-2-SERVICE_CRASHED: Service "ascii-cfg" (PID 20023) hasn't caught signal 11 (core will be saved).
2021 Dec 11 05:54:21.252051 LEAF47 %$ VDC-1 %$ %SYSMGR-2-SERVICE_CRASHED: Service "ascii-cfg" (PID 26524) hasn't caught signal 11 (core will be saved).
2021 Dec 11 06:04:31.746874 LEAF47 %$ VDC-1 %$ %SYSMGR-SLOT1-2-SERVICE_CRASHED: Service "tahusd" (PID 23621) hasn't caught signal 11 (core will be saved).
2021 Dec 11 06:05:18.666840 LEAF47 %$ VDC-1 %$ %SYSMGR-SLOT1-2-LAST_CORE_BASIC_TRACE: core_client_main: PID 29910 with message filename = 0x102_tahusd_log.23621.tar.gz
* We also see the following syslogs: high count of PFC frames reported
%ACLQOS-SLOT1-2-ACLQOS_UNEXPECTED_PFC_FRAMES: Ethernet1/31 received 566935683072 unexpected PFC frames for COS 0
Conditions
using NXAPI-DME-REST to create or delete checkpoints on Nexus 9000 with EVPN VXLAN configuration
Workaround
pro-active: use NXAPI-CLI to create/delete/list checkpoints.
corrective action:
* if the switch has become completely unresponsive and did not reload by itself, a manual power cycling will restore operation