...
A N9K switch may face an unexpected reload on one or more supervisor modules due to a kernel panic resulting from an Out of Memory (OOM) state: `show system reset-reason` ----- reset reason for module 28 (from Supervisor in slot 28) --- 1) At 125065 usecs after Mon Jan 01 01:02:03 2024 Reason: Kernel Panic <<<------------------- Service: Version: 10.2(2) `show logging onboard internal reset-reason` ---------------------------- Module: 28 ---------------------------- Reset Reason for this card: Image Version : 10.2(2) Reset Reason (LCM): Unknown (0) at time Mon Jan 01 01:08:52 2024 Reset Reason (SW): Kernel Panic (19) at time Mon Jan 01 01:02:03 2024 <<<------------------- Reset Reason (HW): Watchdog Timeout (32) at time Mon Jan 01 01:08:52 2024
The kernel panic is the result of an Out of Memory (OOM) state due to excessive memory utilization by svc_ifc_eventmgr. An excessive amount of BGP updates/churn can result in this memory leak.
There are no known workarounds available for this issue. If users notice a significant amount of memory being held by the svc_ifc_eventmgr process, they can try reloading to temporarily free up that memory. Users can track any potential memory growth in the svc_ifc_eventmgr process by checking the size of HEAP memory in the following commands: show system internal kernel memory service svc_ifc_eventmgr OR show system internal kernel memory uuid 1319 ! 1319 is the UUID for the svc_ifc_eventmgr process
This issue was marked as a duplicate of defect CSCwb32663. Please refer to CSCwb32663's Release Notes for information regarding the software fix. There are no process logs, core files, or exception logs generated by the event. However, there is data from the kernel panic in the stack-trace outputs. In particular, users can look for call traces pointing to "out of memory" and "page fault" (also memory related): `show logging onboard stack-trace` [41332678.110981] Call Trace: [41332678.110990] dump_stack+0x6d/0x8b [41332678.110994] dump_header+0x6a/0x274 [41332678.110999] out_of_memory+0x253/0x2e0 <<<------------------- [41332678.111002] __alloc_pages_slowpath+0xa0f/0xe30 [41332678.111007] __alloc_pages_nodemask+0x249/0x280 [41332678.111010] filemap_fault+0x302/0x6c0 [41332678.111013] ? __check_object_size+0x45/0x200 [41332678.111017] ? filemap_map_pages+0x126/0x300 [41332678.111021] __do_fault+0x3e/0x100 [41332678.111023] __handle_mm_fault+0x5c1/0xc80 [41332678.111027] handle_mm_fault+0x100/0x230 [41332678.111031] __do_page_fault+0x291/0x4b0 [41332678.111035] do_page_fault+0x2e/0xf0 [41332678.111038] ? page_fault+0x5/0x20 [41332678.111040] page_fault+0x1b/0x20 <<<------------------- [41332678.203105] Call Trace: [41332678.203114] dump_stack+0x6d/0x8b [41332678.203118] ? prandom_reseed+0x170/0x170 [41332678.203122] ? panic+0x1/0x247 [41332678.203129] nxos_panic+0xf2/0x530 [klm_obfl] <<<------------------- [41332678.203131] ? panic+0x1/0x247 [41332678.203135] kprobe_ftrace_handler+0x8f/0xf0 [41332678.203138] ? set_ti_thread_flag+0xe/0xe [41332678.203141] ? out_of_memory+0x278/0x2e0 [41332678.203146] ftrace_ops_assist_func+0x97/0x140 [41332678.203154] 0xffffffffc01550da [41332678.203157] RIP: 0010:panic+0x1/0x247 Further down in the stack-trace, outputs show that svc_ifc_eventmg (tied to event manager) was holding a significant amount of memory: [41332678.111158] Tasks state (memory values in pages): [41332678.111158] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [41332678.111232] [ 1778] 0 1778 6082319 5804483 47399852 0 0 svc_ifc_eventmg <<<------------------- [41332678.111648] [ 6492] 0 6492 1513017 52200 1046430 0 0 bgp [41332678.111632] [ 6482] 0 6482 1438398 40815 907160 0 0 mrib [41332678.111640] [ 6487] 0 6487 1421992 40786 891838 0 0 m6rib [41332678.111646] [ 6490] 0 6490 1308639 19827 766842 0 0 hmm [41332678.115546] nxos_panic: Kernel panic - not syncing: fatal exception Users may see other information in the syslogs ("show logging nvram" and "show logging log") that points to a memory-exhausted state as well: 2024 Jan 01 01:00:58 n9kSW %DAEMON-3-SYSTEM_MSG: error: do_exec_no_pty: fork: Cannot allocate memory - dcos_sshd[XXXX] 2024 Jan 01 01:00:59 n9kSW %DAEMON-3-SYSTEM_MSG: error: do_exec_no_pty: fork: Cannot allocate memory - dcos_sshd[XXXX] 2024 Jan 01 01:01:00 n9kSW %DAEMON-2-SYSTEM_MSG: fatal: fork of unprivileged child failed - dcos_sshd[XXXX] 2024 Jan 01 01:01:00 n9kSW %LOCAL7-3-SYSTEM_MSG: ssh: fork failed: Cannot allocate memory (errno = 12) - dcos-xinetd[XXXXX] 2024 Jan 01 01:01:12 n9kSW %LOCAL7-3-SYSTEM_MSG: ssh: fork failed: Cannot allocate memory (errno = 12) - dcos-xinetd[XXXXX] (message repeated 1 time) 2024 Jan 01 01:01:12 n9kSW %DAEMON-3-SYSTEM_MSG: error: do_exec_no_pty: fork: Cannot allocate memory - dcos_sshd[XXXX] 2024 Jan 01 01:01:12 n9kSW %DAEMON-2-SYSTEM_MSG: fatal: fork of unprivileged child failed - dcos_sshd[XXXX] 2024 Jan 01 01:02:03 n9kSW %SYSMGR-2-CORE_SAVE_FAILED: master_core_client_try_spawn: PID XXXXX with message Unable to start core client. Cannot allocate memory. Users may see high memory utilization in svc_ifc_eventmg elsewhere in the "show tech detail" outputs as well: `show pie envmon mem-usage detail count 0` 2024-01-01 00:00:02 Event Id: xxxxxxxx Event Class: MEM usage insights Source Id: 0x1c01 Mod: 28 Memory_Health : Severe Alert <<<------------------- MODULE 28: ****** Memory usage ****** Memory Total : 32822690 KB Memory Used : 32591742 KB Memory Free : 230948 KB VmallocTotal : 34359738367 KB VmallocUsed : 0 KB Memory_Health : Severe Alert ******* Top users of Memory ********* PID VIRT(KB) RES(KB) %CPU %MEM COMMAND 1768 24265434 23205056 1.30 70.60 svc_ifc_eventmg <<<------------------- 2670 3666046 368296 0.00 1.10 urib 2353 832582 362756 0.00 1.10 vpx1 2148 1123450 302224 0.00 0.90 clis