...
When oom-killer is invoked, the process may attempt to come back up. However, each time its dumped this invokes printk. Each run of printk is a chance to access an illegal address. If this happens, the leaf will kernel panic: ... [1024558.774251] Memory cgroup out of memory: Kill process 837 (systemd) score 5 or sacrifice child [1024568.975843] sh invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 ... [1024568.975990] Memory cgroup stats for /libvirt/lxc/CentOS7: cache:260324KB rss:1820KB rss_huge:0KB mapped_file:12KB writeback:0KB inactive_anon:260192KB active_anon:1820KB inactive_file:120KB active_file:0KB unevictable:0KB [1024568.976004] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [1024568.976093] BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:20 ... [1024568.976226] Memory cgroup out of memory: Kill process 837 (systemd) score 5 or sacrifice child [1024579.128733] sh invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 ... [1024579.128852] Memory cgroup stats for /libvirt/lxc/CentOS7: cache:260328KB rss:1816KB rss_huge:0KB mapped_file:120KB writeback:0KB inactive_anon:260192KB active_anon:1808KB inactive_file:0KB active_file:0KB unevictable:0KB [1024579.128866] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [1024579.128958] BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:20 ... [1024579.129082] BUG: unable to handle kernel paging request at ffffffffffffff10 <<<<<<<<<<<<<<<<<< printk accessed illegal address [1024579.327688] Oops: 0000 [#1] SMP [1024579.368254] Modules linked in: klm_kgdb(PO) ... [1024583.732951] (1619191610.665921) (04-23-2021 15:26:50 UTC) cctrl_pre_kgdb_notify: send signal to bfdc begin. [1024584.903152] (1619191611.835877) (04-23-2021 15:26:51 UTC) cctrl_pre_kgdb_notify: send signal to bfdc end. [1024585.019582] cctrl: pre kgdb notifier [1024586.508102] nvram_klm wrote rr=19 rr_str=system crash to nvram <<<<<<<<<<<<<<<<<< system crash [1024586.578110] __kgdb_notify: Trying to fall info kgdb [1024600.912727] NV_OOPS_BLOCK = 26,NV_MAX_BLOCK = 27,offset = 1019904, len = 524288 [1024601.002108] Writing to oops block index 0, now at 1 [1024601.062472] nvram oops successfully wrote 109186 bytes [1024603.890578] Wrote mtdoops at 0 size 65536. Ret 0 [1024603.948753] Succesfully wrote mtdoops at 0 size 65536 [1024604.012211] mtdoops: ready 1, 2 (no erase) [1024604.063178] pstore: Successfully logged oops info. Size 109128 [1024604.233352] nvram oops successfully wrote 109186 bytes [1024605.208250] Wrote mtdoops at 65536 size 65536. Ret 0 [1024605.270588] Succesfully wrote mtdoops at 65536 size 65536 [1024605.338204] mtdoops: ready 2, 3 (no erase) [1024605.389171] pstore: Successfully logged oops info. Size 27104 [1024605.559772] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 18918.276 msecs [1024605.559774] cctrl: post kgdb notifier [1024605.559779] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 18918.276 msecs
A leaf is having some process(es) (it could be any processes) which are consistently invoking oom-killer due to hitting cgroup limits. dmesg/kernel dump output will show which process is hitting a limit. Some example snips [478014.360965] Memory cgroup out of memory: Kill process 54787 (node) score 15 or sacrifice child [478014.360968] Killed process 54787 (node) total-vm:610164kB, anon-rss:4044kB, file-rss:812kB [478029.766239] node invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 The example above shows process (node) seeing some out of memory condition, invoking oom-killer [1537745.403558] Memory cgroup out of memory: Kill process 14906 (svc_ifc_eventmg) score 965 or sacrifice child <<<<<<<<<<<<<<<<<<< [1537745.403561] Killed process 14906 (svc_ifc_eventmg) total-vm:6984644kB, anon-rss:3944520kB, file-rss:674336kB [1537751.895367] svc_ifc_eventmg (36854) Ran 4540 msecs in last 5024 msecs The example above shows eventmgr hitting cgroup out of memory and invoking oom-killer NOTE: When this bug was initially filed, it was due to Tetration Agent hitting a cgroup limit (addressed in CSCvx65896), but again this could be any process(es).
1. Identifying and addressing the reason for the process hitting the cgroup limit should be the priority. Once the process is identified and the reasoning for oom condition analyzed, steps should be taken to lower that processes resource usage to stop oom-killer frome being invoked. Example: in the case of the the initial crash which caught this, disabling the Tetration Agent on the leaf by removing the analytics policy from the associated Leaf Policy Group will stop the TA process which in turn will stop it from using resources therefor no longer invoking oom-killer. Ultimately, upgrading to a fixed version will stop the illegal address printk kernel panic, but in all scenarios any identified processes which are constantly hitting an oom condition should be identified and addressed.
The Tetration Agent issue causing the oom condition caught in this defect is addressed in the following defect: CSCvx65896 - HW Agent (ACI mode) unintentionally restarts This fix associated with this defect corrects the illegal address request piece caught in the above trace: [478332.427941] BUG: unable to handle kernel paging request at ffffffffffffff10 IN all cases, if some process hit an oom condition/cgroup limit which invoked a dump via printk, that process should be analyzed and the oom condition understood.