BugZero | Cisco BugID CSCvy17518 - ACI leaf with consistent oom-killed svcs may see p...

OPERATIONAL DEFECT DATABASE

...

BugZero | Cisco BugID CSCvy17518 - ACI leaf with consistent oom-killed svcs may see p...

Cisco - Defect ID: CSCvy17518

ACI leaf with consistent oom-killed svcs may see printk access illegal address; leaf kernel panics

Cisco - Defect ID: CSCvy17518

ACI leaf with consistent oom-killed svcs may see printk access illegal address; leaf kernel panics

Last updated on June 17th, 2025

BugZero Risk Score
7.9 High

Overall: 7.9

Severity: 8.2

Lifecycle: 9.1

Popularity: 6.0

What is the BugZero Risk Score?

Cisco Integration

Learn more about where this data comes from

Cisco Integration

Learn more

Bug Scrub Advisor

Streamline upgrades with automated vendor bug scrubs

Bug Scrub Advisor

Learn more

BugZero Enterprise

Wish you caught this bug sooner? Get proactive today.

BugZero Enterprise

Learn more

Bug Details

Description

Symptom

When oom-killer is invoked, the process may attempt to come back up. However, each time its dumped this invokes printk. Each run of printk is a chance to access an illegal address. If this happens, the leaf will kernel panic: ... [1024558.774251] Memory cgroup out of memory: Kill process 837 (systemd) score 5 or sacrifice child [1024568.975843] sh invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 ... [1024568.975990] Memory cgroup stats for /libvirt/lxc/CentOS7: cache:260324KB rss:1820KB rss_huge:0KB mapped_file:12KB writeback:0KB inactive_anon:260192KB active_anon:1820KB inactive_file:120KB active_file:0KB unevictable:0KB [1024568.976004] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [1024568.976093] BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:20 ... [1024568.976226] Memory cgroup out of memory: Kill process 837 (systemd) score 5 or sacrifice child [1024579.128733] sh invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 ... [1024579.128852] Memory cgroup stats for /libvirt/lxc/CentOS7: cache:260328KB rss:1816KB rss_huge:0KB mapped_file:120KB writeback:0KB inactive_anon:260192KB active_anon:1808KB inactive_file:0KB active_file:0KB unevictable:0KB [1024579.128866] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [1024579.128958] BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:20 ... [1024579.129082] BUG: unable to handle kernel paging request at ffffffffffffff10 <<<<<<<<<<<<<<<<<< printk accessed illegal address [1024579.327688] Oops: 0000 [#1] SMP [1024579.368254] Modules linked in: klm_kgdb(PO) ... [1024583.732951] (1619191610.665921) (04-23-2021 15:26:50 UTC) cctrl_pre_kgdb_notify: send signal to bfdc begin. [1024584.903152] (1619191611.835877) (04-23-2021 15:26:51 UTC) cctrl_pre_kgdb_notify: send signal to bfdc end. [1024585.019582] cctrl: pre kgdb notifier [1024586.508102] nvram_klm wrote rr=19 rr_str=system crash to nvram <<<<<<<<<<<<<<<<<< system crash [1024586.578110] __kgdb_notify: Trying to fall info kgdb [1024600.912727] NV_OOPS_BLOCK = 26,NV_MAX_BLOCK = 27,offset = 1019904, len = 524288 [1024601.002108] Writing to oops block index 0, now at 1 [1024601.062472] nvram oops successfully wrote 109186 bytes [1024603.890578] Wrote mtdoops at 0 size 65536. Ret 0 [1024603.948753] Succesfully wrote mtdoops at 0 size 65536 [1024604.012211] mtdoops: ready 1, 2 (no erase) [1024604.063178] pstore: Successfully logged oops info. Size 109128 [1024604.233352] nvram oops successfully wrote 109186 bytes [1024605.208250] Wrote mtdoops at 65536 size 65536. Ret 0 [1024605.270588] Succesfully wrote mtdoops at 65536 size 65536 [1024605.338204] mtdoops: ready 2, 3 (no erase) [1024605.389171] pstore: Successfully logged oops info. Size 27104 [1024605.559772] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 18918.276 msecs [1024605.559774] cctrl: post kgdb notifier [1024605.559779] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 18918.276 msecs

Conditions

A leaf is having some process(es) (it could be any processes) which are consistently invoking oom-killer due to hitting cgroup limits. dmesg/kernel dump output will show which process is hitting a limit. Some example snips [478014.360965] Memory cgroup out of memory: Kill process 54787 (node) score 15 or sacrifice child [478014.360968] Killed process 54787 (node) total-vm:610164kB, anon-rss:4044kB, file-rss:812kB [478029.766239] node invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 The example above shows process (node) seeing some out of memory condition, invoking oom-killer [1537745.403558] Memory cgroup out of memory: Kill process 14906 (svc_ifc_eventmg) score 965 or sacrifice child <<<<<<<<<<<<<<<<<<< [1537745.403561] Killed process 14906 (svc_ifc_eventmg) total-vm:6984644kB, anon-rss:3944520kB, file-rss:674336kB [1537751.895367] svc_ifc_eventmg (36854) Ran 4540 msecs in last 5024 msecs The example above shows eventmgr hitting cgroup out of memory and invoking oom-killer NOTE: When this bug was initially filed, it was due to Tetration Agent hitting a cgroup limit (addressed in CSCvx65896), but again this could be any process(es).

Workaround

1. Identifying and addressing the reason for the process hitting the cgroup limit should be the priority. Once the process is identified and the reasoning for oom condition analyzed, steps should be taken to lower that processes resource usage to stop oom-killer frome being invoked. Example: in the case of the the initial crash which caught this, disabling the Tetration Agent on the leaf by removing the analytics policy from the associated Leaf Policy Group will stop the TA process which in turn will stop it from using resources therefor no longer invoking oom-killer. Ultimately, upgrading to a fixed version will stop the illegal address printk kernel panic, but in all scenarios any identified processes which are constantly hitting an oom condition should be identified and addressed.

Further Problem Description

The Tetration Agent issue causing the oom condition caught in this defect is addressed in the following defect: CSCvx65896 - HW Agent (ACI mode) unintentionally restarts This fix associated with this defect corrects the illegal address request piece caught in the above trace: [478332.427941] BUG: unable to handle kernel paging request at ffffffffffffff10 IN all cases, if some process hit an oom condition/cgroup limit which invoked a dump via printk, that process should be analyzed and the oom condition understood.

Relevant Products

Click on a version to see all relevant bugs

Affected versions:14.2(6h)

Fixed versions: 14.2(7l), 14.2(7q), 14.2(7r), 14.2(7s), 14.2(7t), 14.2(7u), 14.2(7v), 14.2(7w), 15.2(1g)

Relevant Products

Click on a version to see all relevant bugs

Affected versions:14.2(6h)

Fixed versions: 14.2(7l), 14.2(7q), 14.2(7r), 14.2(7s), 14.2(7t), 14.2(7u), 14.2(7v), 14.2(7w), 15.2(1g)

Top Cisco Defects

9.7Defect ID: CSCwq31287
Cisco IOS and IOS XE Software SNMP Denial of Service and Remote Code Execution Vulnerability
9.7Defect ID: CSCwc66646
Unexpected Reload due to Segmentation Fault in the CCSIP_SPI_CONTROL process
9.7Defect ID: CSCuz46500
Fib_mgr process crash due to "not enough memory"
9.7Defect ID: CSCwa47133
ISE Evaluation log4j CVE-2021-44228
9.7Defect ID: CSCwa47745
Evaluation of vmanage for Log4j RCE (Log4Shell) Vulnerability vulnerability

Ready to prevent the next vendor outage?

Get a demo

OPERATIONAL DEFECT DATABASE

Cisco - Defect ID: CSCvy17518

ACI leaf with consistent oom-killed svcs may see printk access illegal address; leaf kernel panics

Cisco - Defect ID: CSCvy17518

ACI leaf with consistent oom-killed svcs may see printk access illegal address; leaf kernel panics

Last updated on June 17th, 2025

BugZero Risk Score7.9 High

Bug Details

Symptom

Conditions

Workaround

Further Problem Description

Top Cisco Defects

Ready to prevent the next vendor outage?

Links

BugZero Risk Score
7.9 High