Symptom
The stack_mgr process can unexpectedly fail on a Switch stack:
Chassis 1 reloading, reason - EHSA keepalive timeout
Apr 1 11:22:07.739: %PMAN-3-RPSWITCH: R0/0: RP switch initiated. Critical process stack_mgr has failed (rc 0)
Apr 1 11:22:16.452: %PMAN-3-PROCHOLDDOWN: R0/0: The process stack_mgr has been helddown (rc 139)
Apr 1 11:22:22.078: %PMAN-5-EXITACTION: F0/0: pvp: Process manager is exiting: reload fp action requested
INFO: rcu_sched detected stalls on CPUs/tasks:
3-...: (9 ticks this GP) idle=b25/140000000000000/0 softirq=57412647/57412647 fqs=5049
(detected by 2, t=5288 jiffies, g=26983533, c=26983532, q=54430)
NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [flashutil:2940]
Kernel panic - not syncing: softlockup: hung tasks
CPU: 1 PID: 2940 Comm: flashutil Tainted: G W O L 4.4.155 #1
Hardware name: Cisco Craw64 on DopplerG 2.0 (DT)
Call trace:
[] dump_backtrace+0x0/0x148
[] show_stack+0x14/0x20
[] dump_stack+0x98/0xbc
[] panic+0xf8/0x25c
[] watchdog+0x0/0x48
[] __hrtimer_run_queues+0xf0/0x178
[] hrtimer_interrupt+0x98/0x1c8
[] arch_timer_handler_phys+0x30/0x40
[] handle_percpu_devid_irq+0x78/0xa0
[] generic_handle_irq+0x24/0x38
[] __handle_domain_irq+0x5c/0xb8
[] gic_handle_irq+0x58/0xb0
Conditions
The crash occurs when there's a large number of Dot1x sessions active across the stack, and the "default interface range" command is used on a number of interfaces at once.
Workaround
Issue has only been seen in the 16.11.x releases, 16.12.3 and later releases don't see this issue.