Symptoms
Impact:Failover of Active CP to Standby due to a CROND process terminated.Environment:DELL EMC Hardware: Connectrix ED-DCX6-4BDELL EMC Hardware: Connectrix ED-DCX6-8BBrocade Software: Fabric OS 8.1.0cProblem:The Active Control Processor (CP) failed over from active to standby, due to internal crond process hanging.Errdump: [HAMK-1004], 7353, SLOT 1 CHASSIS, INFO, SWITCHNAME, Resetting standby CP (double reset may occur).[FSSM-1003], 7354, SLOT 1 CHASSIS, WARNING, SWITCHNAME, HA State out of sync.[ESM-3000], 7355, SLOT 1 FID 127, INFO, SWITCHNAME -VF127, Warm Recovery starting.[ESM-3001], 7359, SLOT 1 FID 128, INFO, SWITCHNAME -VF127, Warm Recovery complete.[truncated][EM-1033], 7370, SLOT 1 CHASSIS, ERROR, SWITCHNAME, CP in Slot 2 set to faulty because CP ERROR asserted.[EM-1047], 7371, SLOT 1 CHASSIS, INFO, SWITCHNAME, CP in slot 2 not faulty, CP ERROR deasserted.[FV-1002], 7372, SLOT 1 FID 128, INFO, SWITCHNAME, Flow Vision Config Replay Completed Successfully.[truncated][HAM-1004], 7388, SLOT 2 CHASSIS, INFO, SWITCHNAME, Processor rebooted - Software Fault:Kernel Panic.[truncated][FSSM-1002], 7395, SLOT 1 CHASSIS, INFO, SWITCHNAME, HA State is in sync[FSSM-1002], 7396, SLOT 2 CHASSIS, INFO, SWITCHNAME, HA State is in sync.
Corefile output:
This can be seen in the core file captured:
INFO: task crond:5245 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
crond D 0ff2bf04 0 5245 1352 0x00000000
Call Trace:
STACK MAGIC 0x57ac6e9d
[7a687cd0] [400cc610] perf_event_task_sched_out+0x2c/0x350 (unreliable)
[7a687d90] [40009c0c] __switch_to+0xb0/0xcc
[7a687db0] [4058c8c8] schedule+0x318/0x5e4
[7a687e20] [4058d184] schedule_timeout+0x2ac/0x358
[7a687e70] [4058c3dc] wait_for_common+0x124/0x188
[7a687ec0] [4058c598] wait_for_completion+0x30/0x48
[7a687ed0] [40050090] do_fork+0x1c0/0x3b4
[7a687f30] [40009594] sys_vfork+0x64/0x7c
[7a687f40] [4001369c] ret_from_syscall+0x0/0x3c
Cause
The crond daemon process got into hung state, and it could not be determined from the logs as in why is got into a hung state. Brocade engineering made enhancements to future Fabric OS capture additional data in the event of a hung state of a process in future code, which should help to determine a root cause.
Resolution
Upgrade to Fabric OS 8.1.2 to enable more data capture.