...
VSM Card is silently reloaded / restarted, due to either of the following observations: 1. VSM LC-XR QNX Kernel detecting an exception and triggering an explicit LC-XR crash 2. WDSYSMON process in VSM LC-XR VM internally calling reboot of LC-XR VM, upon detecting a Persistent CPU hog observed for more than 30 secs. 3. RSP CANB Server triggers a watchdog reset on VSM LC, upon observing the watchdog toggle not happening on LC #1 could be observed & checked through the following: --------------------------------------------- As a result of VSM Card Crash/Reload, the 'crashinfo' file in LC-XR VM /lcdisk0:/dumper directory, is expected to be generated and to indicate the crash reason as: " Crash Reason: Kernel Crash Exception at 0xfe6c33aa signal 5 c=2 f=0" Expected Signature of Syslog messages: In the following syslogs, 'envmon_lc' is reported as top user of CPU. In other instances, the processes such as 'ntpdc', 'dev-ahci' could also be reported as 'top user of CPU'. LC/0/2/CPU0:Jan 30 17:30:43.464 : wdsysmon[372]: Process envmon_lc pid 176215 prio 10 using 25 percent is the top user of CPU RP/0/RSP1/CPU0:Jan 30 17:30:57.120 : canb-server[151]: %PLATFORM-CANB_SERVER-7-CBC_PRE_RESET_NOTIFICATION : Node 0/2/CPU0 , Power Cycle (0x05000000) RP/0/RSP0/CPU0:Jan 30 17:30:57.120 : shelfmgr[403]: %PLATFORM-SHELFMGR-6-NODE_CPU_RESET : Node 0/2/CPU0 CPU reset detected. RP/0/RSP0/CPU0:Jan 30 17:30:57.121 : shelfmgr[403]: %PLATFORM-SHELFMGR-6-NODE_STATE_CHANGE : 0/2/CPU0 A9K-VSM-500 state:BRINGDOWN RP/0/RSP0/CPU0:Jan 30 17:30:57.141 : invmgr[255]: %PLATFORM-INV-6-OIROUT : OIR: Node 0/2/1 Sn: N/A removed RP/0/RSP0/CPU0:Jan 30 17:30:57.151 : invmgr[255]: %PLATFORM-INV-6-NODE_STATE_CHANGE : Node: 0/2/CPU0, state: BRINGDOWN RP/0/RSP0/CPU0:Jan 30 17:31:04.295 : canb-server[151]: %PLATFORM-CANB_SERVER-7-CBC_POST_RESET_NOTIFICATION : Node 0/2/CPU0 , Power Cycle (0x05000000) RP/0/RSP0/CPU0:Jan 30 17:31:04.296 : shelfmgr[403]: %PLATFORM-SHELFMGR-6-NODE_STATE_CHANGE : 0/2/CPU0 A9K-VSM-500 state:ROMMON --------------------------------------------- #2 could be observed & checked through the following: ---------------------------------------------- As a result of VSM Card Crash/Reload, the 'crashinfo' file in LC-XR VM /lcdisk0:/dumper directory, is expected to be generated and to indicate the crash reason as: "Crash Reason: Cause code 0x2c000008 Cause: wdsysmon: persistent hog detected" Expected Signature of syslog messages: LC/0/2/CPU0:Jan 22 11:11:30.981 : wdsysmon[372]: Persistent Hog detected for more than 20 seconds LC/0/2/CPU0:Jan 22 11:11:31.583 : wdsysmon[372]: Persistent Hog detected for more than 20 seconds LC/0/2/CPU0:Jan 22 11:11:32.185 : wdsysmon[372]: Persistent Hog detected for more than 30 seconds LC/0/2/CPU0:Jan 22 11:11:32.787 : wdsysmon[372]: Persistent hog (lasting more than 30 seconds) detected by wdsysmon on CPU3. Resetting node soon LC/0/2/CPU0:Jan 22 11:11:32.787 : wdsysmon[372]: Process: , Pid 0, Tid 0, Priority 0, Util 0.0 % is the top user of the CPU LC/0/2/CPU0:Jan 22 11:11:32.787 : wdsysmon[372]: Process: , Pid 0, Tid 0, Priority 0, Util 0.0 % is the top user of the CPU LC/0/2/CPU0:Jan 22 11:11:32.846 : syslog_dev[87]: wdsysmon[372] PID-592994450: Fri Jan 22 11:11:32 ISR 2016 LC/0/2/CPU0:Jan 22 11:11:34.818 : wdsysmon[372]: reboot_internal: Incomplete graceful reboot cleanup (Connection timed out) LC/0/2/CPU0:Jan 22 11:11:34.818 : wdsysmon[372]: Fri Jan 22 11:11:32 2016:sync start LC/0/2/CPU0:Jan 22 11:11:34.818 : wdsysmon[372]: Fri Jan 22 11:11:32 2016:sync end LC/0/2/CPU0:Jan 22 11:11:34.818 : wdsysmon[372]: Fri Jan 22 11:11:32 2016:platform_reboot_op start RP/0/RSP1/CPU0:Jan 22 11:11:43.755 : canb-server[151]: %PLATFORM-CANB_SERVER-7-CBC_PRE_RESET_NOTIFICATION : Node 0/2/CPU0 , Power Cycle (0x05000000) RP/0/RSP0/CPU0:Jan 22 11:11:43.757 : shelfmgr[403]: %PLATFORM-SHELFMGR-6-NODE_CPU_RESET : Node 0/2/CPU0 CPU reset detected. RP/0/RSP0/CPU0:Jan 22 11:11:43.758 : shelfmgr[403]: %PLATFORM-SHELFMGR-6-NODE_STATE_CHANGE : 0/2/CPU0 A9K-VSM-500 state:BRINGDOWN RP/0/RSP0/CPU0:Jan 22 11:11:43.775 : invmgr[255]: %PLATFORM-INV-6-OIROUT : OIR: Node 0/2/1 Sn: N/A removed RP/0/RSP0/CPU0:Jan 22 11:11:43.784 : invmgr[255]: %PLATFORM-INV-6-NODE_STATE_CHANGE : Node: 0/2/CPU0, state: BRINGDOWN RP/0/RSP1/CPU0:Jan 22 11:11:50.798 : canb-server[151]: %PLATFORM-CANB_SERVER-7-CBC_POST_RESET_NOTIFICATION : Node 0/2/CPU0 , Power Cycle (0x05000000) RP/0/RSP0/CPU0:Jan 22 11:11:50.800 : shelfmgr[403]: %PLATFORM-SHELFMGR-6-NODE_STATE_CHANGE : 0/2/CPU0 A9K-VSM-500 state:ROMMON ---------------------------------------------- #3 could be observed & checked through the following: Expected signature of syslog messages: --------------------------------------------- RP/0/RSP0/CPU0:Jan 3 01:21:12.035 : envmon[207]: %PLATFORM-ENVMON-4-CBC_WDOG_EXCEED_THRESHOLD : CBC on node 0/2/CPU0 has not seen watchdog toggle in at least 22 seconds RP/0/RSP1/CPU0:Jan 3 01:23:20.187 : canb-server[151]: %PLATFORM-CANB_SERVER-7-CBC_PRE_RESET_NOTIFICATION : Node 0/2/CPU0 , WDOG SReset (0x06000000) RP/0/RSP0/CPU0:Jan 3 01:23:20.194 : shelfmgr[403]: %PLATFORM-SHELFMGR-6-NODE_CPU_RESET : Node 0/2/CPU0 CPU reset detected. RP/0/RSP0/CPU0:Jan 3 01:23:20.195 : shelfmgr[403]: %PLATFORM-SHELFMGR-6-NODE_STATE_CHANGE : 0/2/CPU0 A9K-VSM-500 state:BRINGDOWN RP/0/RSP0/CPU0:Jan 3 01:23:20.235 : invmgr[255]: %PLATFORM-INV-6-OIROUT : OIR: Node 0/2/1 Sn: N/A removed RP/0/RSP0/CPU0:Jan 3 01:23:20.237 : invmgr[255]: %PLATFORM-INV-6-NODE_STATE_CHANGE : Node: 0/2/CPU0, state: BRINGDOWN RP/0/RSP1/CPU0:Jan 3 01:23:36.183 : canb-server[151]: %PLATFORM-CANB_SERVER-7-CBC_PRE_RESET_NOTIFICATION : Node 0/2/CPU0 , WDOG HReset (0x07000000) RP/0/RSP0/CPU0:Jan 3 01:23:37.187 : canb-server[151]: %PLATFORM-CANB_SERVER-7-CBC_PRE_RESET_NOTIFICATION : Node 0/2/CPU0 , WDOG Power Cycle (0x08000000) RP/0/RSP0/CPU0:Jan 3 01:23:37.307 : ce_switch_srv[54]: %PLATFORM-CE_SWITCH-6-UPDN : Interface 6 (LC_Slot_2) is down RP/0/RSP1/CPU0:Jan 3 01:23:37.330 : ce_switch_srv[54]: %PLATFORM-CE_SWITCH-6-UPDN : Interface 6 (LC_Slot_2) is down RP/0/RSP0/CPU0:Jan 3 01:23:43.715 : canb-server[151]: %PLATFORM-CANB_SERVER-7-CBC_POST_RESET_NOTIFICATION : Node 0/2/CPU0 , WDOG Power Cycle (0x08000000) RP/0/RSP0/CPU0:Jan 3 01:23:43.716 : shelfmgr[403]: %PLATFORM-SHELFMGR-6-NODE_STATE_CHANGE : 0/2/CPU0 A9K-VSM-500 state:ROMMON RP/0/RSP1/CPU0:Jan 3 01:23:43.718 : canb-server[151]: %PLATFORM-CANB_SERVER-7-CBC_POST_RESET_NOTIFICATION : Node 0/2/CPU0 , WDOG Power Cycle (0x08000000) ---------------------------------------------
It is observed that the VSM LC-XR's one of the Logical/Virtual CPU Cores (0 to 3) seems to get stuck and doesn't seem to respond to IPI, which is resulting either a Kernel Exception situation or CPU hog like situation. It is also observed that the stuck CPU core is running 'procnto-smp-instr' process (in almost all the reload/restart instances), the idle process, at the time of VSM Card Reload/restart.
None