Symptom
After watchdog timeout reset there are no kernel logs or stack-traces available to determine a reason of the timeout, and reset-reason indicates that kernel did not receive NMI:
----- reset reason for module 1 (from Supervisor in slot 1) ---
1) At 123456 usecs after Sun May 01 01:02:00 2021
Reason: Watchdog Timeout
Service: HW check by card-client
Version:
"HW check by card-client" indicates that Kernel either didn’t receive NMI or kernel wasn't able to write the reset reason section.
In the `show logging onboard internal cardcl` comand it can be seen:
IOFPGA POWER DEBUG = 00000000
IOFPGA RESET CAUSE = 00000004
Conditions
The problem may be seen due to a race condition, when CPU performance counter collection is happening at same time as IOFPGA raises NMI exception for watchdog timeout.
Further Problem Description
This is not a fix for watchdog timeout issues. The fix with this DDTS is to prevent race condition that may not allow a switch to properly log a reason for watchdog timeout.