Skip to content

[WIP] Detect hardware watchdog resets as a boot reason#6068

Draft
eriknordmark wants to merge 3 commits into
lf-edge:masterfrom
eriknordmark:hw-watchdog-bootreason
Draft

[WIP] Detect hardware watchdog resets as a boot reason#6068
eriknordmark wants to merge 3 commits into
lf-edge:masterfrom
eriknordmark:hw-watchdog-bootreason

Conversation

@eriknordmark

Copy link
Copy Markdown
Contributor

Description

On a reboot for which EVE recorded no reason of its own, nodeagent guesses
the cause: if the storage controller's SMART power-cycle counter increased it
reports a dirty power-off (BootReasonPowerFail), otherwise it reports a
kernel panic (BootReasonKernel) or, when SMART is unavailable,
BootReasonUnknown. A device reset by its hardware watchdog (counter
unchanged) is therefore indistinguishable from a kernel bug.

This PR adds a BootReasonHWWatchdog and the signal needed to set it. The
watchdog container reads the watchdog boot status (WDIOC_GETBOOTSTATUS) once
at startup, before it arms the device, and records the set flag names to
/persist/hw_watchdog_bootstatus. When nodeagent reaches the
counter-unchanged / unknown branches and that file shows CARDRESET, it
reports BootReasonHWWatchdog instead of guessing a kernel panic.

BootReasonHWWatchdog.StartWithSavedConfig() returns false — like the
software watchdog reasons, a device returning from an unexplained hard hang
should wait for the controller rather than immediately restart saved
application config. This is an operator-visible behavior change: resets
that previously fell to BootReasonKernel/BootReasonUnknown and auto-resumed
apps will, on CARDRESET-reporting platforms, no longer auto-resume.

The flag is only reported by some watchdog drivers (e.g. AMD sp5100_tco,
many ARM SoC watchdogs); Intel iTCO always reports a zero boot status, so on
that hardware behavior is unchanged.

PR dependencies

How to test and validate this PR

  • Unit tests: BootReasonFromString / String / StartWithSavedConfig in
    pkg/pillar/types/zedagenttypes_test.go cover the new value.
  • On hardware whose watchdog driver reports WDIOF_CARDRESET: trigger a
    watchdog reset, then confirm /persist/hw_watchdog_bootstatus contains
    CARDRESET and the device info message reports BootReasonHWWatchdog
    (rather than BootReasonKernel).
  • The wdctl-before-arm read needs per-platform validation: opening
    /dev/watchdog can arm the timer on some drivers, so confirm the device is
    not left ticking before watchdog(8) starts petting it.

Changelog notes

A device reset by its hardware watchdog is now reported with the dedicated
boot reason "hardware watchdog" instead of being attributed to a kernel panic,
on platforms whose watchdog driver supports it.

PR Backports

  • 16.0-stable: No — new enhancement, not a bug fix.
  • 14.5-stable: No — new enhancement, not a bug fix.
  • 13.4-stable: No — new enhancement, not a bug fix.

Checklist

  • I've provided a proper description
  • I've added the proper documentation
  • I've tested my PR on amd64 device
  • I've tested my PR on arm64 device
  • I've written the test verification instructions
  • I've set the proper labels to this PR

eriknordmark and others added 3 commits June 21, 2026 20:13
Introduce a boot reason for a reset caused by the hardware watchdog,
matching BOOT_REASON_HW_WATCHDOG in eve-api. A device that the watchdog
timer reset, with no reason recorded by EVE itself, was previously
indistinguishable from a kernel panic. Treat it like the software
watchdog reasons: do not start saved application config automatically,
since the device came back from an unexplained hard hang.

Also add the HWWatchdogBootStatusFile location for the boot status that
pkg/watchdog records and nodeagent consumes.

Signed-off-by: eriknordmark <erik@zededa.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Read the watchdog boot status once at startup, before the daemon arms
the device, and write the set flag names to /persist for nodeagent. The
boot status latches the cause of the previous reset; a CARDRESET entry
means the hardware watchdog reset the board. Platforms whose driver does
not report a boot status simply produce an empty file.

Signed-off-by: eriknordmark <erik@zededa.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
When no reboot reason was recorded and the SMART power-cycle counter is
unchanged, a reset was previously reported as a kernel panic. On
platforms whose watchdog driver latches CARDRESET, use that signal to
report BootReasonHWWatchdog instead, distinguishing a hardware watchdog
reset from a kernel bug. Falls back to the existing kernel/unknown
guesses when the flag is absent.

Signed-off-by: eriknordmark <erik@zededa.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 21, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 23.07692% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 21.09%. Comparing base (2134a38) to head (5ffccdd).
⚠️ Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
pkg/pillar/cmd/nodeagent/nodeagent.go 0.00% 20 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6068      +/-   ##
==========================================
+ Coverage   20.29%   21.09%   +0.80%     
==========================================
  Files         490      502      +12     
  Lines       91656    93674    +2018     
==========================================
+ Hits        18600    19760    +1160     
- Misses      71496    72122     +626     
- Partials     1560     1792     +232     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread pkg/watchdog/init.sh
}

reload_watchdog() {
# Firs thinsg first: kill it!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know is not part of your changes, but this is a such tiny fix to not be done now... and it will make Yetus pass....

@eriknordmark eriknordmark changed the title Detect hardware watchdog resets as a boot reason [WIP] Detect hardware watchdog resets as a boot reason Jun 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants