[WIP] Detect hardware watchdog resets as a boot reason#6068
Draft
eriknordmark wants to merge 3 commits into
Draft
[WIP] Detect hardware watchdog resets as a boot reason#6068eriknordmark wants to merge 3 commits into
eriknordmark wants to merge 3 commits into
Conversation
Introduce a boot reason for a reset caused by the hardware watchdog, matching BOOT_REASON_HW_WATCHDOG in eve-api. A device that the watchdog timer reset, with no reason recorded by EVE itself, was previously indistinguishable from a kernel panic. Treat it like the software watchdog reasons: do not start saved application config automatically, since the device came back from an unexplained hard hang. Also add the HWWatchdogBootStatusFile location for the boot status that pkg/watchdog records and nodeagent consumes. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Read the watchdog boot status once at startup, before the daemon arms the device, and write the set flag names to /persist for nodeagent. The boot status latches the cause of the previous reset; a CARDRESET entry means the hardware watchdog reset the board. Platforms whose driver does not report a boot status simply produce an empty file. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
When no reboot reason was recorded and the SMART power-cycle counter is unchanged, a reset was previously reported as a kernel panic. On platforms whose watchdog driver latches CARDRESET, use that signal to report BootReasonHWWatchdog instead, distinguishing a hardware watchdog reset from a kernel bug. Falls back to the existing kernel/unknown guesses when the flag is absent. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6068 +/- ##
==========================================
+ Coverage 20.29% 21.09% +0.80%
==========================================
Files 490 502 +12
Lines 91656 93674 +2018
==========================================
+ Hits 18600 19760 +1160
- Misses 71496 72122 +626
- Partials 1560 1792 +232 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
rene
reviewed
Jun 22, 2026
| } | ||
|
|
||
| reload_watchdog() { | ||
| # Firs thinsg first: kill it! |
Contributor
There was a problem hiding this comment.
I know is not part of your changes, but this is a such tiny fix to not be done now... and it will make Yetus pass....
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
On a reboot for which EVE recorded no reason of its own, nodeagent guesses
the cause: if the storage controller's SMART power-cycle counter increased it
reports a dirty power-off (
BootReasonPowerFail), otherwise it reports akernel panic (
BootReasonKernel) or, when SMART is unavailable,BootReasonUnknown. A device reset by its hardware watchdog (counterunchanged) is therefore indistinguishable from a kernel bug.
This PR adds a
BootReasonHWWatchdogand the signal needed to set it. Thewatchdog container reads the watchdog boot status (
WDIOC_GETBOOTSTATUS) onceat startup, before it arms the device, and records the set flag names to
/persist/hw_watchdog_bootstatus. When nodeagent reaches thecounter-unchanged / unknown branches and that file shows
CARDRESET, itreports
BootReasonHWWatchdoginstead of guessing a kernel panic.BootReasonHWWatchdog.StartWithSavedConfig()returnsfalse— like thesoftware watchdog reasons, a device returning from an unexplained hard hang
should wait for the controller rather than immediately restart saved
application config. This is an operator-visible behavior change: resets
that previously fell to
BootReasonKernel/BootReasonUnknownand auto-resumedapps will, on CARDRESET-reporting platforms, no longer auto-resume.
The flag is only reported by some watchdog drivers (e.g. AMD
sp5100_tco,many ARM SoC watchdogs); Intel
iTCOalways reports a zero boot status, so onthat hardware behavior is unchanged.
PR dependencies
BOOT_REASON_HW_WATCHDOG = 16andBOOT_REASON_KUBE_TRANSITION = 15). Once that merges, aBump eve-apicommit will be added here to vendor the new enum value. Pillar already
compiles without it (the reason maps to the proto via a numeric cast).
How to test and validate this PR
BootReasonFromString/String/StartWithSavedConfiginpkg/pillar/types/zedagenttypes_test.gocover the new value.WDIOF_CARDRESET: trigger awatchdog reset, then confirm
/persist/hw_watchdog_bootstatuscontainsCARDRESETand the device info message reportsBootReasonHWWatchdog(rather than
BootReasonKernel).wdctl-before-arm read needs per-platform validation: opening/dev/watchdogcan arm the timer on some drivers, so confirm the device isnot left ticking before
watchdog(8)starts petting it.Changelog notes
A device reset by its hardware watchdog is now reported with the dedicated
boot reason "hardware watchdog" instead of being attributed to a kernel panic,
on platforms whose watchdog driver supports it.
PR Backports
Checklist