Skip to content

feat(nodetask): add DiscoverPeers + RestartPod SeiNodeTask kinds#386

Merged
bdchatham merged 7 commits into
mainfrom
feat/seinodetask-discover-peers-restart-pod
Jun 6, 2026
Merged

feat(nodetask): add DiscoverPeers + RestartPod SeiNodeTask kinds#386
bdchatham merged 7 commits into
mainfrom
feat/seinodetask-discover-peers-restart-pod

Conversation

@bdchatham

Copy link
Copy Markdown
Collaborator

What

Two new composable SeiNodeTask kinds to refresh a running node's persistent-peers without baking peer-drift logic into the core SeiNode reconcile:

  • DiscoverPeers (sidecar-backed) — re-resolves the target SeiNode's spec.peers and writes persistent-peers into config.toml on the PVC via the existing sidecar discover-peers task. Scalar merge → preserves the rest of the config, so no config-apply needed. Disk-only: the running seid picks it up on its next restart.
  • RestartPod (controller-side) — restarts the target's pod so seid re-reads config.toml.

Sequenced DiscoverPeersRestartPod applies fresh peers to a live node (the arctic-1 EC2→K8s validator migration use case), composable into Chaos Mesh Workflows.

Why

Migrated K8s validators need their persistent-peers updated after a spec.peers change (e.g. adding the proposer cohort). We established that:

  • peers is not in the controller's rollout templateHash, so a peers edit doesn't trigger a plan;
  • the NodeUpdate (image) plan runs neither discover-peers nor config-apply, and mark-ready is a no-op — so an image bump restarts seid onto the stale config.toml;
  • there is no working hot-reload of persistent-peers into a live seid (seictl config-reload's seid-signal is a TODO).

Rather than enshrine a peer-drift trigger in the core reconcile (a heavier, harder-to-reverse change with fleet-wide blast radius), this delivers the capability as opt-in, composable tasks. Promote to an automatic controller trigger later if it proves worth it (tracked separately).

Design notes

  • RestartPod completion is content-addressed by pod UID. Execute deletes only the captured-UID pod; Status completes when an owned Ready pod has a different UID. The UID is captured at synthesis and persisted on status.task (survives reconcile reconstruction). This is immune to the creationTimestamp 1s-resolution race a wall-clock epoch would have. It does not reuse replace-pod (which keys on StatefulSet revision and no-ops when the spec is unchanged).
  • Bounded by default. RestartPod defaults to a 10m timeout, DiscoverPeers 2m (explicit spec.timeoutSeconds always wins; sidecar gov/await kinds stay unbounded). A hand-authored RestartPod whose pod never goes Ready transitions to Failed(reason=Timeout) rather than requeuing forever.
  • Fail-fast on empty peers. A target with no spec.peers — including a label source whose status.resolvedPeers is empty — fails with ParamsBuildFailed rather than submitting a zero-source discover-peers that would silently wipe persistent-peers.
  • Double-sign safe. RestartPod is the same single-replica + OnDelete + RWO-PVC stop-then-start as the existing replace-pod; multi-replica is a terminal guard.
  • Atomicity caveat (documented on the kind): DiscoverPeers succeeds then RestartPod fails → fresh config on disk + old seid still running; transient, observable, self-healing on the next successful RestartPod.

Tests

  • Unit: taskParamsForKind for both kinds (ec2Tags/label/static/mixed-empty-label/empty-peers/nil-target; RestartPod params + UID threading), restart-pod execution (UID match/no-op/missing/terminating/multi-replica guard, same-second replacement regression), CEL union (reject zero/multiple payloads, kind immutability).
  • Reconcile (fake client): both kinds happy-path, empty-peers→Failed, label-empty-resolved→Failed, pod-missing→waits→completes, never-Ready→Timeout.
  • envtest: CEL validation against a real apiserver.
  • make test + make test-integration green; make lint 0 new findings vs main.

Follow-up (deferred)

  • Promote peer-drift to an automatic controller trigger once proven.
  • seid pod has no explicit terminationGracePeriodSeconds (inherits 30s) — if seid needs longer to flush on shutdown, worth setting in the StatefulSet generator. Independent of this PR.

🤖 Generated with Claude Code

Two composable SeiNodeTask kinds to refresh a running node's persistent peers
without baking peer-drift logic into the core SeiNode reconcile:

- DiscoverPeers (sidecar-backed): re-resolves the target's spec.peers and writes
  persistent-peers into config.toml on the PVC via the sidecar discover-peers
  task (scalar merge — no config-apply needed; disk-only, the running seid picks
  it up on next restart).
- RestartPod (controller-side): restarts the target's pod so seid re-reads
  config.toml. Completion is content-addressed by pod UID — Execute deletes only
  the captured-UID pod; Status completes when an owned Ready pod has a different
  UID. UID is captured at synthesis and persisted on status.task, so it survives
  reconcile reconstruction and is immune to creationTimestamp second-resolution
  races (unlike a wall-clock epoch).

Sequenced DiscoverPeers -> RestartPod applies fresh peers to a live validator
(the arctic-1 EC2->K8s migration use case) via declarative task CRs, composable
into Chaos Mesh Workflows. The DiscoverPeers-ok/RestartPod-fails window is a
transient, self-healing divergence (documented on the kind), and RestartPod
carries a bounded default timeout (10m; DiscoverPeers 2m) so a hand-authored CR
cannot hang silently.

Empty spec.peers (incl. a label source with no resolvedPeers) fails fast rather
than submitting a zero-source discover-peers that would wipe persistent-peers.
Conditions follow the Ready/Failed latch + ObservedGeneration conventions;
status patches use optimistic concurrency.

Deferred (tracked separately): promoting peer-drift to an automatic controller
trigger once this proves valuable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@cursor

cursor Bot commented Jun 6, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Touches live validator pods (delete/restart) and on-disk peer config; mistakes could affect connectivity or cause brief downtime, though single-replica/RWO guards and fail-fast peer validation limit blast radius.

Overview
Adds two SeiNodeTask kinds for opt-in peer refresh on running nodes: DiscoverPeers (sidecar writes persistent-peers to disk from the target’s spec.peers) and RestartPod (deletes a caller-supplied pod UID so seid reloads config.toml). CRD/API gains union validation, executionStartedAt / restartedPodUID status fields, and clarified timeout semantics (execution budget starts after requirePhase, with per-kind defaults for the new kinds).

The nodetask reconciler delegates param building to task.SeiNodeTaskParamsFor, uses FailureReason for stable fail reasons, stamps execution metadata at synthesis, and gains pod delete RBAC. restart-pod is a new controller task (UID-based completion, shared podCycle with refactored replace-pod). DiscoverPeers fails fast when peer sources are empty/unresolved (unlike the planner init path).

Tests: broad unit/reconcile coverage, new nodetask envtest for CEL rules, and make test-integration extended to that package.

Reviewed by Cursor Bugbot for commit 467ea31. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread internal/controller/nodetask/controller.go
Comment thread internal/controller/nodetask/controller.go Outdated
Comment thread internal/controller/nodetask/controller.go Outdated
Comment thread internal/controller/nodetask/controller.go Outdated
Comment thread internal/controller/nodetask/controller.go
…out, sweep

Cursor Bugbot (HIGH) + CI + maintainer review:
- Timeout measured from execution start, not task creation: new
  status.task.executionStartedAt (stamped once the target is ready + the task is
  synthesized) so the requirePhase wait isn't charged against the per-kind
  execution budget. spec.timeoutSeconds is now execution-only.
- lint goconst: hoisted repeated test literals to package consts.
- Share peer-source building: new task.DiscoverPeerSources is the single impl;
  planner.discoverPeersTask now wraps it (removed its duplicate). The shared
  empty-label-skip also removes a latent planner edge that could write an empty
  persistent-peers for a not-yet-resolved label source.
- taskParamsForKind is now a thin dispatcher; all per-kind param constructors
  live in the task package (internal/task/seinodetask_params.go). Behavior-
  preserving for existing kinds.
- Moved per-kind timeout consts + the unsupported-kind sentinel
  (task.ErrUnsupportedKind) to their proper homes; trimmed verbose doc comments
  to load-bearing rationale.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@bdchatham

Copy link
Copy Markdown
Collaborator Author

Pushed 2a91ca9 addressing all review feedback:

  • Cursor Bugbot (HIGH, timeout includes target wait): execution budget now measured from a new status.task.executionStartedAt (stamped once the target is ready), so the requirePhase wait is no longer charged against the per-kind timeout. spec.timeoutSeconds is now execution-only (documented).
  • CI lint (goconst): repeated test literals hoisted to consts.
  • Your 4 comments: consts → top of file; ErrUnsupportedKind → task package; shared peer-source builder (planner de-duped); taskParamsForKind → thin dispatcher with per-kind constructors in the task package; verbose doc comments trimmed.

Two expert cross-review rounds + a regression pass signed off on the refactor. Note: spec.timeoutSeconds semantics changed from total→execution-only (additive executionStartedAt status field, back-compat).

Comment thread internal/planner/planner.go Outdated
… empty sources

- DRY: extract a shared podCycle helper (StatefulSet fetch + selector/multi-replica
  guards + owned-pod lookup/delete) used by both replace-pod and restart-pod.
  Behavior-preserving — replace-pod keeps its revision gate (runs before the shared
  guards) + synchronous no-readiness-wait completion; restart-pod keeps its
  captured-UID delete predicate + different-UID-Ready readiness wait.
- Fix Bugbot MEDIUM "init plan empty discover-peers": the shared DiscoverPeerSources
  skips an unresolved label source, so the planner's init path could submit a
  zero-source discover-peers that wipes persistent-peers. New controller-side
  discover-peers-init task re-derives sources from the LIVE node each reconcile:
  empty sources -> Failed (consumes one retry-budget unit, re-pends, re-evaluates),
  so it never writes zero sources and converges once status.resolvedPeers populates,
  bounded by discoverPeersMaxRetries. A build-time gate can't converge (executor
  freezes params; buildRunningPlan has no discover-peers). Nodetask path unchanged
  (synthesis-time fail-fast).

Note: discover-peers-init uses setFailed-in-Execute(returns nil) deliberately — only
the Status-failed path is governed by the retry budget; returning an error would be
unbounded. Don't "fix" it to return err.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@bdchatham

Copy link
Copy Markdown
Collaborator Author

Pushed 1aaaefa:

  • Bugbot MEDIUM (init plan empty discover-peers): fixed — new controller-side discover-peers-init task re-derives peer sources from the live node each reconcile; empty/unresolved sources defer via the retry budget (bounded) instead of submitting a zero-source write that could wipe persistent-peers. Converges once status.resolvedPeers populates.
  • Bugbot HIGH (timeout includes target wait): already resolved in 2a91ca9 (execution budget from executionStartedAt); the re-anchored inline was the original-commit comment.
  • DRY (maintainer follow-up): extracted a shared podCycle helper so restart-pod reuses replace-pod's pod-deletion plumbing — both behavior-preserving.

Regression cross-review signed off on both. CI + Bugbot re-running.

Comment thread internal/controller/nodetask/controller.go
When no owned pod exists at synthesis (STS cache-lag / brief gap), targetPodUID
returned empty and restart-pod then deleted nothing and completed on any owned
Ready pod — a silent no-op (seid never restarts, DiscoverPeers peers never apply).

- Synthesis: for RestartPod, if targetPodUID is empty, defer (requeue every
  targetWaitInterval, TargetReady=False/PodNotObserved) until an owned pod is
  observed and its UID captured, bounded by requirePhaseTimeout (anchored on
  status.startedAt) → Failed/RestartTargetUnresolved if it never appears. A real
  UID proceeds as before (delete it; complete only on a different-UID Ready pod).
- Backstop: restart-pod Status with an empty UID now stays Running (never
  completes); a stray empty UID surfaces via the execution timeout as Failed,
  never success.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@bdchatham

Copy link
Copy Markdown
Collaborator Author

Ready for human review — automated bar met (f477e34)

All CI checks green and Cursor Bugbot passes with no new findings:

Cursor Bugbot   pass     lint   pass     test   pass     test-integration   pass     verify-generated   pass

Review trail (each fix had a regression cross-review before push):

Finding Resolution
Bugbot HIGH — timeout includes requirePhase wait executionStartedAt (exec-only budget) — 2a91ca9
CI lint goconst hoisted test literals to consts — 2a91ca9
Maintainer ×4 (consts→top, sentinel→task pkg, shared peer builder, thin dispatcher) 2a91ca9
Bugbot MEDIUM — init zero-source discover-peers discover-peers-init defers via retry budget — 1aaaefa
Maintainer — DRY restart/replace shared podCycle helper — 1aaaefa
Bugbot MEDIUM — empty-UID silent skip synthesis-defer + task backstop — f477e34

The three inline comments still shown are all from earlier commits and resolved above (the HIGH @387 is the original first-commit comment re-anchored to the diff; Bugbot's per-commit check is green).

Note: spec.timeoutSeconds is now execution-only (additive status.task.executionStartedAt, back-compat). Deferred follow-up (separate): promote peer-drift to an automatic controller trigger once proven.

Over to you for the final look, @bdchatham.

Comment thread internal/controller/nodetask/controller.go Outdated
Comment thread internal/controller/nodetask/controller.go Outdated
…nd; restart fail-fast

Maintainer review: the shared peer builder forced one interface across the
planner (frozen plan params) and the nodetask (live, fail-fast) paths, which
spawned the discover-peers-init task. Unwind it.

- Delete discover-peers-init (task + test + registry). Revert the planner's init
  discover-peers to main — byte-identical (own inline builder, frozen
  sidecar discover-peers). The nodetask DiscoverPeers builds its own peer sources
  with its own empty/label-unresolved fail-fast (never submits a zero-source
  write that would wipe persistent-peers). Slight, deliberate duplication; no
  shared interface to conflict.
- ErrUnsupportedKind is now a typed error carrying the kind (errors.As), not a
  sentinel.
- RestartPod: no owned pod at synthesis -> fail fast with reason
  RestartTargetPodNotFound (drop the awaitRestartTarget bounded-defer). targetPodUID
  reads uncached (APIReader) to avoid cache-lag false negatives. Task-side backstop
  kept (empty UID never completes).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@bdchatham

Copy link
Copy Markdown
Collaborator Author

Pushed 39bd88c — unwound the over-built complexity per your call + addressed the two new comments:

  • Unwind: deleted discover-peers-init; reverted the planner's init discover-peers to main (byte-identical — git diff origin/main -- internal/planner/planner.go is empty); the nodetask DiscoverPeers now builds its own peer sources with its own empty/label-unresolved fail-fast (slight, deliberate duplication — no shared interface to conflict). The nodetask still can't submit a zero-source write that would wipe persistent-peers.
  • :149ErrUnsupportedKind is now a typed error carrying the kind (errors.As).
  • :159 — RestartPod no-pod-at-synthesis → fail fast (RestartTargetPodNotFound), dropping the awaitRestartTarget defer; uncached read avoids cache-lag false negatives; task-side backstop kept.

Net: the init path is untouched vs main, and the two new tasks are self-contained. Expert regression cross-review signed off. CI + Bugbot re-running.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 39bd88c. Configure here.

Comment thread internal/controller/nodetask/controller.go
Comment thread internal/controller/nodetask/controller.go Outdated
…asks

The execution-timeout deadline is gated on status.task.executionStartedAt (added
this PR). A SeiNodeTask already Running when the controller upgrades to this build
has a populated status.task but a nil anchor, so no timeout applied → it could run
unbounded after the upgrade. Lazy-stamp executionStartedAt=now (idempotent, only
when nil) before the deadline check, giving in-flight legacy tasks a bounded fresh
budget from the upgrade moment. Not anchored to startedAt (would put the deadline
in the past for a long-running task and spuriously fail it).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@bdchatham

Copy link
Copy Markdown
Collaborator Author

Pushed ec2b591. Status of the Bugbot findings on the latest commit:

  • In-flight tasks lose timeout (HIGH)fixed: driveTask lazy-stamps executionStartedAt=now when a populated status.task has a nil anchor (a task synthesized by a pre-this-change controller), so in-flight legacy tasks get a bounded fresh budget on upgrade instead of running unbounded. Idempotent; not anchored to startedAt (would back-date the deadline).
  • Task timeout includes target wait (HIGH)stale/re-anchored, resolved in 2a91ca9: the budget measures from executionStartedAt (stamped after the target is ready), not startedAt.
  • Empty pod UID skips restart (MEDIUM)resolved in 39bd88c: no owned pod at synthesis → fail fast (RestartTargetPodNotFound); task-side backstop keeps an empty UID in Running (never completes). The defer machinery is gone.

CI + Bugbot re-running on ec2b591.

Comment thread internal/controller/nodetask/controller.go Outdated
Comment thread internal/controller/nodetask/controller.go Outdated
…ram errors

Address two review comments on the SeiNodeTask DiscoverPeers/RestartPod work:

1. Param-build errors now carry their own condition reason. A ReasonedError
   interface + FailureReason(err) helper replace the synthesis-site
   `if errors.As(...)` that special-cased only UnsupportedKind. ErrUnsupportedKind
   reports UnsupportedKind; a new paramsError reports ParamsBuildFailed; both
   call sites (synthesis + driveTask) derive the reason via FailureReason with no
   conditional.

2. RestartPod takes the pod UID as a required spec field (spec.restartPod.podUID,
   CEL-enforced non-empty) instead of the controller discovering it. The
   synthesis path copies it verbatim into status.task.restartedPodUID; the
   targetPodUID fetch, the APIReader/reader() plumbing, the GetAPIReader() wiring
   in main.go, and the RestartTargetPodNotFound fail-fast are removed.

The caller-owns-UID contract is documented on the PodUID field and the
RestartPod kind doc: fetch the UID as late as possible; a stale UID completes as
a no-op (the controller does not re-validate against the live pod).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@bdchatham bdchatham merged commit 204c1c0 into main Jun 6, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant