Skip to content

Teardown automation#29

Merged
solsson merged 3 commits into
mainfrom
teardown-automation
Jun 22, 2026
Merged

Teardown automation#29
solsson merged 3 commits into
mainfrom
teardown-automation

Conversation

@solsson

@solsson solsson commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

No description provided.

Yolean k8s-qa and others added 3 commits June 22, 2026 10:55
A dev cluster left running after a task is paused or finished is pure
cost. A lifetime.maxRun budget makes the cluster expire on its own,
taking the cheapest environment-appropriate action.

Model:
- Trigger is an absolute TTL (maxRun, a Go duration), counted from
  when the cluster STARTS, re-anchored on every `y-cluster start`.
  An appliance disk may boot days after it was built, so the
  countdown must begin at boot, not at provision.
- The trigger lives where the cost lives. LOCAL (qemu): a host-side
  timer fires `y-cluster lifetime reap`, which runs onExpiry (stop by
  default; pause or teardown opt-in). The host is the cost, so a host
  timer is correct. CLOUD (GCP appliance): GCP-native
  --max-run-duration deletes the instance with no dependency on this
  host or the cluster staying up; the boot=no data disk survives.

Surface:
- config: lifetime{maxRun,onExpiry} on CommonConfig, validated;
  schemas regenerated.
- qemu state sidecar: additive lifetime/onExpiry/expiresAt (no
  stateVersion bump); armed at provision, re-anchored on start.
- pkg/lifetime: host timer (systemd-run --user, at fallback) and the
  GCP flag translation; reap re-checks the deadline and re-arms if
  not due, so extend is safe and a stale timer is harmless.
- cmd: `y-cluster lifetime` status/reap/extend/arm/disarm/gcp-flags;
  provision/start arm the host timer, stop/teardown disarm it.

qemu-only for the local provisioner today, matching the rest of the
lifecycle surface; docker/multipass keep the "not yet implemented"
shape. Idle-based expiry and an in-cluster reaper are deferred.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
libguestfs (virt-customize / virt-sysprep / virt-tar-out / virt-format)
builds a supermin appliance from /boot/vmlinuz-$(uname -r). Ubuntu ships
those images mode 0600, so after every kernel upgrade a fresh root-only
image lands and the tools fail with the opaque "supermin exited with
error status 1". Downstream users were left rediscovering workarounds,
and the ones they found do not hold: a one-off `chmod` is lost on the
next upgrade, and `dpkg-statoverride` is pinned to one versioned path
(and easy to typo, e.g. vmlinux vs vmlinuz) so the next kernel arrives
0600 again.

Standardize on one durable remedy and surface it from both places the
failure can appear:

- pkg/provision/qemu/libguestfs.go: requireReadableHostKernel() checks
  the running kernel image is readable before the libguestfs call sites
  (prepare-export's virt-customize/virt-tar-out; the data-disk
  virt-format preflight) and, if not, returns an actionable error whose
  fix is a /etc/kernel/postinst.d hook that re-applies 0644 on every
  future kernel. The binary previously had NO check here.
- scripts/_check-host-kernel.sh: one sourced helper carrying the same
  message, replacing four divergent copy-pasted blocks (which recommended
  the fragile per-version dpkg-statoverride) across the appliance build /
  e2e scripts. Kept in sync with the Go message.

Detect-and-print only: y-cluster never modifies the host. The remedy is
a single copy-pasteable block the user runs once with sudo; the hook
then covers all future kernels.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Surfaced when PR #29 rebased this work onto main and CI ran on it for
the first time.

Lint (staticcheck):
- ST1005 (pkg/provision/qemu/libguestfs.go): the remediation error
  string ended with a period. Drop it; the multi-line body is unchanged
  (only a trailing newline/punctuation trips ST1005).
- SA4006 (pkg/lifetime/timer.go): Disarm assigned logger but never used
  it. Log the disarm at debug.

Test (prepare-export ordering):
- TestPrepareExport_NoSavedState/VMNotRunning failed on the CI runner
  because its kernel image is mode 0600. requireReadableHostKernel ran
  BEFORE the cheap correctness preconditions, so an unreadable kernel
  masked the actionable "run provision" / "start the cluster" errors.
  Move the kernel capability check to after loadState/IsRunning/disk and
  before the live phase mutates anything. requireReadableHostKernel is
  now a var so a host-independent regression test can force it to fail
  and assert the precondition error still wins (this gap was invisible
  on dev hosts whose running kernel happens to be readable).

scripts/_check-host-kernel.sh is kept though nothing on this main-based
branch sources it yet: the appliance workflow branch (which carries the
appliance-*.sh scripts that source it) will be rebased onto this.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@solsson solsson merged commit 6af8cc4 into main Jun 22, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant