Teardown automation by solsson · Pull Request #29 · Yolean/y-cluster

solsson · 2026-06-22T08:58:20Z

No description provided.

A dev cluster left running after a task is paused or finished is pure cost. A lifetime.maxRun budget makes the cluster expire on its own, taking the cheapest environment-appropriate action. Model: - Trigger is an absolute TTL (maxRun, a Go duration), counted from when the cluster STARTS, re-anchored on every `y-cluster start`. An appliance disk may boot days after it was built, so the countdown must begin at boot, not at provision. - The trigger lives where the cost lives. LOCAL (qemu): a host-side timer fires `y-cluster lifetime reap`, which runs onExpiry (stop by default; pause or teardown opt-in). The host is the cost, so a host timer is correct. CLOUD (GCP appliance): GCP-native --max-run-duration deletes the instance with no dependency on this host or the cluster staying up; the boot=no data disk survives. Surface: - config: lifetime{maxRun,onExpiry} on CommonConfig, validated; schemas regenerated. - qemu state sidecar: additive lifetime/onExpiry/expiresAt (no stateVersion bump); armed at provision, re-anchored on start. - pkg/lifetime: host timer (systemd-run --user, at fallback) and the GCP flag translation; reap re-checks the deadline and re-arms if not due, so extend is safe and a stale timer is harmless. - cmd: `y-cluster lifetime` status/reap/extend/arm/disarm/gcp-flags; provision/start arm the host timer, stop/teardown disarm it. qemu-only for the local provisioner today, matching the rest of the lifecycle surface; docker/multipass keep the "not yet implemented" shape. Idle-based expiry and an in-cluster reaper are deferred. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

libguestfs (virt-customize / virt-sysprep / virt-tar-out / virt-format) builds a supermin appliance from /boot/vmlinuz-$(uname -r). Ubuntu ships those images mode 0600, so after every kernel upgrade a fresh root-only image lands and the tools fail with the opaque "supermin exited with error status 1". Downstream users were left rediscovering workarounds, and the ones they found do not hold: a one-off `chmod` is lost on the next upgrade, and `dpkg-statoverride` is pinned to one versioned path (and easy to typo, e.g. vmlinux vs vmlinuz) so the next kernel arrives 0600 again. Standardize on one durable remedy and surface it from both places the failure can appear: - pkg/provision/qemu/libguestfs.go: requireReadableHostKernel() checks the running kernel image is readable before the libguestfs call sites (prepare-export's virt-customize/virt-tar-out; the data-disk virt-format preflight) and, if not, returns an actionable error whose fix is a /etc/kernel/postinst.d hook that re-applies 0644 on every future kernel. The binary previously had NO check here. - scripts/_check-host-kernel.sh: one sourced helper carrying the same message, replacing four divergent copy-pasted blocks (which recommended the fragile per-version dpkg-statoverride) across the appliance build / e2e scripts. Kept in sync with the Go message. Detect-and-print only: y-cluster never modifies the host. The remedy is a single copy-pasteable block the user runs once with sudo; the hook then covers all future kernels. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Surfaced when PR #29 rebased this work onto main and CI ran on it for the first time. Lint (staticcheck): - ST1005 (pkg/provision/qemu/libguestfs.go): the remediation error string ended with a period. Drop it; the multi-line body is unchanged (only a trailing newline/punctuation trips ST1005). - SA4006 (pkg/lifetime/timer.go): Disarm assigned logger but never used it. Log the disarm at debug. Test (prepare-export ordering): - TestPrepareExport_NoSavedState/VMNotRunning failed on the CI runner because its kernel image is mode 0600. requireReadableHostKernel ran BEFORE the cheap correctness preconditions, so an unreadable kernel masked the actionable "run provision" / "start the cluster" errors. Move the kernel capability check to after loadState/IsRunning/disk and before the live phase mutates anything. requireReadableHostKernel is now a var so a host-independent regression test can force it to fail and assert the precondition error still wins (this gap was invisible on dev hosts whose running kernel happens to be readable). scripts/_check-host-kernel.sh is kept though nothing on this main-based branch sources it yet: the appliance workflow branch (which carries the appliance-*.sh scripts that source it) will be rebased onto this. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Yolean k8s-qa and others added 3 commits June 22, 2026 10:55

solsson merged commit 6af8cc4 into main Jun 22, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Teardown automation#29

Teardown automation#29
solsson merged 3 commits into
mainfrom
teardown-automation

solsson commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

solsson commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant