Teardown automation#29
Merged
Merged
Conversation
A dev cluster left running after a task is paused or finished is pure
cost. A lifetime.maxRun budget makes the cluster expire on its own,
taking the cheapest environment-appropriate action.
Model:
- Trigger is an absolute TTL (maxRun, a Go duration), counted from
when the cluster STARTS, re-anchored on every `y-cluster start`.
An appliance disk may boot days after it was built, so the
countdown must begin at boot, not at provision.
- The trigger lives where the cost lives. LOCAL (qemu): a host-side
timer fires `y-cluster lifetime reap`, which runs onExpiry (stop by
default; pause or teardown opt-in). The host is the cost, so a host
timer is correct. CLOUD (GCP appliance): GCP-native
--max-run-duration deletes the instance with no dependency on this
host or the cluster staying up; the boot=no data disk survives.
Surface:
- config: lifetime{maxRun,onExpiry} on CommonConfig, validated;
schemas regenerated.
- qemu state sidecar: additive lifetime/onExpiry/expiresAt (no
stateVersion bump); armed at provision, re-anchored on start.
- pkg/lifetime: host timer (systemd-run --user, at fallback) and the
GCP flag translation; reap re-checks the deadline and re-arms if
not due, so extend is safe and a stale timer is harmless.
- cmd: `y-cluster lifetime` status/reap/extend/arm/disarm/gcp-flags;
provision/start arm the host timer, stop/teardown disarm it.
qemu-only for the local provisioner today, matching the rest of the
lifecycle surface; docker/multipass keep the "not yet implemented"
shape. Idle-based expiry and an in-cluster reaper are deferred.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
libguestfs (virt-customize / virt-sysprep / virt-tar-out / virt-format) builds a supermin appliance from /boot/vmlinuz-$(uname -r). Ubuntu ships those images mode 0600, so after every kernel upgrade a fresh root-only image lands and the tools fail with the opaque "supermin exited with error status 1". Downstream users were left rediscovering workarounds, and the ones they found do not hold: a one-off `chmod` is lost on the next upgrade, and `dpkg-statoverride` is pinned to one versioned path (and easy to typo, e.g. vmlinux vs vmlinuz) so the next kernel arrives 0600 again. Standardize on one durable remedy and surface it from both places the failure can appear: - pkg/provision/qemu/libguestfs.go: requireReadableHostKernel() checks the running kernel image is readable before the libguestfs call sites (prepare-export's virt-customize/virt-tar-out; the data-disk virt-format preflight) and, if not, returns an actionable error whose fix is a /etc/kernel/postinst.d hook that re-applies 0644 on every future kernel. The binary previously had NO check here. - scripts/_check-host-kernel.sh: one sourced helper carrying the same message, replacing four divergent copy-pasted blocks (which recommended the fragile per-version dpkg-statoverride) across the appliance build / e2e scripts. Kept in sync with the Go message. Detect-and-print only: y-cluster never modifies the host. The remedy is a single copy-pasteable block the user runs once with sudo; the hook then covers all future kernels. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Surfaced when PR #29 rebased this work onto main and CI ran on it for the first time. Lint (staticcheck): - ST1005 (pkg/provision/qemu/libguestfs.go): the remediation error string ended with a period. Drop it; the multi-line body is unchanged (only a trailing newline/punctuation trips ST1005). - SA4006 (pkg/lifetime/timer.go): Disarm assigned logger but never used it. Log the disarm at debug. Test (prepare-export ordering): - TestPrepareExport_NoSavedState/VMNotRunning failed on the CI runner because its kernel image is mode 0600. requireReadableHostKernel ran BEFORE the cheap correctness preconditions, so an unreadable kernel masked the actionable "run provision" / "start the cluster" errors. Move the kernel capability check to after loadState/IsRunning/disk and before the live phase mutates anything. requireReadableHostKernel is now a var so a host-independent regression test can force it to fail and assert the precondition error still wins (this gap was invisible on dev hosts whose running kernel happens to be readable). scripts/_check-host-kernel.sh is kept though nothing on this main-based branch sources it yet: the appliance workflow branch (which carries the appliance-*.sh scripts that source it) will be rebased onto this. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.