Problem
When a VM crashes or the parent bbox process is killed (SIGKILL, OOM, etc.), WithCleanDataDir() never runs. The COW-cloned rootfs at ~/.config/broodbox/vms/<name>/data/rootfs-work/ survives — potentially ~800MB per orphaned VM.
Brood-box already has stale cleanup for two similar cases:
infravm.CleanupStaleLogs() removes old VM log directories using PID sentinel files
infraws.CleanupStaleSnapshots() removes old workspace snapshots
But orphaned rootfs-work/ dirs inside VM data directories are not covered.
Proposal
Extend the existing stale cleanup to also handle orphaned VM data directories (which contain the rootfs-work/ clone).
Approach
go-microvm already persists VM state in <dataDir>/state.json with the runner PID and an active flag. The cleanup logic should:
- On startup, scan
~/.config/broodbox/vms/*/data/state.json
- For each entry where
active: true, check if the PID is still alive
- If the PID is dead, the VM was orphaned — remove the entire data directory (including
rootfs-work/)
go-microvm's terminateStaleRunner() already does steps 1-2 for killing orphaned processes. The data dir cleanup is the missing step 3 — cleanDataDir handles it for the current VM's data dir, but not for data dirs from other crashed VMs.
Where this lives
This could be:
- A new
CleanupStaleVMData() function in internal/infra/vm/ alongside the existing cleanup helpers
- Called from the composition root (
cmd/bbox/main.go) on startup, next to the existing CleanupStaleLogs and CleanupStaleSnapshots calls
Notes
- The PID-liveness check is already battle-tested in the existing stale cleanup code
- This is safe for concurrent VMs: each VM has its own named data directory, and we only clean dirs whose PID is confirmed dead
- This pairs with the image cache GC work — cleaning both the rootfs cache and orphaned rootfs clones covers both sides of the disk waste
Problem
When a VM crashes or the parent
bboxprocess is killed (SIGKILL, OOM, etc.),WithCleanDataDir()never runs. The COW-cloned rootfs at~/.config/broodbox/vms/<name>/data/rootfs-work/survives — potentially ~800MB per orphaned VM.Brood-box already has stale cleanup for two similar cases:
infravm.CleanupStaleLogs()removes old VM log directories using PID sentinel filesinfraws.CleanupStaleSnapshots()removes old workspace snapshotsBut orphaned
rootfs-work/dirs inside VM data directories are not covered.Proposal
Extend the existing stale cleanup to also handle orphaned VM data directories (which contain the
rootfs-work/clone).Approach
go-microvm already persists VM state in
<dataDir>/state.jsonwith the runner PID and anactiveflag. The cleanup logic should:~/.config/broodbox/vms/*/data/state.jsonactive: true, check if the PID is still aliverootfs-work/)go-microvm's
terminateStaleRunner()already does steps 1-2 for killing orphaned processes. The data dir cleanup is the missing step 3 —cleanDataDirhandles it for the current VM's data dir, but not for data dirs from other crashed VMs.Where this lives
This could be:
CleanupStaleVMData()function ininternal/infra/vm/alongside the existing cleanup helperscmd/bbox/main.go) on startup, next to the existingCleanupStaleLogsandCleanupStaleSnapshotscallsNotes