DST: corrupt-manifest-and-restart op + fail-closed coverage#25
Merged
Conversation
Adds an Op variant that overwrites the highest-numbered manifest generation file with garbage and restarts. Recovery falls back to generation N-1 per the existing fail-closed semantics; the reference model resyncs from the SUT post-restart so subsequent ops in the sequence are still modeled correctly. The op is a no-op when only one generation exists (no fallback target). A separate `corrupt_only_manifest_generation_is_fatal` unit test covers the case the state machine deliberately skips: corruption of the only generation file must fail closed (Recovery returns Err). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Op::CorruptLatestManifestAndRestartto the DST state machine. Overwrites the highest-numberedmanifest-NNNNNN.jsonwith garbage, then restarts; recovery falls back to generation N-1 per the existing fail-closed semantics. The reference model resyncs from the SUT post-restart (re-derivesacked, per-account sums, and closed periods from the actual on-disk state), so subsequent ops in the sequence remain correctly modeled.corrupt_only_manifest_generation_is_fatalunit test for the case the state machine intentionally skips: when there's no older generation to fall back to, Recovery must returnErrrather than silently start with an empty DB.Why this matters
Manifest corruption is the production failure mode billing operators most worry about — a partial write, disk bitrot, or fsck-style intervention that leaves the on-disk state inconsistent. Phase A already wired up generation rollback; this PR proves under random op sequences that the rollback keeps the database internally consistent (raw == rollup post-recovery, no panics, no ghost events) and that the truly-unrecoverable case fails loud rather than quiet.
Notes
current_generation < 2(no fallback target), so a corruption-restart never produces an unrecoverable state mid-sequence.drop(harness)would have deleted db_root before Recovery ran, and the test would pass for the wrong reason (Recovery sees no manifest → fresh DB).Test plan
cargo test --test dst— 2 tests pass (state machine + fail-closed)cargo test— full suite passesCorruptLatestManifestAndRestartinterleavings (op weight = 1, so most 5–30-op sequences exercise it)🤖 Generated with Claude Code