Rollup → raw segment provenance (closes spec §19.10)#15
Merged
Conversation
Each rollup segment now records the list of raw segment IDs whose
events contributed to it. The explain endpoint surfaces this as a
`rollup_inputs: { rollup_id → [raw_id, ...] }` map so an operator
can drill from an invoice line back to the source raw segments.
This closes the deferred half of spec §19.10 ("invoice snapshots must
reference both a watermark and a source segment set"). The watermark
half was already covered by `Manifest.watermarks.hourly_rollup_ms`;
this is the segment-set half.
Changes:
- SegmentMeta gains `input_segment_ids: Vec<String>` with
#[serde(default)] so existing manifests deserialize unchanged.
Empty for raw segments (they're ground truth) and compacted
segments (their provenance lives in `compacted_replacements`).
- RollupWorker.tick tracks per-(hour, bucket) inputs in a
BTreeSet<String> during the scan, then stamps the sorted list on
the rollup segment meta. Sorted via BTreeSet so the on-disk
encoding is deterministic.
- write_rollup_segment / build_rollup_segment_meta thread the
Vec<String> through.
- handle_explain reads each overlapping rollup segment's
input_segment_ids and adds them to the response as a
`rollup_inputs` object. Existing `rollup_segments` list of IDs
is kept for callers that don't need provenance.
Tests (tests/rollup_provenance.rs, 3 tests):
- rollup_segment_records_contributing_raw_segments — two raw
segments → rollup lists both as inputs
- rollup_input_only_lists_segments_that_actually_contributed —
out-of-hour segment is NOT listed
- explain_endpoint_surfaces_rollup_inputs_map — full end-to-end
through the HTTP layer
Total tests: 91 (was 88; +3). Clean under RUSTFLAGS=-D warnings.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Each rollup segment now records the list of raw segment IDs whose events contributed to it. The
explainendpoint surfaces this so operators can drill from an invoice line back to the source raw segments. Closes the deferred half of spec §19.10.What changed
SegmentMetagainsinput_segment_ids: Vec<String>with#[serde(default)]so existing manifests deserialize unchanged. Empty for raw segments (ground truth) and compacted segments (their lineage already lives inManifest.compacted_replacements). Populated for rollup segments.RollupWorker.ticknow tracks per-(hour, bucket)inputs in aBTreeSet<String>during the scan and stamps the sorted list on the output rollup segment's meta. Sorting viaBTreeSetkeeps the encoding deterministic.handle_explainadds arollup_inputs: { rollup_id → [raw_id, ...] }map to the response. The existing flatrollup_segmentslist of IDs is kept for callers that don't need provenance.Example explain response
{ "account_id": "acc_x", "watermark_ms": 1701900000000, "lines": [...], "rollup_segments": ["rollup_abc..."], "rollup_inputs": { "rollup_abc...": ["raw_def...", "raw_ghi..."] }, "raw_segments": [...], "corrections": [...] }Spec coverage
Tests
tests/rollup_provenance.rs— 3 new tests:rollup_segment_records_contributing_raw_segments— two raw segments → rollup lists bothrollup_input_only_lists_segments_that_actually_contributed— out-of-hour segment is not listedexplain_endpoint_surfaces_rollup_inputs_map— full end-to-end through the HTTP layerTest plan
cargo build --all-targetsclean with-D warningscargo test --all-targets— 91 tests pass (was 88; +3)🤖 Generated with Claude Code