Parquet export for raw events (Phase D operability)#17
Merged
Conversation
Bridges the internal billing-grade format to a standard analytical
format. The external reviewer's framing was "internal format optimized
for billing; external format Parquet for warehouse/BI/debug" — this is
the bridge.
New module src/export/parquet.rs:
- `export_raw_segments(state, output_path)` — snapshots manifest,
reads every raw segment, builds an Arrow RecordBatch, writes a
Parquet file with zstd compression.
- Flat schema (no nested types) so any Parquet consumer can read
without struct/map support:
event_id, kind, correction_original_event_id (nullable),
correction_reason (nullable), account_id, subscription_id
(nullable), product_id, meter_id, model_id (nullable),
timestamp_ms i64, quantity Decimal128(38,0), unit, source,
dimensions_canonical (JSON string), ingested_at_ms i64.
- i128 quantity → Decimal128(38, 0) preserves exact value; serde_json
of SmallDimensions is canonical because BTreeMap iterates by key.
Deps: arrow 55 + parquet 55 (zstd feature). Compile-time cost ~10s on
first build; cached afterward.
For MVP simplicity this is dump-all-segments-to-one-file; per-day
partitioning + streaming row groups for arbitrarily large exports
land in a follow-up if needed.
Tests (tests/parquet_export.rs, 5 tests):
- write_parquet_round_trips_event_fields (incl. i128::MAX/4,
Correction with negative quantity, correction_ref flattening)
- schema_matches_canonical_shape (column names, nullability,
Decimal128(38,0) for quantity)
- empty_input_writes_valid_zero_row_parquet
- export_raw_segments_round_trips_through_manifest
- export_empty_manifest_produces_zero_row_file
Total tests: 96 (was 91; +5). Clean under RUSTFLAGS=-D warnings.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bridges the internal billing-grade format to standard Parquet for warehouse / BI / debug. The reviewer's framing was "internal format optimized for billing; external format Parquet for everything else" — this is the bridge.
What's in
New module
src/export/parquet.rs:export_raw_segments(state, output_path)— snapshots the manifest, reads every raw segment, builds an ArrowRecordBatch, writes a Parquet file with zstd compression.event_idkindcorrection_original_event_idcorrection_refcorrection_reasoncorrection_refaccount_idsubscription_idproduct_idmeter_idmodel_idtimestamp_msquantityunitsourcedimensions_canonicalserde_jsonofSmallDimensions(BTreeMap → canonical order)ingested_at_msDeps
arrow = 55+parquet = 55(withzstdfeature). Compile-time cost is ~10 s on first build, cached afterward.Scope notes
This is the simplest possible export — dump every segment in the manifest to one Parquet file. No partitioning, no filtering, no streaming row groups. Per-day partitioning + streaming for arbitrarily large exports land in a follow-up if/when needed.
The memtable is not included (only durably-flushed segments). For a full point-in-time export, the caller should drive a flush first.
Tests
tests/parquet_export.rs— 5 tests:write_parquet_round_trips_event_fields— includingi128::MAX/4, a Correction with-100, andcorrection_refflatteningschema_matches_canonical_shape— column names, nullability,Decimal128(38, 0)for quantityempty_input_writes_valid_zero_row_parquetexport_raw_segments_round_trips_through_manifest— full end-to-endexport_empty_manifest_produces_zero_row_fileTest plan
cargo build --all-targetsclean with-D warningscargo test --all-targets— 96 tests pass (was 91; +5)🤖 Generated with Claude Code