Skip to content

Parquet export for raw events (Phase D operability)#17

Merged
pbudzik merged 1 commit into
mainfrom
feat/parquet-export
May 16, 2026
Merged

Parquet export for raw events (Phase D operability)#17
pbudzik merged 1 commit into
mainfrom
feat/parquet-export

Conversation

@pbudzik

@pbudzik pbudzik commented May 16, 2026

Copy link
Copy Markdown
Owner

Summary

Bridges the internal billing-grade format to standard Parquet for warehouse / BI / debug. The reviewer's framing was "internal format optimized for billing; external format Parquet for everything else" — this is the bridge.

What's in

New module src/export/parquet.rs:

  • export_raw_segments(state, output_path) — snapshots the manifest, reads every raw segment, builds an Arrow RecordBatch, writes a Parquet file with zstd compression.
  • Flat schema — no nested types, so any Parquet consumer can read it without struct/map support:
Column Type Notes
event_id Utf8
kind Utf8 "Usage" / "Correction" / "Retraction"
correction_original_event_id Utf8 nullable flattened from correction_ref
correction_reason Utf8 nullable flattened from correction_ref
account_id Utf8
subscription_id Utf8 nullable
product_id Utf8
meter_id Utf8
model_id Utf8 nullable
timestamp_ms Int64
quantity Decimal128(38, 0) i128 fits exactly
unit Utf8
source Utf8
dimensions_canonical Utf8 serde_json of SmallDimensions (BTreeMap → canonical order)
ingested_at_ms Int64

Deps

arrow = 55 + parquet = 55 (with zstd feature). Compile-time cost is ~10 s on first build, cached afterward.

Scope notes

This is the simplest possible export — dump every segment in the manifest to one Parquet file. No partitioning, no filtering, no streaming row groups. Per-day partitioning + streaming for arbitrarily large exports land in a follow-up if/when needed.

The memtable is not included (only durably-flushed segments). For a full point-in-time export, the caller should drive a flush first.

Tests

tests/parquet_export.rs — 5 tests:

  • write_parquet_round_trips_event_fields — including i128::MAX/4, a Correction with -100, and correction_ref flattening
  • schema_matches_canonical_shape — column names, nullability, Decimal128(38, 0) for quantity
  • empty_input_writes_valid_zero_row_parquet
  • export_raw_segments_round_trips_through_manifest — full end-to-end
  • export_empty_manifest_produces_zero_row_file

Test plan

  • cargo build --all-targets clean with -D warnings
  • cargo test --all-targets — 96 tests pass (was 91; +5)
  • CI green

🤖 Generated with Claude Code

Bridges the internal billing-grade format to a standard analytical
format. The external reviewer's framing was "internal format optimized
for billing; external format Parquet for warehouse/BI/debug" — this is
the bridge.

New module src/export/parquet.rs:
  - `export_raw_segments(state, output_path)` — snapshots manifest,
    reads every raw segment, builds an Arrow RecordBatch, writes a
    Parquet file with zstd compression.
  - Flat schema (no nested types) so any Parquet consumer can read
    without struct/map support:
      event_id, kind, correction_original_event_id (nullable),
      correction_reason (nullable), account_id, subscription_id
      (nullable), product_id, meter_id, model_id (nullable),
      timestamp_ms i64, quantity Decimal128(38,0), unit, source,
      dimensions_canonical (JSON string), ingested_at_ms i64.
  - i128 quantity → Decimal128(38, 0) preserves exact value; serde_json
    of SmallDimensions is canonical because BTreeMap iterates by key.

Deps: arrow 55 + parquet 55 (zstd feature). Compile-time cost ~10s on
first build; cached afterward.

For MVP simplicity this is dump-all-segments-to-one-file; per-day
partitioning + streaming row groups for arbitrarily large exports
land in a follow-up if needed.

Tests (tests/parquet_export.rs, 5 tests):
  - write_parquet_round_trips_event_fields (incl. i128::MAX/4,
    Correction with negative quantity, correction_ref flattening)
  - schema_matches_canonical_shape (column names, nullability,
    Decimal128(38,0) for quantity)
  - empty_input_writes_valid_zero_row_parquet
  - export_raw_segments_round_trips_through_manifest
  - export_empty_manifest_produces_zero_row_file

Total tests: 96 (was 91; +5). Clean under RUSTFLAGS=-D warnings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pbudzik pbudzik merged commit b37150f into main May 16, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant