Pure Lean 4 implementation of Apache Arrow + Apache Parquet — no C, no C++, no FFI.
The entire Arrow IPC (Schema / Message / File / Stream with Flatbuffers metadata) and Parquet (Thrift-compact metadata, Dremel shredding, all non-compressed encodings, statistics) codebase is implemented in Lean 4 and verified cross-implementation against pyarrow.
Status: spec-coverage phase complete. See
docs/spec-coverage.md.
- Pure Lean 4 reader + writer for the stable Apache Arrow format: primitives, all nested types, dictionary batches, BinaryView / Utf8View / ListView / Union (dense + sparse) / RunEndEncoded.
- Pure Lean 4 reader + writer for Apache Parquet: Thrift Compact Protocol, FileMetaData / ColumnIndex / OffsetIndex / BloomFilterHeader, all physical types, DATA_PAGE / DATA_PAGE_V2 / DICTIONARY_PAGE, and every non-compressed encoding (PLAIN, RLE, BIT_PACKED, Dictionary, DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, DELTA_LENGTH_BYTE_ARRAY, BYTE_STREAM_SPLIT).
- Dremel shredding + reassembly for nested Parquet columns.
- A performance-competitive Arrow implementation. Spec coverage traded
against runtime speed; expect 10-1000× slowdown vs
arrow-cpp. - A compression layer. SNAPPY, GZIP, LZO, BROTLI, LZ4, ZSTD, LZ4_RAW all
return
Error.unsupportedper the pure-Lean charter (no native codec dependency). Compress / decompress outside the library. - A Parquet modular encryption implementation. AES is out of scope.
- An Arrow Flight / Flight SQL / Dataset / Substrait / ADBC layer.
- Lean 4.29.1 (see
lean-toolchain) - uv (for pyarrow-backed integration gates; install: https://docs.astral.sh/uv/)
- That's it. No system C libraries, no compression libs, no Nix flake.
# Build the library + every test binary.
lake build
# Run the pure-Lean IEEE-754 golden-vector test.
lake exe ieee754-test
# Run the full cross-impl integration harness (pyarrow-backed).
bash scripts/run_integration.sh --quick
# Emit the structured final report (PASS / SKIPPED counts per gate).
uv run scripts/final_report.pyThe final-report script is the authoritative spec-coverage gate. It enforces zero FAIL and enumerates SKIPPED items per the D-list.
From the spec-coverage runner (lake exe spec-coverage):
- 30 implemented acceptance criteria
- 0 partial
- 2 skipped (infeasible per pure-Lean charter — compression codecs + AES)
- 0 not-yet
From the final integration report (uv run scripts/final_report.py):
- Arrow IPC oracle: 10/10 PASS
- Arrow IPC cutover: 7/7 PASS
- Parquet cross-verify: 17/17 PASS (two-directional pyarrow roundtrip)
- Parquet codec-unsupported hard-gate: 3/3 PASS
- Parquet statistics: 13/13 PASS
- Parquet Dremel: 10/10 PASS
- Parquet nested / dictionary / encoding: all PASS
- IEEE-754 golden vectors: 20/20 PASS
- Flatbuffers POC / rich-schema / message / roundtrip / cross-impl: all PASS
- Overall: PASS — zero FAIL
| Topic | File |
|---|---|
| Spec coverage matrix | docs/spec-coverage.md |
| Per-type integration gate | docs/integration-type-matrix.md |
| Harness design + Go upgrade path | docs/integration-harness.md |
| Full acceptance spec | .omc/specs/deep-interview-pure-lean-arrow-parquet.md |
Apache-2.0 (derivative of Apache schemas — the original IDL is vendored
under LICENSE-APACHE with NOTICE).
Parts of this repository were created with assistance from AI-powered coding tools. Design choices, architectural decisions, and final validation were performed independently by the author.