Skip to content

mocusez/arrow-lean

arrow-lean

Pure Lean 4 implementation of Apache Arrow + Apache Parquet — no C, no C++, no FFI.

The entire Arrow IPC (Schema / Message / File / Stream with Flatbuffers metadata) and Parquet (Thrift-compact metadata, Dremel shredding, all non-compressed encodings, statistics) codebase is implemented in Lean 4 and verified cross-implementation against pyarrow.

Status: spec-coverage phase complete. See docs/spec-coverage.md.

What this is

  • Pure Lean 4 reader + writer for the stable Apache Arrow format: primitives, all nested types, dictionary batches, BinaryView / Utf8View / ListView / Union (dense + sparse) / RunEndEncoded.
  • Pure Lean 4 reader + writer for Apache Parquet: Thrift Compact Protocol, FileMetaData / ColumnIndex / OffsetIndex / BloomFilterHeader, all physical types, DATA_PAGE / DATA_PAGE_V2 / DICTIONARY_PAGE, and every non-compressed encoding (PLAIN, RLE, BIT_PACKED, Dictionary, DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, DELTA_LENGTH_BYTE_ARRAY, BYTE_STREAM_SPLIT).
  • Dremel shredding + reassembly for nested Parquet columns.

What this isn't

  • A performance-competitive Arrow implementation. Spec coverage traded against runtime speed; expect 10-1000× slowdown vs arrow-cpp.
  • A compression layer. SNAPPY, GZIP, LZO, BROTLI, LZ4, ZSTD, LZ4_RAW all return Error.unsupported per the pure-Lean charter (no native codec dependency). Compress / decompress outside the library.
  • A Parquet modular encryption implementation. AES is out of scope.
  • An Arrow Flight / Flight SQL / Dataset / Substrait / ADBC layer.

Requirements

  • Lean 4.29.1 (see lean-toolchain)
  • uv (for pyarrow-backed integration gates; install: https://docs.astral.sh/uv/)
  • That's it. No system C libraries, no compression libs, no Nix flake.

Build and test

# Build the library + every test binary.
lake build

# Run the pure-Lean IEEE-754 golden-vector test.
lake exe ieee754-test

# Run the full cross-impl integration harness (pyarrow-backed).
bash scripts/run_integration.sh --quick

# Emit the structured final report (PASS / SKIPPED counts per gate).
uv run scripts/final_report.py

The final-report script is the authoritative spec-coverage gate. It enforces zero FAIL and enumerates SKIPPED items per the D-list.

Current status

From the spec-coverage runner (lake exe spec-coverage):

  • 30 implemented acceptance criteria
  • 0 partial
  • 2 skipped (infeasible per pure-Lean charter — compression codecs + AES)
  • 0 not-yet

From the final integration report (uv run scripts/final_report.py):

  • Arrow IPC oracle: 10/10 PASS
  • Arrow IPC cutover: 7/7 PASS
  • Parquet cross-verify: 17/17 PASS (two-directional pyarrow roundtrip)
  • Parquet codec-unsupported hard-gate: 3/3 PASS
  • Parquet statistics: 13/13 PASS
  • Parquet Dremel: 10/10 PASS
  • Parquet nested / dictionary / encoding: all PASS
  • IEEE-754 golden vectors: 20/20 PASS
  • Flatbuffers POC / rich-schema / message / roundtrip / cross-impl: all PASS
  • Overall: PASS — zero FAIL

Where to look next

Topic File
Spec coverage matrix docs/spec-coverage.md
Per-type integration gate docs/integration-type-matrix.md
Harness design + Go upgrade path docs/integration-harness.md
Full acceptance spec .omc/specs/deep-interview-pure-lean-arrow-parquet.md

License

Apache-2.0 (derivative of Apache schemas — the original IDL is vendored under LICENSE-APACHE with NOTICE).

AI Assistance Disclosure

Parts of this repository were created with assistance from AI-powered coding tools. Design choices, architectural decisions, and final validation were performed independently by the author.

About

No description, website, or topics provided.

Resources

License

Apache-2.0, Apache-2.0 licenses found

Licenses found

Apache-2.0
LICENSE
Apache-2.0
LICENSE-APACHE

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors