Skip to content

File Format API for PyIceberg #3100

@nssalian

Description

@nssalian

Feature Request / Improvement

Problem

The write path in pyiceberg/io/pyarrow.py is hardcoded to Parquet. The write.format.default table property exists but is never read. Adding a new format (ORC, Vortex, Lance) requires modifying the monolithic write_file() function. The read path already dispatches multiple formats; the write path should too.

Proposal

Introduce a File Format API aligned with Java Iceberg's File Format API (design doc).

New module pyiceberg/io/fileformat.py:

  • FileFormatWriter (ABC)
  • FileFormatModel (ABC)
  • FormatRegistry
  • DataFileStatistics (it's in pyarrow.py currently but I think this might be good to consolidate for metrics)

Changes to pyiceberg/io/pyarrow.py:

  • ParquetFormatWriter / ParquetFormatModel using the write_parquet() (inside write_file()
  • write_file() refactored to read write.format.default, look up the format model, and dispatch.

TCK tests/io/test_file_format_tck.py:

  • pytest-parameterized round-trip, statistics, type coverage, and null handling tests for every registered format.

Phased rollout:

  • ABCs and registry first, then Parquet extraction with TCK tests, then write_file() dispatch

Java ↔ Python Mapping

Java Python
FormatModel<D, S> FileFormatModel (ABC, no type params)
FileAppender<D> / ModelWriteBuilder FileFormatWriter (ABC)
FormatModelRegistry FormatRegistry (keyed by FileFormat only)
Metrics DataFileStatistics (existing)
TCK test_file_format_tck.py

Scope

This proposal covers the abstraction layer and the Parquet extraction only. No new format writers are included; ORC write support (#20) and any future formats (Avro, etc.) would be follow-ups once this lands.

References

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions