Skip to content

feat: Add metadata-only replace API to Table for REPLACE snapshot operations#3131

Open
qzyu999 wants to merge 5 commits intoapache:mainfrom
qzyu999:feature/core-rewrite-api
Open

feat: Add metadata-only replace API to Table for REPLACE snapshot operations#3131
qzyu999 wants to merge 5 commits intoapache:mainfrom
qzyu999:feature/core-rewrite-api

Conversation

@qzyu999
Copy link

@qzyu999 qzyu999 commented Mar 9, 2026

Closes #3130

Rationale for this change

In a current PR (#3124, part of #1092), the proposed replace() API accepts a PyArrow dataframe (pa.Table), forcing the table engine to physically serialize data during a metadata transaction commit. This couples execution with the catalog, diverges from Java Iceberg's native RewriteFiles builder behavior, and fails to register under Operation.REPLACE.

This PR redesigns table.replace() and transaction.replace() to accept Iterable[DataFile] inputs. By externalizing physical data writing (e.g., compaction via Ray), the new explicit metadata-only _RewriteFiles SnapshotProducer can natively swap snapshot pointers in the manifests, perfectly inheriting ancestral sequence numbers for DELETED entries to ensure time-travel equivalence.

Are these changes tested?

Yes.

Fully exhaustive test coverage has been added to tests/table/test_replace.py. The suite validates:

  1. Context manager executions tracking valid history growth (len(table.history())).
  2. Snapshot summary bindings asserting strict Operation.REPLACE tags.
  3. Accurate evaluation of delta-metrics (added/deleted files and records tracking perfectly).
  4. Low-level serialization: Bypassed high-level discard filters on manifest.fetch_manifest_entry(discard_deleted=False) to natively assert that status=DELETED overrides are accurately preserving avro sequence numbers.
  5. Idempotent edge cases where replace([], []) successfully short-circuits the commit loop without mutating history.

Are there any user-facing changes?

Yes.

The method signature for Table.replace() and Transaction.replace() has been updated from the original PR #3124.
It no longer accepts a PyArrow DataFrame (df: pa.Table). Instead, it now requests two arguments:
files_to_delete: Iterable[DataFile] and files_to_add: Iterable[DataFile], following the convention seen in the Java implementation.

(Please add the changelog label)

AI Disclosure

AI was used to help understand the code base and draft code changes. All code changes have been thoroughly reviewed, ensuring that the code changes are in line with a broader understanding of the codebase.

  • Worth deeper review after AI-assistance:
  • The test_invalid_operation() in tests/table/test_snapshots.py previously used Operation.REPLACE as a value to test invalid operations, but with this change Operation.REPLACE becomes valid. In place I just put a dummy Operation.
  • The _RewriteFiles in pyiceberg/table/update/snapshot.py overrides the _deleted_entries and _existing_manifests functions. I sought to test this thoroughly that it was done correctly. I am thinking it's possible to improve the test suite to make this more rigorous. I am open to suggestions on how that could be done.

qzyu999 added 4 commits March 9, 2026 15:40
- Fixed positional argument type mismatch for `snapshot_properties` in [_RewriteFiles](iceberg-python/pyiceberg/table/update/snapshot.py)
- Added missing `Catalog` type annotations to pytest fixtures in [test_replace.py](iceberg-python/tests/table/test_replace.py)
- Added strict `is not None` assertions for `table.current_snapshot()` to satisfy mypy Optional checking
- Auto-formatted tests with ruff
…ass enum validation (Operation.REPLACE is valid so we can no longer use it in test_invalid_operation)
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.

There are a couple of things we should check (and add as test cases):

Scanning Manifests

  • Every file marked for replacement was found in at least one manifest — if any are missing, abort
  • Matched entries are rewritten as DELETED with the new snapshot ID
  • Unmatched entries are carried over as EXISTING
  • Manifests with no affected files are reused unchanged

New Manifest

  • All incoming replacement files are present with status ADDED
  • The new snapshot ID is applied to every entry

Manifest List

  • Includes the new ADDED manifest
  • Includes all rewritten manifests with DELETED entries
  • Includes all unchanged manifests
  • Includes any pre-existing delete manifests, passed through as-is

Invariant Check

  • Records added ≤ records removed
  • If the difference is due to prior soft-deletes, confirm those delete files account for it
  • Records added never exceed records removed — if they do, the operation is invalid

Snapshot

  • Has a unique snapshot ID
  • Parent points to the previous snapshot
  • Sequence number is exactly previous + 1
  • Operation type is set to "replace"
  • Manifest list path is correct
  • Summary counts (files and records) are accurate

"""
return UpdateStatistics(transaction=self)

def replace(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not expose replace as a public function as we cannot guarantee that the files_to_delete and files_to_add contains the same records.

I think we should start at _RewriteFiles

Comment on lines +84 to +93
assert len(manifest_files) == 2 # One for ADDED, one for DELETED

# Check that sequence numbers were handled properly natively by verifying the manifest contents
entries = []
for manifest in manifest_files:
for entry in manifest.fetch_manifest_entry(table.io, discard_deleted=False):
entries.append(entry)

# One entry for ADDED (new file), one for DELETED (old file)
assert len(entries) == 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to test a bit more on the status of each file.

https://github.com/apache/iceberg/blob/main/core/src/test/java/org/apache/iceberg/TestRewriteFiles.java is a good reference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Add metadata-only replace API to Table for REPLACE snapshot operations

2 participants