feat: Add metadata-only replace API to Table for REPLACE snapshot operations#3131
Open
qzyu999 wants to merge 5 commits intoapache:mainfrom
Open
feat: Add metadata-only replace API to Table for REPLACE snapshot operations#3131qzyu999 wants to merge 5 commits intoapache:mainfrom
qzyu999 wants to merge 5 commits intoapache:mainfrom
Conversation
- Fixed positional argument type mismatch for `snapshot_properties` in [_RewriteFiles](iceberg-python/pyiceberg/table/update/snapshot.py) - Added missing `Catalog` type annotations to pytest fixtures in [test_replace.py](iceberg-python/tests/table/test_replace.py) - Added strict `is not None` assertions for `table.current_snapshot()` to satisfy mypy Optional checking - Auto-formatted tests with ruff
…ass enum validation (Operation.REPLACE is valid so we can no longer use it in test_invalid_operation)
kevinjqliu
reviewed
Mar 25, 2026
Contributor
kevinjqliu
left a comment
There was a problem hiding this comment.
Thanks for the PR.
There are a couple of things we should check (and add as test cases):
Scanning Manifests
- Every file marked for replacement was found in at least one manifest — if any are missing, abort
- Matched entries are rewritten as DELETED with the new snapshot ID
- Unmatched entries are carried over as EXISTING
- Manifests with no affected files are reused unchanged
New Manifest
- All incoming replacement files are present with status ADDED
- The new snapshot ID is applied to every entry
Manifest List
- Includes the new ADDED manifest
- Includes all rewritten manifests with DELETED entries
- Includes all unchanged manifests
- Includes any pre-existing delete manifests, passed through as-is
Invariant Check
- Records added ≤ records removed
- If the difference is due to prior soft-deletes, confirm those delete files account for it
- Records added never exceed records removed — if they do, the operation is invalid
Snapshot
- Has a unique snapshot ID
- Parent points to the previous snapshot
- Sequence number is exactly previous + 1
- Operation type is set to "replace"
- Manifest list path is correct
- Summary counts (files and records) are accurate
| """ | ||
| return UpdateStatistics(transaction=self) | ||
|
|
||
| def replace( |
Contributor
There was a problem hiding this comment.
we should not expose replace as a public function as we cannot guarantee that the files_to_delete and files_to_add contains the same records.
I think we should start at _RewriteFiles
Comment on lines
+84
to
+93
| assert len(manifest_files) == 2 # One for ADDED, one for DELETED | ||
|
|
||
| # Check that sequence numbers were handled properly natively by verifying the manifest contents | ||
| entries = [] | ||
| for manifest in manifest_files: | ||
| for entry in manifest.fetch_manifest_entry(table.io, discard_deleted=False): | ||
| entries.append(entry) | ||
|
|
||
| # One entry for ADDED (new file), one for DELETED (old file) | ||
| assert len(entries) == 2 |
Contributor
There was a problem hiding this comment.
we need to test a bit more on the status of each file.
https://github.com/apache/iceberg/blob/main/core/src/test/java/org/apache/iceberg/TestRewriteFiles.java is a good reference
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #3130
Rationale for this change
In a current PR (#3124, part of #1092), the proposed
replace()API accepts a PyArrow dataframe (pa.Table), forcing the table engine to physically serialize data during a metadata transaction commit. This couples execution with the catalog, diverges from Java Iceberg's nativeRewriteFilesbuilder behavior, and fails to register underOperation.REPLACE.This PR redesigns
table.replace()andtransaction.replace()to acceptIterable[DataFile]inputs. By externalizing physical data writing (e.g., compaction via Ray), the new explicit metadata-only_RewriteFilesSnapshotProducer can natively swap snapshot pointers in the manifests, perfectly inheriting ancestral sequence numbers forDELETEDentries to ensure time-travel equivalence.Are these changes tested?
Yes.
Fully exhaustive test coverage has been added to
tests/table/test_replace.py. The suite validates:len(table.history())).Operation.REPLACEtags.manifest.fetch_manifest_entry(discard_deleted=False)to natively assert thatstatus=DELETEDoverrides are accurately preserving avro sequence numbers.replace([], [])successfully short-circuits the commit loop without mutating history.Are there any user-facing changes?
Yes.
The method signature for
Table.replace()andTransaction.replace()has been updated from the original PR #3124.It no longer accepts a PyArrow DataFrame (
df: pa.Table). Instead, it now requests two arguments:files_to_delete: Iterable[DataFile]andfiles_to_add: Iterable[DataFile], following the convention seen in the Java implementation.(Please add the
changeloglabel)AI Disclosure
AI was used to help understand the code base and draft code changes. All code changes have been thoroughly reviewed, ensuring that the code changes are in line with a broader understanding of the codebase.
test_invalid_operation()intests/table/test_snapshots.pypreviously usedOperation.REPLACEas a value to test invalid operations, but with this changeOperation.REPLACEbecomes valid. In place I just put a dummy Operation._RewriteFilesinpyiceberg/table/update/snapshot.pyoverrides the_deleted_entriesand_existing_manifestsfunctions. I sought to test this thoroughly that it was done correctly. I am thinking it's possible to improve the test suite to make this more rigorous. I am open to suggestions on how that could be done.