Skip to content

Core, Spark: scan based remove dangling delete action#15727

Open
kinolaev wants to merge 1 commit intoapache:mainfrom
kinolaev:enhanced-remove-dangling-delete-files
Open

Core, Spark: scan based remove dangling delete action#15727
kinolaev wants to merge 1 commit intoapache:mainfrom
kinolaev:enhanced-remove-dangling-delete-files

Conversation

@kinolaev
Copy link

The current implementation of RemoveDanglingDeletesSparkAction keeps equality delete files if their lower/upper bounds don't overlap with lower/upper bounds of data files with a lower sequence number.
These delete files are dangling because they are always skipped during scans. When they add up, a scan can slow down.
This PR collects all delete files from a full scan into a set and then removes all delete files that aren't in the set. This approach guarantees that all dangling delete files are deleted regardless of their type.

Signed-off-by: Sergei Nikolaev <kinolaev@gmail.com>
@kinolaev kinolaev force-pushed the enhanced-remove-dangling-delete-files branch from 9db4e51 to 79de418 Compare March 23, 2026 00:41
"DV data sequence number (%s) must be greater than or equal to data file sequence number (%s)",
dv.dataSequenceNumber(),
seq);
if (dv != null && dv.dataSequenceNumber() < seq) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need to relax this check?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can think about a DV file that reference a data file with a greater sequence number either 1) as a spec violation or 2) as a dangling delete file for a previously deleted data file with the same name.
In the first case, giving that different engines can work with the same table, I would prefer that spark ignore it instead of failing. Especially because there is no way in spark to fix the violation using only sql without java: remove dangling delete action can only be called as part of the rewrite_data_files procedure that will fail on this check during the scan before calling the action. And if you accept this PR, there will be no way to fix the violation with spark even using java.
The second case, I agree, is very unlikely but still possible, and scans should ignore dangling delete files. I'm sorry if I've missed the part of the spec that makes it a spec violation.
That is why I've proposed to relax the check. An alternative would be finding dangling equality deletes by copying DeleteFileIndex.canContainEqDeletesForFile method to the action (or making the class and the method public). I can do it, if it's a more appropriate solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants