Core, Spark: scan based remove dangling delete action#15727
Core, Spark: scan based remove dangling delete action#15727kinolaev wants to merge 1 commit intoapache:mainfrom
Conversation
Signed-off-by: Sergei Nikolaev <kinolaev@gmail.com>
9db4e51 to
79de418
Compare
| "DV data sequence number (%s) must be greater than or equal to data file sequence number (%s)", | ||
| dv.dataSequenceNumber(), | ||
| seq); | ||
| if (dv != null && dv.dataSequenceNumber() < seq) { |
There was a problem hiding this comment.
Why we need to relax this check?
There was a problem hiding this comment.
We can think about a DV file that reference a data file with a greater sequence number either 1) as a spec violation or 2) as a dangling delete file for a previously deleted data file with the same name.
In the first case, giving that different engines can work with the same table, I would prefer that spark ignore it instead of failing. Especially because there is no way in spark to fix the violation using only sql without java: remove dangling delete action can only be called as part of the rewrite_data_files procedure that will fail on this check during the scan before calling the action. And if you accept this PR, there will be no way to fix the violation with spark even using java.
The second case, I agree, is very unlikely but still possible, and scans should ignore dangling delete files. I'm sorry if I've missed the part of the spec that makes it a spec violation.
That is why I've proposed to relax the check. An alternative would be finding dangling equality deletes by copying DeleteFileIndex.canContainEqDeletesForFile method to the action (or making the class and the method public). I can do it, if it's a more appropriate solution.
The current implementation of RemoveDanglingDeletesSparkAction keeps equality delete files if their lower/upper bounds don't overlap with lower/upper bounds of data files with a lower sequence number.
These delete files are dangling because they are always skipped during scans. When they add up, a scan can slow down.
This PR collects all delete files from a full scan into a set and then removes all delete files that aren't in the set. This approach guarantees that all dangling delete files are deleted regardless of their type.