[VL][DELTA] Support Delta CDF scan offload#12218
Conversation
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Could you add a DV-enabled CDF regression test to confirm the behavior? Something like: compared against vanilla Spark (checkAnswer) |
| import org.apache.spark.sql.delta.BatchCDFSchemaEndVersion | ||
| import org.apache.spark.sql.delta.commands.cdc.CDCReader | ||
|
|
||
| object DeltaCDFRelationHelper { |
There was a problem hiding this comment.
Do you know what happens in Delta 4.1/4.2/4.3? Would it take the code from 4.0 folder?
There was a problem hiding this comment.
Yes @felipepessoto I checked the current Maven profile/source wiring:
spark-4.0setsdelta.version=4.0.1anddelta.binary.version=40spark-4.1setsdelta.version=4.1.0anddelta.binary.version=40- the Delta profile adds sources from
src-delta${delta.binary.version}/main/scala
So with the current profile setup, both Spark 4.0 / Delta 4.0.x and Spark 4.1 / Delta 4.1.x use gluten-delta/src-delta40/....
For future Delta 4.2 / 4.3 support, it depends on how those profiles are introduced. If they keep delta.binary.version=40, they will continue to use the same src-delta40 helper. If Delta changes the relevant CDF APIs and Gluten introduces a new binary bucket, then we should add a new version-specific helper.
This also relates to the earlier folder-activation discussion in #11924: supporting family-level folders like src-spark4 / src-delta4 would make this cleaner and avoid copying code across Spark/Delta 4.x profiles when the APIs stay compatible. I think that belongs in a separate Maven/source-layout refactor rather than in this CDF PR.
Thanks @felipepessoto good point! I agree this is worth covering explicitly. I’ll add a focused DV-enabled CDF regression with One important nuance: this PR is not intended to claim full native DV support. DV-backed Delta scan execution is still a separate active area in Gluten/Velox and may continue to fall back where native DV handling is not available. The goal of the regression here is to make sure enabling DV does not break the CDF planning path or result correctness. |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
What changes are proposed in this pull request?
Addresses #12195.
Delta CDF reads enter Spark as
CDCReader.DeltaCDFRelation, so they do not initially have the normalFileSourceScanExec+DeltaParquetFileFormatshape that Gluten's existing Delta scan offload rule recognizes.This PR adds a Gluten Delta planner strategy, wired from the Velox Delta component, that recognizes
DeltaCDFRelation, expands it through Delta's own CDF batch planning path, and rewrites the original projection/filter attributes onto the expanded logical plan. After that, the existing Delta scan offload path can plan the underlying CDF file scans asDeltaScanTransformer.The change is intentionally scoped to Gluten's Delta/Spark planning layer rather than Velox C++:
DeltaCDFScanStrategyfortable_changes(...)and DataFramereadChangeFeedscans.VeloxDeltaComponent.readChangeFeed, column mapping, and astartingVersion = 0case.One planner/test nuance: Delta CDF expansion can keep a Spark-side
ExistingRDDbranch for synthesized change rows, including on the tested update/delete CDF paths. The regression suite therefore compares results against vanilla Spark and asserts that the expanded CDF file scans are transformed toDeltaScanTransformer, rather than requiring the entire expanded CDF union to be globally fallback-free.How was this patch tested?
Local checks used
JAVA_HOME=/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home.git diff --check./dev/format-scala-code.sh check./build/mvn -pl gluten-delta -am -Pspark-3.5 -Pjava-17 -Pbackends-velox -Pdelta -DskipTests test-compile./build/mvn -pl gluten-delta -am -Pspark-4.0 -Pscala-2.13 -Pjava-17 -Pbackends-velox -Pdelta -DskipTests test-compile./build/mvn -pl gluten-delta -am -Pspark-4.1 -Pscala-2.13 -Pjava-17 -Pbackends-velox -Pdelta -DskipTests test-compileEarlier cross-version compile checks also passed before the test-expectation-only update:
./build/mvn -pl gluten-delta -am -Pspark-3.3 -Pjava-17 -Pbackends-velox -Pdelta -DskipTests -Dcheckstyle.skip=true -Dscalastyle.skip=true -Dspotless.check.skip=true test-compile./build/mvn -pl gluten-delta -am -Pspark-3.4 -Pjava-17 -Pbackends-velox -Pdelta -DskipTests -Dcheckstyle.skip=true -Dscalastyle.skip=true -Dspotless.check.skip=true test-compile./build/mvn -pl gluten-delta -am -Pspark-3.5 -Pjava-17 -Pbackends-velox -Pdelta -DskipTests -Dcheckstyle.skip=true -Dscalastyle.skip=true -Dspotless.check.skip=true test-compile./build/mvn -pl gluten-delta -am -Pspark-4.0 -Pscala-2.13 -Pjava-17 -Pbackends-velox -Pdelta -DskipTests -Dcheckstyle.skip=true -Dscalastyle.skip=true -Dspotless.check.skip=true test-compile./build/mvn -pl gluten-delta -am -Pspark-4.1 -Pscala-2.13 -Pjava-17 -Pbackends-velox -Pdelta -DskipTests -Dcheckstyle.skip=true -Dscalastyle.skip=true -Dspotless.check.skip=true test-compileFull native Velox runtime and benchmarking are left to CI / a native Gluten-optimized environment; this local checkout does not have
cpp/build/releases/libgluten.soand the Velox external project build available.Was this patch authored or co-authored using generative AI tooling?
Generated-by: IBM BOB