[VL] Refactor Gluten to use upstream Velox Iceberg connector#12219
[VL] Refactor Gluten to use upstream Velox Iceberg connector#12219infvg wants to merge 1 commit into
Conversation
|
Run Gluten Clickhouse CI on x86 |
1 similar comment
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Thanks for the PR. Will we let Gluten directly reference Velox's main branch? This seems feasible if Velox PR is verified by Gluten CI. Otherwise, the code changes in Velox can break Gluten build/tests. |
|
@philo-he not directly yet, we still need 5 commits. I've included them in this branch here: |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
@philo-he No. The PR is for the iceberg connector refactor only. We will still have to use IBM/velox. |
There was a problem hiding this comment.
Pull request overview
This PR updates Gluten’s Velox integration to a newer upstream Velox branch and switches Iceberg execution to use the upstream Velox Iceberg connector, removing/relaxing previous “enhanced features” gating so Iceberg support is available in the standard Velox backend.
Changes:
- Add and register a dedicated Velox Iceberg connector ID, and route Iceberg scans/splits through it (planner + runtime + query context connector configs).
- Update Iceberg write path to match the upstream connector APIs (IcebergWriter + JNI wiring) and make Spark-side reflection more tolerant of upstream Iceberg/SparkWrite changes.
- Adjust Iceberg partition-data JSON parsing to support additional JSON shapes, and update Iceberg write metrics expectation in tests.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| gluten-iceberg/src/main/scala/org/apache/iceberg/spark/source/IcebergWriteUtil.scala | Makes reflection access to SparkWrite.writeProperties optional for compatibility across Iceberg versions. |
| gluten-iceberg/src/main/java/org/apache/gluten/connector/write/PartitionDataJson.java | Extends partition JSON parsing to accept array/object forms and adds validation on counts. |
| ep/build-velox/src/get-velox.sh | Updates default Velox branch values used by the dependency fetch script. |
| cpp/velox/substrait/SubstraitToVeloxPlanValidator.h | Adds Iceberg connector ID to validator’s connector set. |
| cpp/velox/substrait/SubstraitToVeloxPlan.cc | Routes table scans to Iceberg connector when the split info indicates Iceberg. |
| cpp/velox/jni/VeloxJniWrapper.cc | Removes enhanced-feature compile gating around Iceberg JNI and returns “enhanced enabled” unconditionally. |
| cpp/velox/config/VeloxConfig.h | Introduces kIcebergConnectorId. |
| cpp/velox/compute/WholeStageResultIterator.cc | Uses Iceberg connector ID for Iceberg splits and populates connector session config for it. |
| cpp/velox/compute/VeloxRuntime.h | Makes Iceberg writer APIs always available (no enhanced-feature gating). |
| cpp/velox/compute/VeloxRuntime.cc | Registers/unregisters a scoped Iceberg connector per runtime instance. |
| cpp/velox/compute/VeloxConnectorIds.h | Adds iceberg ID and icebergRegistered tracking. |
| cpp/velox/compute/VeloxBackend.h | Adds Iceberg connector factory API. |
| cpp/velox/compute/VeloxBackend.cc | Implements Iceberg connector creation using upstream Velox IcebergConnector. |
| cpp/velox/compute/iceberg/IcebergWriter.h | Updates writer state/types to align with upstream connector expectations. |
| cpp/velox/compute/iceberg/IcebergWriter.cc | Refactors Iceberg writer setup (QueryCtx/ConnectorQueryCtx/DataSink) and write slicing behavior. |
| cpp/velox/CMakeLists.txt | Always builds Iceberg writer sources (no enhanced-feature gating). |
| backends-velox/src-iceberg/test/scala/org/apache/gluten/execution/enhanced/VeloxIcebergSuite.scala | Updates expected Iceberg write metric numWrittenFiles. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| connectorConfig_); | ||
| connectorConfig_, | ||
| icebergConfig); | ||
| dataSink_.get(); |
| auto filteredRowVector = | ||
| std::make_shared<RowVector>(pool_.get(), rowType_, nullptr, inputRowVector->size(), std::move(dataColumns)); |
This PR will move Gluten to the latest Velox branch, with minimal changes and switches the implementation to the new Iceberg connector. This also removes most traces of the enhanced branch.