[CORE] Drop 15.0.0-gluten Arrow version rename and depend on vanilla Apache Arrow#12244
Conversation
… Apache Arrow The custom 15.0.0-gluten artifact coordinate forced every contributor to run dev/build-arrow.sh before they could build gluten, even though the Java side of that build no longer carries any load-bearing modifications: * The 883-line modify_arrow_dataset_scan_option.patch added CSV / Substrait dataset Java classes (CsvFragmentScanOptions, ConvertUtil, etc.). Every consumer of those classes inside gluten was deleted by apache#12130 along with the Arrow-CSV / Arrow-Dataset JVM code path. The patch is no longer applied to the Arrow Java build here; the file itself is kept because get-velox.sh still copies it into Velox's CMake Arrow EP for the C++ side. * support_ibm_power.patch (ppc64le → ppcle_64 in JniLoader) is still load bearing for ppc64le builds, but does not require an artifact rename — it only patches the binary-resource lookup inside the arrow-c-data JNI jar and is still applied by build-arrow.sh. * The C++ patches (modify_arrow.patch, cmake-compatibility.patch) are unchanged. After this change, on x86_64 / aarch64 every gluten-arrow Arrow dependency resolves from Maven Central (arrow-c-data:15.0.0, arrow-dataset:15.0.0, arrow-vector:15.0.0, arrow-memory-{core,unsafe,netty}:15.0.0; 18.1.0 for the Spark 4.x profiles). ppc64le builds still rely on dev/build-arrow.sh to produce locally-patched 15.0.0 artifacts — the local-m2 install overrides Central as before. Note: this PR removes the artifact-rename indirection but does not yet unbundle Arrow from the gluten-velox bundle. The bundle still ships unshaded Arrow (per apache#12226) at the same vanilla coordinates. Removing the bundled Arrow in favour of Spark's bundled copy is a separate follow-up driven by the discussion on apache#12226.
|
Run Gluten Clickhouse CI on x86 |
| # Arrow Java libraries | ||
| ${MVN_CMD} install -Parrow-jni -P arrow-c-data -pl c,dataset -am \ | ||
| -Darrow.c.jni.dist.dir=$ARROW_INSTALL_DIR/lib -Darrow.dataset.jni.dist.dir=$ARROW_INSTALL_DIR/lib -Darrow.cpp.build.dir=$ARROW_INSTALL_DIR/lib \ | ||
| -Dmaven.test.skip -Drat.skip -Dmaven.gitcommitid.skip -Dcheckstyle.skip -Dassembly.skipAssembly |
There was a problem hiding this comment.
Do we still need to build Arrow Java locally?
Mostly no — Maven Central's The reason it's still wired into
Happy to add a follow-up commit gating |
@Jenkins-J Can you fix this? |
Per @zhztheplayer, once we completely remove the csv reader, we should be able to use Arrow 7 and Arrow 11. I re-open the PR: #12148 |
@sezruby we can give this a try, thanks. We'd also keep an eye on the glibc compatibility of the official arrow-c-data jar. |
Maven Central's arrow-c-data / arrow-dataset jars at the pinned version already ship libarrow_cdata_jni and libarrow_dataset_jni for x86_64 (Linux / macOS / Windows) and aarch_64 (Linux / macOS), so contributors on those archs no longer need build-arrow.sh's mvn install step — gluten-arrow resolves the same artifact transitively from Central. Skip build_arrow_java when uname -m is not ppc64le; ppc64le still needs the local install because Central's jar carries no ppcle_64 native, and support_ibm_power.patch (kept) adds that arch case to JniLoader.java. build_arrow_cpp and prepare_arrow_build stay unconditional — Velox links against the static C++ Arrow regardless of arch, and the patched source tree is needed for that build path. Saves ~10 min of redundant `mvn install` on every dev bootstrap on x86_64 / aarch64. Behavior on ppc64le is unchanged. Follow-up to review feedback on the parent commit.
|
Run Gluten Clickhouse CI on x86 |
|
@zhztheplayer @philo-he Pushed a follow-up commit ( Two CI failures on this run, both unrelated to this PR's diff:
Could you re-trigger those two lanes when you get a chance? One caveat worth flagging: every gluten CI lane runs |
|
Run Gluten Clickhouse CI on x86 |
|
@sezruby, thank you for the continued efforts. In the CI workflow, should we remove the Arrow artifact sharing from the build job? Including, but not limited to, the following: gluten/.github/workflows/velox_backend_x86.yml Lines 98 to 102 in 9d1a5fc |
@zhztheplayer, based on the following build guide, it appears that java-jni-manylinux-2014 is used in Arrow's official build. Since it is CentOS 7-based and built against glibc 2.17, this should ensure glibc compatibility on higher-version environments. |
In Gluten, when we convert velox to Arrow C_data, we should link the one in Velox, not the lib in arrow jar, right? Then we copy the data pointers to arrow jar's and send to JVM. If any project uses Spark's Arrow C_data, then we should be fine. Spark uses 4 arrow libraries: arrow-format-12.0.1.jar arrow-memory-core-12.0.1.jar arrow-memory-netty-12.0.1.jar arrow-vector-12.0.1.jar c_data lib is in arrow-c-data-12.0.1.jar. If there is libc conflict, we can build the jar only. |
|
@sezruby Does lance-spark use arrow-c-data.jar? Did you build locally or download from maven? |
Yes it does use |
Looks good, this should avoid the libc incompatibilities. |
Yes, and I think we still need to verify the change on CI. Can you push a debug commit to remove the pre-built Arrow Java jar from local Maven repo, then see if CI can pass? Once verified, that debug commit can be reverted before merging. |
|
Run Gluten Clickhouse CI on x86 |
Previously the build-native lanes copied /root/.m2/repository/org/apache/arrow/*
(pre-built Arrow Java jars baked into the Docker image) into a workspace
.m2 path and uploaded them as an `arrow-jars-*-${sha}` artifact. Every
downstream test lane downloaded that artifact into its own /root/.m2 so its
Maven build resolved Arrow from the pre-built copy instead of Maven Central.
This pre-bake has two problems now:
1. It hides whether gluten can actually resolve Arrow from Central (the
change in the parent commits depends on this — gluten-arrow now resolves
arrow-c-data:15.0.0 / arrow-dataset:15.0.0 etc. from Central on
x86_64 / aarch64).
2. It exercises the locally-patched 15.0.0-gluten Arrow that no longer
exists after the parent commits remove that artifact rename.
Removed:
- The mkdir + cp lines that staged Arrow jars after each native build
(velox_backend_x86.yml, velox_backend_arm.yml, velox_backend_ansi.yml,
velox_backend_enhanced.yml).
- The `arrow-jars-*` upload-artifact steps that published the staged copy.
- Every Download Arrow Jars / Download All Arrow Jar Artifacts step in
every downstream test lane (~30 occurrences across the four PR-CI
workflows).
- A leftover `ls -l .../arrow-dataset/15.0.0-gluten/` debug command in
velox_backend_x86.yml that referenced the now-removed coordinate.
Untouched: velox_nightly.yml and build_bundle_package.yml — those build
release artifacts and may legitimately want to bake-in a specific Arrow.
2a5d395 to
4249dde
Compare
|
Run Gluten Clickhouse CI on x86 |
|
@FelixYBW I've created a PR over in the arrow repository related to version 15 (apache/arrow#50138). Let me know if there are any modifications I should make to the PR. |
|
Failed to get lakehouse/gluten rebase URL from queue item URL |
What changes were proposed in this pull request?
Make gluten depend on vanilla Apache Arrow from Maven Central instead of the locally-built
15.0.0-glutenartifact, so x86_64 / aarch64 contributors don't need to rundev/build-arrow.sh's Java build.Three commits:
15.0.0-glutenrename. Removes thearrow-gluten.versionproperty from the parent / Spark-4.x profile poms; switches everygluten-arrow/pom.xmlArrow dep to the vanilla${arrow.version}coordinate (15.0.0 for Spark 3.x default; 18.1.0 already in the Spark-4.0/4.1 profiles); drops theversions:set -DnewVersion=15.0.0-glutenstep inbuild-arrow.sh. (Note: [MINOR][VL] Drop modify_arrow_dataset_scan_option.patch #12148 already removed themodify_arrow_dataset_scan_option.patchupstream — picked up via merge.)build_arrow_java()now early-returns on non-ppc64le. Maven Central'sarrow-c-data:15.0.0already shipslibarrow_cdata_jniforx86_64/(Linux/macOS/Windows) andaarch_64/(Linux/macOS) — no local build needed. Saves ~10 min per dev bootstrap. ppc64le path is byte-for-byte unchanged (Central's jar has noppcle_64/native, sosupport_ibm_power.patch+ the localmvn installare still required there).mkdir + cplines that staged Arrow jars, thearrow-jars-*upload-artifact steps, and every downstreamDownload Arrow Jarsstep (~30 occurrences across 4 PR-CI workflows). Test lanes now resolve Arrow from Maven Central, which makes CI green a real signal that the change works against the published artifacts. Also cleaned up a stalels -l .../arrow-dataset/15.0.0-gluten/debug command.Patch audit (for the artifact rename drop)
build-arrow.shpreviously applied four patches and renamed the resulting jars to15.0.0-gluten:modify_arrow.patchmodify_arrow_dataset_scan_option.patchcmake-compatibility.patchsupport_ibm_power.patchJniLoaderswitch caseA sweep (
grep -rn 'org.apache.arrow.dataset' --include='*.java' --include='*.scala') confirms the only main-source consumers ofarrow-datasetaregluten-arrow/.../ArrowNativeMemoryPoolandArrowReservationListener, which use only the upstreamorg.apache.arrow.dataset.jni.{NativeMemoryPool, ReservationListener}types — no patched classes.Effect on contributors
dev/build-arrow.shis no longer required for the JVM Arrow side. Local builds skip the ~10-min Arrow Java compile.dev/build-arrow.sh's C++ build still runs (Velox links static cpp Arrow); its Java build still runs (only on ppc64le, gated byuname -m). Same dev loop as today.Effect on shading
Independent of bundling. The bundled gluten-velox-bundle still ships unshaded
org.apache.arrow.*per #12226. Unbundling Arrow entirely (relying on Spark's shipped Arrow at runtime) was tried in #12245 — closed because Spark 3.3 / 3.4 ship Arrow 7 / 11 and the LCD-pin trade-offs aren't worth it. Re-evaluate when those versions are dropped.How was this patch tested?
mvn dependency:tree -pl gluten-arrowshows everyorg.apache.arrow:*resolved at vanilla15.0.0/18.1.0from Central.grep -rn 'arrow-gluten\.version\|15\.0\.0-gluten'— no matches.References
modify_arrow_dataset_scan_option.patch(now landed upstream)