Skip to content

[CORE] Drop 15.0.0-gluten Arrow version rename and depend on vanilla Apache Arrow#12244

Merged
philo-he merged 4 commits into
apache:mainfrom
sezruby:arrow-drop-gluten-rename
Jun 9, 2026
Merged

[CORE] Drop 15.0.0-gluten Arrow version rename and depend on vanilla Apache Arrow#12244
philo-he merged 4 commits into
apache:mainfrom
sezruby:arrow-drop-gluten-rename

Conversation

@sezruby

@sezruby sezruby commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Make gluten depend on vanilla Apache Arrow from Maven Central instead of the locally-built 15.0.0-gluten artifact, so x86_64 / aarch64 contributors don't need to run dev/build-arrow.sh's Java build.

Three commits:

  1. Drop the 15.0.0-gluten rename. Removes the arrow-gluten.version property from the parent / Spark-4.x profile poms; switches every gluten-arrow/pom.xml Arrow dep to the vanilla ${arrow.version} coordinate (15.0.0 for Spark 3.x default; 18.1.0 already in the Spark-4.0/4.1 profiles); drops the versions:set -DnewVersion=15.0.0-gluten step in build-arrow.sh. (Note: [MINOR][VL] Drop modify_arrow_dataset_scan_option.patch #12148 already removed the modify_arrow_dataset_scan_option.patch upstream — picked up via merge.)
  2. Skip local Arrow Java build on x86_64 / aarch64. build_arrow_java() now early-returns on non-ppc64le. Maven Central's arrow-c-data:15.0.0 already ships libarrow_cdata_jni for x86_64/ (Linux/macOS/Windows) and aarch_64/ (Linux/macOS) — no local build needed. Saves ~10 min per dev bootstrap. ppc64le path is byte-for-byte unchanged (Central's jar has no ppcle_64/ native, so support_ibm_power.patch + the local mvn install are still required there).
  3. Stop sharing pre-built Arrow Java jars between CI lanes. Removes the mkdir + cp lines that staged Arrow jars, the arrow-jars-* upload-artifact steps, and every downstream Download Arrow Jars step (~30 occurrences across 4 PR-CI workflows). Test lanes now resolve Arrow from Maven Central, which makes CI green a real signal that the change works against the published artifacts. Also cleaned up a stale ls -l .../arrow-dataset/15.0.0-gluten/ debug command.

Patch audit (for the artifact rename drop)

build-arrow.sh previously applied four patches and renamed the resulting jars to 15.0.0-gluten:

Patch Lines Touches Status
modify_arrow.patch 135 C++ only Still applied
modify_arrow_dataset_scan_option.patch 883 Adds dead Java classes; C++ Substrait Removed by #12148 upstream
cmake-compatibility.patch 34 C++ only Unchanged
support_ibm_power.patch 28 ppc64le JniLoader switch case Still applied on ppc64le; doesn't need a custom artifact coordinate

A sweep (grep -rn 'org.apache.arrow.dataset' --include='*.java' --include='*.scala') confirms the only main-source consumers of arrow-dataset are gluten-arrow/.../ArrowNativeMemoryPool and ArrowReservationListener, which use only the upstream org.apache.arrow.dataset.jni.{NativeMemoryPool, ReservationListener} types — no patched classes.

Effect on contributors

  • x86_64 / aarch64: dev/build-arrow.sh is no longer required for the JVM Arrow side. Local builds skip the ~10-min Arrow Java compile.
  • ppc64le: dev/build-arrow.sh's C++ build still runs (Velox links static cpp Arrow); its Java build still runs (only on ppc64le, gated by uname -m). Same dev loop as today.

Effect on shading

Independent of bundling. The bundled gluten-velox-bundle still ships unshaded org.apache.arrow.* per #12226. Unbundling Arrow entirely (relying on Spark's shipped Arrow at runtime) was tried in #12245 — closed because Spark 3.3 / 3.4 ship Arrow 7 / 11 and the LCD-pin trade-offs aren't worth it. Re-evaluate when those versions are dropped.

How was this patch tested?

  • mvn dependency:tree -pl gluten-arrow shows every org.apache.arrow:* resolved at vanilla 15.0.0 / 18.1.0 from Central.
  • Sweep for stale references: grep -rn 'arrow-gluten\.version\|15\.0\.0-gluten' — no matches.
  • CI: with commit 3, every test lane now downloads Arrow from Maven Central rather than from the workspace artifact, so a green run validates the Central-only path end-to-end.

References

… Apache Arrow

The custom 15.0.0-gluten artifact coordinate forced every contributor to run
dev/build-arrow.sh before they could build gluten, even though the Java side
of that build no longer carries any load-bearing modifications:

* The 883-line modify_arrow_dataset_scan_option.patch added CSV / Substrait
  dataset Java classes (CsvFragmentScanOptions, ConvertUtil, etc.). Every
  consumer of those classes inside gluten was deleted by apache#12130 along with
  the Arrow-CSV / Arrow-Dataset JVM code path. The patch is no longer applied
  to the Arrow Java build here; the file itself is kept because get-velox.sh
  still copies it into Velox's CMake Arrow EP for the C++ side.
* support_ibm_power.patch (ppc64le → ppcle_64 in JniLoader) is still load
  bearing for ppc64le builds, but does not require an artifact rename — it
  only patches the binary-resource lookup inside the arrow-c-data JNI jar
  and is still applied by build-arrow.sh.
* The C++ patches (modify_arrow.patch, cmake-compatibility.patch) are
  unchanged.

After this change, on x86_64 / aarch64 every gluten-arrow Arrow dependency
resolves from Maven Central (arrow-c-data:15.0.0, arrow-dataset:15.0.0,
arrow-vector:15.0.0, arrow-memory-{core,unsafe,netty}:15.0.0; 18.1.0 for
the Spark 4.x profiles). ppc64le builds still rely on dev/build-arrow.sh
to produce locally-patched 15.0.0 artifacts — the local-m2 install
overrides Central as before.

Note: this PR removes the artifact-rename indirection but does not yet
unbundle Arrow from the gluten-velox bundle. The bundle still ships
unshaded Arrow (per apache#12226) at the same vanilla coordinates. Removing
the bundled Arrow in favour of Spark's bundled copy is a separate
follow-up driven by the discussion on apache#12226.
@github-actions github-actions Bot added CORE works for Gluten Core BUILD VELOX labels Jun 5, 2026
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

Comment thread dev/build-arrow.sh
Comment on lines 113 to 116
# Arrow Java libraries
${MVN_CMD} install -Parrow-jni -P arrow-c-data -pl c,dataset -am \
-Darrow.c.jni.dist.dir=$ARROW_INSTALL_DIR/lib -Darrow.dataset.jni.dist.dir=$ARROW_INSTALL_DIR/lib -Darrow.cpp.build.dir=$ARROW_INSTALL_DIR/lib \
-Dmaven.test.skip -Drat.skip -Dmaven.gitcommitid.skip -Dcheckstyle.skip -Dassembly.skipAssembly

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to build Arrow Java locally?

@sezruby

sezruby commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

Do we still need to build Arrow Java locally?

Mostly no — Maven Central's arrow-c-data:15.0.0 jar already ships libarrow_cdata_jni for x86_64/ (Linux/macOS/Windows) and aarch_64/ (Linux/macOS), so x86_64 / aarch64 contributors no longer need the local Java build after this PR.

The reason it's still wired into dev/builddeps-veloxbe.sh unconditionally:

  • ppc64le has no native in the Central jar. support_ibm_power.patch (kept) adds the ppc64le → ppcle_64 arch case to JniLoader.java and the local mvn install step bakes a locally-built libarrow_cdata_jni.so for ppc64le into the resulting arrow-c-data:15.0.0 jar in ~/.m2, overriding Central.

Happy to add a follow-up commit gating build_arrow_java on [[ $(uname -m) == ppc64le ]] so x86_64 / aarch64 users skip ~10 min of redundant work. Holding off in this PR because there's no ppc64le CI lane to confirm the conditional doesn't break the patched build, and I don't have a qemu setup locally to validate it either.

@FelixYBW

FelixYBW commented Jun 8, 2026

Copy link
Copy Markdown
Contributor
  • ppc64le has no native in the Central jar.

@Jenkins-J Can you fix this?

@FelixYBW

FelixYBW commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Spark 3.3.1 ships Arrow 7.0.0
Spark 3.4.4 ships Arrow 11.0.0
Spark 3.5.5 ships Arrow 15.0.0
Spark 4.0 / 4.1 ship Arrow 18.x

Per @zhztheplayer, once we completely remove the csv reader, we should be able to use Arrow 7 and Arrow 11.

I re-open the PR: #12148

@zhztheplayer

Copy link
Copy Markdown
Member

Happy to add a follow-up commit gating build_arrow_java on [[ $(uname -m) == ppc64le ]] so x86_64 / aarch64 users skip ~10 min of redundant work.

@sezruby we can give this a try, thanks.

We'd also keep an eye on the glibc compatibility of the official arrow-c-data jar.

Maven Central's arrow-c-data / arrow-dataset jars at the pinned version already
ship libarrow_cdata_jni and libarrow_dataset_jni for x86_64 (Linux / macOS /
Windows) and aarch_64 (Linux / macOS), so contributors on those archs no longer
need build-arrow.sh's mvn install step — gluten-arrow resolves the same
artifact transitively from Central.

Skip build_arrow_java when uname -m is not ppc64le; ppc64le still needs the
local install because Central's jar carries no ppcle_64 native, and
support_ibm_power.patch (kept) adds that arch case to JniLoader.java.

build_arrow_cpp and prepare_arrow_build stay unconditional — Velox links
against the static C++ Arrow regardless of arch, and the patched source tree
is needed for that build path.

Saves ~10 min of redundant `mvn install` on every dev bootstrap on
x86_64 / aarch64. Behavior on ppc64le is unchanged.

Follow-up to review feedback on the parent commit.
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@sezruby

sezruby commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

@zhztheplayer @philo-he Pushed a follow-up commit (a9d80cce3) that early-returns from build_arrow_java() when uname -m != ppc64le, so x86_64 / aarch64 contributors skip the redundant local install — gluten-arrow resolves arrow-c-data / arrow-dataset from Maven Central on those archs. ppc64le path is byte-for-byte unchanged.

Two CI failures on this run, both unrelated to this PR's diff:

  • spark-test-spark35-slow (2m31s) — failed at container init: Curl error (7): Couldn't connect to server for http://vault.centos.org/centos/8/AppStream/x86_64/os/repodata/repomd.xml. CentOS 8 mirror outage.
  • spark-test-spark41 (22m10s) — failed in a SparkScriptTransformationExec test asserting some_non_existent_command produces a SparkException. The error stack is from Spark's subprocess machinery, not gluten code; looks like an environment / flake.

Could you re-trigger those two lanes when you get a chance?

One caveat worth flagging: every gluten CI lane runs builddeps-veloxbe.sh with BUILD_ARROW=OFF and uses the pre-built Arrow baked into the Docker image, so the BUILD_ARROW=ON path (where build-arrow.sh actually executes) is not covered by CI. That means the conditional I added isn't exercised by any lane — it relies on a safe-by-construction argument: early-return on non-ppc64le; ppc64le branch byte-for-byte unchanged from before. Worth keeping in mind for review, and probably worth a separate followup to add at least one CI lane that does run BUILD_ARROW=ON end-to-end.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@philo-he

philo-he commented Jun 9, 2026

Copy link
Copy Markdown
Member

@sezruby, thank you for the continued efforts.

In the CI workflow, should we remove the Arrow artifact sharing from the build job? Including, but not limited to, the following:

cp -r /root/.m2/repository/org/apache/arrow/* /work/.m2/repository/org/apache/arrow/

- uses: actions/upload-artifact@v4
with:
name: arrow-jars-centos-7-${{github.sha}}
path: .m2/repository/org/apache/arrow/
if-no-files-found: error

@philo-he

philo-he commented Jun 9, 2026

Copy link
Copy Markdown
Member

We'd also keep an eye on the glibc compatibility of the official arrow-c-data jar.

@zhztheplayer, based on the following build guide, it appears that java-jni-manylinux-2014 is used in Arrow's official build. Since it is CentOS 7-based and built against glibc 2.17, this should ensure glibc compatibility on higher-version environments.

https://arrow.apache.org/java/main/developers/building.html

@FelixYBW

FelixYBW commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

We'd also keep an eye on the glibc compatibility of the official arrow-c-data jar.

@zhztheplayer, based on the following build guide, it appears that java-jni-manylinux-2014 is used in Arrow's official build. Since it is CentOS 7-based and built against glibc 2.17, this should ensure glibc compatibility on higher-version environments.

https://arrow.apache.org/java/main/developers/building.html

In Gluten, when we convert velox to Arrow C_data, we should link the one in Velox, not the lib in arrow jar, right? Then we copy the data pointers to arrow jar's and send to JVM.

If any project uses Spark's Arrow C_data, then we should be fine.

Spark uses 4 arrow libraries: arrow-format-12.0.1.jar arrow-memory-core-12.0.1.jar arrow-memory-netty-12.0.1.jar arrow-vector-12.0.1.jar

c_data lib is in arrow-c-data-12.0.1.jar. If there is libc conflict, we can build the jar only.

@FelixYBW

FelixYBW commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@sezruby Does lance-spark use arrow-c-data.jar? Did you build locally or download from maven?

@zhztheplayer

Copy link
Copy Markdown
Member

@sezruby Does lance-spark use arrow-c-data.jar?

Yes it does use arrow-c-data https://github.com/lance-format/lance-spark/blob/48ddbd12d5bf28c5d886ed99b0f21a3824ef98b3/pom.xml#L182-L186.

@zhztheplayer

Copy link
Copy Markdown
Member

based on the following build guide, it appears that java-jni-manylinux-2014 is used in Arrow's official build.

Looks good, this should avoid the libc incompatibilities.

@zhztheplayer

zhztheplayer commented Jun 9, 2026

Copy link
Copy Markdown
Member

@sezruby

One caveat worth flagging: every gluten CI lane runs builddeps-veloxbe.sh with BUILD_ARROW=OFF and uses the pre-built Arrow baked into the Docker image, so the BUILD_ARROW=ON path (where build-arrow.sh actually executes) is not covered by CI.

Yes, and I think we still need to verify the change on CI. Can you push a debug commit to remove the pre-built Arrow Java jar from local Maven repo, then see if CI can pass?

Once verified, that debug commit can be reverted before merging.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@zhztheplayer

zhztheplayer commented Jun 9, 2026

Copy link
Copy Markdown
Member

@sezruby

I see @philo-he mentioned there are CI steps copying arrow jars for subsequent CI tests. Can we clean them up in this PR? By doing so I think we don't need a debug commit that I mentioned in my last comment.

By all means we'd verify the official Arrow Java jars on CI.

Previously the build-native lanes copied /root/.m2/repository/org/apache/arrow/*
(pre-built Arrow Java jars baked into the Docker image) into a workspace
.m2 path and uploaded them as an `arrow-jars-*-${sha}` artifact. Every
downstream test lane downloaded that artifact into its own /root/.m2 so its
Maven build resolved Arrow from the pre-built copy instead of Maven Central.

This pre-bake has two problems now:

1. It hides whether gluten can actually resolve Arrow from Central (the
   change in the parent commits depends on this — gluten-arrow now resolves
   arrow-c-data:15.0.0 / arrow-dataset:15.0.0 etc. from Central on
   x86_64 / aarch64).
2. It exercises the locally-patched 15.0.0-gluten Arrow that no longer
   exists after the parent commits remove that artifact rename.

Removed:
- The mkdir + cp lines that staged Arrow jars after each native build
  (velox_backend_x86.yml, velox_backend_arm.yml, velox_backend_ansi.yml,
  velox_backend_enhanced.yml).
- The `arrow-jars-*` upload-artifact steps that published the staged copy.
- Every Download Arrow Jars / Download All Arrow Jar Artifacts step in
  every downstream test lane (~30 occurrences across the four PR-CI
  workflows).
- A leftover `ls -l .../arrow-dataset/15.0.0-gluten/` debug command in
  velox_backend_x86.yml that referenced the now-removed coordinate.

Untouched: velox_nightly.yml and build_bundle_package.yml — those build
release artifacts and may legitimately want to bake-in a specific Arrow.
@sezruby sezruby force-pushed the arrow-drop-gluten-rename branch from 2a5d395 to 4249dde Compare June 9, 2026 15:28
@github-actions github-actions Bot added the INFRA label Jun 9, 2026
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@philo-he philo-he changed the title [CORE] Drop 15.0.0-gluten Arrow version rename, depend on vanilla Apache Arrow [CORE] Drop 15.0.0-gluten Arrow version rename and depend on vanilla Apache Arrow Jun 9, 2026

@philo-he philo-he left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@Jenkins-J

Copy link
Copy Markdown
Contributor

@FelixYBW I've created a PR over in the arrow repository related to version 15 (apache/arrow#50138). Let me know if there are any modifications I should make to the PR.

@philo-he philo-he merged commit 3d585da into apache:main Jun 9, 2026
66 checks passed
@prestodb-ci

prestodb-ci commented Jun 9, 2026

Copy link
Copy Markdown

Failed to get lakehouse/gluten rebase URL from queue item URL http://ci.ibm.prestodb.dev/queue/item/375492/: build not available after 6 attempts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BUILD CORE works for Gluten Core INFRA VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants