Skip to content

feat: add distributed zonemap index build#473

Closed
beinan wants to merge 7 commits into
lance-format:mainfrom
beinan:user/beinan/zonemap-create-index-distributed
Closed

feat: add distributed zonemap index build#473
beinan wants to merge 7 commits into
lance-format:mainfrom
beinan:user/beinan/zonemap-create-index-distributed

Conversation

@beinan

@beinan beinan commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR adds a distributed zonemap index build path for in .

It keeps the SQL surface unchanged, but changes the execution strategy from the driver-only/single-process flow to Spark-side fragment processing and logical segment commit.

This is intended as the distributed alternative to #466, which stays open as the simpler single-process solution.

What Changed

  • build zonemap index segments on Spark tasks and commit them through
  • support Spark-side zonemap pruning for segmented zonemap indices
  • batch multiple fragments into a single zonemap segment so task and segment counts stay bounded
  • keep zonemap restricted to single-column indexing
  • update docs and tests for distributed zonemap behavior

Validation

Tested with Spark 4.0 / Scala 2.13:

[INFO] Scanning for projects...
[WARNING]
[WARNING] Some problems were encountered while building the effective model for org.lance:lance-spark-bundle-3.5_2.12:jar:0.4.0-beta.3
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-shade-plugin is missing. @ line 73, column 21
[WARNING]
[WARNING] Some problems were encountered while building the effective model for org.lance:lance-spark-bundle-3.4_2.12:jar:0.4.0-beta.3
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-shade-plugin is missing. @ line 76, column 21
[WARNING]
[WARNING] Some problems were encountered while building the effective model for org.lance:lance-spark-bundle-3.5_2.13:jar:0.4.0-beta.3
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-shade-plugin is missing. @ line 77, column 21
[WARNING]
[WARNING] Some problems were encountered while building the effective model for org.lance:lance-spark-bundle-3.4_2.13:jar:0.4.0-beta.3
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-shade-plugin is missing. @ line 78, column 21
[WARNING]
[WARNING] Some problems were encountered while building the effective model for org.lance:lance-spark-bundle-4.0_2.13:jar:0.4.0-beta.3
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-shade-plugin is missing. @ line 79, column 21
[WARNING]
[WARNING] Some problems were encountered while building the effective model for org.lance:lance-spark-bundle-4.1_2.13:jar:0.4.0-beta.3
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-shade-plugin is missing. @ line 79, column 21
[WARNING]
[WARNING] It is highly recommended to fix these problems because they threaten the stability of your build.
[WARNING]
[WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
[WARNING]
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO]
[INFO] lance-spark-root [pom]
[INFO] lance-spark-base_2.13 [jar]
[INFO] lance-spark-4.0_2.13 [jar]
[INFO]
[INFO] ---------------------< org.lance:lance-spark-root >---------------------
[INFO] Building lance-spark-root 0.4.0-beta.3 [1/3]
[INFO] from pom.xml
[INFO] --------------------------------[ pom ]---------------------------------
[INFO]
[INFO] --- checkstyle:3.3.1:check (validate) @ lance-spark-root ---
[WARNING] File encoding has not been set, using platform encoding UTF-8, i.e. build is platform dependent!
[INFO] Starting audit...
Audit done.
[INFO] You have 0 Checkstyle violations.
[INFO]
[INFO] --- spotless:2.43.0:apply (spotless-check) @ lance-spark-root ---
[INFO]
[INFO] ------------------< org.lance:lance-spark-base_2.13 >-------------------
[INFO] Building lance-spark-base_2.13 0.4.0-beta.3 [2/3]
[INFO] from lance-spark-base_2.13/pom.xml
[INFO] --------------------------------[ jar ]---------------------------------
[WARNING] 2 problems were encountered while building the effective model for org.apache.yetus:audience-annotations:jar:0.5.0 during dependency collection step for project (use -X to see details)
[INFO]
[INFO] --- checkstyle:3.3.1:check (validate) @ lance-spark-base_2.13 ---
[WARNING] File encoding has not been set, using platform encoding UTF-8, i.e. build is platform dependent!
[INFO] Starting audit...
Audit done.
[INFO] You have 0 Checkstyle violations.
[INFO]
[INFO] --- spotless:2.43.0:apply (spotless-check) @ lance-spark-base_2.13 ---
[INFO]
[INFO] --- build-helper:3.2.0:add-source (add-java-source) @ lance-spark-base_2.13 ---
[INFO] Source directory: /home/beinanwang/o/lance-spark/lance-spark-base_2.12/src/main/java added.
[INFO]
[INFO] --- resources:3.0.2:resources (default-resources) @ lance-spark-base_2.13 ---
[WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!
[INFO] skip non existing resourceDirectory /home/beinanwang/o/lance-spark/lance-spark-base_2.13/src/main/resources
[INFO]
[INFO] --- scala:4.9.10:add-source (scala-compile-first) @ lance-spark-base_2.13 ---
[INFO]
[INFO] --- scala:4.9.10:compile (scala-compile-first) @ lance-spark-base_2.13 ---
[INFO] Compiler bridge file: /home/beinanwang/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.13-1.12.0-bin_2.13.16__65.0-1.12.0_20260104T231500.jar
[INFO] compile in 2.3 s
[INFO]
[INFO] --- compiler:3.8.1:compile (default-compile) @ lance-spark-base_2.13 ---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] --- compiler:3.8.1:compile (default) @ lance-spark-base_2.13 ---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] --- resources:3.0.2:testResources (default-testResources) @ lance-spark-base_2.13 ---
[WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!
[INFO] Copying 70 resources
[INFO]
[INFO] --- scala:4.9.10:testCompile (scala-test-compile) @ lance-spark-base_2.13 ---
[INFO] Compiler bridge file: /home/beinanwang/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.13-1.12.0-bin_2.13.16__65.0-1.12.0_20260104T231500.jar
[INFO] compile in 0.1 s
[INFO]
[INFO] --- compiler:3.8.1:testCompile (default-testCompile) @ lance-spark-base_2.13 ---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] --- surefire:3.2.5:test (default-test) @ lance-spark-base_2.13 ---
[INFO]
[INFO] -------------------< org.lance:lance-spark-4.0_2.13 >-------------------
[INFO] Building lance-spark-4.0_2.13 0.4.0-beta.3 [3/3]
[INFO] from lance-spark-4.0_2.13/pom.xml
[INFO] --------------------------------[ jar ]---------------------------------
[INFO]
[INFO] --- checkstyle:3.3.1:check (validate) @ lance-spark-4.0_2.13 ---
[WARNING] File encoding has not been set, using platform encoding UTF-8, i.e. build is platform dependent!
[INFO] Starting audit...
Audit done.
[INFO] You have 0 Checkstyle violations.
[INFO]
[INFO] --- spotless:2.43.0:apply (spotless-check) @ lance-spark-4.0_2.13 ---
[INFO] Spotless.Java is keeping 1 files clean - 0 were changed to be clean, 0 were already clean, 1 were skipped because caching determined they were already clean
[INFO] Spotless.Scala is keeping 3 files clean - 0 were changed to be clean, 0 were already clean, 3 were skipped because caching determined they were already clean
[INFO]
[INFO] --- antlr4:4.13.1:antlr4 (default) @ lance-spark-4.0_2.13 ---
[INFO] No grammars to process
[INFO] ANTLR 4: Processing source directory /home/beinanwang/o/lance-spark/lance-spark-base_2.12/src/main/antlr4
[INFO]
[INFO] --- build-helper:3.2.0:add-source (add-java-source) @ lance-spark-4.0_2.13 ---
[INFO] Source directory: /home/beinanwang/o/lance-spark/lance-spark-base_2.12/src/main/java added.
[INFO] Source directory: /home/beinanwang/o/lance-spark/lance-spark-3.5_2.12/src/main/java added.
[INFO] Source directory: /home/beinanwang/o/lance-spark/lance-spark-4.0_2.13/src/main/java added.
[INFO] Source directory: /home/beinanwang/o/lance-spark/lance-spark-4.0_2.13/src/main/scala added.
[INFO] Source directory: /home/beinanwang/o/lance-spark/lance-spark-4.0_2.13/target/generated-sources/antlr4 added.
[INFO]
[INFO] --- resources:3.0.2:resources (default-resources) @ lance-spark-4.0_2.13 ---
[WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!
[INFO] Copying 1 resource
[INFO]
[INFO] --- scala:4.9.10:compile (scala-compile-first) @ lance-spark-4.0_2.13 ---
[INFO] Compiler bridge file: /home/beinanwang/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.13-1.12.0-bin_2.13.16__65.0-1.12.0_20260104T231500.jar
[INFO] compile in 0.4 s
[INFO]
[INFO] --- compiler:3.8.1:compile (default-compile) @ lance-spark-4.0_2.13 ---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] --- build-helper:3.2.0:add-test-source (add-java-test-source) @ lance-spark-4.0_2.13 ---
[INFO] Test Source directory: /home/beinanwang/o/lance-spark/lance-spark-base_2.12/src/test/java added.
[INFO] Test Source directory: /home/beinanwang/o/lance-spark/lance-spark-3.5_2.12/src/test/java added.
[INFO] Test Source directory: /home/beinanwang/o/lance-spark/lance-spark-4.0_2.13/src/test/java added.
[INFO]
[INFO] --- resources:3.0.2:testResources (default-testResources) @ lance-spark-4.0_2.13 ---
[WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!
[INFO] Copying 70 resources
[INFO]
[INFO] --- scala:4.9.10:testCompile (scala-test-compile) @ lance-spark-4.0_2.13 ---
[INFO] Compiler bridge file: /home/beinanwang/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.13-1.12.0-bin_2.13.16__65.0-1.12.0_20260104T231500.jar
[INFO] compile in 0.1 s
[INFO]
[INFO] --- compiler:3.8.1:testCompile (default-testCompile) @ lance-spark-4.0_2.13 ---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] --- surefire:3.2.5:test (default-test) @ lance-spark-4.0_2.13 ---
[INFO] Using auto detected provider org.apache.maven.surefire.junitplatform.JUnitPlatformProvider
[INFO]
[INFO] -------------------------------------------------------
[INFO] T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.lance.spark.update.CreateIndexStandardSyntaxTest
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 9.235 s -- in org.lance.spark.update.CreateIndexStandardSyntaxTest
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for lance-spark-root 0.4.0-beta.3:
[INFO]
[INFO] lance-spark-root ................................... SUCCESS [ 1.645 s]
[INFO] lance-spark-base_2.13 .............................. SUCCESS [ 4.527 s]
[INFO] lance-spark-4.0_2.13 ............................... SUCCESS [ 12.492 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 18.830 s
[INFO] Finished at: 2026-04-23T01:00:54Z
[INFO] ------------------------------------------------------------------------

Notes

@github-actions github-actions Bot added the enhancement New feature or request label Apr 23, 2026
beinan and others added 6 commits April 27, 2026 17:07
Simplify useLogicalSegmentCommit to only return true for ZONEMAP.
BTREE fragment-mode segment commit is not yet supported by the
lance-jni layer (missing page_lookup.lance files), so remove
the BTREE path and clean up related test assertions and docs.

Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
Allow users to control the number of index segments created during
distributed zonemap builds via the num_segments DDL option.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@beinan beinan force-pushed the user/beinan/zonemap-create-index-distributed branch from 6392ab6 to b18d409 Compare May 11, 2026 17:49
@beinan

beinan commented May 11, 2026

Copy link
Copy Markdown
Contributor Author

Superseded by #516 which is rebased cleanly onto current main.

@beinan beinan closed this May 11, 2026
hamersaw pushed a commit that referenced this pull request Jun 4, 2026
…516)

## Summary

- Add zonemap as a new index type in `CREATE INDEX` DDL with distributed
build support
- Batch fragments into configurable segments via `num_segments` option
(defaults to `spark.default.parallelism`)
- Each segment is built in parallel on Spark executors and committed as
a logical index on the driver
- Zonemap indexes currently support single column only

## What Changed

- `AddIndexExec.scala`: Zonemap-specific path with
`ZonemapIndexJob`/`ZonemapIndexTask` and `commitIndexSegments`
- `create-index.md`: Document zonemap index type, options, and usage
- Tests: unit tests for segment creation/validation and integration test

## Notes

- Rebased cleanly onto current `main`
- Depends on lance-core `7.0.0-beta.10` or newer which includes zonemap
segment support
- Supersedes PR #473 and closed PR #466

## Test plan

- [x] CI passes (lint, unit tests, integration tests across all
Spark/Scala versions)
- [x] Zonemap index creation with default segment count
- [x] Zonemap index creation with explicit `num_segments`
- [x] Repeated zonemap index creation replaces existing segments
- [x] Query correctness after zonemap index creation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Beinan Wang <beinanwang@microsoft.com>
ivscheianu pushed a commit to ivscheianu/lance-spark that referenced this pull request Jun 12, 2026
…ance-format#516)

## Summary

- Add zonemap as a new index type in `CREATE INDEX` DDL with distributed
build support
- Batch fragments into configurable segments via `num_segments` option
(defaults to `spark.default.parallelism`)
- Each segment is built in parallel on Spark executors and committed as
a logical index on the driver
- Zonemap indexes currently support single column only

## What Changed

- `AddIndexExec.scala`: Zonemap-specific path with
`ZonemapIndexJob`/`ZonemapIndexTask` and `commitIndexSegments`
- `create-index.md`: Document zonemap index type, options, and usage
- Tests: unit tests for segment creation/validation and integration test

## Notes

- Rebased cleanly onto current `main`
- Depends on lance-core `7.0.0-beta.10` or newer which includes zonemap
segment support
- Supersedes PR lance-format#473 and closed PR lance-format#466

## Test plan

- [x] CI passes (lint, unit tests, integration tests across all
Spark/Scala versions)
- [x] Zonemap index creation with default segment count
- [x] Zonemap index creation with explicit `num_segments`
- [x] Repeated zonemap index creation replaces existing segments
- [x] Query correctness after zonemap index creation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Beinan Wang <beinanwang@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant