feat: add distributed zonemap index build with configurable segments#516
Conversation
hamersaw
left a comment
There was a problem hiding this comment.
This looks pretty good. Thanks for the PR! A few things that we should tighten up IMO.
416231b to
fbe05fb
Compare
|
Thanks for the thorough review! All feedback has been addressed in the latest push (force-pushed as a single clean commit on latest main): Scan-side changes removed entirely:
Index creation fixes:
|
29ef8a5 to
78bb5ea
Compare
|
@hamersaw All review feedback has been addressed — the key change since your last review is simplifying |
|
@LuciferYang do you mind making a pass here? Specifically, I'm interested with how this compares to your proposal (#513) to support building distributed ZoneMap indexes. |
will give feedback later today. |
|
@hamersaw @beinan Had a closer look. After the May 12 force-push on #516, the two PRs are adjacent rather than overlapping.
The main difference between the two distributed paths is that For reference, sf=100 What I'd want to inherit from #516:
Suggested path:
One nit on #516, will leave inline: |
|
@LuciferYang Thanks for the thorough comparison — the side-by-side table is really helpful. You're right about
Agreed on the suggested path — happy to land #516 as the distributed foundation, then have #513 rebase its distributed path onto |
LuciferYang
left a comment
There was a problem hiding this comment.
Does the PR description also need to be updated?
LuciferYang
left a comment
There was a problem hiding this comment.
The "What Changed" section still references LanceScanBuilder.java, LanceScan.java, LanceInputPartition.java, LanceFragmentScanner.java, and LanceCountStarPartitionReader.java, plus the bullet "Add segmented zonemap scan support with Spark-side post-scan filtering fallback" in Summary. None of this is in the current diff.
others LGTM
Add zonemap as a new index type in CREATE INDEX DDL with distributed build support. Each segment is built in parallel on Spark executors and committed as a logical index on the driver. Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
804c1b9 to
5c18049
Compare
Sorry for my delay, just updated. can we merge this pr? @LuciferYang @hamersaw |
hamersaw
left a comment
There was a problem hiding this comment.
Thanks for the iterations on this!
Black-box end-to-end gate for lance-format#510: a standalone `CREATE INDEX ... USING zonemap` index (the lance-format#516 build path) must feed per-column min/max to Spark's CBO, and the kill-switch must turn it off. Asserts the optimizer's row-count estimate for an impossible range predicate tightens only when column stats are reported — the proof the unit tests (ColumnStatsAggregatorTest) cannot give. - BaseZonemapColumnStatsTest (base): logic + two cases (standalone zonemap, num_segments>1 segmented build). - 3.4 / 3.5 runners; 4.0 / 4.1 inherit the 3.5 runner via build-helper. Requires rebase onto main for lance-format#516's USING zonemap support to run green.
Resolves conflict in AddIndexExec.scala between: - feat: support deferred index creation with WITH (train=false) - feat: add distributed zonemap index build with configurable segments (lance-format#516) The resolution preserves both intents: - ZONEMAP early-returns via useLogicalSegmentCommit before the train check, so train=false correctly applies only to BTree/FTS (non-segmented) paths. - createIndexJob is only called when train=true, avoiding wasted work.
…ance-format#516) ## Summary - Add zonemap as a new index type in `CREATE INDEX` DDL with distributed build support - Batch fragments into configurable segments via `num_segments` option (defaults to `spark.default.parallelism`) - Each segment is built in parallel on Spark executors and committed as a logical index on the driver - Zonemap indexes currently support single column only ## What Changed - `AddIndexExec.scala`: Zonemap-specific path with `ZonemapIndexJob`/`ZonemapIndexTask` and `commitIndexSegments` - `create-index.md`: Document zonemap index type, options, and usage - Tests: unit tests for segment creation/validation and integration test ## Notes - Rebased cleanly onto current `main` - Depends on lance-core `7.0.0-beta.10` or newer which includes zonemap segment support - Supersedes PR lance-format#473 and closed PR lance-format#466 ## Test plan - [x] CI passes (lint, unit tests, integration tests across all Spark/Scala versions) - [x] Zonemap index creation with default segment count - [x] Zonemap index creation with explicit `num_segments` - [x] Repeated zonemap index creation replaces existing segments - [x] Query correctness after zonemap index creation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Beinan Wang <beinanwang@microsoft.com>
Summary
CREATE INDEXDDL with distributed build supportnum_segmentsoption (defaults tospark.default.parallelism)What Changed
AddIndexExec.scala: Zonemap-specific path withZonemapIndexJob/ZonemapIndexTaskandcommitIndexSegmentscreate-index.md: Document zonemap index type, options, and usageNotes
main7.0.0-beta.10or newer which includes zonemap segment supportTest plan
num_segments🤖 Generated with Claude Code