[spark] support building BTree index through procedure #6956

steFaiz · 2026-01-06T03:55:10Z

Purpose

Part of #6834

This PR is about to support creating btree index through current spark CreateGlobalIndexProcedure.

The whole process is illustrated as below:

At first, all indexing column values as well as their related row ids from specified partitions are scanned
All data will be range shuffled and sorted by <partition, indexed field>
Each spark partition will contains disjoint key-ranges and each writer is capable of writing key ranges for multiple partitions. The spark partition num is controlled by records-per-range and max-parallelism option.
Note that the effective number of records of each btree file would not be precisely equal to records-per-range, the reason is that: (1) spark range shuffle is implemented through sampling. (2) if a Paimon partition spans multiple Spark partitions, the first and last output files may contain relatively few records (As the green-colored index writers in the picture before).
Finally the driver will collect all commit messages.

Tests

Please see org.apache.paimon.spark.procedure.CreateGlobalIndexProcedureTest for ut test.

API and Format

This pr do not modify any existing public api.

Documentation

Will be added ASAP

Copilot

Pull request overview

This PR introduces support for building BTree global indexes in Spark through the existing CreateGlobalIndexProcedure. BTree indexes provide efficient point lookups and range queries for high-cardinality data types like integers, doubles, and strings, complementing the existing Bitmap index implementation.

Key changes:

Implements a custom topology builder (BTreeIndexTopoBuilder) that uses Spark's range shuffle and sorting capabilities to distribute index building across partitions
Refactors GlobalIndexBuilder from concrete class to abstract class with separate implementations for default (bitmap) and BTree indexes
Adds configuration options for controlling BTree index parallelism and records per range
Fixes serialization bug where extraFieldIds null check was incorrectly checking indexMeta field instead

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`BTreeIndexTopoBuilder.java`	Implements distributed BTree index building using Spark range shuffle and sort; orchestrates parallel index file generation
`BTreeGlobalIndexBuilder.java`	Handles per-partition BTree index file writing with automatic flushing based on record count or partition boundaries
`BTreeGlobalIndexBuilderFactory.java`	Factory class for creating BTree index builders and topology builders via service loader pattern
`IndexFieldsExtractor.java`	Utility class for extracting partition, index field, and row ID from records during index building
`GlobalIndexTopoBuilder.java`	Interface change to support custom topology builders with direct access to SparkSession and data sources
`GlobalIndexBuilderContext.java`	Enhanced context to support nullable partition info and full range tracking for BTree indexes
`GlobalIndexBuilder.java`	Refactored to abstract class with iterator-based build method, supporting both singleton and parallel writers
`DefaultGlobalIndexBuilder.java`	Extracted default (bitmap) index building logic from the original GlobalIndexBuilder
`CreateGlobalIndexProcedure.java`	Modified to support custom topology builders that bypass traditional shard-based splitting
`BTreeIndexOptions.java`	Adds configuration options for records per range and max parallelism; fixes typo in compression-level key
`IndexManifestEntrySerializer.java`	Fixes bug where null check incorrectly evaluated `indexMeta` instead of `extraFieldIds`
`IndexFileMetaSerializer.java`	Fixes same serialization bug as IndexManifestEntrySerializer
`CreateGlobalIndexProcedureTest.scala`	Adds comprehensive tests for BTree index creation with single and multiple partitions, including overlap detection
Service files	Registers BTreeGlobalIndexerFactory and BTreeGlobalIndexBuilderFactory for service loader discovery

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

...ark-ut/src/test/scala/org/apache/paimon/spark/procedure/CreateGlobalIndexProcedureTest.scala

...-common/src/main/java/org/apache/paimon/spark/globalindex/btree/BTreeGlobalIndexBuilder.java

...rk-common/src/main/java/org/apache/paimon/spark/globalindex/btree/BTreeIndexTopoBuilder.java

...ark-ut/src/test/scala/org/apache/paimon/spark/procedure/CreateGlobalIndexProcedureTest.scala

JingsongLi

+1 thanks @steFaiz

steFaiz added 2 commits January 6, 2026 11:48

[spark] support building BTree index through procedure

dbf17b5

minor fix

94008be

steFaiz closed this Jan 6, 2026

steFaiz reopened this Jan 6, 2026

fix scala type inference

851accd

JingsongLi requested a review from Copilot January 6, 2026 07:00

Copilot started reviewing on behalf of JingsongLi January 6, 2026 07:00 View session

Copilot AI reviewed Jan 6, 2026

View reviewed changes

fix typo

2fa0831

steFaiz closed this Jan 6, 2026

steFaiz reopened this Jan 6, 2026

JingsongLi closed this Jan 7, 2026

JingsongLi reopened this Jan 7, 2026

JingsongLi approved these changes Jan 7, 2026

View reviewed changes

JingsongLi merged commit 22bc3ab into apache:master Jan 7, 2026
37 of 76 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[spark] support building BTree index through procedure #6956

[spark] support building BTree index through procedure #6956

Uh oh!

steFaiz commented Jan 6, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JingsongLi left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[spark] support building BTree index through procedure #6956

[spark] support building BTree index through procedure #6956

Uh oh!

Conversation

steFaiz commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Tests

API and Format

Documentation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

steFaiz commented Jan 6, 2026 •

edited

Loading