feat(index): support flattened JSON sub-doc indexing by wirybeaver · Pull Request #7377 · lance-format/lance

wirybeaver · 2026-06-19T18:20:25Z

Motivation

Lance JSON inverted indexes should preserve array/object structure so path queries over arrays of objects can distinguish exact positions from wildcards. Tantivy documents a known JSON-index flaw: because an array document is treated as a bag of terms, cart.product_type:sneakers AND cart.attributes.color:red can match a document where sneakers and red live in different array objects. Lance's current single-token-stream JSON index has the same class of problem.

This PR introduces FlattenedSubDocs mode for new JSON indexes, flattening each JSON row into one or more sub-documents so those terms must match inside the same flattened array element before the result is mapped back to the original Lance row id.

It also keeps existing JSON indexes readable and adds disable_cross_array_unnest=false by default, with an opt-in memory-safety tradeoff (i.e. disable_cross_array_unnest=true) for documents with many sibling arrays.

Summary

Add explicit JSON tokenizer modes: SingleDocument for existing JSON indexes and FlattenedSubDocs for new JSON indexes.
Flatten JSON arrays into multiple internal inverted-index sub-docs, each mapped back to the original Lance row id through DocSet.
Normalize bracketed JSON query paths into value tokens plus $idx constraint tokens, and deduplicate flattened JSON search results by row id using max score.
Add persisted/index param disable_cross_array_unnest, default false. The default preserves exact Cartesian-product semantics. When set to true, sibling arrays are indexed independently to avoid sub-doc explosion, matching Pinot's DisableCrossArrayUnnest memory tradeoff.
Expose disable_cross_array_unnest through Rust, Python, and Java inverted-index params.

Example 1 - Used to explain the idea

Raw Documents

doc-0

{"foo":[{"bar":["x","y"]}]}

doc-1

{"foo":[{"bar":["y"]},{"bar":"z"}]}

Flattened Documents

Flatten ID	Flattened Path	Value
0	`foo[0].bar[0]`	`"x"`
1	`foo[0].bar[1]`	`"y"`
2	`foo[0].bar[0]`	`"y"`
3	`foo[1].bar`	`"z"`

Token Dictionary

Each token is represented as: (path, type, value)

Path	Type	Value	Flatten Posting
`foo..bar.`	str	`x`	`0`
`foo..bar.`	str	`y`	`1, 2`
`foo..bar`	str	`z`	`3`
`foo$idx`	num	`0`	`0, 1, 2`
`foo$idx`	num	`1`	`3`
`foo..bar$idx`	num	`0`	`0, 2`
`foo..bar$idx`	num	`1`	`1`

Notes

Prior array index [<idx>] is compressed to . when walking through the path from left to right.

foo$idx represents the index of the element in the foo array.

foo..bar$idx represents the index inside the bar array.

Search Query A

Query:

foo[0].bar[0] = "y"

Lookup tokens:

Token	Flatten Posting
`foo..bar.`, `str`, `"y"`	`1, 2`
`foo..bar$idx`, `num`, `0`	`0, 2`
`foo$idx`, `num`, `0`	`0, 1, 2`

Flatten Posting Intersection:

(1, 2) intersection (0, 2) intersection (0, 1, 2) = {2}

Lookup original document:

Flatten ID 2
    ->
Doc ID 1

Search Query B

Query:

foo[0].bar[*] = "y"

Lookup tokens:

Token	Flatten Posting
`foo..bar.`, `str`, `"y"`	`1, 2`
`foo$idx`, `num`, `0`	`0, 1, 2`

Notes

When there's a [*] in the search query, we don't need to generate index constraint token prefix$idx to match the specific array index on that position.

Flatten Posting Intersection:

(1, 2) intersection (0, 1, 2) = {1, 2}

Lookup original document:

Flatten ID [1, 2]
    ->
Doc ID [0, 1]

Example 2: sibling arrays

{"foo":[{"bar":["x","y"]},{"bar":["a","b"]}],"foo2":["u"]}
{"foo":[{"bar":["y","z"]}],"foo2":["u"]}

Expected flattened sub-docs:

row0/sub0: foo$idx=0, foo..bar$idx=0, foo..bar.=x, foo2$idx=0, foo2.=u
row0/sub1: foo$idx=0, foo..bar$idx=1, foo..bar.=y, foo2$idx=0, foo2.=u
row0/sub2: foo$idx=1, foo..bar$idx=0, foo..bar.=a, foo2$idx=0, foo2.=u
row0/sub3: foo$idx=1, foo..bar$idx=1, foo..bar.=b, foo2$idx=0, foo2.=u
row1/sub0: foo$idx=0, foo..bar$idx=0, foo..bar.=y, foo2$idx=0, foo2.=u
row1/sub1: foo$idx=0, foo..bar$idx=1, foo..bar.=z, foo2$idx=0, foo2.=u

Query behavior:

foo[0].bar[0],str,y -> row1 only
foo[0].bar[*],str,y -> row0 and row1

Example 3: `disable_cross_array_unnest`

Input:

{"a":["x","y"],"b":["u","v"],"c":1}

Default disable_cross_array_unnest=false produces exact Cartesian-product sub-docs:

sub0: a$idx=0, a.=x, b$idx=0, b.=u, c=1
sub1: a$idx=0, a.=x, b$idx=1, b.=v, c=1
sub2: a$idx=1, a.=y, b$idx=0, b.=u, c=1
sub3: a$idx=1, a.=y, b$idx=1, b.=v, c=1

With disable_cross_array_unnest=true, sibling arrays are indexed independently:

sub0: a$idx=0, a.=x, c=1
sub1: a$idx=1, a.=y, c=1
sub2: b$idx=0, b.=u, c=1
sub3: b$idx=1, b.=v, c=1

This avoids combinatorial sub-doc growth. Queries that constrain values across multiple sibling arrays can sacrifice accuracy because no single flattened sub-doc contains terms from both sibling arrays.

Example 4: Tantivy nested-object false positive

Tantivy documents this pitfall for JSON arrays: a document is a bag of terms, so cart.product_type:sneakers AND cart.attributes.color:red can match even when sneakers and red come from different objects in the same cart array. FlattenedSubDocs mode avoids that by preserving each array element as a separate internal inverted-index document.

Raw Documents

row0 should not match because sneakers and red are in different cart objects:

{"cart_id":3234234,"cart":[{"product_type":"sneakers","attributes":{"color":"white"}},{"product_type":"t-shirt","attributes":{"color":"red"}}]}

row1 should match because both terms are in the same cart object:

{"cart_id":3234235,"cart":[{"product_type":"sneakers","attributes":{"color":"red"}}]}

Flattened Sub-Docs

row0/sub0: cart$idx=0, cart..product_type=sneakers, cart..attributes.color=white, cart_id=3234234
row0/sub1: cart$idx=1, cart..product_type=t-shirt, cart..attributes.color=red, cart_id=3234234
row1/sub0: cart$idx=0, cart..product_type=sneakers, cart..attributes.color=red, cart_id=3234235

Correct Query Shape For Nested-Object Semantics

Use one JSON MatchQuery that contains all related JSON triplets separated by ;, and set Operator::And on that MatchQuery:

FtsQuery::Match(
    MatchQuery::new(
        "cart[*].product_type,str,sneakers;cart[*].attributes.color,str,red".to_string(),
    )
    .with_column(Some("json_field".to_string()))
    .with_operator(Operator::And),
)

Expected result:

row1 only

Here ; only separates JSON triplets. Operator::And is what requires all generated tokens to match, and because they are inside one MatchQuery, the intersection happens on flattened sub-doc ids before Lance maps matches back to row ids.

Anti-Example: Row-Level Boolean Composition

Do not express nested-object constraints as separate BooleanQuery.must children:

FtsQuery::Boolean(BooleanQuery {
    must: vec![
        MatchQuery::new("cart[*].product_type,str,sneakers".to_string())
            .with_column(Some("json_field".to_string()))
            .into(),
        MatchQuery::new("cart[*].attributes.color,str,red".to_string())
            .with_column(Some("json_field".to_string()))
            .into(),
    ],
    should: vec![],
    must_not: vec![],
})

Each child MatchQuery runs independently and returns Lance row ids. The outer boolean query then composes those row ids, so row0 can still match at row level even though no single cart object contains both terms.

Apache Pinot JSON Index References

JSON flattening behavior and array index keys: JsonUtils.flatten
Pinot disableCrossArrayUnnest config field: JsonIndexConfig
Mutable JSON index SPI, including write entrypoint and reader inheritance: MutableJsonIndex
Immutable JSON index writer SPI: JsonIndexCreator
Immutable JSON index reader SPI: JsonIndexReader
Immutable writer implementation: BaseJsonIndexCreator
Mutable implementation read/write behavior: MutableJsonIndexImpl
Immutable reader implementation: ImmutableJsonIndexReader

Test Plan

PASS: cargo fmt --all --check
PASS: cargo test -p lance-index scalar::inverted::tokenizer::document_tokenizer::tests --lib
PASS: cargo test -p lance-index scalar::inverted::tokenizer::tests --lib
PASS: cargo test -p lance test_json_inverted_flattened_sub_doc_prevents_cross_object_match --lib
PASS: cargo test -p lance test_json_inverted_ --lib
PASS: cargo test -p lance test_auto_infer_lance_tokenizer --lib
PASS: cargo check -p lance-index --tests
PASS: cargo check -p lance --tests
PASS: cargo clippy --all --tests --benches -- -D warnings
PASS: cargo fmt --manifest-path ./lance-jni/Cargo.toml --all --check from java/
PASS: ./mvnw spotless:check from java/
PASS: ./mvnw test from java/
PASS: uv run pytest python/tests/test_scalar_index.py::test_json_inverted_match_query from python/
PASS: uv run make lint from python/

github-actions · 2026-06-21T21:13:31Z

Important

This PR touches the Lance format specification.

Substantive changes to the format specification — the .proto definitions
and the spec docs under docs/src/format/ — require a PMC vote before merge.
Minor edits such as typo fixes, wording, or formatting are excluded; use your
judgment.

If this is a meaningful format change:

Start a vote following the Lance community voting process.
Format specification modifications need 3 binding +1 votes (excluding the
proposer), held on GitHub Discussions, with a minimum voting period of 1 week.
Once the vote passes, link the completed vote in this PR. It should not be
merged until the vote is linked.

Add explicit JsonTokenizerMode values, with SingleDocument for existing JSON indexes and FlattenedSubDocs for new JSON indexes. Flatten JSON arrays into multiple internal inverted-index sub-docs while mapping each sub-doc back to the original row id. Normalize bracketed JSON query paths into value tokens plus constraint tokens, and deduplicate flattened JSON search results by row id. Add persisted disable_cross_array_unnest index metadata, defaulting to false. When set, flattened JSON tokenization indexes sibling arrays independently instead of producing their Cartesian product, matching Pinot's memory-saving DisableCrossArrayUnnest behavior. Expose disable_cross_array_unnest through Rust, Python, and Java inverted-index params. Test Plan: PASS: cargo fmt --all --check PASS: cargo test -p lance-index scalar::inverted::tokenizer::document_tokenizer::tests --lib PASS: cargo test -p lance-index scalar::inverted::tokenizer::tests --lib PASS: cargo test -p lance test_json_inverted_ --lib PASS: cargo test -p lance test_auto_infer_lance_tokenizer --lib PASS: cargo check -p lance-index --tests PASS: cargo check -p lance --tests PASS: cargo clippy --all --tests --benches -- -D warnings PASS: PATH=/home/user/.cargo/bin:$PATH cargo fmt --manifest-path ./lance-jni/Cargo.toml --all --check (from java/) PASS: ./mvnw spotless:check (from java/) PASS: PATH=/home/user/.cargo/bin:$PATH ./mvnw test (from java/) PASS: PATH=/home/user/.local/bin:$PATH uv run pytest python/tests/test_scalar_index.py::test_json_inverted_match_query (from python/) PASS: PATH=/home/user/.local/bin:/home/user/.cargo/bin:$PATH uv run make lint (from python/) NOTE: PATH=/home/user/.local/bin:$PATH make install (from python/) built pylance and installed deps, then failed at pre-commit install because core.hooksPath is set to /etc/git-hooks.

github-actions Bot added A-index Vector index, linalg, tokenizer enhancement New feature or request labels Jun 19, 2026

wirybeaver force-pushed the lance-json branch from 1725477 to 7a59f30 Compare June 21, 2026 21:13

github-actions Bot added the A-format On-disk format: protos and format spec docs label Jun 21, 2026

wirybeaver changed the title ~~feat(index): support array object paths in JSON FTS~~ feat(index): support flattened JSON sub-doc indexing Jun 21, 2026

wirybeaver force-pushed the lance-json branch from 7a59f30 to 63a1201 Compare June 21, 2026 22:37

github-actions Bot added A-python Python bindings A-java Java bindings + JNI labels Jun 21, 2026

wirybeaver mentioned this pull request Jun 21, 2026

feat: Support JSON as data type #4516

Open

8 tasks

test(index): cover JSON cross-object sub-doc matching

9a6a1ef

wirybeaver force-pushed the lance-json branch from 34afab6 to 9a6a1ef Compare June 21, 2026 23:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(index): support flattened JSON sub-doc indexing#7377

feat(index): support flattened JSON sub-doc indexing#7377
wirybeaver wants to merge 2 commits into
lance-format:mainfrom
wirybeaver:lance-json

wirybeaver commented Jun 19, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wirybeaver commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Summary

Example 1 - Used to explain the idea

Raw Documents

Flattened Documents

Token Dictionary

Search Query A

Search Query B

Example 2: sibling arrays

Example 3: disable_cross_array_unnest

Example 4: Tantivy nested-object false positive

Raw Documents

Flattened Sub-Docs

Correct Query Shape For Nested-Object Semantics

Anti-Example: Row-Level Boolean Composition

Apache Pinot JSON Index References

Test Plan

Uh oh!

github-actions Bot commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wirybeaver commented Jun 19, 2026 •

edited

Loading

Example 3: `disable_cross_array_unnest`