Skip to content

feat(index): support flattened JSON sub-doc indexing#7377

Draft
wirybeaver wants to merge 2 commits into
lance-format:mainfrom
wirybeaver:lance-json
Draft

feat(index): support flattened JSON sub-doc indexing#7377
wirybeaver wants to merge 2 commits into
lance-format:mainfrom
wirybeaver:lance-json

Conversation

@wirybeaver

@wirybeaver wirybeaver commented Jun 19, 2026

Copy link
Copy Markdown

Motivation

Lance JSON inverted indexes should preserve array/object structure so path queries over arrays of objects can distinguish exact positions from wildcards. Tantivy documents a known JSON-index flaw: because an array document is treated as a bag of terms, cart.product_type:sneakers AND cart.attributes.color:red can match a document where sneakers and red live in different array objects. Lance's current single-token-stream JSON index has the same class of problem.

This PR introduces FlattenedSubDocs mode for new JSON indexes, flattening each JSON row into one or more sub-documents so those terms must match inside the same flattened array element before the result is mapped back to the original Lance row id.

It also keeps existing JSON indexes readable and adds disable_cross_array_unnest=false by default, with an opt-in memory-safety tradeoff (i.e. disable_cross_array_unnest=true) for documents with many sibling arrays.

Summary

  • Add explicit JSON tokenizer modes: SingleDocument for existing JSON indexes and FlattenedSubDocs for new JSON indexes.
  • Flatten JSON arrays into multiple internal inverted-index sub-docs, each mapped back to the original Lance row id through DocSet.
  • Normalize bracketed JSON query paths into value tokens plus $idx constraint tokens, and deduplicate flattened JSON search results by row id using max score.
  • Add persisted/index param disable_cross_array_unnest, default false. The default preserves exact Cartesian-product semantics. When set to true, sibling arrays are indexed independently to avoid sub-doc explosion, matching Pinot's DisableCrossArrayUnnest memory tradeoff.
  • Expose disable_cross_array_unnest through Rust, Python, and Java inverted-index params.

Example 1 - Used to explain the idea

Raw Documents

doc-0

{"foo":[{"bar":["x","y"]}]}

doc-1

{"foo":[{"bar":["y"]},{"bar":"z"}]}

Flattened Documents

Flatten ID Flattened Path Value
0 foo[0].bar[0] "x"
1 foo[0].bar[1] "y"
2 foo[0].bar[0] "y"
3 foo[1].bar "z"

Token Dictionary

Each token is represented as: (path, type, value)

Path Type Value Flatten Posting
foo..bar. str x 0
foo..bar. str y 1, 2
foo..bar str z 3
foo$idx num 0 0, 1, 2
foo$idx num 1 3
foo..bar$idx num 0 0, 2
foo..bar$idx num 1 1

Notes

  • Prior array index [<idx>] is compressed to . when walking through the path from left to right.
  • foo$idx represents the index of the element in the foo array.
  • foo..bar$idx represents the index inside the bar array.

Search Query A

Query:

foo[0].bar[0] = "y"

Lookup tokens:

Token Flatten Posting
foo..bar., str, "y" 1, 2
foo..bar$idx, num, 0 0, 2
foo$idx, num, 0 0, 1, 2

Flatten Posting Intersection:

(1, 2) intersection (0, 2) intersection (0, 1, 2) = {2}

Lookup original document:

Flatten ID 2
    ->
Doc ID 1

Search Query B

Query:

foo[0].bar[*] = "y"

Lookup tokens:

Token Flatten Posting
foo..bar., str, "y" 1, 2
foo$idx, num, 0 0, 1, 2

Notes

  • When there's a [*] in the search query, we don't need to generate index constraint token prefix$idx to match the specific array index on that position.

Flatten Posting Intersection:

(1, 2) intersection (0, 1, 2) = {1, 2}

Lookup original document:

Flatten ID [1, 2]
    ->
Doc ID [0, 1]

Example 2: sibling arrays

{"foo":[{"bar":["x","y"]},{"bar":["a","b"]}],"foo2":["u"]}
{"foo":[{"bar":["y","z"]}],"foo2":["u"]}

Expected flattened sub-docs:

row0/sub0: foo$idx=0, foo..bar$idx=0, foo..bar.=x, foo2$idx=0, foo2.=u
row0/sub1: foo$idx=0, foo..bar$idx=1, foo..bar.=y, foo2$idx=0, foo2.=u
row0/sub2: foo$idx=1, foo..bar$idx=0, foo..bar.=a, foo2$idx=0, foo2.=u
row0/sub3: foo$idx=1, foo..bar$idx=1, foo..bar.=b, foo2$idx=0, foo2.=u
row1/sub0: foo$idx=0, foo..bar$idx=0, foo..bar.=y, foo2$idx=0, foo2.=u
row1/sub1: foo$idx=0, foo..bar$idx=1, foo..bar.=z, foo2$idx=0, foo2.=u

Query behavior:

foo[0].bar[0],str,y -> row1 only
foo[0].bar[*],str,y -> row0 and row1

Example 3: disable_cross_array_unnest

Input:

{"a":["x","y"],"b":["u","v"],"c":1}

Default disable_cross_array_unnest=false produces exact Cartesian-product sub-docs:

sub0: a$idx=0, a.=x, b$idx=0, b.=u, c=1
sub1: a$idx=0, a.=x, b$idx=1, b.=v, c=1
sub2: a$idx=1, a.=y, b$idx=0, b.=u, c=1
sub3: a$idx=1, a.=y, b$idx=1, b.=v, c=1

With disable_cross_array_unnest=true, sibling arrays are indexed independently:

sub0: a$idx=0, a.=x, c=1
sub1: a$idx=1, a.=y, c=1
sub2: b$idx=0, b.=u, c=1
sub3: b$idx=1, b.=v, c=1

This avoids combinatorial sub-doc growth. Queries that constrain values across multiple sibling arrays can sacrifice accuracy because no single flattened sub-doc contains terms from both sibling arrays.

Example 4: Tantivy nested-object false positive

Tantivy documents this pitfall for JSON arrays: a document is a bag of terms, so cart.product_type:sneakers AND cart.attributes.color:red can match even when sneakers and red come from different objects in the same cart array. FlattenedSubDocs mode avoids that by preserving each array element as a separate internal inverted-index document.

Raw Documents

row0 should not match because sneakers and red are in different cart objects:

{"cart_id":3234234,"cart":[{"product_type":"sneakers","attributes":{"color":"white"}},{"product_type":"t-shirt","attributes":{"color":"red"}}]}

row1 should match because both terms are in the same cart object:

{"cart_id":3234235,"cart":[{"product_type":"sneakers","attributes":{"color":"red"}}]}

Flattened Sub-Docs

row0/sub0: cart$idx=0, cart..product_type=sneakers, cart..attributes.color=white, cart_id=3234234
row0/sub1: cart$idx=1, cart..product_type=t-shirt, cart..attributes.color=red, cart_id=3234234
row1/sub0: cart$idx=0, cart..product_type=sneakers, cart..attributes.color=red, cart_id=3234235

Correct Query Shape For Nested-Object Semantics

Use one JSON MatchQuery that contains all related JSON triplets separated by ;, and set Operator::And on that MatchQuery:

FtsQuery::Match(
    MatchQuery::new(
        "cart[*].product_type,str,sneakers;cart[*].attributes.color,str,red".to_string(),
    )
    .with_column(Some("json_field".to_string()))
    .with_operator(Operator::And),
)

Expected result:

row1 only

Here ; only separates JSON triplets. Operator::And is what requires all generated tokens to match, and because they are inside one MatchQuery, the intersection happens on flattened sub-doc ids before Lance maps matches back to row ids.

Anti-Example: Row-Level Boolean Composition

Do not express nested-object constraints as separate BooleanQuery.must children:

FtsQuery::Boolean(BooleanQuery {
    must: vec![
        MatchQuery::new("cart[*].product_type,str,sneakers".to_string())
            .with_column(Some("json_field".to_string()))
            .into(),
        MatchQuery::new("cart[*].attributes.color,str,red".to_string())
            .with_column(Some("json_field".to_string()))
            .into(),
    ],
    should: vec![],
    must_not: vec![],
})

Each child MatchQuery runs independently and returns Lance row ids. The outer boolean query then composes those row ids, so row0 can still match at row level even though no single cart object contains both terms.

Apache Pinot JSON Index References

Test Plan

  • PASS: cargo fmt --all --check
  • PASS: cargo test -p lance-index scalar::inverted::tokenizer::document_tokenizer::tests --lib
  • PASS: cargo test -p lance-index scalar::inverted::tokenizer::tests --lib
  • PASS: cargo test -p lance test_json_inverted_flattened_sub_doc_prevents_cross_object_match --lib
  • PASS: cargo test -p lance test_json_inverted_ --lib
  • PASS: cargo test -p lance test_auto_infer_lance_tokenizer --lib
  • PASS: cargo check -p lance-index --tests
  • PASS: cargo check -p lance --tests
  • PASS: cargo clippy --all --tests --benches -- -D warnings
  • PASS: cargo fmt --manifest-path ./lance-jni/Cargo.toml --all --check from java/
  • PASS: ./mvnw spotless:check from java/
  • PASS: ./mvnw test from java/
  • PASS: uv run pytest python/tests/test_scalar_index.py::test_json_inverted_match_query from python/
  • PASS: uv run make lint from python/

@github-actions github-actions Bot added A-index Vector index, linalg, tokenizer enhancement New feature or request labels Jun 19, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Important

This PR touches the Lance format specification.

Substantive changes to the format specification — the .proto definitions
and the spec docs under docs/src/format/ — require a PMC vote before merge.
Minor edits such as typo fixes, wording, or formatting are excluded; use your
judgment.

If this is a meaningful format change:

  • Start a vote following the Lance community voting process.
    Format specification modifications need 3 binding +1 votes (excluding the
    proposer), held on GitHub Discussions, with a minimum voting period of 1 week.
  • Once the vote passes, link the completed vote in this PR. It should not be
    merged until the vote is linked.

@github-actions github-actions Bot added the A-format On-disk format: protos and format spec docs label Jun 21, 2026
@wirybeaver wirybeaver changed the title feat(index): support array object paths in JSON FTS feat(index): support flattened JSON sub-doc indexing Jun 21, 2026
Add explicit JsonTokenizerMode values, with SingleDocument for existing JSON indexes and FlattenedSubDocs for new JSON indexes.

Flatten JSON arrays into multiple internal inverted-index sub-docs while mapping each sub-doc back to the original row id. Normalize bracketed JSON query paths into value tokens plus constraint tokens, and deduplicate flattened JSON search results by row id.

Add persisted disable_cross_array_unnest index metadata, defaulting to false. When set, flattened JSON tokenization indexes sibling arrays independently instead of producing their Cartesian product, matching Pinot's memory-saving DisableCrossArrayUnnest behavior.

Expose disable_cross_array_unnest through Rust, Python, and Java inverted-index params.

Test Plan:

PASS: cargo fmt --all --check

PASS: cargo test -p lance-index scalar::inverted::tokenizer::document_tokenizer::tests --lib

PASS: cargo test -p lance-index scalar::inverted::tokenizer::tests --lib

PASS: cargo test -p lance test_json_inverted_ --lib

PASS: cargo test -p lance test_auto_infer_lance_tokenizer --lib

PASS: cargo check -p lance-index --tests

PASS: cargo check -p lance --tests

PASS: cargo clippy --all --tests --benches -- -D warnings

PASS: PATH=/home/user/.cargo/bin:$PATH cargo fmt --manifest-path ./lance-jni/Cargo.toml --all --check (from java/)

PASS: ./mvnw spotless:check (from java/)

PASS: PATH=/home/user/.cargo/bin:$PATH ./mvnw test (from java/)

PASS: PATH=/home/user/.local/bin:$PATH uv run pytest python/tests/test_scalar_index.py::test_json_inverted_match_query (from python/)

PASS: PATH=/home/user/.local/bin:/home/user/.cargo/bin:$PATH uv run make lint (from python/)

NOTE: PATH=/home/user/.local/bin:$PATH make install (from python/) built pylance and installed deps, then failed at pre-commit install because core.hooksPath is set to /etc/git-hooks.
@github-actions github-actions Bot added A-python Python bindings A-java Java bindings + JNI labels Jun 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-format On-disk format: protos and format spec docs A-index Vector index, linalg, tokenizer A-java Java bindings + JNI A-python Python bindings enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant