feat(index): support flattened JSON sub-doc indexing#7377
Draft
wirybeaver wants to merge 2 commits into
Draft
Conversation
Contributor
|
Important This PR touches the Lance format specification. Substantive changes to the format specification — the If this is a meaningful format change:
|
Add explicit JsonTokenizerMode values, with SingleDocument for existing JSON indexes and FlattenedSubDocs for new JSON indexes. Flatten JSON arrays into multiple internal inverted-index sub-docs while mapping each sub-doc back to the original row id. Normalize bracketed JSON query paths into value tokens plus constraint tokens, and deduplicate flattened JSON search results by row id. Add persisted disable_cross_array_unnest index metadata, defaulting to false. When set, flattened JSON tokenization indexes sibling arrays independently instead of producing their Cartesian product, matching Pinot's memory-saving DisableCrossArrayUnnest behavior. Expose disable_cross_array_unnest through Rust, Python, and Java inverted-index params. Test Plan: PASS: cargo fmt --all --check PASS: cargo test -p lance-index scalar::inverted::tokenizer::document_tokenizer::tests --lib PASS: cargo test -p lance-index scalar::inverted::tokenizer::tests --lib PASS: cargo test -p lance test_json_inverted_ --lib PASS: cargo test -p lance test_auto_infer_lance_tokenizer --lib PASS: cargo check -p lance-index --tests PASS: cargo check -p lance --tests PASS: cargo clippy --all --tests --benches -- -D warnings PASS: PATH=/home/user/.cargo/bin:$PATH cargo fmt --manifest-path ./lance-jni/Cargo.toml --all --check (from java/) PASS: ./mvnw spotless:check (from java/) PASS: PATH=/home/user/.cargo/bin:$PATH ./mvnw test (from java/) PASS: PATH=/home/user/.local/bin:$PATH uv run pytest python/tests/test_scalar_index.py::test_json_inverted_match_query (from python/) PASS: PATH=/home/user/.local/bin:/home/user/.cargo/bin:$PATH uv run make lint (from python/) NOTE: PATH=/home/user/.local/bin:$PATH make install (from python/) built pylance and installed deps, then failed at pre-commit install because core.hooksPath is set to /etc/git-hooks.
8 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Lance JSON inverted indexes should preserve array/object structure so path queries over arrays of objects can distinguish exact positions from wildcards. Tantivy documents a known JSON-index flaw: because an array document is treated as a bag of terms,
cart.product_type:sneakers AND cart.attributes.color:redcan match a document wheresneakersandredlive in different array objects. Lance's current single-token-stream JSON index has the same class of problem.This PR introduces
FlattenedSubDocsmode for new JSON indexes, flattening each JSON row into one or more sub-documents so those terms must match inside the same flattened array element before the result is mapped back to the original Lance row id.It also keeps existing JSON indexes readable and adds
disable_cross_array_unnest=falseby default, with an opt-in memory-safety tradeoff (i.e.disable_cross_array_unnest=true) for documents with many sibling arrays.Summary
SingleDocumentfor existing JSON indexes andFlattenedSubDocsfor new JSON indexes.DocSet.$idxconstraint tokens, and deduplicate flattened JSON search results by row id using max score.disable_cross_array_unnest, defaultfalse. The default preserves exact Cartesian-product semantics. When set totrue, sibling arrays are indexed independently to avoid sub-doc explosion, matching Pinot'sDisableCrossArrayUnnestmemory tradeoff.disable_cross_array_unnestthrough Rust, Python, and Java inverted-index params.Example 1 - Used to explain the idea
Raw Documents
doc-0
{"foo":[{"bar":["x","y"]}]}doc-1
{"foo":[{"bar":["y"]},{"bar":"z"}]}Flattened Documents
foo[0].bar[0]"x"foo[0].bar[1]"y"foo[0].bar[0]"y"foo[1].bar"z"Token Dictionary
Each token is represented as:
(path, type, value)foo..bar.x0foo..bar.y1, 2foo..barz3foo$idx00, 1, 2foo$idx13foo..bar$idx00, 2foo..bar$idx11Search Query A
Query:
Lookup tokens:
foo..bar.,str,"y"1, 2foo..bar$idx,num,00, 2foo$idx,num,00, 1, 2Flatten Posting Intersection:
Lookup original document:
Search Query B
Query:
Lookup tokens:
foo..bar.,str,"y"1, 2foo$idx,num,00, 1, 2Flatten Posting Intersection:
Lookup original document:
Example 2: sibling arrays
{"foo":[{"bar":["x","y"]},{"bar":["a","b"]}],"foo2":["u"]} {"foo":[{"bar":["y","z"]}],"foo2":["u"]}Expected flattened sub-docs:
Query behavior:
Example 3:
disable_cross_array_unnestInput:
{"a":["x","y"],"b":["u","v"],"c":1}Default
disable_cross_array_unnest=falseproduces exact Cartesian-product sub-docs:With
disable_cross_array_unnest=true, sibling arrays are indexed independently:This avoids combinatorial sub-doc growth. Queries that constrain values across multiple sibling arrays can sacrifice accuracy because no single flattened sub-doc contains terms from both sibling arrays.
Example 4: Tantivy nested-object false positive
Tantivy documents this pitfall for JSON arrays: a document is a bag of terms, so
cart.product_type:sneakers AND cart.attributes.color:redcan match even whensneakersandredcome from different objects in the samecartarray.FlattenedSubDocsmode avoids that by preserving each array element as a separate internal inverted-index document.Raw Documents
row0 should not match because
sneakersandredare in differentcartobjects:{"cart_id":3234234,"cart":[{"product_type":"sneakers","attributes":{"color":"white"}},{"product_type":"t-shirt","attributes":{"color":"red"}}]}row1 should match because both terms are in the same
cartobject:{"cart_id":3234235,"cart":[{"product_type":"sneakers","attributes":{"color":"red"}}]}Flattened Sub-Docs
Correct Query Shape For Nested-Object Semantics
Use one JSON
MatchQuerythat contains all related JSON triplets separated by;, and setOperator::Andon thatMatchQuery:Expected result:
Here
;only separates JSON triplets.Operator::Andis what requires all generated tokens to match, and because they are inside oneMatchQuery, the intersection happens on flattened sub-doc ids before Lance maps matches back to row ids.Anti-Example: Row-Level Boolean Composition
Do not express nested-object constraints as separate
BooleanQuery.mustchildren:Each child
MatchQueryruns independently and returns Lance row ids. The outer boolean query then composes those row ids, so row0 can still match at row level even though no singlecartobject contains both terms.Apache Pinot JSON Index References
disableCrossArrayUnnestconfig field: JsonIndexConfigTest Plan
cargo fmt --all --checkcargo test -p lance-index scalar::inverted::tokenizer::document_tokenizer::tests --libcargo test -p lance-index scalar::inverted::tokenizer::tests --libcargo test -p lance test_json_inverted_flattened_sub_doc_prevents_cross_object_match --libcargo test -p lance test_json_inverted_ --libcargo test -p lance test_auto_infer_lance_tokenizer --libcargo check -p lance-index --testscargo check -p lance --testscargo clippy --all --tests --benches -- -D warningscargo fmt --manifest-path ./lance-jni/Cargo.toml --all --checkfromjava/./mvnw spotless:checkfromjava/./mvnw testfromjava/uv run pytest python/tests/test_scalar_index.py::test_json_inverted_match_queryfrompython/uv run make lintfrompython/