Skip to content

perf(index): use NGram posting cardinalities for regex conjunction search#7390

Draft
everySympathy wants to merge 1 commit into
lance-format:mainfrom
everySympathy:codex/fast-regex-search
Draft

perf(index): use NGram posting cardinalities for regex conjunction search#7390
everySympathy wants to merge 1 commit into
lance-format:mainfrom
everySympathy:codex/fast-regex-search

Conversation

@everySympathy

@everySympathy everySympathy commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR improves regex conjunction search on the NGram index by making posting-list evaluation more cost-aware.

The existing regex NGram acceleration can derive required trigrams from regex patterns, but conjunction queries may still load and intersect posting lists without considering posting-list size. This PR adds posting-list cardinality metadata and uses rare-first ordering to reduce unnecessary work, especially when a required trigram is missing or when a regex contains many common required trigrams.

Changes

Area Change
Index format Adds a cardinality column to NGram postings, storing the row-count of each trigram posting list.
Query planning Sorts required regex conjunction trigrams by cardinality before lookup.
Missing token handling Short-circuits pure conjunction regex queries immediately when any required trigram is absent.
Bitmap intersection Sorts loaded posting lists by actual bitmap cardinality before intersection, preserving rare-first CPU behavior after unordered async reads.
Compatibility Keeps older two-column NGram postings readable when cardinality metadata is absent.
Remap/update Recomputes cardinality when remapping posting lists and writes the destination in the current format.
Benchmarking Adds a dedicated ngram_load_compat benchmark for old two-column postings load performance.

Why

For a regex conjunction, every required trigram must be present. If one required trigram is absent, the candidate set is empty and there is no need to load/intersect other posting lists.

When all required trigrams are present, intersecting smaller posting lists first avoids cloning and intersecting large Roaring bitmaps early. This is the same selectivity principle used by inverted-index query planning: rare terms are more valuable filters than common terms.

Benchmark Results

Regex Query Benchmarks

Dataset Query Baseline This PR Change
200k rows commonmarkerabcdefghijklmnopqrstuvwx.*missingmarker ~854 us ~451 us ~47% faster
200k rows commoncommon.*missingmarker ~487 us ~402 us ~17% faster
200k rows densecommonprefix.*longrareanchor.*densecommonsuffix ~2.85 ms ~2.62 ms ~8% faster
200k rows commoncommon.*raremarker ~3.80 ms ~3.68 ms ~3% faster
10M rows commonmarkerabcdefghijklmnopqrstuvwx.*missingmarker ~10.60 ms ~0.49 ms ~21.8x faster

The biggest improvement is in negative conjunction cases where a regex contains many common required trigrams plus a missing required trigram. The new path can return an empty candidate set before loading large common posting lists.

Old Index Compatibility Benchmark

A new benchmark was added for loading older two-column NGram postings without cardinality metadata:

cargo bench -p lance-index --bench ngram_load_compat -- old_two_column_postings
Benchmark Baseline This PR
old_two_column_postings(50000) [1.9769 ms, 2.0122 ms, 2.0450 ms] [1.9578 ms, 1.9714 ms, 1.9852 ms]

This shows the schema-based compatibility path does not regress old-format index loading.

Compatibility

Index format Behavior
New three-column postings: tokens, cardinality, posting_list Uses cardinality metadata for regex conjunction planning.
Old two-column postings: tokens, posting_list Still readable; token_cardinalities is None, and the index falls back safely.

The old-format path now checks the index file schema before choosing the projection, instead of attempting a read that may fail.

Testing

  • cargo fmt --all
  • cargo test -p lance-index ngram
  • cargo test -p lance-index --bench ngram_load_compat
  • cargo bench -p lance-index --bench ngram_load_compat -- old_two_column_postings
  • cargo clippy --all --tests --benches -- -D warnings

@github-actions github-actions Bot added A-index Vector index, linalg, tokenizer performance labels Jun 22, 2026
@codecov

codecov Bot commented Jun 22, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 98.13084% with 4 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/ngram.rs 98.13% 2 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-index Vector index, linalg, tokenizer performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant