feat(search): BM25Index — build-once / query-many lexical index by thorwhalen · Pull Request #23 · i2mint/vd

thorwhalen · 2026-06-06T11:01:29Z

bm25_lexical_search re-tokenizes every document and recomputes document frequencies/lengths on every call (O(N) per query, O(N·Q) for a batch) — it doesn't scale (a batch lexical/hybrid pass over a ~16.8k-doc collection didn't finish in 10 min CPU, surfaced via the ir eval run).

Change: extract a reusable BM25Index that builds the query-independent term statistics (df, lengths, mean length, per-doc tokens) once in __init__, then answers many queries via .search(query, *, limit, k1, b). bm25_lexical_search becomes a thin one-shot wrapper — BM25Index(collection, filter=...).search(...) — so existing behavior is exactly preserved (covered by a new equivalence test). Also exposes vd.BM25Index and adds reuse/filter/empty-query tests + a doctest.

First customer: i2mint/ir (ir.retrieve caches a BM25Index per corpus). 49 hybrid tests + 8 doctests pass.

Closes #22

bm25_lexical_search re-tokenized every document and recomputed document frequencies/lengths on every call (O(N) per query, O(N*Q) for a batch), which did not scale. Extract a reusable BM25Index that builds the query-independent term statistics once and answers many queries via .search(query, *, limit, k1, b); bm25_lexical_search becomes a thin one-shot wrapper (BM25Index(collection, filter=...).search(...)) with identical behavior. Adds equivalence + reuse + filter tests. First customer: i2mint/ir (caches a BM25Index per corpus). Closes #22

CI installed no optional extras, so pytest-asyncio (declared only in the dev group) was absent and all 12 tests/test_async.py tests errored with 'async def functions are not natively supported' — a pre-existing failure surfaced by dependency drift (master's last CI run predates it). Add a lightweight 'test' extra (pytest + pytest-cov + pytest-asyncio, no heavy backend deps) and point [tool.wads.ci.install].extras at it.

thorwhalen mentioned this pull request Jun 6, 2026

perf(retrieve): cache BM25 index per corpus (lexical/hybrid no longer rebuild per query) i2mint/ir#22

Merged

thorwhalen merged commit 2486c7f into master Jun 6, 2026
12 checks passed

thorwhalen deleted the feat/bm25-index branch June 6, 2026 12:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(search): BM25Index — build-once / query-many lexical index#23

feat(search): BM25Index — build-once / query-many lexical index#23
thorwhalen merged 2 commits into
masterfrom
feat/bm25-index

thorwhalen commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

thorwhalen commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant