Skip to content

feat(search): BM25Index — build-once / query-many lexical index#23

Merged
thorwhalen merged 2 commits into
masterfrom
feat/bm25-index
Jun 6, 2026
Merged

feat(search): BM25Index — build-once / query-many lexical index#23
thorwhalen merged 2 commits into
masterfrom
feat/bm25-index

Conversation

@thorwhalen

Copy link
Copy Markdown
Member

bm25_lexical_search re-tokenizes every document and recomputes document frequencies/lengths on every call (O(N) per query, O(N·Q) for a batch) — it doesn't scale (a batch lexical/hybrid pass over a ~16.8k-doc collection didn't finish in 10 min CPU, surfaced via the ir eval run).

Change: extract a reusable BM25Index that builds the query-independent term statistics (df, lengths, mean length, per-doc tokens) once in __init__, then answers many queries via .search(query, *, limit, k1, b). bm25_lexical_search becomes a thin one-shot wrapper — BM25Index(collection, filter=...).search(...) — so existing behavior is exactly preserved (covered by a new equivalence test). Also exposes vd.BM25Index and adds reuse/filter/empty-query tests + a doctest.

First customer: i2mint/ir (ir.retrieve caches a BM25Index per corpus). 49 hybrid tests + 8 doctests pass.

Closes #22

bm25_lexical_search re-tokenized every document and recomputed
document frequencies/lengths on every call (O(N) per query, O(N*Q) for
a batch), which did not scale. Extract a reusable BM25Index that builds
the query-independent term statistics once and answers many queries via
.search(query, *, limit, k1, b); bm25_lexical_search becomes a thin
one-shot wrapper (BM25Index(collection, filter=...).search(...)) with
identical behavior. Adds equivalence + reuse + filter tests.

First customer: i2mint/ir (caches a BM25Index per corpus).

Closes #22
CI installed no optional extras, so pytest-asyncio (declared only in the
dev group) was absent and all 12 tests/test_async.py tests errored with
'async def functions are not natively supported' — a pre-existing failure
surfaced by dependency drift (master's last CI run predates it). Add a
lightweight 'test' extra (pytest + pytest-cov + pytest-asyncio, no heavy
backend deps) and point [tool.wads.ci.install].extras at it.
@thorwhalen thorwhalen merged commit 2486c7f into master Jun 6, 2026
12 checks passed
@thorwhalen thorwhalen deleted the feat/bm25-index branch June 6, 2026 12:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

search: add BM25Index (build-once / query-many) so lexical/hybrid scale

1 participant