Skip to content

perf(retrieve): cache BM25 index per corpus (lexical/hybrid no longer rebuild per query)#22

Merged
thorwhalen merged 2 commits into
masterfrom
perf/cache-bm25-index
Jun 6, 2026
Merged

perf(retrieve): cache BM25 index per corpus (lexical/hybrid no longer rebuild per query)#22
thorwhalen merged 2 commits into
masterfrom
perf/cache-bm25-index

Conversation

@thorwhalen

Copy link
Copy Markdown
Member

Closes #21. Needs i2mint/vd#23 (vd.BM25Index).

ir.retrieve._lexical_ranked rebuilt the candidate collection and recomputed every BM25 term statistic on every query → lexical/hybrid were O(corpus×queries) and did not scale (surfaced by the eval run, ir_05 §6.1).

Change: build a vd.BM25Index once per candidate set and cache it on the corpus instance (keyed by surfaces+filter); the query-independent statistics are reused across queries. surfaces is materialized once (it feeds both the filter and the cache key).

Behavior-preserving: skills strict reproduces dense 0.6585 / lexical 0.6407 / hybrid 0.6941 exactly. 94 tests pass.

Scaling proof: the full 600-query reports run (16,836-record corpus, all three modes) that previously didn't finish in 10 min of CPU (killed, down-sampled to 100) now completes in ~3 min 16 s — and confirms the finding on the full set: hybrid 0.537 > dense 0.500 > lexical 0.401.

Refs #12.

_lexical_ranked built the candidate collection and recomputed all BM25
term statistics on every query, making lexical/hybrid O(corpus*queries)
— they did not scale (a full lexical/hybrid pass over the 16,836-record
reports corpus did not finish in 10 min CPU; ir_05 §6.1). Now a
vd.BM25Index is built once per candidate set and cached on the corpus
instance (keyed by surfaces+filter); the index's query-independent
statistics are reused across queries. Behavior-preserving: skills strict
reproduces dense 0.6585 / lexical 0.6407 / hybrid 0.6941 exactly.

Needs vd.BM25Index (i2mint/vd#22).

Closes #21. Refs #12
Records that the per-query BM25 rebuild (ir_05 §6.1) is resolved, and
upgrades the reports row to the full 600-gold run (now feasible): hybrid
0.537 > dense 0.500 > lexical 0.401. Refs #21 #12
@thorwhalen thorwhalen merged commit ccacd66 into master Jun 6, 2026
9 of 18 checks passed
@thorwhalen thorwhalen deleted the perf/cache-bm25-index branch June 6, 2026 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

retrieve: cache BM25 index per corpus (lexical/hybrid rebuild it on every query)

1 participant