perf(retrieve): cache BM25 index per corpus (lexical/hybrid no longer rebuild per query) by thorwhalen · Pull Request #22 · i2mint/ir

thorwhalen · 2026-06-06T11:05:31Z

Closes #21. Needs i2mint/vd#23 (vd.BM25Index).

ir.retrieve._lexical_ranked rebuilt the candidate collection and recomputed every BM25 term statistic on every query → lexical/hybrid were O(corpus×queries) and did not scale (surfaced by the eval run, ir_05 §6.1).

Change: build a vd.BM25Index once per candidate set and cache it on the corpus instance (keyed by surfaces+filter); the query-independent statistics are reused across queries. surfaces is materialized once (it feeds both the filter and the cache key).

Behavior-preserving: skills strict reproduces dense 0.6585 / lexical 0.6407 / hybrid 0.6941 exactly. 94 tests pass.

Scaling proof: the full 600-query reports run (16,836-record corpus, all three modes) that previously didn't finish in 10 min of CPU (killed, down-sampled to 100) now completes in ~3 min 16 s — and confirms the finding on the full set: hybrid 0.537 > dense 0.500 > lexical 0.401.

Refs #12.

_lexical_ranked built the candidate collection and recomputed all BM25 term statistics on every query, making lexical/hybrid O(corpus*queries) — they did not scale (a full lexical/hybrid pass over the 16,836-record reports corpus did not finish in 10 min CPU; ir_05 §6.1). Now a vd.BM25Index is built once per candidate set and cached on the corpus instance (keyed by surfaces+filter); the index's query-independent statistics are reused across queries. Behavior-preserving: skills strict reproduces dense 0.6585 / lexical 0.6407 / hybrid 0.6941 exactly. Needs vd.BM25Index (i2mint/vd#22). Closes #21. Refs #12

Records that the per-query BM25 rebuild (ir_05 §6.1) is resolved, and upgrades the reports row to the full 600-gold run (now feasible): hybrid 0.537 > dense 0.500 > lexical 0.401. Refs #21 #12

thorwhalen added 2 commits June 6, 2026 13:02

docs(eval): note BM25 caching fix in ir_05 (§6.1) + full reports n=600

5f239f5

Records that the per-query BM25 rebuild (ir_05 §6.1) is resolved, and upgrades the reports row to the full 600-gold run (now feasible): hybrid 0.537 > dense 0.500 > lexical 0.401. Refs #21 #12

thorwhalen merged commit ccacd66 into master Jun 6, 2026
9 of 18 checks passed

thorwhalen deleted the perf/cache-bm25-index branch June 6, 2026 12:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(retrieve): cache BM25 index per corpus (lexical/hybrid no longer rebuild per query)#22

perf(retrieve): cache BM25 index per corpus (lexical/hybrid no longer rebuild per query)#22
thorwhalen merged 2 commits into
masterfrom
perf/cache-bm25-index

thorwhalen commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

thorwhalen commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant