diff --git a/CHANGELOG.md b/CHANGELOG.md index 9cf1abf..6dec955 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,6 +7,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] +### Documentation + +- Reframed bitmap-overlap docs around the checked Lean proof spine: query + symmetry, quotient sufficiency, finite threshold optimality, and idealized + hypergeometric calibration, while preserving the real-encoder caveats. + ## [0.2.0] - 2026-05-26 First public release on crates.io / PyPI — the crate was not published before diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 598d86b..fcab97d 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -14,6 +14,10 @@ Contributions to the code, the docs, and the paper are all welcome. reproducible `examples/bench_rank` synthetic run. Real-corpus numbers are user-runnable and reported in the paper — please don't add unverifiable performance claims. +- **Keep theory claims bounded.** Cite theorem names or formalization docs for + proof-backed statements, and preserve the finite in-model vs real-encoder + caveat. The Lean bitmap theorem proves a constant-weight overlap admission + model under explicit assumptions; it is not a blanket retrieval guarantee. - **MSRV is Rust 1.89.** Don't use newer standard-library or language APIs. - **Stable surface.** The persistence file magics (`.tvr` / `.tvrq` / `.tvbm` / `.tvsb`) and the public method names diff --git a/README.md b/README.md index ab838a5..f5d1683 100644 --- a/README.md +++ b/README.md @@ -66,35 +66,42 @@ Two further paths, for callers who need them: latency-critical callers know it exists. - **`MultiBucketBitmap`** *(behind `--features experimental`)* — the multi-bucket bilinear-overlap probe behind the research-side decomposition; - a scaffold for the theory, not a production path. + an algebraic scaffold, not the top-bucket theorem surface or a production + path. -## The bitmap prefilter has a closed-form null +## The bitmap prefilter has a checked finite model The `Bitmap` prefilter scores candidates by `popcount(Q AND D)` over each -document's fixed-size top-bucket set. Two *unrelated* documents — modelled as -independent uniform top-bucket sets — overlap **hypergeometrically**, -`H(dim, n_top, n_top)`, with expected overlap `n_top² / dim` (e.g. 16 at -`dim = 256`, `n_top = 64`). So the filter's **selectivity** — how often an -unrelated document clears a given overlap threshold — is closed-form and -data-independent, not a tuned cutoff. (Whether *true* neighbours clear the bar -is empirical; this is an exact candidate-generation null, not a -retrieval-optimality theorem.) Two pieces of this are separately machine-checked in Lean 4, both `sorry`-free -on Lean's standard axiom base (`propext`, `Classical.choice`, `Quot.sound`): +document's fixed-size top-bucket set. In the idealized uniform constant-weight +null, two unrelated `n_top`-active bitmaps in `dim` coordinates overlap +**hypergeometrically**, `H(dim, n_top, n_top)`, with expected overlap +`n_top² / dim` (e.g. 16 at `dim = 256`, `n_top = 64`). That makes the null +selectivity of an overlap cutoff closed-form. + +The current proof story is stronger than a closed-form null alone. Two pieces +are machine-checked in Lean 4, both `sorry`-free on Lean's standard axiom base +(`propext`, `Classical.choice`, `Quot.sound`): - the **ordinal invariance** on which the rank transform rests — that a vector's sorting permutation is unchanged by any strictly monotone reparametrisation of its coordinates — in [`takens-formalization`](https://github.com/Project-Navi/takens-formalization) (theorem `isOrdinalPatternOf_comp_strictMono`); and -- the **shape of the bitmap candidate filter** — that under a finite - monotone-likelihood-ratio overlap-tilt model, an overlap-count *threshold* - (the popcount cutoff) is the Bayes-optimal deterministic admission rule, and - the uniform constant-weight null gives that threshold exactly the - hypergeometric upper tail — in +- the **finite constant-weight bitmap admission model** — symmetry makes + literal overlap the canonical query-preserving invariant, quotient + sufficiency reduces the decision to that evidence, a finite overlap-tilt + signal model makes an overlap-count threshold Bayes-optimal among + deterministic admission rules, and the uniform constant-weight bitmap null + assigns that same threshold event exactly the hypergeometric upper tail — in [`ordvec-formalization`](https://github.com/Fieldnote-Echo/ordvec-formalization) (theorem `exists_uniformBitmapOverlapTail_finiteBayesRisk_le_and_hypergeomTail`). - This is an *in-model* result: it fixes the optimal rule's shape under the - stated assumptions, not a claim that any given corpus obeys the model. + +This is an *in-model* result. It proves the rule shape and the idealized finite +null under explicit quotient, symmetry, and monotone-overlap assumptions. It +does not prove that real encoders satisfy those assumptions, that the textbook +hypergeometric is every deployment corpus's null, or that ordinal quotients are +representation-complete. Whether true neighbours clear a cutoff remains an +empirical contract to measure. Details in [`docs/RANK_MODES.md`](docs/RANK_MODES.md). @@ -149,6 +156,10 @@ Wheels target CPython 3.10+ (abi3); to build from source instead, see - **Index-file trust model:** [`docs/INDEX_PROVENANCE.md`](docs/INDEX_PROVENANCE.md), [`THREAT_MODEL.md`](THREAT_MODEL.md) +- **Formal proof spine:** [`ordvec-formalization`](https://github.com/Fieldnote-Echo/ordvec-formalization), + including its [`proof-spine`](https://github.com/Fieldnote-Echo/ordvec-formalization/blob/main/docs/proof-spine.md), + [`theorem-map`](https://github.com/Fieldnote-Echo/ordvec-formalization/blob/main/docs/theorem-map.md), + and [`reviewer brief`](https://github.com/Fieldnote-Echo/ordvec-formalization/blob/main/docs/reviewer-brief.md). - **API docs:** - **Paper (OrdVec / RankQuant):** _link TBD — see [Research collaboration](#research-collaboration)._ @@ -236,8 +247,10 @@ Collaboration we're actively seeking: - **Real-corpus evaluation** — running the modes against public corpora (GloVe, MTEB / BEIR, OpenAI embedding dumps) beyond the synthetic benchmark. -- **Theory** — formalising the hypergeometric candidate-generation null and - the rank-cosine invariants. +- **Theory** — extending and independently auditing the `sorry`-free Lean + formalization, especially the finite bitmap proof spine, rank-cosine + invariants, and empirical diagnostics for when real encoders meet or violate + the model assumptions. - **Independent reproduction** — re-running the benchmark on other hardware and reporting the numbers. diff --git a/ROADMAP.md b/ROADMAP.md index ec35520..e964d3a 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -57,5 +57,6 @@ into more hosts rather than forcing callers to adapt to it. ## Out of scope here -Benchmarks and the intellectual framing live in the OrdVec / RankQuant paper; -this roadmap concerns the crate as an embeddable building block. +Benchmarks and the intellectual framing live in the OrdVec / RankQuant paper +and the companion `ordvec-formalization` Lean repository; this roadmap concerns +the crate as an embeddable building block. diff --git a/docs/RANK_MODES.md b/docs/RANK_MODES.md index 8381e17..296ec3d 100644 --- a/docs/RANK_MODES.md +++ b/docs/RANK_MODES.md @@ -128,23 +128,29 @@ don't have a hypergeometric null because they don't have fixed bucket cardinalities — their score distribution depends on the unknown embedding distribution. -**A research program this suggests — now partly machine-checked.** The -chain — representation → statistic → retrieval theorem → systems -implementation — now has a checked link in the middle. Under a finite -monotone-likelihood-ratio overlap-tilt model (the formal cousin of a -shared-latent-support model, where relevant documents have elevated -coordinates on a query-specific support set `S_q`), the top-bucket -overlap statistic is monotone in the likelihood ratio for relevance, -and an overlap-count threshold — the popcount cutoff — is the -Bayes-optimal deterministic admission rule, with the uniform -constant-weight null assigning that threshold exactly the -hypergeometric upper tail. That is proved `sorry`-free in -[`ordvec-formalization`](https://github.com/Fieldnote-Echo/ordvec-formalization) -(`exists_uniformBitmapOverlapTail_finiteBayesRisk_le_and_hypergeomTail`). -It is an *in-model* result: it fixes the shape of the optimal rule -under the stated assumptions, not a claim that any given corpus obeys -the model — whether real neighbours clear the bar stays empirical, -which is what the bench and the paper measure. +**Checked finite model: symmetry, quotient sufficiency, threshold, +calibration.** The proof chain now has a larger machine-checked middle +than the implementation docs used to claim. In +[`ordvec-formalization`](https://github.com/Fieldnote-Echo/ordvec-formalization), +Lean proves that literal bitmap overlap is the canonical invariant +under query-preserving coordinate relabelings; finite quotient +sufficiency reduces the admission decision to ordered overlap +evidence when the likelihood ratio factors through it; a finite +monotone-likelihood-ratio overlap-tilt model makes an overlap-count +threshold Bayes-optimal among deterministic admission rules; and the +uniform constant-weight bitmap null gives that same threshold event +the exact hypergeometric upper tail. The headline theorem is +`exists_uniformBitmapOverlapTail_finiteBayesRisk_le_and_hypergeomTail`; +the proof path is summarized in the formalization repo's +`docs/proof-spine.md`, with theorem names in `docs/theorem-map.md`. + +It is still an *in-model* result. The theorem is about literal +constant-weight bitmaps, finite deterministic admission, and explicit +symmetry / quotient / monotone-overlap assumptions. It does not prove +that real encoders satisfy those assumptions, that the textbook +hypergeometric is every deployment corpus's null, or that ordinal +quotients are representation-complete. Whether real neighbours clear a +cutoff stays empirical, which is what the bench and the paper measure. The systems consequence is what the bench measures: at a moderate M the bitmap probe captures most of exact RankQuant's top-10 neighbours, @@ -512,3 +518,7 @@ multi-seed stability is your call. retrieval still works after the removal is the interesting result: on the corpora tested, those components were carrying less than the dense-quantization literature assumes. +6. **Formal boundary.** The Lean result supports the constant-weight + bitmap overlap admission model and its idealized null calibration. + It does not replace real-corpus recall, null-fit, or monotonicity + checks for a deployed encoder. diff --git a/examples/bench_rank.rs b/examples/bench_rank.rs index e8c76b0..ff6e102 100644 --- a/examples/bench_rank.rs +++ b/examples/bench_rank.rs @@ -18,6 +18,11 @@ //! - recall@10 against FP32 brute-force cosine ground truth //! - candidate-recall (CR) for the two-stage modes (ANN probe quality) //! +//! The bitmap rows also serve as an empirical complement to the Lean +//! constant-weight overlap model: they measure whether this synthetic or +//! user-supplied corpus behaves like the monotone-overlap regime assumed by the +//! theorem. +//! //! DETERMINISM. The corpus, queries, and ground truth are fully seeded, //! so every QUALITY column (R@10, CR, bytes/vec, total MiB, ns/dim) is //! bit-identical across runs on the same machine. Only the wall-clock diff --git a/ordvec-python/README.md b/ordvec-python/README.md index 24649a4..a76707c 100644 --- a/ordvec-python/README.md +++ b/ordvec-python/README.md @@ -21,8 +21,20 @@ scores, ids = q.search_asymmetric(np.random.randn(8, 1024).astype(np.float32), k |-------|---------| | `Rank` | Full-precision rank vectors (u16 per coordinate). | | `RankQuant` | Bucketed ranks, `bits` ∈ {1, 2, 4}; symmetric + asymmetric (float-query LUT) scoring. | -| `Bitmap` | Top-bucket bitmap per document; `popcount(Q AND D)` candidate scoring. | -| `SignBitmap` | Sign bitmap for sign-cosine candidate generation. | +| `Bitmap` | Constant-weight top-bucket bitmap per document; `popcount(Q AND D)` candidate scoring. | +| `SignBitmap` | Sign bitmap for sign-cosine candidate generation; separate from the constant-weight bitmap theorem. | + +## Theory and calibration + +`Bitmap` exposes the constant-weight top-bucket overlap statistic formalized in +[`ordvec-formalization`](https://github.com/Fieldnote-Echo/ordvec-formalization). +In that finite Lean model, literal bitmap overlap is the query-preserving +quotient statistic, an overlap threshold is Bayes-optimal under explicit +monotone-overlap assumptions, and the idealized uniform constant-weight null +calibrates that threshold by the hypergeometric upper tail. + +This is not a deployment guarantee for every encoder or corpus. Real-corpus +recall, monotonicity, and null fit remain empirical diagnostics. ## Installation diff --git a/ordvec-python/pyproject.toml b/ordvec-python/pyproject.toml index defc182..25794ac 100644 --- a/ordvec-python/pyproject.toml +++ b/ordvec-python/pyproject.toml @@ -49,6 +49,7 @@ dependencies = [ Homepage = "https://github.com/Fieldnote-Echo/ordvec" Repository = "https://github.com/Fieldnote-Echo/ordvec" Issues = "https://github.com/Fieldnote-Echo/ordvec/issues" +Formalization = "https://github.com/Fieldnote-Echo/ordvec-formalization" [tool.maturin] manifest-path = "Cargo.toml" diff --git a/ordvec-python/python/ordvec/__init__.py b/ordvec-python/python/ordvec/__init__.py index 73d5ce7..7c2b9d9 100644 --- a/ordvec-python/python/ordvec/__init__.py +++ b/ordvec-python/python/ordvec/__init__.py @@ -15,6 +15,14 @@ read/write functions are reached through the classes' ``write()`` / ``load()`` methods rather than exposed as standalone free functions. +``Bitmap`` exposes the constant-weight top-bucket overlap statistic formalized +in the companion ``ordvec-formalization`` Lean repo: under explicit finite +symmetry, quotient, and monotone-overlap assumptions, an overlap threshold is +Bayes-optimal and the idealized uniform constant-weight null gives that +threshold the hypergeometric upper tail. This is an in-model candidate-admission +claim, not a guarantee that real encoders or deployment corpora satisfy those +assumptions. + The ``*Index`` names are back-compat aliases for the pre-0.2 turbovec-python rank-mode classes; they are kept only to ease script migration and are not part of the documented surface — new code should use the OrdVec ontology names above. diff --git a/ordvec-python/src/lib.rs b/ordvec-python/src/lib.rs index f21a1aa..31472b7 100644 --- a/ordvec-python/src/lib.rs +++ b/ordvec-python/src/lib.rs @@ -1,8 +1,8 @@ //! Python bindings for [`ordvec`](https://github.com/Fieldnote-Echo/ordvec) — the //! training-free ordinal & sign vector-quantization crate. //! -//! Exposes the four retrieval types under the OrdVec ontology — [`Rank`], -//! [`RankQuant`], [`Bitmap`], [`SignBitmap`] — as a single abi3 extension module +//! Exposes the four retrieval types under the OrdVec ontology — `Rank`, +//! `RankQuant`, `Bitmap`, `SignBitmap` — as a single abi3 extension module //! (`_ordvec`) wrapped by the `ordvec` Python package. //! //! The core crate is aliased as `ordvec_core` throughout, so the Rust namespace @@ -487,7 +487,7 @@ impl Rank { .map_err(|e| pyo3::exceptions::PyIOError::new_err(e.to_string())) } - /// Load a `Rank` index from a `.tvr` file previously written by [`Rank.write`]. + /// Load a `Rank` index from a `.tvr` file previously written by [`Rank::write`]. /// /// `path` is forwarded to the filesystem unmodified — no `..` / traversal /// sanitisation — so treat it as trusted input (see the module docstring). @@ -664,7 +664,7 @@ impl RankQuant { .map_err(|e| pyo3::exceptions::PyIOError::new_err(e.to_string())) } - /// Load a `RankQuant` index from a `.tvrq` file written by [`RankQuant.write`]. + /// Load a `RankQuant` index from a `.tvrq` file written by [`RankQuant::write`]. /// /// `path` is forwarded to the filesystem unmodified — no `..` / traversal /// sanitisation — so treat it as trusted input (see the module docstring). @@ -694,6 +694,10 @@ impl RankQuant { /// filled are returned as ``-1``. Uses the same AVX-512 → AVX2 → scalar /// dispatch as ``search_asymmetric``. /// + /// If the shortlist came from [`Bitmap`], this is the exact RankQuant + /// rerank stage over that survivor set; it does not itself apply or + /// calibrate a bitmap overlap threshold. + /// /// ``candidates`` may be a 1-D array of any integer dtype — the ``uint32`` /// emitted by ``top_m_candidates``/``top_m_candidates_batched`` or a plain /// ``int64`` index array (``np.arange``, ``np.where(...)[0]``, fancy-index @@ -762,6 +766,8 @@ impl Bitmap { /// Construct a top-bucket bitmap index. `dim` must be a positive multiple of /// 64; `n_top` (how many coordinates per document are flagged "top", e.g. /// dim/4 for the b=2-equivalent top quarter) must satisfy `0 < n_top < dim`. + /// Every stored document and query bitmap has exactly `n_top` active + /// coordinates, matching the constant-weight overlap model. #[new] fn new(dim: usize, n_top: usize) -> PyResult { if dim == 0 || !dim.is_multiple_of(64) { @@ -821,7 +827,9 @@ impl Bitmap { } /// Bitmap-overlap search: returns the top-`k` doc indices by - /// popcount(query_top AND doc_top). + /// popcount(query_top AND doc_top). This uses the overlap statistic from + /// the finite threshold theorem but returns a top-k ranking, not a + /// calibrated threshold-admission set. fn search<'py>( &self, py: Python<'py>, @@ -848,7 +856,9 @@ impl Bitmap { /// Return top-`m` candidate doc IDs for a single query as a 1-D `uint32` /// array. Used as the candidate generator for two-stage retrieval (bitmap - /// probe → exact RankQuant rerank). + /// probe → exact RankQuant rerank). This is a fixed-budget shortlist over + /// the formal overlap statistic; the theorem's admission rule is an + /// explicit overlap threshold. fn top_m_candidates<'py>( &self, py: Python<'py>, @@ -868,7 +878,8 @@ impl Bitmap { /// Build the query-side top-`n_top` bitmap from an FP32 query, returned as a /// 1-D `uint64` array of `dim / 64` words (the doc-side packing). Pairs with - /// [`Bitmap.body_overlap_scores_subset`] for staged rescoring. + /// [`Bitmap::body_overlap_scores_subset`] for staged rescoring. The returned + /// bitmap has exactly `n_top` active coordinates. fn build_query_bitmap_fp32<'py>( &self, py: Python<'py>, @@ -889,6 +900,8 @@ impl Bitmap { /// dim)` f32 array; returns a 2-D `uint32` array of shape `(batch, m_eff)` /// where `m_eff = min(m, len(index))`. The column count is `m_eff` /// regardless of `batch`, so an empty `(0, dim)` input returns `(0, m_eff)`. + /// Each row has the same fixed-budget semantics as + /// [`Bitmap::top_m_candidates`]. fn top_m_candidates_batched<'py>( &self, py: Python<'py>, @@ -928,11 +941,12 @@ impl Bitmap { } /// Chunked batched candidate generation: like - /// [`Bitmap.top_m_candidates_batched`] but processes `queries` in groups of + /// [`Bitmap::top_m_candidates_batched`] but processes `queries` in groups of /// `batch_size` rows in parallel — use when the full query workload is /// larger than one batch fits efficiently in cache. `queries` is a 2-D `(n, /// dim)` f32 array; returns a 2-D `uint32` array `(n, m_eff)`. `batch_size` - /// must be > 0. + /// must be > 0. Each row has the same fixed-budget semantics as + /// [`Bitmap::top_m_candidates`]. fn top_m_candidates_batched_chunked<'py>( &self, py: Python<'py>, @@ -991,9 +1005,11 @@ impl Bitmap { /// Compute bitmap-overlap scores for a subset of doc IDs against a pre-built /// query bitmap. `q_bitmap` is a 1-D `uint64` array of `dim / 64` words - /// (e.g. from [`Bitmap.build_query_bitmap_fp32`]); `doc_ids` is a 1-D + /// (e.g. from [`Bitmap::build_query_bitmap_fp32`]); `doc_ids` is a 1-D /// integer array of any dtype (converted to `uint32`) whose ids must be in /// range. Returns a 1-D `uint32` array of overlap scores aligned to `doc_ids`. + /// These are the exact overlap values to compare against an explicit + /// calibrated cutoff when using threshold admission. /// /// `doc_ids` must additionally be sorted ascending. This is a *Python-side /// ergonomic policy*, not a core requirement: the Rust core accepts unsorted @@ -1058,7 +1074,7 @@ impl Bitmap { .map_err(|e| pyo3::exceptions::PyIOError::new_err(e.to_string())) } - /// Load a `Bitmap` index from a `.tvbm` file written by [`Bitmap.write`]. + /// Load a `Bitmap` index from a `.tvbm` file written by [`Bitmap::write`]. /// /// `path` is forwarded to the filesystem unmodified — no `..` / traversal /// sanitisation — so treat it as trusted input (see the module docstring). @@ -1104,7 +1120,8 @@ impl SignBitmap { /// 1-bit-per-coord sign-cosine retrieval substrate. `dim` must be a positive /// multiple of 64, at most `MAX_SIGN_BITMAP_DIM`. No `n_top` parameter — the /// threshold is data-independent (bit set iff coord > 0). Storage: `dim / 8` - /// bytes per doc (128 B at D = 1024). + /// bytes per doc (128 B at D = 1024). This is separate from the + /// constant-weight `Bitmap` theorem and its hypergeometric overlap null. #[new] fn new(dim: usize) -> PyResult { if dim == 0 || !dim.is_multiple_of(64) { @@ -1297,7 +1314,7 @@ impl SignBitmap { } /// Load a `SignBitmap` from a `.tvsb` file previously written by - /// [`SignBitmap.write`]. Raises `IOError` if the file is missing, malformed, + /// [`SignBitmap::write`]. Raises `IOError` if the file is missing, malformed, /// or its payload length disagrees with the header-declared shape. /// /// `path` is forwarded to the filesystem unmodified — no `..` / traversal diff --git a/ordvec-python/tests/test_bitmap.py b/ordvec-python/tests/test_bitmap.py index 8cf95a3..0aee255 100644 --- a/ordvec-python/tests/test_bitmap.py +++ b/ordvec-python/tests/test_bitmap.py @@ -3,9 +3,9 @@ Exercise the top-bucket bitmap candidate-generator API (add, search, top_m_candidates, save/load, accessors) and the two-stage flow with a RankQuant rerank. Algorithmic correctness (kernel-vs-scalar parity, -hypergeometric null distribution checks, bilinear identity proofs) lives -in the crate's Rust tests under `tests/index/`; these tests cover the -pyo3 boundary only. +constant-weight bitmap invariants, bilinear identity checks) lives in the +crate's Rust tests under `tests/index/`; the formal overlap-null theorem lives +in `ordvec-formalization`. These tests cover the pyo3 boundary only. """ from __future__ import annotations diff --git a/ordvec-python/tests/test_redteam_fuzz.py b/ordvec-python/tests/test_redteam_fuzz.py index 436b624..21fba01 100644 --- a/ordvec-python/tests/test_redteam_fuzz.py +++ b/ordvec-python/tests/test_redteam_fuzz.py @@ -1013,7 +1013,8 @@ def test_bitmap_load_forged_n_top_io_error(tmp_path): def test_bitmap_load_corrupt_popcount_io_error(tmp_path): # Zero the payload → each row popcount 0 != n_top, violating the per-row - # popcount invariant the hypergeometric model assumes. + # popcount invariant the constant-weight bitmap-null / formal overlap model + # assumes. idx = Bitmap(dim=128, n_top=32) idx.add(unit_vectors(20, 128)) real = str(tmp_path / "real.tvbm") diff --git a/src/bitmap.rs b/src/bitmap.rs index 6b9cf3c..07536a1 100644 --- a/src/bitmap.rs +++ b/src/bitmap.rs @@ -8,11 +8,18 @@ //! For `dim=1024, n_top=256` this is 128 B/doc — half of RankQuant b=2's //! storage — and equivalent to the top bucket of RankQuant b=2. //! -//! Score = `popcount(Q_bitmap AND D_bitmap) ∈ [0, n_top]`. The null -//! distribution for a random doc is hypergeometric -//! `H(N=dim, K=n_top, n=n_top)` with mean `n_top² / dim`, which lets -//! callers compute a closed-form p-value for the overlap — a property -//! magnitude quantizers don't have. +//! Score = `popcount(Q_bitmap AND D_bitmap) ∈ [0, n_top]`. Under the +//! idealized uniform constant-weight bitmap null, a random document's overlap +//! is hypergeometric `H(N=dim, K=n_top, n=n_top)` with mean `n_top² / dim`, +//! which lets callers compute a closed-form p-value for an overlap cutoff — a +//! property magnitude quantizers don't have. +//! +//! This is also the implementation surface for the companion Lean +//! formalization's finite model: literal constant-weight bitmap overlap is the +//! query-preserving quotient statistic, an overlap-tail rule is Bayes-optimal +//! under explicit monotone-overlap assumptions, and the same threshold event has +//! the hypergeometric upper tail under the idealized uniform null. The theorem +//! is in-model only; real encoders and deployment nulls remain empirical. //! //! Intended primary use: candidate generator for two-stage retrieval //! (bitmap probe → top-M candidates → exact RankQuant rerank). @@ -24,6 +31,12 @@ use crate::util::{and_popcount, assert_all_finite, result_buffer_len, TopK}; use crate::SearchResults; /// Top-bucket bitmap index for constant-composition coarse scoring. +/// +/// The checked finite theorem attached to this surface concerns threshold +/// admission on literal `n_top`-active bitmaps. The top-k helpers below are +/// engineering conveniences over the same overlap statistic; calibrated +/// threshold admission still requires choosing and validating a cutoff for the +/// caller's corpus. pub struct Bitmap { dim: usize, n_top: usize, @@ -99,7 +112,9 @@ impl Bitmap { /// Build the query-side bitmap from the *FP32 query directly* /// (top `n_top` coordinates by value). This preserves the /// "rich query, cheap docs" asymmetry of the rank-cosine paper: - /// the query side never gets rank-quantised. + /// the query side never gets rank-quantised. The returned query bitmap has + /// exactly `n_top` active coordinates, matching the constant-weight + /// overlap model. pub fn build_query_bitmap_fp32(&self, q: &[f32]) -> Vec { assert_eq!(q.len(), self.dim); assert_all_finite(q); @@ -122,6 +137,10 @@ impl Bitmap { /// Search: returns the top-`k` documents by popcount-overlap with /// the query's top-bucket bitmap. Scores are `popcount(Q AND D)` /// reported as f32 (in `[0, n_top]`). + /// + /// This ranks by the same overlap statistic used by the finite threshold + /// theorem, but it returns a top-k list rather than applying a calibrated + /// admission cutoff. pub fn search(&self, queries: &[f32], k: usize) -> SearchResults { let nq = queries.len() / self.dim; assert_eq!(queries.len(), nq * self.dim); @@ -172,7 +191,9 @@ impl Bitmap { /// Return the top-`m` candidate document indices for a single /// query, ordered by bitmap-overlap desc. Helper for two-stage - /// retrieval. + /// retrieval. This is a fixed-budget shortlist over the formal overlap + /// statistic; the theorem's admission rule is a threshold event, not an + /// `m`-sized survivor set. /// /// For large `m` this would exhibit O(N · m) behaviour through a /// streaming top-k buffer (each replacement triggers a linear @@ -224,7 +245,8 @@ impl Bitmap { /// /// `queries` is a flat `batch * dim` f32 slice. Returns /// `Vec>` with one top-`m` set per query, sorted by - /// overlap descending. + /// overlap descending. Each row has the same fixed-budget semantics as + /// [`Self::top_m_candidates`]. #[must_use = "this scans the corpus per query to generate candidates; dropping the result discards that work"] pub fn top_m_candidates_batched(&self, queries: &[f32], m: usize) -> Vec> { let dim = self.dim; @@ -295,7 +317,8 @@ impl Bitmap { /// `batch_size` and runs each chunk through /// [`Self::top_m_candidates_batched`] in parallel via rayon. Use /// when the full query workload is larger than one batch fits - /// efficiently in L2/L3. + /// efficiently in L2/L3. This preserves the fixed-budget shortlist + /// semantics of [`Self::top_m_candidates`]. #[must_use = "this scans the corpus per query to generate candidates; dropping the result discards that work"] pub fn top_m_candidates_batched_chunked( &self, @@ -343,7 +366,9 @@ impl Bitmap { /// index addresses at most `u32::MAX` documents. /// /// Public surface to support staged-pipeline callers that need to - /// rescore a small survivor set under the exact body overlap. + /// rescore a small survivor set under the exact body overlap. The returned + /// overlaps are also the values to compare against an explicit calibrated + /// cutoff when using threshold admission. pub fn body_overlap_scores_subset(&self, q_bitmap: &[u64], doc_ids: &[u32], out: &mut [u32]) { let qpv = self.qwords_per_vec; assert_eq!(q_bitmap.len(), qpv); diff --git a/src/fastscan.rs b/src/fastscan.rs index a39b8de..5149bd5 100644 --- a/src/fastscan.rs +++ b/src/fastscan.rs @@ -21,6 +21,8 @@ //! `pub(crate)`: production callers should reach for //! [`RankQuant::search_asymmetric`](crate::RankQuant::search_asymmetric), //! whose AVX-512 → AVX2 → scalar dispatch is the maintained surface. +//! This latency path is not part of the constant-weight bitmap overlap +//! calibration theorem. //! //! # Provenance //! diff --git a/src/lib.rs b/src/lib.rs index 3972dec..24dd8b4 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -20,6 +20,15 @@ //! coordinate, set when the coordinate is positive) for sign-cosine //! candidate generation. //! +//! The [`Bitmap`] candidate score is the implementation surface with the +//! strongest formal story: in the companion Lean formalization, literal +//! constant-weight bitmap overlap is the query-preserving quotient statistic, +//! an overlap threshold is Bayes-optimal under an explicit finite +//! monotone-overlap signal model, and the idealized uniform constant-weight null +//! calibrates that threshold by the hypergeometric upper tail. This is a finite +//! in-model theorem, not a claim that real encoders automatically satisfy the +//! quotient, symmetry, or null assumptions. +//! //! ```no_run //! use ordvec::{Rank, RankQuant}; //! @@ -67,8 +76,9 @@ pub use sign_bitmap::SignBitmap; pub use quant::search_asymmetric_byte_lut; // `MultiBucketBitmap` underwrites the bilinear bucket-overlap -// decomposition but is not stable public API. It is reachable only with -// the `experimental` feature; the default surface excludes it. +// decomposition but is not the constant-weight top-bucket theorem surface and +// is not stable public API. It is reachable only with the `experimental` +// feature; the default surface excludes it. #[cfg(feature = "experimental")] pub use multi_bucket::MultiBucketBitmap; diff --git a/src/multi_bucket.rs b/src/multi_bucket.rs index 878fb18..0b4a2d9 100644 --- a/src/multi_bucket.rs +++ b/src/multi_bucket.rs @@ -8,7 +8,7 @@ //! score(q, d) = Σ_{a, b} W[a, b] · |Q_a ∩ D_b| //! ``` //! -//! for arbitrary weights `W[2^bits][2^bits]` is the formal object the +//! for arbitrary weights `W[2^bits][2^bits]` is the algebraic object the //! scoring decomposes into. For **outer-product weights** //! `W[a, b] = (a − c)(b − c)` with `c = (2^bits − 1) / 2` this is //! algebraically identical to the symmetric RankQuant per-coord score @@ -25,6 +25,10 @@ //! bilinear decomposition empirically and serve as the reference for //! **truncated** weight matrices (top-k buckets only, diagonal-only, //! banded) which are the principled candidate-generation primitives. +//! +//! The companion Lean theorem currently attaches to the literal top-bucket +//! constant-weight [`crate::Bitmap`] overlap statistic and its threshold tail, +//! not to arbitrary bilinear weights over all buckets. use rayon::prelude::*; diff --git a/src/quant.rs b/src/quant.rs index 2d9562d..0e4a2ff 100644 --- a/src/quant.rs +++ b/src/quant.rs @@ -549,6 +549,10 @@ impl RankQuant { /// probe), but the copy grows linearly in `M`. For very large `M` /// (e.g. misuse via FFI), a full [`Self::search_asymmetric`] may be /// cheaper; a gather-free in-place scan is tracked for the FFI work. + /// + /// If the candidate list came from [`crate::Bitmap`], this method reranks + /// that shortlist exactly under RankQuant; it does not itself carry the + /// bitmap threshold-calibration guarantee. pub fn search_asymmetric_subset( &self, query: &[f32], diff --git a/src/rank_io.rs b/src/rank_io.rs index d34d010..32be790 100644 --- a/src/rank_io.rs +++ b/src/rank_io.rs @@ -541,9 +541,10 @@ pub(crate) fn load_bitmap(path: impl AsRef) -> io::Result<(usize, usize, u let bitmaps = read_le_vec(&mut f, payload_bytes / 8, u64::from_le_bytes)?; // Constant-composition invariant: every document bitmap must have exactly // `n_top` bits set (it flags the document's top `n_top` coordinates). The - // hypergeometric null model and the documented `[0, n_top]` score range - // both assume this, so a forged row with valid shape but a different - // popcount would break both. Verify per-row popcount at the boundary. + // idealized uniform constant-weight hypergeometric null model and the + // documented `[0, n_top]` score range both assume this, so a forged row + // with valid shape but a different popcount would break both. Verify + // per-row popcount at the boundary. for (row_idx, row) in bitmaps.chunks_exact(qpv).enumerate() { let pop: u32 = row.iter().map(|w| w.count_ones()).sum(); if pop as usize != n_top { diff --git a/src/sign_bitmap.rs b/src/sign_bitmap.rs index 2f5435a..8efd63a 100644 --- a/src/sign_bitmap.rs +++ b/src/sign_bitmap.rs @@ -17,6 +17,10 @@ //! candidate selector takes top-M docs by **lowest** Hamming //! (= **highest** agreement). //! +//! This is a separate sign-agreement primitive. It is not a constant-weight +//! bitmap space and is not covered by [`crate::Bitmap`]'s hypergeometric +//! overlap-tail theorem. +//! //! Kernel architecture mirrors [`crate::Bitmap`] (single-query //! and CHUNK=8 batched hot+tail paths under AVX-512 VPOPCNTDQ). The //! only material difference is `_mm512_xor_si512` in place of