Fieldnote-Echo · project-navi-bot · May 28, 2026 · May 28, 2026
@@ -7,6 +7,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+### Documentation
+
+- Reframed bitmap-overlap docs around the checked Lean proof spine: query
+  symmetry, quotient sufficiency, finite threshold optimality, and idealized
+  hypergeometric calibration, while preserving the real-encoder caveats.
+
 ## [0.2.0] - 2026-05-26
 
 First public release on crates.io / PyPI — the crate was not published before

@@ -14,6 +14,10 @@ Contributions to the code, the docs, and the paper are all welcome.
   reproducible `examples/bench_rank` synthetic run. Real-corpus numbers are
   user-runnable and reported in the paper — please don't add unverifiable
   performance claims.
+- **Keep theory claims bounded.** Cite theorem names or formalization docs for
+  proof-backed statements, and preserve the finite in-model vs real-encoder
+  caveat. The Lean bitmap theorem proves a constant-weight overlap admission
+  model under explicit assumptions; it is not a blanket retrieval guarantee.
 - **MSRV is Rust 1.89.** Don't use newer standard-library or language APIs.
 - **Stable surface.** The persistence file magics (`.tvr` / `.tvrq` /
   `.tvbm` / `.tvsb`) and the public method names

@@ -66,35 +66,42 @@ Two further paths, for callers who need them:
   latency-critical callers know it exists.
 - **`MultiBucketBitmap`** *(behind `--features experimental`)* — the
   multi-bucket bilinear-overlap probe behind the research-side decomposition;
-  a scaffold for the theory, not a production path.
+  an algebraic scaffold, not the top-bucket theorem surface or a production
+  path.
 
-## The bitmap prefilter has a closed-form null
+## The bitmap prefilter has a checked finite model
 
 The `Bitmap` prefilter scores candidates by `popcount(Q AND D)` over each
-document's fixed-size top-bucket set. Two *unrelated* documents — modelled as
-independent uniform top-bucket sets — overlap **hypergeometrically**,
-`H(dim, n_top, n_top)`, with expected overlap `n_top² / dim` (e.g. 16 at
-`dim = 256`, `n_top = 64`). So the filter's **selectivity** — how often an
-unrelated document clears a given overlap threshold — is closed-form and
-data-independent, not a tuned cutoff. (Whether *true* neighbours clear the bar
-is empirical; this is an exact candidate-generation null, not a
-retrieval-optimality theorem.) Two pieces of this are separately machine-checked in Lean 4, both `sorry`-free
-on Lean's standard axiom base (`propext`, `Classical.choice`, `Quot.sound`):
+document's fixed-size top-bucket set. In the idealized uniform constant-weight
+null, two unrelated `n_top`-active bitmaps in `dim` coordinates overlap
+**hypergeometrically**, `H(dim, n_top, n_top)`, with expected overlap
+`n_top² / dim` (e.g. 16 at `dim = 256`, `n_top = 64`). That makes the null
+selectivity of an overlap cutoff closed-form.
+
+The current proof story is stronger than a closed-form null alone. Two pieces
+are machine-checked in Lean 4, both `sorry`-free on Lean's standard axiom base
+(`propext`, `Classical.choice`, `Quot.sound`):
 
 - the **ordinal invariance** on which the rank transform rests — that a vector's
   sorting permutation is unchanged by any strictly monotone reparametrisation
   of its coordinates — in
   [`takens-formalization`](https://github.com/Project-Navi/takens-formalization)
   (theorem `isOrdinalPatternOf_comp_strictMono`); and
-- the **shape of the bitmap candidate filter** — that under a finite
-  monotone-likelihood-ratio overlap-tilt model, an overlap-count *threshold*
-  (the popcount cutoff) is the Bayes-optimal deterministic admission rule, and
-  the uniform constant-weight null gives that threshold exactly the
-  hypergeometric upper tail — in
+- the **finite constant-weight bitmap admission model** — symmetry makes
+  literal overlap the canonical query-preserving invariant, quotient
+  sufficiency reduces the decision to that evidence, a finite overlap-tilt
+  signal model makes an overlap-count threshold Bayes-optimal among
+  deterministic admission rules, and the uniform constant-weight bitmap null
+  assigns that same threshold event exactly the hypergeometric upper tail — in
   [`ordvec-formalization`](https://github.com/Fieldnote-Echo/ordvec-formalization)
   (theorem `exists_uniformBitmapOverlapTail_finiteBayesRisk_le_and_hypergeomTail`).
-  This is an *in-model* result: it fixes the optimal rule's shape under the
-  stated assumptions, not a claim that any given corpus obeys the model.
+
+This is an *in-model* result. It proves the rule shape and the idealized finite
+null under explicit quotient, symmetry, and monotone-overlap assumptions. It
+does not prove that real encoders satisfy those assumptions, that the textbook
+hypergeometric is every deployment corpus's null, or that ordinal quotients are
+representation-complete. Whether true neighbours clear a cutoff remains an
+empirical contract to measure.
 
 Details in [`docs/RANK_MODES.md`](docs/RANK_MODES.md).
 
@@ -149,6 +156,10 @@ Wheels target CPython 3.10+ (abi3); to build from source instead, see
 - **Index-file trust model:**
   [`docs/INDEX_PROVENANCE.md`](docs/INDEX_PROVENANCE.md),
   [`THREAT_MODEL.md`](THREAT_MODEL.md)
+- **Formal proof spine:** [`ordvec-formalization`](https://github.com/Fieldnote-Echo/ordvec-formalization),
+  including its [`proof-spine`](https://github.com/Fieldnote-Echo/ordvec-formalization/blob/main/docs/proof-spine.md),
+  [`theorem-map`](https://github.com/Fieldnote-Echo/ordvec-formalization/blob/main/docs/theorem-map.md),
+  and [`reviewer brief`](https://github.com/Fieldnote-Echo/ordvec-formalization/blob/main/docs/reviewer-brief.md).
 - **API docs:** <https://docs.rs/ordvec>
 - **Paper (OrdVec / RankQuant):** _link TBD — see
   [Research collaboration](#research-collaboration)._
@@ -236,8 +247,10 @@ Collaboration we're actively seeking:
 
 - **Real-corpus evaluation** — running the modes against public corpora
   (GloVe, MTEB / BEIR, OpenAI embedding dumps) beyond the synthetic benchmark.
-- **Theory** — formalising the hypergeometric candidate-generation null and
-  the rank-cosine invariants.
+- **Theory** — extending and independently auditing the `sorry`-free Lean
+  formalization, especially the finite bitmap proof spine, rank-cosine
+  invariants, and empirical diagnostics for when real encoders meet or violate
+  the model assumptions.
 - **Independent reproduction** — re-running the benchmark on other hardware
   and reporting the numbers.
 

@@ -57,5 +57,6 @@ into more hosts rather than forcing callers to adapt to it.
 
 ## Out of scope here
 
-Benchmarks and the intellectual framing live in the OrdVec / RankQuant paper;
-this roadmap concerns the crate as an embeddable building block.
+Benchmarks and the intellectual framing live in the OrdVec / RankQuant paper
+and the companion `ordvec-formalization` Lean repository; this roadmap concerns
+the crate as an embeddable building block.
@@ -128,23 +128,29 @@ don't have a hypergeometric null because they don't have fixed
 bucket cardinalities — their score distribution depends on the
 unknown embedding distribution.
 
-**A research program this suggests — now partly machine-checked.** The
-chain — representation → statistic → retrieval theorem → systems
-implementation — now has a checked link in the middle. Under a finite
-monotone-likelihood-ratio overlap-tilt model (the formal cousin of a
-shared-latent-support model, where relevant documents have elevated
-coordinates on a query-specific support set `S_q`), the top-bucket
-overlap statistic is monotone in the likelihood ratio for relevance,
-and an overlap-count threshold — the popcount cutoff — is the
-Bayes-optimal deterministic admission rule, with the uniform
-constant-weight null assigning that threshold exactly the
-hypergeometric upper tail. That is proved `sorry`-free in
-[`ordvec-formalization`](https://github.com/Fieldnote-Echo/ordvec-formalization)
-(`exists_uniformBitmapOverlapTail_finiteBayesRisk_le_and_hypergeomTail`).
-It is an *in-model* result: it fixes the shape of the optimal rule
-under the stated assumptions, not a claim that any given corpus obeys
-the model — whether real neighbours clear the bar stays empirical,
-which is what the bench and the paper measure.
+**Checked finite model: symmetry, quotient sufficiency, threshold,
+calibration.** The proof chain now has a larger machine-checked middle
+than the implementation docs used to claim. In
+[`ordvec-formalization`](https://github.com/Fieldnote-Echo/ordvec-formalization),
+Lean proves that literal bitmap overlap is the canonical invariant
+under query-preserving coordinate relabelings; finite quotient
+sufficiency reduces the admission decision to ordered overlap
+evidence when the likelihood ratio factors through it; a finite
+monotone-likelihood-ratio overlap-tilt model makes an overlap-count
+threshold Bayes-optimal among deterministic admission rules; and the
+uniform constant-weight bitmap null gives that same threshold event
+the exact hypergeometric upper tail. The headline theorem is
+`exists_uniformBitmapOverlapTail_finiteBayesRisk_le_and_hypergeomTail`;
+the proof path is summarized in the formalization repo's
+`docs/proof-spine.md`, with theorem names in `docs/theorem-map.md`.
+
+It is still an *in-model* result. The theorem is about literal
+constant-weight bitmaps, finite deterministic admission, and explicit
+symmetry / quotient / monotone-overlap assumptions. It does not prove
+that real encoders satisfy those assumptions, that the textbook
+hypergeometric is every deployment corpus's null, or that ordinal
+quotients are representation-complete. Whether real neighbours clear a
+cutoff stays empirical, which is what the bench and the paper measure.
 
 The systems consequence is what the bench measures: at a moderate M
 the bitmap probe captures most of exact RankQuant's top-10 neighbours,
@@ -512,3 +518,7 @@ multi-seed stability is your call.
    retrieval still works after the removal is the interesting result:
    on the corpora tested, those components were carrying less than the
    dense-quantization literature assumes.
+6. **Formal boundary.** The Lean result supports the constant-weight
+   bitmap overlap admission model and its idealized null calibration.
+   It does not replace real-corpus recall, null-fit, or monotonicity
+   checks for a deployed encoder.
@@ -18,6 +18,11 @@
 //! - recall@10 against FP32 brute-force cosine ground truth
 //! - candidate-recall (CR) for the two-stage modes (ANN probe quality)
 //!
+//! The bitmap rows also serve as an empirical complement to the Lean
+//! constant-weight overlap model: they measure whether this synthetic or
+//! user-supplied corpus behaves like the monotone-overlap regime assumed by the
+//! theorem.
+//!
 //! DETERMINISM. The corpus, queries, and ground truth are fully seeded,
 //! so every QUALITY column (R@10, CR, bytes/vec, total MiB, ns/dim) is
 //! bit-identical across runs on the same machine. Only the wall-clock

@@ -21,8 +21,20 @@ scores, ids = q.search_asymmetric(np.random.randn(8, 1024).astype(np.float32), k
 |-------|---------|
 | `Rank` | Full-precision rank vectors (u16 per coordinate). |
 | `RankQuant` | Bucketed ranks, `bits` ∈ {1, 2, 4}; symmetric + asymmetric (float-query LUT) scoring. |
-| `Bitmap` | Top-bucket bitmap per document; `popcount(Q AND D)` candidate scoring. |
-| `SignBitmap` | Sign bitmap for sign-cosine candidate generation. |
+| `Bitmap` | Constant-weight top-bucket bitmap per document; `popcount(Q AND D)` candidate scoring. |
+| `SignBitmap` | Sign bitmap for sign-cosine candidate generation; separate from the constant-weight bitmap theorem. |
+
+## Theory and calibration
+
+`Bitmap` exposes the constant-weight top-bucket overlap statistic formalized in
+[`ordvec-formalization`](https://github.com/Fieldnote-Echo/ordvec-formalization).
+In that finite Lean model, literal bitmap overlap is the query-preserving
+quotient statistic, an overlap threshold is Bayes-optimal under explicit
+monotone-overlap assumptions, and the idealized uniform constant-weight null
+calibrates that threshold by the hypergeometric upper tail.
+
+This is not a deployment guarantee for every encoder or corpus. Real-corpus
+recall, monotonicity, and null fit remain empirical diagnostics.
 
 ## Installation
 

@@ -49,6 +49,7 @@ dependencies = [
 Homepage = "https://github.com/Fieldnote-Echo/ordvec"
 Repository = "https://github.com/Fieldnote-Echo/ordvec"
 Issues = "https://github.com/Fieldnote-Echo/ordvec/issues"
+Formalization = "https://github.com/Fieldnote-Echo/ordvec-formalization"
 
 [tool.maturin]
 manifest-path = "Cargo.toml"

@@ -15,6 +15,14 @@
 read/write functions are reached through the classes' ``write()`` / ``load()``
 methods rather than exposed as standalone free functions.
 
+``Bitmap`` exposes the constant-weight top-bucket overlap statistic formalized
+in the companion ``ordvec-formalization`` Lean repo: under explicit finite
+symmetry, quotient, and monotone-overlap assumptions, an overlap threshold is
+Bayes-optimal and the idealized uniform constant-weight null gives that
+threshold the hypergeometric upper tail. This is an in-model candidate-admission
+claim, not a guarantee that real encoders or deployment corpora satisfy those
+assumptions.
+
 The ``*Index`` names are back-compat aliases for the pre-0.2 turbovec-python
 rank-mode classes; they are kept only to ease script migration and are not part
 of the documented surface — new code should use the OrdVec ontology names above.