Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Documentation

- Reframed bitmap-overlap docs around the checked Lean proof spine: query
symmetry, quotient sufficiency, finite threshold optimality, and idealized
hypergeometric calibration, while preserving the real-encoder caveats.

## [0.2.0] - 2026-05-26

First public release on crates.io / PyPI — the crate was not published before
Expand Down
4 changes: 4 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,10 @@ Contributions to the code, the docs, and the paper are all welcome.
reproducible `examples/bench_rank` synthetic run. Real-corpus numbers are
user-runnable and reported in the paper — please don't add unverifiable
performance claims.
- **Keep theory claims bounded.** Cite theorem names or formalization docs for
proof-backed statements, and preserve the finite in-model vs real-encoder
caveat. The Lean bitmap theorem proves a constant-weight overlap admission
model under explicit assumptions; it is not a blanket retrieval guarantee.
- **MSRV is Rust 1.89.** Don't use newer standard-library or language APIs.
- **Stable surface.** The persistence file magics (`.tvr` / `.tvrq` /
`.tvbm` / `.tvsb`) and the public method names
Expand Down
53 changes: 33 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,35 +66,42 @@ Two further paths, for callers who need them:
latency-critical callers know it exists.
- **`MultiBucketBitmap`** *(behind `--features experimental`)* — the
multi-bucket bilinear-overlap probe behind the research-side decomposition;
a scaffold for the theory, not a production path.
an algebraic scaffold, not the top-bucket theorem surface or a production
path.

## The bitmap prefilter has a closed-form null
## The bitmap prefilter has a checked finite model

The `Bitmap` prefilter scores candidates by `popcount(Q AND D)` over each
document's fixed-size top-bucket set. Two *unrelated* documents — modelled as
independent uniform top-bucket sets — overlap **hypergeometrically**,
`H(dim, n_top, n_top)`, with expected overlap `n_top² / dim` (e.g. 16 at
`dim = 256`, `n_top = 64`). So the filter's **selectivity** — how often an
unrelated document clears a given overlap threshold — is closed-form and
data-independent, not a tuned cutoff. (Whether *true* neighbours clear the bar
is empirical; this is an exact candidate-generation null, not a
retrieval-optimality theorem.) Two pieces of this are separately machine-checked in Lean 4, both `sorry`-free
on Lean's standard axiom base (`propext`, `Classical.choice`, `Quot.sound`):
document's fixed-size top-bucket set. In the idealized uniform constant-weight
null, two unrelated `n_top`-active bitmaps in `dim` coordinates overlap
**hypergeometrically**, `H(dim, n_top, n_top)`, with expected overlap
`n_top² / dim` (e.g. 16 at `dim = 256`, `n_top = 64`). That makes the null
selectivity of an overlap cutoff closed-form.

The current proof story is stronger than a closed-form null alone. Two pieces
are machine-checked in Lean 4, both `sorry`-free on Lean's standard axiom base
(`propext`, `Classical.choice`, `Quot.sound`):

- the **ordinal invariance** on which the rank transform rests — that a vector's
sorting permutation is unchanged by any strictly monotone reparametrisation
of its coordinates — in
[`takens-formalization`](https://github.com/Project-Navi/takens-formalization)
(theorem `isOrdinalPatternOf_comp_strictMono`); and
- the **shape of the bitmap candidate filter** — that under a finite
monotone-likelihood-ratio overlap-tilt model, an overlap-count *threshold*
(the popcount cutoff) is the Bayes-optimal deterministic admission rule, and
the uniform constant-weight null gives that threshold exactly the
hypergeometric upper tail — in
- the **finite constant-weight bitmap admission model** — symmetry makes
literal overlap the canonical query-preserving invariant, quotient
sufficiency reduces the decision to that evidence, a finite overlap-tilt
signal model makes an overlap-count threshold Bayes-optimal among
deterministic admission rules, and the uniform constant-weight bitmap null
assigns that same threshold event exactly the hypergeometric upper tail — in
[`ordvec-formalization`](https://github.com/Fieldnote-Echo/ordvec-formalization)
(theorem `exists_uniformBitmapOverlapTail_finiteBayesRisk_le_and_hypergeomTail`).
This is an *in-model* result: it fixes the optimal rule's shape under the
stated assumptions, not a claim that any given corpus obeys the model.

This is an *in-model* result. It proves the rule shape and the idealized finite
null under explicit quotient, symmetry, and monotone-overlap assumptions. It
does not prove that real encoders satisfy those assumptions, that the textbook
hypergeometric is every deployment corpus's null, or that ordinal quotients are
representation-complete. Whether true neighbours clear a cutoff remains an
empirical contract to measure.

Details in [`docs/RANK_MODES.md`](docs/RANK_MODES.md).

Expand Down Expand Up @@ -149,6 +156,10 @@ Wheels target CPython 3.10+ (abi3); to build from source instead, see
- **Index-file trust model:**
[`docs/INDEX_PROVENANCE.md`](docs/INDEX_PROVENANCE.md),
[`THREAT_MODEL.md`](THREAT_MODEL.md)
- **Formal proof spine:** [`ordvec-formalization`](https://github.com/Fieldnote-Echo/ordvec-formalization),
including its [`proof-spine`](https://github.com/Fieldnote-Echo/ordvec-formalization/blob/main/docs/proof-spine.md),
[`theorem-map`](https://github.com/Fieldnote-Echo/ordvec-formalization/blob/main/docs/theorem-map.md),
and [`reviewer brief`](https://github.com/Fieldnote-Echo/ordvec-formalization/blob/main/docs/reviewer-brief.md).
- **API docs:** <https://docs.rs/ordvec>
- **Paper (OrdVec / RankQuant):** _link TBD — see
[Research collaboration](#research-collaboration)._
Expand Down Expand Up @@ -236,8 +247,10 @@ Collaboration we're actively seeking:

- **Real-corpus evaluation** — running the modes against public corpora
(GloVe, MTEB / BEIR, OpenAI embedding dumps) beyond the synthetic benchmark.
- **Theory** — formalising the hypergeometric candidate-generation null and
the rank-cosine invariants.
- **Theory** — extending and independently auditing the `sorry`-free Lean
formalization, especially the finite bitmap proof spine, rank-cosine
invariants, and empirical diagnostics for when real encoders meet or violate
the model assumptions.
- **Independent reproduction** — re-running the benchmark on other hardware
and reporting the numbers.

Expand Down
5 changes: 3 additions & 2 deletions ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,5 +57,6 @@ into more hosts rather than forcing callers to adapt to it.

## Out of scope here

Benchmarks and the intellectual framing live in the OrdVec / RankQuant paper;
this roadmap concerns the crate as an embeddable building block.
Benchmarks and the intellectual framing live in the OrdVec / RankQuant paper
and the companion `ordvec-formalization` Lean repository; this roadmap concerns
the crate as an embeddable building block.
44 changes: 27 additions & 17 deletions docs/RANK_MODES.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,23 +128,29 @@ don't have a hypergeometric null because they don't have fixed
bucket cardinalities — their score distribution depends on the
unknown embedding distribution.

**A research program this suggests — now partly machine-checked.** The
chain — representation → statistic → retrieval theorem → systems
implementation — now has a checked link in the middle. Under a finite
monotone-likelihood-ratio overlap-tilt model (the formal cousin of a
shared-latent-support model, where relevant documents have elevated
coordinates on a query-specific support set `S_q`), the top-bucket
overlap statistic is monotone in the likelihood ratio for relevance,
and an overlap-count threshold — the popcount cutoff — is the
Bayes-optimal deterministic admission rule, with the uniform
constant-weight null assigning that threshold exactly the
hypergeometric upper tail. That is proved `sorry`-free in
[`ordvec-formalization`](https://github.com/Fieldnote-Echo/ordvec-formalization)
(`exists_uniformBitmapOverlapTail_finiteBayesRisk_le_and_hypergeomTail`).
It is an *in-model* result: it fixes the shape of the optimal rule
under the stated assumptions, not a claim that any given corpus obeys
the model — whether real neighbours clear the bar stays empirical,
which is what the bench and the paper measure.
**Checked finite model: symmetry, quotient sufficiency, threshold,
calibration.** The proof chain now has a larger machine-checked middle
than the implementation docs used to claim. In
[`ordvec-formalization`](https://github.com/Fieldnote-Echo/ordvec-formalization),
Lean proves that literal bitmap overlap is the canonical invariant
under query-preserving coordinate relabelings; finite quotient
sufficiency reduces the admission decision to ordered overlap
evidence when the likelihood ratio factors through it; a finite
monotone-likelihood-ratio overlap-tilt model makes an overlap-count
threshold Bayes-optimal among deterministic admission rules; and the
uniform constant-weight bitmap null gives that same threshold event
the exact hypergeometric upper tail. The headline theorem is
`exists_uniformBitmapOverlapTail_finiteBayesRisk_le_and_hypergeomTail`;
the proof path is summarized in the formalization repo's
`docs/proof-spine.md`, with theorem names in `docs/theorem-map.md`.

It is still an *in-model* result. The theorem is about literal
constant-weight bitmaps, finite deterministic admission, and explicit
symmetry / quotient / monotone-overlap assumptions. It does not prove
that real encoders satisfy those assumptions, that the textbook
hypergeometric is every deployment corpus's null, or that ordinal
quotients are representation-complete. Whether real neighbours clear a
cutoff stays empirical, which is what the bench and the paper measure.

The systems consequence is what the bench measures: at a moderate M
the bitmap probe captures most of exact RankQuant's top-10 neighbours,
Expand Down Expand Up @@ -512,3 +518,7 @@ multi-seed stability is your call.
retrieval still works after the removal is the interesting result:
on the corpora tested, those components were carrying less than the
dense-quantization literature assumes.
6. **Formal boundary.** The Lean result supports the constant-weight
bitmap overlap admission model and its idealized null calibration.
It does not replace real-corpus recall, null-fit, or monotonicity
checks for a deployed encoder.
5 changes: 5 additions & 0 deletions examples/bench_rank.rs
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,11 @@
//! - recall@10 against FP32 brute-force cosine ground truth
//! - candidate-recall (CR) for the two-stage modes (ANN probe quality)
//!
//! The bitmap rows also serve as an empirical complement to the Lean
//! constant-weight overlap model: they measure whether this synthetic or
//! user-supplied corpus behaves like the monotone-overlap regime assumed by the
//! theorem.
//!
//! DETERMINISM. The corpus, queries, and ground truth are fully seeded,
//! so every QUALITY column (R@10, CR, bytes/vec, total MiB, ns/dim) is
//! bit-identical across runs on the same machine. Only the wall-clock
Expand Down
16 changes: 14 additions & 2 deletions ordvec-python/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,20 @@ scores, ids = q.search_asymmetric(np.random.randn(8, 1024).astype(np.float32), k
|-------|---------|
| `Rank` | Full-precision rank vectors (u16 per coordinate). |
| `RankQuant` | Bucketed ranks, `bits` ∈ {1, 2, 4}; symmetric + asymmetric (float-query LUT) scoring. |
| `Bitmap` | Top-bucket bitmap per document; `popcount(Q AND D)` candidate scoring. |
| `SignBitmap` | Sign bitmap for sign-cosine candidate generation. |
| `Bitmap` | Constant-weight top-bucket bitmap per document; `popcount(Q AND D)` candidate scoring. |
| `SignBitmap` | Sign bitmap for sign-cosine candidate generation; separate from the constant-weight bitmap theorem. |

## Theory and calibration

`Bitmap` exposes the constant-weight top-bucket overlap statistic formalized in
[`ordvec-formalization`](https://github.com/Fieldnote-Echo/ordvec-formalization).
In that finite Lean model, literal bitmap overlap is the query-preserving
quotient statistic, an overlap threshold is Bayes-optimal under explicit
monotone-overlap assumptions, and the idealized uniform constant-weight null
calibrates that threshold by the hypergeometric upper tail.

This is not a deployment guarantee for every encoder or corpus. Real-corpus
recall, monotonicity, and null fit remain empirical diagnostics.

## Installation

Expand Down
1 change: 1 addition & 0 deletions ordvec-python/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ dependencies = [
Homepage = "https://github.com/Fieldnote-Echo/ordvec"
Repository = "https://github.com/Fieldnote-Echo/ordvec"
Issues = "https://github.com/Fieldnote-Echo/ordvec/issues"
Formalization = "https://github.com/Fieldnote-Echo/ordvec-formalization"

[tool.maturin]
manifest-path = "Cargo.toml"
Expand Down
8 changes: 8 additions & 0 deletions ordvec-python/python/ordvec/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,14 @@
read/write functions are reached through the classes' ``write()`` / ``load()``
methods rather than exposed as standalone free functions.

``Bitmap`` exposes the constant-weight top-bucket overlap statistic formalized
in the companion ``ordvec-formalization`` Lean repo: under explicit finite
symmetry, quotient, and monotone-overlap assumptions, an overlap threshold is
Bayes-optimal and the idealized uniform constant-weight null gives that
threshold the hypergeometric upper tail. This is an in-model candidate-admission
claim, not a guarantee that real encoders or deployment corpora satisfy those
assumptions.

The ``*Index`` names are back-compat aliases for the pre-0.2 turbovec-python
rank-mode classes; they are kept only to ease script migration and are not part
of the documented surface — new code should use the OrdVec ontology names above.
Expand Down
Loading
Loading