Low recall for IVF_SQ / IVF_HNSW_SQ with metric="dot" on MSMARCO

Observed unexpectedly low recall when using SQ-based indexes with dot‑product search on the MSMARCO 1M dataset.

**Environment**
- Lance tag: `v8.0.0-beta.18` (commit `909dea18b1`)
- Python package: `8.0.0-beta.18`
- Dataset: MSMARCO Web Search 1M – 1M base vectors, 9,376 queries, dim 768
- Metric: `dot`; ground‑truth: `msmarco-1M-gt100`; evaluated metric: `recall@10`

**Index Configurations**
- `IVF_HNSW_SQ`: `num_partitions=1`, `num_bits=8`, `m=16`, `ef_construction=200`, `nprobes=1`, `ef ∈ {20,40,80,160,320,640}`
- `IVF_SQ`: `num_partitions=1024`, `num_bits=8`, `nprobes ∈ {16,32,64,96,128}`
- `IVF_HNSW_FLAT` (baseline): `num_partitions=1`, `m=16`, `ef_construction=200`, `nprobes=1`, `ef ∈ {20,40,80,160,320,640}`

**Results**

| Index           | nprobes | ef   | recall@10 |
|-----------------|---------|------|-----------|
| `IVF_HNSW_SQ`   | 1       | 20   | 0.0250    |
| `IVF_HNSW_SQ`   | 1       | 40   | 0.0370    |
| `IVF_HNSW_SQ`   | 1       | 80   | 0.0490    |
| `IVF_HNSW_SQ`   | 1       | 160  | 0.0577    |
| `IVF_HNSW_SQ`   | 1       | 320  | 0.0641    |
| `IVF_HNSW_SQ`   | 1       | 640  | 0.0684    |
| `IVF_SQ`        | 16      | N/A  | 0.0918    |
| `IVF_SQ`        | 32      | N/A  | 0.0891    |
| `IVF_SQ`        | 64      | N/A  | 0.0862    |
| `IVF_SQ`        | 96      | N/A  | 0.0841    |
| `IVF_SQ`        | 128     | N/A  | 0.0827    |

**Baseline (`IVF_HNSW_FLAT`)**

| Index           | nprobes | ef   | recall@10 |
|-----------------|---------|------|-----------|
| `IVF_HNSW_FLAT` | 1       | 20   | 0.5179    |
| `IVF_HNSW_FLAT` | 1       | 40   | 0.6522    |
| `IVF_HNSW_FLAT` | 1       | 80   | 0.7566    |
| `IVF_HNSW_FLAT` | 1       | 160  | 0.8382    |
| `IVF_HNSW_FLAT` | 1       | 320  | 0.8983    |
| `IVF_HNSW_FLAT` | 1       | 640  | 0.9377    |

The gap is substantial: the flat baseline reaches 0.9377 recall@10 at `ef=640`, whereas `IVF_HNSW_SQ` peaks at 0.0684 and `IVF_SQ` remains around 0.08–0.09.

**Possible Cause**

I suspect the dot-product distance computation for SQ is incorrect.

SQ encodes each floating-point value with a lower-bound offset:

```text
code = (value - lower_bound) / scale
```

So the approximate dequantized value is:

```text
value ≈ lower_bound + scale * code
```

Let `cx` be the SQ code of the data vector `x`, and `cq` be the SQ code of the query vector `q`.

The current SQ dot path appears to compute dot product directly in code space, roughly as:

```text
dot(x, q) = scale² * sum(cx_i * cq_i)
```

However, the dot product between the dequantized vectors should be computed as:

```text
dot(x, q)
≈ sum((lower_bound + scale * cx_i) * (lower_bound + scale * cq_i))
```

Expanding this gives:

```text
dot(x, q)
≈ scale² * sum(cx_i * cq_i)
 + scale * lower_bound * sum(cx_i)
 + scale * lower_bound * sum(cq_i)
 + dim * lower_bound²
```

So if the implementation only uses the code-space dot term, it misses the offset-related terms. This can change the candidate ordering significantly, especially for dot-product search, and may explain the very low recall observed for `IVF_SQ` and `IVF_HNSW_SQ` with `metric="dot"`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low recall for IVF_SQ / IVF_HNSW_SQ with metric="dot" on MSMARCO #7352

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Index	nprobes	ef	recall@10
`IVF_HNSW_SQ`	1	20	0.0250
`IVF_HNSW_SQ`	1	40	0.0370
`IVF_HNSW_SQ`	1	80	0.0490
`IVF_HNSW_SQ`	1	160	0.0577
`IVF_HNSW_SQ`	1	320	0.0641
`IVF_HNSW_SQ`	1	640	0.0684
`IVF_SQ`	16	N/A	0.0918
`IVF_SQ`	32	N/A	0.0891
`IVF_SQ`	64	N/A	0.0862
`IVF_SQ`	96	N/A	0.0841
`IVF_SQ`	128	N/A	0.0827

Index	nprobes	ef	recall@10
`IVF_HNSW_FLAT`	1	20	0.5179
`IVF_HNSW_FLAT`	1	40	0.6522
`IVF_HNSW_FLAT`	1	80	0.7566
`IVF_HNSW_FLAT`	1	160	0.8382
`IVF_HNSW_FLAT`	1	320	0.8983
`IVF_HNSW_FLAT`	1	640	0.9377

Low recall for IVF_SQ / IVF_HNSW_SQ with metric="dot" on MSMARCO #7352

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions