Observed unexpectedly low recall when using SQ-based indexes with dot‑product search on the MSMARCO 1M dataset.
Environment
- Lance tag:
v8.0.0-beta.18 (commit 909dea18b1)
- Python package:
8.0.0-beta.18
- Dataset: MSMARCO Web Search 1M – 1M base vectors, 9,376 queries, dim 768
- Metric:
dot; ground‑truth: msmarco-1M-gt100; evaluated metric: recall@10
Index Configurations
IVF_HNSW_SQ: num_partitions=1, num_bits=8, m=16, ef_construction=200, nprobes=1, ef ∈ {20,40,80,160,320,640}
IVF_SQ: num_partitions=1024, num_bits=8, nprobes ∈ {16,32,64,96,128}
IVF_HNSW_FLAT (baseline): num_partitions=1, m=16, ef_construction=200, nprobes=1, ef ∈ {20,40,80,160,320,640}
Results
| Index |
nprobes |
ef |
recall@10 |
IVF_HNSW_SQ |
1 |
20 |
0.0250 |
IVF_HNSW_SQ |
1 |
40 |
0.0370 |
IVF_HNSW_SQ |
1 |
80 |
0.0490 |
IVF_HNSW_SQ |
1 |
160 |
0.0577 |
IVF_HNSW_SQ |
1 |
320 |
0.0641 |
IVF_HNSW_SQ |
1 |
640 |
0.0684 |
IVF_SQ |
16 |
N/A |
0.0918 |
IVF_SQ |
32 |
N/A |
0.0891 |
IVF_SQ |
64 |
N/A |
0.0862 |
IVF_SQ |
96 |
N/A |
0.0841 |
IVF_SQ |
128 |
N/A |
0.0827 |
Baseline (IVF_HNSW_FLAT)
| Index |
nprobes |
ef |
recall@10 |
IVF_HNSW_FLAT |
1 |
20 |
0.5179 |
IVF_HNSW_FLAT |
1 |
40 |
0.6522 |
IVF_HNSW_FLAT |
1 |
80 |
0.7566 |
IVF_HNSW_FLAT |
1 |
160 |
0.8382 |
IVF_HNSW_FLAT |
1 |
320 |
0.8983 |
IVF_HNSW_FLAT |
1 |
640 |
0.9377 |
The gap is substantial: the flat baseline reaches 0.9377 recall@10 at ef=640, whereas IVF_HNSW_SQ peaks at 0.0684 and IVF_SQ remains around 0.08–0.09.
Possible Cause
I suspect the dot-product distance computation for SQ is incorrect.
SQ encodes each floating-point value with a lower-bound offset:
code = (value - lower_bound) / scale
So the approximate dequantized value is:
value ≈ lower_bound + scale * code
Let cx be the SQ code of the data vector x, and cq be the SQ code of the query vector q.
The current SQ dot path appears to compute dot product directly in code space, roughly as:
dot(x, q) = scale² * sum(cx_i * cq_i)
However, the dot product between the dequantized vectors should be computed as:
dot(x, q)
≈ sum((lower_bound + scale * cx_i) * (lower_bound + scale * cq_i))
Expanding this gives:
dot(x, q)
≈ scale² * sum(cx_i * cq_i)
+ scale * lower_bound * sum(cx_i)
+ scale * lower_bound * sum(cq_i)
+ dim * lower_bound²
So if the implementation only uses the code-space dot term, it misses the offset-related terms. This can change the candidate ordering significantly, especially for dot-product search, and may explain the very low recall observed for IVF_SQ and IVF_HNSW_SQ with metric="dot".
Observed unexpectedly low recall when using SQ-based indexes with dot‑product search on the MSMARCO 1M dataset.
Environment
v8.0.0-beta.18(commit909dea18b1)8.0.0-beta.18dot; ground‑truth:msmarco-1M-gt100; evaluated metric:recall@10Index Configurations
IVF_HNSW_SQ:num_partitions=1,num_bits=8,m=16,ef_construction=200,nprobes=1,ef ∈ {20,40,80,160,320,640}IVF_SQ:num_partitions=1024,num_bits=8,nprobes ∈ {16,32,64,96,128}IVF_HNSW_FLAT(baseline):num_partitions=1,m=16,ef_construction=200,nprobes=1,ef ∈ {20,40,80,160,320,640}Results
IVF_HNSW_SQIVF_HNSW_SQIVF_HNSW_SQIVF_HNSW_SQIVF_HNSW_SQIVF_HNSW_SQIVF_SQIVF_SQIVF_SQIVF_SQIVF_SQBaseline (
IVF_HNSW_FLAT)IVF_HNSW_FLATIVF_HNSW_FLATIVF_HNSW_FLATIVF_HNSW_FLATIVF_HNSW_FLATIVF_HNSW_FLATThe gap is substantial: the flat baseline reaches 0.9377 recall@10 at
ef=640, whereasIVF_HNSW_SQpeaks at 0.0684 andIVF_SQremains around 0.08–0.09.Possible Cause
I suspect the dot-product distance computation for SQ is incorrect.
SQ encodes each floating-point value with a lower-bound offset:
So the approximate dequantized value is:
Let
cxbe the SQ code of the data vectorx, andcqbe the SQ code of the query vectorq.The current SQ dot path appears to compute dot product directly in code space, roughly as:
However, the dot product between the dequantized vectors should be computed as:
Expanding this gives:
So if the implementation only uses the code-space dot term, it misses the offset-related terms. This can change the candidate ordering significantly, especially for dot-product search, and may explain the very low recall observed for
IVF_SQandIVF_HNSW_SQwithmetric="dot".