Summary
After overwriting an indexed vector column in place via the low-level LanceFragment.update_columns API and then delta-optimizing the index with optimize_indices(num_indices_to_merge=0), a KNN vector query returns the updated row twice: once with the stale (pre-update) indexed vector and once with the new value.
Reproduction
import lance, numpy as np, pyarrow as pa, tempfile, os
ndim = 16
np.random.seed(42)
uri = os.path.join(tempfile.mkdtemp(), "ds")
def fsl(arr):
return pa.FixedSizeListArray.from_arrays(pa.array(arr.reshape(-1), pa.float32()), list_size=ndim)
bulk = np.random.uniform(-1, 1, (1000, ndim)).astype(np.float32)
lance.write_dataset(pa.table({"id": pa.array(range(1000)), "vector": fsl(bulk)}), uri)
ds = lance.write_dataset(
pa.table({"id": pa.array([10000]), "vector": fsl(np.full((1, ndim), 2.0, np.float32))}),
uri, mode="append")
ds = ds.create_index("vector", index_type="IVF_PQ", num_partitions=1, num_sub_vectors=ndim)
frag = ds.get_fragment(1)
rowids = frag.to_table(columns=["id"], with_row_id=True)["_rowid"].to_pylist()
upd = pa.table({"_rowid": pa.array(rowids, pa.uint64()),
"vector": pa.array([[10.8] * ndim] * len(rowids), pa.list_(pa.float32(), ndim))})
uf, fm = frag.update_columns(upd)
ds = lance.LanceDataset.commit(ds.uri, lance.LanceOperation.Update(updated_fragments=[uf], fields_modified=fm),
read_version=ds.version)
ds.optimize.optimize_indices(num_indices_to_merge=0)
ds = lance.dataset(ds.uri)
res = ds.to_table(columns=["id"], nearest={"column": "vector", "q": np.full(ndim, 10.8, np.float32), "k": 10},
with_row_id=True).to_pandas()
print(int((res["id"] == 10000).sum())) # prints 2; expected 1
Root cause
update_columns rewrites only the column data file, keeping the fragment id and row address. Committing the Update prunes the updated fragment from the existing index segment's fragment_bitmap, but that segment's index file still physically holds the row's old vector. optimize_indices(num_indices_to_merge=0) then appends a new delta segment for the updated fragment (with the new vector) without rewriting the old segment.
At query time DatasetPreFilter is built once from the union of all segment bitmaps, so it covers every fragment and applies no restriction. Both segments are searched and there is no cross-segment row-id dedup on the vector path, so the old segment returns the row with its stale vector and the new delta segment returns it with the new value: a duplicate. (The same stale segment also misranks the row for queries near its old vector.)
Expected
The updated row appears exactly once, served by the segment that owns the fragment.
I'll open a PR with the fix.
Summary
After overwriting an indexed vector column in place via the low-level
LanceFragment.update_columnsAPI and then delta-optimizing the index withoptimize_indices(num_indices_to_merge=0), a KNN vector query returns the updated row twice: once with the stale (pre-update) indexed vector and once with the new value.Reproduction
Root cause
update_columnsrewrites only the column data file, keeping the fragment id and row address. Committing theUpdateprunes the updated fragment from the existing index segment'sfragment_bitmap, but that segment's index file still physically holds the row's old vector.optimize_indices(num_indices_to_merge=0)then appends a new delta segment for the updated fragment (with the new vector) without rewriting the old segment.At query time
DatasetPreFilteris built once from the union of all segment bitmaps, so it covers every fragment and applies no restriction. Both segments are searched and there is no cross-segment row-id dedup on the vector path, so the old segment returns the row with its stale vector and the new delta segment returns it with the new value: a duplicate. (The same stale segment also misranks the row for queries near its old vector.)Expected
The updated row appears exactly once, served by the segment that owns the fragment.
I'll open a PR with the fix.