Skip to content

Vector search returns an updated row twice after update_columns + optimize_indices #7370

Description

@wombatu-kun

Summary

After overwriting an indexed vector column in place via the low-level LanceFragment.update_columns API and then delta-optimizing the index with optimize_indices(num_indices_to_merge=0), a KNN vector query returns the updated row twice: once with the stale (pre-update) indexed vector and once with the new value.

Reproduction

import lance, numpy as np, pyarrow as pa, tempfile, os

ndim = 16
np.random.seed(42)
uri = os.path.join(tempfile.mkdtemp(), "ds")

def fsl(arr):
    return pa.FixedSizeListArray.from_arrays(pa.array(arr.reshape(-1), pa.float32()), list_size=ndim)

bulk = np.random.uniform(-1, 1, (1000, ndim)).astype(np.float32)
lance.write_dataset(pa.table({"id": pa.array(range(1000)), "vector": fsl(bulk)}), uri)
ds = lance.write_dataset(
    pa.table({"id": pa.array([10000]), "vector": fsl(np.full((1, ndim), 2.0, np.float32))}),
    uri, mode="append")

ds = ds.create_index("vector", index_type="IVF_PQ", num_partitions=1, num_sub_vectors=ndim)

frag = ds.get_fragment(1)
rowids = frag.to_table(columns=["id"], with_row_id=True)["_rowid"].to_pylist()
upd = pa.table({"_rowid": pa.array(rowids, pa.uint64()),
                "vector": pa.array([[10.8] * ndim] * len(rowids), pa.list_(pa.float32(), ndim))})
uf, fm = frag.update_columns(upd)
ds = lance.LanceDataset.commit(ds.uri, lance.LanceOperation.Update(updated_fragments=[uf], fields_modified=fm),
                               read_version=ds.version)

ds.optimize.optimize_indices(num_indices_to_merge=0)
ds = lance.dataset(ds.uri)

res = ds.to_table(columns=["id"], nearest={"column": "vector", "q": np.full(ndim, 10.8, np.float32), "k": 10},
                  with_row_id=True).to_pandas()
print(int((res["id"] == 10000).sum()))   # prints 2; expected 1

Root cause

update_columns rewrites only the column data file, keeping the fragment id and row address. Committing the Update prunes the updated fragment from the existing index segment's fragment_bitmap, but that segment's index file still physically holds the row's old vector. optimize_indices(num_indices_to_merge=0) then appends a new delta segment for the updated fragment (with the new vector) without rewriting the old segment.

At query time DatasetPreFilter is built once from the union of all segment bitmaps, so it covers every fragment and applies no restriction. Both segments are searched and there is no cross-segment row-id dedup on the vector path, so the old segment returns the row with its stale vector and the new delta segment returns it with the new value: a duplicate. (The same stale segment also misranks the row for queries near its old vector.)

Expected

The updated row appears exactly once, served by the segment that owns the fragment.

I'll open a PR with the fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcritical-fixBugs that cause crashes, security vulnerabilities, or incorrect data.

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions