Skip to content

fix(index): drop stale scalar index entries after stable-row-id update#7359

Open
wkalt wants to merge 3 commits into
lance-format:mainfrom
wkalt:ticket/gen-640/stale-scalar-index-stable-rowid-update
Open

fix(index): drop stale scalar index entries after stable-row-id update#7359
wkalt wants to merge 3 commits into
lance-format:mainfrom
wkalt:ticket/gen-640/stale-scalar-index-stable-rowid-update

Conversation

@wkalt

@wkalt wkalt commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Under stable row ids an update deletes a row's old copy and rewrites it to a new fragment under the same row id. optimize_indices kept the old value->row_id entry, so queries for the old value returned the updated row and BTree optimize errored ("from_sorted_iter called with non-sorted input").

  • build_stable_row_id_filter now subtracts each fragment's deletion vector so the old-row allow-list holds only live rows (fixes BTree).
  • BitmapIndex::update applies that filter to old postings via OldIndexDataFilter::retain_old_rows.
  • optimize routes FTS through InvertedIndex::merge_segments (which filters old partitions) instead of the reference-only update path.

Adds a regression test covering all three index types.

@github-actions github-actions Bot added A-index Vector index, linalg, tokenizer bug Something isn't working labels Jun 18, 2026
Under stable row ids an update deletes a row's old copy and rewrites it to
a new fragment under the same row id. optimize_indices kept the old
value->row_id entry, so queries for the old value returned the updated row
and BTree optimize errored ("from_sorted_iter called with non-sorted
input").

- build_stable_row_id_filter now subtracts each fragment's deletion vector
  so the old-row allow-list holds only live rows (fixes BTree).
- BitmapIndex::update applies that filter to old postings via
  OldIndexDataFilter::retain_old_rows.
- optimize routes FTS through InvertedIndex::merge_segments (which filters
  old partitions) instead of the reference-only update path.

Adds a regression test covering all three index types.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@wkalt wkalt force-pushed the ticket/gen-640/stale-scalar-index-stable-rowid-update branch from 8ff2484 to 3d26238 Compare June 18, 2026 15:32
@codecov

codecov Bot commented Jun 18, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 97.30942% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/bitmap.rs 79.31% 2 Missing and 4 partials ⚠️

📢 Thoughts on this report? Let us know!

@wkalt

wkalt commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Here is a simple example failure:

import tempfile
import lance
import pyarrow as pa

uri = tempfile.mkdtemp() + "/ds"

# 100 rows, cat = "A" for the first 50, "B" for the rest.
tbl = pa.table({
    "id": pa.array(range(100), pa.int32()),
    "cat": pa.array(["A"] * 50 + ["B"] * 50),
})
ds = lance.write_dataset(tbl, uri, enable_stable_row_ids=True)
ds.create_scalar_index("cat", "BITMAP")

# Change the first 25 rows A -> B. Now only ids 25..49 are still "A".
ds.update({"cat": "'B'"}, where="id < 25")

# Optimize_indices should not change a result
print("before optimize, cat='A':", ds.count_rows(filter="cat = 'A'"), "(want 25)")
ds.optimize.optimize_indices()
print("after optimize,  cat='A':", ds.count_rows(filter="cat = 'A'"), "(want 25)")
$ python3 repro.py
before optimize, cat='A': 25 (want 25)
after optimize,  cat='A': 50 (want 25)

wkalt and others added 2 commits June 22, 2026 15:25
live_row_ids swallowed deletion-vector read failures via `.ok().flatten()`,
falling back to the "no deletions" branch and putting the deleted rows back into
the stable-row-id allow-list as stale entries. Propagate with `?` instead, and
add a regression test covering an unreadable deletion file.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
IndexType::Inverted is handled by a dedicated arm in
merge_indices_with_unindexed_frags (lance-format#6737) and never reaches
merge_scalar_indices, so its arm here and the open_and_merge_inverted_segments
helper were unreachable. Remove them; no behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
)
.await?
}
// NOTE: IndexType::Inverted never reaches here -- it is handled by the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check this? Looks like this could be a regression, I asked Codex to generate a test case.

The concern is that old_data_filter is only built from the selected tail segment(s). With multiple scalar index segments, default optimize_indices can merge the newest segment plus the new update fragment while leaving an older segment unchanged. If the update deleted rows covered by that older segment, stale old-value postings can remain visible under stable row IDs.

#[tokio::test]
async fn test_optimize_scalar_index_drops_stale_rows_across_segments_after_update() {
    use crate::Dataset;
    use crate::dataset::builder::DatasetBuilder;
    use crate::dataset::{UpdateBuilder, WriteParams};
    use crate::index::{CreateIndexBuilder, DatasetIndexExt};
    use arrow_array::{Int32Array, RecordBatch, RecordBatchIterator};
    use arrow_schema::{DataType, Field, Schema};
    use lance_core::utils::tempfile::TempStrDir;
    use lance_index::IndexType;
    use lance_index::optimize::OptimizeOptions;
    use lance_index::scalar::{BuiltinIndexType, ScalarIndexParams};
    use std::sync::Arc;

    let test_dir = TempStrDir::default();
    let test_uri = test_dir.as_str();

    let schema = Arc::new(Schema::new(vec![
        Field::new("id", DataType::Int32, false),
        Field::new("num", DataType::Int32, false),
    ]));
    let batch = RecordBatch::try_new(
        schema.clone(),
        vec![
            Arc::new(Int32Array::from_iter_values(0..100)),
            Arc::new(Int32Array::from_iter_values(0..100)),
        ],
    )
    .unwrap();
    let reader = RecordBatchIterator::new(vec![Ok(batch)], schema.clone());
    let mut dataset = Dataset::write(
        reader,
        test_uri,
        Some(WriteParams {
            enable_stable_row_ids: true,
            max_rows_per_file: 50,
            ..Default::default()
        }),
    )
    .await
    .unwrap();

    let params = ScalarIndexParams::for_builtin(BuiltinIndexType::BTree);
    let fragments = dataset.get_fragments();

    let mut segments = Vec::new();
    for fragment in &fragments {
        segments.push(
            CreateIndexBuilder::new(&mut dataset, &["num"], IndexType::BTree, &params)
                .name("num_idx".to_string())
                .fragments(vec![fragment.id() as u32])
                .execute_uncommitted()
                .await
                .unwrap(),
        );
    }
    dataset
        .commit_existing_index_segments("num_idx", "num", segments)
        .await
        .unwrap();

    let res = UpdateBuilder::new(Arc::new(dataset.clone()))
        .update_where("id < 25")
        .unwrap()
        .set("num", "-1")
        .unwrap()
        .build()
        .unwrap()
        .execute()
        .await
        .unwrap();
    dataset = res.new_dataset.as_ref().clone();

    dataset
        .optimize_indices(&OptimizeOptions::default())
        .await
        .unwrap();

    let dataset = DatasetBuilder::from_uri(test_uri).load().await.unwrap();

    assert_eq!(
        dataset.scan().filter("num = 0").unwrap().count_rows().await.unwrap(),
        0
    );
    assert_eq!(
        dataset.scan().filter("num >= 0").unwrap().count_rows().await.unwrap(),
        75
    );
}

On this PR, the test fails with num = 0 returning 1 and num >= 0 returning 100.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-index Vector index, linalg, tokenizer bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants