fix(index): drop stale scalar index entries after stable-row-id update#7359
Open
wkalt wants to merge 3 commits into
Open
fix(index): drop stale scalar index entries after stable-row-id update#7359wkalt wants to merge 3 commits into
wkalt wants to merge 3 commits into
Conversation
Under stable row ids an update deletes a row's old copy and rewrites it to
a new fragment under the same row id. optimize_indices kept the old
value->row_id entry, so queries for the old value returned the updated row
and BTree optimize errored ("from_sorted_iter called with non-sorted
input").
- build_stable_row_id_filter now subtracts each fragment's deletion vector
so the old-row allow-list holds only live rows (fixes BTree).
- BitmapIndex::update applies that filter to old postings via
OldIndexDataFilter::retain_old_rows.
- optimize routes FTS through InvertedIndex::merge_segments (which filters
old partitions) instead of the reference-only update path.
Adds a regression test covering all three index types.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
8ff2484 to
3d26238
Compare
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Contributor
Author
|
Here is a simple example failure: |
live_row_ids swallowed deletion-vector read failures via `.ok().flatten()`, falling back to the "no deletions" branch and putting the deleted rows back into the stable-row-id allow-list as stale entries. Propagate with `?` instead, and add a regression test covering an unreadable deletion file. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
IndexType::Inverted is handled by a dedicated arm in merge_indices_with_unindexed_frags (lance-format#6737) and never reaches merge_scalar_indices, so its arm here and the open_and_merge_inverted_segments helper were unreachable. Remove them; no behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
jackye1995
reviewed
Jun 23, 2026
| ) | ||
| .await? | ||
| } | ||
| // NOTE: IndexType::Inverted never reaches here -- it is handled by the |
Contributor
There was a problem hiding this comment.
Can you check this? Looks like this could be a regression, I asked Codex to generate a test case.
The concern is that old_data_filter is only built from the selected tail segment(s). With multiple scalar index segments, default optimize_indices can merge the newest segment plus the new update fragment while leaving an older segment unchanged. If the update deleted rows covered by that older segment, stale old-value postings can remain visible under stable row IDs.
#[tokio::test]
async fn test_optimize_scalar_index_drops_stale_rows_across_segments_after_update() {
use crate::Dataset;
use crate::dataset::builder::DatasetBuilder;
use crate::dataset::{UpdateBuilder, WriteParams};
use crate::index::{CreateIndexBuilder, DatasetIndexExt};
use arrow_array::{Int32Array, RecordBatch, RecordBatchIterator};
use arrow_schema::{DataType, Field, Schema};
use lance_core::utils::tempfile::TempStrDir;
use lance_index::IndexType;
use lance_index::optimize::OptimizeOptions;
use lance_index::scalar::{BuiltinIndexType, ScalarIndexParams};
use std::sync::Arc;
let test_dir = TempStrDir::default();
let test_uri = test_dir.as_str();
let schema = Arc::new(Schema::new(vec![
Field::new("id", DataType::Int32, false),
Field::new("num", DataType::Int32, false),
]));
let batch = RecordBatch::try_new(
schema.clone(),
vec![
Arc::new(Int32Array::from_iter_values(0..100)),
Arc::new(Int32Array::from_iter_values(0..100)),
],
)
.unwrap();
let reader = RecordBatchIterator::new(vec![Ok(batch)], schema.clone());
let mut dataset = Dataset::write(
reader,
test_uri,
Some(WriteParams {
enable_stable_row_ids: true,
max_rows_per_file: 50,
..Default::default()
}),
)
.await
.unwrap();
let params = ScalarIndexParams::for_builtin(BuiltinIndexType::BTree);
let fragments = dataset.get_fragments();
let mut segments = Vec::new();
for fragment in &fragments {
segments.push(
CreateIndexBuilder::new(&mut dataset, &["num"], IndexType::BTree, ¶ms)
.name("num_idx".to_string())
.fragments(vec![fragment.id() as u32])
.execute_uncommitted()
.await
.unwrap(),
);
}
dataset
.commit_existing_index_segments("num_idx", "num", segments)
.await
.unwrap();
let res = UpdateBuilder::new(Arc::new(dataset.clone()))
.update_where("id < 25")
.unwrap()
.set("num", "-1")
.unwrap()
.build()
.unwrap()
.execute()
.await
.unwrap();
dataset = res.new_dataset.as_ref().clone();
dataset
.optimize_indices(&OptimizeOptions::default())
.await
.unwrap();
let dataset = DatasetBuilder::from_uri(test_uri).load().await.unwrap();
assert_eq!(
dataset.scan().filter("num = 0").unwrap().count_rows().await.unwrap(),
0
);
assert_eq!(
dataset.scan().filter("num >= 0").unwrap().count_rows().await.unwrap(),
75
);
}On this PR, the test fails with num = 0 returning 1 and num >= 0 returning 100.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Under stable row ids an update deletes a row's old copy and rewrites it to a new fragment under the same row id. optimize_indices kept the old value->row_id entry, so queries for the old value returned the updated row and BTree optimize errored ("from_sorted_iter called with non-sorted input").
Adds a regression test covering all three index types.