Skip to content

fix: recovery from crash during optimization#246

Open
zhourrr wants to merge 7 commits intomainfrom
feat/more_ut
Open

fix: recovery from crash during optimization#246
zhourrr wants to merge 7 commits intomainfrom
feat/more_ut

Conversation

@zhourrr
Copy link
Collaborator

@zhourrr zhourrr commented Mar 20, 2026

Greptile Summary

This PR implements crash recovery for collection optimization by cleaning up stale index files left from a previous interrupted run, alongside a suite of typo/rename fixes and a new end-to-end crash-during-optimize test.

Key changes:

  • Core fix (segment.cc): Before building a vector index file, the code now checks whether the file already exists (as crash residue from a prior incomplete optimization) and removes it before re-creating it. This prevents merge_vector_indexer from operating on a potentially partial/corrupt file after recovery.
  • Incomplete coverage: The FileExists + RemoveFile guard is applied to the QuantizeType::UNDEFINED branch and the flat-vector sub-branch of the quantized path, but is missing from the non-RABITQ quantized index path (MakeQuantizeVectorIndexPath, line ~1715) and the RABITQ quantized index path (line ~1787). A crash during quantized index construction would leave residue that is silently handed to merge_vector_indexer on the next recovery run.
  • API/typo cleanup: CreateComapctTaskCreateCompactTask, new_segmnet_metanew_segment_meta, indiceindex, createcreate_if_missing, and a log message typo fix — all consistent and correct.
  • New test infra: A collection_optimizer binary and optimize_recovery_test.cc provide an end-to-end crash-during-optimize scenario. The test itself has several robustness issues flagged in earlier review threads (flaky WIFSIGNALED assertion, copy-paste insert loop instead of verify loop at line 185).
  • Widened float tolerance in Doc::ValueEqual (100× for float, 1000× for double) affects all production callers of Doc::operator==, not only tests; no explanatory comment is present.

Confidence Score: 2/5

  • The fix is partially correct but leaves quantized vector index paths unguarded against crash residue, which could silently corrupt an index during a post-crash recovery.
  • The core crash-recovery logic only covers 2 of the 4 vector-index file creation paths in create_vector_index. The non-RABITQ quantized path and the RABITQ path both allocate a new block ID and write to MakeQuantizeVectorIndexPath without the FileExists guard. Combined with unresolved test issues from the prior review round, the PR is not yet safe to merge as-is.
  • src/db/index/segment/segment.cc — quantized index creation paths (lines ~1715 and ~1787) are missing the crash-residue guard added elsewhere in the same function.

Important Files Changed

Filename Overview
src/db/index/segment/segment.cc Core crash-recovery fix: adds FileExists + RemoveFile guards before creating vector index files to clear crash residue, but only covers 2 of 4 index-file creation paths — the quantized (non-RABITQ) and RABITQ sub-paths in create_vector_index are missing the same guard.
tests/db/crash_recovery/optimize_recovery_test.cc New test for crash-during-optimize recovery; contains the WIFSIGNALED flakiness and a final verification loop inserting instead of verifying (both noted in prior threads), and uses ASSERT_GE inside RunOptimizer which will not propagate failures correctly from non-test contexts.
tests/db/crash_recovery/collection_optimizer.cc New helper binary for driving collection optimization in crash-recovery tests; missing newline after final stats output (noted in prior thread), and opens with a different CollectionOptions than the test (noted as intentional by the developer).
src/db/index/common/doc.cc Float equality tolerances widened 100× (float) and 1000× (double) with no explanatory comment; affects all production Doc::operator== callers, not just tests (noted in prior thread).
src/db/index/segment/segment_helper.h Fixed typo CreateComapctTask → CreateCompactTask; clean rename, no functional change.
src/db/collection.cc Updated three call-sites from the now-renamed CreateComapctTask to CreateCompactTask; straightforward typo fix.
tests/db/crash_recovery/write_recovery_test.cc Renamed test collection paths and added max_doc_count_per_segment; improved ASSERT messages; CollectionOptions mismatch with data_generator still present (noted in prior thread).
tests/db/crash_recovery/CMakeLists.txt Extracted common library list into CRASH_RECOVERY_COMMON_LIBS and added collection_optimizer binary target; clean refactor.
tests/db/crash_recovery/utility.h Changed default vector metric from COSINE to L2 for the test schema; likely motivated by the crash-recovery test needing deterministic nearest-neighbor results with zero vectors.
tests/db/index/segment/segment_helper_test.cc Updated six call-sites from CreateComapctTask to CreateCompactTask to match the renamed API; no functional change.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[create_vector_index called on recovery] --> B{QuantizeType == UNDEFINED?}

    B -- Yes --> C[allocate_block_id\nMakeVectorIndexPath]
    C --> D{FileExists?\n✅ GUARDED}
    D -- Yes --> E[LOG_WARN\nRemoveFile]
    D -- No --> F[merge_vector_indexer]
    E --> F

    B -- No --> G{metric_type changed\nOR indexers count != 1?}
    G -- Yes --> H[allocate_block_id\nMakeVectorIndexPath\nfor flat/raw]
    H --> I{FileExists?\n✅ GUARDED}
    I -- Yes --> J[LOG_WARN\nRemoveFile]
    I -- No --> K[merge_vector_indexer\nfor flat/raw]
    J --> K

    G -- No --> L[reuse existing\nraw_vector_provider]

    K --> M{QuantizeType == RABITQ?}
    L --> M

    M -- No --> N[allocate_block_id\nMakeQuantizeVectorIndexPath]
    N --> O{FileExists?\n❌ NOT GUARDED}
    O --> P[merge_vector_indexer\nfor quantized index]

    M -- Yes --> Q[Train RabitQ converter]
    Q --> R[allocate_block_id\nMakeQuantizeVectorIndexPath]
    R --> S{FileExists?\n❌ NOT GUARDED}
    S --> T[merge_vector_indexer\nfor RABITQ index]

    style D fill:#90EE90
    style I fill:#90EE90
    style O fill:#FFB6C1
    style S fill:#FFB6C1
Loading

Comments Outside Diff (6)

  1. tests/db/crash_recovery/write_recovery_test.cc, line 129 (link)

    P1 SetUp/TearDown reference stale path after rename

    dir_path_ was renamed from "crash_test_db" to "write_recovery_test_db" in this PR, but both SetUp (line 129) and TearDown (line 134) still attempt to delete the old "./crash_test_db" path. As a result:

    • The actual test directory write_recovery_test_db is never cleaned up between test runs, causing test pollution and potential false-positives on repeated runs.
    • The delete of crash_test_db is a no-op since that directory no longer exists.
  2. tests/db/crash_recovery/write_recovery_test.cc, line 134 (link)

    P1 Stale path in TearDown after rename

    Same rename miss as in SetUp. The TearDown must also reference the updated directory name.

  3. src/db/index/segment/segment.cc, line 1715-1718 (link)

    P1 Missing crash residue cleanup for quantized index paths

    The PR adds FileExists + RemoveFile guards before building vector index files at lines 1634 and 1678 (for MakeVectorIndexPath), but the corresponding quantized index files built via MakeQuantizeVectorIndexPath — at this location (non-RabitQ, line ~1715) and the RabitQ path (~line 1787) — receive no such guard.

    The same crash scenario applies: if a crash occurs after the flat index block ID is committed to segment_meta_ but during the quantized index build, block_id_allocator_ will be reset to max_block_id + 1 from the persisted meta on recovery, producing the same quant_block_id and therefore the same file path. The orphaned (and possibly corrupted) quantized index file will then be silently re-opened by merge_vector_indexer rather than being detected and cleaned up.

          std::string index_file_path = FileHelper::MakeQuantizeVectorIndexPath(
              path_, column, segment_meta_->id(), quant_block_id);
    +     if (FileHelper::FileExists(index_file_path)) {
    +       LOG_WARN(
    +           "Quantize index file[%s] already exists (possible crash residue); "
    +           "cleaning and overwriting.",
    +           index_file_path.c_str());
    +       FileHelper::RemoveFile(index_file_path);
    +     }
          auto vector_indexer = merge_vector_indexer(
              index_file_path, column, *field_with_new_index_params, concurrency);

    The same fix should be applied to the RabitQ quantized index path at line ~1787.

  4. src/db/index/common/doc.cc, line 1153-1166 (link)

    P2 Loose float tolerance may hide data corruption

    The tolerance for float vectors is widened 100× (from 1e-6f to 1e-4f) and for double vectors 1000× (from 1e-9 to 1e-6). Doc::operator== — and by extension the crash-recovery integration tests — now accepts vectors whose components differ by up to ~0.0001 as "equal".

    While some precision loss during compaction/reconstruction is expected (e.g., quantisation round-trips, format conversions), a 100× relaxation is very coarse for float32. Any data corruption that introduces per-element errors below 1e-4f will be silently accepted by the equality check, defeating the purpose of the data-integrity assertions in the new test.

    Consider tightening the tolerance back towards the original values, or — if a specific lossy path is known to cause up to this level of error — adding a comment explaining the maximum expected error and why it is acceptable, so future readers do not loosen it further.

  5. src/db/index/segment/segment.cc, line 1715-1718 (link)

    P1 Missing crash residue check for quantize index file

    The PR adds FileHelper::FileExists / FileHelper::RemoveFile guards before creating the flat vector index at lines 1634 and 1678, but the analogous guard is absent for MakeQuantizeVectorIndexPath in the non-RABITQ quantize path here.

    If the optimizer crashes after quant_block_id has been allocated and the flat index file has been fully written, but during writing of the quantize index file, the quant file will be a partial crash residue. On the next recovery run, allocate_block_id() will produce the same quant_block_id (block counter recovered from persistent state), constructing the same path — but merge_vector_indexer is called against the pre-existing partial file with no cleanup, potentially causing corruption or an incorrect index.

    The flat index path at line 1678 is protected; the quantize path should be too:

          std::string index_file_path = FileHelper::MakeQuantizeVectorIndexPath(
              path_, column, segment_meta_->id(), quant_block_id);
    +     if (FileHelper::FileExists(index_file_path)) {
    +       LOG_WARN(
    +           "Quantize index file[%s] already exists (possible crash residue); cleaning "
    +           "and overwriting.",
    +           index_file_path.c_str());
    +       FileHelper::RemoveFile(index_file_path);
    +     }
          auto vector_indexer = merge_vector_indexer(
              index_file_path, column, *field_with_new_index_params, concurrency);
  6. src/db/index/segment/segment.cc, line 1715-1718 (link)

    P1 Crash-residue check missing for quantized index paths

    The PR adds a FileHelper::FileExists + FileHelper::RemoveFile guard before building index files in the QuantizeType::UNDEFINED branch (lines ~1634–1639) and the flat-vector sub-branch of the quantized case (lines ~1678–1683). However, the same guard is not applied here when building the quantized index file via MakeQuantizeVectorIndexPath.

    If optimization crashes after the flat/raw vector index is successfully written but before this quantized index file is committed, the next recovery run will reallocate the same quant_block_id, construct the same file path, and call merge_vector_indexer on an already-existing (potentially partial) file. The resulting behavior—silently overwriting, failing, or producing a corrupt index—is undefined.

    The RABITQ path at line ~1787 has the same gap:

    std::string index_file_path = FileHelper::MakeQuantizeVectorIndexPath(
        path_, column, segment_meta_->id(), quant_block_id);
    if (FileHelper::FileExists(index_file_path)) {
      LOG_WARN(
          "Index file[%s] already exists (possible crash residue); cleaning "
          "and overwriting.",
          index_file_path.c_str());
      FileHelper::RemoveFile(index_file_path);
    }
    auto vector_indexer = merge_vector_indexer(
        index_file_path, column, *field_with_new_index_params, concurrency);

    Apply the same pattern at both the non-RABITQ quantized path (lines 1715–1718) and the RABITQ path (lines 1787–1790).

Reviews (4): Last reviewed commit: "Apply suggestion from @greptile-apps[bot..." | Re-trigger Greptile

@zhourrr zhourrr changed the title feat: add crash recovery test for optimize fix: recovery from crash during optimization Mar 20, 2026
@zhourrr
Copy link
Collaborator Author

zhourrr commented Mar 23, 2026

@greptile

@zhourrr
Copy link
Collaborator Author

zhourrr commented Mar 23, 2026

@greptile

Copy link
Collaborator

@feihongxu0824 feihongxu0824 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

auto collection = result.value();
uint64_t doc_count{collection->Stats().value().doc_count};
ASSERT_EQ(doc_count, (num_batches + 500) * batch_size);
for (uint64_t doc_id = 0; doc_id < doc_count; doc_id++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

上面也有类似的逻辑,建议抽取为辅助函数或者加点注释

@zhourrr
Copy link
Collaborator Author

zhourrr commented Mar 23, 2026

@greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants