Skip to content

SDSTOR-22424: redo destroy pg#435

Merged
JacksonYao287 merged 1 commit into
eBay:stable/v4.xfrom
JacksonYao287:fix-issues-of-destroy-pg
Jun 24, 2026
Merged

SDSTOR-22424: redo destroy pg#435
JacksonYao287 merged 1 commit into
eBay:stable/v4.xfrom
JacksonYao287:fix-issues-of-destroy-pg

Conversation

@JacksonYao287

@JacksonYao287 JacksonYao287 commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Fix crash-recovery gap in PG destroy, handle stale destroyed PGs on restart

Extracted PG resource cleanup into a reusable destroy_pg_resource() helper and fixed two crash-recovery bugs:

Problem 1 — Stale destroyed PGs after crash:
When a repl_dev is marked destroyed (via leave()) but a crash occurs before pg_destroy cleans up the PG superblk, subsequent recovery finds a PG superblk with no corresponding repl_dev.
Previously this triggered a fatal error log and left the PG resources dangling.

Fix: on_pg_meta_blk_found now detects this case and tracks the pg_id in destoryed_stale_pgs_. On on_replica_restart, destroy_pg_resource is called for each stale PG to reclaim its chunks,
index table, and superblk.

Problem 2 — Crash mid-destroy during baseline resync:
If a crash occurs after pg_destroy marks the PG destroyed but before the superblk is removed, log replay on recovery would attempt to write to the destroyed index table. This is safe
because m_rd_sb->last_snapshot_lsn is persisted before pg_destroy is triggered, so all logs at or before that LSN are skipped on recovery (see raft_repl_dev::need_skip_processing). Added a
guard in on_log_replay_done to return early when the PG is in DESTROYED state with a live repl_dev, letting the next snapshot message re-trigger pg_destroy.

Additional fixes:

  • Simplified _exit_pg to rely on permanent_destroy → on_destroy → pg_destroy instead of manually finding and destroying the PG
  • Made repl_dev_ checks conditional in destroy_pg_superblk to handle null repl_dev
  • Renamed can_chunks_in_pg_be_gc → is_pg_alive for clarity
  • Fixed PGMetrics to use data_service().get_blk_size() instead of dereferencing potentially-null repl_dev_
  • Reset latch LSN and no-space-left error state when destroying a PG during baseline resync

@codecov-commenter

codecov-commenter commented Jun 16, 2026

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 61.29032% with 12 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (stable/v4.x@e1c23e1). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...ib/homestore_backend/replication_state_machine.cpp 25.00% 5 Missing and 1 partial ⚠️
src/lib/homestore_backend/hs_pg_manager.cpp 73.68% 4 Missing and 1 partial ⚠️
src/lib/homestore_backend/gc_manager.cpp 0.00% 0 Missing and 1 partial ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@              Coverage Diff               @@
##             stable/v4.x     #435   +/-   ##
==============================================
  Coverage               ?   53.72%           
==============================================
  Files                  ?       36           
  Lines                  ?     5392           
  Branches               ?      678           
==============================================
  Hits                   ?     2897           
  Misses                 ?     2195           
  Partials               ?      300           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JacksonYao287 JacksonYao287 force-pushed the fix-issues-of-destroy-pg branch from 0b5e4a4 to c686ce2 Compare June 16, 2026 11:59
Comment thread src/lib/homestore_backend/hs_pg_manager.cpp Outdated
Comment thread src/lib/homestore_backend/replication_state_machine.cpp
@@ -499,7 +503,7 @@ void ReplicationStateMachine::write_snapshot_obj(std::shared_ptr< homestore::sna
if (home_object_->pg_exists(pg_data->pg_id())) {
LOGI("pg already exists, clean pg resources before snapshot, pg={} {}", pg_data->pg_id(), log_suffix);
// Need to pause state machine before destroying the PG, if fail, let raft retry.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments out of date, as well as we dont have a branch that returns false as of now.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let`s remove this out-of-date comments after addressing other comments for this PR

Comment thread src/lib/homestore_backend/hs_pg_manager.cpp Outdated
Comment thread src/lib/homestore_backend/replication_state_machine.cpp Outdated
@JacksonYao287 JacksonYao287 force-pushed the fix-issues-of-destroy-pg branch from c686ce2 to 41de75e Compare June 18, 2026 07:11
Comment on lines +1064 to +1066
// error. but actually, this is not a problem. since before we starting pg_destroy in baseline resync,
// m_rd_sb->last_snapshot_lsn will be persisted upto the snapshot.get_last_log_idx(). then all the log less than
// or equal to m_rd_sb->last_snapshot_lsn will not be replayed or committed after recovery. so, the concern is

@JacksonYao287 JacksonYao287 Jun 18, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiaoxichen in baseline resync case , before we start destroying pg, we will m_rd_sb->last_snapshot_lsn upto snapshot.get_last_log_idx(). then raft_repl_dev#need_skip_processing will help us skipping replaying all the logs in recovery path(so that we will not hit those destroyed resources , like pg_index_table, etc.). so we don`t need wait for all the appended log to be committed in pg_destroy for BR case

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so basically you reverted your changes

@xiaoxichen xiaoxichen Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont get the point of this comments, you said for br no need to redo destroy PG but we do it anyway.

The concern of log replay vs destroy is not valid here ... if we reach here the log replay had been done...If we want to record the thinking why waiting for log commit is not needed, better rephrase this paragraph and move it to destroy_pg

Similar for L1043-1052, those lines explains the situations that a PG can be destroy, better to move to destroy_pg rather than here, especially we use same action in recovery path , for all source

@JacksonYao287 JacksonYao287 requested a review from xiaoxichen June 18, 2026 07:16
xiaoxichen
xiaoxichen previously approved these changes Jun 18, 2026

@xiaoxichen xiaoxichen left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM aside from inline comments cleanup.

Try to use LLM to polish the language a bit.

@JacksonYao287

Copy link
Copy Markdown
Collaborator Author

I need to rethink the implementation here again , seems I miss some repl_dev details.

@JacksonYao287

Copy link
Copy Markdown
Collaborator Author

let`s me explain redo destroy pg with more details. Bassically, the destoryed pg can be ‌categorized into two case:

1 RaftReplDev::leave() is not called and thus m_rd_sb->destroy_pending = 0x0: only BR belongs to this case. in this case repl_dev will be recovered, log will be replayed. For this case , we need do nothing for redoing destroy pg, no matter is pg state is alive or destroyed. the reason is that if the pg super blk exists, it means the first snapshot message (obj_id.shard_seq_num == 0) is not successfully handled and crash happens before pg super blk is destroyed . Then, when recovery, leader will resend the first snapshot message and follower will handle this message again and call pg_destroy to redo pg destory.

2 RaftReplDev::leave() is called and thus m_rd_sb->destroy_pending = 0x1: remove_member, destroy_repl_dev and destroy_raft_group all belong to this case. if m_rd_sb->destroy_pending = 0x1, this repl_dev will not be loaded and no log replay.. so permenant_destroy will not see this repl_dev, but the pg resource are probably still there if crash happens. so we need to redo pg destory to carefully reclaim them.

@xiaoxichen ptal

@xiaoxichen

Copy link
Copy Markdown
Collaborator

1 RaftReplDev::leave() is not called and thus m_rd_sb->destroy_pending = 0x0: only BR belongs to this case. in this case repl_dev will be recovered, log will be replayed. For this case , we need do nothing for redoing destroy pg, no matter is pg state is alive or destroyed. the reason is that if the pg super blk exists, it means the first snapshot message (obj_id.shard_seq_num == 0) is not successfully handled and crash happens before pg super blk is destroyed . Then, when recovery, leader will resend the first snapshot message and follower will handle this message again and call pg_destroy to redo pg destory.

any possibility it is case 2 but crashed just before setting the destroy_pending = 0x1?

@JacksonYao287 JacksonYao287 changed the title redo destroy pg SDSTOR-22424: redo destroy pg Jun 22, 2026
@JacksonYao287

JacksonYao287 commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator Author

any possibility it is case 2 but crashed just before setting the destroy_pending = 0x1?

yes , of course, but it does not matter, since pg_destory is always called after RaftReplDev::leave() is successfully called (it is called in permenant_destroy in gc_repl_devs()#leave_group) and RaftReplDev::leave() is the only way to change destroy_pending to 0x1.

let`s discuss case by case:

1 destroy_raft_group: will propose a log with type of HS_CTRL_DESTROY, and leave() will be called in on_commit of this log. if crash happens before changing destroy_pending to 0x1, when recovery, repl_dev will be recovered(destroy_pending is still 0x0) and this log will be commit again, and leave() will be called again and then permenant_destroy will call pg_destory to reclaim pg resource.

2 destroy_repl_dev(exit_pg): will directly call leave(). so if crash happens before changing destroy_pending to 0x1, nothing happens since now in exit_pg , what we do is only changing destroy_pending to 0x1 , no pg resource will be destory there.

3 remove_from_cluster: will call leave() when receiving a RemovedFromCluster. so if crash happens before changing destroy_pending to 0x1, then after recovery, it will received RemovedFromCluster again.

as a summary, if it is case 2 but crashed just before setting the destroy_pending = 0x1, pg resource will not be touched and everything will go well

@xiaoxichen xiaoxichen left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@JacksonYao287 JacksonYao287 merged commit 5a3b409 into eBay:stable/v4.x Jun 24, 2026
50 of 51 checks passed
@JacksonYao287 JacksonYao287 deleted the fix-issues-of-destroy-pg branch June 24, 2026 01:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants