SDSTOR-22424: redo destroy pg by JacksonYao287 · Pull Request #435 · eBay/HomeObject

JacksonYao287 · 2026-06-16T09:26:42Z

Fix crash-recovery gap in PG destroy, handle stale destroyed PGs on restart

Extracted PG resource cleanup into a reusable destroy_pg_resource() helper and fixed two crash-recovery bugs:

Problem 1 — Stale destroyed PGs after crash:
When a repl_dev is marked destroyed (via leave()) but a crash occurs before pg_destroy cleans up the PG superblk, subsequent recovery finds a PG superblk with no corresponding repl_dev.
Previously this triggered a fatal error log and left the PG resources dangling.

Fix: on_pg_meta_blk_found now detects this case and tracks the pg_id in destoryed_stale_pgs_. On on_replica_restart, destroy_pg_resource is called for each stale PG to reclaim its chunks,
index table, and superblk.

Problem 2 — Crash mid-destroy during baseline resync:
If a crash occurs after pg_destroy marks the PG destroyed but before the superblk is removed, log replay on recovery would attempt to write to the destroyed index table. This is safe
because m_rd_sb->last_snapshot_lsn is persisted before pg_destroy is triggered, so all logs at or before that LSN are skipped on recovery (see raft_repl_dev::need_skip_processing). Added a
guard in on_log_replay_done to return early when the PG is in DESTROYED state with a live repl_dev, letting the next snapshot message re-trigger pg_destroy.

Additional fixes:

Simplified _exit_pg to rely on permanent_destroy → on_destroy → pg_destroy instead of manually finding and destroying the PG
Made repl_dev_ checks conditional in destroy_pg_superblk to handle null repl_dev
Renamed can_chunks_in_pg_be_gc → is_pg_alive for clarity
Fixed PGMetrics to use data_service().get_blk_size() instead of dereferencing potentially-null repl_dev_
Reset latch LSN and no-space-left error state when destroying a PG during baseline resync

codecov-commenter · 2026-06-16T10:10:56Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 61.29032% with 12 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (stable/v4.x@e1c23e1). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...ib/homestore_backend/replication_state_machine.cpp	25.00%	5 Missing and 1 partial ⚠️
src/lib/homestore_backend/hs_pg_manager.cpp	73.68%	4 Missing and 1 partial ⚠️
src/lib/homestore_backend/gc_manager.cpp	0.00%	0 Missing and 1 partial ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@              Coverage Diff               @@
##             stable/v4.x     #435   +/-   ##
==============================================
  Coverage               ?   53.72%           
==============================================
  Files                  ?       36           
  Lines                  ?     5392           
  Branches               ?      678           
==============================================
  Hits                   ?     2897           
  Misses                 ?     2195           
  Partials               ?      300

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

xiaoxichen · 2026-06-16T12:40:19Z

@@ -499,7 +503,7 @@ void ReplicationStateMachine::write_snapshot_obj(std::shared_ptr< homestore::sna
        if (home_object_->pg_exists(pg_data->pg_id())) {
            LOGI("pg already exists, clean pg resources before snapshot, pg={} {}", pg_data->pg_id(), log_suffix);
            // Need to pause state machine before destroying the PG, if fail, let raft retry.


comments out of date, as well as we dont have a branch that returns false as of now.

let`s remove this out-of-date comments after addressing other comments for this PR

JacksonYao287 · 2026-06-18T07:14:20Z

+        // error. but actually, this is not a problem. since before we starting pg_destroy in baseline resync,
+        // m_rd_sb->last_snapshot_lsn will be persisted upto the snapshot.get_last_log_idx(). then all the log less than
+        // or equal to m_rd_sb->last_snapshot_lsn will not be replayed or committed after recovery. so, the concern is


@xiaoxichen in baseline resync case , before we start destroying pg, we will m_rd_sb->last_snapshot_lsn upto snapshot.get_last_log_idx(). then raft_repl_dev#need_skip_processing will help us skipping replaying all the logs in recovery path(so that we will not hit those destroyed resources , like pg_index_table, etc.). so we don`t need wait for all the appended log to be committed in pg_destroy for BR case

so basically you reverted your changes

I dont get the point of this comments, you said for br no need to redo destroy PG but we do it anyway.

The concern of log replay vs destroy is not valid here ... if we reach here the log replay had been done...If we want to record the thinking why waiting for log commit is not needed, better rephrase this paragraph and move it to destroy_pg

Similar for L1043-1052, those lines explains the situations that a PG can be destroy, better to move to destroy_pg rather than here, especially we use same action in recovery path , for all source

xiaoxichen

LGTM aside from inline comments cleanup.

Try to use LLM to polish the language a bit.

JacksonYao287 · 2026-06-22T04:03:18Z

I need to rethink the implementation here again , seems I miss some repl_dev details.

JacksonYao287 · 2026-06-22T09:18:47Z

let`s me explain redo destroy pg with more details. Bassically, the destoryed pg can be ‌categorized into two case:

1 RaftReplDev::leave() is not called and thus m_rd_sb->destroy_pending = 0x0: only BR belongs to this case. in this case repl_dev will be recovered, log will be replayed. For this case , we need do nothing for redoing destroy pg, no matter is pg state is alive or destroyed. the reason is that if the pg super blk exists, it means the first snapshot message (obj_id.shard_seq_num == 0) is not successfully handled and crash happens before pg super blk is destroyed . Then, when recovery, leader will resend the first snapshot message and follower will handle this message again and call pg_destroy to redo pg destory.

2 RaftReplDev::leave() is called and thus m_rd_sb->destroy_pending = 0x1: remove_member, destroy_repl_dev and destroy_raft_group all belong to this case. if m_rd_sb->destroy_pending = 0x1, this repl_dev will not be loaded and no log replay.. so permenant_destroy will not see this repl_dev, but the pg resource are probably still there if crash happens. so we need to redo pg destory to carefully reclaim them.

@xiaoxichen ptal

xiaoxichen · 2026-06-22T09:31:42Z

1 RaftReplDev::leave() is not called and thus m_rd_sb->destroy_pending = 0x0: only BR belongs to this case. in this case repl_dev will be recovered, log will be replayed. For this case , we need do nothing for redoing destroy pg, no matter is pg state is alive or destroyed. the reason is that if the pg super blk exists, it means the first snapshot message (obj_id.shard_seq_num == 0) is not successfully handled and crash happens before pg super blk is destroyed . Then, when recovery, leader will resend the first snapshot message and follower will handle this message again and call pg_destroy to redo pg destory.

any possibility it is case 2 but crashed just before setting the destroy_pending = 0x1?

JacksonYao287 · 2026-06-22T10:14:53Z

any possibility it is case 2 but crashed just before setting the destroy_pending = 0x1?

yes , of course, but it does not matter, since pg_destory is always called after RaftReplDev::leave() is successfully called (it is called in permenant_destroy in gc_repl_devs()#leave_group) and RaftReplDev::leave() is the only way to change destroy_pending to 0x1.

let`s discuss case by case:

1 destroy_raft_group: will propose a log with type of HS_CTRL_DESTROY, and leave() will be called in on_commit of this log. if crash happens before changing destroy_pending to 0x1, when recovery, repl_dev will be recovered(destroy_pending is still 0x0) and this log will be commit again, and leave() will be called again and then permenant_destroy will call pg_destory to reclaim pg resource.

2 destroy_repl_dev(exit_pg): will directly call leave(). so if crash happens before changing destroy_pending to 0x1, nothing happens since now in exit_pg , what we do is only changing destroy_pending to 0x1 , no pg resource will be destory there.

3 remove_from_cluster: will call leave() when receiving a RemovedFromCluster. so if crash happens before changing destroy_pending to 0x1, then after recovery, it will received RemovedFromCluster again.

as a summary, if it is case 2 but crashed just before setting the destroy_pending = 0x1, pg resource will not be touched and everything will go well

xiaoxichen

lgtm

JacksonYao287 requested review from Besroy, xiaoxichen and yuwmao June 16, 2026 10:14

JacksonYao287 force-pushed the fix-issues-of-destroy-pg branch from 0b5e4a4 to c686ce2 Compare June 16, 2026 11:59

xiaoxichen reviewed Jun 16, 2026

View reviewed changes

JacksonYao287 force-pushed the fix-issues-of-destroy-pg branch from c686ce2 to 41de75e Compare June 18, 2026 07:11

JacksonYao287 commented Jun 18, 2026

View reviewed changes

JacksonYao287 requested a review from xiaoxichen June 18, 2026 07:16

xiaoxichen previously approved these changes Jun 18, 2026

View reviewed changes

redo destroy pg

a3118a7

JacksonYao287 dismissed xiaoxichen’s stale review via a3118a7 June 22, 2026 08:59

JacksonYao287 force-pushed the fix-issues-of-destroy-pg branch from 41de75e to a3118a7 Compare June 22, 2026 08:59

JacksonYao287 changed the title ~~redo destroy pg~~ SDSTOR-22424: redo destroy pg Jun 22, 2026

xiaoxichen approved these changes Jun 23, 2026

View reviewed changes

JacksonYao287 merged commit 5a3b409 into eBay:stable/v4.x Jun 24, 2026
50 of 51 checks passed

JacksonYao287 deleted the fix-issues-of-destroy-pg branch June 24, 2026 01:21

Conversation

JacksonYao287 commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

xiaoxichen Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

JacksonYao287 Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

JacksonYao287 Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaoxichen Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

xiaoxichen Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaoxichen left a comment

Choose a reason for hiding this comment

Uh oh!

JacksonYao287 commented Jun 22, 2026

Uh oh!

JacksonYao287 commented Jun 22, 2026

Uh oh!

xiaoxichen commented Jun 22, 2026

Uh oh!

JacksonYao287 commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiaoxichen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JacksonYao287 commented Jun 16, 2026 •

edited

Loading

codecov-commenter commented Jun 16, 2026 •

edited

Loading

JacksonYao287 Jun 18, 2026 •

edited

Loading

xiaoxichen Jun 18, 2026 •

edited

Loading

JacksonYao287 commented Jun 22, 2026 •

edited

Loading