Skip to content

refactor(workflow): S10 self-healing recovery events#1583

Merged
gsxdsm merged 3 commits into
feature/workflow-owned-merge-s08-workflow-owned-merge-processingfrom
feature/workflow-owned-merge-s10-self-healing-recovery-events
Jun 17, 2026
Merged

refactor(workflow): S10 self-healing recovery events#1583
gsxdsm merged 3 commits into
feature/workflow-owned-merge-s08-workflow-owned-merge-processingfrom
feature/workflow-owned-merge-s10-self-healing-recovery-events

Conversation

@gsxdsm

@gsxdsm gsxdsm commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Stack Slice

  • Slice: S10
  • Milestone: Gate C
  • Base branch: feature/workflow-owned-merge-s09-workflow-owned-retry-state
  • Full plan: docs/plans/2026-06-09-003-refactor-workflow-owned-merge-full-migration-slices-plan.md

Goal

Convert self-healing merge/retry lifecycle mutations into typed workflow recovery events and node wakes.

Dependency

S5 runtime driver, S8 merge processing, and S9 retry state.

Expected Scope

packages/engine/src/self-healing.ts; restart recovery coordinator; recovery policy; workflow runtime; recovery tests.

Expected Tests

Mergeable in-review recovery event, stale merge status event, transient merge retry event, already-landed finalize event, autoMerge false terminal behavior, dedupe.

Exit Gate

Self-healing no longer directly requeues, pauses, fails, unpauses, or moves merge/retry tasks except through guarded workflow primitives.

Status

Draft stack placeholder. This PR reserves ordering and review context; implementation should replace or extend the handoff artifact before this slice is marked ready.

Implementation Added

  • Added typed workflow recovery event publisher that wakes recovery-router work items.\n- Added TaskStore-backed tests for deduped runnable recovery work.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added workflow recovery events to support self-healing capabilities for failed or blocked workflow executions.
    • Recovery events automatically create new recovery attempts when previous attempts reach terminal states.
  • Documentation

    • Added planning documentation for self-healing recovery events functionality.
  • Tests

    • Added comprehensive test coverage for recovery event publishing and handling behavior.

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 9dd1b6f0-2d89-488a-86ea-ed93eb97b718

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/workflow-owned-merge-s10-self-healing-recovery-events

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gsxdsm gsxdsm force-pushed the feature/workflow-owned-merge-s09-workflow-owned-retry-state branch from 4fd5a3b to b57eaa0 Compare June 9, 2026 20:30
@gsxdsm gsxdsm force-pushed the feature/workflow-owned-merge-s10-self-healing-recovery-events branch from 30ef86b to f1badbe Compare June 9, 2026 20:30
@gsxdsm gsxdsm marked this pull request as ready for review June 9, 2026 20:35
@greptile-apps

greptile-apps Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a typed workflow recovery event publisher (publishWorkflowRecoveryEvent) that maps self-healing lifecycle signals into "recovery" work items on the TaskStore, addressed the previously flagged lastError/source fallback, and introduced a terminal-state awareness mechanism via UUID-suffix runIds to allow re-publishing after a work item is processed.

  • workflow-recovery-events.ts introduces recoveryRunIdForPublish to deduplicate in-flight recovery items and create fresh UUID-suffixed runIds when prior items are terminal, but the deduplication logic only matches items with runId === baseRunId (exact match), leaving UUID-suffixed items from earlier terminal cycles invisible during subsequent lookups.
  • Test coverage handles the two-event dedupe case and the null lastError for informational events, but the three-event scenario (fast re-fire while the second item is still active) and autoMerge: false terminal behavior (listed in the PR's expected tests) are not covered.

Confidence Score: 4/5

Safe to merge after the deduplication logic in recoveryRunIdForPublish is extended to also search UUID-suffixed items for non-terminal states.

The recovery event publisher has a deduplication gap: once an item moves past the base runId into UUID-suffixed territory, subsequent publishes while that item is still active bypass the dedupe check and create a second runnable work item, causing duplicate recovery processing of the same task.

packages/engine/src/workflow-recovery-events.ts — specifically the recoveryRunIdForPublish function's item search predicate.

Important Files Changed

Filename Overview
packages/engine/src/workflow-recovery-events.ts Introduces typed recovery event publisher with terminal-state awareness, but recoveryRunIdForPublish only deduplicates against the exact-match base runId — UUID-suffixed items from prior terminal cycles are invisible to the search, so a third publish while the second item is still active creates a duplicate runnable work item.
packages/engine/src/tests/workflow-recovery-events.test.ts Covers single-terminal-cycle republish and null lastError for informational events; missing a three-event scenario that would expose the UUID-suffix deduplication gap, and no coverage for autoMerge: false terminal behavior.
packages/engine/src/index.ts Adds clean public exports for publishWorkflowRecoveryEvent, WorkflowRecoveryEventInput, WorkflowRecoveryEventKind, and WorkflowRecoveryEventStore; no issues.
docs/plans/workflow-owned-merge-stack/s10-self-healing-recovery-events.md Documentation-only planning artifact for the S10 slice; no functional concerns.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant SH as Self-Healing
    participant Pub as publishWorkflowRecoveryEvent
    participant RId as recoveryRunIdForPublish
    participant Store as TaskStore

    SH->>Pub: "kind=transient-merge-failure, taskId"
    Pub->>RId: baseRunId
    RId->>Store: listWorkflowWorkItemsForTask
    Store-->>RId: [] none
    RId-->>Pub: baseRunId
    Pub->>Store: "upsertWorkflowWorkItem runId=baseRunId state=runnable"
    Store-->>Pub: Item A runnable

    Note over Store: Item A transitions to terminal

    SH->>Pub: 2nd event same kind
    Pub->>RId: baseRunId
    RId->>Store: listWorkflowWorkItemsForTask
    Store-->>RId: Item A terminal
    RId-->>Pub: baseRunId:uuid1
    Pub->>Store: "upsertWorkflowWorkItem runId=baseRunId:uuid1"
    Store-->>Pub: Item B runnable

    Note over Store,Pub: Item B still active

    SH->>Pub: 3rd event fast re-fire
    Pub->>RId: baseRunId
    RId->>Store: listWorkflowWorkItemsForTask
    Store-->>RId: Item A terminal plus Item B runnable
    Note over RId: find runId===baseRunId returns Item A only
    RId-->>Pub: baseRunId:uuid2
    Pub->>Store: "upsertWorkflowWorkItem runId=baseRunId:uuid2"
    Store-->>Pub: Item C runnable DUPLICATE of Item B
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant SH as Self-Healing
    participant Pub as publishWorkflowRecoveryEvent
    participant RId as recoveryRunIdForPublish
    participant Store as TaskStore

    SH->>Pub: "kind=transient-merge-failure, taskId"
    Pub->>RId: baseRunId
    RId->>Store: listWorkflowWorkItemsForTask
    Store-->>RId: [] none
    RId-->>Pub: baseRunId
    Pub->>Store: "upsertWorkflowWorkItem runId=baseRunId state=runnable"
    Store-->>Pub: Item A runnable

    Note over Store: Item A transitions to terminal

    SH->>Pub: 2nd event same kind
    Pub->>RId: baseRunId
    RId->>Store: listWorkflowWorkItemsForTask
    Store-->>RId: Item A terminal
    RId-->>Pub: baseRunId:uuid1
    Pub->>Store: "upsertWorkflowWorkItem runId=baseRunId:uuid1"
    Store-->>Pub: Item B runnable

    Note over Store,Pub: Item B still active

    SH->>Pub: 3rd event fast re-fire
    Pub->>RId: baseRunId
    RId->>Store: listWorkflowWorkItemsForTask
    Store-->>RId: Item A terminal plus Item B runnable
    Note over RId: find runId===baseRunId returns Item A only
    RId-->>Pub: baseRunId:uuid2
    Pub->>Store: "upsertWorkflowWorkItem runId=baseRunId:uuid2"
    Store-->>Pub: Item C runnable DUPLICATE of Item B
Loading

Reviews (8): Last reviewed commit: "Address PR review feedback (#1583)" | Re-trigger Greptile

@gsxdsm gsxdsm force-pushed the feature/workflow-owned-merge-s09-workflow-owned-retry-state branch from b57eaa0 to 7de5076 Compare June 9, 2026 20:38
@gsxdsm gsxdsm force-pushed the feature/workflow-owned-merge-s10-self-healing-recovery-events branch from f1badbe to cbdbe87 Compare June 9, 2026 20:38
@gsxdsm gsxdsm force-pushed the feature/workflow-owned-merge-s09-workflow-owned-retry-state branch from 7de5076 to 95230fa Compare June 9, 2026 20:48
@gsxdsm gsxdsm force-pushed the feature/workflow-owned-merge-s10-self-healing-recovery-events branch from cbdbe87 to 6451e39 Compare June 9, 2026 20:48
@gsxdsm gsxdsm force-pushed the feature/workflow-owned-merge-s09-workflow-owned-retry-state branch from 95230fa to c85ab12 Compare June 9, 2026 23:30
@gsxdsm gsxdsm force-pushed the feature/workflow-owned-merge-s10-self-healing-recovery-events branch from 6451e39 to 2e67ef0 Compare June 9, 2026 23:31
Comment thread packages/engine/src/workflow-recovery-events.ts Outdated
@gsxdsm gsxdsm force-pushed the feature/workflow-owned-merge-s10-self-healing-recovery-events branch from 2e67ef0 to eacaa0f Compare June 10, 2026 00:23
@gsxdsm gsxdsm force-pushed the feature/workflow-owned-merge-s09-workflow-owned-retry-state branch 2 times, most recently from d3ade66 to e84e8f6 Compare June 10, 2026 03:48
@gsxdsm gsxdsm force-pushed the feature/workflow-owned-merge-s10-self-healing-recovery-events branch from eacaa0f to 11687a5 Compare June 10, 2026 03:48
Comment thread packages/engine/src/workflow-recovery-events.ts
@gsxdsm gsxdsm force-pushed the feature/workflow-owned-merge-s10-self-healing-recovery-events branch from 11687a5 to 082995d Compare June 11, 2026 15:25
@gsxdsm gsxdsm force-pushed the feature/workflow-owned-merge-s09-workflow-owned-retry-state branch from e84e8f6 to 19e6540 Compare June 11, 2026 15:25
@gsxdsm gsxdsm force-pushed the feature/workflow-owned-merge-s10-self-healing-recovery-events branch from 082995d to d0d7fb7 Compare June 11, 2026 15:35
@gsxdsm gsxdsm force-pushed the feature/workflow-owned-merge-s09-workflow-owned-retry-state branch 2 times, most recently from 4eb81d1 to 86bea2f Compare June 11, 2026 15:39
@gsxdsm gsxdsm force-pushed the feature/workflow-owned-merge-s10-self-healing-recovery-events branch from d0d7fb7 to 46ef418 Compare June 11, 2026 15:39
@gsxdsm gsxdsm force-pushed the feature/workflow-owned-merge-s09-workflow-owned-retry-state branch from 86bea2f to c2df552 Compare June 11, 2026 15:44
@gsxdsm gsxdsm force-pushed the feature/workflow-owned-merge-s10-self-healing-recovery-events branch from 46ef418 to 3d1e5a9 Compare June 11, 2026 15:44
gsxdsm added 2 commits June 11, 2026 08:48
Address PR #1583 feedback by keeping in-flight recovery dedupe while issuing fresh work when a previous deterministic recovery event has reached a terminal state.
@gsxdsm gsxdsm force-pushed the feature/workflow-owned-merge-s10-self-healing-recovery-events branch from 3d1e5a9 to b16d22e Compare June 11, 2026 15:49
@gsxdsm

gsxdsm commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator Author

A second invocation after the recovery router has completed the work item will throw an unhandled error.

Addressed: recovery events now dedupe in-flight work but create a fresh run id after terminal prior work, so recurring recovery signals can be published safely.

Base automatically changed from feature/workflow-owned-merge-s09-workflow-owned-retry-state to feature/workflow-owned-merge-s08-workflow-owned-merge-processing June 16, 2026 22:04

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/engine/src/__tests__/workflow-recovery-events.test.ts`:
- Around line 99-109: The assertion on the listWorkflowWorkItemsForTask call
uses a direct toEqual comparison with a fixed array order, which makes the test
fragile if row ordering changes. Replace the toEqual check with
expect.arrayContaining to match items regardless of order, and add a separate
length check to ensure exactly the expected number of items are returned. This
decouples the test from the specific ordering of workflow items.
- Around line 73-109: The test at line 73 currently validates the recurring
recovery event invariant only when the previous event reaches the "succeeded"
terminal state. Expand the test coverage to verify that
publishWorkflowRecoveryEvent can publish fresh recovery items when the previous
event transitions to any known terminal state: "succeeded", "failed",
"cancelled", and "exhausted". For each terminal state, use
store.transitionWorkflowWorkItem to move the first recovery work item to that
state, then assert that a subsequent publishWorkflowRecoveryEvent call creates a
new work item with the expected properties, ensuring the regression is fully
guarded across all terminal surface cases.

In `@packages/engine/src/workflow-recovery-events.ts`:
- Around line 20-21: The `listWorkflowWorkItemsForTask` method in the
`WorkflowRecoveryEventStore` interface is currently optional (marked with `?`),
which prevents `publishWorkflowRecoveryEvent` from guaranteeing deduplication
and terminal-state runId rotation across all store implementations. Remove the
optional marker (`?`) from the `listWorkflowWorkItemsForTask` method signature
to make it a required method, ensuring all implementations of this interface
must provide this capability for proper recovery-event handling.
🪄 Autofix (Beta)

❌ Autofix failed (check again to retry)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: e7374297-4e83-4fe3-a182-66a9fbf840e8

📥 Commits

Reviewing files that changed from the base of the PR and between 31129df and b16d22e.

📒 Files selected for processing (4)
  • docs/plans/workflow-owned-merge-stack/s10-self-healing-recovery-events.md
  • packages/engine/src/__tests__/workflow-recovery-events.test.ts
  • packages/engine/src/index.ts
  • packages/engine/src/workflow-recovery-events.ts

Comment thread packages/engine/src/__tests__/workflow-recovery-events.test.ts Outdated
Comment thread packages/engine/src/__tests__/workflow-recovery-events.test.ts Outdated
Comment thread packages/engine/src/workflow-recovery-events.ts Outdated
@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Note

Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it.

An unexpected error occurred while generating fixes: 13 INTERNAL: Received RST_STREAM with code 2 (Internal server error)

- Make listWorkflowWorkItemsForTask required on WorkflowRecoveryEventStore so the terminal-state runId rotation guard can't be silently bypassed
- Expand recurring-recovery regression test to all terminal states (succeeded/failed/cancelled/exhausted) via it.each
- Use order-independent arrayContaining + length assertion for recovery work items

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@stage-review

stage-review Bot commented Jun 17, 2026

Copy link
Copy Markdown

Ready to review this PR? Stage has broken it down into 3 individual chapters for you:

Title
1 Document self-healing recovery events plan
2 Implement workflow recovery event publisher
3 Verify recovery event publishing and deduplication
Open in Stage

Chapters generated by Stage for commit 21bb295 on Jun 17, 2026 2:19pm UTC.

@gsxdsm gsxdsm merged commit b5072ce into feature/workflow-owned-merge-s08-workflow-owned-merge-processing Jun 17, 2026
2 checks passed
@gsxdsm gsxdsm deleted the feature/workflow-owned-merge-s10-self-healing-recovery-events branch June 17, 2026 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant