From eb9d20046330c4f98b8de0c9c06f9b59565f1601 Mon Sep 17 00:00:00 2001 From: Yvette Carlisle Date: Wed, 3 Jun 2026 00:34:58 +0800 Subject: [PATCH] {"schema":"decodex/commit/1","summary":"Add lane-control recovery guidance","authority":"XY-705"} --- docs/index.md | 3 + docs/runbook/index.md | 4 + docs/runbook/lane-control-recovery.md | 155 +++++++++++++++++++++ docs/spec/lane-control.md | 5 +- plugins/decodex/skills/automation/SKILL.md | 32 +++++ plugins/decodex/skills/manual-cli/SKILL.md | 26 ++++ 6 files changed, 224 insertions(+), 1 deletion(-) create mode 100644 docs/runbook/lane-control-recovery.md diff --git a/docs/index.md b/docs/index.md index 20f37a4c..09fb7340 100644 --- a/docs/index.md +++ b/docs/index.md @@ -35,6 +35,9 @@ The split below is by question type, not by human-versus-agent audience. - Need Decodex operator lane-control capability support, including inspect, pause/resume, scan, interrupt, steer, retained retry/resume, manual attention, or unsupported/deferred controls -> `docs/spec/lane-control.md` +- Need the post-control recovery sequence after lane interrupt, hard fallback, broad + steer, task replacement, or ambiguous retained evidence -> + `docs/runbook/lane-control-recovery.md` - Need public static-site contracts, GitHub bundle schemas, signal-entry schemas, or release-delta schemas -> `docs/spec/` - Need runbooks, migrations, validation steps, troubleshooting, or operational diff --git a/docs/runbook/index.md b/docs/runbook/index.md index 8c200dbd..ee01a088 100644 --- a/docs/runbook/index.md +++ b/docs/runbook/index.md @@ -30,6 +30,10 @@ Question this index answers: "which sequence should I execute?" `decodex.space` custom-domain setup for the static public site. - [`linear-archive-hygiene.md`](./linear-archive-hygiene.md) for dry-run-first archive hygiene of old terminal Linear issues by repo label. +- [`lane-control-recovery.md`](./lane-control-recovery.md) for deciding whether to + inspect, resume, scan, keep or remove queue labels, or route manual attention after + interrupt, hard fallback, broad steer, task replacement, or ambiguous recovery + evidence. - [`local-github-signal-workflow.md`](./local-github-signal-workflow.md) for collecting GitHub change bundles, running Codex editorial analysis, validating signal entries, and publishing static site content. diff --git a/docs/runbook/lane-control-recovery.md b/docs/runbook/lane-control-recovery.md new file mode 100644 index 00000000..b4a42539 --- /dev/null +++ b/docs/runbook/lane-control-recovery.md @@ -0,0 +1,155 @@ +# Lane-Control Recovery + +Goal: Give agents and operators a bounded recovery sequence after Decodex lane +interrupt, hard fallback, broad steer, task replacement, or ambiguous retained-lane +evidence. + +Read this when: A lane-control request has returned, timed out, fallen back, changed +task content materially, or left unclear evidence about whether a retained lane should +resume, requeue, stop, or require human attention. + +Inputs: Registered project id, issue identifier, run id, attempt number, current turn +id when available, control request result, `decodex status` or +`decodex status --json`, private evidence from `decodex evidence`, tracker state, +retained worktree state, and PR lineage when present. + +Depends on: [`../spec/lane-control.md`](../spec/lane-control.md), +[`../spec/tracker-tools.md`](../spec/tracker-tools.md), +[`../reference/operator-control-plane.md`](../reference/operator-control-plane.md), +[`./recover-review-handoff.md`](./recover-review-handoff.md), the Decodex `automation`, +`manual-cli`, and `labels` skills, plus the registered project `project.toml` and +`WORKFLOW.md`. + +Verification: The chosen path should cite the inspection evidence, the control outcome, +the retained worktree or PR lineage when relevant, and the supported Decodex command, +API, label skill, or issue-scoped tracker tool used for the next mutation. + +## Recovery Principle + +Lane control is not a shortcut around retained-lane lifecycle. `turn/steer` can carry +broad operator text, and `hard_interrupt_fallback` can stop a recorded child process +when explicitly requested, but recovery still has to preserve audit, lane identity, +workflow policy, and useful local work. + +Do not directly kill hidden `_attempt` children, edit runtime DB rows, or mutate Linear +labels to simulate lane control. The normal paths are CLI/API lane controls, retained +retry/resume, explicit recovery commands, label-skill actions, issue-scoped tracker +tools, and manual attention. If an operator had to stop a process outside Decodex +controls for immediate host safety, treat the next state as ambiguous evidence until +the lane, private evidence, and worktree have been inspected. + +## Inspect First + +Run the smallest set of inspections that can prove the lane identity and current owner: + +```sh +decodex lane inspect --run-id --json +decodex status --json +decodex diagnose --json +decodex evidence --run-id --attempt --json +``` + +Use the local HTTP API only against the same trusted listener when CLI access is not +the active surface: + +```sh +curl -sS 'http://127.0.0.1:8912/api/lane/inspect?projectId=&issue=&runId=' +``` + +Before mutating anything, confirm: + +- project id and registered project path +- issue identifier, tracker state, and service-scoped labels +- branch, worktree, and whether the worktree is active, retained, queued-attention, or + cleanup-only +- run id, attempt, thread id, current turn id, and process/protocol liveness +- control outcome such as accepted, rejected, timed out, failed, or + `hard_interrupt_fallback` +- private evidence and public lifecycle signal +- PR URL, head branch, and head SHA when the lane has crossed review handoff + +If these facts do not prove the requested lane, do not steer, interrupt, retry, resume, +or clean labels. + +## Decision Tree + +| Evidence after inspection | Agent decision | Supported next action | +| --- | --- | --- | +| Active lane still matches the issue, branch, run id, attempt, and turn. | Let the runtime continue or wait for the control result. | No label change. Use the next CLI/API control only when the operator explicitly asks. | +| Soft interrupt was accepted and the runtime is still resolving the attempt. | Wait for status, protocol activity, or evidence to settle. | Re-inspect; do not requeue or force-kill. | +| Hard fallback reports `hard_interrupt_fallback`. | Treat it as an interrupted runtime event, not a graceful completion. | Inspect retained worktree and evidence; resume only if lineage is exact. | +| Retained worktree has useful local changes and lineage matches issue, branch, runtime evidence, and PR when present. | Resume or repair the same lane. | Use `decodex run ` when the registered workflow makes it eligible, or use the specific retained recovery runbook. | +| Review handoff marker is missing or stale but the retained PR lane appears recoverable. | Diagnose before rebind. | Run `decodex recover review-handoff diagnose ` and follow [`recover-review-handoff.md`](./recover-review-handoff.md). | +| Queue label or tracker state was changed and the scheduler should observe it before the next poll. | Request a refresh, not a retry. | `POST /api/linear-scan` with `projectId`, or no body for all enabled projects. | +| Queue label should be added, removed, or interpreted. | Use service-scoped label policy. | Follow the `labels` skill; do not guess `` or clear `needs-attention` before fixing the blocker. | +| Broad steer materially changes the objective or acceptance contract. | Preserve audit and resolve lifecycle explicitly. | Update and requeue the same issue, create a new issue/lane, or route the owned run to manual attention. | +| Operator wants a different issue or replacement task. | Treat as task replacement, not steer. | Stop or pause through supported controls as needed, then create/update/requeue through the supported lifecycle. | +| Evidence is missing, contradictory, or would require guessing whether local work is safe to overwrite. | Stop automatic recovery. | Use manual attention with structured public blockers and keep private evidence local. | + +## Broad Steer Examples + +Broad steer can be delivered by the runtime, but it does not erase lifecycle authority. + +Example: an active lane is implementing "add lane-control guidance" and an operator +steers "ignore that and add dashboard retry buttons." The CLI/API may accept the steer +when the run id and expected turn id match. After the turn resolves, an agent must +inspect the diff and evidence. If the issue still has the old objective and the diff +now contains dashboard controls, do not hand off the PR as if the original issue was +satisfied. Preserve the steer audit and either create a replacement issue, update and +requeue the current issue through explicit lifecycle, or route manual attention. + +Example: an operator steers "narrow this to docs only; do not touch Rust." If the issue +still accepts that scope and the resulting diff matches the same acceptance criteria, +the lane may continue after inspection. The agent should still cite the steer evidence +and ensure the review handoff summary does not imply unrequested runtime behavior +changed. + +## Interrupt And Hard Fallback Examples + +Example: `decodex lane interrupt XY-123 --run-id run-abc` reports a soft interrupt +request. Re-run `decodex lane inspect` or `decodex status --json`. If protocol +activity shows the same turn is still stopping, wait or inspect private evidence; do +not kill the child process from the side. + +Example: `decodex lane interrupt XY-123 --run-id run-abc --force` reports +`hard_interrupt_fallback`. Inspect the retained worktree before retry. If the worktree +contains a partial patch that still belongs to `XY-123`, resume through +`decodex run XY-123` only when `WORKFLOW.md` eligibility, runtime evidence, branch, +and PR lineage still match. If the patch belongs to a replaced task or the issue state +is unclear, route manual attention. + +## Label And Scan Rules + +`POST /api/linear-scan` only asks the local listener to refresh Linear-backed intake and +status before the next scheduled poll. It does not start an attempt, retry a failed +lane, or change labels. + +Keep `decodex:queued:` when the issue is still intended for automation and +the scheduler simply needs to observe a changed state. Remove it only through the +labels skill when the issue should no longer be an intake candidate. Keep +`decodex:needs-attention` until the recorded blocker is resolved; clearing it is not a +recovery shortcut. + +During an owned automation run, agents use issue-scoped tracker tools for progress, +review handoff, manual attention, and terminal finalization. Outside the owned run, +operators use the documented CLI/API controls and label procedures. + +## Manual Attention Route + +Use manual attention when: + +- lane identity cannot be proven from current evidence +- retained work may be overwritten or discarded without a human decision +- broad steer or task replacement changed the issue authority +- hard fallback stopped a process but retained worktree state is unclear +- Linear labels, active ownership, or tracker state conflict with runtime evidence +- PR lineage cannot be validated after review handoff + +The valid owned-agent path is: + +1. add the configured `decodex:needs-attention` label +2. call `issue_comment` with `kind = "manual_attention"` and structured public fields +3. call `issue_terminal_finalize(path = "manual_attention")` + +Keep host-local paths, private payloads, raw steer text, process diagnostics, account +details, and secrets out of the public Linear fields. diff --git a/docs/spec/lane-control.md b/docs/spec/lane-control.md index 87d78cb4..4d6d5985 100644 --- a/docs/spec/lane-control.md +++ b/docs/spec/lane-control.md @@ -6,7 +6,10 @@ Status: normative Read this when: You are implementing, validating, or using CLI/API controls for active or retained Decodex lanes. Not this document: The full runtime state machine, the low-level app-server method -schema, dashboard layout, or tracker-tool payload schema. +schema, dashboard layout, tracker-tool payload schema, or the step-by-step recovery +sequence after a control action. Use +[`../runbook/lane-control-recovery.md`](../runbook/lane-control-recovery.md) for +post-control recovery decisions. Defines: The lane-control capability matrix, supported and deferred controls, audit requirements, and policy boundary for inspect, pause/resume, scan, interrupt, steer, retained retry/resume, and manual-attention controls. diff --git a/plugins/decodex/skills/automation/SKILL.md b/plugins/decodex/skills/automation/SKILL.md index 1af79a45..e22fce51 100644 --- a/plugins/decodex/skills/automation/SKILL.md +++ b/plugins/decodex/skills/automation/SKILL.md @@ -22,6 +22,9 @@ Operate Decodex as the retained-lane control plane for automatic development. - `docs/spec/lane-control.md` owns CLI/API-first lane-control capabilities, including inspect, pause/resume, scan, interrupt, steer, retained resume/retry, manual attention, and deferred controls. +- `docs/runbook/lane-control-recovery.md` owns the post-control decision trees for + agents after interrupt, hard fallback, broad steer, task replacement, or ambiguous + recovery evidence. - `docs/spec/workflow-file.md` owns `WORKFLOW.md` schema and field semantics. - `docs/reference/operator-control-plane.md` owns the current status/dashboard field map. @@ -111,6 +114,8 @@ terminal automation signal. ## Lane Controls Read `docs/spec/lane-control.md` before using or explaining operator controls. +Read `docs/runbook/lane-control-recovery.md` before retrying, resuming, relabeling, or +escalating after a control action or ambiguous recovery signal. Rules for agents: @@ -146,6 +151,33 @@ Rules for agents: owned agent run, use issue-scoped tools for progress, review handoff, manual attention, and terminal finalization. Outside the owned lane, use documented CLI/API controls and the labels skill. +- Do not directly kill hidden `_attempt` children or edit runtime DB rows to force a + lane state. Use the supported interrupt, retained retry/resume, recovery, and + manual-attention paths. If an operator had to stop a process for immediate host + safety outside Decodex controls, treat the lane as evidence-ambiguous until + `status`, `diagnose`, `evidence`, and the retained worktree have been inspected. + +Post-control decision tree for automation agents: + +1. Inspect the current lane and private evidence before deciding whether the control + succeeded, failed, timed out, or fell back to `hard_interrupt_fallback`. +2. If the lane is still active and identity still matches the issue, branch, run id, + attempt, and current turn, let the runtime continue or wait for the control result; + do not requeue or clear labels. +3. If the lane is interrupted, failed, or retained with useful local work, resume only + when the retained worktree, branch, issue, runtime evidence, and PR lineage still + prove the same lane. Use runtime lifecycle entrypoints such as `decodex run + `; do not restart from a guessed branch. +4. If a queued or relabeled issue should be observed sooner, request a Linear scan with + `POST /api/linear-scan`. Keep or remove queue labels only through the labels skill + or the supported tracker-tool path for the owned issue. +5. If a broad steer materially changes the requested objective, acceptance criteria, or + issue authority, preserve the local control audit and resolve lifecycle explicitly: + update and requeue the issue, create a new lane, or route the owned run to manual + attention. Do not silently hand off a PR whose diff no longer matches the issue. +6. If evidence cannot prove whether to resume, retry, requeue, or discard retained + work, stop automatic recovery and use manual attention with structured public + blockers. ## Boundaries diff --git a/plugins/decodex/skills/manual-cli/SKILL.md b/plugins/decodex/skills/manual-cli/SKILL.md index 1570e7a3..13c7b276 100644 --- a/plugins/decodex/skills/manual-cli/SKILL.md +++ b/plugins/decodex/skills/manual-cli/SKILL.md @@ -15,6 +15,8 @@ runtime-owned retained-lane lifecycle. - `README.md` for the current CLI shape. - `Makefile.toml` before running repo-native checks. - `docs/spec/lane-control.md` before using CLI/API lane controls. +- `docs/runbook/lane-control-recovery.md` before deciding what to do after interrupt, + hard fallback, broad steer, task replacement, or ambiguous recovery evidence. - `docs/reference/operator-control-plane.md` when interpreting `status` or dashboard fields. - `docs/runbook/linear-archive-hygiene.md` before archiving old terminal Linear issues. @@ -83,6 +85,30 @@ CLI/API lane controls: - Do not use active-lane UI controls, direct runtime DB edits, raw `thread/inject_items`, or tracker-state mutations as substitutes for the lane-control contract. +- Do not kill hidden `_attempt` children to simulate interrupt. Use + `decodex lane interrupt ... --force` or the API `"force": true` only when explicit + operator intent allows hard fallback. If an emergency host-safety stop happens + outside Decodex controls, inspect local evidence and route recovery explicitly before + retrying or cleaning labels. + +Post-control CLI recovery: + +1. Inspect again with `decodex lane inspect `, `decodex status --json`, and + `decodex evidence ` when a control request returns, times out, or reports + `hard_interrupt_fallback`. +2. If identity still matches an active lane, wait for the runtime-owned attempt or use + the next supported control. Do not remove `decodex:active:` by hand. +3. If the lane is retained and lineage is exact, use the registered workflow path such + as `decodex run ` for retry/resume. If status reports a retained review + handoff mismatch, use `docs/runbook/recover-review-handoff.md`. +4. If the operator changed labels or issue state and wants the scheduler to notice + before the next poll, request `POST /api/linear-scan`; this is a refresh request, + not a retry command. +5. If the new operator text replaces the task or changes acceptance materially, do not + hide that as steer. Resolve the old lane explicitly, then update/requeue the same + issue or create a new issue for the replacement work. +6. If the evidence is ambiguous or useful retained work would be overwritten, route to + manual attention instead of direct Linear label mutation. Manual commit and landing are separate narrow workflows: