Subagent runtime model fallback (#198)#209
Conversation
Implementation PlanProblem Analysis
Approach: Child-Side FallbackWe pass the full fallback list into the child process and teach
Deliverables
Files to Create or Modify
Detailed Design1. Passing the fallback list to the childIn if (!modelOverride && Array.isArray(config.model)) {
effectiveConfig = { ...effectiveConfig, modelFallbacks: config.model };
}In 2. AgentSession fallback consumptionAdd to fallbackModels?: string[];In this._fallbackModels = config.fallbackModels ?? [];
this._fallbackModelIndex = 0;3. Model switching on retry exhaustionModify
4. Surfacing the resolved model
5. Logging design principlePer the issue's "Fail loud, not verbose":
Acceptance Criteria Mapping
Testing Approach
Risks and Open Questions
Plan created by mach6 |
Dinesh vs Gilfoyle Review — Plan for issue 1983 rounds of mass destructionBest of the Banter
VerdictCritical (Gilfoyle won, Dinesh conceded)
Important (Gilfoyle won after debate)
Contested (Dinesh held his ground)
Dismissed (Gilfoyle was nitpicking)
Strengths (even Gilfoyle grudgingly acknowledged)
ScoreGilfoyle: 8 | Dinesh: 2 Amended Plan — Recommended ChangesThe following amendments should be applied to the implementation plan before work begins:
Review by mach6 /dg |
Scope ClarificationThe implementation plan for this issue needs a significant course correction. After review, the correct scope is much narrower than what was previously planned. What this feature is NOT
What this feature ISThe fallback exists only at subagent spawn time, in the parent process.
Flow:
Approach: Parent-side respawn. The child process receives only a single Files to modifyOnly
Files to NOT modify
All previously planned changes to the above files should be discarded. The feature lives entirely in the subagent spawn logic. |
Additional Design NoteAn acceptable implementation approach: make a lightweight API call at spawn time to verify the model is actually responsive before committing to it. If the probe returns a rate limit, quota, or auth error, skip that model and try the next fallback immediately. This avoids the respawn overhead entirely. This is preferred over spawning blindly and then parsing exit codes / stderr to detect failure. |
Amended Implementation Plan — Issue 198Problem Analysis
The previous plan attempted child-side fallback, which would have touched ScopeOnly No changes to Approach: Spawn-Time Probe with Fallback LoopInstead of spawning blindly and then parsing exit codes / stderr to detect failure (respawn approach), we make a lightweight API call at spawn time to verify the model is actually responsive. If the probe returns a rate limit, quota, or auth error, we skip that model and try the next fallback immediately. This avoids respawn overhead entirely. Flow:
The probe is best-effort: it only runs when a Deliverables
Files to Create or Modify
Files to NOT Modify
Detailed Design1.
|
| Criterion | How it's addressed |
|---|---|
| Primary model hits rate limit → attempt next fallback | executeSingle probes each fallback in order; skips models that fail the probe for any reason |
| Transparent to parent | Parent sees normal success/fail from spawnSubagent; fallback logic is entirely inside executeSingle |
| Resolved model surfaced in result | agent_start JSONL event carries the active model ID; spawnSubagent already captures it. Fallback summary is prepended to result.output |
| Spawn-time fallback unchanged | resolveModelWithFallbacks logic is untouched; probe is an additional gate before spawning |
| Every fallback logged loudly | Skipped models are logged via console.error AND summarized in result.output so the parent agent sees them |
| Final failure is concise | If all fallbacks fail, existing resolveModelWithFallbacks error behavior is preserved |
Testing Approach
Extend packages/coding-agent/test/subagent-model-fallback.test.ts:
- Probe succeeds on first model — mock
completeto return success; verifyspawnSubagentis called with the first model - Probe skips failed model, falls back to second — mock
completeto return an error for the first model and success for the second; verify the second model is used and the fallback summary is present inresult.output - Probe skips on any error type — mock
completewith rate-limit, quota, timeout, and 5xx errors; verify fallback behavior in all cases - Probe exhausts all configured fallbacks, parent model succeeds — mock all probes to fail; verify parent model is used with warning
- Single model config skips probing — verify
completeis not called whenconfig.modelis a string - Model override skips probing — verify
completeis not called when a per-invocation override is set
Mock strategy:
vi.mock("@dreb/ai", async () => ({ complete: vi.fn(), ... }))- Mock registry with
find()returningModelobjects - Spy on
console.errorto verify log lines
Risks and Open Questions
-
Probe costs: Each probe is a 1-token API call. For a 4-model fallback list where the first 3 fail, this costs ~3 minimal API calls. This is acceptable given it prevents a full subagent spawn + failure cycle.
-
Transient network errors skip good models: A local network blip could cause all probes to fail, pushing everything to the parent model fallback. This is acceptable because the parent model is guaranteed to be working (the parent session is using it), and it is good enough for any subagent task.
-
Probe false negatives: A model could pass the probe but fail on the first real request in the child (e.g., rate limit triggers between probe and spawn). This is an accepted trade-off; the probe catches the common case of immediate unavailability. A future enhancement could add respawn-on-failure as a secondary safety net, but that is out of scope for this plan.
-
No registry = no probe: In environments without a
ModelRegistry, the probe is skipped and existing behavior is preserved. This matches the current test infrastructure. -
completeimport from@dreb/ai:packages/coding-agentalready depends on@dreb/ai. The import is a simple addition with no package-level changes.
Plan created by mach6
Progress UpdateImplemented the amended plan for subagent runtime model fallback:
Verification run:
Commit: Progress tracked by mach6 |
Code ReviewCriticalNone. Importantfinding 1 — Parent abort during probe loop is reported as model unavailability finding 2 — Already-aborted parent signal may remove the probe timeout safety net finding 3 — Fallback summary output is not tested Suggestionsfinding 4 — Cancellation is not checked at the top of the fallback loop finding 5 — Per-model skip reasons are discarded on total failure finding 6 — Registry-less array fallback boundary is untested finding 7 — Aborted assistant messages from the probe are not directly tested finding 8 — Probe timeout and parent-abort propagation are untested finding 9 — Parent model failure after all probes fail is untested in the new resolver finding 10 — Missing explicit test for per-invocation override probe skip finding 11 — Redundant ternary in finding 12 — Strengths
Agents run: code-reviewer, error-auditor, test-reviewer, completeness-checker, simplifier Reviewed by mach6 |
Review AssessmentReview comment: #209 (comment) Classifications
Action Plan
Assessment by mach6 |
Progress UpdateFixed the genuine review findings for subagent spawn-time model probes:
Verification run:
Commit: Progress tracked by mach6 |
Code ReviewCriticalNone. Importantfinding 1 — finding 2 — Per-invocation model override probe skip is not integration-tested Suggestionsfinding 3 — finding 4 — finding 5 — Probe skip when finding 6 — Redundant finding 7 — Duplicate skipped-model list formatting finding 8 — Verbose inline attempted-model array construction in final error message Strengths
Agents run: code-reviewer, error-auditor, test-reviewer, completeness-checker, simplifier Reviewed by mach6 |
Review AssessmentReview comment: #209 (comment) Classifications
Action Plan
Assessment by mach6 |
Progress UpdateFixed the two genuine review findings for subagent fallback test coverage:
Verification run:
Note: Commit: Progress tracked by mach6 |
Code ReviewCriticalNone. Importantfinding 1 — Probe-at-spawn only partially addresses "runtime rate limits trigger fallback" finding 2 — Parent model final fallback bypasses availability probe finding 3 — Suggestionsfinding 4 — finding 5 — Duplicated bullet-list construction across formatting helpers finding 6 — Minor test coverage gaps in helper functions Strengths
Agents run: code-reviewer, error-auditor, test-reviewer, completeness-checker, simplifier Reviewed by mach6 |
Review AssessmentReview comment: #209 (comment) Classifications
Action PlanNo action needed before merge — all findings are nitpicks or false positives. Assessment by mach6 |
Also fix flaky web-search-queue timing test by using monotonic performance.now() instead of Date.now() which can go backwards during NTP clock corrections.
Closes #198
Subagent model fallback lists are lost at spawn time — runtime rate limits cannot trigger fallback.
Implementation plan posted as a comment below.