Skip to content

Make subagent probes use normal model defaults#216

Merged
m-aebrer merged 4 commits into
masterfrom
feature/issue-215-subagent-probes-normal-defaults
May 19, 2026
Merged

Make subagent probes use normal model defaults#216
m-aebrer merged 4 commits into
masterfrom
feature/issue-215-subagent-probes-normal-defaults

Conversation

@m-aebrer
Copy link
Copy Markdown
Collaborator

Closes #215

Plan to make subagent model availability probes use the same headless coding-agent request path and model defaults as normal execution, instead of a synthetic low-level 1-token request.

Implementation plan posted as a comment below.

@m-aebrer
Copy link
Copy Markdown
Collaborator Author

Implementation Plan

Problem analysis

Subagent fallback probing currently exercises a different request path from normal dreb execution. probeModelAvailability() in packages/coding-agent/src/core/tools/subagent.ts calls the low-level complete() API with a synthetic prompt and maxTokens: 1. Normal TUI/headless coding-agent turns go through Agent / runAgentLoop / streamSimple, with ThinkingLevel defaults resolved by the coding-agent session and then mapped by provider streamSimple implementations into provider-specific reasoning options.

That divergence matters for OpenAI Responses reasoning models. The synthetic probe can produce max_output_tokens: 1 and no normal reasoning effort, while the real path uses normal output defaults and passes reasoning effort through the SimpleStreamOptions path. OpenAI's reasoning docs also confirm reasoning tokens count against max_output_tokens, so an artificially tiny output budget is not representative for availability.

The issue comment adds an important constraint: do not copy the normal request-building logic into the probe. The probe must literally import and reuse the same runtime path, or use headless CLI/session execution if that is the safest way to stay representative.

Deliverables

  1. Replace the synthetic low-level probe with a headless normal coding-agent request path.

    • Add a small reusable probe runner, likely in packages/coding-agent/src/core/model-availability-probe.ts or a similarly named core helper.
    • The runner should perform one minimal headless agent turn through existing coding-agent/agent-core machinery, not construct provider request options itself.
    • Preferred shape: use the same Agent / runAgentLoop / streamSimple path that normal print/TUI sessions use, with the same convertToLlm and API-key lookup behavior. If implementation shows the in-process path still diverges from CLI initialization, use the existing headless CLI path (--mode json --ui agent --no-session) instead.
    • The probe prompt can remain minimal, but the request construction must be normal.
  2. Remove the artificial cost-saving options from fallback probing.

    • Stop passing maxTokens: 1 from probeModelAvailability().
    • Do not add provider-specific probe hacks in openai-responses.ts as the primary fix.
    • Let the normal simple-stream/default option path choose output token defaults and reasoning behavior.
  3. Reuse shared thinking/default logic instead of duplicating it.

    • If the probe runner needs to resolve an effective ThinkingLevel, extract that logic from createAgentSession() into a shared helper and make both the normal session setup and probe import it.
    • The probe should mirror the subagent child process defaults for the resolved model: reasoning models get the normal default thinking level, non-reasoning models get off.
    • Avoid introducing a parallel table of provider/model reasoning defaults in subagent.ts.
  4. Preserve existing fallback semantics and diagnostics.

    • Fallback lists still probe each configured model in order.
    • Single-model configs and per-call overrides still skip probing.
    • Auth, rate-limit, quota, timeout, and provider errors still skip to the next fallback.
    • Failure summaries should continue to include compact diagnostic detail, but should make it clear when the failure came from the headless normal probe path rather than model-name resolution.
  5. Update user-facing docs that currently describe the old behavior.

    • packages/coding-agent/README.md says probes are “lightweight 1-token API call”; update this to describe normal headless request-path probing.
    • Check root README.md; it mentions subagents but not probe details, so update it only if implementation changes user-visible subagent behavior beyond the package README description.

Files to create or modify

  • packages/coding-agent/src/core/tools/subagent.ts

    • Remove the direct complete() probe request.
    • Delegate to the new shared/headless probe runner.
    • Keep timeout, abort propagation, skipped-model aggregation, and fallback summaries.
  • packages/coding-agent/src/core/model-availability-probe.ts or equivalent new helper

    • Own the one-turn headless probe implementation.
    • Import and use existing normal request/agent-loop code; do not build provider-specific payloads.
    • Return the same { ok: true } | { ok: false, reason, aborted? } shape or a small internal equivalent consumed by subagent.ts.
  • packages/coding-agent/src/core/sdk.ts and/or a new shared thinking helper if needed

    • Extract effective thinking-level resolution only if the probe cannot otherwise reuse session setup directly.
    • Keep normal session behavior unchanged.
  • packages/coding-agent/src/core/agent-session.ts

    • Only touch if the probe needs live parent/session values or if the shared helper must be wired through the subagent tool factory.
    • Do not expand scope unless necessary; current child subagent spawning does not explicitly pass --thinking.
  • packages/coding-agent/test/subagent-model-fallback.test.ts

    • Update existing probe tests that currently mock complete() and assert maxTokens: 1.
    • Add regression coverage for an openai-responses reasoning model so the probe cannot silently regress to low-level complete() or tiny token budgets.
  • packages/ai/test/openai-responses-copilot-provider.test.ts or a focused new OpenAI Responses payload test

    • Add a provider-level guard showing the normal streamSimple path for a reasoning model includes normal reasoning options and does not rely on the old probe shape.
    • This should be a regression guard, not a provider-specific probe fix.
  • packages/coding-agent/README.md

    • Replace the outdated “1-token API call” description.

Testing approach

  • Unit tests in packages/coding-agent/test/subagent-model-fallback.test.ts:

    • Clean probe success uses the new headless/normal probe helper, not complete().
    • Probe errors, returned error messages, abort-before-start, abort-during-probe, and timeout behavior still produce the existing fallback result shapes.
    • Fallback loop uses first model when the normal probe succeeds.
    • Fallback loop skips a failed normal probe and uses the next fallback.
    • Parent-model last-resort behavior remains unchanged.
    • Single-model configs and per-invocation overrides still do not probe.
    • New regression: an openai-responses reasoning model probe does not pass maxTokens: 1 and does use the normal reasoning/default path.
  • Provider/request-shape tests in packages/ai/test/:

    • Capture an OpenAI Responses streamSimple payload for a reasoning model and assert it carries a normal reasoning effort/summary path.
    • Assert the old synthetic shape is not what the probe depends on.
  • Documentation verification:

    • Confirm packages/coding-agent/README.md no longer promises “1-token” probes.
  • Command verification after implementation:

    • npx vitest --run packages/coding-agent/test/subagent-model-fallback.test.ts
    • The focused OpenAI Responses test file added/modified under packages/ai/test/
    • npm run build
    • npm run verify-workspace-links
    • If credentials are available, manually launch a subagent fallback-list scenario with an OpenAI Responses reasoning model first in the list and confirm it is not falsely skipped.

Acceptance criteria

  • Subagent fallback probing no longer uses a synthetic low-level complete() request with maxTokens: 1.
  • The probe imports and uses the same normal request path used by coding-agent/headless execution for model option construction.
  • OpenAI Responses reasoning models that work in normal dreb execution are not rejected by the probe due to disabled/missing reasoning defaults or an invalid tiny output budget.
  • Genuine unavailable models still fall through to later fallback-list entries with loud diagnostics.
  • Existing fallback summary output remains understandable and includes enough provider error detail to distinguish model unavailability from request-shape rejection.
  • Tests fail if the probe regresses to the old low-level request shape.
  • User-facing subagent docs match the new behavior.

Risks and open questions

  • A fully representative headless probe costs more than the old 1-token request. That is acceptable for correctness, but timeout and prompt design should keep it bounded.
  • Running through more of the normal session path may load more configuration than the current probe. Keep session persistence in-memory/no-session and avoid side effects beyond the one model request.
  • The child subagent currently relies on normal CLI/session defaults for thinking; it does not explicitly inherit a transient parent thinking level. The probe should match the child’s actual defaults unless implementation intentionally changes both probe and child spawning together.
  • Avoid treating OpenAI Responses as the only affected provider. The fix should be provider-agnostic because the root problem is request-path divergence.

Plan created by mach6

@m-aebrer
Copy link
Copy Markdown
Collaborator Author

Progress Update

Implemented the subagent probe default fix.

Changes in this batch:

  • Switched subagent model availability probing from the low-level complete() path to completeSimple() / streamSimple, matching the normal agent-loop request path.
  • Removed the synthetic maxTokens: 1 probe override and added shared coding-agent thinking/default clamping so reasoning models are probed with representative options.
  • Added regression coverage for OpenAI Responses reasoning models and the normal streamSimple request payload.
  • Updated subagent docs to remove the old “1-token API call” behavior.
  • Refreshed xAI test model IDs from removed grok-3 / grok-3-fast names to current grok-4.3 so the repository typecheck passes with the generated model registry.

Verification run:

  • npm run check
  • npx vitest --run packages/coding-agent/test/subagent-model-fallback.test.ts packages/ai/test/openai-responses-copilot-provider.test.ts
  • npm run build
  • npm run verify-workspace-links
  • npm test
  • Commit hook tests passed before commit

Commit: a22f7d4


Progress tracked by mach6

@m-aebrer m-aebrer marked this pull request as ready for review May 19, 2026 20:03
@m-aebrer
Copy link
Copy Markdown
Collaborator Author

Code Review

Critical

None.

Important

Finding 1 — Probe can falsely skip slow reasoning models due to full-weight reasoning timeout
Severity: medium · Confidence: 87
Files: packages/coding-agent/src/core/tools/subagent.ts, packages/coding-agent/src/core/thinking.ts

The new probe uses resolveEffectiveThinkingLevel(model, undefined), which defaults reasoning-capable models to the normal xhigh thinking level, and then waits for completeSimple() to finish within the existing ten-second probe timeout. For slow reasoning models, even a trivial prompt can exceed that budget, causing an available model to be marked unavailable with a timeout and skipped in the fallback chain. Consider capping probe reasoning to minimal or low while still using the normal streamSimple path and avoiding the old disabled-reasoning/max-token shape.

Finding 2 — New thinking.ts helper behavior is not directly unit-tested
Severity: medium · Confidence: 90
File: packages/coding-agent/src/core/thinking.ts

resolveEffectiveThinkingLevel() and thinkingLevelToReasoning() are new exported helpers, but they are only covered indirectly through subagent probe tests. Important paths such as an undefined model, a custom defaultThinkingLevel, and non-default reasoning levels are not directly asserted. Add focused unit tests for these helpers so future changes cannot silently break the shared thinking clamp.

Finding 3 — createAgentSession initialization clamp is not tested at the SDK level
Severity: medium · Confidence: 85
File: packages/coding-agent/src/core/sdk.ts

The PR refactors the session initialization clamp from an inline guard to resolveEffectiveThinkingLevel(model, thinkingLevel), but there is no SDK-level test asserting that createAgentSession() clamps a non-reasoning model to thinkingLevel: "off". Existing model-switch tests cover a separate runtime clamp path. Add a createAgentSession test with an explicit high thinking level and a non-reasoning model, asserting the resulting session uses off.

Suggestions

Finding 4 — OpenAI Responses reasoning probe coverage is partly mock-only
Severity: low · Confidence: 82
File: packages/coding-agent/test/subagent-model-fallback.test.ts

The subagent regression test for an OpenAI Responses reasoning model mocks completeSimple, so it verifies the probe options but not the provider payload. This is partially mitigated by the new provider-level streamSimple payload test, but that test uses the GitHub Copilot OpenAI Responses path. Consider adding a provider-level streamSimple payload test with a plain provider: "openai" OpenAI Responses reasoning model to make the chain complete.

Finding 5 — resolveEffectiveThinkingLevel can be simplified
Severity: low · Confidence: 92
File: packages/coding-agent/src/core/thinking.ts

The local effectiveThinkingLevel variable only names thinkingLevel ?? defaultThinkingLevel inside a tiny helper. Inlining it in the return expression would keep behavior identical and make the function more concise.

Finding 6 — thinkingLevelToReasoning may be unnecessary as a single-use export
Severity: low · Confidence: 80
Files: packages/coding-agent/src/core/thinking.ts, packages/coding-agent/src/core/tools/subagent.ts

thinkingLevelToReasoning() is a single ternary and appears to be used only by the subagent probe. Consider collapsing that conversion at the call site, or making the helper non-exported until another caller needs it, to avoid expanding the exported helper surface prematurely.

Strengths

  • The core path change from complete() to completeSimple() removes the low-level provider path that caused the original false rejection.
  • The probe no longer passes maxTokens: 1, and tests assert this directly.
  • Shared thinking clamp logic avoids duplicating provider/model reasoning defaults in the probe.
  • Existing fallback, abort, timeout, and parent-model fallback semantics are preserved and tested.
  • The OpenAI Responses payload test captures the real request payload and verifies normal output-token defaults and reasoning options.
  • Documentation was updated to remove the outdated one-token probe description.

Agents run: code-reviewer, error-auditor, test-reviewer, completeness-checker, simplifier


Reviewed by mach6

@m-aebrer
Copy link
Copy Markdown
Collaborator Author

Review Assessment

Review comment: #216 (comment)

Additional manual validation note: the TUI path was confirmed working after the fix.

Classifications

Finding Classification Reasoning
finding 1: Probe can falsely skip slow reasoning models due to full-weight reasoning timeout genuine Confirmed. probeModelAvailability() uses resolveEffectiveThinkingLevel(model, undefined), which defaults reasoning models to xhigh, then waits for completeSimple() under the existing ten-second timeout. A working slow reasoning model can time out and be skipped as unavailable.
finding 2: New thinking.ts helper behavior is not directly unit-tested genuine Confirmed. The new exported helpers are only covered indirectly through mocked probe tests; focused coverage is missing for undefined model, custom default level, and non-default reasoning levels.
finding 3: createAgentSession initialization clamp is not tested at the SDK level genuine Confirmed. The refactored initialization clamp in createAgentSession() is not covered by an SDK-level test with a non-reasoning model and explicit high thinking level; existing tests cover a separate runtime model-switch path.
finding 4: OpenAI Responses reasoning probe coverage is partly mock-only false-positive The subagent probe test is mocked, but the alleged missing plain OpenAI provider payload test already exists: the streamSimple applies reasoning defaults without synthetic one-token cap test constructs an api: "openai-responses", provider: "openai" reasoning model and asserts normal reasoning plus max_output_tokens: 32000, not 1.
finding 5: resolveEffectiveThinkingLevel can be simplified nitpick True but stylistic. Inlining the local variable would not affect behavior or correctness.
finding 6: thinkingLevelToReasoning may be unnecessary as a single-use export nitpick True but stylistic/API-surface preference. The helper is currently single-use, but keeping it does not create a correctness issue.

Action Plan

  1. Fix the probe false-negative risk so availability probes do not skip working reasoning models solely because an xhigh response exceeds the hardcoded timeout, while preserving the required completeSimple / streamSimple request path.
  2. Add focused unit tests for resolveEffectiveThinkingLevel() and thinkingLevelToReasoning(), covering undefined model, custom default level, non-reasoning clamp to off, and non-default reasoning levels.
  3. Add an SDK-level createAgentSession() test using a non-reasoning model plus explicit high thinking level, asserting the session starts with thinkingLevel: "off".

Assessment by mach6

@m-aebrer
Copy link
Copy Markdown
Collaborator Author

Progress Update

Fixed review findings 1-3.

Changes in this batch:

  • Increased the default subagent model availability probe timeout from 10 seconds to a named 120-second default, preserving the completeSimple / streamSimple probe path, normal reasoning defaults, and caller-provided timeout overrides.
  • Added fake-timer regression coverage proving the named default probe timeout is used without waiting in real time.
  • Added focused unit tests for resolveEffectiveThinkingLevel() and thinkingLevelToReasoning() covering undefined models, custom defaults, non-reasoning clamps, and reasoning level passthrough.
  • Added SDK-level createAgentSession() coverage verifying explicit high / xhigh thinking is clamped to off for non-reasoning models.

Verification run:

  • npx vitest --run packages/coding-agent/test/subagent-model-fallback.test.ts packages/coding-agent/test/thinking.test.ts packages/coding-agent/test/sdk-session-manager.test.ts
  • npx biome check packages/coding-agent/src/core/tools/subagent.ts packages/coding-agent/test/subagent-model-fallback.test.ts packages/coding-agent/test/thinking.test.ts packages/coding-agent/test/sdk-session-manager.test.ts
  • npm run build
  • npm run verify-workspace-links
  • npm test
  • Commit hook checks passed before commit

Commit: 8170da3


Progress tracked by mach6

@m-aebrer
Copy link
Copy Markdown
Collaborator Author

Code Review

Critical

None.

Important

Finding 1 — resolveEffectiveThinkingLevel default-arg call path untested in thinking.test.ts
Severity: medium · Confidence: 90
File: packages/coding-agent/test/thinking.test.ts

The production call in probeModelAvailability is resolveEffectiveThinkingLevel(model, undefined) — two arguments, relying on the DEFAULT_THINKING_LEVEL module-level constant ("xhigh") as the third-parameter default. thinking.test.ts only tests the explicit-default-override path (resolveEffectiveThinkingLevel(reasoningModel, undefined, "low")). There's no test for the no-third-arg form that exercises the actual default import. If DEFAULT_THINKING_LEVEL were broken or changed, the unit test wouldn't catch it.

Finding 2 — SDK session-restore path with non-reasoning model is untested
Severity: medium · Confidence: 85
File: packages/coding-agent/test/sdk-session-manager.test.ts

sdk.ts has two thinking-level paths: (a) fresh session with explicit thinkingLevel, and (b) session restore where thinkingLevel is loaded from stored session data then clamped. The new SDK test only covers path (a) by passing thinkingLevel: "high" explicitly. The untested scenario: an ongoing session with a stored high thinking level is reloaded with a non-reasoning model — the restored level should be clamped to "off" by resolveEffectiveThinkingLevel.

Suggestions

Finding 3 — "xhigh" missing from resolveEffectiveThinkingLevel test.each
Severity: low · Confidence: 85
File: packages/coding-agent/test/thinking.test.ts

The test.each for "reasoning model preserves explicit level" covers ["minimal", "low", "medium", "high"] but omits "xhigh", which is the default probe value and a valid AgentThinkingLevel. The thinkingLevelToReasoning test.each correctly includes "xhigh", making the omission inconsistent.

Finding 4 — DEFAULT_MODEL_AVAILABILITY_PROBE_TIMEOUT_MS value never pinned
Severity: low · Confidence: 80
File: packages/coding-agent/test/subagent-model-fallback.test.ts

The "uses the named default timeout" test verifies the mechanism (default timeout = constant) but never asserts the constant's value. If the constant were changed from 120_000 to 30_000, the test would still pass. A one-line expect(DEFAULT_MODEL_AVAILABILITY_PROBE_TIMEOUT_MS).toBe(120_000) would pin it.

Finding 5 — Unnecessary intermediate variable in resolveEffectiveThinkingLevel
Severity: low · Confidence: 92
File: packages/coding-agent/src/core/thinking.ts

const effectiveThinkingLevel = thinkingLevel ?? defaultThinkingLevel; is assigned and consumed on the very next line. Inlining as return model?.reasoning ? (thinkingLevel ?? defaultThinkingLevel) : "off"; is equivalently clear and more concise.

Strengths

  • Core path change from complete() to completeSimple() cleanly removes the root cause without introducing new complexity.
  • Shared thinking helpers (thinking.ts) avoid duplicating provider/model reasoning defaults and are used by both the session init and the probe.
  • Error handling is comprehensive — all failure modes surface { ok: false } with actionable reason strings.
  • Regression tests directly assert the two sides of the issue 215 bug (no maxTokens, correct reasoning defaults).
  • OpenAI Responses provider-level test validates actual request payload shape with normal max_output_tokens: 32000.
  • Documentation accurately describes the new behavior.
  • All issue 215 acceptance criteria are fully met.

Agents run: code-reviewer, error-auditor, test-reviewer, completeness-checker, simplifier


Reviewed by mach6

@m-aebrer
Copy link
Copy Markdown
Collaborator Author

Review Assessment

#216 (comment)

Classifications

Finding Classification Reasoning
finding 1: default-arg call path untested in thinking.test.ts false-positive The two-argument production path is covered in subagent-model-fallback.test.ts where probeModelAvailability() calls the helper and tests assert reasoning: "xhigh". The reviewer missed that integration coverage.
finding 2: SDK session-restore path with non-reasoning model untested genuine sdk.ts restores thinkingLevel from stored session data when hasExistingSession is true, then clamps via resolveEffectiveThinkingLevel. This restore-then-clamp path has no test with a non-reasoning model.
finding 3: "xhigh" missing from test.each nitpick Implementation is value-agnostic (thinkingLevel ?? defaultThinkingLevel) and "xhigh" is exercised via the default probe path and thinkingLevelToReasoning tests. Minor test polish.
finding 4: timeout constant value never pinned genuine The test imports the constant and advances timers by it, so it would still pass if the value regressed to a lower number. The intended 120s value is not pinned.
finding 5: unnecessary intermediate variable nitpick Stylistic. The local variable is clear and harmless.

Action Plan

  1. Pin the subagent probe default timeout value with expect(DEFAULT_MODEL_AVAILABILITY_PROBE_TIMEOUT_MS).toBe(120_000).
  2. Add an SDK session-restore regression test for a stored high/xhigh thinking level with a non-reasoning model, asserting session.thinkingLevel clamps to "off".

Assessment by mach6

@m-aebrer m-aebrer merged commit bf6a97c into master May 19, 2026
2 checks passed
@m-aebrer m-aebrer deleted the feature/issue-215-subagent-probes-normal-defaults branch May 19, 2026 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make subagent probes use normal model defaults

1 participant