fix(triage): defer instead of error on budget-exceeded (OPENHUMAN-TAURI-X)#1558
fix(triage): defer instead of error on budget-exceeded (OPENHUMAN-TAURI-X)#1558CodeGhost21 wants to merge 3 commits into
Conversation
…RI-X) When the OpenHuman backend rejected an inference call with "400 Bad Request — Budget exceeded — add credits to continue", classify_error fell through to ArmError::Fatal (it's neither 429 nor 5xx/transient). run_triage then returned Err, which the composio bus subscriber logged with tracing::error!, paging Sentry on every Composio trigger fire (108 occurrences in 5 days). Budget exhaustion is user-actionable, not a code bug. Route it through a new ArmError::BudgetExhausted variant that skips the cloud retry (same wall) and falls through to the local arm; if local is unavailable or also fails, return TriageOutcome::Deferred (Ok, not Err) with a budget-named reason so the bus logs warn, not error.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughAdds detection and an ArmError::BudgetExhausted variant, maps budget-related cloud errors to it, and updates run_triage_with_arms to skip cloud retry on budget exhaustion and fall back to local or produce budget-specific deferred reasons. Tests cover classification and triage flows. ChangesBudget-Aware Triage Fallback
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
src/openhuman/agent/triage/evaluator_tests.rs (1)
367-450: ⚡ Quick winCover the retry-then-budget branch too.
These additions cover budget exhaustion on the first cloud call and with no local arm, but the separate branch where attempt
#1is retryable and attempt#2returnsBudgetExhaustedis still unexercised. A small counter-based regression test for that sequence would lock down the newCloudAfterRetrycontrol flow.As per coding guidelines, achieve ≥ 80% coverage on changed lines; PRs must pass
diff-covercheck viacargo-llvm-cov.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/agent/triage/evaluator_tests.rs` around lines 367 - 450, Add a new tokio test (e.g., cloud_budget_exhausted_after_retry_skips_local) that uses mock_agent_run_turn with a counter: on first call return a retryable/transient Err (so the triage will schedule a cloud retry), on second call return the Budget exceeded error string used elsewhere; then call run_triage_with_arms(cloud_arm(), Some(local_arm()), &envelope()) and assert the outcome is a Decision with TriageResolutionPath::LocalFallback and run.used_local true, and assert the counter equals 2 to ensure we hit cloud twice (first transient then budget) but do not perform a third cloud retry; reference the same helpers used in existing tests (mock_agent_run_turn, run_triage_with_arms, cloud_arm, local_arm, envelope, TriageResolutionPath::LocalFallback) and mirror the assertion style from cloud_budget_exhausted_skips_retry_and_falls_to_local.src/openhuman/agent/triage/evaluator.rs (1)
175-208: ⚡ Quick winAdd correlation fields to the new budget/defer warnings.
These warn paths are now the main breadcrumb for budget-exhausted triage, but they only emit the message text. Please include stable identifiers from
envelopeplus the active arm/path here so an operator can tie the warning back to the triggering event without correlating adjacent logs manually.As per coding guidelines, use
log/tracingwith stable grep-friendly prefixes and correlation fields.Also applies to: 253-257
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/agent/triage/evaluator.rs` around lines 175 - 208, Update the two tracing::warn! calls that handle Err(ArmError::BudgetExhausted(...)) (the one at the initial cloud attempt and the one inside the retry match) to include stable correlation fields from the envelope and the active resolution path; for example add envelope identifiers (e.g. envelope.id or envelope.request_id — use the actual stable id field name on the Envelope struct) and the path string (e.g. TriageResolutionPath::Cloud and TriageResolutionPath::CloudAfterRetry or a derived &str) as structured fields to the warn! macro so operators can grep/ correlate these budget-exhausted warnings to the triggering event and path.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/agent/triage/evaluator_tests.rs`:
- Around line 113-121: The test is moving the non-Copy ArmError twice by using
matches!(err, ...) and then matching err in the panic message; update the
assertion to avoid moving err by matching on a reference (e.g., use
matches!(&err, ArmError::BudgetExhausted(_)) and adjust the panic match to match
&err with patterns like ArmError::Retryable { .. } => "Retryable", etc.), or
alternatively compute a label string from err first (let label = match &err {
... }) and then call assert!(matches!(&err, ArmError::BudgetExhausted(_)),
"...", label = label) so err is not moved twice; apply this change around the
classify_error call and the assert block.
---
Nitpick comments:
In `@src/openhuman/agent/triage/evaluator_tests.rs`:
- Around line 367-450: Add a new tokio test (e.g.,
cloud_budget_exhausted_after_retry_skips_local) that uses mock_agent_run_turn
with a counter: on first call return a retryable/transient Err (so the triage
will schedule a cloud retry), on second call return the Budget exceeded error
string used elsewhere; then call run_triage_with_arms(cloud_arm(),
Some(local_arm()), &envelope()) and assert the outcome is a Decision with
TriageResolutionPath::LocalFallback and run.used_local true, and assert the
counter equals 2 to ensure we hit cloud twice (first transient then budget) but
do not perform a third cloud retry; reference the same helpers used in existing
tests (mock_agent_run_turn, run_triage_with_arms, cloud_arm, local_arm,
envelope, TriageResolutionPath::LocalFallback) and mirror the assertion style
from cloud_budget_exhausted_skips_retry_and_falls_to_local.
In `@src/openhuman/agent/triage/evaluator.rs`:
- Around line 175-208: Update the two tracing::warn! calls that handle
Err(ArmError::BudgetExhausted(...)) (the one at the initial cloud attempt and
the one inside the retry match) to include stable correlation fields from the
envelope and the active resolution path; for example add envelope identifiers
(e.g. envelope.id or envelope.request_id — use the actual stable id field name
on the Envelope struct) and the path string (e.g. TriageResolutionPath::Cloud
and TriageResolutionPath::CloudAfterRetry or a derived &str) as structured
fields to the warn! macro so operators can grep/ correlate these
budget-exhausted warnings to the triggering event and path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 300215d0-23b7-4fac-9958-b8081e96a775
📒 Files selected for processing (2)
src/openhuman/agent/triage/evaluator.rssrc/openhuman/agent/triage/evaluator_tests.rs
- Avoid double-move of `err` in the top_up/out_of_credits classify test by matching on `&err` and pre-computing a label string. - Add `cloud_budget_exhausted_on_retry_falls_through_to_local` to cover the path where cloud is transient on attempt tinyhumansai#1 and budget-exceeded on the retry — must skip a third cloud call and dispatch to local. - Carry envelope correlation fields (source / display_label / external_id) and the active resolution path on the two budget-exhausted warn! sites so operators can grep these back to the triggering event.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/agent/triage/evaluator.rs`:
- Around line 233-240: The Deferred branch currently injects raw upstream error
text into TriageOutcome::Deferred.reason (see TriageOutcome::Deferred, variable
reason and match on cloud_budget_exhausted), violating the safe-to-log contract;
change reason to a stable, scrubbed message (e.g. "cloud retry exhausted; local
arm unavailable" or "cloud budget exhausted; local arm unavailable") without
interpolating err/cloud_err, and emit the original error details separately at
debug/trace level (e.g., tracing::debug! or log::debug!) with a fixed prefix
like "[rpc]" or "[triage]" and any correlation IDs; apply the same fix to the
other similar branch around the 255–268 region so no raw upstream error text
flows into Deferred.reason while operator details remain available in debug
logs.
- Around line 463-480: The phrase matching in is_inference_budget_exceeded
currently uses normalized.contains(...) which can match substrings inside words
(e.g., "stop updating" matching "top up"); change it to operate on token
windows: after computing normalized and splitting into Vec<String> words, for
each needle (split into needle_tokens) slide a window of the same length over
the normalized words and require an exact token-by-token match (contiguous
words) before returning true. Update the logic that iterates the needles (the
array of strings like "budget exceeded", "top up", etc.) to perform this
token-window containment check instead of using contains so only
whole-word/whole-phrase matches trigger BudgetExhausted.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e6b61012-47ce-4e85-ac65-6cad406c2e9f
📒 Files selected for processing (2)
src/openhuman/agent/triage/evaluator.rssrc/openhuman/agent/triage/evaluator_tests.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- src/openhuman/agent/triage/evaluator_tests.rs
graycyrus
left a comment
There was a problem hiding this comment.
Review
Solid fix for a real production issue — the BudgetExhausted variant is well-designed, the skip-retry logic is correct, and the test coverage is thorough (5 new tests covering all budget paths including the retry-then-budget sequence). Correlation fields on the warn sites are a nice touch.
One actionable item below on the Deferred.reason field. The contains() matching in is_inference_budget_exceeded is actually safe — the normalization + whitespace collapse makes it effectively word-boundary matching (CodeRabbit's "stop updating" example doesn't actually match "top up" after normalization).
| let reason = match cloud_budget_exhausted { | ||
| Some(err) => format!("cloud budget exhausted; local arm unavailable: {err}"), | ||
| None => "cloud retry exhausted; local arm unavailable".to_string(), | ||
| }; |
There was a problem hiding this comment.
[major] Raw upstream error text leaks into Deferred.reason
This field gets logged via warn! on line 261 and may be surfaced in RPC responses. The pre-PR pattern uses stable, scrubbed copy ("cloud retry exhausted; local arm unavailable"). Splicing raw err in breaks the safe-to-log contract and could carry PII or sensitive upstream text.
Same issue at lines 255-258 where cloud_err and err are interpolated.
Suggestion — keep reason scrubbed, log the raw error separately at debug:
let reason = match cloud_budget_exhausted {
Some(_) => "cloud budget exhausted; local arm unavailable".to_string(),
None => "cloud retry exhausted; local arm unavailable".to_string(),
};And for the local-also-failed path:
let reason = match cloud_budget_exhausted {
Some(_) => "cloud budget exhausted; local arm failed".to_string(),
None => "cloud + local both failed".to_string(),
};
tracing::debug!(
cloud_err = ?cloud_budget_exhausted,
local_err = %err,
"[triage::evaluator] deferred — raw error details"
);- Deferred.reason is documented as scrubbed/safe-to-log but the
no-local and both-failed branches were interpolating raw upstream
`err` / `cloud_err` text. Replace with stable strings ("cloud budget
exhausted; local arm unavailable", etc.) and route the original
errors to debug/warn logs with the [triage::evaluator] target plus
envelope correlation fields.
- is_inference_budget_exceeded used String::contains, which fires on
cross-word substrings (e.g. "stop updating" matches "top up").
Switch to token-window matching against the normalized word list so
only whole-word/whole-phrase needles trigger BudgetExhausted.
- Add regression test covering the substring false-positive.
Summary
[composio][triage] run_triage failedpaging on every Composio trigger fire when the user is out of inference credits (108 occurrences over 5 days).ArmError::BudgetExhaustedvariant so the triage evaluator treats budget/top-up/out-of-credits responses as user-actionable instead of fatal: skip the cloud retry (same wall), fall through to local, and if local is unavailable or also fails returnTriageOutcome::Deferred(Ok, notErr). The composio bus subscriber therefore logswarn, nevererror— no Sentry page.is_inference_budget_exceeded_errorinchannels/providers/web.rs(budget exceeded / top up / add credits / out of credits / no remaining credits) and is evaluated inline to avoid a cross-domain dep on the web-channel module.Root cause
The backend rejected the inference call with
400 Bad Request — Budget exceeded — add credits to continue.classify_erroronly recognises 429 and 5xx/timeout/connection patterns as retryable, so a 400 fell through toArmError::Fatal.run_triagereturnedErr, andcomposio/bus.rs:307logged it withtracing::error!, which Sentry's tracing layer ships as an error event. Every webhook fire that arrived while the user was out of credits paged.Test plan
cargo test openhuman::agent::triage::evaluator— 18 passed (15 existing + 3 new)classify_string_recognises_budget_exceeded_as_budget_exhausted— exact Sentry payload classifies asBudgetExhausted, notFatalclassify_string_recognises_top_up_and_out_of_credits_as_budget_exhausted— sibling phrasings coveredcloud_budget_exhausted_skips_retry_and_falls_to_local— only 2 dispatches (no cloud retry); local produces the decisioncloud_budget_exhausted_without_local_returns_deferred_not_err— defers with budget-named reason; no cloud retrycargo checkclean;cargo fmt --checkcleanSummary by CodeRabbit
Bug Fixes
Tests