feat(run-eval): add router-classified-3tier MODELS entry + recursive preflight#3636
feat(run-eval): add router-classified-3tier MODELS entry + recursive preflight#3636juanmichelini wants to merge 2 commits into
router-classified-3tier MODELS entry + recursive preflight#3636Conversation
…preflight Companion change to OpenHands/benchmarks#742 (intelligent per-instance model routing). With this PR the SDK can dispatch a router-shaped llm_config to the evaluation pipeline; the benchmarks side already understands the intelligent-router-v0 shape and will classify each instance and route to the matching tier model. Changes: - New MODELS entry 'router-classified-3tier' (classifier=minimax-m2.7, tiers={kimi-k2.6, minimax-m2.7, gpt-5.5}, default iter5 routing). - New helpers ROUTER_CONFIG_KIND and is_router_config(). - check_model() now detects router entries and recurses into each tier sub-model, aggregating success/failure. - Pydantic validator in tests learns about RouterLLMConfig and the registry's llm_config is now 'RouterLLMConfig | LLMConfig'. - 14 new tests covering the new entry, is_router_config, and recursive preflight. Note: the matching OpenHands/evaluation change to eval-job/scripts/build_matrix.py (handle no-top-level-model router entries when deriving the GCS slug) is required for end-to-end dispatch and will be opened separately. Co-authored-by: openhands <openhands@all-hands.dev>
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review: router-classified-3tier
Taste Rating
🟢 Good taste - Clean implementation with minimal complexity.
Analysis
This PR adds intelligent per-instance model routing. The design is sound: router config discriminator separates router entries from plain model entries, check_model recurses into tier sub-models during preflight, and pydantic validation catches internal consistency errors at test-time.
What works well:
- is_router_config() is a clean, side-effect-free predicate
- _check_router_tiers aggregates results cleanly without duplicating logic
- RouterLLMConfig model validator enforces reference consistency
- 14 new tests cover key paths with appropriate mocking
Style Notes (minor):
- Block comment (~Line 440-455) explaining routing table is verbose - the table is self-evident from the code
- Comment referencing build_matrix.py (~Line 572) may drift since that code is out-of-scope per PR
Risk Assessment: 🟢 LOW
Pure additive change. Existing plain-model paths unchanged. Pydantic union is backward-compatible.
Verdict
✅ Worth merging - Core logic sound, tests comprehensive, design extensible.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
all-hands-bot
left a comment
There was a problem hiding this comment.
⚠️ QA Report: PASS WITH ISSUES
The router entry and recursive preflight behavior work locally, but the real Run Eval resolver CLI currently aborts because the live proxy rejects the kimi-k2.6 tier.
Does this PR achieve its stated goal?
Partially. The PR does add router-classified-3tier and changes preflight from the old model=unknown failure mode into recursive tier validation; I verified that against a local OpenAI-compatible endpoint with real litellm HTTP calls. However, exercising the actual resolver CLI as the workflow would (MODEL_IDS=router-classified-3tier) fails preflight against the default live proxy because moonshot/kimi-k2.6 is rejected, so the new model is not currently dispatch-ready in this environment.
| Phase | Result |
|---|---|
| Environment Setup | ✅ uv run created/used the project environment and the resolver executed successfully. |
| CI Status | 🟡 At refresh: 22 successful checks, 6 in progress, 3 skipped. I did not run tests/linters locally. |
| Functional Verification |
Functional Verification
Test 1: Model resolution before/after
Step 1 — Establish baseline on origin/main:
Ran a short user-style resolver invocation for find_models_by_id(["router-classified-3tier"]):
has_router_entry= False
ERROR: Model ID 'router-classified-3tier' not found. Available models: ...
find_models_by_id_ok= False
SystemExit 1
This confirms the base branch cannot dispatch this model id at all.
Step 2 — Apply the PR changes:
Checked out dc25347887e8394255a699a36c4bf39e91a5b4b9.
Step 3 — Re-run with the fix in place:
Ran the same resolver flow:
type= list
[
{
"display_name": "Router (3-tier, classifier=minimax-m2.7)",
"id": "router-classified-3tier",
"llm_config": {
"classifier_model_id": "minimax-m2.7",
"fallback_model_id": "gpt-5.5",
"kind": "intelligent-router-v0",
"tiers": { ... },
"vision_capable_model_ids": ["kimi-k2.6", "gpt-5.5"]
}
}
]
This confirms the new model id resolves to a router-shaped payload with no top-level model.
Test 2: Recursive preflight behavior before/after
Step 1 — Establish baseline on origin/main:
Ran check_model() on a router-shaped config:
success= False
✗ Router Test: Bad request - litellm.BadRequestError: LLM Provider NOT provided. Pass in the LLM provider you are trying to call. You passed model=unknown
This confirms the old code path treated router configs as plain configs and tried model=unknown.
Step 2 — Apply the PR changes:
Checked out dc25347887e8394255a699a36c4bf39e91a5b4b9 and started an in-process local OpenAI-compatible HTTP endpoint.
Step 3 — Re-run with the fix in place:
Ran check_model() on the real router-classified-3tier entry using the local endpoint:
has_router_entry= True
resolved_ids= ['router-classified-3tier']
is_router_config= True
top_level_model_present= False
tier_ids= ['gpt-5.5', 'kimi-k2.6', 'minimax-m2.7']
preflight_success= True
Router (3-tier, classifier=minimax-m2.7): validating 3 tier model(s)...
✓ Router (3-tier, classifier=minimax-m2.7) :: kimi-k2.6: OK
✓ Router (3-tier, classifier=minimax-m2.7) :: minimax-m2.7: OK
✓ Router (3-tier, classifier=minimax-m2.7) :: gpt-5.5: OK
✓ Router (3-tier, classifier=minimax-m2.7): OK (3 tier(s))
Captured HTTP requests from litellm:
[
{"path": "/chat/completions", "model": "moonshot/kimi-k2.6", "temperature": 1.0, "top_p": null, "reasoning_effort": null},
{"path": "/chat/completions", "model": "minimax/MiniMax-M2.7", "temperature": 1.0, "top_p": 0.95, "reasoning_effort": null},
{"path": "/chat/completions", "model": "openai/gpt-5.5", "temperature": null, "top_p": null, "reasoning_effort": "high"}
]This confirms recursive preflight now hits each tier and forwards the per-tier parameters.
Test 3: Actual workflow-style CLI execution against the live proxy
Step 1 — Run the actual resolver CLI for the new model:
Ran:
LLM_API_KEY="$LLM_API_KEY" LITELLM_API_KEY="$LLM_API_KEY" OPENAI_API_KEY="$LLM_API_KEY" MODEL_IDS=router-classified-3tier GITHUB_OUTPUT=/tmp/resolve_model_config_output.txt uv run python .github/run-eval/resolve_model_config.pyObserved:
Resolved 1 model(s): router-classified-3tier
✓ Proxy reachable at https://llm-proxy.app.all-hands.dev
Preflight LLM check for 1 model(s)...
Checking Router (3-tier, classifier=minimax-m2.7)...
Router (3-tier, classifier=minimax-m2.7): validating 3 tier model(s)...
✗ Router (3-tier, classifier=minimax-m2.7) :: kimi-k2.6: Bad request - litellm.BadRequestError: Litellm_proxyException - /chat/completions: Invalid model name passed in model=moonshot/kimi-k2.6. Call `/v1/models` to view available models for your key.
✓ Router (3-tier, classifier=minimax-m2.7) :: minimax-m2.7: OK
✓ Router (3-tier, classifier=minimax-m2.7) :: gpt-5.5: OK
✗ Router (3-tier, classifier=minimax-m2.7): one or more tiers failed
✗ Some models failed preflight check
ERROR: Preflight LLM check failed
exit_code=1
--- GITHUB_OUTPUT ---
(missing)
This shows the real workflow-style dispatch path currently aborts before producing GITHUB_OUTPUT.
Step 2 — Compare the underlying plain tier:
Ran the same CLI for MODEL_IDS=kimi-k2.6:
Resolved 1 model(s): kimi-k2.6
✓ Proxy reachable at https://llm-proxy.app.all-hands.dev
Checking Kimi K2.6...
✗ Kimi K2.6: Bad request - litellm.BadRequestError: Litellm_proxyException - /chat/completions: Invalid model name passed in model=moonshot/kimi-k2.6. Call `/v1/models` to view available models for your key.
ERROR: Preflight LLM check failed
exit_code=1
This suggests the recursion itself is working correctly, but the kimi-k2.6 tier is not currently usable through the live proxy credentials/environment I exercised.
Issues Found
- 🟠 Issue:
MODEL_IDS=router-classified-3tieris not currently dispatch-ready against the live default proxy because thekimi-k2.6tier fails preflight withInvalid model name passed in model=moonshot/kimi-k2.6. The plainkimi-k2.6entry fails the same way, so this looks like a proxy provisioning/model-name issue rather than a recursion bug, but it still blocks the PR’s stated dispatch-readiness goal.
Automated QA review generated by an AI agent (OpenHands) on behalf of the requester.
| "fallback_model_id": "gpt-5.5", | ||
| "tiers": { | ||
| "kimi-k2.6": { | ||
| "model": "litellm_proxy/moonshot/kimi-k2.6", |
There was a problem hiding this comment.
🟠 Important: I exercised the actual resolver CLI with MODEL_IDS=router-classified-3tier against the live default proxy using the available LLM credentials. Preflight recursed correctly, but this tier failed with Invalid model name passed in model=moonshot/kimi-k2.6; running the plain MODEL_IDS=kimi-k2.6 entry failed the same way. Until the proxy/model name is provisioned or this tier is changed to a reachable model, the new router model aborts before writing GITHUB_OUTPUT, so the dispatching end is not fully ready.
Automated QA finding generated by an AI agent (OpenHands) on behalf of the requester.
Summary
Companion PR to
OpenHands/benchmarks#742(per-instance intelligent model routing for the 5 default-agent benchmarks). That PR makes the receiving end ready: when a benchmark's--llm-config-pathpoints at anintelligent-router-v0JSON, each instance is classified once and the agent conversation is routed to the matching tier model.This PR makes the dispatching end ready:
resolve_model_config.MODELSnow contains arouter-classified-3tierentry whosellm_configis exactly that router payload, andcheck_model(preflight) knows how to recurse into the tier sub-models.After both PRs land, dispatching
Run Evalwithmodel_ids=router-classified-3tierwill produce a run that routes per instance instead of running a single model end-to-end. Until then the entry is dormant on the SDK side and harmless to existing flows.What's in this PR
.github/run-eval/resolve_model_config.pyrouter-classified-3tier; new helpersROUTER_CONFIG_KINDandis_router_config();check_model()now detects router entries and recurses into each tier sub-model via a new_check_router_tiers()helper..github/run-eval/ADDINGMODEL.mdtests/cross/test_resolve_model_config.pyRouterLLMConfigpydantic validator (mirrorsLLMConfig);EvalModelConfig.llm_configis nowRouterLLMConfig | LLMConfig; 14 new tests covering the registry entry, the predicate, and the recursive preflight.Total: +412 / −3 across 3 files.
The new MODELS entry
Each tier sub-config is byte-identical to the matching plain MODELS entry (
kimi-k2.6,minimax-m2.7,gpt-5.5), so all proxy provisioning that already works for those models keeps working here. The classifier reusesminimax-m2.7, exactly mirroringOpenHands/benchmarks's sample router config.Preflight: recursing into tier sub-models
A router payload has no top-level
"model"— so the existingcheck_modelwould have calledlitellm.completion(model="unknown", …)and failed in a confusing way. The new shape:_check_router_tiersrunscheck_modelon each tier sub-model and aggregates the result. Per-entry output stays a one-liner in the preflight summary, with indented per-tier diagnostics directly underneath:If any tier fails (provisioning, parameter shape, etc.) the aggregate fails and the per-tier failure line is surfaced so the cause is obvious from the workflow log.
Pydantic validator update
tests/cross/test_resolve_model_config.pyalready enforces that every MODELS entry validates againstEvalModelConfig. Without the router shape that test fails for the new entry because router payloads have nomodelfield. The fix is a newRouterLLMConfig(parallelsLLMConfig) andEvalModelConfig.llm_config: RouterLLMConfig | LLMConfig. Pydantic union resolution picksRouterLLMConfigfor payloads carryingkind: "intelligent-router-v0"andLLMConfigotherwise. Existing models are unaffected.RouterLLMConfigadditionally enforces internal consistency:classifier_model_id,fallback_model_id, everyroutingtarget, and everyvision_capable_model_idsentry must all be keys intiers. This catches typos at test-time instead of at run-time.New tests (14)
TestRouterClassified3Tier(5): the entry is router-shaped, refs are consistent, every tier is a validlitellm_proxy/…config, the iter5 5-category routing table is complete, the payload satisfiesRouterLLMConfig.TestIsRouterConfig(6): plain configs, missing-kind, missing-tiers, wrong-kind, canonical-payload, non-dict inputs.TestCheckModelRouterRecursion(4): all tiers succeed → router passes (withlitellm.completioncalled once per tier andmodel=correctly forwarded); one tier failure → router fails; emptytiersshort-circuits without ever calling litellm; per-tier parameters (temperature,top_p) are forwarded correctly.All tests use the existing
litellm.completion-mock pattern fromTestTestModel; no real network calls.Verification
uv run ruff format .github/run-eval/resolve_model_config.py tests/cross/test_resolve_model_config.py— cleanuv run ruff check .github/run-eval/resolve_model_config.py tests/cross/test_resolve_model_config.py—All checks passed!uv run pyright .github/run-eval/resolve_model_config.py—0 errors, 0 warnings, 0 informationsuv run pytest tests/cross/test_resolve_model_config.py— 58 passed (44 pre-existing + 14 new), 0 failed.find_models_by_id(["router-classified-3tier"])returns the full routerllm_configas themodels_jsonpayload that would be passed downstream.Out of scope (will be a separate PR)
The matching change to
OpenHands/evaluation/eval-job/scripts/build_matrix.pyis still needed for end-to-end dispatch. That script currently derives the GCS artifact slug fromllm_config["model"]and will exit withERROR: llm_config missing 'model'when handed a router payload. It needs to detectis_router_config(llm_config), fall back to deriving the slug from the entry'sid(e.g."router-classified-3tier"→"router-classified-3tier"), and otherwise pass thellm_configthrough to the benchmark untouched. That's a one-file change I can put up next; opening it separately to keep the two reviews independent.How to test end-to-end after the matching
evaluationPR landsRun Evalwithmodel_ids=router-classified-3tier,benchmark=swebench,eval_limit=10.metadata.routingis non-null in the resultingresults.tar.gz(vs.nullin the gpt-5.4 run we just looked at).benchmarks.utils.intelligent_routinglogger) likeintelligent-routing instance=… category=Frontend model=kimi-k2.6 ….output.jsonl[*].metrics.costs[*].modelcontains a mix of the three tier model strings instead of a single repeated value.This PR was prepared by an AI agent (OpenHands) on behalf of @juanmichelini.
@juanmichelini can click here to continue refining the PR
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:dc25347-pythonRun
All tags pushed for this build
About Multi-Architecture Support
dc25347-python) is a multi-arch manifest supporting both amd64 and arm64dc25347-python-amd64) are also available if needed