feat(run-eval): add `router-classified-3tier` MODELS entry + recursive preflight by juanmichelini · Pull Request #3636 · OpenHands/software-agent-sdk

juanmichelini · 2026-06-10T17:03:16Z

Summary

Companion PR to OpenHands/benchmarks#742 (per-instance intelligent model routing for the 5 default-agent benchmarks). That PR makes the receiving end ready: when a benchmark's --llm-config-path points at an intelligent-router-v0 JSON, each instance is classified once and the agent conversation is routed to the matching tier model.

This PR makes the dispatching end ready: resolve_model_config.MODELS now contains a router-classified-3tier entry whose llm_config is exactly that router payload, and check_model (preflight) knows how to recurse into the tier sub-models.

After both PRs land, dispatching Run Eval with model_ids=router-classified-3tier will produce a run that routes per instance instead of running a single model end-to-end. Until then the entry is dormant on the SDK side and harmless to existing flows.

What's in this PR

File	Purpose
`.github/run-eval/resolve_model_config.py`	New MODELS entry `router-classified-3tier`; new helpers `ROUTER_CONFIG_KIND` and `is_router_config()`; `check_model()` now detects router entries and recurses into each tier sub-model via a new `_check_router_tiers()` helper.
`.github/run-eval/ADDINGMODEL.md`	New "Two kinds of MODELS entries" section documenting the plain-vs-router distinction and pointing at the canonical router entry.
`tests/cross/test_resolve_model_config.py`	New `RouterLLMConfig` pydantic validator (mirrors `LLMConfig`); `EvalModelConfig.llm_config` is now `RouterLLMConfig \| LLMConfig`; 14 new tests covering the registry entry, the predicate, and the recursive preflight.

Total: +412 / −3 across 3 files.

The new MODELS entry

"router-classified-3tier": {
    "id": "router-classified-3tier",
    "display_name": "Router (3-tier, classifier=minimax-m2.7)",
    "llm_config": {
        "kind": "intelligent-router-v0",
        "classifier_model_id": "minimax-m2.7",
        "fallback_model_id": "gpt-5.5",
        "tiers": {
            "kimi-k2.6":     {"model": "litellm_proxy/moonshot/kimi-k2.6",     "temperature": 1.0, "inline_image_urls": True},
            "minimax-m2.7":  {"model": "litellm_proxy/minimax/MiniMax-M2.7",   "temperature": 1.0, "top_p": 0.95},
            "gpt-5.5":       {"model": "litellm_proxy/openai/gpt-5.5",         "reasoning_effort": "high"},
        },
        "routing": {
            "Frontend":                 "kimi-k2.6",
            "Issue Resolution (other)": "minimax-m2.7",
            "Greenfield":               "gpt-5.5",
            "Testing":                  "gpt-5.5",
            "Information Gathering":    "gpt-5.5",
        },
        "vision_capable_model_ids": ["kimi-k2.6", "gpt-5.5"],
    },
},

Each tier sub-config is byte-identical to the matching plain MODELS entry (kimi-k2.6, minimax-m2.7, gpt-5.5), so all proxy provisioning that already works for those models keeps working here. The classifier reuses minimax-m2.7, exactly mirroring OpenHands/benchmarks's sample router config.

Preflight: recursing into tier sub-models

A router payload has no top-level "model" — so the existing check_model would have called litellm.completion(model="unknown", …) and failed in a confusing way. The new shape:

def check_model(model_config, api_key, base_url, timeout=60):
    llm_config = model_config.get("llm_config", {})
    if is_router_config(llm_config):
        return _check_router_tiers(model_config, api_key, base_url, timeout)
    # ... existing plain-model code path, unchanged

_check_router_tiers runs check_model on each tier sub-model and aggregates the result. Per-entry output stays a one-liner in the preflight summary, with indented per-tier diagnostics directly underneath:

  Router (3-tier, classifier=minimax-m2.7): validating 3 tier model(s)...
    ✓ Router (3-tier, classifier=minimax-m2.7) :: kimi-k2.6: OK
    ✓ Router (3-tier, classifier=minimax-m2.7) :: minimax-m2.7: OK
    ✓ Router (3-tier, classifier=minimax-m2.7) :: gpt-5.5: OK
  ✓ Router (3-tier, classifier=minimax-m2.7): OK (3 tier(s))

If any tier fails (provisioning, parameter shape, etc.) the aggregate fails and the per-tier failure line is surfaced so the cause is obvious from the workflow log.

Pydantic validator update

tests/cross/test_resolve_model_config.py already enforces that every MODELS entry validates against EvalModelConfig. Without the router shape that test fails for the new entry because router payloads have no model field. The fix is a new RouterLLMConfig (parallels LLMConfig) and EvalModelConfig.llm_config: RouterLLMConfig | LLMConfig. Pydantic union resolution picks RouterLLMConfig for payloads carrying kind: "intelligent-router-v0" and LLMConfig otherwise. Existing models are unaffected.

RouterLLMConfig additionally enforces internal consistency: classifier_model_id, fallback_model_id, every routing target, and every vision_capable_model_ids entry must all be keys in tiers. This catches typos at test-time instead of at run-time.

New tests (14)

TestRouterClassified3Tier (5): the entry is router-shaped, refs are consistent, every tier is a valid litellm_proxy/… config, the iter5 5-category routing table is complete, the payload satisfies RouterLLMConfig.
TestIsRouterConfig (6): plain configs, missing-kind, missing-tiers, wrong-kind, canonical-payload, non-dict inputs.
TestCheckModelRouterRecursion (4): all tiers succeed → router passes (with litellm.completion called once per tier and model= correctly forwarded); one tier failure → router fails; empty tiers short-circuits without ever calling litellm; per-tier parameters (temperature, top_p) are forwarded correctly.

All tests use the existing litellm.completion-mock pattern from TestTestModel; no real network calls.

Verification

uv run ruff format .github/run-eval/resolve_model_config.py tests/cross/test_resolve_model_config.py — clean
uv run ruff check .github/run-eval/resolve_model_config.py tests/cross/test_resolve_model_config.py — All checks passed!
uv run pyright .github/run-eval/resolve_model_config.py — 0 errors, 0 warnings, 0 informations
uv run pytest tests/cross/test_resolve_model_config.py — 58 passed (44 pre-existing + 14 new), 0 failed.
Sanity-checked find_models_by_id(["router-classified-3tier"]) returns the full router llm_config as the models_json payload that would be passed downstream.

Out of scope (will be a separate PR)

The matching change to OpenHands/evaluation/eval-job/scripts/build_matrix.py is still needed for end-to-end dispatch. That script currently derives the GCS artifact slug from llm_config["model"] and will exit with ERROR: llm_config missing 'model' when handed a router payload. It needs to detect is_router_config(llm_config), fall back to deriving the slug from the entry's id (e.g. "router-classified-3tier" → "router-classified-3tier"), and otherwise pass the llm_config through to the benchmark untouched. That's a one-file change I can put up next; opening it separately to keep the two reviews independent.

How to test end-to-end after the matching `evaluation` PR lands

Dispatch Run Eval with model_ids=router-classified-3tier, benchmark=swebench, eval_limit=10.
Check that metadata.routing is non-null in the resulting results.tar.gz (vs. null in the gpt-5.4 run we just looked at).
Confirm per-instance routing log lines (benchmarks.utils.intelligent_routing logger) like intelligent-routing instance=… category=Frontend model=kimi-k2.6 ….
Confirm output.jsonl[*].metrics.costs[*].model contains a mix of the three tier model strings instead of a single repeated value.

This PR was prepared by an AI agent (OpenHands) on behalf of @juanmichelini.

@juanmichelini can click here to continue refining the PR

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.13-nodejs22-slim`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:dc25347-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-dc25347-python \
  ghcr.io/openhands/agent-server:dc25347-python

All tags pushed for this build

ghcr.io/openhands/agent-server:dc25347-golang-amd64
ghcr.io/openhands/agent-server:dc25347887e8394255a699a36c4bf39e91a5b4b9-golang-amd64
ghcr.io/openhands/agent-server:feat-router-classified-3tier-model-golang-amd64
ghcr.io/openhands/agent-server:dc25347-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:dc25347-golang-arm64
ghcr.io/openhands/agent-server:dc25347887e8394255a699a36c4bf39e91a5b4b9-golang-arm64
ghcr.io/openhands/agent-server:feat-router-classified-3tier-model-golang-arm64
ghcr.io/openhands/agent-server:dc25347-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:dc25347-java-amd64
ghcr.io/openhands/agent-server:dc25347887e8394255a699a36c4bf39e91a5b4b9-java-amd64
ghcr.io/openhands/agent-server:feat-router-classified-3tier-model-java-amd64
ghcr.io/openhands/agent-server:dc25347-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:dc25347-java-arm64
ghcr.io/openhands/agent-server:dc25347887e8394255a699a36c4bf39e91a5b4b9-java-arm64
ghcr.io/openhands/agent-server:feat-router-classified-3tier-model-java-arm64
ghcr.io/openhands/agent-server:dc25347-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:dc25347-python-amd64
ghcr.io/openhands/agent-server:dc25347887e8394255a699a36c4bf39e91a5b4b9-python-amd64
ghcr.io/openhands/agent-server:feat-router-classified-3tier-model-python-amd64
ghcr.io/openhands/agent-server:dc25347-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:dc25347-python-arm64
ghcr.io/openhands/agent-server:dc25347887e8394255a699a36c4bf39e91a5b4b9-python-arm64
ghcr.io/openhands/agent-server:feat-router-classified-3tier-model-python-arm64
ghcr.io/openhands/agent-server:dc25347-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:dc25347-golang
ghcr.io/openhands/agent-server:dc25347887e8394255a699a36c4bf39e91a5b4b9-golang
ghcr.io/openhands/agent-server:feat-router-classified-3tier-model-golang
ghcr.io/openhands/agent-server:dc25347-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:dc25347-java
ghcr.io/openhands/agent-server:dc25347887e8394255a699a36c4bf39e91a5b4b9-java
ghcr.io/openhands/agent-server:feat-router-classified-3tier-model-java
ghcr.io/openhands/agent-server:dc25347-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:dc25347-python
ghcr.io/openhands/agent-server:dc25347887e8394255a699a36c4bf39e91a5b4b9-python
ghcr.io/openhands/agent-server:feat-router-classified-3tier-model-python
ghcr.io/openhands/agent-server:dc25347-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

Each variant tag (e.g., dc25347-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., dc25347-python-amd64) are also available if needed

…preflight Companion change to OpenHands/benchmarks#742 (intelligent per-instance model routing). With this PR the SDK can dispatch a router-shaped llm_config to the evaluation pipeline; the benchmarks side already understands the intelligent-router-v0 shape and will classify each instance and route to the matching tier model. Changes: - New MODELS entry 'router-classified-3tier' (classifier=minimax-m2.7, tiers={kimi-k2.6, minimax-m2.7, gpt-5.5}, default iter5 routing). - New helpers ROUTER_CONFIG_KIND and is_router_config(). - check_model() now detects router entries and recurses into each tier sub-model, aggregating success/failure. - Pydantic validator in tests learns about RouterLLMConfig and the registry's llm_config is now 'RouterLLMConfig | LLMConfig'. - 14 new tests covering the new entry, is_router_config, and recursive preflight. Note: the matching OpenHands/evaluation change to eval-job/scripts/build_matrix.py (handle no-top-level-model router entries when deriving the GCS slug) is required for end-to-end dispatch and will be opened separately. Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-06-10T17:03:42Z

Python API breakage checks — ✅ PASSED

Result: ✅ PASSED

Action log

github-actions · 2026-06-10T17:03:55Z

REST API breakage checks (OpenAPI) — ✅ PASSED

Result: ✅ PASSED

Action log

all-hands-bot · 2026-06-13T00:33:01Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

Code Review: router-classified-3tier

Taste Rating

🟢 Good taste - Clean implementation with minimal complexity.

Analysis

This PR adds intelligent per-instance model routing. The design is sound: router config discriminator separates router entries from plain model entries, check_model recurses into tier sub-models during preflight, and pydantic validation catches internal consistency errors at test-time.

What works well:

is_router_config() is a clean, side-effect-free predicate
_check_router_tiers aggregates results cleanly without duplicating logic
RouterLLMConfig model validator enforces reference consistency
14 new tests cover key paths with appropriate mocking

Style Notes (minor):

Block comment (~Line 440-455) explaining routing table is verbose - the table is self-evident from the code
Comment referencing build_matrix.py (~Line 572) may drift since that code is out-of-scope per PR

Risk Assessment: 🟢 LOW

Pure additive change. Existing plain-model paths unchanged. Pydantic union is backward-compatible.

Verdict

✅ Worth merging - Core logic sound, tests comprehensive, design extensible.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

all-hands-bot

⚠️ QA Report: PASS WITH ISSUES

The router entry and recursive preflight behavior work locally, but the real Run Eval resolver CLI currently aborts because the live proxy rejects the kimi-k2.6 tier.

Does this PR achieve its stated goal?

Partially. The PR does add router-classified-3tier and changes preflight from the old model=unknown failure mode into recursive tier validation; I verified that against a local OpenAI-compatible endpoint with real litellm HTTP calls. However, exercising the actual resolver CLI as the workflow would (MODEL_IDS=router-classified-3tier) fails preflight against the default live proxy because moonshot/kimi-k2.6 is rejected, so the new model is not currently dispatch-ready in this environment.

Phase	Result
Environment Setup	✅ `uv run` created/used the project environment and the resolver executed successfully.
CI Status	🟡 At refresh: 22 successful checks, 6 in progress, 3 skipped. I did not run tests/linters locally.
Functional Verification	⚠️ Resolver + recursion verified locally; live proxy preflight for the new router model fails.

Functional Verification

Test 1: Model resolution before/after

Step 1 — Establish baseline on origin/main:
Ran a short user-style resolver invocation for find_models_by_id(["router-classified-3tier"]):

has_router_entry= False
ERROR: Model ID 'router-classified-3tier' not found. Available models: ...
find_models_by_id_ok= False
SystemExit 1

This confirms the base branch cannot dispatch this model id at all.

Step 2 — Apply the PR changes:
Checked out dc25347887e8394255a699a36c4bf39e91a5b4b9.

Step 3 — Re-run with the fix in place:
Ran the same resolver flow:

type= list
[
  {
    "display_name": "Router (3-tier, classifier=minimax-m2.7)",
    "id": "router-classified-3tier",
    "llm_config": {
      "classifier_model_id": "minimax-m2.7",
      "fallback_model_id": "gpt-5.5",
      "kind": "intelligent-router-v0",
      "tiers": { ... },
      "vision_capable_model_ids": ["kimi-k2.6", "gpt-5.5"]
    }
  }
]

This confirms the new model id resolves to a router-shaped payload with no top-level model.

Test 2: Recursive preflight behavior before/after

Step 1 — Establish baseline on origin/main:
Ran check_model() on a router-shaped config:

success= False
✗ Router Test: Bad request - litellm.BadRequestError: LLM Provider NOT provided. Pass in the LLM provider you are trying to call. You passed model=unknown

This confirms the old code path treated router configs as plain configs and tried model=unknown.

Step 2 — Apply the PR changes:
Checked out dc25347887e8394255a699a36c4bf39e91a5b4b9 and started an in-process local OpenAI-compatible HTTP endpoint.

Step 3 — Re-run with the fix in place:
Ran check_model() on the real router-classified-3tier entry using the local endpoint:

has_router_entry= True
resolved_ids= ['router-classified-3tier']
is_router_config= True
top_level_model_present= False
tier_ids= ['gpt-5.5', 'kimi-k2.6', 'minimax-m2.7']
preflight_success= True
  Router (3-tier, classifier=minimax-m2.7): validating 3 tier model(s)...
    ✓ Router (3-tier, classifier=minimax-m2.7) :: kimi-k2.6: OK
    ✓ Router (3-tier, classifier=minimax-m2.7) :: minimax-m2.7: OK
    ✓ Router (3-tier, classifier=minimax-m2.7) :: gpt-5.5: OK
✓ Router (3-tier, classifier=minimax-m2.7): OK (3 tier(s))

Captured HTTP requests from litellm:

[
  {"path": "/chat/completions", "model": "moonshot/kimi-k2.6", "temperature": 1.0, "top_p": null, "reasoning_effort": null},
  {"path": "/chat/completions", "model": "minimax/MiniMax-M2.7", "temperature": 1.0, "top_p": 0.95, "reasoning_effort": null},
  {"path": "/chat/completions", "model": "openai/gpt-5.5", "temperature": null, "top_p": null, "reasoning_effort": "high"}
]

This confirms recursive preflight now hits each tier and forwards the per-tier parameters.

Test 3: Actual workflow-style CLI execution against the live proxy

Step 1 — Run the actual resolver CLI for the new model:
Ran:

LLM_API_KEY="$LLM_API_KEY" LITELLM_API_KEY="$LLM_API_KEY" OPENAI_API_KEY="$LLM_API_KEY"   MODEL_IDS=router-classified-3tier   GITHUB_OUTPUT=/tmp/resolve_model_config_output.txt   uv run python .github/run-eval/resolve_model_config.py

Observed:

Resolved 1 model(s): router-classified-3tier
✓ Proxy reachable at https://llm-proxy.app.all-hands.dev
Preflight LLM check for 1 model(s)...
  Checking Router (3-tier, classifier=minimax-m2.7)...
    Router (3-tier, classifier=minimax-m2.7): validating 3 tier model(s)...
    ✗ Router (3-tier, classifier=minimax-m2.7) :: kimi-k2.6: Bad request - litellm.BadRequestError: Litellm_proxyException - /chat/completions: Invalid model name passed in model=moonshot/kimi-k2.6. Call `/v1/models` to view available models for your key.
    ✓ Router (3-tier, classifier=minimax-m2.7) :: minimax-m2.7: OK
    ✓ Router (3-tier, classifier=minimax-m2.7) :: gpt-5.5: OK
✗ Router (3-tier, classifier=minimax-m2.7): one or more tiers failed
✗ Some models failed preflight check
ERROR: Preflight LLM check failed
exit_code=1
--- GITHUB_OUTPUT ---
(missing)

This shows the real workflow-style dispatch path currently aborts before producing GITHUB_OUTPUT.

Step 2 — Compare the underlying plain tier:
Ran the same CLI for MODEL_IDS=kimi-k2.6:

Resolved 1 model(s): kimi-k2.6
✓ Proxy reachable at https://llm-proxy.app.all-hands.dev
  Checking Kimi K2.6...
  ✗ Kimi K2.6: Bad request - litellm.BadRequestError: Litellm_proxyException - /chat/completions: Invalid model name passed in model=moonshot/kimi-k2.6. Call `/v1/models` to view available models for your key.
ERROR: Preflight LLM check failed
exit_code=1

This suggests the recursion itself is working correctly, but the kimi-k2.6 tier is not currently usable through the live proxy credentials/environment I exercised.

Issues Found

🟠 Issue: MODEL_IDS=router-classified-3tier is not currently dispatch-ready against the live default proxy because the kimi-k2.6 tier fails preflight with Invalid model name passed in model=moonshot/kimi-k2.6. The plain kimi-k2.6 entry fails the same way, so this looks like a proxy provisioning/model-name issue rather than a recursion bug, but it still blocks the PR’s stated dispatch-readiness goal.

Automated QA review generated by an AI agent (OpenHands) on behalf of the requester.

all-hands-bot · 2026-06-13T00:40:04Z

+            "fallback_model_id": "gpt-5.5",
+            "tiers": {
+                "kimi-k2.6": {
+                    "model": "litellm_proxy/moonshot/kimi-k2.6",


🟠 Important: I exercised the actual resolver CLI with MODEL_IDS=router-classified-3tier against the live default proxy using the available LLM credentials. Preflight recursed correctly, but this tier failed with Invalid model name passed in model=moonshot/kimi-k2.6; running the plain MODEL_IDS=kimi-k2.6 entry failed the same way. Until the proxy/model name is provisioned or this tier is changed to a reachable model, the new router model aborts before writing GITHUB_OUTPUT, so the dispatching end is not fully ready.

Automated QA finding generated by an AI agent (OpenHands) on behalf of the requester.

juanmichelini requested a review from all-hands-bot June 13, 2026 00:31

juanmichelini marked this pull request as ready for review June 13, 2026 00:31

Merge branch 'main' into feat/router-classified-3tier-model

dc25347

all-hands-bot reviewed Jun 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(run-eval): add `router-classified-3tier` MODELS entry + recursive preflight#3636

feat(run-eval): add `router-classified-3tier` MODELS entry + recursive preflight#3636
juanmichelini wants to merge 2 commits into
mainfrom
feat/router-classified-3tier-model

juanmichelini commented Jun 10, 2026 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

all-hands-bot commented Jun 13, 2026 •

edited

Loading

Uh oh!

all-hands-bot left a comment

Uh oh!

all-hands-bot left a comment

Uh oh!

all-hands-bot Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

juanmichelini commented Jun 10, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in this PR

The new MODELS entry

Preflight: recursing into tier sub-models

Pydantic validator update

New tests (14)

Verification

Out of scope (will be a separate PR)

How to test end-to-end after the matching evaluation PR lands

Uh oh!

github-actions Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python API breakage checks — ✅ PASSED

Uh oh!

github-actions Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

REST API breakage checks (OpenAPI) — ✅ PASSED

Uh oh!

all-hands-bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Code Review: router-classified-3tier

Taste Rating

Analysis

What works well:

Style Notes (minor):

Risk Assessment: 🟢 LOW

Verdict

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

⚠️ QA Report: PASS WITH ISSUES

Does this PR achieve its stated goal?

Test 1: Model resolution before/after

Test 2: Recursive preflight behavior before/after

Test 3: Actual workflow-style CLI execution against the live proxy

Issues Found

Uh oh!

all-hands-bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

juanmichelini commented Jun 10, 2026 •

edited by github-actions Bot

Loading

How to test end-to-end after the matching `evaluation` PR lands

github-actions Bot commented Jun 10, 2026 •

edited

Loading

github-actions Bot commented Jun 10, 2026 •

edited

Loading

all-hands-bot commented Jun 13, 2026 •

edited

Loading