feat(llm-access): add anthropic upstream channel probes#58
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the management of Anthropic Upstream Channels by moving the channel administration, model refreshing, model testing, and token rollups from the main Kiro Gateway overview page to a dedicated, independent page. It introduces database schema updates to persist probe states (such as model IDs, latency, and errors), implements backend endpoints for refreshing and testing models, and updates the frontend UI and routing accordingly. Feedback on the changes suggests optimizing the string truncation logic in anthropic_upstream_probe.rs to use a single iterator pass instead of traversing the string twice.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if sanitized.chars().count() > MAX_ERROR_CHARS { | ||
| sanitized = sanitized.chars().take(MAX_ERROR_CHARS).collect::<String>(); | ||
| sanitized.push_str("..."); | ||
| } |
There was a problem hiding this comment.
The current string truncation logic traverses the string twice: once to count the characters using sanitized.chars().count(), and a second time to take the characters using sanitized.chars().take(). According to the general rules, we should use a single iterator pass to check for remaining characters and perform the truncation, which avoids the redundant traversal.
let mut char_iter = sanitized.chars();
let mut truncated: String = char_iter.by_ref().take(MAX_ERROR_CHARS).collect();
if char_iter.next().is_some() {
truncated.push_str("...");
}
sanitized = truncated;References
- When truncating a string and appending a truncation marker based on whether the string exceeds a limit, use a single iterator pass (e.g., calling
next()on the iterator after taking the limit) to check for remaining characters, rather than traversing the string twice withcount().
There was a problem hiding this comment.
Addressed in d4eb724: sanitize_error no longer does count + second take; it normalizes whitespace then truncates with a single char iterator pass.
acking-you
left a comment
There was a problem hiding this comment.
CR — feat(llm-access): add anthropic upstream channel probes
整体是一个质量很高的 PR:核心 bug 修复正确且完整,探针路径加固良好,URL 校验既在写入侧也在构造侧,并采纳了之前 review 的反馈。多智能体审查:3 mapper + 7 维 finder + 每条对抗式验证,共 33 个 agent;23 候选,21 confirmed / 2 refuted——但确认项里绝大多数是 LOW/nit,只有一条安全项值得在合并前处理。
先表扬几处已核对无误的设计:
- ✅ wrapper 修复正确且完整:
UsageAccountingControlStore现在转发record_anthropic_upstream_channel_usage;系统性检查了ControlStore全部 5 个方法,无遗漏的 no-op 吞没(admin 探针状态走独立的AdminAnthropicUpstreamStore、指向真实 Postgres repo,不受影响)。 - ✅ base_url 的 SSRF 校验已经做了且有测试:
build_anthropic_endpoint_url+ 写入侧normalize_anthropic_upstream_base_url都强制 https、拒绝 localhost/metadata/私网 IP(admin.rs 有专门测试)——PR#56 的遗留隐患在此被补上。 - ✅ 探针响应有 1MB 上限 + 30s 超时 + 错误 sanitize/截断;
build_anthropic_endpoint_url是很好的共享复用。 - ✅ 两条被否决也说明这点:
UsageEvent现已#[derive(Default)](我在 PR#57 提的“无 Default”已被采纳);写入侧 SSRF 校验已有测试。
🟡 建议合并前处理
- probe:197 探针的 HTTP client 跟随 3xx 重定向、且无私网 DNS 过滤——绕过了 base_url 的 host 白名单:合法的公网 base_url 若被 302 到
http://169.254.169.254/...、或公网域名解析到私网 IP(DNS rebind),请求会被跟随、x-api-key 被带过去、响应体还回显给 admin。同仓的 media/cctest client 已经redirect(none()) + PrivateFilteringDnsResolver,照搬即可(dispatch 路径同源,建议一并加固)。
🟡 测试缺口
- probe:347 整个探针模块零单测;wrapper 转发测试只覆盖 5 个 ControlStore 方法中的 1 个,bug 类(默认 no-op 被吞)无系统性防护。
🟢 LOW / nit
- admin:3018 test 每次点击都发真实可计费的上游请求,无 rate-limit/cooldown,且
billable_tokens=0使花费不进计费 rollup(仅在 journal 可见)。 - admin:3029 probe→save 非事务:探测期间频道被删 → 已花 token 但返回 404(建议加 warn)。
- anthropic_upstream.rs:33
model_ids_from_json_text把损坏 JSON 静默降级为空 vec(无日志)。 - 复用:probe:182 探针 HTTP 骨架与
dispatch_one_route重复(且 anthropic-version 常量重复、探针漏发 dispatch 转发的 anthropic-beta/accept/user-agent → 漂移);probe:48 每个错误路径重复整段 ProbeOutput literal;is_private_or_loopback_ip在 admin.rs 与 pool/lib.rs 各有一份逐字拷贝。 - 前端页:581 探针按钮中文标签混在英文页面里;
format_timestamp_opt与 overview 页逐字重复。
详见各行内评论。唯一值得合并前处理的是 #1(重定向/DNS 重新打开 SSRF),其余都是低优清理与测试补强。
| }; | ||
| }, | ||
| }; | ||
| let client = match provider::provider_client(target.proxy.as_ref()) { |
There was a problem hiding this comment.
🟡 [security] 探针 HTTP client 跟随重定向 + 无私网 DNS 过滤,绕过 base_url 白名单
build_anthropic_endpoint_url(pool/lib.rs:168-197)确实校验了 base_url 的 host(仅 https、拒绝 localhost/metadata/字面私网 IP),但这是字符串/解析期校验。探针经 provider::provider_client(target.proxy) 用的是通用 build_provider_client(provider/client.rs:15-31),它既没 redirect(Policy::none()) 也没私网过滤 DNS resolver。reqwest 0.12 默认跟随最多 10 次重定向,跨 host 重定向只剥 Authorization/Cookie/Proxy-Authorization/Www-Authenticate,不剥自定义 x-api-key。于是两个绕过:
- 合法公网 base_url 的
/models//messages返回302 -> http://169.254.169.254/...(或任意内网 host)→ 被跟随、x-api-key 带过去、响应体经upstream_error_summary回显给 admin(returning SSRF + 凭据外泄); - 公网域名解析到私网/loopback IP(内部 DNS / DNS rebinding)→ 连接期才暴露,字符串校验看不到。
同仓 30 行外就有正确做法:media client(provider.rs:544-555)与 cctest client(556-565)都 redirect(reqwest::redirect::Policy::none()) + dns_resolver(Arc::new(kiro_media::PrivateFilteringDnsResolver))。
建议:给探针用一个专用 client(像 media/cctest 那样设 redirect(none) + 私网过滤 DNS),refresh_models 与 test_messages_model 都用它。注意 dispatch 路径(anthropic_upstream_dispatch.rs)用的是同一个通用 client,建议一并加固。属 admin-trust,但触发不需要 admin 恶意(被攻陷/恶意上游的 302、或 DNS rebind 即可越过信任边界),故 medium。
There was a problem hiding this comment.
Fixed in d4eb724: Anthropic upstream probe and dispatch now use a dedicated client that disables redirects. Direct/no-proxy mode also uses the existing PrivateFilteringDnsResolver so DNS rebinding to private/local addresses is rejected at connect-time resolution. Added regression tests for both redirect blocking and localhost DNS rejection.
| return internal_error("Failed to load Anthropic upstream channel").into_response(); | ||
| }, | ||
| }; | ||
| let output = anthropic_upstream_probe::test_messages_model(&target, &model).await; |
There was a problem hiding this comment.
🟢 [low] 模型测试每次都发真实可计费请求,无限流,且花费不进计费 rollup
test_admin_anthropic_upstream_model(:2985)仅 ensure_admin_access 后即 :3018 无条件调用 test_messages_model,后者发一个真实的 POST /messages(max_tokens:8, "hi",消耗上游真实 token)。usage_event_for_messages_test 把 billable_tokens 硬编码为 0(probe.rs:318)却照抄真实 input/output token,所以上游被计费、但网关自己的计费/配额 rollup 看不到这笔花费(仅 raw journal 有)。没有 per-channel cooldown/限流——脚本化或多客户端快速点击可驱动无界真实花费(前端 in-flight guard 只防单会话)。
建议:(1) 在 handler 上加一个轻量 per-channel cooldown(last_test_at_ms 已持久化,可直接用);(2) 让探针花费可观测——要么给 billable_tokens 填真实值、要么单列一个 admin_probe 花费计数(routing_diagnostics 已打 admin_probe:true 可过滤)。
附带(nit):探针/上游失败现在返回 HTTP 200 + ok:false(admin.rs:2963 一带),而非映射到错误状态码——调用方需读 body 判断成败。
There was a problem hiding this comment.
Fixed in eb64284. The admin model-test handler now enforces a 30s per-channel cooldown from persisted last_test_at_ms before sending the real POST /messages probe. The usage event still keeps billable_tokens=0 by design so admin tests do not debit user quota, but routing_diagnostics_json now includes admin_probe_billable_tokens computed with the same billable-token formula for observability.
| value.map(non_negative_i64_to_u64) | ||
| } | ||
|
|
||
| fn model_ids_from_json_text(raw: &str) -> Vec<String> { |
There was a problem hiding this comment.
🟢 [low] model_ids_from_json_text 把损坏 JSON 静默降级为空 vec
serde_json::from_str::<Vec<String>>(raw).unwrap_or_default()(:33-35,解码点 :93)。迁移 0034 的 CHECK 只约束 jsonb_typeof(model_ids) = 'array',并不保证是“字符串数组”。所以存进 [1,2] 或 [{"id":"x"}] 能过约束、但 from_str::<Vec<String>> 失败 → 静默变成 []、无任何日志,admin 只看到“无模型”而非错误。
当前唯一写入方 save_admin_anthropic_upstream_models_status(:535)写的是类型化 Vec<String>,所以带内不会损坏;风险只在带外写入/未来更宽松的写路径。
建议:把 unwrap_or_default() 换成显式 match/map_err,在解析失败分支 tracing::warn!(带 channel 上下文)再回退空 vec,让真正的损坏可观测;可选地把 DB CHECK 收紧到“元素均为 string”。
There was a problem hiding this comment.
Fixed in eb64284. model_ids decoding now uses an explicit match and emits tracing::warn with the channel name and parse error before falling back to an empty list, so malformed JSONB written out-of-band is observable instead of silently looking like no models.
| checked_at_ms: output.checked_at_ms, | ||
| error: output.error.clone(), | ||
| }; | ||
| match state |
There was a problem hiding this comment.
🟢 [low] probe→save 非事务:探测期间频道被删 → 已花 token 却返回 404
test_admin_anthropic_upstream_model::3018 已发出真实可计费 POST、:3021 已 journal 记录,随后 :3029 save_admin_anthropic_upstream_test_status;若频道在 load(:3004) 与 save 之间被删除,UPDATE 命中 0 行 → 返回 Ok(None) → handler 映射成 404(:3044)。即上游已被扣费、事件已记,但 admin 看到 404。
罕见、admin-only、每次 save 各自原子,属设计边角。建议:save_* 返回 Ok(None) 时(:3044 分支)tracing::warn!(channel/model/event_id),让这笔“孤儿花费”在日志可见,而不是只表现为 404。
There was a problem hiding this comment.
Fixed in eb64284. The Ok(None) branch after saving model-test status now logs channel, model, and usage event_id before returning 404, so a probe that spent upstream tokens but lost the channel row during save is visible in service logs.
| upstream_request_body_json, | ||
| }; | ||
| } | ||
| let url = match build_messages_url(&target.base_url) { |
There was a problem hiding this comment.
🟢 [reuse] 探针 HTTP 骨架与 dispatch_one_route 重复,且已开始漂移
test_messages_model(此处起)与 refresh_models 都重复了 “build_url → provider_client → 设 x-api-key/anthropic-version/content-type → send → read-capped-body → 按状态分类” 这套骨架,而 provider/anthropic_upstream_dispatch.rs::dispatch_one_route(:361-444)也是同一套。已经出现漂移:dispatch 会转发 anthropic-beta/accept/user-agent 并接受调用方传入的 anthropic-version,而探针硬编码 ANTHROPIC_VERSION 常量(probe.rs:20,与 dispatch 的 fallback 字符串重复)且不转发这些头——于是探针不是真实请求的忠实复现。
建议:抽一个共享 request builder(放在 pool crate 或 provider util),如 anthropic_messages_request(client, url, api_key, anthropic_version) -> RequestBuilder 设好这几个头,dispatch 与探针都调用它;再抽一个 send_and_read_capped(req, cap) 包住 body 截断。纯质量清理,不改行为。
There was a problem hiding this comment.
Addressed the shared stable parts in eb64284. The default Anthropic version and x-api-key/anthropic-version request header setup now live in llm-access-anthropic-pool and are used by both dispatch and admin probes. Dispatch still forwards anthropic-beta/accept/user-agent from the user request, while the admin probe intentionally remains a fixed minimal health-check request instead of pretending to replay arbitrary user request headers.
| pub upstream_request_body_json: String, | ||
| } | ||
|
|
||
| pub(crate) async fn refresh_models( |
There was a problem hiding this comment.
🟢 [nit] 每个错误路径重复整段 ProbeOutput struct literal
refresh_models / test_messages_model 在 ~5 个早返回错误分支里各自手写完整的 ModelsProbeOutput / MessagesProbeOutput(6-9 个字段),只有 status/error/error_class/latency 不同。建议加构造助手,例如 ModelsProbeOutput::error(started, checked_at_ms, status_code, msg) / MessagesProbeOutput::error(.., error_class, upstream_request_body_json),把这些早返回收成一行,减少重复与漏填风险。
There was a problem hiding this comment.
Fixed in eb64284. ModelsProbeOutput and MessagesProbeOutput now have ok/failure constructors, and the early-return probe error paths use them instead of repeating the full struct literal each time.
| </div> | ||
| <div class={classes!("space-y-2")}> | ||
| <div class={classes!("flex", "gap-2", "flex-wrap")}> | ||
| <button type="button" class={classes!("btn-terminal", "text-xs")} disabled={is_refreshing} onclick={on_refresh_models}>{ if is_refreshing { "刷新中..." } else { "刷新状态" } }</button> |
There was a problem hiding this comment.
🟢 [nit] 前端:中文标签混入英文页面 + 工具函数逐字重复
探针动作按钮/标签是硬编码中文(约 :581 一带),而本页其余文案是英文——建议统一语言(或走 i18n)。另外 format_timestamp_opt(页首 :21 一带)与 overview 页里的同名函数逐字重复,建议提到一个共享的 frontend util。
附带(同类 dedup):is_private_or_loopback_ip 在 admin.rs:7737 与 llm-access-anthropic-pool/src/lib.rs:230 各有一份逐字拷贝——SSRF 判定逻辑分两处维护,建议合并到 pool crate 一处、admin 复用,避免“改一处漏一处”。
There was a problem hiding this comment.
Fixed in eb64284. The Anthropic upstream page actions now use English labels, format_timestamp_opt is shared from llm_access_shared, and admin.rs now reuses llm-access-anthropic-pool::is_private_or_loopback_ip instead of keeping a duplicate SSRF helper.
| } | ||
| } | ||
|
|
||
| async fn read_limited_response_body(response: reqwest::Response) -> Result<Bytes, String> { |
There was a problem hiding this comment.
🟡 [test gap] 探针模块零单测;wrapper 转发测试只覆盖 5 个方法中的 1 个
anthropic_upstream_probe.rs 没有任何 #[cfg(test)]——success / 非 2xx / 超时 / 超大 body(content-length 与流式两条上限)/ 解析失败 / proxy_error 这些分支全未测,read_limited_response_body 的 1MB 上限也未测。
另外 usage_accounting_control_store_forwards_anthropic_channel_usage 只验证了 record_anthropic_upstream_channel_usage 这 1 个转发;bug 类(ControlStore 默认 no-op 被 wrapper 吞没)没有系统性防护——若将来再加一个带默认 no-op 的变更方法、wrapper 漏转发,又会重演本 PR 修的这个 bug。
建议:(1) 给探针加单测(用 mock/本地 HTTP server 覆盖上述分支 + body 上限);(2) 把会变更状态的 rollup/usage trait 方法改成无默认实现(强制每个 ControlStore impl 显式实现),或加一个 wrapper-completeness 测试逐个验证转发——从根上防住这一类“静默吞没”。
There was a problem hiding this comment.
Improved in eb64284. Probe module now has unit coverage for admin-test usage diagnostics, Anthropic error-summary parsing, and both oversized-body paths: Content-Length over cap and streaming past cap. Also removed default no-op implementations for the ControlStore usage mutation methods, so wrappers/stubs must implement record_codex_image_key_usage and record_anthropic_upstream_channel_usage explicitly. I did not make probe HTTP tests bypass the production HTTPS/non-local base_url validation; that security rule stays enforced.
| .await | ||
| } | ||
|
|
||
| async fn record_anthropic_upstream_channel_usage( |
There was a problem hiding this comment.
✅ [confirmed-good + defense-in-depth] wrapper 修复正确且完整
已核实:本 PR 在 UsageAccountingControlStore(生产路径在启用 usage accounting 时包裹 Postgres repo)新增了对 record_anthropic_upstream_channel_usage 的转发(此处 :1974-1982),有针对性测试 usage_accounting_control_store_forwards_anthropic_channel_usage。系统性核对了 ControlStore 全部 5 个方法——都已转发,无其他静默 no-op;record_codex_image_key_usage 早已转发,故此前只有 anthropic 这一个回归。修复正确,无需改动。
为防该 bug 类复发,建议把会变更状态的 rollup/usage 方法设为 trait 必需方法(去掉默认 no-op 实现),让编译器强制每个 impl/wrapper 都实现(详见 probe:347 的测试评论)。
Summary
Adds a dedicated Anthropic upstream channel admin page and probe workflow, and fixes the production bug where direct Anthropic channel usage was visible in usage logs but the channel rollup card stayed at zero.
Root cause
Production wraps the control store with
UsageAccountingControlStorewhen usage accounting is enabled. Direct Anthropic dispatch calledrecord_anthropic_upstream_channel_usage, but the wrapper did not forward that method, so the trait default no-op swallowed channel rollup updates while regular usage events still existed.What changed
Anthropic 直连/Anthropic 测试badges can render without opening detail./admin/kiro-gateway/upstream-channels, avoiding the backend API path collision.Verification
cargo test -p llm-access --jobs 4 usage_accounting_control_store_forwards_anthropic_channel_usagecargo test -p llm-access-store --jobs 4 duckdb_repository_persists_usage_events_with_default_featurecargo test -p llm-access -p llm-access-core -p llm-access-store -p llm-access-anthropic-pool -p llm-access-migrations --jobs 4 anthropic_upstreamcargo clippy -p llm-access -p llm-access-core -p llm-access-store -p llm-access-anthropic-pool -p llm-access-migrations --tests --jobs 4 -- -D warningscargo clippy -p static-flow-frontend --target wasm32-unknown-unknown --jobs 4 -- -D warningscargo test -p static-flow-backend --jobs 4 admin_kirocargo clippy -p static-flow-backend --jobs 4 -- -D warningsbash scripts/build_frontend_selfhosted.shProduction validation
llm-accessAPI:20260628T201928Z-12544120993a.llm-access-usage-worker:20260628T202154Z-12544120993a.ylupstream models online: 29 models, statusok, latency 74ms.ylmodelclaude-opus-4-7online: statusok, latency 2496ms.for-external-cache-new;ylchannel rollup moved from zero to non-zero usage.direct_anthropicrouting diagnostics./admin/kiro-gateway/upstream-channelsreturns SPA HTML and/admin/kiro-gateway/anthropic-upstreamsremains JSON API.