Add locale evaluation harness by deankarn · Pull Request #8 · rust-playground/universal-translator

deankarn · 2026-05-10T03:49:35Z

Summary

Python harness under eval/ for scoring candidate BCP 47 locale codes against the running translator API. Decides which codes to promote into the Language enum based on two reference-free signals.

Two phases:

Output-language consistency (free, local) — translate a fixed source set, feed outputs back through /detect-language, compute fraction whose detected base matches the target's base. Catches locales where the model echoes the source, returns English, or produces gibberish.
LLM-as-judge sample (Anthropic API) — sample N source/translation pairs, ask Claude to score fluency + adequacy on a 1–5 scale.

Classifies each candidate as PASS / BORDERLINE / FAIL on configurable thresholds (defaults: ≥ 0.90 consistency, ≥ 3.5 judge mean → PASS; < 0.85 consistency or < 3.0 judge → FAIL).

Output: a results CSV plus a translations CSV for manual review.

Best-effort, not authoritative — reference-free QE has known biases. Treat results as directional. Calibrate thresholds against known-good (WMT24++-validated) locales before trusting them on new candidates.

Promotion is a separate follow-up: the Language enum is unchanged in this PR. Locales that survive the harness will be promoted in subsequent PRs, one curated batch at a time, with a results report committed alongside.

Files

`eval/harness.py` — main entry; CLI flags for candidates path, API URL, judge model, sample size, skip-judge, limit, output
`eval/sources.txt` — 30 neutral English source sentences (declarative/question/imperative/formal/casual/numbers/idioms)
`eval/candidates.example.csv` — template (the real `candidates.csv` is gitignored — per-project, may be private)
`eval/README.md` — workflow, prerequisites, threshold tuning, cost notes per model tier
`eval/requirements.txt` — `requests`, `anthropic`
`.gitignore` — excludes `eval/candidates.csv`, `eval/results/*`, `eval/.venv/`

Cost

Phase 4 cost is dominated by judge calls. With `--sample-judge 10`:

Haiku 4.5: ~$0.01 per candidate
Sonnet 4.6: ~$0.04
Opus 4.7: ~$0.20

Phase 1 is free (local API). `--skip-judge` runs Phase 1 only.

Test plan

`python3 eval/harness.py --help` — usage prints cleanly
`python3 eval/harness.py --skip-judge` (no candidates.csv) — fails with clear hint
`python3 eval/harness.py --candidates eval/candidates.example.csv --skip-judge --limit 1` (API not running) — fails with clear hint
With API running: `python3 eval/harness.py --candidates eval/candidates.example.csv --skip-judge --limit 3` — produces a results CSV with consistency scores
With `ANTHROPIC_API_KEY` and API running: full run with `--limit 3` produces judge scores and PASS/FAIL classifications
Calibration: on validated WMT24++ locales already in the enum, harness should produce PASS for all (otherwise thresholds are too strict)

Python script under `eval/` for scoring candidate BCP 47 locale codes against the running translator API to decide which to add to the translate-side `Language` enum. Two-phase scoring: 1. Output-language consistency (Phase 1, free) — translate fixed English source sentences via /translate, run each output through /detect-language, compute fraction whose detected base matches the target's base. Catches locales where the model echoes the source, returns English, or produces gibberish. 2. LLM-as-judge sample (Phase 4, Anthropic API) — sample N source/ translation pairs, ask Claude to score fluency and adequacy 1–5. Classifies each candidate as PASS / BORDERLINE / FAIL on configurable thresholds (defaults: ≥ 0.90 consistency and ≥ 3.5 judge mean → PASS). Outputs results CSV plus a translations CSV for manual review. Best-effort, not authoritative — reference-free QE has known biases. Treat results as directional. Calibrate thresholds against known-good (WMT24++-validated) locales before trusting them on new candidates. Files: - `eval/harness.py` — main entry, CLI flags for candidates path, API URL, judge model, sample size, skip-judge, limit, output. - `eval/sources.txt` — 30 neutral English source sentences covering declarative/question/imperative/formal/casual/numbers/idioms. - `eval/candidates.example.csv` — template (real candidates.csv is gitignored — per-project, may be private). - `eval/README.md` — workflow, prerequisites, threshold tuning, cost notes per model tier (Haiku ~$0.01/candidate, Sonnet ~$0.04, Opus ~$0.20). - `eval/requirements.txt` — `requests`, `anthropic`. - `.gitignore` — excludes `eval/candidates.csv`, `eval/results/*`, `eval/.venv/`. The enum is unchanged in this PR — promotion of locales that pass the harness is a separate follow-up, one curated batch per PR with a results report committed alongside.

deankarn · 2026-05-11T04:30:09Z

closing in favour of #7

deankarn closed this May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add locale evaluation harness#8

Add locale evaluation harness#8
deankarn wants to merge 1 commit into
mainfrom
eval-harness

deankarn commented May 10, 2026

Uh oh!

deankarn commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

deankarn commented May 10, 2026

Summary

Files

Cost

Test plan

Uh oh!

deankarn commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant