Skip to content

Add locale evaluation harness#8

Closed
deankarn wants to merge 1 commit into
mainfrom
eval-harness
Closed

Add locale evaluation harness#8
deankarn wants to merge 1 commit into
mainfrom
eval-harness

Conversation

@deankarn
Copy link
Copy Markdown
Contributor

Summary

Python harness under eval/ for scoring candidate BCP 47 locale codes against the running translator API. Decides which codes to promote into the Language enum based on two reference-free signals.

Two phases:

  1. Output-language consistency (free, local) — translate a fixed source set, feed outputs back through /detect-language, compute fraction whose detected base matches the target's base. Catches locales where the model echoes the source, returns English, or produces gibberish.
  2. LLM-as-judge sample (Anthropic API) — sample N source/translation pairs, ask Claude to score fluency + adequacy on a 1–5 scale.

Classifies each candidate as PASS / BORDERLINE / FAIL on configurable thresholds (defaults: ≥ 0.90 consistency, ≥ 3.5 judge mean → PASS; < 0.85 consistency or < 3.0 judge → FAIL).

Output: a results CSV plus a translations CSV for manual review.

Best-effort, not authoritative — reference-free QE has known biases. Treat results as directional. Calibrate thresholds against known-good (WMT24++-validated) locales before trusting them on new candidates.

Promotion is a separate follow-up: the Language enum is unchanged in this PR. Locales that survive the harness will be promoted in subsequent PRs, one curated batch at a time, with a results report committed alongside.

Files

  • `eval/harness.py` — main entry; CLI flags for candidates path, API URL, judge model, sample size, skip-judge, limit, output
  • `eval/sources.txt` — 30 neutral English source sentences (declarative/question/imperative/formal/casual/numbers/idioms)
  • `eval/candidates.example.csv` — template (the real `candidates.csv` is gitignored — per-project, may be private)
  • `eval/README.md` — workflow, prerequisites, threshold tuning, cost notes per model tier
  • `eval/requirements.txt` — `requests`, `anthropic`
  • `.gitignore` — excludes `eval/candidates.csv`, `eval/results/*`, `eval/.venv/`

Cost

Phase 4 cost is dominated by judge calls. With `--sample-judge 10`:

  • Haiku 4.5: ~$0.01 per candidate
  • Sonnet 4.6: ~$0.04
  • Opus 4.7: ~$0.20

Phase 1 is free (local API). `--skip-judge` runs Phase 1 only.

Test plan

  • `python3 eval/harness.py --help` — usage prints cleanly
  • `python3 eval/harness.py --skip-judge` (no candidates.csv) — fails with clear hint
  • `python3 eval/harness.py --candidates eval/candidates.example.csv --skip-judge --limit 1` (API not running) — fails with clear hint
  • With API running: `python3 eval/harness.py --candidates eval/candidates.example.csv --skip-judge --limit 3` — produces a results CSV with consistency scores
  • With `ANTHROPIC_API_KEY` and API running: full run with `--limit 3` produces judge scores and PASS/FAIL classifications
  • Calibration: on validated WMT24++ locales already in the enum, harness should produce PASS for all (otherwise thresholds are too strict)

Python script under `eval/` for scoring candidate BCP 47 locale codes
against the running translator API to decide which to add to the
translate-side `Language` enum.

Two-phase scoring:

1. Output-language consistency (Phase 1, free) — translate fixed
   English source sentences via /translate, run each output through
   /detect-language, compute fraction whose detected base matches the
   target's base. Catches locales where the model echoes the source,
   returns English, or produces gibberish.
2. LLM-as-judge sample (Phase 4, Anthropic API) — sample N source/
   translation pairs, ask Claude to score fluency and adequacy 1–5.

Classifies each candidate as PASS / BORDERLINE / FAIL on configurable
thresholds (defaults: ≥ 0.90 consistency and ≥ 3.5 judge mean → PASS).
Outputs results CSV plus a translations CSV for manual review.

Best-effort, not authoritative — reference-free QE has known biases.
Treat results as directional. Calibrate thresholds against
known-good (WMT24++-validated) locales before trusting them on
new candidates.

Files:

- `eval/harness.py` — main entry, CLI flags for candidates path,
  API URL, judge model, sample size, skip-judge, limit, output.
- `eval/sources.txt` — 30 neutral English source sentences covering
  declarative/question/imperative/formal/casual/numbers/idioms.
- `eval/candidates.example.csv` — template (real candidates.csv is
  gitignored — per-project, may be private).
- `eval/README.md` — workflow, prerequisites, threshold tuning, cost
  notes per model tier (Haiku ~$0.01/candidate, Sonnet ~$0.04, Opus
  ~$0.20).
- `eval/requirements.txt` — `requests`, `anthropic`.
- `.gitignore` — excludes `eval/candidates.csv`, `eval/results/*`,
  `eval/.venv/`.

The enum is unchanged in this PR — promotion of locales that pass
the harness is a separate follow-up, one curated batch per PR with a
results report committed alongside.
@deankarn
Copy link
Copy Markdown
Contributor Author

closing in favour of #7

@deankarn deankarn closed this May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant