Add locale evaluation harness#8
Closed
deankarn wants to merge 1 commit into
Closed
Conversation
Python script under `eval/` for scoring candidate BCP 47 locale codes against the running translator API to decide which to add to the translate-side `Language` enum. Two-phase scoring: 1. Output-language consistency (Phase 1, free) — translate fixed English source sentences via /translate, run each output through /detect-language, compute fraction whose detected base matches the target's base. Catches locales where the model echoes the source, returns English, or produces gibberish. 2. LLM-as-judge sample (Phase 4, Anthropic API) — sample N source/ translation pairs, ask Claude to score fluency and adequacy 1–5. Classifies each candidate as PASS / BORDERLINE / FAIL on configurable thresholds (defaults: ≥ 0.90 consistency and ≥ 3.5 judge mean → PASS). Outputs results CSV plus a translations CSV for manual review. Best-effort, not authoritative — reference-free QE has known biases. Treat results as directional. Calibrate thresholds against known-good (WMT24++-validated) locales before trusting them on new candidates. Files: - `eval/harness.py` — main entry, CLI flags for candidates path, API URL, judge model, sample size, skip-judge, limit, output. - `eval/sources.txt` — 30 neutral English source sentences covering declarative/question/imperative/formal/casual/numbers/idioms. - `eval/candidates.example.csv` — template (real candidates.csv is gitignored — per-project, may be private). - `eval/README.md` — workflow, prerequisites, threshold tuning, cost notes per model tier (Haiku ~$0.01/candidate, Sonnet ~$0.04, Opus ~$0.20). - `eval/requirements.txt` — `requests`, `anthropic`. - `.gitignore` — excludes `eval/candidates.csv`, `eval/results/*`, `eval/.venv/`. The enum is unchanged in this PR — promotion of locales that pass the harness is a separate follow-up, one curated batch per PR with a results report committed alongside.
Contributor
Author
|
closing in favour of #7 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Python harness under
eval/for scoring candidate BCP 47 locale codes against the running translator API. Decides which codes to promote into theLanguageenum based on two reference-free signals.Two phases:
/detect-language, compute fraction whose detected base matches the target's base. Catches locales where the model echoes the source, returns English, or produces gibberish.Classifies each candidate as PASS / BORDERLINE / FAIL on configurable thresholds (defaults: ≥ 0.90 consistency, ≥ 3.5 judge mean → PASS; < 0.85 consistency or < 3.0 judge → FAIL).
Output: a results CSV plus a translations CSV for manual review.
Promotion is a separate follow-up: the
Languageenum is unchanged in this PR. Locales that survive the harness will be promoted in subsequent PRs, one curated batch at a time, with a results report committed alongside.Files
Cost
Phase 4 cost is dominated by judge calls. With `--sample-judge 10`:
Phase 1 is free (local API). `--skip-judge` runs Phase 1 only.
Test plan