Regional language variant support (translate + detect) by deankarn · Pull Request #7 · rust-playground/universal-translator

deankarn · 2026-05-08T17:57:08Z

Summary

Adds regional locale variants to the translate-side Language enum, expands the detector to emit broader BCP 47 codes, and ships a MetricX-based evaluation harness used to curate the final set.

Final shape: 111 translate variants, 108 detect codes. Harness PASS rate climbed from 52 → 74 over the branch.

Translate

WMT24++ base + 11 regional pairs (ar_EG, ar_SA, es_MX, fr_CA, fr_FR, pt_BR, pt_PT, sw_KE, sw_TZ, zh_CN, zh_TW); 3 added base (He, Is, Fil); 4 added regional (en_GB, en_US, es_ES, zh_HK); harness-validated additions for the rest.
18 originally-added Tier B/C/D codes dropped after the model produced wrong-language or script-salad output for them (ff, jv, kac, ln, lu, luo, mai, nso, ny, om, sd, sn, so, to, wo, xh, yo, zu). All 18 were branch-only — none shipped in v0.0.5.
Variant naming: pt_BR style under #[allow(non_camel_case_types)]; code() → BCP 47 dash form; full_name() → English label.
Prompt format switched from BCP 47 codes to full English names + a target-script anchor — the 4B model was misinterpreting ambiguous 2-letter codes (si → Slovene, or → Spanish, af → Hindi). Recovered ~21 PASS.

Detect

Pipeline:

Within-script overrides (pre-lingua): ৰ/ৱ → as, Sorani letters (ێ ۆ ڕ ڵ ڤ) → ckb.
Script-only fast path: ml, kn, ta, te, gu, pa, or, km, lo, my, bo, si, am.
Lingua (75 base codes).
Within-script disambiguation: Chinese Simp/Trad, sr-Cyrl/sr-Latn, az-*, pa-*, mn-*, he → yi (Yiddish niqqud + ligatures).
Dialect markers (Aho-Corasick, streaming early-return): pt-BR/pt-PT, en-US/en-GB, fr-CA/fr-FR, es-MX/es-ES, zh-TW → zh-HK, hi/mr → ne.

LanguageDetectionResult returns both language: String (raw) and translate_language: Option<Language> (translate-side equivalent via FromStr). One mapping table — same FromStr does input parsing and detect-output mapping. When auto-detect returns a lingua-only code (cy, ka, eu, …), the engine returns UnsupportedLanguage.

Evaluation harness (`eval/`)

Self-contained, no third-party API calls. make calibrate runs every candidate against the live API and scores with Google's MetricX-23-QE. English source sentences plus a parallel German set (German source for en/en-GB/en-US — MetricX is OOD on identity en→en). Drove every drop/keep decision in this PR.

Breaking wire changes

LanguageDetectionResult drops translation_supported; use translate_language !== null.
Language::FromStr no longer collapses pt-br → pt, zh-cn → zh, fr-ca → fr, es-mx → es.
Chinese detect output is zh-CN / zh-TW (was zh-Hans / zh-Hant; script form still accepted as input alias).
18 enum variants removed.

/languages wire format unchanged — still {"languages": [code, ...]}. New ?for=detect query returns the broader detect-side coverage.

See CHANGELOG.md for the full breakdown.

Test plan

Translate side: 70 Language variants (was 55) — adds 4 base codes (He, Is, Fil, Zu) and 11 regional pairs from WMT24++ (ar_EG, ar_SA, es_MX, fr_CA, fr_FR, pt_BR, pt_PT, sw_KE, sw_TZ, zh_CN, zh_TW). FromStr accepts BCP 47 in dash or underscore form, case-insensitive, with aliases for zh-Hans/zh-Hant, nb/nn, tl, iw. Unknown region tags fall back to the base language. Prompt now uses BCP 47 codes (matches the official chat-template format). Detect side: Detector returns String (BCP 47), broader than the translate enum. Pipeline: lingua → script disambiguation (zh-CN/zh-TW, sr-Cyrl/sr-Latn, az-*, pa-Guru/pa-Arab, mn-*) → heuristic dialect markers (pt-BR/pt-PT, en-US/en-GB, fr-CA/fr-FR via Aho-Corasick with streaming early-return) → Malayalam fallback. LanguageDetectionResult exposes both raw `language: String` and `translate_language: Option<Language>` so callers see the precise detector output and the translate-side equivalent — populated via the same FromStr used for input parsing (one mapping table, two consumers). Boundary contract: when auto-detect returns a lingua-only code (cy, ka, eu, …), engine returns UnsupportedLanguage rather than silently falling back. Surface: CLI gains `languages --for translate|detect`, API gains `/languages?for=translate|detect`. detect-language exposes translate_language field. Breaking wire changes: /languages response is now `[{code,name},...]` (was `[code,...]`); LanguageDetectionResult drops translation_supported (use translate_language !== null); FromStr regional collapse aliases removed (pt-br no longer collapses to pt — it parses to pt_BR); Chinese detect output is region form (zh-CN/zh-TW) instead of script form (zh-Hans/zh-Hant). Adds translator-core/src/dialect.rs with curated marker tables and streaming early-return scoring (commit when winner ≥ 2 hits and beats loser by ≥ 2). New aho-corasick dependency. Documentation: README gains 107-entry support table (translate ∪ detect), API/CLI/ENGINE/docs/models updated, full CHANGELOG entry.

Drops a self-contained eval/ harness that exercises every translate-side language against TranslateGemma 4B and scores quality with Google's MetricX-23-QE (no third-party API calls). - harness.py: orchestrates per-language runs against the live API, computes detector consistency rate + per-language MetricX statistics, writes results-<ts>.csv / translations-<ts>.csv - metricx_runner.py: standalone MetricX inference that bypasses transformers' Trainer to avoid MPS device routing issues on Apple Silicon - Makefile: target-driven setup with PYTHON_BIN auto-detection (Python 3.10/3.11 required by metricx pins; 3.12+ lacks wheels for the transformers / sentencepiece versions metricx pins) - sources.txt + sources-de.txt: 30 parallel sentences each — English source for most targets, German source for en/en-GB/en-US (MetricX is OOD when scoring identity translations en→en) - candidates.example.csv: tier-tagged seed list; users copy to candidates.csv and edit to scope a run eval/candidates.csv, calibration.csv, results/* and .metricx/ are gitignored (per-project state).

…rgets Harness runs surfaced a class of failures where the 4B model produced wrong-language or script-salad output for low-resource targets. Two root causes: ambiguous 2-letter codes in the prompt (si → Slovene, or → Spanish, af → Hindi) and over-permissive enum coverage. Prompt format - translate_gemma_prompt now uses full English language names plus a target-script anchor in the system turn (Output only the translated text in {name}, using the native script of {name}). Recovered ~17 PASS on the harness with no regressions on previously-passing languages. Language enum - Drops 18 branch-only variants the model could not actually translate: Ff, Jv, Kac, Ln, Lu, Luo, Mai, Nso, Ny, Om, Sd, Sn, So, To, Wo, Xh, Yo, Zu. Confirmed via translation-output inspection: jv outputs pure Indonesian, pa outputs Hindi (kept pa since it shipped on master), kac outputs Burmese, mi outputs fluent-fake Maori (kept mi — verified real Maori), etc. None of the 18 were in the v0.0.5 release. - Total variants: 129 → 111. Detector - Script-only fast path runs before lingua, commits unambiguous Unicode blocks deterministically: ml, kn, ta, te, gu, pa (Gurmukhi), or, km, lo, my, bo, si, am. Fixes detection where lingua misroutes or doesn't cover. - Within-script disambiguation: bn → as on Assamese-distinctive letters ৰ (U+09F0), ৱ (U+09F1); ar → ckb on Sorani Kurdish letters ێ ۆ ڕ ڵ ڤ; he → yi on Yiddish double-vav/yod digraphs and precomposed ligatures U+05F0–U+05F2. - Dialect markers (Aho-Corasick, dialect.rs) extended: es-MX vs es-ES; zh-HK refinement on top of zh-TW (Cantonese particles); hi → ne when Nepali copula/verb/day-name markers fire (छन्, हुनेछ, गरिने, बिहीबार, etc.). - DETECT_SUPPORTED_CODES list updated with the new emitted codes. Harness summary after these changes: PASS 54 → 75 (+21), FAIL 74 → 52 (-22).

- CHANGELOG.md (Unreleased): expanded Added section with script-only fast path, within-script disambiguators (as, ckb, yi), new dialect pairs (es-MX/es-ES, zh-HK, hi/ne); rewrote prompt-format Changed entry; added Removed entry listing the 18 dropped variants with rationale. - CLAUDE.md: counts updated (translate 70 → 111, detect 95 → 108); detection pipeline section reflects script-only fast path and full set of within-script / dialect refinements; prompt-format description updated. - README.md: am/si/ne/yi rows updated to show detect ✓ + both ✓ (new heuristics); zu row stripped of translate ✓ (dropped); summary line points to GET /languages as the live source of truth instead of asserting a count.

U+06ED is ARABIC SMALL LOW MEEM (a combining mark), not the Sorani-distinctive yeh-with-small-v. Test sorani_kurdish_distinctive_letter_refines_ar caught it.

…arkers Three issues surfaced in the latest harness run: - he regressed 100% → 60%: raw double-vav/yod match false-positived on legitimate Hebrew (`הכנסייה` synagogue, `האוויר` the air). Replaced with niqqud (U+05B0–U+05BD, U+05BF) + precomposed ligature (U+05F0–U+05F2) check — modern Hebrew prose rarely uses niqqud, Yiddish requires it. - as stuck at 73% / ckb at 77%: when text mixed Devanagari + Bengali or Arabic + Persian, lingua picked the wrong base (hi/fa) so the bn→as and ar→ckb refinement never ran. Added detect_within_script_override that fires before lingua on Assamese letters (ৰ/ৱ) and Sorani letters (ێ ۆ ڕ ڵ ڤ), overriding lingua entirely. - ne stuck at 3%: only 1 Nepali marker per typical sentence, below the ≥2 commit threshold. Expanded marker set with 13 high-frequency forms (गर्नुहोस्, सक्नुहुन्छ, उनले, तपाईंले, आइपुग्यो, बिहान, etc.) so 2+ fire on most Nepali outputs. Also fleshed out script_only_name with the new script-only codes (kn, ta, te, gu, pa, or, as, ckb, yi) so detect_with_confidence reports the right display name.

…arkers (v2) Swaps order so within-script overrides (as/ckb) run BEFORE the unique-script fast path. Otherwise stray Tamil/Bengali codepoints in mixed-script outputs beat the distinctive-letter override. Enriches Nepali marker list from 24 → 65: tense morphology (रहेको, भएका, थिए, थियो), vocabulary (नयाँ, अर्को, ठूलो, राम्रो, गाउँ, बुबा, सबैभन्दा), pronouns (हामी, हाम्रो, आफू), distinctive postpositions (लागि, अगाडि). Adds 22 Hindi-distinctive markers (की, नया, बड़ा, रहा है, रहे हैं, सकते हैं, पहुँच) so the Hindi side actively votes against ne when present.

Lingua often misroutes Nepali text to mr (Marathi shares Devanagari script and similar morphology). The dialect dispatch only handled 'hi' as base, so mr-detected Nepali fell through unchanged. Both hi and mr now feed into hi_ne_pair so Nepali markers can refine either base toward 'ne' on commit. Added a regression test for the past-perfect phrase 'भएका थिए' that surfaced in harness output.

The earlier change to return {code, name} objects was unnecessary — Language serializes as its BCP 47 code string already, and clients can call .code() / .full_name() on the deserialized enum without a name field on the wire. This restores v0.0.5 wire compatibility for /languages. - /languages?for=translate → Vec<Language> (same wire format as main). - /languages?for=detect → Vec<String> (new endpoint; detect emits codes outside the translate enum so it can't be Vec<Language>). - Removed LanguageEntry from translator-core::types. - translator-api-client::languages() returns Vec<Language>; new languages_detect() returns Vec<String>.

deankarn added 5 commits May 8, 2026 10:56

Fix Sorani Kurdish codepoint — ێ is U+06CE not U+06ED

e28917f

U+06ED is ARABIC SMALL LOW MEEM (a combining mark), not the Sorani-distinctive yeh-with-small-v. Test sorani_kurdish_distinctive_letter_refines_ar caught it.

deankarn mentioned this pull request May 11, 2026

Add locale evaluation harness #8

Closed

6 tasks

deankarn added 4 commits May 11, 2026 16:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regional language variant support (translate + detect)#7

Regional language variant support (translate + detect)#7
deankarn wants to merge 9 commits into
mainfrom
region-language-support

deankarn commented May 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

deankarn commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Translate

Detect

Evaluation harness (eval/)

Breaking wire changes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

deankarn commented May 8, 2026 •

edited

Loading

Evaluation harness (`eval/`)