Regional language variant support (translate + detect)#7
Draft
deankarn wants to merge 9 commits into
Draft
Conversation
Translate side: 70 Language variants (was 55) — adds 4 base codes
(He, Is, Fil, Zu) and 11 regional pairs from WMT24++ (ar_EG, ar_SA,
es_MX, fr_CA, fr_FR, pt_BR, pt_PT, sw_KE, sw_TZ, zh_CN, zh_TW).
FromStr accepts BCP 47 in dash or underscore form, case-insensitive,
with aliases for zh-Hans/zh-Hant, nb/nn, tl, iw. Unknown region tags
fall back to the base language. Prompt now uses BCP 47 codes
(matches the official chat-template format).
Detect side: Detector returns String (BCP 47), broader than the
translate enum. Pipeline: lingua → script disambiguation
(zh-CN/zh-TW, sr-Cyrl/sr-Latn, az-*, pa-Guru/pa-Arab, mn-*) → heuristic
dialect markers (pt-BR/pt-PT, en-US/en-GB, fr-CA/fr-FR via
Aho-Corasick with streaming early-return) → Malayalam fallback.
LanguageDetectionResult exposes both raw `language: String` and
`translate_language: Option<Language>` so callers see the precise
detector output and the translate-side equivalent — populated via
the same FromStr used for input parsing (one mapping table, two
consumers).
Boundary contract: when auto-detect returns a lingua-only code (cy,
ka, eu, …), engine returns UnsupportedLanguage rather than silently
falling back.
Surface: CLI gains `languages --for translate|detect`, API gains
`/languages?for=translate|detect`. detect-language exposes
translate_language field.
Breaking wire changes: /languages response is now `[{code,name},...]`
(was `[code,...]`); LanguageDetectionResult drops translation_supported
(use translate_language !== null); FromStr regional collapse aliases
removed (pt-br no longer collapses to pt — it parses to pt_BR);
Chinese detect output is region form (zh-CN/zh-TW) instead of script
form (zh-Hans/zh-Hant).
Adds translator-core/src/dialect.rs with curated marker tables and
streaming early-return scoring (commit when winner ≥ 2 hits and
beats loser by ≥ 2). New aho-corasick dependency.
Documentation: README gains 107-entry support table (translate ∪
detect), API/CLI/ENGINE/docs/models updated, full CHANGELOG entry.
Drops a self-contained eval/ harness that exercises every translate-side language against TranslateGemma 4B and scores quality with Google's MetricX-23-QE (no third-party API calls). - harness.py: orchestrates per-language runs against the live API, computes detector consistency rate + per-language MetricX statistics, writes results-<ts>.csv / translations-<ts>.csv - metricx_runner.py: standalone MetricX inference that bypasses transformers' Trainer to avoid MPS device routing issues on Apple Silicon - Makefile: target-driven setup with PYTHON_BIN auto-detection (Python 3.10/3.11 required by metricx pins; 3.12+ lacks wheels for the transformers / sentencepiece versions metricx pins) - sources.txt + sources-de.txt: 30 parallel sentences each — English source for most targets, German source for en/en-GB/en-US (MetricX is OOD when scoring identity translations en→en) - candidates.example.csv: tier-tagged seed list; users copy to candidates.csv and edit to scope a run eval/candidates.csv, calibration.csv, results/* and .metricx/ are gitignored (per-project state).
…rgets
Harness runs surfaced a class of failures where the 4B model produced
wrong-language or script-salad output for low-resource targets. Two root
causes: ambiguous 2-letter codes in the prompt (si → Slovene, or → Spanish,
af → Hindi) and over-permissive enum coverage.
Prompt format
- translate_gemma_prompt now uses full English language names plus a
target-script anchor in the system turn (Output only the translated text
in {name}, using the native script of {name}). Recovered ~17 PASS on the
harness with no regressions on previously-passing languages.
Language enum
- Drops 18 branch-only variants the model could not actually translate:
Ff, Jv, Kac, Ln, Lu, Luo, Mai, Nso, Ny, Om, Sd, Sn, So, To, Wo, Xh, Yo, Zu.
Confirmed via translation-output inspection: jv outputs pure Indonesian,
pa outputs Hindi (kept pa since it shipped on master), kac outputs Burmese,
mi outputs fluent-fake Maori (kept mi — verified real Maori), etc.
None of the 18 were in the v0.0.5 release.
- Total variants: 129 → 111.
Detector
- Script-only fast path runs before lingua, commits unambiguous Unicode
blocks deterministically: ml, kn, ta, te, gu, pa (Gurmukhi), or, km, lo,
my, bo, si, am. Fixes detection where lingua misroutes or doesn't cover.
- Within-script disambiguation: bn → as on Assamese-distinctive letters
ৰ (U+09F0), ৱ (U+09F1); ar → ckb on Sorani Kurdish letters ێ ۆ ڕ ڵ ڤ;
he → yi on Yiddish double-vav/yod digraphs and precomposed ligatures
U+05F0–U+05F2.
- Dialect markers (Aho-Corasick, dialect.rs) extended: es-MX vs es-ES;
zh-HK refinement on top of zh-TW (Cantonese particles); hi → ne when
Nepali copula/verb/day-name markers fire (छन्, हुनेछ, गरिने, बिहीबार, etc.).
- DETECT_SUPPORTED_CODES list updated with the new emitted codes.
Harness summary after these changes: PASS 54 → 75 (+21), FAIL 74 → 52 (-22).
- CHANGELOG.md (Unreleased): expanded Added section with script-only fast path, within-script disambiguators (as, ckb, yi), new dialect pairs (es-MX/es-ES, zh-HK, hi/ne); rewrote prompt-format Changed entry; added Removed entry listing the 18 dropped variants with rationale. - CLAUDE.md: counts updated (translate 70 → 111, detect 95 → 108); detection pipeline section reflects script-only fast path and full set of within-script / dialect refinements; prompt-format description updated. - README.md: am/si/ne/yi rows updated to show detect ✓ + both ✓ (new heuristics); zu row stripped of translate ✓ (dropped); summary line points to GET /languages as the live source of truth instead of asserting a count.
U+06ED is ARABIC SMALL LOW MEEM (a combining mark), not the Sorani-distinctive yeh-with-small-v. Test sorani_kurdish_distinctive_letter_refines_ar caught it.
…arkers Three issues surfaced in the latest harness run: - he regressed 100% → 60%: raw double-vav/yod match false-positived on legitimate Hebrew (`הכנסייה` synagogue, `האוויר` the air). Replaced with niqqud (U+05B0–U+05BD, U+05BF) + precomposed ligature (U+05F0–U+05F2) check — modern Hebrew prose rarely uses niqqud, Yiddish requires it. - as stuck at 73% / ckb at 77%: when text mixed Devanagari + Bengali or Arabic + Persian, lingua picked the wrong base (hi/fa) so the bn→as and ar→ckb refinement never ran. Added detect_within_script_override that fires before lingua on Assamese letters (ৰ/ৱ) and Sorani letters (ێ ۆ ڕ ڵ ڤ), overriding lingua entirely. - ne stuck at 3%: only 1 Nepali marker per typical sentence, below the ≥2 commit threshold. Expanded marker set with 13 high-frequency forms (गर्नुहोस्, सक्नुहुन्छ, उनले, तपाईंले, आइपुग्यो, बिहान, etc.) so 2+ fire on most Nepali outputs. Also fleshed out script_only_name with the new script-only codes (kn, ta, te, gu, pa, or, as, ckb, yi) so detect_with_confidence reports the right display name.
…arkers (v2) Swaps order so within-script overrides (as/ckb) run BEFORE the unique-script fast path. Otherwise stray Tamil/Bengali codepoints in mixed-script outputs beat the distinctive-letter override. Enriches Nepali marker list from 24 → 65: tense morphology (रहेको, भएका, थिए, थियो), vocabulary (नयाँ, अर्को, ठूलो, राम्रो, गाउँ, बुबा, सबैभन्दा), pronouns (हामी, हाम्रो, आफू), distinctive postpositions (लागि, अगाडि). Adds 22 Hindi-distinctive markers (की, नया, बड़ा, रहा है, रहे हैं, सकते हैं, पहुँच) so the Hindi side actively votes against ne when present.
Lingua often misroutes Nepali text to mr (Marathi shares Devanagari script and similar morphology). The dialect dispatch only handled 'hi' as base, so mr-detected Nepali fell through unchanged. Both hi and mr now feed into hi_ne_pair so Nepali markers can refine either base toward 'ne' on commit. Added a regression test for the past-perfect phrase 'भएका थिए' that surfaced in harness output.
The earlier change to return {code, name} objects was unnecessary — Language
serializes as its BCP 47 code string already, and clients can call
.code() / .full_name() on the deserialized enum without a name field on
the wire. This restores v0.0.5 wire compatibility for /languages.
- /languages?for=translate → Vec<Language> (same wire format as main).
- /languages?for=detect → Vec<String> (new endpoint; detect emits codes
outside the translate enum so it can't be Vec<Language>).
- Removed LanguageEntry from translator-core::types.
- translator-api-client::languages() returns Vec<Language>; new
languages_detect() returns Vec<String>.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds regional locale variants to the translate-side
Languageenum, expands the detector to emit broader BCP 47 codes, and ships a MetricX-based evaluation harness used to curate the final set.Final shape: 111 translate variants, 108 detect codes. Harness PASS rate climbed from 52 → 74 over the branch.
Translate
ar_EG,ar_SA,es_MX,fr_CA,fr_FR,pt_BR,pt_PT,sw_KE,sw_TZ,zh_CN,zh_TW); 3 added base (He,Is,Fil); 4 added regional (en_GB,en_US,es_ES,zh_HK); harness-validated additions for the rest.ff,jv,kac,ln,lu,luo,mai,nso,ny,om,sd,sn,so,to,wo,xh,yo,zu). All 18 were branch-only — none shipped in v0.0.5.pt_BRstyle under#[allow(non_camel_case_types)];code()→ BCP 47 dash form;full_name()→ English label.si→ Slovene,or→ Spanish,af→ Hindi). Recovered ~21 PASS.Detect
Pipeline:
ৰ/ৱ→as, Sorani letters (ێ ۆ ڕ ڵ ڤ) →ckb.ml,kn,ta,te,gu,pa,or,km,lo,my,bo,si,am.sr-Cyrl/sr-Latn,az-*,pa-*,mn-*,he→yi(Yiddish niqqud + ligatures).pt-BR/pt-PT,en-US/en-GB,fr-CA/fr-FR,es-MX/es-ES,zh-TW→zh-HK,hi/mr→ne.LanguageDetectionResultreturns bothlanguage: String(raw) andtranslate_language: Option<Language>(translate-side equivalent viaFromStr). One mapping table — sameFromStrdoes input parsing and detect-output mapping. When auto-detect returns a lingua-only code (cy,ka,eu, …), the engine returnsUnsupportedLanguage.Evaluation harness (
eval/)Self-contained, no third-party API calls.
make calibrateruns every candidate against the live API and scores with Google's MetricX-23-QE. English source sentences plus a parallel German set (German source foren/en-GB/en-US— MetricX is OOD on identity en→en). Drove every drop/keep decision in this PR.Breaking wire changes
LanguageDetectionResultdropstranslation_supported; usetranslate_language !== null.Language::FromStrno longer collapsespt-br→pt,zh-cn→zh,fr-ca→fr,es-mx→es.zh-CN/zh-TW(waszh-Hans/zh-Hant; script form still accepted as input alias)./languageswire format unchanged — still{"languages": [code, ...]}. New?for=detectquery returns the broader detect-side coverage.See
CHANGELOG.mdfor the full breakdown.Test plan
cargo test --workspacepasses.cargo clippy --workspace --all-targets -- -D warningsclean.cargo run -p translator-cli -- translate -t "How are you?" -l pt-BR,pt-PT— outputs differ.cargo run -p translator-cli -- detect-language "Dette er Bokmål" --output json—language: "nb",translate_language: "no".cargo run -p translator-cli -- detect "繁體中文測試"→zh-TW.cargo run -p translator-cli -- detect "ৰোগী আছে"→as.cargo run -p translator-cli -- detect "די קאָנפערענס"→yi.cargo run -p translator-cli -- detect "ئەو کوێرە"→ckb.cargo run -p translator-cli -- detect "बिहीबार गरिने छ"→ne.cargo run -p translator-cli -- translate -t "Croeso i Gymru" -l en—UnsupportedLanguage(boundary contract).curl http://localhost:3000/languages | jq '.languages[0]'→"af"(still an array of strings, v0.0.5-compatible).cd eval && make calibrate— harness runs end-to-end, PASS ≥ 74.