Skip to content

Regional language variant support (translate + detect)#7

Draft
deankarn wants to merge 9 commits into
mainfrom
region-language-support
Draft

Regional language variant support (translate + detect)#7
deankarn wants to merge 9 commits into
mainfrom
region-language-support

Conversation

@deankarn
Copy link
Copy Markdown
Contributor

@deankarn deankarn commented May 8, 2026

Summary

Adds regional locale variants to the translate-side Language enum, expands the detector to emit broader BCP 47 codes, and ships a MetricX-based evaluation harness used to curate the final set.

Final shape: 111 translate variants, 108 detect codes. Harness PASS rate climbed from 52 → 74 over the branch.

Translate

  • WMT24++ base + 11 regional pairs (ar_EG, ar_SA, es_MX, fr_CA, fr_FR, pt_BR, pt_PT, sw_KE, sw_TZ, zh_CN, zh_TW); 3 added base (He, Is, Fil); 4 added regional (en_GB, en_US, es_ES, zh_HK); harness-validated additions for the rest.
  • 18 originally-added Tier B/C/D codes dropped after the model produced wrong-language or script-salad output for them (ff, jv, kac, ln, lu, luo, mai, nso, ny, om, sd, sn, so, to, wo, xh, yo, zu). All 18 were branch-only — none shipped in v0.0.5.
  • Variant naming: pt_BR style under #[allow(non_camel_case_types)]; code() → BCP 47 dash form; full_name() → English label.
  • Prompt format switched from BCP 47 codes to full English names + a target-script anchor — the 4B model was misinterpreting ambiguous 2-letter codes (si → Slovene, or → Spanish, af → Hindi). Recovered ~21 PASS.

Detect

Pipeline:

  1. Within-script overrides (pre-lingua): /as, Sorani letters (ێ ۆ ڕ ڵ ڤ) → ckb.
  2. Script-only fast path: ml, kn, ta, te, gu, pa, or, km, lo, my, bo, si, am.
  3. Lingua (75 base codes).
  4. Within-script disambiguation: Chinese Simp/Trad, sr-Cyrl/sr-Latn, az-*, pa-*, mn-*, heyi (Yiddish niqqud + ligatures).
  5. Dialect markers (Aho-Corasick, streaming early-return): pt-BR/pt-PT, en-US/en-GB, fr-CA/fr-FR, es-MX/es-ES, zh-TWzh-HK, hi/mrne.

LanguageDetectionResult returns both language: String (raw) and translate_language: Option<Language> (translate-side equivalent via FromStr). One mapping table — same FromStr does input parsing and detect-output mapping. When auto-detect returns a lingua-only code (cy, ka, eu, …), the engine returns UnsupportedLanguage.

Evaluation harness (eval/)

Self-contained, no third-party API calls. make calibrate runs every candidate against the live API and scores with Google's MetricX-23-QE. English source sentences plus a parallel German set (German source for en/en-GB/en-US — MetricX is OOD on identity en→en). Drove every drop/keep decision in this PR.

Breaking wire changes

  • LanguageDetectionResult drops translation_supported; use translate_language !== null.
  • Language::FromStr no longer collapses pt-brpt, zh-cnzh, fr-cafr, es-mxes.
  • Chinese detect output is zh-CN / zh-TW (was zh-Hans / zh-Hant; script form still accepted as input alias).
  • 18 enum variants removed.

/languages wire format unchanged — still {"languages": [code, ...]}. New ?for=detect query returns the broader detect-side coverage.

See CHANGELOG.md for the full breakdown.

Test plan

  • cargo test --workspace passes.
  • cargo clippy --workspace --all-targets -- -D warnings clean.
  • cargo run -p translator-cli -- translate -t "How are you?" -l pt-BR,pt-PT — outputs differ.
  • cargo run -p translator-cli -- detect-language "Dette er Bokmål" --output jsonlanguage: "nb", translate_language: "no".
  • cargo run -p translator-cli -- detect "繁體中文測試"zh-TW.
  • cargo run -p translator-cli -- detect "ৰোগী আছে"as.
  • cargo run -p translator-cli -- detect "די קאָנפערענס"yi.
  • cargo run -p translator-cli -- detect "ئەو کوێرە"ckb.
  • cargo run -p translator-cli -- detect "बिहीबार गरिने छ"ne.
  • cargo run -p translator-cli -- translate -t "Croeso i Gymru" -l enUnsupportedLanguage (boundary contract).
  • curl http://localhost:3000/languages | jq '.languages[0]'"af" (still an array of strings, v0.0.5-compatible).
  • cd eval && make calibrate — harness runs end-to-end, PASS ≥ 74.

deankarn added 5 commits May 8, 2026 10:56
Translate side: 70 Language variants (was 55) — adds 4 base codes
(He, Is, Fil, Zu) and 11 regional pairs from WMT24++ (ar_EG, ar_SA,
es_MX, fr_CA, fr_FR, pt_BR, pt_PT, sw_KE, sw_TZ, zh_CN, zh_TW).
FromStr accepts BCP 47 in dash or underscore form, case-insensitive,
with aliases for zh-Hans/zh-Hant, nb/nn, tl, iw. Unknown region tags
fall back to the base language. Prompt now uses BCP 47 codes
(matches the official chat-template format).

Detect side: Detector returns String (BCP 47), broader than the
translate enum. Pipeline: lingua → script disambiguation
(zh-CN/zh-TW, sr-Cyrl/sr-Latn, az-*, pa-Guru/pa-Arab, mn-*) → heuristic
dialect markers (pt-BR/pt-PT, en-US/en-GB, fr-CA/fr-FR via
Aho-Corasick with streaming early-return) → Malayalam fallback.
LanguageDetectionResult exposes both raw `language: String` and
`translate_language: Option<Language>` so callers see the precise
detector output and the translate-side equivalent — populated via
the same FromStr used for input parsing (one mapping table, two
consumers).

Boundary contract: when auto-detect returns a lingua-only code (cy,
ka, eu, …), engine returns UnsupportedLanguage rather than silently
falling back.

Surface: CLI gains `languages --for translate|detect`, API gains
`/languages?for=translate|detect`. detect-language exposes
translate_language field.

Breaking wire changes: /languages response is now `[{code,name},...]`
(was `[code,...]`); LanguageDetectionResult drops translation_supported
(use translate_language !== null); FromStr regional collapse aliases
removed (pt-br no longer collapses to pt — it parses to pt_BR);
Chinese detect output is region form (zh-CN/zh-TW) instead of script
form (zh-Hans/zh-Hant).

Adds translator-core/src/dialect.rs with curated marker tables and
streaming early-return scoring (commit when winner ≥ 2 hits and
beats loser by ≥ 2). New aho-corasick dependency.

Documentation: README gains 107-entry support table (translate ∪
detect), API/CLI/ENGINE/docs/models updated, full CHANGELOG entry.
Drops a self-contained eval/ harness that exercises every translate-side
language against TranslateGemma 4B and scores quality with Google's MetricX-23-QE
(no third-party API calls).

- harness.py: orchestrates per-language runs against the live API, computes
  detector consistency rate + per-language MetricX statistics, writes
  results-<ts>.csv / translations-<ts>.csv
- metricx_runner.py: standalone MetricX inference that bypasses transformers'
  Trainer to avoid MPS device routing issues on Apple Silicon
- Makefile: target-driven setup with PYTHON_BIN auto-detection (Python 3.10/3.11
  required by metricx pins; 3.12+ lacks wheels for the transformers / sentencepiece
  versions metricx pins)
- sources.txt + sources-de.txt: 30 parallel sentences each — English source for
  most targets, German source for en/en-GB/en-US (MetricX is OOD when scoring
  identity translations en→en)
- candidates.example.csv: tier-tagged seed list; users copy to candidates.csv
  and edit to scope a run

eval/candidates.csv, calibration.csv, results/* and .metricx/ are gitignored
(per-project state).
…rgets

Harness runs surfaced a class of failures where the 4B model produced
wrong-language or script-salad output for low-resource targets. Two root
causes: ambiguous 2-letter codes in the prompt (si → Slovene, or → Spanish,
af → Hindi) and over-permissive enum coverage.

Prompt format
- translate_gemma_prompt now uses full English language names plus a
  target-script anchor in the system turn (Output only the translated text
  in {name}, using the native script of {name}). Recovered ~17 PASS on the
  harness with no regressions on previously-passing languages.

Language enum
- Drops 18 branch-only variants the model could not actually translate:
  Ff, Jv, Kac, Ln, Lu, Luo, Mai, Nso, Ny, Om, Sd, Sn, So, To, Wo, Xh, Yo, Zu.
  Confirmed via translation-output inspection: jv outputs pure Indonesian,
  pa outputs Hindi (kept pa since it shipped on master), kac outputs Burmese,
  mi outputs fluent-fake Maori (kept mi — verified real Maori), etc.
  None of the 18 were in the v0.0.5 release.
- Total variants: 129 → 111.

Detector
- Script-only fast path runs before lingua, commits unambiguous Unicode
  blocks deterministically: ml, kn, ta, te, gu, pa (Gurmukhi), or, km, lo,
  my, bo, si, am. Fixes detection where lingua misroutes or doesn't cover.
- Within-script disambiguation: bn → as on Assamese-distinctive letters
  ৰ (U+09F0), ৱ (U+09F1); ar → ckb on Sorani Kurdish letters ێ ۆ ڕ ڵ ڤ;
  he → yi on Yiddish double-vav/yod digraphs and precomposed ligatures
  U+05F0–U+05F2.
- Dialect markers (Aho-Corasick, dialect.rs) extended: es-MX vs es-ES;
  zh-HK refinement on top of zh-TW (Cantonese particles); hi → ne when
  Nepali copula/verb/day-name markers fire (छन्, हुनेछ, गरिने, बिहीबार, etc.).
- DETECT_SUPPORTED_CODES list updated with the new emitted codes.

Harness summary after these changes: PASS 54 → 75 (+21), FAIL 74 → 52 (-22).
- CHANGELOG.md (Unreleased): expanded Added section with script-only fast
  path, within-script disambiguators (as, ckb, yi), new dialect pairs
  (es-MX/es-ES, zh-HK, hi/ne); rewrote prompt-format Changed entry; added
  Removed entry listing the 18 dropped variants with rationale.
- CLAUDE.md: counts updated (translate 70 → 111, detect 95 → 108); detection
  pipeline section reflects script-only fast path and full set of within-script
  / dialect refinements; prompt-format description updated.
- README.md: am/si/ne/yi rows updated to show detect ✓ + both ✓ (new heuristics);
  zu row stripped of translate ✓ (dropped); summary line points to
  GET /languages as the live source of truth instead of asserting a count.
U+06ED is ARABIC SMALL LOW MEEM (a combining mark), not the Sorani-distinctive
yeh-with-small-v. Test sorani_kurdish_distinctive_letter_refines_ar caught it.
@deankarn deankarn mentioned this pull request May 11, 2026
6 tasks
deankarn added 4 commits May 11, 2026 16:28
…arkers

Three issues surfaced in the latest harness run:

- he regressed 100% → 60%: raw double-vav/yod match false-positived on
  legitimate Hebrew (`הכנסייה` synagogue, `האוויר` the air). Replaced with
  niqqud (U+05B0–U+05BD, U+05BF) + precomposed ligature (U+05F0–U+05F2)
  check — modern Hebrew prose rarely uses niqqud, Yiddish requires it.
- as stuck at 73% / ckb at 77%: when text mixed Devanagari + Bengali or
  Arabic + Persian, lingua picked the wrong base (hi/fa) so the bn→as
  and ar→ckb refinement never ran. Added detect_within_script_override
  that fires before lingua on Assamese letters (ৰ/ৱ) and Sorani letters
  (ێ ۆ ڕ ڵ ڤ), overriding lingua entirely.
- ne stuck at 3%: only 1 Nepali marker per typical sentence, below the
  ≥2 commit threshold. Expanded marker set with 13 high-frequency forms
  (गर्नुहोस्, सक्नुहुन्छ, उनले, तपाईंले, आइपुग्यो, बिहान, etc.) so 2+ fire on
  most Nepali outputs.

Also fleshed out script_only_name with the new script-only codes (kn, ta,
te, gu, pa, or, as, ckb, yi) so detect_with_confidence reports the right
display name.
…arkers (v2)

Swaps order so within-script overrides (as/ckb) run BEFORE the unique-script
fast path. Otherwise stray Tamil/Bengali codepoints in mixed-script outputs
beat the distinctive-letter override.

Enriches Nepali marker list from 24 → 65: tense morphology (रहेको, भएका,
थिए, थियो), vocabulary (नयाँ, अर्को, ठूलो, राम्रो, गाउँ, बुबा, सबैभन्दा),
pronouns (हामी, हाम्रो, आफू), distinctive postpositions (लागि, अगाडि).
Adds 22 Hindi-distinctive markers (की, नया, बड़ा, रहा है, रहे हैं, सकते हैं,
पहुँच) so the Hindi side actively votes against ne when present.
Lingua often misroutes Nepali text to mr (Marathi shares Devanagari script
and similar morphology). The dialect dispatch only handled 'hi' as base, so
mr-detected Nepali fell through unchanged.

Both hi and mr now feed into hi_ne_pair so Nepali markers can refine either
base toward 'ne' on commit. Added a regression test for the past-perfect
phrase 'भएका थिए' that surfaced in harness output.
The earlier change to return {code, name} objects was unnecessary — Language
serializes as its BCP 47 code string already, and clients can call
.code() / .full_name() on the deserialized enum without a name field on
the wire. This restores v0.0.5 wire compatibility for /languages.

- /languages?for=translate → Vec<Language> (same wire format as main).
- /languages?for=detect → Vec<String> (new endpoint; detect emits codes
  outside the translate enum so it can't be Vec<Language>).
- Removed LanguageEntry from translator-core::types.
- translator-api-client::languages() returns Vec<Language>; new
  languages_detect() returns Vec<String>.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant