feat: register Hebrew and Chinese units in Tokenizer API by AmitMY · Pull Request #17 · sign-language-processing/complex-tokenization

AmitMY · 2026-04-08T10:54:02Z

Summary

Register "hebrew" and "chinese" as unit options in Tokenizer
Tokenizer(units="hebrew") uses Hebrew diacritics decomposition
Tokenizer(units="chinese") uses Chinese IDS decomposition

Stacked on #16.

What improved

Language-specific tokenization accessible through the clean API
No need to import language-specific modules directly

Test plan

4 new tests pass
ruff check . passes

🤖 Generated with Claude Code

- Tokenizer(units, merge_size, connected) — configurable base class - BPETokenizer, BNETokenizer, BoundlessBPETokenizer, SuperBPETokenizer - Units can be string ("utf8_clusters", "utf8", "characters") or callable - 10 tests covering all variants, custom units, error handling Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- FastBPETrainer flattens words into byte tuples and counts word frequencies, avoiding repeated graph traversal - Pair counting operates on word-freq dict instead of full corpus - Produces identical merges to graph-based BPE (tested) - Significantly faster on repeated text patterns Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Test training on plain Hebrew, nikkud text, mixed text - Verify dagesh/qamats appear in early merges for repeated patterns - Verify bytes preservation and pretokenization of Hebrew text Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- from complex_tokenization import BPETokenizer, Tokenizer, etc. - Add __all__ for explicit export control - Add import tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Test empty text, single char, all same chars, whitespace-only, multiple empty texts for all 4 tokenizer variants - Test emoji, mixed scripts, newlines for all variants - Parametrized across BPE, BNE, Boundless BPE, Super BPE Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

AmitMY · 2026-04-08T16:19:50Z

Absorbed into #12 — Tokenizer.register_script() and tests/test_language_units.py are now part of the clean API PR.

AmitMY mentioned this pull request Apr 8, 2026

perf: add scaling benchmark — FastBPE is 7-12x faster than graph BPE #18

Closed

2 tasks

AmitMY force-pushed the feat/language-units branch 10 times, most recently from 992f398 to 575c5a9 Compare April 8, 2026 16:06

AmitMY and others added 5 commits April 8, 2026 18:19

feat: export tokenizer classes from package __init__

8171158

- from complex_tokenization import BPETokenizer, Tokenizer, etc. - Add __all__ for explicit export control - Add import tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

AmitMY force-pushed the feat/language-units branch from 575c5a9 to 6b1cf73 Compare April 8, 2026 16:19

AmitMY closed this Apr 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: register Hebrew and Chinese units in Tokenizer API#17

feat: register Hebrew and Chinese units in Tokenizer API#17
AmitMY wants to merge 5 commits into
mainfrom
feat/language-units

AmitMY commented Apr 8, 2026

Uh oh!

AmitMY commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AmitMY commented Apr 8, 2026

Summary

What improved

Test plan

Uh oh!

AmitMY commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant