Skip to content

feat: register Hebrew and Chinese units in Tokenizer API#17

Closed
AmitMY wants to merge 5 commits into
mainfrom
feat/language-units
Closed

feat: register Hebrew and Chinese units in Tokenizer API#17
AmitMY wants to merge 5 commits into
mainfrom
feat/language-units

Conversation

@AmitMY
Copy link
Copy Markdown
Contributor

@AmitMY AmitMY commented Apr 8, 2026

Summary

  • Register "hebrew" and "chinese" as unit options in Tokenizer
  • Tokenizer(units="hebrew") uses Hebrew diacritics decomposition
  • Tokenizer(units="chinese") uses Chinese IDS decomposition

Stacked on #16.

What improved

  • Language-specific tokenization accessible through the clean API
  • No need to import language-specific modules directly

Test plan

  • 4 new tests pass
  • ruff check . passes

🤖 Generated with Claude Code

@AmitMY AmitMY force-pushed the feat/language-units branch 10 times, most recently from 992f398 to 575c5a9 Compare April 8, 2026 16:06
AmitMY and others added 5 commits April 8, 2026 18:19
- Tokenizer(units, merge_size, connected) — configurable base class
- BPETokenizer, BNETokenizer, BoundlessBPETokenizer, SuperBPETokenizer
- Units can be string ("utf8_clusters", "utf8", "characters") or callable
- 10 tests covering all variants, custom units, error handling

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- FastBPETrainer flattens words into byte tuples and counts word
  frequencies, avoiding repeated graph traversal
- Pair counting operates on word-freq dict instead of full corpus
- Produces identical merges to graph-based BPE (tested)
- Significantly faster on repeated text patterns

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Test training on plain Hebrew, nikkud text, mixed text
- Verify dagesh/qamats appear in early merges for repeated patterns
- Verify bytes preservation and pretokenization of Hebrew text

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- from complex_tokenization import BPETokenizer, Tokenizer, etc.
- Add __all__ for explicit export control
- Add import tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Test empty text, single char, all same chars, whitespace-only,
  multiple empty texts for all 4 tokenizer variants
- Test emoji, mixed scripts, newlines for all variants
- Parametrized across BPE, BNE, Boundless BPE, Super BPE

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@AmitMY AmitMY force-pushed the feat/language-units branch from 575c5a9 to 6b1cf73 Compare April 8, 2026 16:19
@AmitMY
Copy link
Copy Markdown
Contributor Author

AmitMY commented Apr 8, 2026

Absorbed into #12Tokenizer.register_script() and tests/test_language_units.py are now part of the clean API PR.

@AmitMY AmitMY closed this Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant