Route byte-level llama tokenizers to TokenizersBackend by subin9 · Pull Request #47017 · huggingface/transformers

subin9 · 2026-07-02T05:11:17Z

Model type "llama" spans both SentencePiece (Llama-1/2) and byte-level (Llama-3 / tiktoken) tokenizers under one Hub tokenizer_class (LlamaTokenizerFast). In v5, LlamaTokenizer.init unconditionally installs a Metaspace pre-tokenizer/decoder, which silently drops spaces for byte-level repos (see #45488), e.g. deepseek-ai/DeepSeek-R1-Distill-Llama-*.

The existing MODEL_IDS_TO_TOKENIZERS_BACKEND allowlist only covers specific checkpoints (e.g. the 8B) and misses others (the 70B is still broken). Instead, for the small set of dual-scheme model types, inspect the serialized tokenizer.json: if it declares a ByteLevel pre_tokenizer/decoder, route to TokenizersBackend (which respects tokenizer.json). SentencePiece Llama-1/2 stays on LlamaTokenizer unchanged.

What does this PR do?

model_type == "llama" covers two incompatible tokenizer schemes under a single Hub tokenizer_class
(LlamaTokenizerFast):

Llama-1/2 — SentencePiece (Metaspace, ▁)
Llama-3 and its derivatives (e.g. deepseek-ai/DeepSeek-R1-Distill-Llama-*) — byte-level BPE (GPT-2 / tiktoken, Ġ)

In v5, LlamaTokenizer.__init__ unconditionally installs a Metaspace pre-tokenizer/decoder, overwriting whatever
tokenizer.json declares. For byte-level repos this silently drops spaces on both encode and decode
("Hello world" → "Helloworld"), which is a silent accuracy regression — see #45488.

The current mitigation (MODEL_IDS_TO_TOKENIZERS_BACKEND, from #46091) is a per-checkpoint allowlist. It fixes
deepseek-ai/deepseek-r1-distill-llama-8b but misses others — for example DeepSeek-R1-Distill-Llama-70B is
still broken on main (it is not in the list).

Instead of enumerating checkpoints, this PR adds a small content-based rule: for the (rare) set of dual-scheme
model types ({"llama"}), inspect the serialized tokenizer.json; if it declares a ByteLevel
pre-tokenizer/decoder, route to TokenizersBackend (which respects tokenizer.json). SentencePiece Llama-1/2
(no ByteLevel in tokenizer.json) keeps using LlamaTokenizer unchanged. LlamaTokenizer.__init__ is not
modified, so there is no risk to the SentencePiece path.

This covers any byte-level Llama repo (8B, 70B, future distills) without maintaining a checkpoint allowlist.

Fixes #45488 for Llama-3-derived checkpoints.

Verification

Round-trip on "Hello world.\nI'm an AI, so I don't have consciousness." (transformers main):

repo	tokenizer	before (`main`)	after
`DeepSeek-R1-Distill-Llama-70B` (not in allowlist)	byte-level	`LlamaTokenizer`, roundtrip False (`"Helloworld.I'manAI."`)	`TokenizersBackend`, roundtrip True
`DeepSeek-R1-Distill-Llama-8B`	byte-level	`TokenizersBackend`, True	`TokenizersBackend`, True
`NousResearch/Llama-2-7b-hf`	SentencePiece	`LlamaTokenizer`, True	`LlamaTokenizer`, True (unchanged)

A no-network unit test for the detection helper is added in
tests/models/auto/test_tokenization_auto.py::AutoTokenizerTest::test_tokenizer_json_is_byte_level.

github-actions · 2026-07-02T05:12:26Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto

Model type "llama" spans both SentencePiece (Llama-1/2) and byte-level (Llama-3 / tiktoken) tokenizers under one Hub tokenizer_class (LlamaTokenizerFast). In v5, LlamaTokenizer.__init__ unconditionally installs a Metaspace pre-tokenizer/decoder, which silently drops spaces for byte-level repos (see huggingface#45488), e.g. deepseek-ai/DeepSeek-R1-Distill-Llama-*. The existing MODEL_IDS_TO_TOKENIZERS_BACKEND allowlist only covers specific checkpoints (e.g. the 8B) and misses others (the 70B is still broken). Instead, for the small set of dual-scheme model types, inspect the serialized tokenizer.json: if it declares a ByteLevel pre_tokenizer/decoder, route to TokenizersBackend (which respects tokenizer.json). SentencePiece Llama-1/2 stays on LlamaTokenizer unchanged.

github-actions · 2026-07-02T05:31:13Z

CI recap

Dashboard: View test results in Grafana
Latest run: 28567266014:2
Result: success | Jobs: 14 | Tests: 50,054 | Failures: 0 | Duration: 15h 19m

Rocketknight1 · 2026-07-02T11:45:53Z

cc @ArthurZucker @itazap

tboerstad · 2026-07-02T21:05:25Z

This has been an issue since 5.3.0 I believe, would be great to get this fixed.
Thank you @subin9

subin9 force-pushed the fix-llama3-bytelevel-tokenizer-routing branch from bf64c6f to e073da1 Compare July 2, 2026 05:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Route byte-level llama tokenizers to TokenizersBackend#47017

Route byte-level llama tokenizers to TokenizersBackend#47017
subin9 wants to merge 1 commit into
huggingface:mainfrom
subin9:fix-llama3-bytelevel-tokenizer-routing

subin9 commented Jul 2, 2026 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

Rocketknight1 commented Jul 2, 2026

Uh oh!

tboerstad commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

subin9 commented Jul 2, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Verification

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

github-actions Bot commented Jul 2, 2026

CI recap

Uh oh!

Rocketknight1 commented Jul 2, 2026

Uh oh!

tboerstad commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

subin9 commented Jul 2, 2026 •

edited by github-actions Bot

Loading