Skip to content

Route byte-level llama tokenizers to TokenizersBackend#47017

Open
subin9 wants to merge 1 commit into
huggingface:mainfrom
subin9:fix-llama3-bytelevel-tokenizer-routing
Open

Route byte-level llama tokenizers to TokenizersBackend#47017
subin9 wants to merge 1 commit into
huggingface:mainfrom
subin9:fix-llama3-bytelevel-tokenizer-routing

Conversation

@subin9

@subin9 subin9 commented Jul 2, 2026

Copy link
Copy Markdown

CI

Model type "llama" spans both SentencePiece (Llama-1/2) and byte-level (Llama-3 / tiktoken) tokenizers under one Hub tokenizer_class (LlamaTokenizerFast). In v5, LlamaTokenizer.init unconditionally installs a Metaspace pre-tokenizer/decoder, which silently drops spaces for byte-level repos (see #45488), e.g. deepseek-ai/DeepSeek-R1-Distill-Llama-*.

The existing MODEL_IDS_TO_TOKENIZERS_BACKEND allowlist only covers specific checkpoints (e.g. the 8B) and misses others (the 70B is still broken). Instead, for the small set of dual-scheme model types, inspect the serialized tokenizer.json: if it declares a ByteLevel pre_tokenizer/decoder, route to TokenizersBackend (which respects tokenizer.json). SentencePiece Llama-1/2 stays on LlamaTokenizer unchanged.

What does this PR do?

model_type == "llama" covers two incompatible tokenizer schemes under a single Hub tokenizer_class
(LlamaTokenizerFast):

  • Llama-1/2 — SentencePiece (Metaspace, )
  • Llama-3 and its derivatives (e.g. deepseek-ai/DeepSeek-R1-Distill-Llama-*) — byte-level BPE (GPT-2 / tiktoken, Ġ)

In v5, LlamaTokenizer.__init__ unconditionally installs a Metaspace pre-tokenizer/decoder, overwriting whatever
tokenizer.json declares. For byte-level repos this silently drops spaces on both encode and decode
("Hello world""Helloworld"), which is a silent accuracy regression — see #45488.

The current mitigation (MODEL_IDS_TO_TOKENIZERS_BACKEND, from #46091) is a per-checkpoint allowlist. It fixes
deepseek-ai/deepseek-r1-distill-llama-8b but misses others — for example DeepSeek-R1-Distill-Llama-70B is
still broken on main
(it is not in the list).

Instead of enumerating checkpoints, this PR adds a small content-based rule: for the (rare) set of dual-scheme
model types ({"llama"}), inspect the serialized tokenizer.json; if it declares a ByteLevel
pre-tokenizer/decoder, route to TokenizersBackend (which respects tokenizer.json). SentencePiece Llama-1/2
(no ByteLevel in tokenizer.json) keeps using LlamaTokenizer unchanged. LlamaTokenizer.__init__ is not
modified, so there is no risk to the SentencePiece path.

This covers any byte-level Llama repo (8B, 70B, future distills) without maintaining a checkpoint allowlist.

Fixes #45488 for Llama-3-derived checkpoints.

Verification

Round-trip on "Hello world.\nI'm an AI, so I don't have consciousness." (transformers main):

repo tokenizer before (main) after
DeepSeek-R1-Distill-Llama-70B (not in allowlist) byte-level LlamaTokenizer, roundtrip False ("Helloworld.I'manAI.") TokenizersBackend, roundtrip True
DeepSeek-R1-Distill-Llama-8B byte-level TokenizersBackend, True TokenizersBackend, True
NousResearch/Llama-2-7b-hf SentencePiece LlamaTokenizer, True LlamaTokenizer, True (unchanged)

A no-network unit test for the detection helper is added in
tests/models/auto/test_tokenization_auto.py::AutoTokenizerTest::test_tokenizer_json_is_byte_level.

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto

Model type "llama" spans both SentencePiece (Llama-1/2) and byte-level
(Llama-3 / tiktoken) tokenizers under one Hub tokenizer_class
(LlamaTokenizerFast). In v5, LlamaTokenizer.__init__ unconditionally
installs a Metaspace pre-tokenizer/decoder, which silently drops spaces
for byte-level repos (see huggingface#45488), e.g. deepseek-ai/DeepSeek-R1-Distill-Llama-*.

The existing MODEL_IDS_TO_TOKENIZERS_BACKEND allowlist only covers specific
checkpoints (e.g. the 8B) and misses others (the 70B is still broken). Instead,
for the small set of dual-scheme model types, inspect the serialized
tokenizer.json: if it declares a ByteLevel pre_tokenizer/decoder, route to
TokenizersBackend (which respects tokenizer.json). SentencePiece Llama-1/2
stays on LlamaTokenizer unchanged.
@subin9 subin9 force-pushed the fix-llama3-bytelevel-tokenizer-routing branch from bf64c6f to e073da1 Compare July 2, 2026 05:17
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

CI recap

Dashboard: View test results in Grafana
Latest run: 28567266014:2
Result: success | Jobs: 14 | Tests: 50,054 | Failures: 0 | Duration: 15h 19m

@Rocketknight1

Copy link
Copy Markdown
Member

cc @ArthurZucker @itazap

@tboerstad

Copy link
Copy Markdown
Contributor

This has been an issue since 5.3.0 I believe, would be great to get this fixed.
Thank you @subin9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LlamaTokenizer in v5 overrides tokenizer.json's ByteLevel pre-tokenizer with Metaspace, silently breaks DeepSeek V3/R1 family

4 participants