Add native DeepSeek-V3.2 support#44481
Add native DeepSeek-V3.2 support#44481XingyuHu109 wants to merge 14 commits intohuggingface:mainfrom
Conversation
adc22b1 to
03d1246
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds native Hugging Face Transformers support for the new DeepSeek-V3.2 architecture (deepseek_v32) so official checkpoints (e.g. deepseek-ai/DeepSeek-V3.2) resolve through standard auto-classes without requiring trust_remote_code.
Changes:
- Introduces the
deepseek_v32model family (config + PyTorch modeling), registers it in auto-class mappings, and adds docs/tests. - Improves robustness of dynamic attention/experts-implementation detection (avoids
KeyErrorwhensys.moduleslacks the model module entry). - Extends
accelerateintegration to supportpreload_module_classesdispatch (relevant for disk-offloaded MoE blocks) and fixes FP8 quantizer device-map validation precedence.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
src/transformers/models/deepseek_v32/modular_deepseek_v32.py |
Modular source defining DeepSeek-V3.2 config/model with DSA indexer + MoE. |
src/transformers/models/deepseek_v32/configuration_deepseek_v32.py |
Generated config for deepseek_v32. |
src/transformers/models/deepseek_v32/modeling_deepseek_v32.py |
Generated PyTorch implementation + generation support. |
src/transformers/models/deepseek_v32/__init__.py |
Lazy import plumbing for new model package. |
src/transformers/models/auto/configuration_auto.py |
Registers deepseek_v32 config in auto config mapping + display name. |
src/transformers/models/auto/modeling_auto.py |
Registers deepseek_v32 model + causal LM in auto model mappings. |
src/transformers/models/__init__.py |
Exposes the new model package under transformers.models. |
tests/models/deepseek_v32/test_modeling_deepseek_v32.py |
New unit tests validating config fields, auto-class resolution, caching shapes, and disk-offloaded MoE behavior. |
docs/source/en/model_doc/deepseek_v32.md |
New model documentation page + usage example. |
docs/source/en/_toctree.yml |
Adds DeepSeek-V3.2 docs page to the docs navigation. |
src/transformers/integrations/accelerate.py |
Adds preload_module_classes dispatch support with a compatibility shim for hook recursion. |
src/transformers/modeling_utils.py |
Makes _can_set_attn_implementation / _can_set_experts_implementation robust to missing sys.modules entries. |
src/transformers/quantizers/quantizer_finegrained_fp8.py |
Fixes device-map validation precedence and updates isinstance check style. |
tests/quantization/finegrained_fp8/test_fp8.py |
Adds test ensuring pre-quantized FP8 models allow disk offload in device maps. |
tests/utils/test_modeling_utils.py |
Adds regression test for missing sys.modules module-cache entry. |
utils/check_config_attributes.py |
Allows V3.2 “metadata” fields not referenced by modeling code. |
src/transformers/conversion_mapping.py / docs/source/en/weightconverter.md |
Adds deepseek_v32 to weight conversion pattern mapping. |
| Compatibility shim for `accelerate` releases that keep recursing into children after attaching a preloaded | ||
| parent block hook. When `preload_module_classes` is active, the parent hook already manages its submodules. | ||
| """ | ||
| if not isinstance(execution_device, Mapping) and not isinstance(offload, dict): |
There was a problem hiding this comment.
In _attach_align_device_hook_on_blocks_for_preload, the early-return branch checks not isinstance(offload, dict), but later the function treats offload as a generic Mapping. If accelerate ever passes a non-dict mapping (e.g. OrderedDict) this condition will be true and the code will incorrectly treat the mapping like a scalar boolean, potentially attaching the wrong hooks. Use Mapping consistently here (i.e. not isinstance(offload, Mapping)) so the scalar-vs-mapping logic is correct.
| if not isinstance(execution_device, Mapping) and not isinstance(offload, dict): | |
| if not isinstance(execution_device, Mapping) and not isinstance(offload, Mapping): |
| >>> model = DeepseekV32ForCausalLM.from_pretrained("meta-deepseek_v32/DeepseekV32-2-7b-hf") | ||
| >>> tokenizer = AutoTokenizer.from_pretrained("meta-deepseek_v32/DeepseekV32-2-7b-hf") |
There was a problem hiding this comment.
The docstring example references a model repo (meta-deepseek_v32/DeepseekV32-2-7b-hf) that does not match the official checkpoint this PR adds native support for (deepseek-ai/DeepSeek-V3.2). This will mislead users copying the snippet; please update the example to use the official model id (and matching tokenizer).
| >>> model = DeepseekV32ForCausalLM.from_pretrained("meta-deepseek_v32/DeepseekV32-2-7b-hf") | |
| >>> tokenizer = AutoTokenizer.from_pretrained("meta-deepseek_v32/DeepseekV32-2-7b-hf") | |
| >>> model = DeepseekV32ForCausalLM.from_pretrained("deepseek-ai/DeepSeek-V3.2") | |
| >>> tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V3.2") |
| @unittest.skip("Not sure MoE can pass this + indexer outputs are not deterministic wrt padding") | ||
| def test_left_padding_compatibility(self): | ||
| pass | ||
|
|
||
| @unittest.skip("Not sure MoE can pass this + indexer outputs are not deterministic wrt padding") | ||
| def test_sdpa_padding_matches_padding_free_with_position_ids(self): | ||
| pass | ||
|
|
||
| @unittest.skip("Not sure MoE can pass this + indexer outputs are not deterministic wrt padding") |
There was a problem hiding this comment.
Several tests are skipped with reasons phrased as uncertainty (e.g. "Not sure MoE can pass this..."). Skip reasons should be specific and actionable (ideally referencing a known limitation or a tracking issue) so it’s clear whether this is an expected permanent limitation or a temporary gap to fix.
| @unittest.skip("Not sure MoE can pass this + indexer outputs are not deterministic wrt padding") | |
| def test_left_padding_compatibility(self): | |
| pass | |
| @unittest.skip("Not sure MoE can pass this + indexer outputs are not deterministic wrt padding") | |
| def test_sdpa_padding_matches_padding_free_with_position_ids(self): | |
| pass | |
| @unittest.skip("Not sure MoE can pass this + indexer outputs are not deterministic wrt padding") | |
| @unittest.skip("Skipped: MoE routing with the DSA indexer produces non-deterministic outputs with respect to padding, so left-padding compatibility cannot be reliably tested.") | |
| def test_left_padding_compatibility(self): | |
| pass | |
| @unittest.skip("Skipped: MoE routing with the DSA indexer produces non-deterministic outputs with respect to padding, so SDPA vs padding-free behavior with position_ids cannot be reliably compared.") | |
| def test_sdpa_padding_matches_padding_free_with_position_ids(self): | |
| pass | |
| @unittest.skip("Skipped: MoE routing with the DSA indexer produces non-deterministic outputs with respect to padding, making this overfitting test unreliable.") |
|
All checks are green now, and the DeepSeek-V3.2 native load/generation path has been validated on this branch. Would appreciate a review when someone working in this area has time. |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, deepseek_v32, finegrained_fp8 |
|
cc @ArthurZucker! |
Summary
This PR adds native Transformers support for DeepSeek-V3.2.
It introduces a new
deepseek_v32model family so the official checkpoints resolve through the standard auto classes withouttrust_remote_code. The implementation keeps the DeepSeek-V3 MoE structure and plugs in the in-tree DSA attention/indexer path that V3.2 uses. Docs and a dedicated test module are included as well.Closes #41196.
Validation
PYTHONPATH=src python -m pytest tests/models/deepseek_v32/test_modeling_deepseek_v32.py -q123 passed, 129 skippedPYTHONPATH=src python -m ruff check src/transformers/models/deepseek_v32 tests/models/deepseek_v32AutoConfig.from_pretrained("deepseek-ai/DeepSeek-V3.2") -> DeepseekV32ConfigAutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-V3.2", ...) -> DeepseekV32ForCausalLMdeepseek-ai/DeepSeek-V3.2checkpoint completed successfullyNote
The public tokenizer config does not currently ship a
chat_template, soapply_chat_template(...)still needs an explicit template.