Skip to content

[Bug]: Llama-4-Maverick-17B-128E-Instruct quantization skips all MoE experts → missing expert weights → vLLM load failure #2060

@shubhra

Description

@shubhra

⚙️ Your current environment

The output of python collect_env.py
### Environment Information ###
Operating System: `Linux-6.8.0-85-generic-x86_64-with-glibc2.39`
Python Version: `3.12.3 (main, Aug 14 2025, 17:47:21) [GCC 13.3.0]`
llm-compressor Version: `0.8.2.dev54+g6fea8880.d20251119`
compressed-tensors Version: `0.12.3a20251114`
transformers Version: `4.57.1`
torch Version: `2.9.0`
CUDA Devices: `['NVIDIA B200', 'NVIDIA B200', 'NVIDIA B200', 'NVIDIA B200', 'NVIDIA B200', 'NVIDIA B200', 'NVIDIA B200', 'NVIDIA B200']`
AMD Devices: `None`

🐛 Describe the bug

When running examples/quantization_w4a4_fp4/llama4_example.py to quantize the Llama-4-Maverick-17B-128E-Instruct model, the generated config.json places all MoE experts under the ignore list - none of the routed experts are quantized. These should get quantized.

The shared expert is quantized correctly, but all 128 routed experts remain unquantized, producing no quantized tensors such as w1_weight, w2_weight, w3_weight.

This leads to vLLM failing to load the model with:

 KeyError: 'layers.17.feed_forward.experts.126.w1_weight'

because the expected quantized expert parameters were never created.

Impact:
MoE expert weights are silently omitted during quantization, producing incomplete checkpoints incompatible with vLLM inference.

🛠️ Steps to reproduce

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingllamaFor any PR / issue related to Llama herd support

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions