HOW To: Quantization: Qwen/Qwen3-VL-235B-A22B-Instruct NVFP4

use script: [examples](https://github.com/vllm-project/llm-compressor/tree/main/examples)/[quantization_w4a4_fp4](https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w4a4_fp4)
/qwen3_vl_moe_w4a4_fp4.py

out of memory:
`Calibrating weights:  70%|███████   | 25637/36472 [22:34<09:14,
Calibrating weights:  72%|███████▏
Calibrating weights:  73%|███████▎  | 2
Calibrating weights:  75%|███████▍  | 27190
Calibrating weights:  76%|███████▌  | 27700/364
Calibrating weights:  77%|███████▋  | 28193/36472 [
Calibrating weights:  79%|███████▊  | 28702/36472 [25:1
Calibrating weights:  80%|████████  | 29187/36472 [25:43<06
Calibrating weights:  81%|████████  | 29597/36472 [26:11<06:05, 18.83it/s]
Traceback (most recent call last):
File "/workspace/home/scripts/quantize/qwen3_vl_moe_w4a4_fp4.py", line 92, in <module>
oneshot(
File "/usr/local/lib/python3.11/site-packages/llmcompressor/entrypoints/oneshot.py", line 330, in oneshot
one_shot()
File "/usr/local/lib/python3.11/site-packages/llmcompressor/entrypoints/oneshot.py", line 158, in __call__
self.apply_recipe_modifiers(
File "/usr/local/lib/python3.11/site-packages/llmcompressor/entrypoints/oneshot.py", line 201, in apply_recipe_modifiers
pipeline(
File "/usr/local/lib/python3.11/site-packages/llmcompressor/pipelines/independent/pipeline.py", line 45, in __call__
pipeline(model, dataloader, dataset_args)
File "/usr/local/lib/python3.11/site-packages/llmcompressor/pipelines/sequential/pipeline.py", line 74, in __call__
LifecycleCallbacks.calibration_epoch_start()
File "/usr/local/lib/python3.11/site-packages/llmcompressor/core/session_functions.py", line 150, in calibration_epoch_start
return cls.event(EventType.CALIBRATION_EPOCH_START, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/llmcompressor/core/session_functions.py", line 85, in event
return active_session().event(event_type, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/llmcompressor/core/session.py", line 187, in event
mod_data = self._lifecycle.event(
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/llmcompressor/core/lifecycle.py", line 204, in event
data = mod.update_event(state=self.state, event=event, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/llmcompressor/modifiers/modifier.py", line 123, in update_event
self.on_event(state, event, **kwargs)
File "/usr/local/lib/python3.11/site-packages/llmcompressor/modifiers/quantization/quantization/base.py", line 96, in on_event
self.on_start(state, None)
File "/usr/local/lib/python3.11/site-packages/llmcompressor/modifiers/quantization/quantization/base.py", line 91, in on_start
update_weight_zp_scale(module)
File "/usr/local/lib/python3.11/site-packages/llmcompressor/modifiers/quantization/calibration.py", line 173, in update_weight_zp_scale
call_observer(module=module, base_name="weight")
File "/usr/local/lib/python3.11/site-packages/llmcompressor/modifiers/quantization/calibration.py", line 125, in call_observer
updated_scale, updated_zero_point = observer(
^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/llmcompressor/observers/base.py", line 55, in forward
return self.get_qparams(
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/llmcompressor/observers/base.py", line 178, in get_qparams
scale, zero_point = self.get_qparams_along_dim(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/llmcompressor/observers/min_max.py", line 137, in get_qparams_along_dim
return self.calculate_qparams(
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/llmcompressor/observers/min_max.py", line 119, in calculate_qparams
return calculate_qparams(
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/compressed_tensors/quantization/utils/helpers.py", line 105, in calculate_qparams
scales = torch.clamp(scales, max=FP8_E4M3_DATA.max, min=FP8_E4M3_DATA.min)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 79.11 GiB of which 2.94 MiB is free. Process 716000 has 79.10 GiB memory in use. Of the allocated memory 78.49 GiB is allocated by PyTorch, and 4.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)`

with 8*80g h100.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HOW To: Quantization: Qwen/Qwen3-VL-235B-A22B-Instruct NVFP4 #2038

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

HOW To: Quantization: Qwen/Qwen3-VL-235B-A22B-Instruct NVFP4 #2038

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions