Skip to content

HOW To: Quantization: Qwen/Qwen3-VL-235B-A22B-Instruct NVFP4 #2038

@tianruochen

Description

@tianruochen

use script: examples/quantization_w4a4_fp4
/qwen3_vl_moe_w4a4_fp4.py

out of memory:
Calibrating weights: 70%|███████ | 25637/36472 [22:34<09:14, Calibrating weights: 72%|███████▏ Calibrating weights: 73%|███████▎ | 2 Calibrating weights: 75%|███████▍ | 27190 Calibrating weights: 76%|███████▌ | 27700/364 Calibrating weights: 77%|███████▋ | 28193/36472 [ Calibrating weights: 79%|███████▊ | 28702/36472 [25:1 Calibrating weights: 80%|████████ | 29187/36472 [25:43<06 Calibrating weights: 81%|████████ | 29597/36472 [26:11<06:05, 18.83it/s] Traceback (most recent call last): File "/workspace/home/scripts/quantize/qwen3_vl_moe_w4a4_fp4.py", line 92, in <module> oneshot( File "/usr/local/lib/python3.11/site-packages/llmcompressor/entrypoints/oneshot.py", line 330, in oneshot one_shot() File "/usr/local/lib/python3.11/site-packages/llmcompressor/entrypoints/oneshot.py", line 158, in __call__ self.apply_recipe_modifiers( File "/usr/local/lib/python3.11/site-packages/llmcompressor/entrypoints/oneshot.py", line 201, in apply_recipe_modifiers pipeline( File "/usr/local/lib/python3.11/site-packages/llmcompressor/pipelines/independent/pipeline.py", line 45, in __call__ pipeline(model, dataloader, dataset_args) File "/usr/local/lib/python3.11/site-packages/llmcompressor/pipelines/sequential/pipeline.py", line 74, in __call__ LifecycleCallbacks.calibration_epoch_start() File "/usr/local/lib/python3.11/site-packages/llmcompressor/core/session_functions.py", line 150, in calibration_epoch_start return cls.event(EventType.CALIBRATION_EPOCH_START, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/llmcompressor/core/session_functions.py", line 85, in event return active_session().event(event_type, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/llmcompressor/core/session.py", line 187, in event mod_data = self._lifecycle.event( ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/llmcompressor/core/lifecycle.py", line 204, in event data = mod.update_event(state=self.state, event=event, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/llmcompressor/modifiers/modifier.py", line 123, in update_event self.on_event(state, event, **kwargs) File "/usr/local/lib/python3.11/site-packages/llmcompressor/modifiers/quantization/quantization/base.py", line 96, in on_event self.on_start(state, None) File "/usr/local/lib/python3.11/site-packages/llmcompressor/modifiers/quantization/quantization/base.py", line 91, in on_start update_weight_zp_scale(module) File "/usr/local/lib/python3.11/site-packages/llmcompressor/modifiers/quantization/calibration.py", line 173, in update_weight_zp_scale call_observer(module=module, base_name="weight") File "/usr/local/lib/python3.11/site-packages/llmcompressor/modifiers/quantization/calibration.py", line 125, in call_observer updated_scale, updated_zero_point = observer( ^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/llmcompressor/observers/base.py", line 55, in forward return self.get_qparams( ^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/llmcompressor/observers/base.py", line 178, in get_qparams scale, zero_point = self.get_qparams_along_dim( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/llmcompressor/observers/min_max.py", line 137, in get_qparams_along_dim return self.calculate_qparams( ^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/llmcompressor/observers/min_max.py", line 119, in calculate_qparams return calculate_qparams( ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/compressed_tensors/quantization/utils/helpers.py", line 105, in calculate_qparams scales = torch.clamp(scales, max=FP8_E4M3_DATA.max, min=FP8_E4M3_DATA.min) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 79.11 GiB of which 2.94 MiB is free. Process 716000 has 79.10 GiB memory in use. Of the allocated memory 78.49 GiB is allocated by PyTorch, and 4.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

with 8*80g h100.

Metadata

Metadata

Assignees

No one assigned

    Labels

    nvfp4For any PR / issue related to NVFP4 supportqwenFor any PR / issue related to Qwen support

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions