-
Notifications
You must be signed in to change notification settings - Fork 311
Description
use script: examples/quantization_w4a4_fp4
/qwen3_vl_moe_w4a4_fp4.py
out of memory:
Calibrating weights: 70%|███████ | 25637/36472 [22:34<09:14, Calibrating weights: 72%|███████▏ Calibrating weights: 73%|███████▎ | 2 Calibrating weights: 75%|███████▍ | 27190 Calibrating weights: 76%|███████▌ | 27700/364 Calibrating weights: 77%|███████▋ | 28193/36472 [ Calibrating weights: 79%|███████▊ | 28702/36472 [25:1 Calibrating weights: 80%|████████ | 29187/36472 [25:43<06 Calibrating weights: 81%|████████ | 29597/36472 [26:11<06:05, 18.83it/s] Traceback (most recent call last): File "/workspace/home/scripts/quantize/qwen3_vl_moe_w4a4_fp4.py", line 92, in <module> oneshot( File "/usr/local/lib/python3.11/site-packages/llmcompressor/entrypoints/oneshot.py", line 330, in oneshot one_shot() File "/usr/local/lib/python3.11/site-packages/llmcompressor/entrypoints/oneshot.py", line 158, in __call__ self.apply_recipe_modifiers( File "/usr/local/lib/python3.11/site-packages/llmcompressor/entrypoints/oneshot.py", line 201, in apply_recipe_modifiers pipeline( File "/usr/local/lib/python3.11/site-packages/llmcompressor/pipelines/independent/pipeline.py", line 45, in __call__ pipeline(model, dataloader, dataset_args) File "/usr/local/lib/python3.11/site-packages/llmcompressor/pipelines/sequential/pipeline.py", line 74, in __call__ LifecycleCallbacks.calibration_epoch_start() File "/usr/local/lib/python3.11/site-packages/llmcompressor/core/session_functions.py", line 150, in calibration_epoch_start return cls.event(EventType.CALIBRATION_EPOCH_START, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/llmcompressor/core/session_functions.py", line 85, in event return active_session().event(event_type, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/llmcompressor/core/session.py", line 187, in event mod_data = self._lifecycle.event( ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/llmcompressor/core/lifecycle.py", line 204, in event data = mod.update_event(state=self.state, event=event, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/llmcompressor/modifiers/modifier.py", line 123, in update_event self.on_event(state, event, **kwargs) File "/usr/local/lib/python3.11/site-packages/llmcompressor/modifiers/quantization/quantization/base.py", line 96, in on_event self.on_start(state, None) File "/usr/local/lib/python3.11/site-packages/llmcompressor/modifiers/quantization/quantization/base.py", line 91, in on_start update_weight_zp_scale(module) File "/usr/local/lib/python3.11/site-packages/llmcompressor/modifiers/quantization/calibration.py", line 173, in update_weight_zp_scale call_observer(module=module, base_name="weight") File "/usr/local/lib/python3.11/site-packages/llmcompressor/modifiers/quantization/calibration.py", line 125, in call_observer updated_scale, updated_zero_point = observer( ^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/llmcompressor/observers/base.py", line 55, in forward return self.get_qparams( ^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/llmcompressor/observers/base.py", line 178, in get_qparams scale, zero_point = self.get_qparams_along_dim( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/llmcompressor/observers/min_max.py", line 137, in get_qparams_along_dim return self.calculate_qparams( ^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/llmcompressor/observers/min_max.py", line 119, in calculate_qparams return calculate_qparams( ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/compressed_tensors/quantization/utils/helpers.py", line 105, in calculate_qparams scales = torch.clamp(scales, max=FP8_E4M3_DATA.max, min=FP8_E4M3_DATA.min) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 79.11 GiB of which 2.94 MiB is free. Process 716000 has 79.10 GiB memory in use. Of the allocated memory 78.49 GiB is allocated by PyTorch, and 4.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
with 8*80g h100.