Env: transformers 5.12.1, deepspeed 0.18.9, torch 2.12.0+cu130, peft 0.19.1, bitsandbytes 0.49.2, accelerate 1.14.0; 8x B200 (178GB) / 2TB RAM; model MiniMaxAI/MiniMax-M3 (428B sparse MoE VL, minimax_m3_vl).
During the correctly-sharded (world_size=8) load, a single GPU spikes to ~180GB vs the ~94GB steady partition and OOMs. The giant fused MoE-expert parameter (128 experts) appears to be materialized in full on a GPU before being scattered. This is a load-time transient — independent of sequence length — and offload_param: {device: cpu} does not prevent the GPU spike during the from_pretrained init path.
Ask: stream/scatter very large parameters during partitioning without a full single-GPU materialization, and honor remote_device='cpu' during the from_pretrained init path so the load can stage through CPU.
Env: transformers 5.12.1, deepspeed 0.18.9, torch 2.12.0+cu130, peft 0.19.1, bitsandbytes 0.49.2, accelerate 1.14.0; 8x B200 (178GB) / 2TB RAM; model MiniMaxAI/MiniMax-M3 (428B sparse MoE VL, minimax_m3_vl).
During the correctly-sharded (world_size=8) load, a single GPU spikes to ~180GB vs the ~94GB steady partition and OOMs. The giant fused MoE-expert parameter (128 experts) appears to be materialized in full on a GPU before being scattered. This is a load-time transient — independent of sequence length — and
offload_param: {device: cpu}does not prevent the GPU spike during thefrom_pretrainedinit path.Ask: stream/scatter very large parameters during partitioning without a full single-GPU materialization, and honor
remote_device='cpu'during the from_pretrained init path so the load can stage through CPU.