zero.Init partitioning of large fused MoE-expert tensors spikes a single GPU (transient full materialization) -> OOM during load even when the sharded model fits

**Env:** transformers 5.12.1, deepspeed 0.18.9, torch 2.12.0+cu130, peft 0.19.1, bitsandbytes 0.49.2, accelerate 1.14.0; 8x B200 (178GB) / 2TB RAM; model MiniMaxAI/MiniMax-M3 (428B sparse MoE VL, minimax_m3_vl).

During the correctly-sharded (world_size=8) load, a single GPU spikes to ~180GB vs the ~94GB steady partition and OOMs. The giant fused MoE-expert parameter (128 experts) appears to be materialized in full on a GPU before being scattered. This is a load-time transient — independent of sequence length — and `offload_param: {device: cpu}` does not prevent the GPU spike during the `from_pretrained` init path.

**Ask:** stream/scatter very large parameters during partitioning without a full single-GPU materialization, and honor `remote_device='cpu'` during the from_pretrained init path so the load can stage through CPU.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

zero.Init partitioning of large fused MoE-expert tensors spikes a single GPU (transient full materialization) -> OOM during load even when the sharded model fits #8085

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

zero.Init partitioning of large fused MoE-expert tensors spikes a single GPU (transient full materialization) -> OOM during load even when the sharded model fits #8085

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions