Skip to content

zero.Init partitioning of large fused MoE-expert tensors spikes a single GPU (transient full materialization) -> OOM during load even when the sharded model fits #8085

Description

@trevorgordon981

Env: transformers 5.12.1, deepspeed 0.18.9, torch 2.12.0+cu130, peft 0.19.1, bitsandbytes 0.49.2, accelerate 1.14.0; 8x B200 (178GB) / 2TB RAM; model MiniMaxAI/MiniMax-M3 (428B sparse MoE VL, minimax_m3_vl).

During the correctly-sharded (world_size=8) load, a single GPU spikes to ~180GB vs the ~94GB steady partition and OOMs. The giant fused MoE-expert parameter (128 experts) appears to be materialized in full on a GPU before being scattered. This is a load-time transient — independent of sequence length — and offload_param: {device: cpu} does not prevent the GPU spike during the from_pretrained init path.

Ask: stream/scatter very large parameters during partitioning without a full single-GPU materialization, and honor remote_device='cpu' during the from_pretrained init path so the load can stage through CPU.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions