ZeRO-3: stream partitioning of oversized parameters in zero.Init by Achyuthan-S · Pull Request #8103 · deepspeedai/DeepSpeed

Achyuthan-S · 2026-06-30T11:06:27Z

Problem

Under zero.Init (ZeRO-3), every parameter is moved to the accelerator, broadcast in full, and then sliced into per-rank partitions. A single very large fused parameter — e.g. a 128-expert MoE weight — must be fully materialized on one device during this step, which can OOM that device during a from_pretrained load even when the sharded model fits. offload_param: {device: cpu} does not help: it only controls where the resulting partition is stored, not where the full tensor is staged.

Closes #8085.

Change

Adds an opt-in ZeRO-3 config stage3_partition_stream_chunk_size (default 0 = disabled). When set, a parameter with more elements than the threshold that is not already on the accelerator (the host-staged from_pretrained / low_cpu_mem_usage path) is partitioned by streaming its flattened data through fixed-size chunks: stage a chunk on the accelerator → broadcast from the owner rank → copy only this rank's slice into ds_tensor. The full tensor is never materialized on a single device, bounding the partition-time peak to roughly the chunk size.

With the default (0) the standard broadcast-then-partition path runs unchanged. Streaming is skipped for the nvme / quantized / ZeRO++ secondary-partition paths, which stage parameters differently.

Validation

Correctness — new unit test covers the chunk/partition overlap math (incl. padding, single-rank). End-to-end, the streamed partition reconstructs bit-for-bit identically to the standard path across world sizes 1–3, with padding, all_gather round-trip, and offload_param: cpu.

NCCL + peak memory (2× NVIDIA L40S):
[A] NCCL correctness (gathered streamed == standard): True
[B] peak GPU memory during zero.Init (world=2, dim=22528, fp32)
full param : 2.03 GB partition/rank: 1.02 GB chunk: 40 MB
streaming OFF peak : 3.05 GB
streaming ON peak : 1.10 GB
peak reduction : 1.95 GB (64% lower)

Scope

Applies to parameters that reach partitioning off-GPU (the from_pretrained / low_cpu_mem_usage path this issue targets). Parameters constructed directly on the accelerator inside zero.Init are unaffected — the spike there happens at construction time, which can be addressed as a follow-up.

cc @tohtana @tjruwase @loadams

Copilot

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cc35972244

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-30T11:08:46Z

+            if not self._should_stream_partition(param):
+                param.data = param.data.to(self.local_device)


Avoid checking streaming before ZeRO metadata exists

When stage3_partition_stream_chunk_size is set and zero.Init(module=prebuilt_model, ...) is used, this new pre-check runs on ordinary torch.nn.Parameters before _zero_init_param() calls _convert_to_deepspeed_param(). _should_stream_partition() immediately asks for _partition_world_size(param), which dereferences param.ds_process_group; that attribute is only installed later in _convert_to_deepspeed_param(), so the module-conversion path raises AttributeError even for parameters smaller than the chunk size. Move the stream decision until after conversion, or make the pre-check use the default process group without requiring ZeRO metadata.

Useful? React with 👍 / 👎.

Fixed. _should_stream_partition now gates on the global num_partitions instead of _partition_world_size(param), so the zero.Init(module=...) path no longer dereferences param.ds_process_group before _convert_to_deepspeed_param attaches it. The per-parameter group is still used in the actual partitioning (_partition_param_streaming), which runs after conversion. Added a DistributedTest that exercises the module= path with streaming enabled to guard this.

Under zero.Init, each parameter is broadcast and partitioned by first materializing the full tensor on a single device. A single very large fused parameter (e.g. a 128-expert MoE weight) can exceed device memory during a from_pretrained load even when the sharded model fits; offload_param does not help because it only controls where the resulting partition is stored. Add an opt-in stage3_partition_stream_chunk_size: a parameter larger than the threshold that is not already on the accelerator is partitioned by streaming its flattened data through fixed-size chunks (stage chunk -> broadcast from owner rank -> copy this rank's slice), bounding the partition-time device peak to roughly the chunk size. Defaults to 0 (disabled), leaving the existing path unchanged. Signed-off-by: Achyuthan Sivasankar <achyuthan.sivasankar@gmail.com>

Achyuthan-S · 2026-07-01T06:10:11Z

Hey @tohtana , I have been working on this issue and opened a PR with the solution.
It would be great if you review this and let me know if this works.
Thank you !

Copilot AI review requested due to automatic review settings June 30, 2026 11:06

Achyuthan-S requested review from loadams, tjruwase and tohtana as code owners June 30, 2026 11:06

Copilot AI reviewed Jun 30, 2026

chatgpt-codex-connector Bot reviewed Jun 30, 2026

View reviewed changes

Achyuthan-S mentioned this pull request Jun 30, 2026

zero.Init partitioning of large fused MoE-expert tensors spikes a single GPU (transient full materialization) -> OOM during load even when the sharded model fits #8085

Open

Achyuthan-S force-pushed the fix/zero-init-stream-large-params branch from cc35972 to cc0fb6a Compare June 30, 2026 11:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ZeRO-3: stream partitioning of oversized parameters in zero.Init#8103

ZeRO-3: stream partitioning of oversized parameters in zero.Init#8103
Achyuthan-S wants to merge 1 commit into
deepspeedai:masterfrom
Achyuthan-S:fix/zero-init-stream-large-params

Achyuthan-S commented Jun 30, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 30, 2026

Uh oh!

Achyuthan-S Jun 30, 2026

Uh oh!

Achyuthan-S commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if not self._should_stream_partition(param):
		param.data = param.data.to(self.local_device)

Uh oh!

Conversation

Achyuthan-S commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Change

Validation

Scope

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Achyuthan-S Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Achyuthan-S commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Achyuthan-S commented Jun 30, 2026 •

edited

Loading