FSDP orchestration: apply + loading/saving by 3outeille · Pull Request #46990 · huggingface/transformers

3outeille · 2026-07-01T05:09:05Z

Summary

FSDP:
- Only 1 model for now has base_fsdp_plan. Will do another PR to edit every other models later
- now wired through from_pretrained
- For FSDP: loading through shard-on-Read + saving like TP (DCP optional)
- Add FSDP Ci
- DistributedMixin
TP:
- Wired DistributedConfig everywhere (no more tp_plan=auto)
- TP left untouched (no Dtensor yet)

Stack

Based on Shard on read #46717 (A-PR-3 dual-path loading)

Wire distributed_config from_pretrained/save_pretrained alongside the legacy tp_plan path, add distributed/utils.py for mesh orchestration and checkpoint I/O, and extend sharding_utils with DTensor gather/optimizer fusion helpers needed by save/load.

…orchestration

HuggingFaceDocBuilderDev · 2026-07-01T05:52:17Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Wire FSDP tests into the dynamic PR CI caller, mirroring tests_tensor_parallel_ci: detect tests_fsdp_ci_test_list.txt, run with is_fsdp_test marker and RUN_FSDP_TESTS, and exclude FSDP tests from the tests_torch job. Companion to huggingface/transformers#46990 (tests_fetcher changes stay in transformers). Co-authored-by: Cursor <cursoragent@cursor.com>

3outeille · 2026-07-03T03:52:49Z

run-slow: cohere2_moe, deepseek_v4, glm_moe_dsa, gpt_oss

github-actions · 2026-07-03T03:54:11Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/cohere2_moe", "models/deepseek_v4", "models/glm_moe_dsa", "models/gpt_oss"]
quantizations: []

…orchestration

github-actions · 2026-07-03T05:02:23Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	1bcc0fb0	workflow commit (merge commit)
PR	37df13ee	branch commit (from PR)
main	70544cd9	base commit (on `main`)

⚠️ Model CI failed to report results

The test failure analysis could not be completed. Please check the workflow run for details.

…orchestration

github-actions · 2026-07-03T05:09:40Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: cohere2_moe, deepseek_v4, glm_moe_dsa, gpt_oss

github-actions · 2026-07-03T05:46:32Z

CI recap

Dashboard: View test results in Grafana
Latest run: 28639792108:1
Result: success | Jobs: 15 | Tests: 79,359 | Failures: 0 | Duration: 16h 45m

Wire FSDP tests into the dynamic PR CI caller, mirroring tests_tensor_parallel_ci: detect tests_fsdp_ci_test_list.txt, run with is_fsdp_test marker and RUN_FSDP_TESTS, and exclude FSDP tests from the tests_torch job. Companion to huggingface/transformers#46990 (tests_fetcher changes stay in transformers). Co-authored-by: Cursor <cursoragent@cursor.com>

3outeille and others added 4 commits July 1, 2026 05:08

Merge branch 'split/a-pr-3-dual-path-loading' into split/a-pr-4-fsdp-…

17c6d40

…orchestration

add fsdp plan to 2 models for now

00eb116

add tests fsdp mixin

be296dd

3outeille mentioned this pull request Jul 1, 2026

Add native FSDP2 module + migration #46707

Open

linting

05900d6

3outeille added 8 commits July 1, 2026 06:15

refactor test fsdp mixin

fc2423b

test fsdp mixin cleaning

5bbd820

remove fsdp policy in tests + trim down further

b6d0b67

test fsdp clean

ea36123

restore test_modeling_utils

bec4d23

linting

8d3d329

start trim down stuff

6316ee1

fix

6e9004e

3outeille marked this pull request as draft July 1, 2026 08:49

3outeille added 13 commits July 2, 2026 04:38

breaking: cleaning modeling_utils.py

68df491

load path with fsdp (dtensor) and tp (old tp) is linked

16b0b29

linting

e976a44

add saving

a2fb155

styling

5f52f19

fix tp ci

7f54301

add fsdp to ci

99f79ac

linting

b451490

pick one model only for this PR

54ff4d1

restore

3399539

trigger fsdp ci

11cf79c

doc cleaning + tp_size remove

5b7ac3e

fix tp ci for ep

06b0c39

3outeille mentioned this pull request Jul 3, 2026

Add FSDP CI job to PR workflow huggingface/transformers-test-ci#80

Merged

edit doc

7a94d77

3outeille changed the title ~~FSDP orchestration: mesh init, distribute-before-load, DCP save~~ FSDP orchestration: apply + loading/saving Jul 3, 2026

3outeille added 2 commits July 3, 2026 03:17

move distributed function to utils + guarding

f3c742b

linting

37df13e

3outeille marked this pull request as ready for review July 3, 2026 03:49

Merge branch 'split/a-pr-3-dual-path-loading' into split/a-pr-4-fsdp-…

86875d2

…orchestration

Merge branch 'split/a-pr-3-dual-path-loading' into split/a-pr-4-fsdp-…

450579b

…orchestration

3outeille requested a review from ArthurZucker July 3, 2026 05:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FSDP orchestration: apply + loading/saving#46990

FSDP orchestration: apply + loading/saving#46990
3outeille wants to merge 31 commits into
split/a-pr-3-dual-path-loadingfrom
split/a-pr-4-fsdp-orchestration

3outeille commented Jul 1, 2026 •

edited by github-actions Bot

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jul 1, 2026

Uh oh!

3outeille commented Jul 3, 2026

Uh oh!

github-actions Bot commented Jul 3, 2026

Uh oh!

github-actions Bot commented Jul 3, 2026

Uh oh!

github-actions Bot commented Jul 3, 2026

Uh oh!

github-actions Bot commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

3outeille commented Jul 1, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Stack

Uh oh!

HuggingFaceDocBuilderDev commented Jul 1, 2026

Uh oh!

3outeille commented Jul 3, 2026

Uh oh!

github-actions Bot commented Jul 3, 2026

Uh oh!

github-actions Bot commented Jul 3, 2026

CI Results

Commit Info

Uh oh!

github-actions Bot commented Jul 3, 2026

Uh oh!

github-actions Bot commented Jul 3, 2026

CI recap

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

3outeille commented Jul 1, 2026 •

edited by github-actions Bot

Loading