FSDP orchestration: apply + loading/saving#46990
Conversation
Wire distributed_config from_pretrained/save_pretrained alongside the legacy tp_plan path, add distributed/utils.py for mesh orchestration and checkpoint I/O, and extend sharding_utils with DTensor gather/optimizer fusion helpers needed by save/load.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Wire FSDP tests into the dynamic PR CI caller, mirroring tests_tensor_parallel_ci: detect tests_fsdp_ci_test_list.txt, run with is_fsdp_test marker and RUN_FSDP_TESTS, and exclude FSDP tests from the tests_torch job. Companion to huggingface/transformers#46990 (tests_fetcher changes stay in transformers). Co-authored-by: Cursor <cursoragent@cursor.com>
|
run-slow: cohere2_moe, deepseek_v4, glm_moe_dsa, gpt_oss |
|
This comment contains models: ["models/cohere2_moe", "models/deepseek_v4", "models/glm_moe_dsa", "models/gpt_oss"] |
CI ResultsCommit Info
The test failure analysis could not be completed. Please check the workflow run for details. |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: cohere2_moe, deepseek_v4, glm_moe_dsa, gpt_oss |
CI recapDashboard: View test results in Grafana |
Wire FSDP tests into the dynamic PR CI caller, mirroring tests_tensor_parallel_ci: detect tests_fsdp_ci_test_list.txt, run with is_fsdp_test marker and RUN_FSDP_TESTS, and exclude FSDP tests from the tests_torch job. Companion to huggingface/transformers#46990 (tests_fetcher changes stay in transformers). Co-authored-by: Cursor <cursoragent@cursor.com>
Summary
base_fsdp_plan. Will do another PR to edit every other models laterfrom_pretrainedshard-on-Read+ saving like TP (DCP optional)DistributedConfigeverywhere (no more tp_plan=auto)Stack