Skip to content

Fix save_pretrained with offloading and weight conversions#47018

Open
Cyrilvallez wants to merge 5 commits into
mainfrom
fix-save-offloading
Open

Fix save_pretrained with offloading and weight conversions#47018
Cyrilvallez wants to merge 5 commits into
mainfrom
fix-save-offloading

Conversation

@Cyrilvallez

@Cyrilvallez Cyrilvallez commented Jul 2, 2026

Copy link
Copy Markdown
Member

CI

What does this PR do?

As per the title. Currently, since offloaded weights are on meta device and reverse conversions happens BEFORE we reload them from disk, the resulting converted tensors are on meta as well, and then we will usually fail when trying to load them back from the model, since they changed name.
Since we cannot load everything back to cpu at the beginning with offloading (we have a constrained cpu environment usually, otherwise we would not offload), the best is to skip conversion at the beginning, load all params of a given file shard to cpu, and reverse-convert only this shard before starting the next. This is however not fullproof for one-weight-to-many conversions, because the reverse is many-weight-to-one, meaning we need several weights to be able to perform the conversion, and they may not live all in the same shard if we are unlucky with how the shards were created. In those cases, we raise a nice and clear error, as we cannot do better without probably overloading the memory compl,etely anyway.

Supersedes #46902

@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

CI recap

Dashboard: View test results in Grafana
Latest run: 28581147185:1
Result: success | Jobs: 13 | Tests: 46,177 | Failures: 0 | Duration: 16h 9m

@kylesayrs kylesayrs left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I'll test this with some of my large moe models later today.

I think the one difference I can find between this and #46902 is that

  • #46902 handles many-to-one reversion optimally without erroring, but suffers increased cpu memory usage for unlucky sorting + one-to-many reversion (but afaict this can be completely negated)
  • (this PR) handles many-to-one reversion with an error if unlucky, but handles one-to-many reversion optimally without erroring

My only caution is that, when many-to-one reversions do happen, the unlucky case of splitting across shards can be common, even when it's just (2/3-to-1). When compressed-tensors needed similar logic to fuse fused weights (3-to-1) and dequantization (2-to-1), we found that almost half of the shards ended up splitting these params.

By contrast, unlucky sorting + one-to-many reversion can essentially be avoided if the state_dict is sorted before split_torch_state_dict_into_shards.

So I do think that we should try to figure out when many-to-one conversions happen and whether this is going to block those kinds of models from being saved. If HF starts using many-to-one conversions more heavily, then this might have to change.

EDIT: I thought of one such use case where downstream user LLM Compressor will want to support reverting linearized MoE weights into packed 3D weights in order to match the original checkpoint's format for models like https://huggingface.co/collections/Qwen/qwen3-vl.

is_offloaded = True
warnings.warn(
"Attempting to save a model with offloaded modules. Ensure that unallocated cpu memory "
"exceeds the `shard_size` (50GB default)"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"exceeds the `shard_size` (50GB default) and/or the largest model weight size (this can "
"be very large for MoE models with fused experts)."

Comment on lines 3512 to 3516
if (
hasattr(self, "hf_device_map")
and len(set(self.hf_device_map.values())) > 1
and ("cpu" in self.hf_device_map.values() or "disk" in self.hf_device_map.values())
):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Support full disk offloading

Suggested change
if hasattr(self, "hf_device_map") and (
len(set(self.hf_device_map.values())) > 1 or "disk" in self.hf_device_map.values()
):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants