Fix save_pretrained with offloading and weight conversions#47018
Fix save_pretrained with offloading and weight conversions#47018Cyrilvallez wants to merge 5 commits into
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
CI recapDashboard: View test results in Grafana |
There was a problem hiding this comment.
Looks good to me! I'll test this with some of my large moe models later today.
I think the one difference I can find between this and #46902 is that
- #46902 handles many-to-one reversion optimally without erroring, but suffers increased cpu memory usage for unlucky sorting + one-to-many reversion (but afaict this can be completely negated)
- (this PR) handles many-to-one reversion with an error if unlucky, but handles one-to-many reversion optimally without erroring
My only caution is that, when many-to-one reversions do happen, the unlucky case of splitting across shards can be common, even when it's just (2/3-to-1). When compressed-tensors needed similar logic to fuse fused weights (3-to-1) and dequantization (2-to-1), we found that almost half of the shards ended up splitting these params.
By contrast, unlucky sorting + one-to-many reversion can essentially be avoided if the state_dict is sorted before split_torch_state_dict_into_shards.
So I do think that we should try to figure out when many-to-one conversions happen and whether this is going to block those kinds of models from being saved. If HF starts using many-to-one conversions more heavily, then this might have to change.
EDIT: I thought of one such use case where downstream user LLM Compressor will want to support reverting linearized MoE weights into packed 3D weights in order to match the original checkpoint's format for models like https://huggingface.co/collections/Qwen/qwen3-vl.
| is_offloaded = True | ||
| warnings.warn( | ||
| "Attempting to save a model with offloaded modules. Ensure that unallocated cpu memory " | ||
| "exceeds the `shard_size` (50GB default)" |
There was a problem hiding this comment.
| "exceeds the `shard_size` (50GB default) and/or the largest model weight size (this can " | |
| "be very large for MoE models with fused experts)." |
| if ( | ||
| hasattr(self, "hf_device_map") | ||
| and len(set(self.hf_device_map.values())) > 1 | ||
| and ("cpu" in self.hf_device_map.values() or "disk" in self.hf_device_map.values()) | ||
| ): |
There was a problem hiding this comment.
Support full disk offloading
| if hasattr(self, "hf_device_map") and ( | |
| len(set(self.hf_device_map.values())) > 1 or "disk" in self.hf_device_map.values() | |
| ): |
What does this PR do?
As per the title. Currently, since offloaded weights are on meta device and reverse conversions happens BEFORE we reload them from disk, the resulting converted tensors are on meta as well, and then we will usually fail when trying to load them back from the model, since they changed name.
Since we cannot load everything back to cpu at the beginning with offloading (we have a constrained cpu environment usually, otherwise we would not offload), the best is to skip conversion at the beginning, load all params of a given file shard to cpu, and reverse-convert only this shard before starting the next. This is however not fullproof for one-weight-to-many conversions, because the reverse is many-weight-to-one, meaning we need several weights to be able to perform the conversion, and they may not live all in the same shard if we are unlucky with how the shards were created. In those cases, we raise a nice and clear error, as we cannot do better without probably overloading the memory compl,etely anyway.
Supersedes #46902