FSDP v2 with DCP checkpoint#240
Conversation
|
Cool! I wiill check it these two days when free! |
|
@jd-nuva Looks correct! Could you please provide the relevant training file, such as |
This is the part that would require deeper integration because it only works the best when it's resumed from a FSDP v2 checkpoint at step 0 in the first place. From numerous attempts, i found existing bagel code has a few weight copy / post init operations that simply running training script directly calling these utils function would lead to subtle but severe training quality errors. So far the best path is
I think alternatively we can get the utilities function merged first, then I can put up a separate script like |
|
Cool! Can you provide these files that help me debug? |
|
@jd-nuva I tried the current logic, but I’m running into another issue. Since I need to train both T2I and I2T, I have to run |
hmm i haven't run the fsdp v2 logic should strictly be a placement of existing fsdp v1 only, and |
Summary
Existing Bagel code initializes on CPU first that materializes all tensors before sharding or moving to GPU. As a result, on my 8xH100 machine, it takes 15~20mins to simply initialize before able to do anything, thus significantly slow down iteration.
This PR uses FSDP v2 that natively uses distributed DTensor that provides finer control of initializing, sharding, gradient clipping, checkpointing.
The benefit with FSDP v2 approach is that we can use meta / empty initialization that locally shard the model without materializing all tensors on CPU, and ensure each GPU worker only need to read the tensors it's responsible for instead of entire copy.
Perf difference
On 8xH100 dev machine, model init can be reduced from ~15 mins to ~30 seconds, and naturally works across multi-node distributed training too via replication.
Next Steps
The proposed scripts and functions are tested on my local 8xH100 host with numerically correct and stable loss (~0.31 across first ~20 ish with gradient norm <0.1, no explosion or NaN)
But it didn't have training script integration yet given I've made significant modifications of existing bagel codebase on this, that will need more inputs and guidance from bagel team.
NOTE: There're many sharp edges of existing script given post init and weight copying, thus I found the safest path is to initialize on CPU first using exist CPU + FSDP v1 code, but shard with FSDP v2 and save as DCP, then for all subsequence training runs, directly load DCP checkpoint for step 0.