Hi, thank you for sharing this very inspiring work!
I have a question regarding the training setup while reading the paper and trying to reproduce the results:
In the official Wan training code, it seems that batch size is not easily adjustable (typically limited to batch size = 1 per GPU). However, the paper mentions training with 8× A800 GPUs and a global batch size of 64.
I would like to clarify:
Is the global batch size of 64 achieved via gradient accumulation (e.g., gradient_accumulation_steps = 8)?
Or are there additional modifications to the training framework (e.g., data parallelism, FSDP/DeepSpeed, or other optimizations)?
If possible, could you share more details about the training configuration? That would be very helpful for reproduction.
Thanks a lot!
Hi, thank you for sharing this very inspiring work!
I have a question regarding the training setup while reading the paper and trying to reproduce the results:
In the official Wan training code, it seems that batch size is not easily adjustable (typically limited to batch size = 1 per GPU). However, the paper mentions training with 8× A800 GPUs and a global batch size of 64.
I would like to clarify:
Is the global batch size of 64 achieved via gradient accumulation (e.g., gradient_accumulation_steps = 8)?
Or are there additional modifications to the training framework (e.g., data parallelism, FSDP/DeepSpeed, or other optimizations)?
If possible, could you share more details about the training configuration? That would be very helpful for reproduction.
Thanks a lot!