About batchsize=64

Hi, thank you for sharing this very inspiring work!

I have a question regarding the training setup while reading the paper and trying to reproduce the results:

In the official Wan training code, it seems that batch size is not easily adjustable (typically limited to batch size = 1 per GPU). However, the paper mentions training with 8× A800 GPUs and a global batch size of 64.

I would like to clarify:

Is the global batch size of 64 achieved via gradient accumulation (e.g., gradient_accumulation_steps = 8)?
Or are there additional modifications to the training framework (e.g., data parallelism, FSDP/DeepSpeed, or other optimizations)?

If possible, could you share more details about the training configuration? That would be very helpful for reproduction.

Thanks a lot!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About batchsize=64 #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

About batchsize=64 #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions