Skip to content

About batchsize=64 #2

@wghr123

Description

@wghr123

Hi, thank you for sharing this very inspiring work!

I have a question regarding the training setup while reading the paper and trying to reproduce the results:

In the official Wan training code, it seems that batch size is not easily adjustable (typically limited to batch size = 1 per GPU). However, the paper mentions training with 8× A800 GPUs and a global batch size of 64.

I would like to clarify:

Is the global batch size of 64 achieved via gradient accumulation (e.g., gradient_accumulation_steps = 8)?
Or are there additional modifications to the training framework (e.g., data parallelism, FSDP/DeepSpeed, or other optimizations)?

If possible, could you share more details about the training configuration? That would be very helpful for reproduction.

Thanks a lot!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions