I need 1 hr to inference single_example_image.json on 4 3090 GPUs, is there anything I can do to increase the speed?

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
GPU_NUM=4
torchrun --nproc_per_node=$GPU_NUM --standalone generate_infinitetalk.py \
    --ckpt_dir weights/Wan2.1-I2V-14B-480P \
    --wav2vec_dir 'weights/chinese-wav2vec2-base' \
    --infinitetalk_dir weights/InfiniteTalk/single/infinitetalk.safetensors \
    --dit_fsdp --t5_fsdp \
    --ulysses_size=$GPU_NUM \
    --input_json examples/single_example_image.json \
    --size infinitetalk-480 \
    --sample_steps 40 \
    --mode streaming \
    --motion_frame 9 \
    --save_file infinitetalk_res_multigpu

Here is the scirpt I used. PYTORCH_CUDA_ALLOC_CONF is set becuase of OOM problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I need 1 hr to inference single_example_image.json on 4 3090 GPUs, is there anything I can do to increase the speed? #197

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

I need 1 hr to inference single_example_image.json on 4 3090 GPUs, is there anything I can do to increase the speed? #197

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions