qwen_image torch.OutOfMemoryError 需要多大的卡啊？8张L20是不是跑不了Lora啊

`(Qwen_Image) ➜  DiffSynth-Studio git:(main) ✗ sh examples/qwen_image/model_training/lora/Qwen-Image-Edit-2511.sh
The following values were not passed to `accelerate launch` and had defaults used instead:
                More than one GPU was found, enabling multi-GPU training.
                If this was unintended please pass in `--num_processes=1`.
        `--num_machines` was set to a value of `1`
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Downloading Model from https://www.modelscope.cn to directory: /nas/xyq/DiffSynth-Studio/models/Qwen/Qwen-Image-Edit-2511
Loading models from: [
    "/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00001-of-00005.safetensors",
    "/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00002-of-00005.safetensors",
    "/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00003-of-00005.safetensors",
    "/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00004-of-00005.safetensors",
    "/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00005-of-00005.safetensors"
]
Loaded model: {
    "model_name": "qwen_image_dit",
    "model_class": "diffsynth.models.qwen_image_dit.QwenImageDiT",
    "extra_kwargs": null
}
Downloading Model from https://www.modelscope.cn to directory: /nas/xyq/DiffSynth-Studio/models/Qwen/Qwen-Image-Edit-2511
Loading models from: [
    "/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00001-of-00005.safetensors",
    "/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00002-of-00005.safetensors",
    "/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00003-of-00005.safetensors",
    "/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00004-of-00005.safetensors",
    "/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00005-of-00005.safetensors"
]
Downloading Model from https://www.modelscope.cn to directory: /nas/xyq/DiffSynth-Studio/models/Qwen/Qwen-Image-Edit-2511
Loading models from: [
    "/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00001-of-00005.safetensors",
    "/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00002-of-00005.safetensors",
    "/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00003-of-00005.safetensors",
    "/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00004-of-00005.safetensors",
    "/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00005-of-00005.safetensors"
]
Downloading Model from https://www.modelscope.cn to directory: /nas/xyq/DiffSynth-Studio/models/Qwen/Qwen-Image-Edit-2511
Loading models from: [
    "/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00001-of-00005.safetensors",
    "/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00002-of-00005.safetensors",
    "/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00003-of-00005.safetensors",
    "/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00004-of-00005.safetensors",
    "/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00005-of-00005.safetensors"
]
Loaded model: {
    "model_name": "qwen_image_dit",
    "model_class": "diffsynth.models.qwen_image_dit.QwenImageDiT",
    "extra_kwargs": null
}
Downloading Model from https://www.modelscope.cn to directory: /nas/xyq/DiffSynth-Studio/models/Qwen/Qwen-Image-Edit-2511
Loading models from: [
    "/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00001-of-00005.safetensors",
    "/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00002-of-00005.safetensors",
    "/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00003-of-00005.safetensors",
    "/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00004-of-00005.safetensors",
    "/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00005-of-00005.safetensors"
]
Loaded model: {
    "model_name": "qwen_image_dit",
    "model_class": "diffsynth.models.qwen_image_dit.QwenImageDiT",
    "extra_kwargs": null
}
Downloading Model from https://www.modelscope.cn to directory: /nas/xyq/DiffSynth-Studio/models/Qwen/Qwen-Image-Edit-2511
Loaded model: {
    "model_name": "qwen_image_dit",
    "model_class": "diffsynth.models.qwen_image_dit.QwenImageDiT",
    "extra_kwargs": null
}
Loading models from: [
    "/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/text_encoder/model-00001-of-00004.safetensors",
    "/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/text_encoder/model-00002-of-00004.safetensors",
    "/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/text_encoder/model-00003-of-00004.safetensors",
    "/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/text_encoder/model-00004-of-00004.safetensors"
]
[rank6]: Traceback (most recent call last):
[rank6]:   File "/nas/xyq/DiffSynth-Studio/examples/qwen_image/model_training/train.py", line 136, in <module>
[rank6]:     model = QwenImageTrainingModule(
[rank6]:   File "/nas/xyq/DiffSynth-Studio/examples/qwen_image/model_training/train.py", line 31, in __init__
[rank6]:     self.pipe = QwenImagePipeline.from_pretrained(torch_dtype=torch.bfloat16, device=device, model_configs=model_configs, tokenizer_config=tokenizer_config, processor_config=processor_config)
[rank6]:   File "/nas/syh/miniconda3/envs/Qwen_Image/lib/python3.10/site-packages/diffsynth/pipelines/qwen_image.py", line 71, in from_pretrained
[rank6]:     model_pool = pipe.download_and_load_models(model_configs, vram_limit)
[rank6]:   File "/nas/syh/miniconda3/envs/Qwen_Image/lib/python3.10/site-packages/diffsynth/diffusion/base_pipeline.py", line 293, in download_and_load_models
[rank6]:     model_pool.auto_load_model(
[rank6]:   File "/nas/syh/miniconda3/envs/Qwen_Image/lib/python3.10/site-packages/diffsynth/models/model_loader.py", line 70, in auto_load_model
[rank6]:     model = self.load_model_file(config, path, vram_config, vram_limit=vram_limit)
[rank6]:   File "/nas/syh/miniconda3/envs/Qwen_Image/lib/python3.10/site-packages/diffsynth/models/model_loader.py", line 40, in load_model_file
[rank6]:     model = load_model(
[rank6]:   File "/nas/syh/miniconda3/envs/Qwen_Image/lib/python3.10/site-packages/diffsynth/core/loader/model.py", line 46, in load_model
[rank6]:     state_dict = state_dict_converter(state_dict)
[rank6]:   File "/nas/syh/miniconda3/envs/Qwen_Image/lib/python3.10/site-packages/diffsynth/utils/state_dict_converters/qwen_image_text_encoder.py", line 4, in QwenImageTextEncoderStateDictConverter
[rank6]:     v = state_dict[k]
[rank6]:   File "/nas/syh/miniconda3/envs/Qwen_Image/lib/python3.10/site-packages/diffsynth/core/vram/disk_map.py", line 62, in __getitem__
[rank6]:     param = self.files[file_id].get_tensor(name)
[rank6]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 130.00 MiB. GPU 6 has a total capacity of 44.53 GiB of which 131.94 MiB is free. Including non-PyTorch memory, this process has 44.39 GiB memory in use. Of the allocated memory 44.11 GiB is allocated by PyTorch, and 4.91 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Loaded model: {
    "model_name": "qwen_image_dit",
    "model_class": "diffsynth.models.qwen_image_dit.QwenImageDiT",
    "extra_kwargs": null
}
Downloading Model from https://www.modelscope.cn to directory: /nas/xyq/DiffSynth-Studio/models/Qwen/Qwen-Image-Edit-2511
[rank6]:[W115 17:02:46.284882681 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator())
W0115 17:02:46.360000 1505942 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1506260 closing signal SIGTERM
W0115 17:02:46.360000 1505942 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1506261 closing signal SIGTERM
W0115 17:02:46.360000 1505942 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1506262 closing signal SIGTERM
W0115 17:02:46.360000 1505942 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1506263 closing signal SIGTERM
W0115 17:02:46.361000 1505942 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1506264 closing signal SIGTERM
W0115 17:02:46.361000 1505942 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1506265 closing signal SIGTERM
W0115 17:02:46.361000 1505942 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1506267 closing signal SIGTERM
E0115 17:02:47.089000 1505942 site-packages/torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: 1) local_rank: 6 (pid: 1506266) of binary: /nas/syh/miniconda3/envs/Qwen_Image/bin/python3.10
Traceback (most recent call last):
  File "/nas/syh/miniconda3/envs/Qwen_Image/bin/accelerate", line 7, in <module>
    sys.exit(main())
  File "/nas/syh/miniconda3/envs/Qwen_Image/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main
    args.func(args)
  File "/nas/syh/miniconda3/envs/Qwen_Image/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1272, in launch_command
    multi_gpu_launcher(args)
  File "/nas/syh/miniconda3/envs/Qwen_Image/lib/python3.10/site-packages/accelerate/commands/launch.py", line 899, in multi_gpu_launcher
    distrib_run.run(args)
  File "/nas/syh/miniconda3/envs/Qwen_Image/lib/python3.10/site-packages/torch/distributed/run.py", line 927, in run
    elastic_launch(
  File "/nas/syh/miniconda3/envs/Qwen_Image/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 156, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/nas/syh/miniconda3/envs/Qwen_Image/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 293, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
examples/qwen_image/model_training/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2026-01-15_17:02:46
  host      : iZl4vdqozw0b94f5yrqsbpZ
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 1506266)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
[W115 17:02:47.738886932 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator())`.  I am using 8 * L20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

qwen_image torch.OutOfMemoryError 需要多大的卡啊？8张L20是不是跑不了Lora啊 #1204

examples/qwen_image/model_training/train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2026-01-15_17:02:46
host : iZl4vdqozw0b94f5yrqsbpZ
rank : 6 (local_rank: 6)
exitcode : 1 (pid: 1506266)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

qwen_image torch.OutOfMemoryError 需要多大的卡啊？8张L20是不是跑不了Lora啊 #1204

Description

examples/qwen_image/model_training/train.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2026-01-15_17:02:46 host : iZl4vdqozw0b94f5yrqsbpZ rank : 6 (local_rank: 6) exitcode : 1 (pid: 1506266) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2026-01-15_17:02:46
host : iZl4vdqozw0b94f5yrqsbpZ
rank : 6 (local_rank: 6)
exitcode : 1 (pid: 1506266)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html