-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Open
Description
(Qwen_Image) ➜ DiffSynth-Studio git:(main) ✗ sh examples/qwen_image/model_training/lora/Qwen-Image-Edit-2511.sh The following values were not passed to accelerate launchand had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in--num_processes=1. --num_machineswas set to a value of1 --dynamo_backendwas set to a value of'no'To avoid this warning pass in values for each of the problematic parameters or runaccelerate config`.
Downloading Model from https://www.modelscope.cn to directory: /nas/xyq/DiffSynth-Studio/models/Qwen/Qwen-Image-Edit-2511
Loading models from: [
"/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00001-of-00005.safetensors",
"/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00002-of-00005.safetensors",
"/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00003-of-00005.safetensors",
"/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00004-of-00005.safetensors",
"/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00005-of-00005.safetensors"
]
Loaded model: {
"model_name": "qwen_image_dit",
"model_class": "diffsynth.models.qwen_image_dit.QwenImageDiT",
"extra_kwargs": null
}
Downloading Model from https://www.modelscope.cn to directory: /nas/xyq/DiffSynth-Studio/models/Qwen/Qwen-Image-Edit-2511
Loading models from: [
"/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00001-of-00005.safetensors",
"/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00002-of-00005.safetensors",
"/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00003-of-00005.safetensors",
"/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00004-of-00005.safetensors",
"/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00005-of-00005.safetensors"
]
Downloading Model from https://www.modelscope.cn to directory: /nas/xyq/DiffSynth-Studio/models/Qwen/Qwen-Image-Edit-2511
Loading models from: [
"/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00001-of-00005.safetensors",
"/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00002-of-00005.safetensors",
"/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00003-of-00005.safetensors",
"/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00004-of-00005.safetensors",
"/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00005-of-00005.safetensors"
]
Downloading Model from https://www.modelscope.cn to directory: /nas/xyq/DiffSynth-Studio/models/Qwen/Qwen-Image-Edit-2511
Loading models from: [
"/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00001-of-00005.safetensors",
"/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00002-of-00005.safetensors",
"/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00003-of-00005.safetensors",
"/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00004-of-00005.safetensors",
"/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00005-of-00005.safetensors"
]
Loaded model: {
"model_name": "qwen_image_dit",
"model_class": "diffsynth.models.qwen_image_dit.QwenImageDiT",
"extra_kwargs": null
}
Downloading Model from https://www.modelscope.cn to directory: /nas/xyq/DiffSynth-Studio/models/Qwen/Qwen-Image-Edit-2511
Loading models from: [
"/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00001-of-00005.safetensors",
"/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00002-of-00005.safetensors",
"/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00003-of-00005.safetensors",
"/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00004-of-00005.safetensors",
"/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/transformer/diffusion_pytorch_model-00005-of-00005.safetensors"
]
Loaded model: {
"model_name": "qwen_image_dit",
"model_class": "diffsynth.models.qwen_image_dit.QwenImageDiT",
"extra_kwargs": null
}
Downloading Model from https://www.modelscope.cn to directory: /nas/xyq/DiffSynth-Studio/models/Qwen/Qwen-Image-Edit-2511
Loaded model: {
"model_name": "qwen_image_dit",
"model_class": "diffsynth.models.qwen_image_dit.QwenImageDiT",
"extra_kwargs": null
}
Loading models from: [
"/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/text_encoder/model-00001-of-00004.safetensors",
"/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/text_encoder/model-00002-of-00004.safetensors",
"/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/text_encoder/model-00003-of-00004.safetensors",
"/nas/xyq/Qwen-Image/Qwen-Image-Edit-2511/text_encoder/model-00004-of-00004.safetensors"
]
[rank6]: Traceback (most recent call last):
[rank6]: File "/nas/xyq/DiffSynth-Studio/examples/qwen_image/model_training/train.py", line 136, in
[rank6]: model = QwenImageTrainingModule(
[rank6]: File "/nas/xyq/DiffSynth-Studio/examples/qwen_image/model_training/train.py", line 31, in init
[rank6]: self.pipe = QwenImagePipeline.from_pretrained(torch_dtype=torch.bfloat16, device=device, model_configs=model_configs, tokenizer_config=tokenizer_config, processor_config=processor_config)
[rank6]: File "/nas/syh/miniconda3/envs/Qwen_Image/lib/python3.10/site-packages/diffsynth/pipelines/qwen_image.py", line 71, in from_pretrained
[rank6]: model_pool = pipe.download_and_load_models(model_configs, vram_limit)
[rank6]: File "/nas/syh/miniconda3/envs/Qwen_Image/lib/python3.10/site-packages/diffsynth/diffusion/base_pipeline.py", line 293, in download_and_load_models
[rank6]: model_pool.auto_load_model(
[rank6]: File "/nas/syh/miniconda3/envs/Qwen_Image/lib/python3.10/site-packages/diffsynth/models/model_loader.py", line 70, in auto_load_model
[rank6]: model = self.load_model_file(config, path, vram_config, vram_limit=vram_limit)
[rank6]: File "/nas/syh/miniconda3/envs/Qwen_Image/lib/python3.10/site-packages/diffsynth/models/model_loader.py", line 40, in load_model_file
[rank6]: model = load_model(
[rank6]: File "/nas/syh/miniconda3/envs/Qwen_Image/lib/python3.10/site-packages/diffsynth/core/loader/model.py", line 46, in load_model
[rank6]: state_dict = state_dict_converter(state_dict)
[rank6]: File "/nas/syh/miniconda3/envs/Qwen_Image/lib/python3.10/site-packages/diffsynth/utils/state_dict_converters/qwen_image_text_encoder.py", line 4, in QwenImageTextEncoderStateDictConverter
[rank6]: v = state_dict[k]
[rank6]: File "/nas/syh/miniconda3/envs/Qwen_Image/lib/python3.10/site-packages/diffsynth/core/vram/disk_map.py", line 62, in getitem
[rank6]: param = self.files[file_id].get_tensor(name)
[rank6]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 130.00 MiB. GPU 6 has a total capacity of 44.53 GiB of which 131.94 MiB is free. Including non-PyTorch memory, this process has 44.39 GiB memory in use. Of the allocated memory 44.11 GiB is allocated by PyTorch, and 4.91 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Loaded model: {
"model_name": "qwen_image_dit",
"model_class": "diffsynth.models.qwen_image_dit.QwenImageDiT",
"extra_kwargs": null
}
Downloading Model from https://www.modelscope.cn to directory: /nas/xyq/DiffSynth-Studio/models/Qwen/Qwen-Image-Edit-2511
[rank6]:[W115 17:02:46.284882681 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator())
W0115 17:02:46.360000 1505942 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1506260 closing signal SIGTERM
W0115 17:02:46.360000 1505942 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1506261 closing signal SIGTERM
W0115 17:02:46.360000 1505942 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1506262 closing signal SIGTERM
W0115 17:02:46.360000 1505942 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1506263 closing signal SIGTERM
W0115 17:02:46.361000 1505942 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1506264 closing signal SIGTERM
W0115 17:02:46.361000 1505942 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1506265 closing signal SIGTERM
W0115 17:02:46.361000 1505942 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1506267 closing signal SIGTERM
E0115 17:02:47.089000 1505942 site-packages/torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: 1) local_rank: 6 (pid: 1506266) of binary: /nas/syh/miniconda3/envs/Qwen_Image/bin/python3.10
Traceback (most recent call last):
File "/nas/syh/miniconda3/envs/Qwen_Image/bin/accelerate", line 7, in
sys.exit(main())
File "/nas/syh/miniconda3/envs/Qwen_Image/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main
args.func(args)
File "/nas/syh/miniconda3/envs/Qwen_Image/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1272, in launch_command
multi_gpu_launcher(args)
File "/nas/syh/miniconda3/envs/Qwen_Image/lib/python3.10/site-packages/accelerate/commands/launch.py", line 899, in multi_gpu_launcher
distrib_run.run(args)
File "/nas/syh/miniconda3/envs/Qwen_Image/lib/python3.10/site-packages/torch/distributed/run.py", line 927, in run
elastic_launch(
File "/nas/syh/miniconda3/envs/Qwen_Image/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 156, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/nas/syh/miniconda3/envs/Qwen_Image/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 293, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
examples/qwen_image/model_training/train.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2026-01-15_17:02:46
host : iZl4vdqozw0b94f5yrqsbpZ
rank : 6 (local_rank: 6)
exitcode : 1 (pid: 1506266)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[W115 17:02:47.738886932 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator())`. I am using 8 * L20
Metadata
Metadata
Assignees
Labels
No labels