-
Notifications
You must be signed in to change notification settings - Fork 14
Description
Hello! When using GD for multi-task training, I encountered an issue: in multi-task mode, the loss never converges, as if the model isn't learning at all,and the evaluation results are also zero. However, when switching to single-task mode, the training proceeds normally and the loss decreases as expected. I would like to ask you, what might be the cause of this phenomenon? Are there any key parameter configurations or training strategies that need adjustment? Looking forward to your reply, thank you very much!
Here are some training logs:
Rank[0/1]` 08/02/2025 07:21:43 INFO stats.py:335 | Epoch[1] Step[1675] GlobalStep[3824] Training Speed: 18.99 samples/sec across all devices. Average Step Time: 0.42 sec. Estimated Remaining Time: 11:15:22. Learning Rate Group 0: 1.00000e-04. Learning Rate Group 1: 1.00000e-04.
Rank[0/1] 08/02/2025 07:21:56 INFO loss_tracker.py:84 | Epoch[1/NA] Step[1699] GlobalStep[3848/99999]: loss_noise_mse[0.3521] loss_fk_mse[0.1989] loss_depth[0.0457] total_loss[0.5967]
…………
Rank[0/1] 08/02/2025 07:52:35 INFO stats.py:335 | Epoch[3] Step[902] GlobalStep[7349] Training Speed: 18.83 samples/sec across all devices. Average Step Time: 0.42 sec. Estimated Remaining Time: 10:55:54. Learning Rate Group 0: 1.00000e-04. Learning Rate Group 1: 1.00000e-04.
Rank[0/1] 08/02/2025 07:52:47 INFO loss_tracker.py:84 | Epoch[3/NA] Step[924] GlobalStep[7371/99999]: loss_noise_mse[0.3393] loss_fk_mse[0.1990] loss_depth[0.0457] total_loss[0.5841]