Skip to content

Loss does not converge during multi-task training #12

@user123812

Description

@user123812

Hello! When using GD for multi-task training, I encountered an issue: in multi-task mode, the loss never converges, as if the model isn't learning at all,and the evaluation results are also zero. However, when switching to single-task mode, the training proceeds normally and the loss decreases as expected. I would like to ask you, what might be the cause of this phenomenon? Are there any key parameter configurations or training strategies that need adjustment? Looking forward to your reply, thank you very much!
Here are some training logs:

Rank[0/1]` 08/02/2025 07:21:43 INFO stats.py:335 | Epoch[1] Step[1675] GlobalStep[3824] Training Speed: 18.99 samples/sec across all devices. Average Step Time: 0.42 sec. Estimated Remaining Time: 11:15:22. Learning Rate Group 0: 1.00000e-04. Learning Rate Group 1: 1.00000e-04.

Rank[0/1] 08/02/2025 07:21:56 INFO loss_tracker.py:84 | Epoch[1/NA] Step[1699] GlobalStep[3848/99999]: loss_noise_mse[0.3521] loss_fk_mse[0.1989] loss_depth[0.0457] total_loss[0.5967]
…………
Rank[0/1] 08/02/2025 07:52:35 INFO stats.py:335 | Epoch[3] Step[902] GlobalStep[7349] Training Speed: 18.83 samples/sec across all devices. Average Step Time: 0.42 sec. Estimated Remaining Time: 10:55:54. Learning Rate Group 0: 1.00000e-04. Learning Rate Group 1: 1.00000e-04.
Rank[0/1] 08/02/2025 07:52:47 INFO loss_tracker.py:84 | Epoch[3/NA] Step[924] GlobalStep[7371/99999]: loss_noise_mse[0.3393] loss_fk_mse[0.1990] loss_depth[0.0457] total_loss[0.5841]

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions