You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
if torch.distributed.get_rank() == 0:
for name, p in model.named_parameters():
if p.main_grad is not None:
logger.info(
f"[DEBUG_GRAD_NORM][POST_BACKWARD_LAYER] {name} "
f"norm={p.main_grad.float().norm().item()}"
)
环境配置:
现象
复现步骤
使用带图片的数据(base64编码)对qwen3-vl-8b 进行sft训练
执行了这个函数custom_backward(https://github.com/NVIDIA/Megatron-LM/blob/core_v0.12.1/megatron/core/pipeline_parallel/schedules.py#L130) 后
在https://github.com/NVIDIA/Megatron-LM/blob/core_v0.12.1/megatron/core/pipeline_parallel/schedules.py#L516
加如下的debug代码
输出结果如下:
训练超参数: