There's a bug in the GRPO advantage calculation at line 159 of verl/trainer/core_algos.py. The standard deviation computation has extra brackets that create an incorrect tensor shape.
Line 159
id2std[idx] = torch.std(torch.tensor([id2score[idx]]))
The id2score[idx] is already a list; wrapping it in additional brackets [id2score[idx]] creates a nested structure.
There's a bug in the GRPO advantage calculation at line 159 of
verl/trainer/core_algos.py. The standard deviation computation has extra brackets that create an incorrect tensor shape.Line 159
id2std[idx] = torch.std(torch.tensor([id2score[idx]]))
The
id2score[idx]is already a list; wrapping it in additional brackets[id2score[idx]]creates a nested structure.