I noticed that the labels of the validation set are slightly unbalanced, something like this: Counter({3: 113, 1: 112, 5: 107, 0: 100, 6: 99, 7: 98, 9: 98, 2: 94, 8: 90, 4: 89}) with seed 0 under my environment settings. I haven't tested it yet, but maybe a stratified sampling is better?