Any progress on making the training batchable?

Did anyone manage to make the model converge (both stage 1 and stage 2) using a batch size larger than 1?
If not, what are the blockers to achieve such a thing.


As long as my experience goes, stage 1 seems to be unstable for a batch larger than 1.
For stage 2 though, given a good collate function and good masking, it must be achievable.

Any comments on this are more than welcome, thanks!