After merging #80 more optimization can possibly be applied:
- Evaluate if the attn_mask (mask out padded inputs and cls token) is necessary for training
- Evaluate if we can change the feed-forward dimension back to 512 (like in torch.v1)
- Try to implement
torch.compile for deployment (probably not working due to variable input shapes) and for preprocessing
After merging #80 more optimization can possibly be applied:
torch.compilefor deployment (probably not working due to variable input shapes) and for preprocessing