Indeed. Potentially, these could work better than throwing a timeseries transformer in the current setup. Nevertheless, it's worthwhile implementing both cases i.e. consider the following:
- Simple causal transformer to replace the current GRU
- A specialised 'timeries transformer' instead
- Overhaul the setup a little bit and use DT
- 3, but with ODT