Using self-play, create and train the model:
- First, do a random initialisation
- in each epoch, play some games, using MCTS, and record all states, predictions and rewards
- at the end of the epoch, retrain the model, either all states predicting the reward or the state predicting the next state's reward