Skip to content

NeuroMachinesLab/tic-tac-toe-q-learning

Repository files navigation

Q-Learning for Tic Tac Toe Game

To generate Q-Learning Table in csv format for Tic Tac Toe Game run:

pip install -r requirements.txt
python q_learn.py

Q-Learning Table output example:

State A1 A2 A3 A4 A5 A6 A7 A8 A9
o-- --x ox- - -9.1 -9.1 -7.4 -9.1 - - - -9.1
-x- oox -x- 7.2 - -0.18 - - - 7.12 - -0.18

where:
State - state on the 3x3 board, describes by 9 chars string, where '-' for empty cell, 'x' and 'o'; for example, the board state for the image above is 'oxx -ox --o';
A1, ..., A9 - agent rewards for move to cell 1-9 for current State, higher value - the action is preferable.

The actions on gameboard are:

| 1 | 2 | 3 |
-------------
| 4 | 5 | 6 |
-------------
| 7 | 8 | 9 |

2 files are generated for convenience: q-table-x.csv and q-table-o.csv for X-player and O-player moves. They may be joined, because files contains unique states.

Details

The core of the algorithm is a Bellman equation:

Q[state][action] = (1 - alpha) * Q[state][action] + alpha * (reward + gamma * max(Q[next_state]))

where:
alpha - learning rate, determines to what extent newly acquired information overrides old information (0..1);
gamma - discount factor, determines the importance of future rewards (0..1);
reward - reword for move, which changes state to next_state.

This algorithm is tuned.

The more training, the more accurately Q-Learning Table

Tuned algorithm doesn't use constant alpha and gamma. They calculated by:

alpha = gamma = min(i / N, 0.9) 

where i - the training iteration, N - the total training cycles.

Minus sign is used for max(Q[next_state])

The formula is corrected to

Q[state][action] = (1 - alpha) * Q[state][action] + alpha * (reward + gamma * -max(Q[next_state]))

because the next state belongs to opponent's move. But the benefits of the opponent are the penalty of the current player. The penalty is a negative value.

Higher rewards

In general penalty = -1 and reward = 0 are used. Tuned algorithm uses:

reward = max_step_count * penalty 

where max_step_count = 9.

Q-Learning Table is better distinguishes the winning, if reward is propagated to the first move of the party without blur. Nine moves with -1 penalty is compensated by reward.

About

Generates Q-Learning Table for Tic Tac Toe Game

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages