Q-Learning for Tic Tac Toe Game

To generate Q-Learning Table in csv format for Tic Tac Toe Game run:

pip install -r requirements.txt
python q_learn.py

Q-Learning Table output example:

State	A1	A2	A3	A4	A5	A6	A7	A8	A9
o-- --x ox-	-	-9.1	-9.1	-7.4	-9.1	-	-	-	-9.1
-x- oox -x-	7.2	-	-0.18	-	-	-	7.12	-	-0.18

where:
State - state on the 3x3 board, describes by 9 chars string, where '-' for empty cell, 'x' and 'o'; for example, the board state for the image above is 'oxx -ox --o';
A1, ..., A9 - agent rewards for move to cell 1-9 for current State, higher value - the action is preferable.

The actions on gameboard are:

| 1 | 2 | 3 |
-------------
| 4 | 5 | 6 |
-------------
| 7 | 8 | 9 |

2 files are generated for convenience: q-table-x.csv and q-table-o.csv for X-player and O-player moves. They may be joined, because files contains unique states.

Details

The core of the algorithm is a Bellman equation:

Q[state][action] = (1 - alpha) * Q[state][action] + alpha * (reward + gamma * max(Q[next_state]))

where:
alpha - learning rate, determines to what extent newly acquired information overrides old information (0..1);
gamma - discount factor, determines the importance of future rewards (0..1);
reward - reword for move, which changes state to next_state.

This algorithm is tuned.

The more training, the more accurately Q-Learning Table

Tuned algorithm doesn't use constant alpha and gamma. They calculated by:

alpha = gamma = min(i / N, 0.9)

where i - the training iteration, N - the total training cycles.

Minus sign is used for `max(Q[next_state])`

The formula is corrected to

Q[state][action] = (1 - alpha) * Q[state][action] + alpha * (reward + gamma * -max(Q[next_state]))

because the next state belongs to opponent's move. But the benefits of the opponent are the penalty of the current player. The penalty is a negative value.

Higher rewards

In general penalty = -1 and reward = 0 are used. Tuned algorithm uses:

reward = max_step_count * penalty

where max_step_count = 9.

Q-Learning Table is better distinguishes the winning, if reward is propagated to the first move of the party without blur. Nine moves with -1 penalty is compensated by reward.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.idea/runConfigurations		.idea/runConfigurations
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
q-table-o.csv		q-table-o.csv
q-table-x.csv		q-table-x.csv
q_learn.py		q_learn.py
requirements.txt		requirements.txt
tic_tac_toe_board.py		tic_tac_toe_board.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Q-Learning for Tic Tac Toe Game

Details

The more training, the more accurately Q-Learning Table

Minus sign is used for `max(Q[next_state])`

Higher rewards

About

Uh oh!

Releases

Packages

Languages

License

NeuroMachinesLab/tic-tac-toe-q-learning

Folders and files

Latest commit

History

Repository files navigation

Q-Learning for Tic Tac Toe Game

Details

The more training, the more accurately Q-Learning Table

Minus sign is used for max(Q[next_state])

Higher rewards

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Minus sign is used for `max(Q[next_state])`

Packages