Skip to content

Add random selfplay#57

Open
vwxyzjn wants to merge 3 commits intomasterfrom
new-rsp
Open

Add random selfplay#57
vwxyzjn wants to merge 3 commits intomasterfrom
new-rsp

Conversation

@vwxyzjn
Copy link
Collaborator

@vwxyzjn vwxyzjn commented Feb 5, 2022

Continue from #35

@vwxyzjn vwxyzjn mentioned this pull request Feb 5, 2022
@vwxyzjn
Copy link
Collaborator Author

vwxyzjn commented Feb 7, 2022

https://wandb.ai/gym-microrts/gym-microrts/runs/3k4i5p4y?workspace=user-costa-huang tracks the run.
Surprisingly, just playing against past selves is enough to produce a SOTA bot as follows.

image

In comparison, playing against the latest self performs much more poorly as follows:

image

@vwxyzjn vwxyzjn requested a review from kachayev February 7, 2022 01:47
Copy link
Contributor

@kachayev kachayev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this makes sense. And I can see why playing against the last version leads to overfitting of some sort (I guess). I'm not sure how well does this scale with the number of historical version, but this is perfectly sounds approach for the simplest approach.

@vwxyzjn
Copy link
Collaborator Author

vwxyzjn commented Feb 9, 2022

Thanks for reviewing @kachayev

I just discovered a problem with this implementation: We are only training the reason that starts from the top left of the map, and when we randomly sample selfs from the past, this self is not trained to start from the bottom left. As a result, we are essentially training an agent to play against a random player... I will need to fix this by placing with p1_idx and p2_idx.

@kachayev
Copy link
Contributor

kachayev commented Feb 9, 2022

Oh, that's a really good point! Should this be a part of the environment setting, like a random placement of opponents? It should be simple to add, just need to be careful with flipping player ids in observations

@vwxyzjn
Copy link
Collaborator Author

vwxyzjn commented Feb 9, 2022

Oh, that's a really good point! Should this be a part of the environment setting, like a random placement of opponents?

I considered something like this but abandoned the idea because it made training twice as slow. We thought this was good kind of slow because maybe the agent learns something general such as to move towards the enemy instead of "just going to the bottom right".

However, it turns out the agent just learned "going to the bottom right" and "going to the top left", so not that exciting from the generalization standpoint and therefore kind of a waste of compute.

Ultimately this is something we should do (at least give an option to do randomized starting location), but it's probably not a big priority right now.

@vwxyzjn
Copy link
Collaborator Author

vwxyzjn commented Feb 9, 2022

I have done some index manipulation:

p1_idxs = [1, 3, 5, 7, 9, 11, 12, 14, 16, 18, 20, 22]
p2_idxs = [0, 2, 4, 6, 8, 10, 13, 15, 17, 19, 21, 23]

Now the agent issues action for the player starting from the bottom right (the 1, 3, 5, 7, 9, 11-th environments) and the reset starting from top left

@vwxyzjn
Copy link
Collaborator Author

vwxyzjn commented Feb 9, 2022

Interestingly, the current runs we have suggest just playing with an almost-random player (red line) is still better than playing against the latest selfs (blue lines). The experiments are done with three random seeds each. I am going to run the experiments for the correct ppo_gridnet_rs.py.

image

@vwxyzjn
Copy link
Collaborator Author

vwxyzjn commented Feb 19, 2022

When using the correct implementation, random selfplay performs no better than naive / latest selfplay

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants