README.md

Learning to play Tetris with Monte Carlo Tree Search and Temporal Difference Learning

My personal project for the love of Tetris.

See the agent in action here!

(Warning: Codes are a hot mess riddled with inconsistent styles and unclear namings, read them at your own risk.)

Introduction

This project started out as a practice to apply Deep Q-Learning to Tetris, one of my favourite puzzle games of all time. However, I soon realized that it was almost impossible to train an agent to perform anywhere near human level due to the sparsity and long-term dependency of the rewards in Tetris (imagine how many actions you need to perform to clear even one line!). It was also around that time AlphaGo beat Lee Sedol in a dominating fashion that reignited my hopes for a better agent. Also, I believed that a model based approach should improve significantly compared to model free approaches (Q learning, policy gradients etc.). So here it is, the MCTS-TD agent inspired by AlphaGo specializing in the game Tetris.

How is this related to AlphaGO/Zero?

At the core of AlphaGo, the agent tries to search the game tree base on upper confidence bound applied to trees (UCT). Unlike vanilla MCTS which has to simulate the entire game to estimate the value of current state, AlphaGo uses a neural network to inference the value (winning probability) and the policy (likely next moves) of the current state to calculate the upper confidence bound for each moves. In my agent I used exponential moving averages and variances with initial values from the neural network to calculate the upper confidence bound based on central limit theorem which I believe is more appropriate for single player games with unbounded rewards. Another difference is that AlphaGo uses the final scores of each game as the training targets while this agent uses a bootstrapped target, hence Temporal Difference Learning.

How is this different from other Tetris Bots?

Most of the super-human performing Tetris bots seen on youtube or other games use heuristics (number of holes, height of each column, smoothness of the surface stc.) to model the reward. Using heuristics can substantially simplify the problem since the rewards are now much denser (you get a reward for each piece you drop) and are highly correlated with the final score. However, such handcrafted rewards can bias your agents toward the target you set (minimize holes in the board or height of the column) instead of the true target (clearing lines). Furthermore, such heuristics do not generalize beyond the game Tetris meaning that you have to handcraft rewards for each game you want your bot to play. This agent differs from those bots in the sense that it can be applied to any environment satisfying certain requirements.

Prerequisite

torch==1.0.0
numpy==1.14.2
numba==0.39.0
tables==3.4.2
matplotlib==2.1.2
tensorflow==1.12.0 (not supported anymore, switch to PyTorch instead)

You'll also need the Tetris environment from here and modify the sys.path.append in play.py to include the path of pyTetris.

How to run it?

play.py script for self-play or manual play
train.py script for training the neural network
tools/plot_score.py script for plotting the score curve
tools/plot_loss.py script for plotting the loss curve
tools/replay.py GUI for replaying

The default routine is written in cycle.sh, if you are unsure what to do simply use ./cycle.sh and things should get going.

Results

In the default routine (cycle.sh), each iteration consists of 100 games of selfplay with 300 MCTS simulations per move to generate the training data and 1 benchmark game with 1500 MCTS simulations per move to test the performance of the agent.

Left one is the normal (300 simulations) selfplay, right one is the benchmark (1500 simulations) selfplay. As a baseline, vanilla MCTS agent (no neural network) has an average score about 7 lines with 300 simulations per move.

As can be seen in the graphs, the agent is still improving even after 13 iterations (1300 games), however, it takes more than 10 hours to finish one iteration on my potato so I had to terminate it early. To the best of my knowledge, this result beats all non-heuristic agents.

GitHub - hrpan/tetris_mcts: MCTS project for Tetris

README.md

Learning to play Tetris with Monte Carlo Tree Search and Temporal Difference Learning

Introduction

How is this related to AlphaGO/Zero?

How is this different from other Tetris Bots?

Prerequisite

How to run it?

Results

Further Readings

Recommend

GitHub - mars-project/mars: Mars is a tensor-based unified framework for large-s...

清北人，以及其他学校本科生是怎么看待韩国留学生的？ - 知乎

砸办公室、搬电脑、围堵CEO：途歌上演讨债大作战

从欧洲商场看中国低端产业竞争力

从微信新版看张小龙的产品之道，回归社交和用户本质

花十几万养硅胶娃娃，他和她那无声而昂贵的孤独

雷军：同事格式化了我的电脑，让我走上当CEO的“不归路”

李国庆“翻车”背后

【雷军：当年的我28岁，就成为了金山的总经理，这应该是一个非常荣光的事情吧？】我父...

谁该为阿里彩蛋背锅？

About Joyk