Investigating Kasparov Chess Moves with Scikit-Learn

I am practicing my machine learning (ML) skills and became curious if chess moves could be predicted, somehow. Crazy right? ;-) tldr: chess.py

For aprevious project, I collected 547 Gary Kasparov games, in PGN format. For this project there are 21,088 Kasparov-only moves, and I want to look at the prediction accuracy for increasing numbers of moves.

My hypothesis is that this will drop sharply as the number of moves increases, since the complexity or number of unique moves increases over time. So, I would be happily surprised, if not amazed, if a prediction accuracy of 20% were even possible.

So how do you represent a populated chessboard full game in a format that is digestible by a machine learning algorithm? How do you represent the target of the move to predict? Well what I chose to do is to flatten the Forsyth-Edwards Notation (or FEN) into a comma separated value (or CSV) row. Each piece is encoded as an arbitrary number and unoccupied squares are zero. I chose white p=10, r=11, n=12, b=13, q=14, k=15 and black P=20, R=21, N=22, B=23, Q=24, K=25.

For the target, I chose to use the standard algebraic notation (or SAN) of the moves. (But as of this writing, I wonder if using alternate chess move notations would change prediction accuracy. Hmmm…)

Also, I add the number of the move to the beginning to the list. So for the initial game setup and an initial move we have:

1,11,12,13,14,15,13,12,11,10,10,10,10,10,10,10,10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,20,20,20,20,20,20,20,20,21,22,23,24,25,23,22,21,e4

For a full game, we have rows 1, 3, 5, etc. when Kasparov is the white player and 2, 4, 6, etc. when he plays black. For every game in the collection, where every move is considered, there are 21,088 rows (as mentioned above).

With this CSV data in hand we can analyze it with various ML algorithms.

First thing is to break the data into training and testing sets. I do this with the scikit-learn train_test_split() function . Next is to train different models and check out their accuracy! (Please see the code for details.)

The first model I tried is k-nearest neighbors (or KNN). Here is a snip of the code for it:

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print metrics.accuracy_score(y_test, y_pred)

This yields 0.12101669195751139 for all moves in game. A 12% prediction accuracy. I expected this! And when looking at the accuracy graph for k=1 to 25, the line drops directly as k increases. Again, I expected this…

YjU7Bzi.png!web

The next model tried is Multinomial Naive Bayes . This results in a 7.2% prediction accuracy. Oof.

How about a Support vector machine ? That’s a bit better with an 15.4% prediction accuracy.

Ok lastly let’s look at a Decision tree : For all moves, that weighs in at a whopping 17.1%! Woo.

What does the accuracy graph look like given the number of moves? For example, what is the prediction accuracy when considering only the first moves? What about as the number increases?

Here is a snip that performs that (given the function definitions defined in the code ):

k_range = range(1, 15)
scores = []
for k in k_range:
    create_csv(k)
    X_train, X_test, y_train, y_test = train()
    accuracy = decision_tree(X_train, X_test, y_train, y_test)
    print accuracy
    scores.append(accuracy)
 
plt.plot(k_range, scores)
plt.xlabel('Moves')
plt.ylabel('Accuracy')
plt.show()

And here is the graph:

2IRzym6.png!web

Dismal! ;-)

Recommend

REST was Never about CRUD

Git服务器搭建，多用户组权限控制配置，添加新用户密码登陆

Running Apache Tomcat as a Windows Service for Remote Debugging

Java vs Objective-C: A Comparative Analysis Of The Languages

WebGL入门与进阶3

知乎安卓客户端启动优化 - Retrofit 代理

Python post-Guido

有赞环境解决方案

视频显示搭配英特尔 Core i9 的 MacBook Pro 存在热保护问题

珠穆朗玛峰上有多少垃圾？带上山的也请带下山

About Joyk