18

Web Scraping and Visualizing Chess Data

 3 years ago
source link: https://www.tuicool.com/articles/A3umQzm
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

The motivation

I’ve been learning about web scraping and data visualization mainly through articles published on this site, and while I’ve found dozens of articles that give a quick intro to web scraping, very few go beyond that. I hope to provide suggestions on how to find data within a site’s html source code, how to scrape multiple pages and collect data, and what to do once one actually has the data she wants.

The topic

I started playing chess in November of 2018 and since then have played 1,566 games of varying speeds for an average of about 5 a day. I play using chess.com’s app, which tracks some basic stats about each user in addition to maintaining a full back history of every game played.

mieq2iB.png!web

Screenshot of Chess.com’s stats page

While chess.com does an incredible job with its UX, the stats page leaves something to be desired. Since I’ve been looking for projects to practice/ learn more about web scraping and data visualization, I decided to scrape statistics about my past games and visualize some of the insights gathered.

The Website Structure

The “Archive” page lets a user look through all of her past games should she want to review or analyze them. Looking at it from a web scraping perspective, one sees various information about each game stored neatly in rows including:

The type of game played

The user’s piece color, name, elo, and country

The opponent’s piece color, name, elo, and country

The result of the game (win or loss)

The number of moves played in the game

The date of the game

i2iQNnU.png!web

Screenshot of Chess.com’s game archive

The first thing to do is inspect the page (right click → Inspect) to view its underlying HTML source code.Clicking the selector tool in the top left and then clicking the game icon points me directly to where the code for this part of the website is.

mAZR7ra.png!web

To collect this data, I use the Requests library to grab the site’s HTML code and the BeautifulSoup library to parse through it and get the data I want. The general idea would be something like the following:

page = requests.get('https://www.chess.com/games/archive/eono619')
text = page.text
b = BeautifulSoup(text, 'html.parser')content = b.find('span', attrs={'class': re.compile("archive-games-game-time")})content.getText().strip()

In this snippet of code I:

1. Grab the page with requests

2. Create a BeautifulSoup object from it

3. Use the object’s .find() method to find the html code that contains the data I want by inputing its tag (‘span’) and attribute. This yields the following somewhat messy string: “<span class="archive-games-game-time"> 5 min </span>”

4. Finally grab just the text with the .getText() method and clean it with the string method .strip()

Each datapoint has its own unique HTML tags and attributes, but this is the general format for pulling data from a webpage. Since I want all of the games on the page, rather than just a single one, I use BeautifulSoup’s .findAll() method to grab all of the elements on the page with that tag and attribute. For each datapoint I want, I create a helper function like the one above that pulls all of these for a given page.

def pull_speed(Bsoup):
    times = []
    for i in b.findAll('span', attrs={'class': re.compile("archive-games-game-time")}):
        times.append((i.getText().strip()))
    
    return times

In order to scrape over multiple pages, I turn to the URL:

https: //www.chess.com/games/archive/eono619?gameOwner=my_game&gameTypes%5B0%5D=chess960&gameTypes%5B1%5D=daily&gameType=live& page=2

By changing the page number in the final parameter, I’m able to iterate through the pages. My code for the overall scrape initializes empty lists to store results, iterates through each page and calls the helper functions to scrape the data I want, and then collects everything in a pandas dataframe for analysis.

Full notebook and data on my GitHub ( https://github.com/eonofrey/chess_scrape_2 )

My dataframe in the end has 9 columns and 1,566 rows containing everything I need to dive deeper into my games.

The Visualizations

First plot I made was just to look at how my elo has progressed over time along with a 30-day moving average to smooth out some of the choppiness inherent to granular data. There’s very clearly an upward trend which makes me happy to see since it shows I’m actually learning the game despite the occasional rough patch.

vUV7Jbe.png!web

Next, I looked at the different countries I played against using Seaborn’s countplots, which is a really helpful function that plots the counts of a variable passed into it.

EjamuqF.png!web

Unfortunately, the list of games I’ve played per country had a very long tail, but when I pass the order parameter into the countplot function I can pass only a select slice of my dataframe (order = speed_games[‘opponent_country’].value_counts() .iloc[1:25] .index) to fix this problem. Here I limit to the top 25 values and exclude the US since it’s massively over-represented in this chart.

1*uh0RKAA4la6C6bGP3eNGhQ.png?q=20

After this, I decided to process my dataframe a little by grouping on different fields and looking at various summary statistics. I’ve found the two best ways to do this are the .resample() and .group_by() methods . Here I group the data by months and weeks to get a count of games played for each time period and add a horizontal line to show the mean number of games played at each frequency.

6JzIziY.png!web

In order to visualize how well I play against players of different countries, I needed to do a bit more preprocessing. I first group the data by country and get the number of games played as well as the average of the result (1=win, 0=loss) which equates to a win percentage. I then join the two dataframes on the country column so I’m able to limit my chart to countries I’ve played at least 15 games against.

R7rURb2.png!web

Next I investigated whether my win percentage varied by day of the week. Interestingly enough, I seem to play better on the weekend.

UNRf6vA.png!web

Finally, I looked into my win percentage with each color. Since white moves first, it has a slight advantage over black and is expected to win more often. My data backs this up showing that I win about 4% of the time more with white than black.

I also look into the average number of moves that are played in games I win vs. games I lose. Interestingly enough, games I win take two moves longer on average than games I lose.

vqeuQbY.png!web

There are aspects of this data I could have explored, but I’ll cut the article here to keep it concise. By carrying out this project, I improved my web scraping and data visualization skills and even ended up with some insights about my chess game as an added benefit. My ultimate hope is that someone uses the lessons/code snippets here to help them with their own projects. For those interested, my full script and dataset can be found on my GitHub page https://github.com/eonofrey/chess_scrape_2 .


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK