7

A Practical Guide for Exploratory Data Analysis: English Premier League

 3 years ago
source link: https://mc.ai/a-practical-guide-for-exploratory-data-analysis-english-premier-league/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

A Practical Guide for Exploratory Data Analysis: English Premier League

Exploring 2019–2020 season of English Premier League

The fuel of each and every machine learning or deep learning model is data. Without data, the models are useless. Before building a model and train it, we should try to explore and understand the data at hand. By understanding, I mean correlations, structures, distributions, characteristics and trends in data. A comprehensive understanding of data will be very useful in building a robust and well-designed model. We can draw valuable conclusions by exploring the data.

In this post, I will walk through an exploratory data analysis process of English Premier League 2019–2020 season dataset which is available on Kaggle.

Let’s start by reading the data into a Pandas dataframe:

import numpy as np
import pandas as pddf_epl = pd.read_csv("../input/epl-stats-20192020/epl2020.csv")print(df_epl.shape)
(576, 45)

Dataset has 576 rows and 45 columns. To be able to display all the columns, we need to adjust display.max_columns setting.

pd.set_option("display.max_columns",45)df_epl.head()

It does not fit on the screen but we can see all the columns by sliding the scroll bar. The datasets includes the statistics for 288 games. There are 576 rows because each game is represented with two rows, one from the home team side and one for away team side. For instance, the first two rows represent “Liverpool-Norwich” game.

The first column (“Unnamed: 0”) is redundant so we can just drop it:

df_epl.drop(['Unnamed: 0'], axis=1, inplace=True)
df_epl = df_epl.reset_index(drop=True)

The dataset includes lots of different statistics about games.

  • xG, xGA: Expected goals for team and opponent
  • scored, missed: Goal scored and conceded
  • xpts, pts: Expected and received points
  • wins, draws, losses: Binary variables showing the result of the game
  • tot_goal, tot_con: Total goals scored and conceded from the beginning of the season

There are also basic stats such as shots, shots on target, corner kicks, yellow card, red card. We also have information about the date and time of the games.

Let’s start with days:

df_epl.matchDay.value_counts()

Most of the games are played on saturdays.

We can quickly create a standing based on the total number of points achieved so far. The maximum value in the tot_points column shows the most up to date points:

df_epl[['teamId','tot_points']].groupby('teamId').max().sort_values(by='tot_points', ascending=False)[:10]

I only displayed the first 10 teams. If you are a football (i.e. soccer) fan, you may have heard of the success of Liverpool dominating the English Premier League this season. Liverpool leads by 25 points.

The advancements in technology and data science brought up new stats in football. One type of relatively new stats is “expected” stats such as expexted goals and expected points. Let’s check how close expected and actual values are. There are different ways to do comparison. One way is to check the distribution of the difference:

#Data visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='darkgrid')
%matplotlib inlineplt.figure(figsize=(10,6))
plt.title("Expected vs Actual Goals - Distribution of Difference", fontsize=18)diff_goal = df_epl.xG - df_epl.scoredsns.distplot(diff_goal, hist=False, color='blue')

It’s much like a normal distribution with a mean close to zero. Thus, expected values are very close to the actual values in general and there are, ofcourse, some exceptions. These exceptions are what make football exciting.

We get a similar distribution with expected and actual points:

The difference between expected and actual points can be in between -3 and +3. The tail of the distribution curve goes a little further to complete the distribution curve.

I do not know how expected goals stats are calculated but it should be somewhat related to shots and shot accuracy. We can check the correlation between expected goals (xG) and some other stats using corr function of pandas.

df_epl[df_epl.h_a == 'h'][['xG','HS.x','HST.x','HtrgPerc','tot_goal']].corr()

Shots and shots on target are definitely correlated with expected goals. There is also a weak positive correlation between expexted goals and the number of goals a team has scored so far in the season.

We can also get an idea about the performance of goalkeepers using expected goal stats and actual goals. If a team conceded less goals than the expected goals of opponent team, it is indicating that goalkeeper performs well. On the other hand, if a team conceded more goals than expectation, then the goalkeepers performance is not so good.

df_epl['keep_performance'] = df_epl['missed'] / df_epl['xGA']df_epl[['teamId','keep_performance']].groupby('teamId').mean().sort_values(by='keep_performance', ascending=False)[:5]

Man City concedes 2.22 times more goals than expectation which is an indication of bad goalkeeper performance. The blame is not only on the keeper. The defensive players also have responsibility in this situation.

On the other hand, Newcastle United and Leicester have an outstanding goalkeeper performance.

We can also check if there is an effect of the “matchday” on a teams performance. Liverpool has only lost 5 points in the season so let’s check it for the second team which is Man City.

df_epl[df_epl.teamId == 'Man City'][['pts','matchDay']].groupby('matchDay').agg(['mean','count'])

It seems like Man City does not like sundays. The average point for them on fridays is 0 but there is only one game so we cannot actually make a true judgement on that.

Let’s have a look at how many goals scored in games on average. One way is to sum goals scored and conceded and then take the mean:

df_epl['goals']= df_epl['scored'] + df_epl['missed']
df_epl['goals'].mean()
2.7222222222222223

Goals per game average is 2.72. Home teams usually score more than away teams and thus collect more points due to the support of fans in the stadium.

df_epl[['h_a','scored','pts']].groupby('h_a').mean()

Home teams, in general, dominate the games. We can also see that on the number of shots per game. Let’s make a comparison between shots for home teams and away teams:

print("Home team stats \n {} \n".format(df_epl[df_epl.h_a == 'h'][['HS.x','HST.x','HtrgPerc']].mean()))print("Away team stats \n {} \n".format(df_epl[df_epl.h_a == 'a'][['AS.x','AST.x','AtrgPerc']].mean()))

Home teams overtop away teams in shots and shots on target stats. However, the accuracy is slightly better for away teams than that of home teams.

One way to measure the performance of a team is how many points they collect relative to the expected points. There is, ofcourse, the “luck” factor in some cases but it is an interesting stats. So, let’s check it. We can check the average of the difference between actual points and expected points. This will show how successful each team is at meeting the expectations.

df_epl['performance'] = df_epl['pts'] - df_epl['xpts']df_epl[['teamId','performance']].groupby('teamId').mean().sort_values(by='performance', ascending=False)
Above expectation
Below expectation

Liverpool outperforms others by far which makes sense because they have only lost 5 points out of possible 87 points in 29 games. Man City, Man Utd, and Chelsea get some surprising results because they perform lower than expectation on average.

Some referees tend to use yellow and red cards more easily than others. I think players keep that in mind. Let’s see how many cards on average each referee per game:

df_epl['cards'] = df_epl['HY.x'] + df_epl['HR.x'] + df_epl['AY.x'] + df_epl['AR.x']df_epl[['Referee.x','cards']].groupby('Referee.x').mean().sort_values(by='cards', ascending=False)[:10]

Players should be more careful when the referee is M Dean.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK