27

Pandas Groupby and Data-Handling Tips- FIFA Player Data

 4 years ago
source link: https://www.tuicool.com/articles/En2MR3y
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

FIFA 19 Complete Player Dataset

uumEFfU.png!web

Courtesy of FIFA International Soccer

Making Groupings and Presenting/Using that Data

Grouping in Pandas represents one of the most powerful features of the library. It couples simplicity with efficiency. This helpfully cuts down on the amount of code that needs to be written when answering data science questions. The aim of this article is to showcase some useful features when grouping in Pandas and blend it with some equally useful Pandas tips.

As Usual, I will use a Football Example

The data-set used for this tutorial article is available from Kaggle from the following link and is entitled ‘FIFA 19 complete player dataset’. I have included import library code to import pandas and matplotlib, and dropped unnecessary columns.

In addition, as I know from Kaggle, this Dataframe will possess over 80 columns. I set the max_columns attribute from the display module to 90, so that a sliding bar appears at the bottom of my fifa_19_players_df Dataframe that I have read in. I encourage re-assigning the columns attribute when working with Dataframes with a large number of columns to avoid a truncated view and to enable a better sense of the underlying data.

iuUvMbf.png!web

By re-assigning the max_columns attribute from the display method to encompass all the columns in a Dataframe, a better sense of the data can be gleaned!

1. Creating a Groupby Object

In Pandas, the groupby method is called directly on the Dataframe object, in this case, fifa_19_players_data. In the example shown below, the groupby object that is returned is based on the common column values within the ‘Club’ column and assigned the variable name ‘Team’.

This team variable represents a container of grouped Dataframes. Each Dataframe within this container is grouped on a common Club.

2. Groupby object Methods

size() and get_group()

To demonstrate that the team variable points to a groupby object which represents a collection of Dataframes, there is a convenient method called get_group() which can be called directly on the groupby object.

In order to access the groups initially, the size() method can be called on the groupby object which returns a series. In this series the Club is the index and the column is the number of rows in each grouped Dataframe.

Here, I have method chained to obtain the 5 largest Dataframes. As outputted from this command, Arsenal have 33 players represented. To view these players directly, we can call the get_group method on the team groupby object, with ‘Arsenal’ passed as a string argument.

ZJfU73v.png!web

Once this Dataframe is derived, we can perform plots such as the horizontal bar plot depicted. In this example, I sort the values in the arsenal Dataframe, by the ‘Wage’ column, set the ascending parameter argument to False, and take the first 10 results using the head method before plotting.

3. Customization of Plots using Built in Templates with user-defined color choices

The ‘ggplot’ template in combination with the ‘#61D199’ color argument, gives the horizontal bar plot a nice professional edge. The ggplot is supplied as a string to the use method from the style module. Aubameyang is clearly the highest weekly earner from the top available from the 2019 data-set.

iauMRfE.png!web

#OzilOut -with Ozil not playing and making a dent on Arsenal’s bottom line, is it time to say goodbye to the German?

4. Pandas Styler object

-To include the horizontal bar

The horizontal bar plot produced using the plot method called on the Dataframe above gives a good visual representation of the weekly wage. This information can also be conveyed directly within the Dataframe by calling the bar method on a Pandas Styler object.

To illustrate, when I append the .style onto the end of my arsenal Dataframe, from which I have extracted 4 columns as a list and pass this object to Python’s built in type function, a Styler object is returned.

It is now simple to call the bar method on this Styler object. The Styler object has a subset parameter which accepts either a string column name or a list of column names. For consistency, I have passed the same color argument, as the horizontal plot argument above. If 2 columns are passed to the subset parameter, note that 2 different color arguments can be passed as a list to the color parameter. In this way, user customization of the resulting Styler object can be flexibly adjusted.

QbeQfiu.png!web

5. The .agg() Method on a groupby object

The agg method can be usefully called on a groupby object. To illustrate a use case, I would like to know the average value for the Overall column in addition to the acceleration and shot power columns for each grouped Club.

In order to do this, I call the agg method on my team variable which points to a groupby object, which was created earlier, and pass it a Python dictionary.

In this dictionary, the keys are the columns in the original ‘fifa_19_players_data’ as strings, and the values are the functions that I would like to use when aggregating, which is also passed as a string. I sort the values by the overall column, from highest to lowest by setting the ascending argument to False, and view the 10 top search results. Juventus come out top with an average 82.28 for their respective overall mean.

NvQriyE.png!web

6. Precision options and settings

For presentation purposes, I often change the settings option of the float number. To change from the default of 6 decimal places to 2, I simply call the set option method directly on the pandas library (which here has the common alias of pd), and run the cell again. The floating point numbers presented have now been rounded to 2 decimal places.

This method is particularity useful, because it does not change the underlying data, and maintains the precision. If the intention was to export the data, the numeric floating point numbers would still maintain their accuracy to 6 decimal places.

iYBBvqe.png!web

7. Create a Dataframe With the ‘Best’ Player from each team

From this data-set, the metric which may distinguish the ‘ best ’ football player at each club is the Overall column found in the original fifa_19_players_data Dataframe.

Once more to create that familiar groupby object, I call the groupby method on the fifa_19_players_data with the Club column passed as a string. I assign this groupby object to the variable team.

To create a Dataframe with the ‘ best ’ players from each team, I use the Dataframe constructor method, and pass only one argument. This argument is the columns attribute of the original fifa_19_players_data Dataframe to the columns parameter. This creates an empty Dataframe, with the same column headers as the fifa_19_players_data Dataframe. I call this Dataframe greatest_club_players_df.

I then use a for loop to iterate through my groupby objects. I assign two temporary variables, one for the grouping and one for the Dataframe associated with that grouping. For each Dataframe within the grouping (which I have assigned the variable name data), I use the nlargest method and take the highest value from the Overall column. I assign this row to the variable best player. Finally, on each iteration I append this row to greatest_club_players_df.

I then sort the greatest_club_players_df by the Overall column and view the highest columns first. No surprise to see that Portugal’s Ronaldo and Argentina’s Messi top the overall quality for both Juventus and Barcelona respectively.

iEr2aeJ.png!web

Still shot for my Jupyter Notebook. The replica code can be found in the Github gist above.

Concluding Tip

To round off, lets pay homage to the top player as determined by the Overall column in the greatest_club_players_df Dataframe.

I will use a cell body magic from Jupyter Notebook which will enable the incorporation of a picture. I write 2 % signs, followed by html, and then the html tag underneath. I run the cell as as ‘code’ cell type, and now this picture of Cristiano Ronaldo from as.com is included in the Notebook, and can be used to better aid in story telling through the medium of code!

ii6zqyn.png!web

Thumps up for Pandas!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK