1

Some pandas notes (part 1, EDA)

 2 years ago
source link: https://andrewpwheeler.com/2022/06/06/some-pandas-notes-part-1-eda/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Some pandas notes (part 1, EDA)

Teaching my son python at the moment – when I gave him a final project to do on his own, he complained and said he did not remember how to do one step from the prior lessons.

In the end, I am guessing I really only have maybe 30 commands in any language memorized, and then maybe another 70 I know of (but often need to look at the docs for its less often used arguments).

Here I figured it would be good for learners for me to put down my notes on pandas dataframes in python. At least the less than 20 commands I regularly use.

So first of I think a 3 part series, is using pandas for exploratory data analysis (EDA). Besides these tables I will show here, I often only use histograms and smooth conditional plots. Those I don’t even use very often.

Debated on writing a post that EDA is overrated, (I spend way more time understanding business objectives, then tuning a reasonably non-linear machine learning model and seeing its results dominates after that point). So it makes writing a post on EDA quite easy though – I can show a dozen pandas methods I regularly use and that is over 90% of the typical EDA work I do.

Here I will show an example using the PPP loan data available to download, in particular I will be using the over 150k loan data. A nice aspect of pandas is you can read in a CSV file directly from a url.

# Python code examples for EDA
import pandas as pd

# PPP loans, see full dataset
# at https://data.sba.gov/dataset/ppp-foia
url = r'https://data.sba.gov/dataset/8aa276e2-6cab-4f86-aca4-a7dde42adf24/resource/501af711-1c91-477a-80ce-bf6428eb9253/download/public_150k_plus_220403.csv'

# Can read in from url directly
pp_loans = pd.read_csv(url)
pp_loans.shape # over 900k rows
AM-JKLVl5tW9JmdBf7KTI9TxTOkj8E3VXhtGiVyrFGq_KHjI7dpVAjPFWgowvtCeaN1vlXz9fAv4nLLB2OLtl-DCrJc1Ej-ZvqZ7h2Xgh5k6W-sOIF8ktXswW37pOnqaDGkbWvbvX9mEwL_hmIp2D5OgIbqc=w600-h165-no?authuser=0

At work I am often querying a database using SQLAlchemy instead of reading a csv file. But this is a nice example to show off – the data is too big to really work effectively in Excel, but pandas it should be no problem for most peoples machines.

So you can see the first thing I look at is data.shape to get the number of rows and columns. Next I often just print the columns using list(data):

# Can use list to see all of the variables
list(pp_loans)
AM-JKLUkJmWIOmhcCjVxjqx5gHPxEv8mNdDLhkW6N5Je_B6EB5BFUStCdYxHHswqaxFyG5P0-_FpJvlyT4G-HPYcgp1rjRl0Lb9GXscqYWL3HJfQSmgx9Z5l4bgXWGJ1n0fXk5cJTHYy-KgUUhNpC-J4UI9B=w1805-h313-no?authuser=0

So you can access/modify the actual column names via data.columns, but this is a bit nicer printing.

Next I often summarize the data using the data.describe() method. I like transposing this result to be easier to read though.

# Transpose describe for easier reading
pp_loans.describe().T
AM-JKLVVUE0beDm4Ov5PGBL5MDQU61bmMZGAu34-3NJziVOa1wdOdN71BEQHZDAEM2c7c22zxF0C1T2gUj7qGftKiwwtFIT4aynxVLFmARXl_d9z6pWq1SblDycsX1Sei1pqIVcrFDCDprWoPpF-o6uYVigh=w1721-h706-no?authuser=0

Here we get a bit of a mess of scientific notation (also this only shows summaries of float data by default, not categorical or dates). If you want to set a bit easier on the eyes printing, I typically have this pd.set_option() handy. Also you can scale the data, here I divide everything by 1000.

# If you dont want scientific notation
pd.set_option('display.float_format', '{:,.0f}'.format)
pp_loans.describe().T/1000

One of the things I like about this the most is the count, so you can see data that has missing values (that here should likely be set to 0 for the *_PROCEED variables). The second thing I look at are the min/max. Sometimes you can spot rogue missing values or censoring in the data. Then looking at the quartiles is typically informative as to how skewed the data is. Only then do I bother with the mean – which for binary data is reasonably informative.

You can see for this particular data, it is the loans over $150k – here this means apparently the CurrentApprovalAmount, as 150k to 10million (the apparent cap on the PPP loans). So this is $150k conditional on approval dataset.

The next step I am typically looking the distribution of various categorical data. For that I am often just using data['variable'].value_counts():

Value counts is maybe my favorite

pp_loans[‘BusinessType’].value_counts()

AM-JKLWkodepI3KiX7OCJCh7caHTtqA8ofYzV2oIt_Q3ztzrAaLcXPT_jleqxw7GDSFOj9416Dw2i6_pqeY8iBfrcNfyvAfmhy_pTtZsxcPzQH0c-ImaHMlr_Ap-bIoN3ZDkb2vfe8K9rfK7GOSbPun8HIV3=w726-h844-no?authuser=0

This prints out all of the business types in this example, but with a very large number will limit to the head/tail. Here we can see Corps are the most common, with LLCs and s-corps as the next.

The next most common pandas method I use is data.groupby(), this produces summary aggregations conditional on different groups in the data. For example, here we can do the total sum of the loans (per million dollars) for each of the business types.

# Groupby with stats is next, per million
x = 'BusinessType'
y = 'CurrentApprovalAmount'
pp_loans.groupby(x)[y].sum().sort_values()/1e6
AM-JKLVw1caVh0Ut7FxlVbaUMLi6StSvtve0L1KpAA28YohyJy8MmbenhFQ27ZfT9E1GMFKP0W9bJjiiw2n769p5kHD2HO59Ee36L6hOnD_JjuwIhI3iR33GWosXgreGbqkw4N7BcxObkQ4hsc-hevj_lbET=w763-h896-no?authuser=0

You can see that you can chain multiple pandas methods together, which is just a typical python object oriented pattern. So I tack on a .sort_values() to the end of the groupby. Note again these are in millions of dollars, so Coop’s got over a billion dollars of PPP loans in this data (it will be more overall, these again are only loans over $150k, the link at the beginning has a bunch of additional datasets with smaller loan amounts).

You can subsequently do additional functions in groupby, for just an example here I create my own normal distribution confidence interval of the mean for large samples.

# Can do more complicated functions
def ci95(x):
    mx = x.mean()
    sx = x.std()
    res = pd.DataFrame([[mx - 2*sx,mx + 2*sx]])
    res.columns = ['low','high']
    return res

pp_loans.groupby(x)[y].apply(ci95)

I often use this for functions with binary data, but otherwise we can again chain methods together, so we can use data.groupby(x)[y].describe() to get those summaries for each of our groups. Also as an FYI you can pass in multiple y values to summarize via a list, although it gets a bit hairy in the output:

# And can have multiple outcome variables
ys = [y,'PAYROLL_PROCEED']
pp_loans.groupby(x)[ys].describe()
AM-JKLX3ACY00AWGdnXPMeeUsIsEKeUIN_Au4JvZqDfIt29LKujzk9k3LiWvOo3SyNy5tOA2nUYXeZZzQ0Y_eR1g6VTwE-D387faX-P6T460Ind3yfmb39DuNQv0Gxg9mfDOxoRaAzeNf-CJ8vOHPbGh7w6q=w1821-h527-no?authuser=0

One of the common things I do at this point is to simply dump the results to a CSV file I can lookat, e.g. something like:

# Can chain together and save result to csv
pp_loans.groupby(x)[ys].describe().to_csv('aggstats.csv')

Or assign to a new object, and some python editors (such as Spyder) have a data browsers.

The final EDA part I use most often is simply sorting the dataset and looking at the head/tail.

# Can sort values and look at head/tail
pp_loans.sort_values(by=ys,ascending=False,inplace=True,ignore_index=True)
pp_loans[['BorrowerName'] + ys].head()
AM-JKLWjz4SgfL0qIWgBH9blEefHM9f-cQdg0oQkfxwL2HBZyK5dPX3yOf-AvLCH1SNv4Rfqo2ajvK-hpx9Vzx_RhfzouY-9FqATmwKMqD2aqvIo914xVKjlLkMm-IWmGX1zz5jY7gZFsiUINPOvrxMTABOS=w1178-h284-no?authuser=0

I often will export the head of the data, e.g. something like data.head().to_csv('checkdata.csv') to view all of the fields and their content as well when first figuring out fields, how often they are filled in, etc.

Next part in the series, I will describe more regular pandas data manipulation operations I regularly use, e.g. getting a single column vs multiple columns, filtering, string manipulation functions, etc.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK