

sidetable - Create Simple Summary Tables in Pandas
source link: https://pbpython.com/sidetable.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.


Introduction
Today I am happy to announce the release of a new pandas utility library called sidetable. This library makes it easy to build a frequency table and simple summary of missing values in a DataFrame. I have found it to be a useful tool when starting data exploration on a new data set and I hope others find it useful as well.
This project is also an opportunity to illustrate how to use pandas new API to register custom DataFrame accessors. This API allows you to build custom functions for working with pandas DataFrames and Series and could be really useful for building out your own library of custom pandas accessor functions.
sidetable
At its core, sidetable is a super-charged version of pandas
value_counts
with a
little bit of
crosstab
mixed in. For instance, let’s look at some data on School
Improvement Grants so we can see how sidetable can help us explore a new data
set and figure out approaches for more complex analysis.
The only external dependency is pandas version >= 1.0. Make sure it is installed, then install sidetable:
python -m pip install sidetable
Once sidetable is installed, you need to import it to get the pandas accessor registered.
import pandas as pd import sidetable df = pd.read_csv('https://github.com/chris1610/pbpython/blob/master/data/school_transform.csv?raw=True', index_col=0)

Now that sidetable is imported, you have a new accessor on all your DataFrames -
stb
that you can use to build summary tables. For instance, we can use
.stb.freq()
to build a frequency table to show how many schools were included by state with cumulative totals and percentages:
df.stb.freq(['State'])

This example shows that CA occurs 92 times and represents 12.15% of the total number of schools. If you include FL in the counts, you now have 163 total schools that represent 21.5% of the total.
As of version 0.6, sidetable displays percents correctly. The examples shown here are proportions not percentages. Newer versions will multiply the results times 100 - which is the correct way to represent a percent. Recent versions of sidetable also use lower cases titles for column names. Users are encouraged to use the most recent version.
For comparison, here’s
value_counts(normalize=True)
next to sidetable’s output:

I think you’ll agree sidetable provides a lot more insight with not much more effort.
But wait, there’s more!
What if we want a quick view of the states that contribute around 50% of the total? Use the
thresh
argument to group all of the rest into an “Others” category:
df.stb.freq(['State'], thresh=.5)

This is handy. Now we can see that 8 states contributed almost 50% of the total and all the other states account for the remainder.
If we want, we can rename the catch-all category using
other_label
df.stb.freq(['State'], thresh=.5, other_label='Rest of states')
One of the useful features of sidetable is that it can group columns together to further understand the distribution. For instance, what if we want to see how the various “Transformation Models” are applied across Regions?
df.stb.freq(['Region', 'Model Selected'])

This view is a quick way to understand the interaction and distribution of the various data elements. I find that this is an easy way to explore data and get some insights that might warrant further analysis. A table like this is also easy to share with others since it is relatively simple to understand.
You could definitely perform this analysis with standard pandas (that’s all that is behind the scenes after all). It is cumbersome though, to remember the code. My experience is that if it is tough to remember then you are less likely to do it. simpletable tries to make this type of summary very easy to do.
Up until now, we have been counting the number of instances. What might be much more
interesting is looking at the total breakdown by
Award Amount
. sidetable
allows you to pass a
value
column that can be summed (instead of counting occurrences).
df.stb.freq(['Region'], value='Award_Amount')

This view gives us insight that the Northeast has the least amount of dollars spent on these projects and that 37% of the total spend went to schools in the South region.
Finally, we can look at the types of models selected and determine the 80/20 breakdown of the allocated dollars:
df.stb.freq(['Region', 'Model Selected'], value='Award_Amount', thresh=.82, other_label='Remaining')

If you’re familiar with pandas crosstab, then one way to look at sidetable is that
it is an expanded version of a
crosstab
with some convenience functions to view
the data more easily:

One of sidetable’s goals is that its output is easy to interpret. If you would like to leverage
pandas style functions to format your output for improved readability, sidetable can
format Percentage and Amount columns to be more readable. This is not used by
default but can be seen by passing
style=True
to the function:
df.stb.freq(['Region'], value='Award_Amount', style=True)

So far I have only shown the
freq
function but in the interest of showing how to
add other functions to the library, here’s an example of building a simple missing values table:
df.stb.missing()

In this table, there are 10 missing values in the Region column that represent a little less than 1.3% of the total values in that column.
You can get similar information using
df.info()
but I find this table
easier to interpret when it comes to quickly identifying missing values:

The documentation shows more information on usage and other options. Please check it out and let me know if it is useful to you.
One thing I do want to do is thank three people for their contributions to make sidetable work.
- Peter Baumgartner - For the original inspiration in this tweet thread
- Steve Miller - For an article that illustrates the value of looking at frequency distribution article
- Ted Petrou - Made this post showing how to count null values in a DataFrame.
Each of these references was leveraged very heavily to make sidetable. Thank you!
Finally, the functionality in
missing
is not meant to be a replacement for the
excellent missingno module. The implementation included in sidetable is a quick summary
version and does not include any of the useful visualizations in missingno.
Introducing the pandas accessor API
If you would like to learn how to build your own accessor, it’s actually relatively straightforward. As a reference, you can view the file that does all the work here.
Here’s a short summary of how to get started. At the top of your file import pandas to get access to the decorator:
import pandas as pd @pd.api.extensions.register_dataframe_accessor("stb") class SideTableAccessor: def __init__(self, pandas_obj): self._validate(pandas_obj) self._obj = pandas_obj
This portion of code creates the accessor class and defines the accessor value which I have
chosen as
stb
. Once this is in place, any time you import the python module containing
this code, you will get the accessor registered and available on all DataFrames.
When the class is instantiated, the current pandas DataFrame will be validated through the
_validate()
method and then the DataFrame will be reference in subsequent functions
using
self._obj
In this case, I don’t really do much with the validate method but you could choose to add more logic:
@staticmethod def _validate(obj): # verify this is a DataFrame if not isinstance(obj, pd.DataFrame): raise AttributeError("Must be a pandas DataFrame")
All of the work is done in the
freq
and
missing
functions. For the most part,
it is all standard pandas code. You just need to make sure you return a valid DataFrame.
For example, here is the full version of the
missing
function at the time of this article:
def missing(self, clip_0=False, style=False): """ Build table of missing data in each column. clip_0 (bool): In cases where 0 counts are generated, remove them from the list style (bool): Apply a pandas style to format percentages Returns: DataFrame with each Column including total Missing Values, Percent Missing and Total rows """ missing = pd.concat([self._obj.isna().sum(), self._obj.isna().mean()], axis='columns').rename(columns={ 0: 'Missing', 1: 'Percent' }) missing['Total'] = len(self._obj) if clip_0: missing = missing[missing['Missing'] > 0] results = missing[['Missing', 'Total', 'Percent']].sort_values(by=['Missing'], ascending=False) if style: format_dict = {'Percent': '{:.2%}', 'Total': '{0:,.0f}'} return results.style.format(format_dict) else: return results
In your “normal” pandas code, you would reference the DataFrame using
df
but here, use
self._obj
as your DataFrame to perform your concatenation and sorting.
I can see this as a very useful approach for building your own custom flavor of pandas functions. If you have certain transformation, cleaning or summarizing data that you do, then this might be an approach to consider - instead of just copying and pasting the code from file to file.
Summary
Pandas has a very rich API but sometimes it can take a lot of typing and wrangling to get the data in the format that is easy to understand. sidetable can make some of those summary tasks a lot easier by building frequency tables on combinations of your data and identifying gaps in your data.
sidetable does not replace any of the sophisticated analysis you will likely need to do to answer complex questions. However, it is a handy tool for quickly analyzing your data and identifying patterns you may want to investigate further.
In addition, I want sidetable to serve as an example of how to build you own pandas accessor that streamlines your normal analysis process.
I hope you find sidetable useful. If you have ideas for improvements or bug reports, head on over to github and let me know. I hope this can grow over time and become a useful tool that helps many others. I am curious to see what the community does with it.
Recommend
-
7
Introduction The pandas read_...
-
1
Drifted logs and simple summaryotsukare Thoughts after a day of work I have summaries for the previous week that I should probably published at least for my own record. Travels...
-
16
Let PROC FREQ create graphs of your two-way tables 9 ...
-
10
How to Create Relationships Between Multiple Tables Using Data Model in Excel By Tamal Das Published 8 hours ago...
-
7
react-native-awesome-table React-Native Simple Data Tables Examples cam be found on our Storybook https://garrylachman.github....
-
5
the distribution of strings on simple strings works, but not on a series of strings in pandas advertisements I'm very new to python & pand...
-
10
Simple tables with CSS grid layoutThis website uses cookies to ensure you get the best experience on our website. Learn more
-
9
Bringing Kafka based architecture to the next level using simple PostgreSQL tablesWritten by Tõnis PoolWe’ve mentioned in
-
10
Notifications🆘 Need Any Help? Today at 3:11 PM🤑 1k Brands Publish Here Today at 3:32 PM✍🏽 How Do I Get Published?Today at 3:25...
-
6
Tablesaw: The EASIEST way to process tables in Java (Data Frames / Pandas like) ...
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK