pandas GroupBy: Your Guide to Grouping Data in Python

 1 month ago
source link: https://realpython.com/pandas-groupby/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Your Guide to Grouping Data in Python – Real Python


Before you proceed, make sure that you have the latest version of pandas available within a new virtual environment:

PS> python -m venv venv
PS> venv\Scripts\Activate.ps1
(venv) PS> python -m pip install pandas

In this tutorial, you’ll focus on three datasets:

You can download the source code for all the examples in this tutorial by clicking on the link below:

Download Datasets: Click here to download the datasets that you’ll use to learn about pandas’ GroupBy in this tutorial.

Once you’ve downloaded the .zip file, unzip the file to a folder called groupby-data/ in your current directory. Before you read on, ensure that your directory tree looks like this:

└── groupby-data/
    ├── legislators-historical.csv
    ├── airqual.csv
    └── news.csv

With pandas installed, your virtual environment activated, and the datasets downloaded, you’re ready to jump in!

Example 1: U.S. Congress Dataset

You’ll jump right into things by dissecting a dataset of historical members of Congress. You can read the CSV file into a pandas DataFrame with read_csv():

# pandas_legislators.py

import pandas as pd

dtypes = {
    "first_name": "category",
    "gender": "category",
    "type": "category",
    "state": "category",
    "party": "category",
df = pd.read_csv(
    usecols=list(dtypes) + ["birthday", "last_name"],

The dataset contains members’ first and last names, birthday, gender, type ("rep" for House of Representatives or "sen" for Senate), U.S. state, and political party. You can use df.tail() to view the last few rows of the dataset:

>>> from pandas_legislators import df
>>> df.tail()
      last_name first_name   birthday gender type state       party
11970   Garrett     Thomas 1972-03-27      M  rep    VA  Republican
11971    Handel      Karen 1962-04-18      F  rep    GA  Republican
11972     Jones     Brenda 1959-10-24      F  rep    MI    Democrat
11973    Marino        Tom 1952-08-15      M  rep    PA  Republican
11974     Jones     Walter 1943-02-10      M  rep    NC  Republican

The DataFrame uses categorical dtypes for space efficiency:

>>> df.dtypes
last_name             object
first_name          category
birthday      datetime64[ns]
gender              category
type                category
state               category
party               category
dtype: object

You can see that most columns of the dataset have the type category, which reduces the memory load on your machine.

The Hello, World! of pandas GroupBy

Now that you’re familiar with the dataset, you’ll start with a Hello, World! for the pandas GroupBy operation. What is the count of Congressional members, on a state-by-state basis, over the entire history of the dataset? In SQL, you could find this answer with a SELECT statement:

SELECT state, count(name)
GROUP BY state
ORDER BY state;

Here’s the near-equivalent in pandas:

>>> n_by_state = df.groupby("state")["last_name"].count()
>>> n_by_state.head(10)
AK     16
AL    206
AR    117
AS      2
AZ     48
CA    361
CO     90
CT    240
DC      2
DE     97
Name: last_name, dtype: int64

You call .groupby() and pass the name of the column that you want to group on, which is "state". Then, you use ["last_name"] to specify the columns on which you want to perform the actual aggregation.

You can pass a lot more than just a single column name to .groupby() as the first argument. You can also specify any of the following:

  • A list of multiple column names
  • A dict or pandas Series
  • A NumPy array or pandas Index, or an array-like iterable of these

Here’s an example of grouping jointly on two columns, which finds the count of Congressional members broken out by state and then by gender:

>>> df.groupby(["state", "gender"])["last_name"].count()
state  gender
AK     F           0
       M          16
AL     F           3
       M         203
AR     F           5
WI     M         196
WV     F           1
       M         119
WY     F           2
       M          38
Name: last_name, Length: 116, dtype: int64

The analogous SQL query would look like this:

SELECT state, gender, count(name)
GROUP BY state, gender
ORDER BY state, gender;

As you’ll see next, .groupby() and the comparable SQL statements are close cousins, but they’re often not functionally identical.

pandas GroupBy vs SQL

This is a good time to introduce one prominent difference between the pandas GroupBy operation and the SQL query above. The result set of the SQL query contains three columns:

  1. state
  2. gender
  3. count

In the pandas version, the grouped-on columns are pushed into the MultiIndex of the resulting Series by default:

>>> n_by_state_gender = df.groupby(["state", "gender"])["last_name"].count()
>>> type(n_by_state_gender)
<class 'pandas.core.series.Series'>
>>> n_by_state_gender.index[:5]
MultiIndex([('AK', 'M'),
            ('AL', 'F'),
            ('AL', 'M'),
            ('AR', 'F'),
            ('AR', 'M')],
           names=['state', 'gender'])

To more closely emulate the SQL result and push the grouped-on columns back into columns in the result, you can use as_index=False:

>>> df.groupby(["state", "gender"], as_index=False)["last_name"].count()
    state gender  last_name
0      AK      F          0
1      AK      M         16
2      AL      F          3
3      AL      M        203
4      AR      F          5
..    ...    ...        ...
111    WI      M        196
112    WV      F          1
113    WV      M        119
114    WY      F          2
115    WY      M         38

[116 rows x 3 columns]

This produces a DataFrame with three columns and a RangeIndex, rather than a Series with a MultiIndex. In short, using as_index=False will make your result more closely mimic the default SQL output for a similar operation.

Note: In df.groupby(["state", "gender"])["last_name"].count(), you could also use .size() instead of .count(), since you know that there are no NaN last names. Using .count() excludes NaN values, while .size() includes everything, NaN or not.

Also note that the SQL queries above explicitly use ORDER BY, whereas .groupby() does not. That’s because .groupby() does this by default through its parameter sort, which is True unless you tell it otherwise:

>>> # Don't sort results by the sort keys
>>> df.groupby("state", sort=False)["last_name"].count()
DE      97
VA     432
SC     251
MD     305
PA    1053
AK      16
PI      13
VI       4
GU       4
AS       2
Name: last_name, dtype: int64

Next, you’ll dive into the object that .groupby() actually produces.

How pandas GroupBy Works

Before you get any further into the details, take a step back to look at .groupby() itself:

>>> by_state = df.groupby("state")
>>> print(by_state)
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x107293278>

What is DataFrameGroupBy? Its .__str__() value that the print function shows doesn’t give you much information about what it actually is or how it works. The reason that a DataFrameGroupBy object can be difficult to wrap your head around is that it’s lazy in nature. It doesn’t really do any operations to produce a useful result until you tell it to.

Note: In this tutorial, the generic term pandas GroupBy object refers to both DataFrameGroupBy and SeriesGroupBy objects, which have a lot in common.

One term that’s frequently used alongside .groupby() is split-apply-combine. This refers to a chain of three steps:

  1. Split a table into groups.
  2. Apply some operations to each of those smaller tables.
  3. Combine the results.

It can be difficult to inspect df.groupby("state") because it does virtually none of these things until you do something with the resulting object. A pandas GroupBy object delays virtually every part of the split-apply-combine process until you invoke a method on it.

So, how can you mentally separate the split, apply, and combine stages if you can’t see any of them happening in isolation? One useful way to inspect a pandas GroupBy object and see the splitting in action is to iterate over it:

>>> for state, frame in by_state:
...     print(f"First 2 entries for {state!r}")
...     print("------------------------")
...     print(frame.head(2), end="\n\n")
First 2 entries for 'AK'
     last_name first_name   birthday gender type state        party
6619    Waskey      Frank 1875-04-20      M  rep    AK     Democrat
6647      Cale     Thomas 1848-09-17      M  rep    AK  Independent

First 2 entries for 'AL'
    last_name first_name   birthday gender type state       party
912   Crowell       John 1780-09-18      M  rep    AL  Republican
991    Walker       John 1783-08-12      M  sen    AL  Republican

If you’re working on a challenging aggregation problem, then iterating over the pandas GroupBy object can be a great way to visualize the split part of split-apply-combine.

There are a few other methods and properties that let you look into the individual groups and their splits. The .groups attribute will give you a dictionary of {group name: group label} pairs. For example, by_state.groups is a dict with states as keys. Here’s the value for the "PA" key:

>>> by_state.groups["PA"]
Int64Index([    4,    19,    21,    27,    38,    57,    69,    76,    84,
            11842, 11866, 11875, 11877, 11887, 11891, 11932, 11945, 11959,
           dtype='int64', length=1053)

Each value is a sequence of the index locations for the rows belonging to that particular group. In the output above, 4, 19, and 21 are the first indices in df at which the state equals "PA".

You can also use .get_group() as a way to drill down to the sub-table from a single group:

>>> by_state.get_group("PA")
      last_name first_name   birthday gender type state                party
4        Clymer     George 1739-03-16      M  rep    PA                  NaN
19       Maclay    William 1737-07-20      M  sen    PA  Anti-Administration
21       Morris     Robert 1734-01-20      M  sen    PA   Pro-Administration
27      Wynkoop      Henry 1737-03-02      M  rep    PA                  NaN
38       Jacobs     Israel 1726-06-09      M  rep    PA                  NaN
...         ...        ...        ...    ...  ...   ...                  ...
11891     Brady     Robert 1945-04-07      M  rep    PA             Democrat
11932   Shuster       Bill 1961-01-10      M  rep    PA           Republican
11945   Rothfus      Keith 1962-04-25      M  rep    PA           Republican
11959  Costello       Ryan 1976-09-07      M  rep    PA           Republican
11973    Marino        Tom 1952-08-15      M  rep    PA           Republican

This is virtually equivalent to using .loc[]. You could get the same output with something like df.loc[df["state"] == "PA"].

It’s also worth mentioning that .groupby() does do some, but not all, of the splitting work by building a Grouping class instance for each key that you pass. However, many of the methods of the BaseGrouper class that holds these groupings are called lazily rather than at .__init__(), and many also use a cached property design.

Next, what about the apply part? You can think of this step of the process as applying the same operation (or callable) to every sub-table that the splitting stage produces.

From the pandas GroupBy object by_state, you can grab the initial U.S. state and DataFrame with next(). When you iterate over a pandas GroupBy object, you’ll get pairs that you can unpack into two variables:

>>> state, frame = next(iter(by_state))  # First tuple from iterator
>>> state
>>> frame.head(3)
     last_name first_name   birthday gender type state        party
6619    Waskey      Frank 1875-04-20      M  rep    AK     Democrat
6647      Cale     Thomas 1848-09-17      M  rep    AK  Independent
7442   Grigsby     George 1874-12-02      M  rep    AK          NaN

Now, think back to your original, full operation:

>>> df.groupby("state")["last_name"].count()
AK      16
AL     206
AR     117
AS       2
AZ      48

The apply stage, when applied to your single, subsetted DataFrame, would look like this:

>>> frame["last_name"].count()  # Count for state == 'AK'

You can see that the result, 16, matches the value for AK in the combined result.

The last step, combine, takes the results of all of the applied operations on all of the sub-tables and combines them back together in an intuitive way.

Read on to explore more examples of the split-apply-combine process.

About Joyk

Aggregate valuable and interesting links.
Joyk means Joy of geeK