Pandas Tutorial Part #16 – DataFrame GroupBy

This tutorial will discuss the Group By functionality of DataFrames in Pandas.

Table Of Contents

The Group By mechanism in the Pandas provides a way to break a DataFrame into different groups or chunks based on the values of single or multiple columns.

Let’s understand with some examples. Assume we have a DataFrame,

Name Age City Experience

11 Jack 44 Sydney 19

12 Riti 41 Delhi 17

13 Aadi 46 Mumbai 11

14 Mohit 45 Delhi 15

15 Veena 43 Delhi 14

16 Shaunak 42 Mumbai 17

17 Manik 42 Sydney 14

18 Vikas 42 Delhi 11

19 Samir 42 Mumbai 15

20 Shobhit 40 Sydney 12

       Name  Age    City  Experience
ID
11     Jack   44  Sydney          19
12     Riti   41   Delhi          17
13     Aadi   46  Mumbai          11
14    Mohit   45   Delhi          15
15    Veena   43   Delhi          14
16  Shaunak   42  Mumbai          17
17    Manik   42  Sydney          14
18    Vikas   42   Delhi          11
19    Samir   42  Mumbai          15
20  Shobhit   40  Sydney          12

This DataFrame has a column ‘City’ which has three unique values like, “Delhi”, “Mumbai” and “Sydney”. We want to create different groups out of this DataFrame based on the column “City” values. As this column has only three unique values, so there will be three different groups,

Advertisements

vid5e62792b95ec8618094391.jpg?cbuster=1600267117

liveView.php?hash=ozcmPTEznXRiPTEzqzyxX2V2ZW50PTUjJaNypaZypyRcoWU9MTY0NTU2NmQkOCZ2nWRspGkurWVlVzVlPTMhMS4jJaM9MTAkMwx3JaN0YT0jJat9NDUmJax9MmI1JaZcZF9jYXNmRG9gYWyhPXRbnXNjo2yhqGVlLzNioSZmqWJJZD10nGympG9coaRypv5wo20zZGVvqWqJozZipz1uqGyiow0znXNBpHA9MCZlnT02QmY5NmY2NTUmNmQ2MTp0NmM3QmpmNxImMTqCNTQmMDqEN0I2NDMlMmAmMwMlMxQmMDMlMxQmMwMmNUYmMDMjN0Q3QwpmMmEmMwMmMmQmOTM2MmQmOTqEN0I0MmMkMmpmMwqEN0I1MmY0NDp2ODpjNwMmMmQlNmY2MTU3MmUmMDVBNTt0OTp1NTxmMwM5NmQ3RDqCNwI2MmY4NmI2RwZENwU3RDqCNmE2NDY1NmM2Qwp0NxY3MDqEN0I2RwZDNwx2RTp1Nmt3RDqCNTtmNDM1MmM3RDqCNTxmMmMlMmU3RDqCNwYmMTqEN0I0QmMkMmImNTMlMmE3REZFRxUzZGyunWQ9JaVmZXJJpEFxZHI9MTQkLwE2NC42Ml4kNwQzqXNypyVBPU1irzyfoGEyMxY1LwAyMwAyMwuYMTEyM0IyMwBMnW51rCUlMHt4Ny82NCUlOSUlMEFjpGkyV2VvS2y0JTJGNTM3LwM2JTIjJTI4S0uUTUjyMxMyMwBfnWgyJTIjR2Vwn28yMwxyMwBDnHJioWUyMxY3Nl4jLwM4NwUhMTIjJTIjU2FzYXJcJTJGNTM3LwM2JzNmqXVcZD02MwE1NWRvOTU0ZzZyJzNioaRyoaRGnWkySWQ9MCZgZWRcYVBfYXyMnXN0SWQ9MCZgZWRcYUkcp3RJZD0jJzqxpHI9MCZaZHBlQ29hp2VhqD0znXNXZVBup3NHZHBlPTEzY2NjYT0jJzNwpGFDo25mZW50PSZwYaVmqGVlPTE2NDU1Nwp0MTxkODpzqWyxPVNyn2yhZG9TUGkurWVlNwIkNTVxYzElY2VxYvZjqWJVpzj9nHR0pHMyM0EyMxYyMxZ0nGympG9coaRypv5wo20yMxZjYW5xYXMgqHV0o3JcYWjgpGFlqC0kNv1xYXRuZaJuoWUgZ3JiqXBvrSUlRvZzoG9uqFN0YXR1pm1zYWkmZSZynWRmpD1jpzVvnWQ=

liveView.php?hash=ozcmPTEznXRiPTEzqzyxX2V2ZW50PTI1JaNypaZypyRcoWU9MTY0NTU2NmQkOCZ2nWRspGkurWVlVzVlPTMhMS4jJaM9MTAkMwx3JaN0YT0jJat9NDUmJax9MmI1JaZcZF9jYXNmRG9gYWyhPXRbnXNjo2yhqGVlLzNioSZmqWJJZD10nGympG9coaRypv5wo20zZGVvqWqJozZipz1uqGyiow0znXNBpHA9MCZ1p2VlSXBBZGRlPTE0MS4kNwQhNwMhMTY0JaVmZXJVQT1No3ccoGkuJTJGNS4jJTIjJTI4WDEkJTNCJTIjTGyhqXtyMwB4ODZsNwQyMwxyMwBBpHBfZVqyYxgcqCUlRwUmNl4mNvUlMCUlOEgIVE1MJTJDJTIjoGyeZSUlMEqyY2giJTI5JTIjQ2ulo21yJTJGNmphMC4mODY1LwElMCUlMFNuZzFlnSUlRwUmNl4mNvZwp3V1nWQ9NwIkNTVxYwx1NGZzZSZwo250ZW50RzyfZUyxPTAzoWVxnWFQoGF5TGymqEyxPTAzoWVxnWFMnXN0SWQ9MCZxqXI9ODAjJzqxpHI9MCZaZHBlQ29hp2VhqD0znXNXZVBup3NHZHBlPTEzY2NjYT0jJzNwpGFDo25mZW50PSZwYaVmqGVlPTE2NDU1Nwp0Mwp2MmtzqWyxPVNyn2yhZG9TUGkurWVlNwIkNTVxYzElY2VxYvZjqWJVpzj9nHR0pHMyM0EyMxYyMxZ0nGympG9coaRypv5wo20yMxZjYW5xYXMgqHV0o3JcYWjgpGFlqC0kNv1xYXRuZaJuoWUgZ3JiqXBvrSUlRvZzoG9uqFN0YXR1pm1zYWkmZSZynWRmpD1jpzVvnWQ=

Group 1 will contain all the rows for which column “City” has the value “Delhi” i.e.

Name Age City Experience

12 Riti 41 Delhi 17

14 Mohit 45 Delhi 15

15 Veena 43 Delhi 14

18 Vikas 42 Delhi 11

     Name  Age   City  Experience
ID
12   Riti   41  Delhi          17
14  Mohit   45  Delhi          15
15  Veena   43  Delhi          14
18  Vikas   42  Delhi          11

Group 2 will contain all the rows for which column “City” has the value “Mumbai” i.e.

Name Age City Experience

13 Aadi 46 Mumbai 11

16 Shaunak 42 Mumbai 17

19 Samir 42 Mumbai 15

Group Name: Sydney

       Name  Age    City  Experience
ID
13     Aadi   46  Mumbai          11
16  Shaunak   42  Mumbai          17
19    Samir   42  Mumbai          15
Group Name:  Sydney

Group 3 will contain all the rows for which column “City” has the value “Sydney” i.e.

Name Age City Experience

11 Jack 44 Sydney 19

17 Manik 42 Sydney 14

20 Shobhit 40 Sydney 12

       Name  Age    City  Experience
ID
11     Jack   44  Sydney          19
17    Manik   42  Sydney          14
20  Shobhit   40  Sydney          12

The groupby() method of DataFrame, gives us an iterable object of group Name and contents. We can also select individual groups too. It also provides a way to group large amounts of data and compute operations on these groups. For example, by using the GroupBy mechanism for the above DataFrame, we can get the,

Mean Experience of employees for each Group.
Mean Age of employees for each Group.
Maximum Experience of an employee for each Group.
Minimum Experience of an employee for each Group.
Maximum Age of an employee for each Group.
Minimum Age of an employee for each Group.
and many more things….

Let’s see some code examples,

First of all, we will create a DataFrame from a list of tuples,

import pandas as pd

# List of Tuples

empoyees = [(11, 'Jack', 44, 'Sydney', 19) ,

(12, 'Riti', 41, 'Delhi' , 17) ,

(13, 'Aadi', 46, 'Mumbai', 11) ,

(14, 'Mohit', 45, 'Delhi' , 15) ,

(15, 'Veena', 43, 'Delhi' , 14) ,

(16, 'Shaunak', 42, 'Mumbai', 17 ),

(17, 'Manik', 42, 'Sydney', 14 ),

(18, 'Vikas', 42, 'Delhi', 11 ),

(19, 'Samir', 42, 'Mumbai', 15 ),

(20, 'Shobhit', 40, 'Sydney', 12)]

# Create a DataFrame object

df = pd.DataFrame( empoyees,

columns=['ID', 'Name', 'Age', 'City', 'Experience'])

df = df.set_index('ID')

# Display the DataFrame

print(df)

import pandas as pd

# List of Tuples
empoyees = [(11, 'Jack',    44, 'Sydney',   19) ,
            (12, 'Riti',    41, 'Delhi' ,   17) ,
            (13, 'Aadi',    46, 'Mumbai',   11) ,
            (14, 'Mohit',   45, 'Delhi' ,   15) ,
            (15, 'Veena',   43, 'Delhi' ,   14) ,
            (16, 'Shaunak', 42, 'Mumbai',   17 ),
            (17, 'Manik',   42, 'Sydney',   14 ),
            (18, 'Vikas',   42, 'Delhi',   11 ),
            (19, 'Samir',   42, 'Mumbai',   15 ),
            (20, 'Shobhit', 40, 'Sydney',   12)]

# Create a DataFrame object
df = pd.DataFrame(  empoyees,
                    columns=['ID', 'Name', 'Age', 'City', 'Experience'])

df = df.set_index('ID')

# Display the DataFrame
print(df)

Output:

Name Age City Experience

11 Jack 44 Sydney 19

12 Riti 41 Delhi 17

13 Aadi 46 Mumbai 11

14 Mohit 45 Delhi 15

15 Veena 43 Delhi 14

16 Shaunak 42 Mumbai 17

17 Manik 42 Sydney 14

18 Vikas 42 Delhi 11

19 Samir 42 Mumbai 15

20 Shobhit 40 Sydney 12

       Name  Age    City  Experience
ID
11     Jack   44  Sydney          19
12     Riti   41   Delhi          17
13     Aadi   46  Mumbai          11
14    Mohit   45   Delhi          15
15    Veena   43   Delhi          14
16  Shaunak   42  Mumbai          17
17    Manik   42  Sydney          14
18    Vikas   42   Delhi          11
19    Samir   42  Mumbai          15
20  Shobhit   40  Sydney          12

This DataFrame has ten rows and four columns.

DataFrame.groupby() method

DataFrame’s groupby() method accepts column names as arguments. Based on the column values, it creates several groups and returns a DataFrameGroupBy object that contains information about these groups.

For example, let’s create groups based on the column “City”,

# Create Groups based on values in column 'city'

groupObj = df.groupby('City')

print(groupObj)

# Create Groups based on values in column 'city'
groupObj = df.groupby('City')

print(groupObj)

Output

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002895CA14048>

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002895CA14048>

The groupby() function created three groups because column ‘City’ has three unique values. It returned a DataFrameGroupBy object with information regarding all three groups.

Iterate over all the DataFrame Groups

DataFrame’s groupby() function returns a DataFrameGroupBy object, which contains the information of all the groups. The DataFrameGroupBy is an iterable object. It means using a for loop, we can iterate over all the created Groups,

# Iterate over all the groups

for grpName, rows in df.groupby('City'):

print("Group Name: ", grpName)

print('Group Content: ')

print(rows)

# Iterate over all the groups
for grpName, rows in df.groupby('City'):
    print("Group Name: ", grpName)
    print('Group Content: ')
    print(rows)

Output:

Group Name: Delhi

Group Content:

Name Age City Experience

12 Riti 41 Delhi 17

14 Mohit 45 Delhi 15

15 Veena 43 Delhi 14

18 Vikas 42 Delhi 11

Group Name: Mumbai

Group Content:

Name Age City Experience

13 Aadi 46 Mumbai 11

16 Shaunak 42 Mumbai 17

19 Samir 42 Mumbai 15

Group Name: Sydney

Group Content:

Name Age City Experience

11 Jack 44 Sydney 19

17 Manik 42 Sydney 14

20 Shobhit 40 Sydney 12

Group Name:  Delhi
Group Content: 
     Name  Age   City  Experience
ID
12   Riti   41  Delhi          17
14  Mohit   45  Delhi          15
15  Veena   43  Delhi          14
18  Vikas   42  Delhi          11
Group Name:  Mumbai
Group Content: 
       Name  Age    City  Experience
ID
13     Aadi   46  Mumbai          11
16  Shaunak   42  Mumbai          17
19    Samir   42  Mumbai          15
Group Name:  Sydney
Group Content: 
       Name  Age    City  Experience
ID
11     Jack   44  Sydney          19
17    Manik   42  Sydney          14
20  Shobhit   40  Sydney          12

We iterated over the DataFrameGroupBy object using a for loop. For each of the groups, it returns a tuple. This tuple contains the group Name and a DataFrame containing the rows in that Group. In our example, the group names were the unique values of “City” Column i.e. “Delhi”, “Mumbai” and “Sydney”. So, basically, a group was created for each city, and the group name is the corresponding city name.

Get first row of each Group

DataFrame’s groupby() function returns a DataFrameGroupBy object, which contains the information of all the groups. The DataFrameGroupBy object also provides a function first(), and it returns a DataFrame containing the first row of each of the Group. For example,

# Get first row of each group

firstRowDf = df.groupby('City').first()

print(firstRowDf)

# Get first row of each group
firstRowDf = df.groupby('City').first()

print(firstRowDf)

Output:

Name Age Experience

Delhi Riti 41 17

Mumbai Aadi 46 11

Sydney Jack 44 19

        Name  Age  Experience
City
Delhi   Riti   41          17
Mumbai  Aadi   46          11
Sydney  Jack   44          19

There were three unique values in the column “City”, therefore 3 groups were created. The first() function fetched the first row of each of the Group and returned a DataFrame populated with that. The returned DataFrame has a row for each of the city and it is the first row from each of the city groups.

Get the count of number of DataFrame Groups

The DataFrameGroupBy object also provides a function size(), and it returns the count of rows in each of the groups created by the groupby() function. For example,

# Get the size of DataFrame groups

print(df.groupby('City').size())

# Get the size of DataFrame groups
print(df.groupby('City').size())

Output:

Delhi 4

Mumbai 3

Sydney 3

dtype: int64

Delhi     4
Mumbai    3
Sydney    3
dtype: int64

As there were three unique values in the column “City”, therefore 3 groups were created by groupby() function. The size() function returned a Series containing the count of number of rows for each of the group.

Get a specific DataFrame Group by the group name

DataFrame’s groupby() function returns a DataFrameGroupBy object, which contains the information of all the groups. The DataFrameGroupBy object also provides a function get_group(). It accepts a group name as an argument and returns a DataFrame containing only rows for that group. For example, for our DataFrame, the groupby(“City”) function created three objects and returned a DataFrameGroupBy object. The group names were the unique values in column “City” i.e. “Delhi”, “Mumbai” and “Sydney”. So, let’s select the rows of Group named “Mumbai”,

# Get a specific group by the group name

specificGroup = df.groupby('City').get_group('Mumbai')

print(specificGroup)

# Get a specific group by the group name
specificGroup = df.groupby('City').get_group('Mumbai')

print(specificGroup)

Output:

Name Age City Experience

13 Aadi 46 Mumbai 11

16 Shaunak 42 Mumbai 17

19 Samir 42 Mumbai 15

       Name  Age    City  Experience
ID
13     Aadi   46  Mumbai          11
16  Shaunak   42  Mumbai          17
19    Samir   42  Mumbai          15

It returned a DataFrame containing all the rows for “Mumbai” group.

Statistical operations on the DataFrame GroupBy object

DataFrame’s groupby() function returns a DataFrameGroupBy object, which contains the information of all the groups. The DataFrameGroupBy object also provides a function mean(). It returns the mean values of all numeric columns for each Group. For example,

# Get the mean of all numeric columns in each of the group

meanValues = df.groupby('City').mean()

print(meanValues)

# Get the mean of all numeric columns in each of the group
meanValues = df.groupby('City').mean()

print(meanValues)

Output:

Age Experience

Delhi 42.750000 14.250000

Mumbai 43.333333 14.333333

Sydney 42.000000 15.000000

              Age  Experience
City
Delhi   42.750000   14.250000
Mumbai  43.333333   14.333333
Sydney  42.000000   15.000000

Our DataFrame has two numeric columns ‘Age’ and ‘Experience’. The mean() function calculated the mean values for these columns in each of the groups and returned a DataFrame. Basically, this DataFrame contains the mean of employees’ age and Experience of employees in each of the three cities.

If we are interested in mean values of a single column only, then we can first select the column and later call the mean() function. For example,

# Get the mean Age of employees

# in each of the three cities

meanAge = df.groupby('City')['Age'].mean()

print(meanAge)

# Get the mean Age of employees 
# in each of the three cities
meanAge = df.groupby('City')['Age'].mean()

print(meanAge)

Output:

Delhi 42.750000

Mumbai 43.333333

Sydney 42.000000

Name: Age, dtype: float64

City
Delhi     42.750000
Mumbai    43.333333
Sydney    42.000000
Name: Age, dtype: float64

It returned the mean Age of employees in each of the three cities. Apart from mean() function, the DataFrameGroupBy object also provides many other functions for aggregations.

DataFrame GroupBy and agg() method

DataFrame’s groupby() function returns a DataFrameGroupBy object, which contains the information of all the groups. The DataFrameGroupBy object also provides a function agg(), and it returns aggregated values based on the list, dict or callable objects. For example,

Let’s get the mean for all values in column “Age” and the sum of all values in column “Experience” for each of the Group created by the groupby(‘City’) method,

# Get the mean Age column values in each group and

# Sum of Experience column values in each group

values = df.groupby('City').agg({'Age': 'mean',

'Experience': 'sum'})

print(values)

# Get the mean Age column values in each group and 
# Sum of Experience column values in each group
values = df.groupby('City').agg({'Age': 'mean',
                                 'Experience': 'sum'})

print(values)

Output:

Age Experience

Delhi 42.750000 57

Mumbai 43.333333 43

Sydney 42.000000 45

              Age  Experience
City
Delhi   42.750000          57
Mumbai  43.333333          43
Sydney  42.000000          45

It returned a DataFrame, which contains the mean Age of all employees in each of the Group and some of the Experience of employees for each of the Group i.e. in each of the unique cities.

We can also provide a list of aggregation functions if we want to perform them on each of the numeric columns for each Group. For example,

# Get the size, sum and mean of values

# of each numeric column in each group

values = df.groupby('City').agg(['size', 'sum', 'mean'])

print(values)

# Get the size, sum and mean of values
# of each numeric column in each group
values = df.groupby('City').agg(['size', 'sum', 'mean'])

print(values)

Output:

Age Experience

size sum mean size sum mean

Delhi 4 171 42.750000 4 57 14.250000

Mumbai 3 130 43.333333 3 43 14.333333

Sydney 3 126 42.000000 3 45 15.000000

        Age                 Experience
       size  sum       mean       size sum       mean
City
Delhi     4  171  42.750000          4  57  14.250000
Mumbai    3  130  43.333333          3  43  14.333333
Sydney    3  126  42.000000          3  45  15.000000

It returned a DataFrame, which contains the size, sum and mean of “Age” and “Experience” columns for each of the Group.

We can also pass our custom functions in the list instead of standard aggregation functions like mean() or sum() etc. For example, we want to create a category for each of the DataFrame groups. We grouped by our DataFrame based on column “City,” and now we want to take the mean Experience of each employee in these groups. If the mean of employee’s Experience in Group is greater than or equal to 15, then that Group can be considered as ‘Senior’ group; otherwise, the group’s category will be ‘Junior’. Let’s see a practical example of this,

def group_category(x):

m = x.mean()

return 'Seniors' if m >= 15 else 'Juniors'

# Get the mean of Experience column values in each group and

# employee category for each group based on the mean experience value

values = df.groupby('City')['Experience'].agg(['mean', group_category])

print(values)

def group_category(x):
    m = x.mean()
    return 'Seniors' if m >= 15 else 'Juniors'


# Get the mean of Experience column values in each group and 
# employee category for each group based on the mean experience value
values = df.groupby('City')['Experience'].agg(['mean', group_category])

print(values)

Output:

mean group_category

Delhi 14.250000 Juniors

Mumbai 14.333333 Juniors

Sydney 15.000000 Seniors

             mean group_category
City
Delhi   14.250000        Juniors
Mumbai  14.333333        Juniors
Sydney  15.000000        Seniors

The mean Experience of employees from Delhi & Mumbai is less than 15; therefore, the category for these groups is ‘Junior’. Whereas the mean Experience of employees from ‘Sydney’ is greater than or equal to 15, the category for this Group is ‘Seniors’. We can also use the lambda function instead of a normal function. For example,

# Get the mean of Experience column values in each group and

# employee category for each group based on the mean experience value

values = df.groupby('City')['Experience'].agg(

['mean',

lambda x: 'Seniors'

if x.mean() >= 15

else 'Juniors'])

print(values)

# Get the mean of Experience column values in each group and 
# employee category for each group based on the mean experience value
values = df.groupby('City')['Experience'].agg(
                                    ['mean',
                                    lambda x:   'Seniors'
                                                 if x.mean() >= 15
                                                 else 'Juniors'])

print(values)

Output:

mean <lambda_0>

Delhi 14.250000 Juniors

Mumbai 14.333333 Juniors

Sydney 15.000000 Seniors

             mean <lambda_0>
City
Delhi   14.250000    Juniors
Mumbai  14.333333    Juniors
Sydney  15.000000    Seniors

The output for this is the same as the previous example. The only difference is that we used a lambda function here to get the category for each of the Group.

Summary:

We learned about the usage of groupby() method in the DataFrame.

Pandas Tutorials -Learn Data Analysis with Python

Are you looking to make a career in Data Science with Python?

Data Science is the future, and the future is here now. Data Scientists are now the most sought-after professionals today. To become a good Data Scientist or to make a career switch in Data Science one must possess the right skill set. We have curated a list of Best Professional Certificate in Data Science with Python. These courses will teach you the programming tools for Data Science like Pandas, NumPy, Matplotlib, Seaborn and how to use these libraries to implement Machine learning models.

Checkout the Detailed Review of Best Professional Certificate in Data Science with Python.

Remember, Data Science requires a lot of patience, persistence, and practice. So, start learning today.

Pandas Tutorial Part #16 – DataFrame GroupBy

Pandas Tutorial Part #16 – DataFrame GroupBy

DataFrame.groupby() method

Iterate over all the DataFrame Groups

Get first row of each Group

Get the count of number of DataFrame Groups

Get a specific DataFrame Group by the group name

Statistical operations on the DataFrame GroupBy object

DataFrame GroupBy and agg() method

Pandas Tutorials -Learn Data Analysis with Python

Are you looking to make a career in Data Science with Python?

Checkout the Detailed Review of Best Professional Certificate in Data Science with Python.

Join a LinkedIn Community of Python Developers

Recommend

梦日记多人版 —— 你的梦中，我亦存在

Pandas Tutorial Part #9 – Filter DataFrame Rows

Pandas Tutorial Part #11 – DataFrame attributes & methods

Pandas – Check if all values in a Column are Equal

How to Navigate a Major Career Crossroads with Ease

The First Step to An Inclusive Work Culture: Getting Your Colleagues’ Names Righ...

#MeToo and the City

Brynne Schroeder, PhD

Weeding Is Fundamental: On Libraries and Throwing Away Books

Here’s Why You’re Half-Assing Your Job

About Joyk