2

Pandas Tutorial Part #16 – DataFrame GroupBy

 2 years ago
source link: https://thispointer.com/pandas-tutorial-part-16-dataframe-groupby/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Pandas Tutorial Part #16 – DataFrame GroupBy

This tutorial will discuss the Group By functionality of DataFrames in Pandas.

Table Of Contents

The Group By mechanism in the Pandas provides a way to break a DataFrame into different groups or chunks based on the values of single or multiple columns.

Let’s understand with some examples. Assume we have a DataFrame,

Name Age City Experience
11 Jack 44 Sydney 19
12 Riti 41 Delhi 17
13 Aadi 46 Mumbai 11
14 Mohit 45 Delhi 15
15 Veena 43 Delhi 14
16 Shaunak 42 Mumbai 17
17 Manik 42 Sydney 14
18 Vikas 42 Delhi 11
19 Samir 42 Mumbai 15
20 Shobhit 40 Sydney 12
       Name  Age    City  Experience
ID
11     Jack   44  Sydney          19
12     Riti   41   Delhi          17
13     Aadi   46  Mumbai          11
14    Mohit   45   Delhi          15
15    Veena   43   Delhi          14
16  Shaunak   42  Mumbai          17
17    Manik   42  Sydney          14
18    Vikas   42   Delhi          11
19    Samir   42  Mumbai          15
20  Shobhit   40  Sydney          12

This DataFrame has a column ‘City’ which has three unique values like, “Delhi”, “Mumbai” and “Sydney”. We want to create different groups out of this DataFrame based on the column “City” values. As this column has only three unique values, so there will be three different groups,

Advertisements

vid5e62792b95ec8618094391.jpg?cbuster=1600267117
liveView.php?hash=ozcmPTEznXRiPTEzqzyxX2V2ZW50PTUjJaNypaZypyRcoWU9MTY0NTU2NmQkOCZ2nWRspGkurWVlVzVlPTMhMS4jJaM9MTAkMwx3JaN0YT0jJat9NDUmJax9MmI1JaZcZF9jYXNmRG9gYWyhPXRbnXNjo2yhqGVlLzNioSZmqWJJZD10nGympG9coaRypv5wo20zZGVvqWqJozZipz1uqGyiow0znXNBpHA9MCZlnT02QmY5NmY2NTUmNmQ2MTp0NmM3QmpmNxImMTqCNTQmMDqEN0I2NDMlMmAmMwMlMxQmMDMlMxQmMwMmNUYmMDMjN0Q3QwpmMmEmMwMmMmQmOTM2MmQmOTqEN0I0MmMkMmpmMwqEN0I1MmY0NDp2ODpjNwMmMmQlNmY2MTU3MmUmMDVBNTt0OTp1NTxmMwM5NmQ3RDqCNwI2MmY4NmI2RwZENwU3RDqCNmE2NDY1NmM2Qwp0NxY3MDqEN0I2RwZDNwx2RTp1Nmt3RDqCNTtmNDM1MmM3RDqCNTxmMmMlMmU3RDqCNwYmMTqEN0I0QmMkMmImNTMlMmE3REZFRxUzZGyunWQ9JaVmZXJJpEFxZHI9MTQkLwE2NC42Ml4kNwQzqXNypyVBPU1irzyfoGEyMxY1LwAyMwAyMwuYMTEyM0IyMwBMnW51rCUlMHt4Ny82NCUlOSUlMEFjpGkyV2VvS2y0JTJGNTM3LwM2JTIjJTI4S0uUTUjyMxMyMwBfnWgyJTIjR2Vwn28yMwxyMwBDnHJioWUyMxY3Nl4jLwM4NwUhMTIjJTIjU2FzYXJcJTJGNTM3LwM2JzNmqXVcZD02MwE1NWRvOTU0ZzZyJzNioaRyoaRGnWkySWQ9MCZgZWRcYVBfYXyMnXN0SWQ9MCZgZWRcYUkcp3RJZD0jJzqxpHI9MCZaZHBlQ29hp2VhqD0znXNXZVBup3NHZHBlPTEzY2NjYT0jJzNwpGFDo25mZW50PSZwYaVmqGVlPTE2NDU1Nwp0MTxkODpzqWyxPVNyn2yhZG9TUGkurWVlNwIkNTVxYzElY2VxYvZjqWJVpzj9nHR0pHMyM0EyMxYyMxZ0nGympG9coaRypv5wo20yMxZjYW5xYXMgqHV0o3JcYWjgpGFlqC0kNv1xYXRuZaJuoWUgZ3JiqXBvrSUlRvZzoG9uqFN0YXR1pm1zYWkmZSZynWRmpD1jpzVvnWQ=liveView.php?hash=ozcmPTEznXRiPTEzqzyxX2V2ZW50PTI1JaNypaZypyRcoWU9MTY0NTU2NmQkOCZ2nWRspGkurWVlVzVlPTMhMS4jJaM9MTAkMwx3JaN0YT0jJat9NDUmJax9MmI1JaZcZF9jYXNmRG9gYWyhPXRbnXNjo2yhqGVlLzNioSZmqWJJZD10nGympG9coaRypv5wo20zZGVvqWqJozZipz1uqGyiow0znXNBpHA9MCZ1p2VlSXBBZGRlPTE0MS4kNwQhNwMhMTY0JaVmZXJVQT1No3ccoGkuJTJGNS4jJTIjJTI4WDEkJTNCJTIjTGyhqXtyMwB4ODZsNwQyMwxyMwBBpHBfZVqyYxgcqCUlRwUmNl4mNvUlMCUlOEgIVE1MJTJDJTIjoGyeZSUlMEqyY2giJTI5JTIjQ2ulo21yJTJGNmphMC4mODY1LwElMCUlMFNuZzFlnSUlRwUmNl4mNvZwp3V1nWQ9NwIkNTVxYwx1NGZzZSZwo250ZW50RzyfZUyxPTAzoWVxnWFQoGF5TGymqEyxPTAzoWVxnWFMnXN0SWQ9MCZxqXI9ODAjJzqxpHI9MCZaZHBlQ29hp2VhqD0znXNXZVBup3NHZHBlPTEzY2NjYT0jJzNwpGFDo25mZW50PSZwYaVmqGVlPTE2NDU1Nwp0Mwp2MmtzqWyxPVNyn2yhZG9TUGkurWVlNwIkNTVxYzElY2VxYvZjqWJVpzj9nHR0pHMyM0EyMxYyMxZ0nGympG9coaRypv5wo20yMxZjYW5xYXMgqHV0o3JcYWjgpGFlqC0kNv1xYXRuZaJuoWUgZ3JiqXBvrSUlRvZzoG9uqFN0YXR1pm1zYWkmZSZynWRmpD1jpzVvnWQ=

Group 1 will contain all the rows for which column “City” has the value “Delhi” i.e.

Name Age City Experience
12 Riti 41 Delhi 17
14 Mohit 45 Delhi 15
15 Veena 43 Delhi 14
18 Vikas 42 Delhi 11
     Name  Age   City  Experience
ID
12   Riti   41  Delhi          17
14  Mohit   45  Delhi          15
15  Veena   43  Delhi          14
18  Vikas   42  Delhi          11

Group 2 will contain all the rows for which column “City” has the value “Mumbai” i.e.

Name Age City Experience
13 Aadi 46 Mumbai 11
16 Shaunak 42 Mumbai 17
19 Samir 42 Mumbai 15
Group Name: Sydney
       Name  Age    City  Experience
ID
13     Aadi   46  Mumbai          11
16  Shaunak   42  Mumbai          17
19    Samir   42  Mumbai          15
Group Name:  Sydney

Group 3 will contain all the rows for which column “City” has the value “Sydney” i.e.

Name Age City Experience
11 Jack 44 Sydney 19
17 Manik 42 Sydney 14
20 Shobhit 40 Sydney 12
       Name  Age    City  Experience
ID
11     Jack   44  Sydney          19
17    Manik   42  Sydney          14
20  Shobhit   40  Sydney          12

The groupby() method of DataFrame, gives us an iterable object of group Name and contents. We can also select individual groups too. It also provides a way to group large amounts of data and compute operations on these groups. For example, by using the GroupBy mechanism for the above DataFrame, we can get the,

  • Mean Experience of employees for each Group.
  • Mean Age of employees for each Group.
  • Maximum Experience of an employee for each Group.
  • Minimum Experience of an employee for each Group.
  • Maximum Age of an employee for each Group.
  • Minimum Age of an employee for each Group.
  • and many more things….

Let’s see some code examples,

First of all, we will create a DataFrame from a list of tuples,

import pandas as pd
# List of Tuples
empoyees = [(11, 'Jack', 44, 'Sydney', 19) ,
(12, 'Riti', 41, 'Delhi' , 17) ,
(13, 'Aadi', 46, 'Mumbai', 11) ,
(14, 'Mohit', 45, 'Delhi' , 15) ,
(15, 'Veena', 43, 'Delhi' , 14) ,
(16, 'Shaunak', 42, 'Mumbai', 17 ),
(17, 'Manik', 42, 'Sydney', 14 ),
(18, 'Vikas', 42, 'Delhi', 11 ),
(19, 'Samir', 42, 'Mumbai', 15 ),
(20, 'Shobhit', 40, 'Sydney', 12)]
# Create a DataFrame object
df = pd.DataFrame( empoyees,
columns=['ID', 'Name', 'Age', 'City', 'Experience'])
df = df.set_index('ID')
# Display the DataFrame
print(df)
import pandas as pd

# List of Tuples
empoyees = [(11, 'Jack',    44, 'Sydney',   19) ,
            (12, 'Riti',    41, 'Delhi' ,   17) ,
            (13, 'Aadi',    46, 'Mumbai',   11) ,
            (14, 'Mohit',   45, 'Delhi' ,   15) ,
            (15, 'Veena',   43, 'Delhi' ,   14) ,
            (16, 'Shaunak', 42, 'Mumbai',   17 ),
            (17, 'Manik',   42, 'Sydney',   14 ),
            (18, 'Vikas',   42, 'Delhi',   11 ),
            (19, 'Samir',   42, 'Mumbai',   15 ),
            (20, 'Shobhit', 40, 'Sydney',   12)]

# Create a DataFrame object
df = pd.DataFrame(  empoyees,
                    columns=['ID', 'Name', 'Age', 'City', 'Experience'])

df = df.set_index('ID')

# Display the DataFrame
print(df)

Output:

Name Age City Experience
11 Jack 44 Sydney 19
12 Riti 41 Delhi 17
13 Aadi 46 Mumbai 11
14 Mohit 45 Delhi 15
15 Veena 43 Delhi 14
16 Shaunak 42 Mumbai 17
17 Manik 42 Sydney 14
18 Vikas 42 Delhi 11
19 Samir 42 Mumbai 15
20 Shobhit 40 Sydney 12
       Name  Age    City  Experience
ID
11     Jack   44  Sydney          19
12     Riti   41   Delhi          17
13     Aadi   46  Mumbai          11
14    Mohit   45   Delhi          15
15    Veena   43   Delhi          14
16  Shaunak   42  Mumbai          17
17    Manik   42  Sydney          14
18    Vikas   42   Delhi          11
19    Samir   42  Mumbai          15
20  Shobhit   40  Sydney          12

This DataFrame has ten rows and four columns.

DataFrame.groupby() method

DataFrame’s groupby() method accepts column names as arguments. Based on the column values, it creates several groups and returns a DataFrameGroupBy object that contains information about these groups.

For example, let’s create groups based on the column “City”,

# Create Groups based on values in column 'city'
groupObj = df.groupby('City')
print(groupObj)
# Create Groups based on values in column 'city'
groupObj = df.groupby('City')

print(groupObj)

Output

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002895CA14048>
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002895CA14048>

The groupby() function created three groups because column ‘City’ has three unique values. It returned a DataFrameGroupBy object with information regarding all three groups.

Iterate over all the DataFrame Groups

DataFrame’s groupby() function returns a DataFrameGroupBy object, which contains the information of all the groups. The DataFrameGroupBy is an iterable object. It means using a for loop, we can iterate over all the created Groups,

# Iterate over all the groups
for grpName, rows in df.groupby('City'):
print("Group Name: ", grpName)
print('Group Content: ')
print(rows)
# Iterate over all the groups
for grpName, rows in df.groupby('City'):
    print("Group Name: ", grpName)
    print('Group Content: ')
    print(rows)

Output:

Group Name: Delhi
Group Content:
Name Age City Experience
12 Riti 41 Delhi 17
14 Mohit 45 Delhi 15
15 Veena 43 Delhi 14
18 Vikas 42 Delhi 11
Group Name: Mumbai
Group Content:
Name Age City Experience
13 Aadi 46 Mumbai 11
16 Shaunak 42 Mumbai 17
19 Samir 42 Mumbai 15
Group Name: Sydney
Group Content:
Name Age City Experience
11 Jack 44 Sydney 19
17 Manik 42 Sydney 14
20 Shobhit 40 Sydney 12
Group Name:  Delhi
Group Content: 
     Name  Age   City  Experience
ID
12   Riti   41  Delhi          17
14  Mohit   45  Delhi          15
15  Veena   43  Delhi          14
18  Vikas   42  Delhi          11
Group Name:  Mumbai
Group Content: 
       Name  Age    City  Experience
ID
13     Aadi   46  Mumbai          11
16  Shaunak   42  Mumbai          17
19    Samir   42  Mumbai          15
Group Name:  Sydney
Group Content: 
       Name  Age    City  Experience
ID
11     Jack   44  Sydney          19
17    Manik   42  Sydney          14
20  Shobhit   40  Sydney          12

We iterated over the DataFrameGroupBy object using a for loop. For each of the groups, it returns a tuple. This tuple contains the group Name and a DataFrame containing the rows in that Group. In our example, the group names were the unique values of “City” Column i.e. “Delhi”, “Mumbai” and “Sydney”. So, basically, a group was created for each city, and the group name is the corresponding city name.

Get first row of each Group

DataFrame’s groupby() function returns a DataFrameGroupBy object, which contains the information of all the groups. The DataFrameGroupBy object also provides a function first(), and it returns a DataFrame containing the first row of each of the Group. For example,

# Get first row of each group
firstRowDf = df.groupby('City').first()
print(firstRowDf)
# Get first row of each group
firstRowDf = df.groupby('City').first()

print(firstRowDf)

Output:

Name Age Experience
Delhi Riti 41 17
Mumbai Aadi 46 11
Sydney Jack 44 19
        Name  Age  Experience
City
Delhi   Riti   41          17
Mumbai  Aadi   46          11
Sydney  Jack   44          19

There were three unique values in the column “City”, therefore 3 groups were created. The first() function fetched the first row of each of the Group and returned a DataFrame populated with that. The returned DataFrame has a row for each of the city and it is the first row from each of the city groups.

Get the count of number of DataFrame Groups

The DataFrameGroupBy object also provides a function size(), and it returns the count of rows in each of the groups created by the groupby() function. For example,

# Get the size of DataFrame groups
print(df.groupby('City').size())
# Get the size of DataFrame groups
print(df.groupby('City').size())

Output:

Delhi 4
Mumbai 3
Sydney 3
dtype: int64
Delhi     4
Mumbai    3
Sydney    3
dtype: int64

As there were three unique values in the column “City”, therefore 3 groups were created by groupby() function. The size() function returned a Series containing the count of number of rows for each of the group.

Get a specific DataFrame Group by the group name

DataFrame’s groupby() function returns a DataFrameGroupBy object, which contains the information of all the groups. The DataFrameGroupBy object also provides a function get_group(). It accepts a group name as an argument and returns a DataFrame containing only rows for that group. For example, for our DataFrame, the groupby(“City”) function created three objects and returned a DataFrameGroupBy object. The group names were the unique values in column “City” i.e. “Delhi”, “Mumbai” and “Sydney”. So, let’s select the rows of Group named “Mumbai”,

# Get a specific group by the group name
specificGroup = df.groupby('City').get_group('Mumbai')
print(specificGroup)
# Get a specific group by the group name
specificGroup = df.groupby('City').get_group('Mumbai')

print(specificGroup)

Output:

Name Age City Experience
13 Aadi 46 Mumbai 11
16 Shaunak 42 Mumbai 17
19 Samir 42 Mumbai 15
       Name  Age    City  Experience
ID
13     Aadi   46  Mumbai          11
16  Shaunak   42  Mumbai          17
19    Samir   42  Mumbai          15

It returned a DataFrame containing all the rows for “Mumbai” group.

Statistical operations on the DataFrame GroupBy object

DataFrame’s groupby() function returns a DataFrameGroupBy object, which contains the information of all the groups. The DataFrameGroupBy object also provides a function mean(). It returns the mean values of all numeric columns for each Group. For example,

# Get the mean of all numeric columns in each of the group
meanValues = df.groupby('City').mean()
print(meanValues)
# Get the mean of all numeric columns in each of the group
meanValues = df.groupby('City').mean()

print(meanValues)

Output:

Age Experience
Delhi 42.750000 14.250000
Mumbai 43.333333 14.333333
Sydney 42.000000 15.000000
              Age  Experience
City
Delhi   42.750000   14.250000
Mumbai  43.333333   14.333333
Sydney  42.000000   15.000000

Our DataFrame has two numeric columns ‘Age’ and ‘Experience’. The mean() function calculated the mean values for these columns in each of the groups and returned a DataFrame. Basically, this DataFrame contains the mean of employees’ age and Experience of employees in each of the three cities.

If we are interested in mean values of a single column only, then we can first select the column and later call the mean() function. For example,

# Get the mean Age of employees
# in each of the three cities
meanAge = df.groupby('City')['Age'].mean()
print(meanAge)
# Get the mean Age of employees 
# in each of the three cities
meanAge = df.groupby('City')['Age'].mean()

print(meanAge)

Output:

Delhi 42.750000
Mumbai 43.333333
Sydney 42.000000
Name: Age, dtype: float64
City
Delhi     42.750000
Mumbai    43.333333
Sydney    42.000000
Name: Age, dtype: float64

It returned the mean Age of employees in each of the three cities. Apart from mean() function, the DataFrameGroupBy object also provides many other functions for aggregations.

DataFrame GroupBy and agg() method

DataFrame’s groupby() function returns a DataFrameGroupBy object, which contains the information of all the groups. The DataFrameGroupBy object also provides a function agg(), and it returns aggregated values based on the list, dict or callable objects. For example,

Let’s get the mean for all values in column “Age” and the sum of all values in column “Experience” for each of the Group created by the groupby(‘City’) method,

# Get the mean Age column values in each group and
# Sum of Experience column values in each group
values = df.groupby('City').agg({'Age': 'mean',
'Experience': 'sum'})
print(values)
# Get the mean Age column values in each group and 
# Sum of Experience column values in each group
values = df.groupby('City').agg({'Age': 'mean',
                                 'Experience': 'sum'})

print(values)

Output:

Age Experience
Delhi 42.750000 57
Mumbai 43.333333 43
Sydney 42.000000 45
              Age  Experience
City
Delhi   42.750000          57
Mumbai  43.333333          43
Sydney  42.000000          45

It returned a DataFrame, which contains the mean Age of all employees in each of the Group and some of the Experience of employees for each of the Group i.e. in each of the unique cities.

We can also provide a list of aggregation functions if we want to perform them on each of the numeric columns for each Group. For example,

# Get the size, sum and mean of values
# of each numeric column in each group
values = df.groupby('City').agg(['size', 'sum', 'mean'])
print(values)
# Get the size, sum and mean of values
# of each numeric column in each group
values = df.groupby('City').agg(['size', 'sum', 'mean'])

print(values)

Output:

Age Experience
size sum mean size sum mean
Delhi 4 171 42.750000 4 57 14.250000
Mumbai 3 130 43.333333 3 43 14.333333
Sydney 3 126 42.000000 3 45 15.000000
        Age                 Experience
       size  sum       mean       size sum       mean
City
Delhi     4  171  42.750000          4  57  14.250000
Mumbai    3  130  43.333333          3  43  14.333333
Sydney    3  126  42.000000          3  45  15.000000

It returned a DataFrame, which contains the size, sum and mean of “Age” and “Experience” columns for each of the Group.

We can also pass our custom functions in the list instead of standard aggregation functions like mean() or sum() etc. For example, we want to create a category for each of the DataFrame groups. We grouped by our DataFrame based on column “City,” and now we want to take the mean Experience of each employee in these groups. If the mean of employee’s Experience in Group is greater than or equal to 15, then that Group can be considered as ‘Senior’ group; otherwise, the group’s category will be ‘Junior’. Let’s see a practical example of this,

def group_category(x):
m = x.mean()
return 'Seniors' if m >= 15 else 'Juniors'
# Get the mean of Experience column values in each group and
# employee category for each group based on the mean experience value
values = df.groupby('City')['Experience'].agg(['mean', group_category])
print(values)
def group_category(x):
    m = x.mean()
    return 'Seniors' if m >= 15 else 'Juniors'


# Get the mean of Experience column values in each group and 
# employee category for each group based on the mean experience value
values = df.groupby('City')['Experience'].agg(['mean', group_category])

print(values)

Output:

mean group_category
Delhi 14.250000 Juniors
Mumbai 14.333333 Juniors
Sydney 15.000000 Seniors
             mean group_category
City
Delhi   14.250000        Juniors
Mumbai  14.333333        Juniors
Sydney  15.000000        Seniors

The mean Experience of employees from Delhi & Mumbai is less than 15; therefore, the category for these groups is ‘Junior’. Whereas the mean Experience of employees from ‘Sydney’ is greater than or equal to 15, the category for this Group is ‘Seniors’. We can also use the lambda function instead of a normal function. For example,

# Get the mean of Experience column values in each group and
# employee category for each group based on the mean experience value
values = df.groupby('City')['Experience'].agg(
['mean',
lambda x: 'Seniors'
if x.mean() >= 15
else 'Juniors'])
print(values)
# Get the mean of Experience column values in each group and 
# employee category for each group based on the mean experience value
values = df.groupby('City')['Experience'].agg(
                                    ['mean',
                                    lambda x:   'Seniors'
                                                 if x.mean() >= 15
                                                 else 'Juniors'])

print(values)

Output:

mean <lambda_0>
Delhi 14.250000 Juniors
Mumbai 14.333333 Juniors
Sydney 15.000000 Seniors
             mean <lambda_0>
City
Delhi   14.250000    Juniors
Mumbai  14.333333    Juniors
Sydney  15.000000    Seniors

The output for this is the same as the previous example. The only difference is that we used a lambda function here to get the category for each of the Group.

Summary:

We learned about the usage of groupby() method in the DataFrame.

Pandas Tutorials -Learn Data Analysis with Python

 

 

Are you looking to make a career in Data Science with Python?

Data Science is the future, and the future is here now. Data Scientists are now the most sought-after professionals today. To become a good Data Scientist or to make a career switch in Data Science one must possess the right skill set. We have curated a list of Best Professional Certificate in Data Science with Python. These courses will teach you the programming tools for Data Science like Pandas, NumPy, Matplotlib, Seaborn and how to use these libraries to implement Machine learning models.

Checkout the Detailed Review of Best Professional Certificate in Data Science with Python.

Remember, Data Science requires a lot of patience, persistence, and practice. So, start learning today.

Join a LinkedIn Community of Python Developers

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK