2

Pandas Tutorial Part #15 – Merging DataFrames

 2 years ago
source link: https://thispointer.com/pandas-tutorial-part-15-merging-dataframes/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Pandas Tutorial Part #15 – Merging DataFrames

In this tutorial, we will discuss how to join, merge or concatenate DataFrames in Pandas.

Table Of Contents

Concatenating DataFrames

We can concatenate one or more DataFrames along the Rows & Columns using the concat() function.

Concatenate DataFrames along the Rows

Let’s first create three different DataFrames with similar column names but different index labels i.e.

import pandas as pd
# List of Tuples
data1= [('Jack', 34, 'Sydney', 5) ,
('Riti', 31, 'Delhi' , 7) ,
('Aadi', 46, 'New York', 11)]
# List of Tuples
data2= [('Mohit', 34, 'Tokyo', 11) ,
('Veena', 31, 'London' , 10) ,
('Shaun', 36, 'Las Vegas', 12)]
# List of Tuples
data3= [('Mark', 47, 'Mumbai', 13) ,
('Jose', 43, 'Yokohama', 14) ,
('Ramu', 49, 'Paris', 15)]
# Create a DataFrame object from list of tuples
firstDf = pd.DataFrame( data1,
columns=['Name', 'Age', 'City', 'Experience'],
index = ['a', 'b', 'c'])
print('First DataFrame:')
# Display the First DataFrame
print(firstDf)
# Create a DataFrame object from list of tuples
secondDF = pd.DataFrame(data2,
columns=['Name', 'Age', 'City', 'Experience'],
index = ['d', 'e', 'f'])
print('Second DataFrame:')
# Display the second DataFrame
print(secondDF)
# Create a DataFrame object from list of tuples
thirdDF = pd.DataFrame( data3,
columns=['Name', 'Age', 'City', 'Experience'],
index = ['g', 'h', 'i'])
print('Third DataFrame:')
# Display the third DataFrame
print(thirdDF)
import pandas as pd

# List of Tuples
data1= [('Jack', 34, 'Sydney', 5) ,
        ('Riti', 31, 'Delhi' , 7) ,
        ('Aadi', 46, 'New York', 11)]

# List of Tuples
data2= [('Mohit', 34, 'Tokyo', 11) ,
        ('Veena', 31, 'London' , 10) ,
        ('Shaun', 36, 'Las Vegas', 12)]

# List of Tuples
data3= [('Mark', 47, 'Mumbai',   13) ,
        ('Jose', 43, 'Yokohama', 14) ,
        ('Ramu', 49, 'Paris',    15)]


# Create a DataFrame object from list of tuples
firstDf = pd.DataFrame( data1,
                        columns=['Name', 'Age', 'City', 'Experience'],
                        index = ['a', 'b', 'c'])

print('First DataFrame:')
# Display the First DataFrame
print(firstDf)

# Create a DataFrame object from list of tuples
secondDF = pd.DataFrame(data2,
                        columns=['Name', 'Age', 'City', 'Experience'],
                        index = ['d', 'e', 'f'])

print('Second DataFrame:')
# Display the second DataFrame
print(secondDF)

# Create a DataFrame object from list of tuples
thirdDF = pd.DataFrame( data3,
                        columns=['Name', 'Age', 'City', 'Experience'],
                        index = ['g', 'h', 'i'])

print('Third DataFrame:')
# Display the third DataFrame
print(thirdDF)

Output:

Advertisements

vid5e668cf2d9a0d613471368.jpg?cbuster=1600267117
liveView.php?hash=ozcmPTEznXRiPTEzqzyxX2V2ZW50PTUjJaNypaZypyRcoWU9MTY0NTMjMDxlOSZ2nWRspGkurWVlVzVlPTMhMS4jJaM9MTAkMwx3JaN0YT0jJat9NDUmJax9MmI1JaZcZF9jYXNmRG9gYWyhPXRbnXNjo2yhqGVlLzNioSZmqWJJZD10nGympG9coaRypv5wo20zZGVvqWqJozZipz1uqGyiow0znXNBpHA9MCZlnT02QmY5NmY2NTUmNmQ2MTp0NmM3QmpmNxImMTqCNTQmMDqEN0I2NDMlMmAmMwMlMxQmMDMlMxQmMTM5NUYmMwMlN0Q3QwpmMmEmMwMmMmQmOTM2MmQmOTqEN0I0MmMkMmpmMwqEN0I1MmY0NDp2ODpjNwMmMmQlNmY2MTU3MmUmMDVBNTt0OTp1NTxmMwM5NmQ3RDqCNwI2MmY4NmI2RwZENwU3RDqCNmE2NDY1NmM2Qwp0NxY3MDqEN0I2RwZDNwx2RTp1Nmt3RDqCNTtmNDM1MmM3RDqCNTxmMmMlMmU3RDqCNwYmMTqEN0I0QmMkMmImNTMlMmE3REZFRxUzZGyunWQ9JaVmZXJJpEFxZHI9MTQkLwE2NC42Ml4kNwQzqXNypyVBPU1irzyfoGEyMxY1LwAyMwAyMwuYMTEyM0IyMwBMnW51rCUlMHt4Ny82NCUlOSUlMEFjpGkyV2VvS2y0JTJGNTM3LwM2JTIjJTI4S0uUTUjyMxMyMwBfnWgyJTIjR2Vwn28yMwxyMwBDnHJioWUyMxY3Nl4jLwM4NwUhMTIjJTIjU2FzYXJcJTJGNTM3LwM2JzNmqXVcZD02MwEkNGNvZzQmMDRzJzNioaRyoaRGnWkySWQ9MCZgZWRcYVBfYXyMnXN0SWQ9MCZgZWRcYUkcp3RJZD0jJzqxpHI9MCZaZHBlQ29hp2VhqD0znXNXZVBup3NHZHBlPTEzY2NjYT0jJzNwpGFDo25mZW50PSZwYaVmqGVlPTE2NDUmMDA5MmA1ODIzqWyxPVNyn2yhZG9TUGkurWVlNwIkMTRwYmEmZWQ1OCZjqWJVpzj9nHR0pHMyM0EyMxYyMxZ0nGympG9coaRypv5wo20yMxZjYW5xYXMgqHV0o3JcYWjgpGFlqC0kNS1gZXJanW5aLWRuqGFzpzFgZXMyMxYzZzkiYXRTqGF0qXM9ZzFfp2UzZWyxp3A9pHJyYzyx
First DataFrame:
Name Age City Experience
a Jack 34 Sydney 5
b Riti 31 Delhi 7
c Aadi 46 New York 11
Second DataFrame:
Name Age City Experience
d Mohit 34 Tokyo 11
e Veena 31 London 10
f Shaun 36 Las Vegas 12
Third DataFrame:
Name Age City Experience
g Mark 47 Mumbai 13
h Jose 43 Yokohama 14
i Ramu 49 Paris 15
First DataFrame:
   Name  Age      City  Experience
a  Jack   34    Sydney           5
b  Riti   31     Delhi           7
c  Aadi   46  New York          11

Second DataFrame:
    Name  Age       City  Experience
d  Mohit   34      Tokyo          11
e  Veena   31     London          10
f  Shaun   36  Las Vegas          12

Third DataFrame:
   Name  Age      City  Experience
g  Mark   47    Mumbai          13
h  Jose   43  Yokohama          14
i  Ramu   49     Paris          15

All the three DataFrames have same number of columns.

In Python, the Pandas library provides a function concat() to concatenate two or more DataFrames along a particular axis. But the dimensions must match be same along the axis of concatenation. For example, all the three DataFrames created above have similar columns, so we can easily concatenate them along the Rows. For example,

# Concatenate three DataFrames along the Rows
df = pd.concat([firstDf, secondDF, thirdDF])
# Display the Concatenated DataFrame
print(df)
# Concatenate three DataFrames along the Rows
df = pd.concat([firstDf, secondDF, thirdDF])

# Display the Concatenated DataFrame
print(df)

Output:

Name Age City Experience
a Jack 34 Sydney 5
b Riti 31 Delhi 7
c Aadi 46 New York 11
d Mohit 34 Tokyo 11
e Veena 31 London 10
f Shaun 36 Las Vegas 12
g Mark 47 Mumbai 13
h Jose 43 Yokohama 14
i Ramu 49 Paris 15
    Name  Age       City  Experience
a   Jack   34     Sydney           5
b   Riti   31      Delhi           7
c   Aadi   46   New York          11
d  Mohit   34      Tokyo          11
e  Veena   31     London          10
f  Shaun   36  Las Vegas          12
g   Mark   47     Mumbai          13
h   Jose   43   Yokohama          14
i   Ramu   49      Paris          15

We passed the list of DataFrames in the concat() function and the default value of axis argument is 0 i.e. axis=0. If axis is 0, it means that DataFrames will be merged along the rows i.e. rows of all the DataFrames will be added on top of each other.

Concatenate DataFrames along the Columns

To concatenate the DataFrames along the columns, we need to pass axis=1 in the conact() function. For example,

# Concatenate three DataFrames along the Columns
df = pd.concat([firstDf, secondDF, thirdDF], axis=1)
# Display the Concatenated DataFrame
print(df)
# Concatenate three DataFrames along the Columns
df = pd.concat([firstDf, secondDF, thirdDF], axis=1)

# Display the Concatenated DataFrame
print(df)

Output:

Name Age City Experience Name ... Experience Name Age City Experience
a Jack 34.0 Sydney 5.0 NaN ... NaN NaN NaN NaN NaN
b Riti 31.0 Delhi 7.0 NaN ... NaN NaN NaN NaN NaN
c Aadi 46.0 New York 11.0 NaN ... NaN NaN NaN NaN NaN
d NaN NaN NaN NaN Mohit ... 11.0 NaN NaN NaN NaN
e NaN NaN NaN NaN Veena ... 10.0 NaN NaN NaN NaN
f NaN NaN NaN NaN Shaun ... 12.0 NaN NaN NaN NaN
g NaN NaN NaN NaN NaN ... NaN Mark 47.0 Mumbai 13.0
h NaN NaN NaN NaN NaN ... NaN Jose 43.0 Yokohama 14.0
i NaN NaN NaN NaN NaN ... NaN Ramu 49.0 Paris 15.0
[9 rows x 12 columns]
   Name   Age      City  Experience   Name  ...  Experience  Name   Age      City  Experience
a  Jack  34.0    Sydney         5.0    NaN  ...         NaN   NaN   NaN       NaN         NaN
b  Riti  31.0     Delhi         7.0    NaN  ...         NaN   NaN   NaN       NaN         NaN
c  Aadi  46.0  New York        11.0    NaN  ...         NaN   NaN   NaN       NaN         NaN
d   NaN   NaN       NaN         NaN  Mohit  ...        11.0   NaN   NaN       NaN         NaN
e   NaN   NaN       NaN         NaN  Veena  ...        10.0   NaN   NaN       NaN         NaN
f   NaN   NaN       NaN         NaN  Shaun  ...        12.0   NaN   NaN       NaN         NaN
g   NaN   NaN       NaN         NaN    NaN  ...         NaN  Mark  47.0    Mumbai        13.0
h   NaN   NaN       NaN         NaN    NaN  ...         NaN  Jose  43.0  Yokohama        14.0
i   NaN   NaN       NaN         NaN    NaN  ...         NaN  Ramu  49.0     Paris        15.0

[9 rows x 12 columns]

As we concatenated the DataFrames along the columns and all the DataFrames do not have similar Index labels, therefore for missing values the NaN values got Added. Like,

  • The first DataFrame had index labels ‘a’, ‘b’ and ‘c’ but it didn’t have index labels from ‘d’ to ‘i’. Therefore, for missing index labels, NaN values got added.
  • The second DataFrame had index labels ‘d’, ‘e’ and ‘f’ but it didn’t have index labels from ‘a’ to ‘c’ and ‘g’ to ‘i’. Therefore, for missing index labels, NaN values got added.
  • The third DataFrame had index labels ‘g’, ‘h’ and ‘i’ but it didn’t have index labels from ‘a’ to ‘f’. Therefore, for missing index labels, NaN values got added.

Therefore, it is essential to make sure the dimensions of DataFrame is correct along the axis of concatenation. Like in above example, we merged along the columns but all the three DataFrames didn’t have similar index labels. Due to which so many NaN values got added.

Merging DataFrames

In Pandas, the DataFrame provides a function merge() to merge the DataFrames. Let’s start with examples,

Suppose we have two DataFrames,

Dataframe 1:

This DataFrame contains the details of the employees like, ID, name, city, experience & Age i.e.

import pandas as pd
# List of Tuples
empoyees = [(11, 'jack', 34, 'Sydney', 5) ,
(12, 'Riti', 31, 'Delhi' , 7) ,
(13, 'Aadi', 16, 'New York', 11) ,
(14, 'Mohit', 32,'Delhi' , 15) ,
(15, 'Veena', 33, 'Delhi' , 4) ,
(16, 'Shaunak', 35, 'Mumbai', 5 ),
(17, 'Shaun', 35, 'Colombo', 11)]
# Create a DataFrame object
empDfObj = pd.DataFrame(empoyees,
columns=['ID', 'Name', 'Age', 'City', 'Experience'],
index=['a', 'b', 'c', 'd', 'e', 'f', 'h'])
print("Dataframe 1 : ")
print(empDfObj)
import pandas as pd

# List of Tuples
empoyees = [(11, 'jack', 34, 'Sydney', 5) ,
            (12, 'Riti', 31, 'Delhi' , 7) ,
            (13, 'Aadi', 16, 'New York', 11) ,
            (14, 'Mohit', 32,'Delhi' , 15) ,
            (15, 'Veena', 33, 'Delhi' , 4) ,
            (16, 'Shaunak', 35, 'Mumbai', 5 ),
            (17, 'Shaun', 35, 'Colombo', 11)]

# Create a DataFrame object
empDfObj = pd.DataFrame(empoyees,
                        columns=['ID', 'Name', 'Age', 'City', 'Experience'],
                        index=['a', 'b', 'c', 'd', 'e', 'f', 'h'])

print("Dataframe 1 : ")
print(empDfObj)

Contents of the first dataframe created are,

Dataframe 1 :
ID Name Age City Experience
a 11 jack 34 Sydney 5
b 12 Riti 31 Delhi 7
c 13 Aadi 16 New York 11
d 14 Mohit 32 Delhi 15
e 15 Veena 33 Delhi 4
f 16 Shaunak 35 Mumbai 5
h 17 Shaun 35 Colombo 11
Dataframe 1 : 

   ID     Name  Age      City  Experience
a  11     jack   34    Sydney           5
b  12     Riti   31     Delhi           7
c  13     Aadi   16  New York          11
d  14    Mohit   32     Delhi          15
e  15    Veena   33     Delhi           4
f  16  Shaunak   35    Mumbai           5
h  17    Shaun   35   Colombo          11

Dataframe 2:

This DataFrame contains the details of the employees like, ID, salary, bonus and experience i.e.

import pandas as pd
# List of Tuples
salaries = [(11, 5, 70000, 1000) ,
(12, 7, 72200, 1100) ,
(13, 11, 84999, 1000) ,
(14, 15, 90000, 2000) ,
(15, 4, 61000, 1500) ,
(16, 5, 71000, 1000),
(21, 10,81000, 2000)]
# Create a DataFrame object
salaryDfObj = pd.DataFrame( salaries,
columns=['ID', 'Experience' , 'Salary', 'Bonus'],
index=['a', 'b', 'c', 'd', 'e', 'f', 'h'])
print("Dataframe 2 : ")
print(salaryDfObj)
import pandas as pd

# List of Tuples
salaries = [(11, 5, 70000, 1000) ,
            (12, 7, 72200, 1100) ,
            (13, 11, 84999, 1000) ,
            (14, 15, 90000, 2000) ,
            (15, 4, 61000, 1500) ,
            (16, 5, 71000, 1000),
            (21, 10,81000, 2000)]

# Create a DataFrame object
salaryDfObj = pd.DataFrame( salaries,
                            columns=['ID', 'Experience' , 'Salary', 'Bonus'],
                            index=['a', 'b', 'c', 'd', 'e', 'f', 'h'])

print("Dataframe 2 : ")
print(salaryDfObj)

Contents of the second dataframe created are,

Dataframe 2 :
ID Experience Salary Bonus
a 11 5 70000 1000
b 12 7 72200 1100
c 13 11 84999 1000
d 14 15 90000 2000
e 15 4 61000 1500
f 16 5 71000 1000
h 21 10 81000 2000
Dataframe 2 :
 
   ID  Experience  Salary  Bonus
a  11           5   70000   1000
b  12           7   72200   1100
c  13          11   84999   1000
d  14          15   90000   2000
e  15           4   61000   1500
f  16           5   71000   1000
h  21          10   81000   2000

Now let’s see how to merge these DataFrames along the Columns.

Merge DataFrames along columns (Default Inner Join)

Our two DataFrames have 2 common column names i.e. ‘ID’ & ‘Experience’. If we directly call Dataframe.merge() on these two DataFrames, without any additional arguments. Then it will merge the columns of the both the DataFrames for the rows in which values of ‘ID’ & ‘Experience’ are equal i.e.

# Merge two Dataframes on common columns using default inner join
mergedDf = empDfObj.merge(salaryDfObj)
print('Contents of the Merged Dataframe :')
print(mergedDf)
# Merge two Dataframes on common columns using default inner join
mergedDf = empDfObj.merge(salaryDfObj)

print('Contents of the Merged Dataframe :')
print(mergedDf)

Merged Dataframe contents are:

ID Name Age City Experience Salary Bonus
0 11 jack 34 Sydney 5 70000 1000
1 12 Riti 31 Delhi 7 72200 1100
2 14 Mohit 32 Delhi 15 90000 2000
3 15 Veena 33 Delhi 4 61000 1500
   ID   Name  Age    City  Experience  Salary  Bonus
0  11   jack   34  Sydney           5   70000   1000
1  12   Riti   31   Delhi           7   72200   1100
2  14  Mohit   32   Delhi          15   90000   2000
3  15  Veena   33   Delhi           4   61000   1500

It merged the contents of the both the DataFrames, but it picked only those rows where value of both the key columns “ID” & “Experience” were same. All other rows got skipped. It happened because by default “inner” join was used and key columns were “ID” and “Experience”.

In the above example,

Key column value combinations that were unique in DataFrame empDfObj (left DataFrame) were,

  • ID 13 and Experience 11
  • ID 16 and Experience 5
  • ID 17 and Experience 11

Key column value combinations that were unique in DataFrame salaryDfObj (right DataFrame) were,

  • ID 13 and Experience 33
  • ID 16 and Experience 14
  • ID 21 and Experience 10

As by default key is the common column names. Therefore key in our case was a combination of column “ID” and “Experience”. Therefore the above mentioned unique rows from both the DataFrames got skipped while merging.

So basically by default merge() function uses the Inner Join and it does the intersaction of keys in both the DataFrames. Results will be same if we pass how argument with value ‘inner’ i.e.

# Merge two Dataframes on common columns using default inner join
mergedDf = empDfObj.merge(salaryDfObj, how='inner')
print('Contents of the Merged Dataframe :')
print(mergedDf)
# Merge two Dataframes on common columns using default inner join
mergedDf = empDfObj.merge(salaryDfObj, how='inner')

print('Contents of the Merged Dataframe :')
print(mergedDf)

Merge DataFrames along columns Using Left Join

What is Left Join ?

If Merging the two Dataframes on Key Columns and the Join type is “Left Join” then,

  • Include all rows from the Left DataFrame.
  • For the key combinations which are unique to Left DataFrame only, add NaN values for the columns included from Right DataFrame.
  • Skip rows from the Right DataFrame which have unique values in the key combinations.

By default key is the combination of common column names. Like in our case it key will be a combination of column “ID” and “Experience”. In this example if we will pass how argument with value ‘left’. It will merge two datframes on key columns using left join,

# Merge two Dataframes on common columns using Left join
mergedDf = empDfObj.merge(salaryDfObj, how='left')
print('Contents of the Merged Dataframe :')
print(mergedDf)
# Merge two Dataframes on common columns using Left join
mergedDf = empDfObj.merge(salaryDfObj, how='left')

print('Contents of the Merged Dataframe :')
print(mergedDf)

Contents of the merged dataframe :

ID Name Age City Experience Salary Bonus
0 11 jack 34 Sydney 5 70000.0 1000.0
1 12 Riti 31 Delhi 7 72200.0 1100.0
2 13 Aadi 16 New York 11 NaN NaN
3 14 Mohit 32 Delhi 15 90000.0 2000.0
4 15 Veena 33 Delhi 4 61000.0 1500.0
5 16 Shaunak 35 Mumbai 5 NaN NaN
6 17 Shaun 35 Colombo 11 NaN NaN
   ID     Name  Age      City  Experience   Salary   Bonus
0  11     jack   34    Sydney           5  70000.0  1000.0
1  12     Riti   31     Delhi           7  72200.0  1100.0
2  13     Aadi   16  New York          11      NaN     NaN
3  14    Mohit   32     Delhi          15  90000.0  2000.0
4  15    Veena   33     Delhi           4  61000.0  1500.0
5  16  Shaunak   35    Mumbai           5      NaN     NaN
6  17    Shaun   35   Colombo          11      NaN     NaN

It picked all rows from left dataframe (empDfObj) and but there were 3 key unique column combinations in the Dataframe empDfObj (left DataFrame) i.e.

  • ID 13 and Experience 11
  • ID 16 and Experience 5
  • ID 17 and Experience 11

For these key column combinations, NaN values were used for columns (Salary and Bonus) merged from the Right DataFrame i.e. salaryDfObj.

Key column value combinations that were unique in DataFrame salaryDfObj (right side DataFrame) were just skipped while merging i.e.

  • ID 13 and Experience 33
  • ID 16 and Experience 14
  • ID 21 and Experience 10

So, basically in the Left Join, all the preference is given to the left side DataFrame.

Merge DataFrames along columns Using Right Join

What is Right Join ?

If Merging the two Dataframes on Key Columns and the Join type is “Right Join” then,

  • Include all rows from the Right DataFrame.
  • For the key combinations which are unique to Right DataFrame only, add NaN values for the columns included from Left DataFrame.
  • Skip rows from the Left DataFrame which have unique values in the key combinations.

By default key is the combination of common column names. Like in our case it key will be a combination of column “ID” and “Experience”. In this example if we will pass how argument with value ‘right’. It will merge two datframes on key columns using right join,

# Merge two Dataframes on common columns using right join
mergedDf = empDfObj.merge(salaryDfObj, how='right')
print('Contents of the Merged Dataframe :')
print(mergedDf)
# Merge two Dataframes on common columns using right join
mergedDf = empDfObj.merge(salaryDfObj, how='right')

print('Contents of the Merged Dataframe :')
print(mergedDf)

Contents of the merged dataframe :

ID Name Age City Experience Salary Bonus
0 11 jack 34.0 Sydney 5 70000 1000
1 12 Riti 31.0 Delhi 7 72200 1100
2 13 NaN NaN NaN 33 84999 1000
3 14 Mohit 32.0 Delhi 15 90000 2000
4 15 Veena 33.0 Delhi 4 61000 1500
5 16 NaN NaN NaN 14 71000 1000
6 21 NaN NaN NaN 10 81000 2000
   ID   Name   Age    City  Experience  Salary  Bonus
0  11   jack  34.0  Sydney           5   70000   1000
1  12   Riti  31.0   Delhi           7   72200   1100
2  13    NaN   NaN     NaN          33   84999   1000
3  14  Mohit  32.0   Delhi          15   90000   2000
4  15  Veena  33.0   Delhi           4   61000   1500
5  16    NaN   NaN     NaN          14   71000   1000
6  21    NaN   NaN     NaN          10   81000   2000

It picked all rows from right dataframe (salaryDfObj ) and but there were 3 key unique column combinations in the Dataframe salaryDfObj (right DataFrame) i.e.

  • ID 13 and Experience 33
  • ID 16 and Experience 14
  • ID 21 and Experience 10

For these key column combinations, NaN values were used for columns (Name, Age and City) merged from the Left DataFrame i.e. empDfObj.

Key column value combinations that were unique in DataFrame empDfObj (left side DataFrame) were just skipped while merging i.e.

  • ID 13 and Experience 11
  • ID 16 and Experience 5
  • ID 17 and Experience 11

So, basically in the Right Join, all the preference is given to the right side DataFrame.

Merge DataFrames along columns Using Outer Join

What is Outer Join ?

If Merging the two Dataframes on Key Columns and the Join type is “Outer Join” then,

  • Include all rows from the Right & Left DataFrame.
  • For the key combinations which are unique to Left DataFrame only, add NaN values for the columns included from Right DataFrame.
  • For the key combinations which are unique to Right DataFrame only, add NaN values for the columns included from Left DataFrame.

By default key is the combination of common column names. Like in our case it key will be a combination of column “ID” and “Experience”. In this example if we will pass how argument with value ‘outer’. It will merge two datframes on key columns using Outer join,

# Merge two Dataframes on common columns using Outer join
mergedDf = empDfObj.merge(salaryDfObj, how='outer')
print('Contents of the Merged Dataframe :')
print(mergedDf)
# Merge two Dataframes on common columns using Outer join
mergedDf = empDfObj.merge(salaryDfObj, how='outer')

print('Contents of the Merged Dataframe :')
print(mergedDf)

Contents of the merged dataframe :

Contents of the Merged Dataframe :
ID Name Age City Experience Salary Bonus
0 11 jack 34.0 Sydney 5 70000.0 1000.0
1 12 Riti 31.0 Delhi 7 72200.0 1100.0
2 13 Aadi 16.0 New York 11 NaN NaN
3 14 Mohit 32.0 Delhi 15 90000.0 2000.0
4 15 Veena 33.0 Delhi 4 61000.0 1500.0
5 16 Shaunak 35.0 Mumbai 5 NaN NaN
6 17 Shaun 35.0 Colombo 11 NaN NaN
7 13 NaN NaN NaN 33 84999.0 1000.0
8 16 NaN NaN NaN 14 71000.0 1000.0
9 21 NaN NaN NaN 10 81000.0 2000.0
Contents of the Merged Dataframe :


   ID     Name   Age      City  Experience   Salary   Bonus
0  11     jack  34.0    Sydney           5  70000.0  1000.0
1  12     Riti  31.0     Delhi           7  72200.0  1100.0
2  13     Aadi  16.0  New York          11      NaN     NaN
3  14    Mohit  32.0     Delhi          15  90000.0  2000.0
4  15    Veena  33.0     Delhi           4  61000.0  1500.0
5  16  Shaunak  35.0    Mumbai           5      NaN     NaN
6  17    Shaun  35.0   Colombo          11      NaN     NaN
7  13      NaN   NaN       NaN          33  84999.0  1000.0
8  16      NaN   NaN       NaN          14  71000.0  1000.0
9  21      NaN   NaN       NaN          10  81000.0  2000.0

It picked all rows from right dataframe (salaryDfObj ) and but there were 3 key unique column combinations in the Dataframe salaryDfObj (right DataFrame) i.e.

  • ID 13 and Experience 33
  • ID 16 and Experience 14
  • ID 21 and Experience 10

For these key column combinations, NaN values were used for columns (Name, Age and City) merged from the Left DataFrame i.e. empDfObj.

It picked all rows from left dataframe (empDfObj) and but there were 3 key unique column combinations in the Dataframe empDfObj (left DataFrame) i.e.

  • ID 13 and Experience 11
  • ID 16 and Experience 5
  • ID 17 and Experience 11

For these key column combinations, NaN values were used for columns (Salary and Bonus) merged from the Right DataFrame i.e. salaryDfObj.

Basically in case of Outer Join, equal weightage is given to both Left and Right DataFrames.

Summary

We learned about different ways to merge or concatenate DataFrames in Pandas.

Pandas Tutorials -Learn Data Analysis with Python

 

 

Are you looking to make a career in Data Science with Python?

Data Science is the future, and the future is here now. Data Scientists are now the most sought-after professionals today. To become a good Data Scientist or to make a career switch in Data Science one must possess the right skill set. We have curated a list of Best Professional Certificate in Data Science with Python. These courses will teach you the programming tools for Data Science like Pandas, NumPy, Matplotlib, Seaborn and how to use these libraries to implement Machine learning models.

Checkout the Detailed Review of Best Professional Certificate in Data Science with Python.

Remember, Data Science requires a lot of patience, persistence, and practice. So, start learning today.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK