9

Split a column into multiple columns in Pandas

 1 year ago
source link: https://thispointer.com/split-a-column-into-multiple-columns-in-pandas/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

In this article, we will discuss how to split a DataFrame column into two columns in Python.

Table Of Contents

Overview Pandas DataFrame

In Pandas, the DataFrame contains three elements rows, columns, and data. It is a two-dimensional object which contains columns and rows. Where columns represent the content and rows representing the index. DataFrame is like a tabular data structure.

String column values in DataFrame

In Pandas, a DataFrame column can contain delimited string values. It means, multiple values in a single column that are either separated by dashes, whitespace, or comma. For example,

Advertisements

RollNo student_name student_address
0 10 Reema Surat_Gujarat
1 20 Rekha Pune_Maharastra
2 30 Jaya Delhi_Uttar Pradesh
   RollNo student_name      student_address
0      10        Reema        Surat_Gujarat
1      20        Rekha      Pune_Maharastra
2      30         Jaya  Delhi_Uttar Pradesh

Here, we have the requirement to split a single column into two different columns. For example, in the above DataFrame split the student_address column to two different columns “city” and “state” like,

RollNo student_name city state
0 10 Reema Surat Gujarat
1 20 Rekha Pune Maharastra
2 30 Jaya Delhi Uttar Pradesh
   RollNo student_name   city          state
0      10        Reema  Surat        Gujarat
1      20        Rekha   Pune     Maharastra
2      30         Jaya  Delhi  Uttar Pradesh

There are different ways to do this. Let’s discuss them one by one.

Split DataFrame column into two columns using Series.str.split()

In pandas, DataFrame columns are called Series, and to convert the column into a string data we can use Series.str() function. The Series.str.split() function is used to break up single column values into multiple columns based on a specified separator or delimiter. The Series.str.split() function is similar to the Python string split() method, but split() method works on the all Dataframe columns, whereas the Series.str.split() method works on a specified column only.

Syntax of Series.str.split() method

Series.str.split(pat=None, n=-1, expand=False)
Series.str.split(pat=None, n=-1, expand=False)
  • pat: string type; It is a regular expression or a delimiter symbol to split on. By default it is whitespace. It is optional.
  • n: int type; It specifies the number of splits, default is -1.
  • expand: bool type; default is False
    • If True, return DataFrame/MultiIndex expanding dimensionality.
    • If False, return Series/Index, containing lists of strings.

Example of Series.str.split() method with an underscore as a delimiter to split the student_address column to two different columns “city” and “state”.

import pandas as pd
# create a Dataframe
df = pd.DataFrame({
'RollNo': [10, 20, 30],
'student_name': ['Reema', 'Rekha', 'Jaya'],
'student_address': ['Surat_Gujarat', 'Pune_Maharastra', 'Delhi_Uttar Pradesh'] })
# show the dataframe
print(df)
print("***********")
df[['city','state']] = df.student_address.str.split("_", expand=True)
print(df)
import pandas as pd

# create a Dataframe
df = pd.DataFrame({
    'RollNo': [10, 20, 30],
    'student_name': ['Reema', 'Rekha', 'Jaya'],
    'student_address': ['Surat_Gujarat', 'Pune_Maharastra', 'Delhi_Uttar Pradesh'] })

# show the dataframe
print(df)   

print("***********")

df[['city','state']] = df.student_address.str.split("_", expand=True)

print(df)

Output

RollNo student_name student_address
0 10 Reema Surat_Gujarat
1 20 Rekha Pune_Maharastra
2 30 Jaya Delhi_Uttar Pradesh
***********
RollNo student_name student_address city state
0 10 Reema Surat_Gujarat Surat Gujarat
1 20 Rekha Pune_Maharastra Pune Maharastra
2 30 Jaya Delhi_Uttar Pradesh Delhi Uttar Pradesh
   RollNo student_name      student_address
0      10        Reema        Surat_Gujarat
1      20        Rekha      Pune_Maharastra
2      30         Jaya  Delhi_Uttar Pradesh

***********

   RollNo student_name      student_address   city          state
0      10        Reema        Surat_Gujarat  Surat        Gujarat
1      20        Rekha      Pune_Maharastra   Pune     Maharastra
2      30         Jaya  Delhi_Uttar Pradesh  Delhi  Uttar Pradesh

In the above script, we have used the Series.str.split() method to split student_address column values into the city and state columns. The delimiter ‘‘ underscore is specified between both the values, so both values are separated by ‘‘. We passed ‘_’ as the first argument in the Series. str.split() method.

Example 2:

Split two different columns values into four new columns, where comma is the delimiter.

import pandas as pd
# create a Dataframe
df = pd.DataFrame({
'RollNo': [10, 20, 30],
'student_name': ['Reema,Thakkar', 'Rekha,Chande', 'Jaya,Sachde'],
'student_address': ['Surat,Gujarat', 'Pune,Maharastra', 'Delhi,Uttar Pradesh'] })
# show the dataframe
print(df)
print('***********')
# Split column student_name to Name and Surname
df[['Name','Surname']]=df["student_name"].str.split(",",expand=True)
# Split column student_address to City and State
df[['City','State']]=df["student_address"].str.split(",",expand=True)
print(df)
import pandas as pd

# create a Dataframe
df = pd.DataFrame({
    'RollNo': [10, 20, 30],
    'student_name': ['Reema,Thakkar', 'Rekha,Chande', 'Jaya,Sachde'],
    'student_address': ['Surat,Gujarat', 'Pune,Maharastra', 'Delhi,Uttar Pradesh'] })

# show the dataframe
print(df)   

print('***********')

# Split column  student_name to Name and Surname
df[['Name','Surname']]=df["student_name"].str.split(",",expand=True)

# Split column  student_address to City and State
df[['City','State']]=df["student_address"].str.split(",",expand=True)

print(df)

Output

RollNo student_name student_address
0 10 Reema,Thakkar Surat,Gujarat
1 20 Rekha,Chande Pune,Maharastra
2 30 Jaya,Sachde Delhi,Uttar Pradesh
***********
RollNo student_name student_address Name Surname City State
0 10 Reema,Thakkar Surat,Gujarat Reema Thakkar Surat Gujarat
1 20 Rekha,Chande Pune,Maharastra Rekha Chande Pune Maharastra
2 30 Jaya,Sachde Delhi,Uttar Pradesh Jaya Sachde Delhi Uttar Pradesh
   RollNo   student_name      student_address
0      10  Reema,Thakkar        Surat,Gujarat
1      20   Rekha,Chande      Pune,Maharastra
2      30    Jaya,Sachde  Delhi,Uttar Pradesh

***********

   RollNo   student_name      student_address   Name  Surname   City          State
0      10  Reema,Thakkar        Surat,Gujarat  Reema  Thakkar  Surat        Gujarat
1      20   Rekha,Chande      Pune,Maharastra  Rekha   Chande   Pune     Maharastra
2      30    Jaya,Sachde  Delhi,Uttar Pradesh   Jaya   Sachde  Delhi  Uttar Pradesh

In the above script, we have used Series. str. split() method to split values in column student_name into two columns Name and Surname, and column student_address into columns City and State.

Split DataFrame column into two columns using apply() method

In Pandas, the apply() method can also be used to split one column values into multiple columns. The DataFrame.apply method() can execute a function on all values of single or multiple columns. Then inside that function, we can split the string value to multiple values. Then we can assign all these splitted values into new columns.

Syntax of DataFrame.apply() method

dataframe.apply(func, axis, raw, result_type, args, kwds)
dataframe.apply(func, axis, raw, result_type, args, kwds)
  • Parameters:
    • func: Required. A function to apply to the DataFrame.
    • axis: Default-0, 1
    • raw: True or False Optional, default False.
    • result_type: ‘expand’,’reduce’,’broadcast or None Optional, default None. Specifies how the result will be returned
    • args: a tuple Optional, arguments to send into the function
    • kwds: keyword arguments Optional, keyword arguments to send into the function

Example 1:

Example of DataFrame.apply() method to split a column into multiple columns. Where an underscore is the delimiter.

import pandas as pd
# create a Dataframe
df = pd.DataFrame({
'RollNo': [10, 20, 30],
'student_name': ['Reema', 'Rekha', 'Jaya'],
'student_address': ['Surat_Gujarat', 'Pune_Maharastra', 'Delhi_Uttar Pradesh'] })
# show the dataframe
print(df)
# Split column student_address into two columns City and State
df[['City', 'State']] = df["student_address"].apply(lambda x: pd.Series(str(x).split("_")))
print(df)
import pandas as pd

# create a Dataframe
df = pd.DataFrame({
    'RollNo': [10, 20, 30],
    'student_name': ['Reema', 'Rekha', 'Jaya'],
    'student_address': ['Surat_Gujarat', 'Pune_Maharastra', 'Delhi_Uttar Pradesh'] })

# show the dataframe
print(df)

# Split column student_address into two columns City and State
df[['City', 'State']] = df["student_address"].apply(lambda x: pd.Series(str(x).split("_")))

print(df)

Output

RollNo student_name student_address
0 10 Reema Surat_Gujarat
1 20 Rekha Pune_Maharastra
2 30 Jaya Delhi_Uttar Pradesh
RollNo student_name student_address City State
0 10 Reema Surat_Gujarat Surat Gujarat
1 20 Rekha Pune_Maharastra Pune Maharastra
2 30 Jaya Delhi_Uttar Pradesh Delhi Uttar Pradesh
   RollNo student_name      student_address
0      10        Reema        Surat_Gujarat
1      20        Rekha      Pune_Maharastra
2      30         Jaya  Delhi_Uttar Pradesh

   RollNo student_name      student_address   City          State
0      10        Reema        Surat_Gujarat  Surat        Gujarat
1      20        Rekha      Pune_Maharastra   Pune     Maharastra
2      30         Jaya  Delhi_Uttar Pradesh  Delhi  Uttar Pradesh

In the above script, we have the used pandas DataFrame.apply() method to split a column student_address into two columns city and state. For that, we applied a lambda function on all values of column student_address. This lambda function broke each value in that column to two different values i.e. City and State. Then we assigned those values to two new columns.

Example 2:

Example of DataFrame.apply() method with comma as a delimiter, to split two different columns values into four new columns.

import pandas as pd
# create a Dataframe
df = pd.DataFrame({
'RollNo': [10, 20, 30],
'student_name': ['Reema,Thakkar', 'Rekha,Chande', 'Jaya,Sachde'],
'student_address': ['Surat,Gujarat', 'Pune,Maharastra', 'Delhi,Uttar Pradesh'] })
# show the DataFrame
print(df)
print('*******************')
# Split column student_name into Name and Surname
df[['Name', 'Surname']] = df["student_name"].apply(lambda x: pd.Series(str(x).split(",")))
# Split column student_address into City and State
df[['City', 'State']] = df["student_address"].apply(lambda x: pd.Series(str(x).split(",")))
print(df)
import pandas as pd

# create a Dataframe
df = pd.DataFrame({
    'RollNo': [10, 20, 30],
    'student_name': ['Reema,Thakkar', 'Rekha,Chande', 'Jaya,Sachde'],
    'student_address': ['Surat,Gujarat', 'Pune,Maharastra', 'Delhi,Uttar Pradesh'] })

# show the DataFrame
print(df)

print('*******************')

# Split column student_name into Name and Surname
df[['Name', 'Surname']] = df["student_name"].apply(lambda x: pd.Series(str(x).split(",")))

# Split column student_address into City and State
df[['City', 'State']] = df["student_address"].apply(lambda x: pd.Series(str(x).split(",")))

print(df)

Output

RollNo student_name student_address
0 10 Reema,Thakkar Surat,Gujarat
1 20 Rekha,Chande Pune,Maharastra
2 30 Jaya,Sachde Delhi,Uttar Pradesh
*******************
RollNo student_name student_address Name Surname City State
0 10 Reema,Thakkar Surat,Gujarat Reema Thakkar Surat Gujarat
1 20 Rekha,Chande Pune,Maharastra Rekha Chande Pune Maharastra
2 30 Jaya,Sachde Delhi,Uttar Pradesh Jaya Sachde Delhi Uttar Pradesh
   RollNo   student_name      student_address
0      10  Reema,Thakkar        Surat,Gujarat
1      20   Rekha,Chande      Pune,Maharastra
2      30    Jaya,Sachde  Delhi,Uttar Pradesh

*******************

   RollNo   student_name      student_address   Name  Surname   City          State
0      10  Reema,Thakkar        Surat,Gujarat  Reema  Thakkar  Surat        Gujarat
1      20   Rekha,Chande      Pune,Maharastra  Rekha   Chande   Pune     Maharastra
2      30    Jaya,Sachde  Delhi,Uttar Pradesh   Jaya   Sachde  Delhi  Uttar Pradesh

In the above script, we have applied the pandas.DataFrame.apply() method to split two columns student_name and student_address into four new columns i.e. student_name into Name and Surname, and student_address into City and State.

Summary

In the article, we have discussed what is a string column in a DataFrame? The we listed out different ways to split string column into two columns and also explain each method with practical examples.

Pandas Tutorials -Learn Data Analysis with Python

 

 

Are you looking to make a career in Data Science with Python?

Data Science is the future, and the future is here now. Data Scientists are now the most sought-after professionals today. To become a good Data Scientist or to make a career switch in Data Science one must possess the right skill set. We have curated a list of Best Professional Certificate in Data Science with Python. These courses will teach you the programming tools for Data Science like Pandas, NumPy, Matplotlib, Seaborn and how to use these libraries to implement Machine learning models.

Checkout the Detailed Review of Best Professional Certificate in Data Science with Python.

Remember, Data Science requires a lot of patience, persistence, and practice. So, start learning today.

Join a LinkedIn Community of Python Developers

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK