5

Data Wrangling Solutions — Working With Dates — Part 3

 2 years ago
source link: https://towardsdatascience.com/data-wrangling-solutions-working-with-dates-part-3-b8b5caccc4ed
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Data Wrangling Solutions — Working With Dates — Part 3

Extract information from columns containing DateTime data.

Photo by Waldemar Brandt on Unsplash

In the last two tutorials, we went from importing data containing date columns to converting them from non-DateTime datatype to DateTime datatype. In this tutorial, we will go through a few data-wrangling scenarios and their solutions that we encounter while working with DateTime columns.

Assumption and Recommendation

If you would like to implement the codes while following through the tutorial, use the sample data and the associated Jupiter notebook available in the Scenario_7 folder of this GitHub link.

To learn about GitHub, please go through this tutorial. This tutorial will help you set up a new Python environment if it is not set up already.

Following is the list of Python concepts and pandas functions/ methods used in the tutorial:

Pandas functions

  • read_csv
  • to_datetime
  • DateOffset
  • date accessors

Challenges & Solutions

Knowing the dataset

As explained in the last tutorial, we have created a dummy dataset that contains multiple versions of a data field, release_date. The data dictionary of this dummy dataset is as follows:

  • release_date — Actual date column with first date value deleted.
  • release_date_int — Another column containing date information but in an integer format, for example, date 2020–02–12 is present as 20200212 in YYYYMMDD format.
  • release_date_text — Column containing dates in text format, and # as the separator. The format code is YYYY#DD#MM.
  • Year — Column containing only year values of the date data.
  • Month — Column containing only month values of the date data.
  • Day — Column containing information on days

A snapshot of the raw dataset is shared below:

1*KiUXQuKxjUfqRvp3v5sAwA.png?q=20
data-wrangling-solutions-working-with-dates-part-3-b8b5caccc4ed
Sample Data Snapshot (Image by Author)

In all our scenarios explained below, we have assumed that this dummy data is either successfully read or transformed into a DateTime variable within the Python environment. A sample view of the read data is as follows:

1*S68-f3WXqdCPzmEWtnOiVQ.png?q=20
data-wrangling-solutions-working-with-dates-part-3-b8b5caccc4ed
Sample Data Read in Python (Image By Author)

Challenge 1 — Create a new column containing dates n days/ month/ year ahead of the existing one.

This is one of the common scenarios we always encounter when working with datasets. To achieve this objective, we will use Pandas's built-in function, DateOffset. In the sample code below, we are using our existing date columns to create a new column where the date values are 7 days ahead:

#### Sample Code
#### A view to existing columns
imp_data["release_date_text"].head()#### Output
0 2019-03-17
1 2004-08-06
2 2014-10-10
3 2012-03-09
4 2009-02-05#### New column 7 days ahead
imp_data["release_date_text_7daysplus"] = imp_data["release_date_text"] + pd.DateOffset(days=7)
1*3zy7gT-3jzsZOZsCUtNpRg.png?q=20
data-wrangling-solutions-working-with-dates-part-3-b8b5caccc4ed
Sample Output — Date Ahead (Image By Author)

Explanation

  • The new column release_date_text_7daysplusis created using DateOffset function.
  • DateOffset — The function offsets the date values by n period. The value of n can be any integer, and the period can be years, months, week, days, and hours among others. Depending upon the offset period type, we should use the associated keyword as the argument (in our example, it’s days). For a complete list of keywords, refer to this link.

Challenge 2 — Extracting various components of date columns

One of the desired scenarios is to extract information like a weekday, week number from dates. We can achieve this in Pandas using the date accessor that is represented by dt. The sample code below will demonstrate how we can extract the weekdays using one of the date columns.

#### Sample Code
#### A view to existing columns
imp_data["release_date_text"].head()#### Output
0 2019-03-17
1 2004-08-06
2 2014-10-10
3 2012-03-09
4 2009-02-05#### Extracting Weekdays
imp_data["Weekdays"] = imp_data["release_date_int"].dt.dayofweek
imp_data["Weekdays"].head(3)#### Ouput
0 6
1 4
2 4
Name: Weekdays, dtype: int64#### Extracting year
imp_data["year"] = imp_data["release_date_int"].dt.year
imp_data["year"].head(3)#### Ouput
0 2019
1 2004
2 2014
Name: year, dtype: int64

Explanation

  • Observe the new columns containing the weekday and year information.
  • Date accessors— The date accessors (represented as dt) provide ready-to-use methods that one can use to extract specific components from a DateTime object. Please note that we can apply the DateTime accessors only to a dataframe series (single column). For a complete list of these components, refer to this link.

Challenge 3 —Finding date differences or using logical operations on two different date columns

With pandas, this task is as simple as adding or subtracting 2 columns. Let’s straight look at the following examples to understand:

#### Sample Code
#### Finding difference between 2 dates

imp_data["days_difference"] = imp_data["release_date_text_7daysplus"] - imp_data["release_date_text"]
imp_data["days_difference"].head()#### Sample Ouptut
0 7 days
1 7 days
2 7 days
3 7 days
4 7 days
Name: days_difference, dtype: timedelta64[ns]#### Comparing Dates
imp_data["Is_bigger"] = imp_data["release_date_text_7daysplus"] > imp_data["release_date_text"]
imp_data["Is_bigger"].head()#### Sample Output
0 True
1 True
2 True
3 True
4 True
Name: Is_bigger, dtype: bool

Please note, in all the above scenarios, the output of the operations is a series object and if we want to use it further, we should store it as a new column in the dataframe.

Closing note

We are now ending the tutorial with a sample problem. Using all that we have learned in the last three DateTime tutorials, can you think of a solution to manually correct a wrong date present in a cell of the dataframe? For the answer, refer to the Jupyter Notebook shared through the GitHub Repository.

HAPPY LEARNING ! ! ! !


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK