16

Dealing with Missing Data: Practical Imputation Methods | by Mehmet S | Nov, 202...

 4 years ago
source link: https://blog.usejournal.com/dealing-with-missing-data-practical-imputation-methods-54dc18143eed
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Responses

Dealing with Missing Data: Practical Imputation Methods

Image for post
Image for post

According to different resources, data scientists spend 80% of their time cleaning data rather than creating insights. Here are some reports that confirm this argument.

According to Crowdflower 2015 report, although there is no info about the time spent on it, 66.7% of the participants stated that data cleaning/organizing data is one of their most time-consuming tasks.

In 2016 report of the same organization, the answer for the question “What data scientists spend the most time doing?” was data cleaning, ranking 1st with 60%, and in 2017 the number was 51%.

Image for post
Image for post
Time Spent on Activities in Data Science

Finally according to Kaggle 2018 report, data cleaning and organizing tasks consume around 80% of the time on a given project.

Dealing with missing data constitutes the main task in data cleaning process. Python provides a great library, Pandas, for data manipulation. Pandas is a great data cleaning tool for finding, replacing or removing missing (or null) values in a dataframe. All of these methods can be used in different circumstances, however sometimes removing missing data can affect our machine learning algorithms negatively. Therefore caution is needed and several things should be taken into consideration, such as number of missing values, how big is the dataset etc., when dealing with missing values.

Want to read this story later? Save it in Journal.

Since it is the easiest way, there is a tendency to omit missing values in data cleaning process. However, instead of removing missing data, replacing/filling them with proper values can be a better option. Nevertheless, finding proper values for this action is a common challenge for data scientists.

Pandas library provides us with a variety of options for filling missing values. For the dataset below, we will use different fillna methods in accordance with the characteristics of respective columns.

Let’s have a look at our dataset.

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

As seen above, our dataset has some missing values. Let’s use visualization to analyze the dataset better.

Image for post
Image for post

Some part of the clylinders, gears, upholstery and km columns are colored in yellow, representing missing values. Since some of them are categorical and others are numerical data, we can’t use mean imputation for filling missing values for all columns. Different methods are required, considering characteristics of the columns. Here we will use mean, mode and specific values (with limit) to fill missing values.

1. Using mean():

Mean of a dataframe/column is a common way when replacing missing values. In our sample dataframe, we will use the mean of the km column together with groupby and transform methods.

Image for post
Image for post

Here groupby help us to calculate means of the cars according to their ages. Then transform enables us to fill missing km values with the mean of a car’s respective age group.

2. Using mode():

When median or mean cannot be used, mode is especially useful for filling categorical missing values. In this method, missing values is replaced by the most common value in the dataframe/column.

Image for post
Image for post

3. Filling with a specific value (with limit):

Sometimes due to the characteristics of the data, above methods can’t be applied. There can be several repeating values in our dataset and we can’t use aggregate methods or mode to replace NaN values. Hence, we can fill missing values considering the proportion of valid values at hand.

The number of missing values in the upholstery column is 465.

Image for post
Image for post

Rest of the 1534 rows consist of either “leather” (1280) or “cloth” (254) values. So as a proportion, around 80% of non-missing values in this column are “leather” and the rest are “cloth” (20%).

Image for post
Image for post

Taking this into consideration, we can replace missing values in this column by setting up a certain limit and using proportions of the number of “leather”/”cloth” values, in our case 4/1. In other words, around 370 of the 465 missing values should be filled with “leather” and 95 of them with “cloth”.

Image for post
Image for post

So to recap:

-Explore your dataset: Spend some time analyzing your dataset and with the help of columns, which don’t have missing values, try to segment your data to create useful patterns for calculations. Make use of groupby and transform functions in this regard.

-For the next step, choose the best method to fill your data, considering its affects on the desired outcome.

Do you want to learn imputation methods in SQL Server? If so, follow me for the next story.

More from Journal

There are many Black creators doing incredible work in Tech. This collection of resources shines a light on some of us:


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK