1

6 Essential Tips to Solve Data Science Projects

 2 years ago
source link: https://hackernoon.com/6-essential-tips-to-solve-data-science-projects
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

6 Essential Tips to Solve Data Science Projects

Data science projects are focusing on solving social or business problems by using data. Solving data science projects can be a very challenging task for beginners in this field. You will need to have a different skills set depending on the type of data problem you want to solve. Data preparation is the process of cleaning and transforming your raw data into useful features that you can use to analyze and create predictive models. Take advantage of Cloud Platforms to train your data with a large dataset and run a lot of experiments.

Listen to this story

Speed:
Read by:
voice-avatar
Davis David

Data Scientist | AI Practitioner | Software Developer. Giving talks, teaching, writing.

Data science projects are focusing on solving social or business problems by using data. Solving data science projects can be a very challenging task for beginners in this field. You will need to have a different skill set depending on the type of data problem you want to solve.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

In this article, you will learn some technical tips that can help you be more productive when working on different data science projects and achieve your goals.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

TABLE OF CONTENTS

  1. Spend Time on Data Preparation
  2. Train with Cross-Validation 
  3. Train Many algorithms and Run Many Experiments 
  4. Tune Your Hyperparameters 
  5. Take Advantage of Cloud Platforms 
  6. Apply Ensemble Methods 


1. Spend Your Time on Data Preparation 

Data preparation is the process of cleaning and transforming your raw data into useful features that you can use to analyze and create predictive models. This step is crucial and can be very difficult to accomplish. It will take a lot of your time (60% of the data science project).

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Data is collected from different sources with different formats and that makes your data science project very unique from others and you may need to apply different techniques to prepare your data.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Remember, if your data is not prepared well don't expect to get the best results in your models.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Here is the list of activities you can do in data preparation:

0 reactions
heart.png
light.png
money.png
thumbs-down.png
  • Exploratory data analysis: Analyze & visualize your Data.
  • Data cleaning: Identifying & correcting mistakes or errors in the data.e.g missing values
  • Feature selection: Identifying those features that are most relevant to the task.
  • Data transforms: Changing the scale or distribution of Features/Variables.
  • Feature engineering: Deriving new variables from available data.
  • Split data: Prepare your train and test set e.g 75% for train & 25% for test

“At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used.”— Prof. Pedro Domingos from the University of Washington


2.Train with Cross-Validation 

Cross-validation is the statistical method to assess the effectiveness of the predictive models. This is a very useful technique because it can help you avoid the overfitting problem in your model. It is recommended to set up a cross-validation technique in the early stages of your data science project.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

There are different cross-validation techniques that you can try as mentioned below. K-fold Cross-validation technique is very recommended.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
  • Leave one out cross-validation
  • Leave p out cross-validation
  • Holdout cross-validation
  • Repeated random subsampling validation
  • k-fold cross-validation
  • Stratified k-fold cross-validation
  • Time Series cross-validation
  • Nested cross-validation

3.Train Many Algorithms and Run Many Experiments 

There is no other way to find the best predictive model with higher performance than training your data with different algorithms. You also need to run different experiments (a lot of them) to find the best hyperparameter values that will produce the best performance.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

It is recommended to try multiple algorithms to understand how model performance changes and then select the algorithm that produces the best result.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

4. Tune Your Hyperparameters 

A hyperparameter is a parameter whose value is used to control the learning process of an algorithm. Hyperparameter optimization or tuning is the process of choosing a set of optimal hyperparameters for a learning algorithm that will give the best results/performance.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Here is a list of recommended techniques to use:

0 reactions
heart.png
light.png
money.png
thumbs-down.png
  • Random Search
  • Grid Search
  • Scikit-Optimize
  • Optuna
  • Hyperopt
  • Keras Tuner

Here is a simple example that shows how you can use Random Search to tune your hyperparameters.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV

# instatiate logistic regression 
logistic = LogisticRegression()

# define search space 
distribution = dict(C=uniform(loc=0, scale=4), penalty = ['l1','l2'])

# define search 
clf = RandomizedSearchCV(logistic, distributions, random_state=0)

# execute search 
search = clf.fit(X,y)

# print best parameters 
print(search.best_params_)

{'C':2, 'penalty':'l1}

0 reactions
heart.png
light.png
money.png
thumbs-down.png

5. Take Advantage of Cloud Platforms 

Our local machines can not handle the training of large datasets to create a predictive model. The process can be very slow and you will not be able to run as many experiments as you want. Cloud platforms can help you solve this problem.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

In a simple definition, a Cloud platform refers to the operating system that offers different services and resources over the internet. They also come with large computation powers that can help you to train your model with a large dataset and run a lot of experiments over a short period compared to your local machine.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

The common cloud platforms are 

0 reactions
heart.png
light.png
money.png
thumbs-down.png
  • Google Cloud Platform 
  • Microsoft Azure 
  • Amazon Web Service
  • IBM Cloud 

Most of these platforms come with free trials that you can try to use and select which one fits and can provide services specifically for your data science project.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

6. Apply Ensemble Methods

Sometimes multiple models are better than one to get a good performance. You can do this by applying ensemble methods that combine multiple base modes into a group model to perform better than each model alone.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Here is a simple example of a voting classifier algorithm that combines more than one algorithm to make predictions.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
# instantiate individual models

clf_1 = KNeighborsClassifier()
clf_2 = LogisticRegression()
clf_3 = DecisionTreeClassifier()


# Create voting classifier 
voting_ens = VotingClassifier(estimators=[('knn',clf_1), ('lr',clf_2),('dt',clf_3)], voting='hard')

# Fit and predict with the individual model and ensemble model.
for clf in (clf_1,clf_2,clf_3, voting_ens):
	clf.fit(x_train,y_train)
	y_pred = clf.predict(X_test)
	print(clf.__class__.__name__, accuracy_score(y_test,y_pred))

The results show that VotingClassfier performs better than the individual models.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

I hope you find these technical tips very useful in your data science project(s). Mastering these techniques requires a lot of practice and experiments, then you will be able to achieve the goals of your data science projects and get the best results.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in the next post!

0 reactions
heart.png
light.png
money.png
thumbs-down.png

You can also find me on Twitter @Davis_McDavid.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

And you can read more articles like this here.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Want to keep up to date with all the latest in Data Science? Subscribe to our newsletter in the footer below.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
6
heart.pngheart.pngheart.pngheart.png
light.pnglight.pnglight.pnglight.png
boat.pngboat.pngboat.pngboat.png
money.pngmoney.pngmoney.pngmoney.png
by Davis David @davisdavid. Data Scientist | AI Practitioner | Software Developer. Giving talks, teaching, writing.Contact me to collaborate
Join Hacker Noon

Create your free account to unlock your custom reading experience.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK