Kaggle Competitions Top Classification Algorithm

Comparing and Choosing the best Algorithm for Classification Problems

If you are a beginner Data Scientist you probably feel overwhelmed by the number of possible algorithms to choose from, and if you have tried some of them, you have probably realized that they are all pretty good. But don’t worry, in this article, we are going to have a closer look at classification problems and we are going to do it with a practical business case.

In my previous article, I showed you how to use the K-Means algorithm to group items in similar categories, particularly, we used an unsupervised dataset to classify different types of Whisky into 15 different labels. Now we have created a supervised dataset that we can use to compare the performance of different classification algorithms. Let’s do that!

Let’s download the supervised dataset here, load the data into a Pandas data frame, and try the most simple classification algorithm, logistic regression:

Train accuracy 0.959910913140311
Test accuracy 0.9560622914349277

I have used a very small train dataset (only 0.2) because this is not a very complex problem and wanted to test the capacity of these algorithms to perform with small samples.

Now let’s try the KNN algorithm:

We can use 15 neighbors because we know there are 15 groups of Whiskies

Train accuracy 0.951002227171492
Test accuracy 0.9443826473859844

Now we train a Support Vector Machine:

Train accuracy 0.9510022271714922
Test accuracy 1.0

Now it’s time to compare the performance of Decision Trees and Random Forests

Trees Train accuracy 0.9510022271714922
Forest Train accuracy 0.9510022271714922
Trees Test accuracy 0.996662958843159
Forest Test accuracy 0.996662958843159

And, finally, let’s try XGboost:

Train accuracy 1.0
Trees Test accuracy 0.996662958843159

OK, so let’s pause and have a look at the accuracies obtained.

First of all, we can see that the most simple algorithms Logistic Regression and KNN algorithms performed the worst with low train and test accuracies around 0.95
Then we have more robust algorithms like Support Vector Machines, Decision trees, and Random Forest which performed very well on the test set with accuracy close to 1 but low accuracy in the train set close to 0.95
Finally, the top performer was XGboost with an impressive performance in both train and test sets.

This shouldn’t be a surprise, in fact, XGboost is an extremely powerful algorithm and has raised to dominate the Kaggle competitions for non-perceptual problems (perceptual problems are dominated by neural networks).

XGBoost is an implementation of the Gradient Boosted Decision Trees algorithm. The implementation of the algorithm is such that the compute time and memory resources are very efficient. A design goal was to make the best use of available resources to train the model. It is an iteration that repeatedly builds new models and combines them into an ensemble model. We start the cycle by calculating the errors for each observation in the dataset. We then build a new model to predict those. We add predictions from this error-predicting model to the “ensemble of models.”

To make a prediction, we add the predictions from all previous models. We can use these predictions to calculate new errors, build the next model, and add it to the ensemble.

So if we have enough data and maximum accuracy is the goal, XGboost is the go-to technique.

Now let’s summarize the best use cases for each of the other algorithms:

logistic regression performs better when the number of noise variables is less than or equal to the number of explanatory variables.
KNN works better than logistic regression in problems with a high signal-to-noise ratio (SNR) of variables and low amounts of data (it runs really slow).
SVM similar to KNN but can handle better problems with many outliers in the data. Also, KNN works well in datasets with many features and lesser training data, while the opposite is true for SVM.
Trees and Forests also work well with a high SNR ratio but can handle much bigger datasets than KNN, that’s why they are much more commonly used in data science.

OK, so that was it for today, hope you enjoyed this article.

Happy coding!

Kaggle Competitions Top Classification Algorithm

Kaggle Competitions Top Classification Algorithm

Comparing and Choosing the best Algorithm for Classification Problems

Recommend

Python Sales Forecasting Kaggle Competition

Advanced Regression: Improve Your Predictions

Modeling Climate Change With Python

Dimensionality Reduction with Python

Data Science

How do CPUs read machine code? — 6502 part 2

6502 breadboard computer: part 1

guizero-calc

Making calculators with microbits

Coding as a way of understanding the Monty Hall problem

About Joyk