39

K-Nearest Neighbors (KNN): Solving Classification Problems

 5 years ago
source link: https://www.tuicool.com/articles/hit/JbMjAbJ
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

In this tutorial, we are going to use the K-Nearest Neighbors (KNN) algorithm to solve a classification problem.Firstly, what exactly do we mean by classification?

Classification across a variable means that results are categorised into a particular group. e.g. classifying a fruit as either an apple or an orange.

The KNN algorithm is one the most basic, yet most commonly used algorithms for solving classification problems. KNN works by seeking to minimize the distance between the test and training observations, so as to achieve a high classification accuracy.

aqqYv2M.png!web

As we dive deeper into our case study, you will see exactly how this works. First of all, let’s take a look at the specific case study that we will analyse using KNN.

Our case study

In this particular instance, the KNN is used to classify consumers according to their internet usage. Certain consumers will use more data (in megabytes) than others, and certain factors will have an influence on the level of usage. For simplicity, let’s set this up as a classifiction problem.

Our dependent variable (usage per week in megabytes) is expressed as a 1 if the person’s usage exceeds 15000mb per week, and 0 if it does not. Therefore, we are splitting consumers into two separate groups based on their usage (1= heavy users, 0 = light users).

The independent variables (or the variables that are hypothesised to directly influence usage – the dependent variable) are as follows:

  • Income per month
  • Hours of video per week
  • Webpages accessed per week
  • Gender (0 = Female, 1 = Male)
  • Age

To clarify:

Dependent variable: A variable that is influenced by other variables. In this case, data usage is being influenced by other factors. Independent variable: A variable that influences another variable. For instance, the more hours of video a person watches per week, the more this will increase the amount of data consumed.

Load libraries

Firstly, let’s open up a Python environment and load the following libraries:

import numpy as np
import statsmodels.api as sm
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
import mglearn
import os;

As we go through the tutorial, the uses for the above libraries will become evident.

Note that I used Python 3.6.5 at the time of writing this tutorial. As an example, if one wanted to install the mglearn library, it can accordingly be installed with the pip command as follows:

pip3 install mglearn

Load data and define variables

Before we dive into the analysis itself, we will first:

1. Load the CSV file into the Python environment using the os and pandas libraries

2. Stack the independent variables with numpy and statsmodels

Firstly, the file path where the CSV is located is set. The dataset itself can be found here, titledinternetlogit.csv.

path="/home/michaeljgrogan/Documents/a_documents/computing/data science/datasets"
os.chdir(path)
os.getcwd()

Then, we are loading in the CSV file using pandas (or pd – which represents the short notation that we specified upon importing):

variables=pd.read_csv('internetlogit.csv')
usage=variables['usage']
income=variables['income']
videohours=variables['videohours']
webpages=variables['webpages']
gender=variables['gender']
age=variables['age']

Finally, we are defining our dependent variable (usage) as y , and our independent variables as x .

y=usage
x=np.column_stack((income,videohours,webpages,gender,age))
x=sm.add_constant(x,prepend=True)

MaxMinScaler and Train-Test Split

To further prepare the data for meaningful analysis with KNN, it is necessary to:

1. Scale the data between 0 and 1 using a max-min scaler in order for the KNN algorithm to interpret it properly. Failing to do this results in unscaled data given that our dependent variable is between 0 and 1, and the KNN may not necessarily give us accurate results. In other words, if our dependent variable is scaled between 0 and 1, then our independent variables also need to be scaled between 0 and 1.

2. Partition the data into training and test data. In this instance, 80% of the data is apportioned to the training segment, while 20% is apportioned to the test segment. Specifically, the KNN model will be built with the training data, and the results will then be validated against the test data to gauge classifiction accuracy.

x_scaled = MinMaxScaler().fit_transform(x)
x_train, x_test, y_train, y_test = train_test_split(x_scaled, y,
                                                    test_size=0.2)

Now, our data has been split and the independent variables have been scaled appropriately.

pd.DataFrame(x_scaled)

To get a closer look at our scaled variables, let’s view the x_scaled variable as a pandas dataframe:

0         1         2         3    4         5
0    0.0  0.501750  0.001364  0.023404  0.0  0.414634
1    0.0  0.853250  0.189259  0.041489  0.0  0.341463
2    0.0  0.114500  0.000000  0.012766  1.0  0.658537
..   ...       ...       ...       ...  ...       ...
963  0.0  0.106500  0.061265  0.014894  0.0  0.073171
964  0.0  0.926167  0.033951  0.018085  1.0  0.926829
965  0.0  0.975917  0.222488  0.010638  1.0  0.634146

You can see that all of our variables are now on a scale between 0 and 1 , allowing for a meaningful comparison with the dependent variable.

Classification with KNN

Now that we have loaded and prepared our data, we are now ready to run the KNN itself!

n_neighbors = 1

Firstly, we will run with 1 nearest neighbor (where n_neighbors = 1) and obtain a test score:

print (x_train.shape, y_train.shape)
print (x_test.shape, y_test.shape)
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(x_train, y_train)
print("Test set score: {:.2f}".format(knn.score(x_test, y_test)))

We obtain the following output:

Test set score: 0.91

Moreover, we can now visualise this using mglearn :

mglearn.plots.plot_knn_classification(n_neighbors=1)
plt.show()

aqqYv2M.png!web

n_neighbors = 5

Now, what happens if we decide to use 5 nearest neighbors? Let’s find out!

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(x_train, y_train)
print("Test set score: {:.2f}".format(knn.score(x_test, y_test)))

We now obtain a higher test set score of 0.95:

Test set score: 0.95

When we analyse this visually, we see that we now have 5 nearest neighbors for each test prediction instead of 1:

RfEVVf3.png!web

In this instance, we see that increasing the number of nearest neighbors increased the accuracy rate against our test data. Now, try varying the number of nearest neighbors and see what accuracy rates you get!

Summary

In this tutorial, you have learned:

  • What is a classification problem
  • How KNN can be used to solve classification problems
  • Configuring of data for effective analysis with KNN

Many thanks for reading, and feel free to leave any questions in the comments below!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK