Visualizing Support Vector Machine Decision Boundary

fYfyA3e.jpg!web

Decision Boundary (Picture: Author’s Own Work, Saitama, Japan)

In a previous post I have described about principal component analysis (PCA) in detail and, the mathematics behind support vector machine (SVM) algorithm in another. Here, I will combine SVM, PCA, and Grid-search Cross-Validation to create a pipeline to find best parameters for binary classification and eventually plot a decision boundary to present how good our algorithm has performed. What you expect to learn/review in this post —

Joint-plots and representing data in a meaningful way through Seaborn Library .
If you have more than 2 components in principal component analysis, how to choose and represent which 2 components are more relevant than others?
Creating a pipeline with PCA, and SVM to find best fit parameters through grid search cross-validation.
Finally, we choose the 2 principal components to represent SVM decision boundary in a 3d/2d plot, drawn using Matplotlib .

1. Know the Data-Set Better: Joint-plots and Seaborn

Here, I have used scikit-learn cancer data-set , relatively easy data-set for studying binary classification, with 2 classes being Malignant and Benign. Let’s look at the few rows of the data-frames.

iuYZRzV.png!web

As we can see there are total 569 samples and 30 features in the data-set and, our task is to classify malignant samples from benign samples. After checking that there areno missing data, we check the feature names and check correlation plots of the mean features.

vIjiA3V.png!web

Below is the correlation plot of mean features potted using seaborn library . As expected ‘area’, ‘perimeter’, and ‘radius’ are highly correlated.

3I7Zzye.png!web

Fig. 1: Correlation plot of mean features.

We can use ‘ seaborn jointplot ’ to understand relationship between individual features. Let’s see 2 examples below, where as an alternative of scatter plots, I have opted for 2D density plots. On the right panel, I used ‘hex’ setting, where along with histograms, we can understand the concentration of number of points in a small hexagonal area. Darker the hexagon, more number of points (observations) fall in that region and this intuition can also be checked with the histograms plotted on the boundaries for the 2 features.

v2ma6fe.png!web

Fig. 2: Joint-plots can carry more info than simple scatter plots.

On the left, apart from the histogram of individual features that are plotted on the boundaries, the contours are representing the 2D kernel density estimation (KDE). Instead of just discrete histograms, KDE’s are often useful and, you can find one fantastic explanation here .

We can also plot some pair plots to study which features are kind of ‘ more relevant ’ to classify malignant from benign samples. Let’s see one example below —

mUzIjyA.png!web

Fig. 3: Pair plots of few features in Cancer data-set. Code can be found in my GitHub.

Once we have played enough with the data-set to explore and understand what we have got in hand, then, let’s move towards the main classification task.

1. Know the Data-Set Better: Joint-plots and Seaborn

Recommend

Generate News Sentiment Scores in Excel

P-values Explained By Data Scientist

和朱晔一起复习Java并发（一）：线程池 - lovecindywang - 博客园

停止生产关闭销售渠道昔日"明星"暴风TV黯然退场？

求推荐个满足如下条件的手机

【链得得独家】韩国釜山发行地区性数字货币，力促自由贸易区发展

智能手机是新的“人民的鸦片”？

写 App 真难，还是前端简单

Visual Studio 支持 Java ？谣言止于智者

The lift programming language

About Joyk