20

Dimensionality Reduction for Data Visualization: PCA vs TSNE vs UMAP

 3 years ago
source link: https://towardsdatascience.com/dimensionality-reduction-for-data-visualization-pca-vs-tsne-vs-umap-be4aa7b1cb29?gi=914f933ae327
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Visualising a high-dimensional dataset using: PCA, TSNE and UMAP

R3a22ey.jpg!web

May 31 ·10min read

mIve2iV.jpg!web

Photo by Hin Bong Yeung on Unsplash

In this story, we are gonna go through three Dimensionality reduction techniques specifically used for Data Visualization : PCA(Principal Component Analysis), t-SNE and UMAP. We are going to explore them in details using the Sign Language MNIST Dataset, without going in-depth with the maths behind the algorithms.

What is Dimensionality Reduction?

Many Machine Learning problems involve thousands of features, having such a large number of features bring along many problems, the most important ones are:

  • Makes the training extremely slow
  • Makes it difficult to find a good solution

This is known as the curse of dimensionality and the Dimensionality Reduction is the process of reducing the number of features to the most relevant ones in simple terms.

Reducing the dimensionality does lose some information, however as most compressing processes it comes with some drawbacks, even though we get the training faster, we make the system perform slightly worse, but this is ok! “sometimes reducing the dimensionality can filter out some of the noise present and some of the unnecessary details”.

Most Dimensionality Reduction applications are used for:

  • Data Compression
  • Noise Reduction
  • Data Classification
  • Data Visualization

One of the most important aspects of Dimensionality reduction, it is Data Visualization. Having to drop the dimensionality down to two or three, make it possible to visualize the data on a 2d or 3d plot, meaning important insights can be gained by analysing these patterns in terms of clusters and much more.

Main Approaches for Dimensionality Reduction

The two main approaches to reducing dimensionality: Projection and Manifold Learning.

  • Projection : This technique deals with projecting every data point which is in high dimension, onto a subspace suitable lower-dimensional space in a way which approximately preserves the distances between the points.
  • Manifold Learning: Many dimensionality reductions algorithm work by modelling the manifold on which the training instance lie; this is called Manifold learning . It relies on the manifold hypothesis or assumption, which holds that most real-world high-dimensional datasets lie close to a much lower-dimensional manifold, this assumption in most of the cases is based on observation or experience rather than theory or pure logic.[4]

Now let's briefly explain the three techniques: (PCA, TSNE, UMAP) before jumping into solving the use case.

PCA

One of the most known dimensionality reduction technique is PCA(Principal Component Analysis, this works by identifying the hyperplane which lies closest to the data and then projects the data on that hyperplane while retaining most of the variation in the data set.

Principal Components

The axis that explains the maximum amount of variance in the training set is called the Principal Components .

The axis orthogonal to this axis is called the second principal component . As we go for higher dimensions, PCA would find a third component orthogonal to the other two components and so on, for visualization purposes we always stick to 2 or maximum 3 principal components.

vQjE3aB.png!web

Source: Packt_Pub, via: Hackernoon

It is very important to choose the right hyperplane so that when the data is projected onto it, it the maximum amount of information about how the original data is distributed.

t-SNE(T-distributed stochastic neighbour embedding)

(t-SNE)or T-distributed stochastic neighbour embedding created in 2008 by ( Laurens van der Maaten and Geoffrey Hinton) for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets.

(t-SNE)takes a high dimensional data set and reduces it to a low dimensional graph that retains a lot of the original information. It does so by giving each data point a location in a two or three-dimensional map. This technique finds clusters in data thereby making sure that an embedding preserves the meaning in the data. t-SNE reduces dimensionality while trying to keep similar instances close and dissimilar instances apart.[2]

For a quick a Visualization of this technique, refer to the animation below (it is taken from an amazing tutorial by Cyrille Rossant, I highly recommend to check out his amazing tutorial.

link: https://www.oreilly.com/content/an-illustrated-introduction-to-the-t-sne-algorithm/

IvA7n2V.jpg

Source: Cyrille Rossant ,via OReilly

UMAP(Uniform Manifold Approximation and Projection)

Uniform Manifold Approximation and Projectioncreated in 2018 by ( Leland McInnes , John Healy , James Melville ) is a general-purpose manifold learning and dimension reduction algorithm.

UMAP is a nonlinear dimensionality reduction method and is very effective for visualizing clusters or groups of data points and their relative proximities.

The significant difference with TSNE is scalability , it can be applied directly to sparse matrices thereby eliminating the need to applying any Dimensionality reduction such a s PCA or Truncated SVD(Singular Value Decomposition) as a prior pre-processing step . [1]

Put simply, it is similar to t-SNE but with probably higher processing speed, therefore, faster and probably better visualization. (let’s find it out in the tutorial below)

Use Case

Now we are going to go through the above-mentioned use case where all the three techniques will be applied: specifically, we will try to visualize a high dimensional dataset using these techniques: T he Sign-Language-MNIST Dataset: https://www.kaggle.com/datamunge/sign-language-mnist

bE3ai2i.png!web
(Sign-Language-MNIST Dataset), screenshot from kaggle.com
import numpy as np
import pandas as pd
import time
# For plotting
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
#PCA
from sklearn.decomposition import PCA
#TSNE
from sklearn.manifold import TSNE
#UMAP
import umap

The Data

train = pd.read_csv('/kaggle/input/sign-language-mnist/sign_mnist_test/sign_mnist_test.csv')train.head()

Fbe2Ebm.png!web

Size of the train Data
# Setting the label and the feature columns
y = train.loc[:,'label'].values
x = train.loc[:,'pixel1':].values
print(np.unique(y))
The number of unique labels

There are 25 unique labels representing the number of distinct sign-languages.

#Appling PCAstart = time.time()pca = PCA(n_components=3)
principalComponents = pca.fit_transform(x)
print('Duration: {} seconds'.format(time.time() - start))
principal = pd.DataFrame(data = principalComponents
, columns = ['principal component 1', 'principal component 2','principal component 3'])
principal.shape

After applying PCA, the new dimensionality of the data has only 2 features compared to the 784 features of the x data.

The number of dimensions has been cut down drastically whilst trying to retain as much of the ‘variation’ in the information as possible.

Drawbacks of PCA

The main drawback of PCA is that it is highly influenced by outliers present in the data. Moreover, PCA is a linear projection , which means it can’t capture non-linear dependencies.

PCA in 2D space

# Plotting PCA 2Dplt.style.use('dark_background')
plt.scatter(principalComponents[:, 0], principalComponents[:, 1], c=y, cmap='gist_rainbow')
plt.gca().set_aspect('equal', 'datalim')
plt.colorbar(boundaries=np.arange(24)).set_ticks(np.arange(24))
plt.title('Visualizing sign-language-mnist through PCA', fontsize=24);
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')

3qQJfyi.png!web

Image by Author

From the 2D plot, we can see the two components definitely hold some information, especially for specific digits, but clearly not enough to set all of them apart.

PCA in 3D space

# Plotting PCA 3D
ax = plt.figure(figsize=(12,10)).gca(projection='3d')
ax.scatter(
xs=principalComponents[:, 0],
ys=principalComponents[:, 1],
zs=principalComponents[:, 2],
c=y,
cmap='gist_rainbow'
)
ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_zlabel('Principal Component 3')
plt.title('Visualizing sign-language-mnist through PCA in 3D', fontsize=24);
plt.show()

fmUFryY.png!web

Image by Author

t-SNE with Scikit learn

One thing to now down is that t-SNE is very computationally expensive, hence it is mentioned in its documentation that :

“It is highly recommended to use another dimensionality reduction method (e.g. PCA for dense data or TruncatedSVD for sparse data) to reduce the number of dimensions to a reasonable amount (e.g. 50) if the number of features is very high. This will suppress some noise and speed up the computation of pairwise distances between samples.”[2]

start = time.time()pca_50 = PCA(n_components=50)
pca_result_50 = pca_50.fit_transform(x)tsne = TSNE(random_state = 42, n_components=3,verbose=0, perplexity=40, n_iter=400).fit_transform(pca_result_50)print(‘Duration: {} seconds’.format(time.time() — start))

Thus, I have applied PCA choosing to retain 50 principal components from the original data to cut down the need for more processing power and it will require time to compute the dimensionality reduction if we had considered the original data.

The speed of the three techniques will be analysed and compared in the following sections further down in details.

T-SNE in 2D space

#Visualising t-SNE 2D
fig = plt.figure(figsize=(12,8))
plt.scatter(tsne[:, 0], tsne[:, 1], c=y, cmap='gist_rainbow')
plt.gca().set_aspect('equal', 'datalim')
plt.colorbar(boundaries=np.arange(24)).set_ticks(np.arange(24))
plt.title('Visualizing sign-language-mnist through t-SNE in 2D', fontsize=24);
plt.xlabel('tsne_1')
plt.ylabel('tsne_2')

v6nUjaJ.png!web

Image by Author

T-SNE in 3D space

#Visualising t-SNE 3D
fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(tsne[:, 0], tsne[:, 1],tsne[:,2], c=y, cmap='gist_rainbow')
ax.set_xlabel('tsne_1')
ax.set_ylabel('tsne_2')
ax.set_zlabel('tsne_3')
plt.title('Visualizing sign-language-mnist through TSNE in 3D', fontsize=24);
plt.show()

rIzYRbu.png!web

Image by Author

Implementing UMAP

UMAP has different hyperparameters that can have an impact on the resulting embeddings:

  • n_neighbors

This parameter controls how UMAP balances local versus global structure in the data. This low values of n_neighbours forces UMAP to focus on very local structures while the higher values will make UMAP focus on the larger neighbourhoods.

  • min_dist

This parameter controls how tightly UMAP is allowed to pack points together. Lower values mean the points will be clustered closely and vice versa.

  • n_components

This parameter allows the user to determine the dimensionality of the reduced dimension space.

  • metric

This parameter controls how distance is computed in the ambient space of the input data.

For more detailed information, I suggest to check out the UMAP documentation :

//umap-learn.readthedocs.io/en/latest/

uUBvU3R.png!web
UMAP(default setting)

For this tutorial, I have chosen to keep the default setting apart for n_components which I set to 3 for the 3d space plot. It would be best to experiment with different hyper-parameter settings to obtain the best out of the algorithm.

start = time.time()
reducer = umap.UMAP(random_state=42,n_components=3)
embedding = reducer.fit_transform(x)
print('Duration: {} seconds'.format(time.time() - start))

UMAP in 2D space

# Visualising UMAP in 2d
fig = plt.figure(figsize=(12,8))
plt.scatter(reducer.embedding_[:, 0], reducer.embedding_[:, 1], c=y, cmap='gist_rainbow')
plt.gca().set_aspect('equal', 'datalim')
plt.colorbar(boundaries=np.arange(24)).set_ticks(np.arange(24))
ax.set_xlabel('umap_1')
ax.set_ylabel('umap_2')
plt.title('Visualizing sign-language-mnist with UMAP in 2D', fontsize=24);

YjimayR.png!web

Image by Author

We can clearly see that the UMAP does a great job in separating the signs compared to t-SNE and PCA already in 2d space.

UMAP -3D space

# Visualising UMAP in 3d
fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(reducer.embedding_[:, 0], reducer.embedding_[:, 1],reducer.embedding_[:, 2], c=y, cmap='gist_rainbow')
ax.set_xlabel('umap_1')
ax.set_ylabel('umap_2')
ax.set_zlabel('umap_3')
plt.title('Visualizing sign-language-mnist through UMAP in 3D', fontsize=24);
plt.show()

QBbyQrV.png!web

Image by Author

Comparison between the Dimension Reduction Techniques: PCA vs t-SNE vs UMAP

INZ3e2E.png!web

PCA (top_row) vs T-SNE (middle_row) vs UMAP(bottom_row) ,Image by Author

By comparing the visualisations produced by the three models, we can see that PCA was not able to do such a good job in differentiating the signs. This is mainly because PCA is a linear projection , which means it can’t capture non-linear dependencies.

t-SNE does a better job as compared to PCA when it comes to visualising High Dimensional datasets. Similar Hand-signs are clustered together, even though there are big agglomerates of data points on top each other from 2d perspective.

UMAPoutperformed the other two techniques in a reasonable manner if we look at the 2d and 3d plot, we can clearly see that sign languages are separated very well compared to the first two techniques. If we applied a clustering algorithm on this, we could be able to assign labels to the clusters.

In terms of speed, UMAP is much faster than t-SNE , another problem faced by the former is the need for another dimensionality reduction method prior, otherwise, it would take a longer time to compute, therefore we can state that UMAP is much faster than t-SNE. PCA is the fastest of them all, however, it does not do a very good job.

2yUB73i.png!web
Comparison of the speed (computation times)

Note that: the above table was constructed considering the computation time taken on a Kernel on Kaggle using their GPU.

UMAP can also be used for preprocessing while t-SNE doesn’t have major use outside visualisation. This means that it can often provide a better “big picture” view of the data as well as preserving local neighbour relations.[3]

Summary

We have explored three dimensionality reduction techniques for data visualization : (PCA, t-SNE, UMAP )and tried to use them to visualize a high-dimensional dataset in 2d and 3d plots.

Based on this Tutorial for this particular use case we can say that:

  • PCA did not work quite well in categorizing the different signs (24). However, instead of arbitrarily choosing the number dimensions to 3, it is much better to choose the number of dimensions that add up to a sufficiently large proportion of variance, but since this is data visualization problem that was the most reasonable thing to do.
  • TSNE managed to do better work on separating the clusters, the visualization in 2d and 3d was better than PCA definitely. However, it took a very long time to compute its embeddings
  • UMAP turned out to be the most effective manifold learning in terms of displaying the different clusters, some of them were very well defined and significantly faster than t-SNE implementation.

References

[1] McInnes, L., & Healy, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ArXiv e-prints.

[2] van der Maaten, L.J.P. t-Distributed Stochastic Neighbor Embedding

[3]Kaggle.com. 2020. Visualizing Kannada MNIST With T-SNE . Available at: https://www.kaggle.com/parulpandey/visualizing-kannada-mnist-with-t-sne

[4]Hands on Machine Learning with Scikit-Learn, Keras & Tensorflow by Aurelien Geron


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK