The most gentle introduction to Principal Component Analysis

In the lasts months, I’ve been spending a considerable amount of time revisiting some of my forgotten maths knowledge. This allowed me to finally get my head around some key concepts in Machine Learning that I kind of had understood before, but never as deeply enough as to explain them to someone else in a clear way. So today, we’re going to test myself and see how clear can I be explaining a concept such as Principal Component Analysis.

The objective of this story is to give a strong intuition about how it works, without getting too much into the nitty-gritty mathematical details. However, explaining some of the basics concepts will be unavoidable. I’d love to read your feedback on the comments section since the idea is to continue writing this kind of stories, covering key concepts in Machine Learning for data-enthusiasts without yet a solid maths background.

Let’s start by understanding what’s Principal Component Analysis, or PCA, as we’ll call it from now on. From Wikipedia, PCA is a statistical procedure that converts a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components . In simpler words, PCA is often used to simplify data, reduce noise, and find unmeasured “latent variables”. This means that PCA will help us to find a reduced number of features that will represent our original dataset in a compressed way, capturing up to a certain portion of its variance depending on the number of new features we end up selecting . This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component, in turn, has the highest possible variance possible.

A few important things were mentioned in the previous paragraph, but before getting into that, let’s make a quick review of some fundamental concepts.

Throughout this story, we’ll be using a subset of a dataset available at Kaggle and compiled by professors at Columbia Business School trying to find an answer to the question What influences love at first sight? Data was gathered from participants in experimental speed dating events from 2002–2004. During the events, the attendees would have a four-minute “first date” with every other participant of the opposite sex. At the end of their four minutes, participants were asked if they would like to see their date again. They were also asked to rate their date on six attributes: Attractiveness, Sincerity, Intelligence, Fun, Ambition, and Shared Interests.

Our dataset, called ‘ sd ’ (an acronym for speed dating), contains in total 551 rows and the following 29 features:

sd.columnsIndex([‘subject_id’, ‘wave’, ‘like_sports’, ‘like_tvsports’, ‘like_exercise’, ‘like_food’, ‘like_museums’, ‘like_art’, ‘like_hiking’, ‘

As you can see, these columns represent interests and perceptions. And some of them might be related to each other. For example, enjoying music and movies, or liking sports and tv sports. That’s why, even though 29 columns is not much, our objective will be, by the end of this story, to have reduced the total number of columns to a handful of uncorrelated features explaining most of this relationship between interests and perceptions. I.e. we want to end up with a smaller set of features explaining the original relationship in between the initial columns of our dataset . And we’ll do that using linear algebra and PCA.

Let’s start remembering that every time we’re dealing with any given dataset we just have a massive matrix, containing one vector per feature.

As mentioned before, these features of our dataset are going to be more or less correlated, i.e. if we were to plot two of these vectors in our matrix of features, some of them are going to present similar trends, some might go in opposite directions, some might be almost identical and then, of course, we could find anything in between. This measure of the strength and direction of the relationship between two vectors is the correlation .

If we plot two features against each other on a scatterplot when the correlation is strong, the data points on the chart will be close together. And the closer the final value is to −1 or 1, the stronger the relationship in between those features:

−1 Strong inverse relationship
+1 Strong direct relationship

When the correlation is weak, the data points are spread apart more. The closer the correlation is to 0, the weaker the relationship:

The equation of the correlation between two variables looks messy, but it’s actually quite simple:

Where:

Xi and Yi are each of the elements of our vectors X and Y,
and x-bar and y-bar are thee mean of each vector

If we measured the correlation between all features in our dataset, we’d end up with an nxn matrix, where n is the total number of features in our dataset and the diagonal represents the correlation of each feature against itself. You can find this matrix easily in Python using pandas:

sd.corr()

We can easily see some features are actually quite related. For example, and probably unexpectedly, like_exercise and like_tvsports show a correlation of 0.489…quite strong! While some others as like_reading and like_hiking have a very weak correlation.

In a very similar way, instead of the correlation, we could be interested in the covariance between two variables. While the correlation tells us information about the direction and magnitude of how two features vary with each other, the covariance gives us just the direction in which two quantities vary with each other, i.e. a certain variable has a positive covariance with respect to other, they increase or decrease together. In our example above, the covariance between like_tvsports and like_sports is 3.91, while the covariance between like_shopping and like_hikinig is -0.37.

Cool, let’s leave this concept on hold for a while now in order to get a better understanding of two key concepts in PCA: eigenvalues and eigenvectors.

Let’s start from the basics, matrices and vectors are of course much more than just another way of interpreting our dataset. It would be impossible to go through all the cases and uses of matrices and vectors in this story, but there’s one of them particularly important for the sake of understanding eigenvalues and eigenvectors: transformations.

Short story long, a vector points out how a feature or a variable spans in a certain space (and yes, I know that’s not the perfect technical definition before any outraged mathematician jumps in…just trying to keep it simple ). For example, suppose I’m buying oranges and apples and for some reason, I decide to make two visits to the supermarket. In the first one, I buy 2 oranges and 3 apples, while the second time I buy 10 oranges and just 1 apple. If we add the price of those apples and oranges into the analysis, this could be represented in several ways:

As individual functions
As individual vectors
As a matrix

If we knew that the price of those apples and oranges was of 3 and 2 respectively, we’d get a new vector of prices now, which we could use to make a matrix multiplication and find our total spending:

Therefore, our set of equations, i.e. our matrix, is asking what price vector do I need in order to get a transformed product at the position [12, 32]? In other words, our matrix is just making a transformation of our original price vector [3, 2] into our total spending vector [12, 32]

This is as well part of what happens on the back when we fit any kind of linear regression in Machine Learning; we’re just trying to find a vector of coefficients that when multiplied with our features matrix, gives us the best possible approximation to our target variable vector. However, without getting into much detail of what happens in such complex cases, there are certain kinds of transformations that a matrix can perform over a vector that are important to know:

A matrix with any number along the diagonal and zeros in all the rest of it is called a stretching matrix. And it performs a stretch transformation . Where positive numbers along the diagonal will effectively stretch every vector in the space, while a negative number would flip the axis around and a fraction in any position along the diagonal would squeeze everything. Let’s see an example using what is known as basis vectors, i.e. those vectors spanning along the points [1,0] and [0,1]:

Another kind of transformation we can find is a shearing , where all points along a given line remain fixed while other points are shifted:

We can also have rotations :

And then there’s a matrix that doesn’t do anything at all. A matrix that only has the basis vectors mentioned before: the Identity matrix:

We could then have combinations of all these transformations, and even more complex cases with several dimensions. However, this ground knowledge should be enough to keep going for now. We have been building up our algebra understanding, and now we’re ready to finally talk about eigenvectors and eigenvalues. For that, however, it’s important to understand that even though we have been speaking of transformations as something that’s applied to just one vector, they are often useful to be applied to every vector in a space. And the key for understanding Principal Component Analysis is that even though some vectors will completely change their position, some others will land on the same line they started on. Take the following example:

As you can see on the image above, the horizontal vector remains with the same length pointing in the same direction. The vertical vector points in the same direction, but its length has changed. And the diagonal vector completely changed both its angle and length. The horizontal and vertical vector will be then the eigenvectors of this transformation . Moreover, given that the horizontal vector didn’t change at all, we can say it has an eigenvalue of 1, whereas the vertical increased its length, so it could have an associated eigenvalue of for example 2 .

Can you spot the eigenvector in the following shear transformation? 樂

Just the horizontal vector did not change. So in this case we have just one eigenvector.

Let’s quickly wrap up before continuing:

Eigenvector : those which lie along the same span both before and after a transformation
Eigenvalue : the amount each of those vectors has stretched or squeezed

Well, this is great! The concept of what eigenvectors and eigenvalues are should be quite clear by now. At least by inspection, we can easily spot them in a graph. Though, of course, as long as we’re dealing with simple cases like the ones we just saw. Imagine now that instead of being doing 2D operations, we have a multidimensional space 樂. Scaling, shears and all the transformations we mentioned are going to operate in a similar way, but they are going to be harder or impossible to be spotted. Moreover, we could have a combination of transformations. Like a shear plus a vertical rotation.

We’ll need a new method for spotting our eigenvectors and values. Let’s try moving into an equation form to generalise what we found to all kind of vectors and matrices:

In the equation above, A is an n-dimensional matrix, meaning that it is an nxn matrix and therefore X has to be an n-dimensional vector. Remember one of the examples we saw before:

We could now rearrange this equation. But for that, we’ll have to use one of the concepts we mentioned before: the identity matrix. That matrix that doesn’t perform any transformation at all:

Now, we’re not interested in the case where the vector X is zero since that is when it has no length or direction. So we’ll look for the part in between brackets to be zero .

To move forward now we’ll have to introduce a new concept: the determinant of a matrix . We could talk extensively about this concept since it’s part of several tools in Machine Learning and Data Science. However, the main idea is that the determinant is a number that can be computed from the elements of a square matrix, i.e. a matrix with the same number of rows and columns, and encodes certain properties of the linear transformation described by the matrix. Specifically it shows us how linear transformations change area or volume . Take the following example:

Did you see what’s happening? 樂 Everything got bigger by a factor of a, d . So for example, a determinant of 2 would tell us that a matrix operation would double the area or volume of our vector space. Cool, now we’re ready to understand where the determinant sits in our previous equation:

Remember we said that we’ll look for the part in between brackets of this equation to be zero. And do you notice what’s happening there? Nothing more than a matrix operation. Since we’re multiplying our scalar λ by the identity matrix full of zeros and ones along the diagonal, in the end, we are just subtracting from our matrix A another matrix with our λ term along the diagonal. We can then rewrite our equation above in matrix form, using the determinant to solve our problem:

Subtracting these two matrices will just give us another matrix, and since we know that a matrix equals to zero when its determinant is zero, we’ll know as weel that its area or volume will be zero too . Evaluating the matrix operation above we end up with the following equation:

This equation is known as the characteristic polynomial. However, don’t worry, you just need to know now that our eigenvalues are simply the solutions of this equation. And that then we can just plug those eigenvalues in our original equation to find the eigenvectors . Let’s go through a concrete example:

We can know plug λ = 1 and λ = 2 in our original equation and find our eigenvectors:

What does all this tell us? Well in the case where λ = 1, according to our final expression, we need X2 to be zero. But we don’t know anything about X1. This is because any vector pointing along the horizontal axis could be an eigenvector of this system, i.e. any vector with an X2 equal to zero and any other number for X1 will be an eigenvector. The result is very similar at λ = 2 as well but the other way around: we could take any value for X2 while remaining X1 equal to zero. See the following example:

Wow, this has been quite a journey! All the way up starting from the main concepts, up to actually finding the eigenvalues and eigenvectors of a transformation. However, we still have a couple more things left to tie up. Do you remember way back at the beginning when we talked about correlations and covariance? We can easily get the covariance matrix from the love at first sight dataset by using the following command:

sd.cov()

In the following example I’m dropping the subject_id and wave columns first:

sd.drop(labels=[‘wave’,’subject_id’], axis=1).cov()

As a reference, in PCA we use the covariance matrix when the variable scales are similar and the correlation matrix when variables are on different scales . However, the Sklearn algorithm we’ll see uses the covariance matrix by default, so you must apply some standardisation before proceeding if your features have different scales. For example, having weight and height within your data. Unfortunately, I just can’t cover standardisation in this story, but I’ll leave you this other story from Clar Liu also in Towards Data Science if you’re interested in learning more about it.

Now, cool thing, since our covariance matrix is a square matrix, it can be used to do our “eigenmagic” and find the eigenvectors and eigenvalues. Take the following case of a scenario with only two variables/features:

Great, we already have everything we need to understand what’s happening underneath Principal Component Analysis. However, the whole PCA process won’t end after we found our eigenvectors and eigenvalues. The final step will be to create a matrix with all of some of the eigenvectors we found and then do something called projection, to transform our original dataset into the new vector space we just found. In other words, we’ll use the eigenvectors to transform our samples into a smaller set of features expressing as much of the variance of the original data as possible . Where the largest eigenvalue we found will correspond to the largest explained variance, the second largest to the second largest variance and so on. In the end, we’ll have as many features as eigenvalues and eigenvectors we end up choosing. If the first two eigenvalues seem to explain most of the variance in our dataset, then probably we won’t need to choose any more than them.

Luckily for us, the implementation of all this in Sklearn is super simple and we don’t actually need to cover projections in this story. However, if you’re interested in learning more about it, I recommend you to go through this set of videos from Khan Academy.

Let’s finish this super long story by quickly seeing how to use the PCA module from Sklearn! First, imports:

from sklearn.decomposition import PCAfrom sklearn.preprocessing import StandardScaler

In my case, all my features are on the same scale, so I don’t need to transform my data. However, if that’s not your case, you can do that by simply running the following:

scaler = StandardScaler()transformed_data = scaler.fit_transform(data)data_transformed = pd.DataFrame(transformed, columns=data.columns)

Cool, now we only need two lines of code to make our Principal Component Analysis:

sd_pca = PCA(n_components=5)sd_pca.fit(sd)

As you can see, even though we could find as many components as features we have, Sklearn allows us to specify the number of components we’ll want to keep in order to speed up the computation.

We can now inspect the components and the explained_variance_ ratio attribute will tell us how much of the variance in the original data is encapsulated in the new component variables.

sd_exp_var_eigenvals = sd_pca.explained_variance_sd_exp_var_pct = subjective_pcasd.explained_variance_ratio_print(sd_exp_var_eigenvals)print(sd_exp_var_pct)[2.16041482 0.88925024 0.73820887 0.72229813 0.49907009][0.43128576 0.17752191 0.14736937 0.14419309 0.09962986]

Finally, the transform function in the PCA will create your new component variable matrix:

components_matrix = sd_pca.transform(sd)

We’ll end up with just 5 vectors reducing the dimensionality of our data while minimizing loss and accounting for 75.6% of the variance within our original dataset (0.431+0.178+0.147=75.6%0.431+0.178+0.147=75.6%).

Well, I reckon this is more than enough by now . As always, I’m eager for reading your comments and feedback.

Don’t forget to take a look at some of my other stories:

Or just visit my profile on Medium and check any other of my other stories .

See you around, and thanks for reading!

Recommend

DOUBLE your Google Colab RAM in 10 seconds using these 10 characters

OFFSystem

Iterator in Rust (1)

如何解决以太坊上矿工选择性打包的问题？

SpringBoot 缓存实战

2 个案例带你迅速入门 Python Flask 框架

Multipole Methods for the Masses

LightProxy – Web debugging proxy alternative to Charles

Falco 0.22.1 and Kubernetes 1.18 on armv7 using NOVIX

Time Series Analysis and Forecasting with ARIMA

About Joyk