44

Principal Component Analysis — Math and Intuition (Post 3)

 5 years ago
source link: https://www.tuicool.com/articles/hit/YJFNneV
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
1*c-ZYbjLfhsSHaiup-vhbpQ.jpeg?q=20mYVfYzj.jpg!web

As promised, this is the third and last post on Principal Component Analysis — Math and Intuition. We had a brief introduction to PCA inPost 1 with a real world example and grasped the intuition. We learned some of the most important concepts in Linear Algebra relevant to PCA and perhaps various other data science applications, inPost 2. With all the hard work done, it is now time to use our solid mathematical framework and connect the dots to really understanding how and why PCA works.

PCA is a dimensionality reduction technique. It simplifies a complex dataset making it computationally amenable. There is no feature elimination involved; rather PCA works by extracting the significant bits of information from all features in the original dataset and creates lesser numbers of new features. In simple words, if you have a data set with n-dimensions (n number of features), applying PCA reduces it to a k-dimensional feature space where k < n.

Let us use an example here. You have a dataset on hotel rooms listing the room areas and respective prices.

nAbIniA.png!webmaIJra3.png!web

The 2-dimensional feature space can be plot as shown below.

BJrYjiU.png!webN7VRnqu.png!web

The goal is to reduce the 2 dimensions that are represented as x-axis and y-axis respectively, into one.

Note that I have chosen a 2D dataset because it is easy to visualise. In a real world scenario, you may have 1000s or more features and therefore the need for dimensionality reduction. Also, please do not confuse the above plot with linear regression, because although it looks similar it is an entirely different concept. There is no prediction going on here.

Coming back to PCA, we would like to simplify our original dataset which is described on a 2D basis. Recall the concept of basis change that we encountered inPost 2. The question we need to ask here is,

Is there a basis which is a linear combination of old basis that best describes the data?

Mathematically, let X and Y be mxn matrices that are related by a transformation P. Matrix X is the original dataset with n features and Y is the new dataset with n new features or principal components .

AFveAvN.png!webjQjemmU.png!web
  • P is a matrix that transforms X into Y.
  • Columns of X are the old basis or original features.
  • Rows of transformation matrix P are a set of new basis vectors for expressing the columns of X.
  • Matrix Y is a re-representation of X, its columns being the new features or principal components.

Now that we have a mathematical representation of our goal, we would like to know what is the ideal choice for the transformation matrix P? That depends on what are the properties we want the features of matrix Y to exhibit. This question will be answered in the following sections on Variance and Redundancy . Let us deal with them individually.

Projection error and Variance

The 2D feature space of our example dataset is again shown below,

BJrYjiU.png!webN7VRnqu.png!web

We want to re express this data in one dimension or compress the two features into one.Thus all we need to do is draw a line passing through the data cloud and project each data point on the line. Of all the various possible lines that we can draw, I have represented two lines in black and orange.

yQVzeaR.png!webfMRzQj2.png!web

Which line do you think best represents the data? I hope you chose the black line as it is closer to most of the data points (see figure below). The point where the red tick intersects the black/orange line is the projection of corresponding blue data point. Pay attention to the projection distances or the size of red ticks that connect each data point to the black and orange lines respectively. The black line minimizes the cumulative projection distances of the data points. In other words, the black vector minimizes the projection error or information loss as we move from representing our data from 2D to 1D.

BV3eqq6.png!webJvEz6rI.png!web

It should be noted that the variance or the ‘spread’ of data is maximum along the black line.If this interpretation is not very obvious, the following animation may help. It shows how the ‘maximum variance’ and ‘minimum projection error’ are reached at the same time, that is along the magenta ticks on either sides of the data cloud.

R3UBj2e.jpg7RJZb2m.gif
Courtesy : Making sense of principal component analysis, eigenvectors & eigenvalues

Thus a desirable property of matrix Y is that the new feature or its first principal component should be along the line that minimizes projection error, while simultaneously maximizing variance of the projected data.

Redundancy

PCA exploits the inherent redundancy in a dataset to reduce dimensions. Consider the following plots of 2D feature spaces covering the possible spectrum of data redundancies.

VnuyAz2.png!web6NNvAzb.png!web

Figure A plots Staff salaries vs Room area (sq ft) , that are uncorrelated with each other. The two dimensions in Figure A do not exhibit any redundancy and cannot be subject to dimension reduction by PCA. On the extreme side, Figure C is a plot of Room area in square metres vs Room area in square feet . There is complete correlation between the two features, therefore in this scenario, it is safe to eliminate one of the two as they both are essentially giving the same information.

An ideal scenario where the role of PCA is appreciated is Figure B which is the same plot as our previous example, Room price ($) vs Room area (sq ft). Figure B shows some correlation between the two features, indicating that the data can be re expressed by a new feature that is a linear combination of the old features. Thus, when we change our basis from 2D to 1D by projecting each data point onto the black line as shown previously, we are also eliminating feature redundancy. As you can observe in Figure B, the data points become uncorrelated precisely along the black line passing through the data cloud. The same is demonstrated in the reoriented figure below,

f6rAVzA.png!webJry6Fzj.png!web

Variance is the spread of data for one variable, whereas Covariance is a measure of how two variables vary together. If we denote the features Room area (sq ft) and Room price ($) as variables x and y respectively,

aEZnE3F.png!webbiAFBvf.png!web

and the Covariance matrix can be computed as follows,

UB3ueei.png!webBryuuaQ.png!web

Note that I haven’t gone into the details of mathematical calculations of variance and covariance as it is trivial. The take home message is that a covariance matrix is always symmetric with all the entries in the main diagonal being the variances of each variable. All the other entries are covariances of each pair of variables.

Coming back to our example of Room area (sq ft) vs Room prices ($), once we change the basis and reduce dimensions from 2D to 1D; the features become uncorrelated to each other or in other words the covariance is 0. The covariance matrix is therefore a diagonal matrix.

Vbye6nB.png!webvQvA32M.png!web

Summarizing, the properties that we want the features/ principal component of matrix Y to exhibit are:

  • The principal component should be along the direction that maximizes the variance of projected data.
  • Features of matrix Y should be uncorrelated with each other, i.e. its covariance matrix should be a diagonal matrix.

Let us revisit the mathematical representation of the goal for PCA that we derived earlier,

AFveAvN.png!webjQjemmU.png!web

X is the original dataset with n numbers of features. P is a transformation matrix that is applied to matrix X. Matrix Y is the new dataset with n numbers of new features/principal components. We have established the properties of features in matrix Y. The goal is to reduce redundancy or more precisely the covariance matrix of matrix Y (let’s call it S y ) is diagonal. Therefore, in our equation PX = Y, the choice of matrix P should be such that S y is diagonalized.

We know that a symmetric matrix is diagonalized by a matrix of its orthonormal eigenvectors. Recall the Spectral theorem of Linear Algebra we learnt inPost 2,

If A is symmetric, then A is orthogonally diagonalizable and has only real eigenvalues.

This was indeed the last piece in our puzzle. In theory, PCA assumes that all the basis vectors i.e. the rows of matrix P are orthonormal eigenvectors of covariance matrix of X. Applying a transformation P to X results in a matrix Y such that S y is diagonalized. Secondly, it assumes that the directions with largest variances are the most important or ‘most principal’. The rows of matrix P are rank ordered in terms of its corresponding variances or eigenvalues in this case. By eliminating the rows in matrix P with low eigenvalues, the directions in which variance is low are ignored. This makes it possible to effectively reduce the number of dimensions without significant loss in information.

References

  1. A Tutorial on Principal Component Analysi s, an excellent tutorial for someone interested in further reading on the topic.
  2. No References list is perhaps complete without this undeniably awesome answer on stack exchange Making sense of principal component analysis, eigenvectors & eigenvalues
  3. Principal Component Analysis (PCA).

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK