14

Statistical Decision Theory

 4 years ago
source link: https://mc.ai/statistical-decision-theory/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
R32Mj2U.jpg!web

In this post, we will discuss some theory that provides the framework for developing machine learning models.

Let’s get started!

If we consider a real valued random input vector, X , and a real valued random output vector, Y , the goal is to find a function f ( X ) for predicting the value of Y. This requires a loss function, L ( Y , f ( X )). This function allows us to penalize errors in predictions. One example of a commonly used loss function is the square error losss:

The loss function is the squared difference between true outcome values and our predictions. If f ( X ) = Y , which means our predictions equal true outcome values, our loss function is equal to zero. So we’d like to find a way to choose a function f ( X ) that gives us values as close to Y as possible.

Given our loss function, we have a critereon for selecting f ( X ). We can calculate the expected squared prediction error by integrating the loss function over x and y :

Where P( X , Y ) is the joint probability distribution in input and output. We can then condition on X and calculate the expected squared prediction error as follows:

We can then minimize this expect squared prediction error point wise, by finding the values, c , which minimize the error given X :

The solution to this is:

Which is the conditional expectation of Y , given X = x. Put another way, the regression function gives the conditional mean of Y, given our knowledge of X. Interestingly, the k -nearest neighbors method is a direct attempt at implementing this method from training data. With nearest neighbors, for each x , we can ask for the average of the y ’s where the input, x , equals a specific value. Our estimator for Y can then be written as:

Where we are taking the average over sample data and using the result to estimate the expected value. We are also conditioning on a region with k neighbors closest to the target point. As the sample size gets larger, the points in the neighborhood are likely to be close to x . Additionally, as the number of neighbors, k , gets larger the mean becomes more stable.

If you’re interested in learning more, Elements of Statistical Learning , by Trevor Hastie, is a great resource. Thank you for reading!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK