Statistical Decision Theory

In this post, we will discuss some theory that provides the framework for developing machine learning models.

Let’s get started!

If we consider a real valued random input vector, X , and a real valued random output vector, Y , the goal is to find a function f ( X ) for predicting the value of Y. This requires a loss function, L ( Y , f ( X )). This function allows us to penalize errors in predictions. One example of a commonly used loss function is the square error losss:

The loss function is the squared difference between true outcome values and our predictions. If f ( X ) = Y , which means our predictions equal true outcome values, our loss function is equal to zero. So we’d like to find a way to choose a function f ( X ) that gives us values as close to Y as possible.

Given our loss function, we have a critereon for selecting f ( X ). We can calculate the expected squared prediction error by integrating the loss function over x and y :

Where P( X , Y ) is the joint probability distribution in input and output. We can then condition on X and calculate the expected squared prediction error as follows:

We can then minimize this expect squared prediction error point wise, by finding the values, c , which minimize the error given X :

The solution to this is:

Which is the conditional expectation of Y , given X = x. Put another way, the regression function gives the conditional mean of Y, given our knowledge of X. Interestingly, the k -nearest neighbors method is a direct attempt at implementing this method from training data. With nearest neighbors, for each x , we can ask for the average of the y ’s where the input, x , equals a specific value. Our estimator for Y can then be written as:

Where we are taking the average over sample data and using the result to estimate the expected value. We are also conditioning on a region with k neighbors closest to the target point. As the sample size gets larger, the points in the neighborhood are likely to be close to x . Additionally, as the number of neighbors, k , gets larger the mean becomes more stable.

If you’re interested in learning more, Elements of Statistical Learning , by Trevor Hastie, is a great resource. Thank you for reading!

Recommend

又一款免费可商用的中文字体！Open 粉圆字体火热下载中

Kubernetes节点频繁NotReady-处理和防范

用这套代码，奇葩说也能从线下搬到线上

互联网公司的灰色战争

Materialize on NYC taxi data

做软件国产化，我们太难了

YAML中多行字符串的配置方法总结

WebRTC应用程序如何与浏览器更新同步？（一） | WebRTC中文网-最权威的RTC实时通信平...

雷军:手机业销量会受到明显影响但下滑幅度没那么大

Optimizing Collaboration Between Distributed Front-End Teams

About Joyk