4

Why Use Cross Entropy in Classification Task?

 2 years ago
source link: https://jdhao.github.io/2021/10/16/why_cross_entropy_in_classification/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Why Use Cross Entropy in Classification Task?

2021-10-16366 words 2 mins read 7 times read

In classification tasks, the de facto loss to use is the cross entropy loss.

Suppose that we have 10 classes, we would like the network to predict the probabilities of current sample belonging to each of the 10 classes. However, the raw output from from a neural network is just floating point values. So the softmax function is used to normalize the output to fall in the range (0,1).

After softmax, all output values are between 0 and 1 and their sum is 1. So the output now can be considered as the probability distribution over the predicted classes. The element with largest probability is the predicted class.

Now, suppose we have a batch of N data samples and their class labels, from the point of maximum likelihood estimation (or MLE in short), we want to find the parameters of the neural network that can maximize the product of probabilities that each sample get in their ground truth class.

For example, if we have data sample x1, x2 and x3, and their class label is 1, 3, 5. Then we want to find network parameters that can maximize p{11} * p{23} * p{35}.

Then according to theory of MLE, we need to do derivative stuff and find the parameter. However, the multiplication form is not suitable for calculating derivatives. That is why log() function used in this.

Why log function? Because:

(1) log function is monotonic. Or our objective is the same. Maximizing the old objective is equivalent to maximizing the new objective. (2) log(x * y) = log(x) + log(y), so that we can greatly simplify calculation of derivatives.

Another question, why do we use the minus sign?

Because in machine learning, we always talk about minimizing the loss/cost, which is equivalent to maximize log likelihood. It is just a convention. By adding a minus sign, we transform the initial problem of maximizing likelihood to minimize the new loss function. They are essentially the same, but loss function is a more familiar jargon to machine learning practitioners.

The cross entropy loss is also called log loss.

To be continued…

References

Author jdhao

LastMod 2021-10-25

License CC BY-NC-ND 4.0

Reward
Scheduling Your Tasks with Package Apscheduler

Setting up Yasnippet for Emacs


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK