Kullback-Leibler divergence

This is where entropy comes in. Looking at the information, it tells us the ideal number of bits required to represent a particular information for transmission.

Let’s consider an example.

beyMryE.png!web

Suppose, you are an official at NASA and your job is to receive the temperature levels of a particular planet from a fellow astronaut stationed at the International Space Station. Years of observation of this planet’s temperatures levels have concluded that it can only have 8 different levels of it.

possible_temperatures_levels = [100, 105, 110, 145, 160, 162, 174, 180]

Each of them has a certain probability of occurring attached to it. It’s clear that some of them occur more commonly than the others.

# (of course, they all sum up to 1)
temperature_probs = [0.30, 0.40, 0.05, 0.06, 0.07, 0.02, 0.05, 0.05]

Let’s consider a trivial example at first.

Suppose, the astronaut wants to convey today’s temperature to the NASA official. He could do it by sending the string - “Today’s temperature is 105°C”. This string is 28 characters or 8 x 28 = 224 bits long. Do we really need these many bits? If we shorten the string to just “105”, we still require 3 x 8 = 24 bits to convey the information. Since we have 8 different temperature levels, the ideal way to go about it would be to use 3 bits where each 3-bit pattern represents a specific temperature level. Using 24 bits to represent information worth only 3 bits is a sin. Also, note that if a receiver successfully receives the information of the temperature being 105°C, it would have reduced its uncertainty(or confusion) by a factor of 8(as mentioned earlier).

Well, this was easy. But how do we find the magic number of 3 mathematically? That is easy too, right? Just take a

of the number of available unique pieces of information (eg.

) and we’re sorted. Well, no!

That would only work if the probabilities for all the temperatures levels is same. It is to be noted that in the case of uneven probabilities(as in our example), we can further save up on the number of bits being transferred and that is what entropy addresses. The intuition here is that we don’t need as many bits for something that occurs much more frequently compared to something that occurs rarely. If 105°C occurs 60% of the time and 162°C occurs only 2% of the time, then it doesn’t makes sense to use the same bit code(encoding) for representing both these pieces of information.

I hope you understand the problem now.

If the astronaut sends us a continuous stream of information about the temperatures such as,

100,105,110,162,100......and so on.

where each number/temperature is obviously represented by some kind of a bit code.

The frequency of occurrence of each temperature in this stream is dependent on the given probability distribution. The NASA official gets puzzled and ponders over a question - “How many bits or guesses do I need on an average to correctly predict a temperature level in a stream?”. Rephrasing his concerns in layman’s terms, he wants to know that on an average, how many questions does he need to ask to correctly predict or tell the temperature value currently received by him? This is known as entropy. The average number of times the NASA official will encounter uncertainty before correctly predicting something.

Mathematically, it is represented as

Throughout this post, we deal with a discrete set of probabilities. I’ll expand on continuous set of probabilities later on in the post.

One may argue that in order to find the number of “guesses” to rightly predict something from the stream, we can just multiply the probabilities. For example,

import numpy as np
np.prod([0.30, 0.40, 0.05, 0.06, 0.07, 0.02, 0.05, 0.05])
# result = 0.0000000012600000000000002

However, calculating the product serves no purpose as the result is extremely tiny and doesn’t provide a lot of insights. Multiplication is also an expensive operation. A simple way to tackle this is to convert this into a sum by using

. Utilizing the property,

we get,

Entropy means the average number of bits/guesses. Had the probabilities been similar for all the temperatures, we could have just done

But since we already have a given probability distribution, we multiply each of the

terms with their respective probabilities. Why? Because when we calculate an average of a set of numbers, we assume that each of the numbers in the set are equally likely to be closer to the average. But when the probabilities for each number in the set are mentioned, things are done differently.

Obtaining an entropy roughly equal to 2.3 tells us that we need to ask questions on an average of 2.3 times or need 2.3 bits per information in order to clearly determine what the message stands for and efficiently transmit it. This also means that we only have 2.3 bits of useful information out of the 3 bits if we would have used the coding scheme described earlier (we’d be wasting 0.7 bits if we used a 3-bit pattern on this sample).

For folks who can’t get their head around what 2.3 bits would look like, don’t think in terms of a binary number but in terms of the portion or part of an information. Let me propose 2 ways of looking at it which will hopefully make this more lucid.

Considering the planetary temperature example mentioned above, we know that we have 8 different kinds of information and 2.3 is our entropy. Now if we use 2.3 bits per information, our message would be 2.3 x 8 = 18.4 bits long which is not possible. But if we send the message 10 times, we can do it in 184 bits in contrast to the 3 x 8 x 10 = 240 bits. Scale of transmission is tackled here.
Consider a state election where 4 candidates (3 women and 1 man) are contesting. Suppose we find some dirt on the only male candidate and he get’s eliminated from the race. We still have 3 candidates we can vote for. The dirt certainly gave me more than 0 bits of information because I know more than I knew before. But at the same time, it did not give me 1-complete bit of information because if that would have happened, I would have reduced my uncertainty by a factor of 2 and be left with only 2 candidates in total (refer to the third paragraph in this post).

Cross-Entropy

Since we now know what entropy is, cross entropy should be fairly easy to understand. A Wikipedia lift tells us that,

Cross entropy between two probability distributions p and q over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an “unnatural” probability distribution q, rather than the “true” distribution p.

In simple terms,

Given that

represents entropy, we can say that represents cross entropy. You ask how? Well, consider you have 2 probability distributions p and q where p represents a true distribution and q represents a predicted distribution. Both p and q are distributions over the same set of events. Imagine the last step of a neural network wherein you obtain a probability distribution(or q ) and then compare it with a distribution you already know beforehand(or p

) where both the distributions describe the same set of classes such animals, plants, cars, or whatever your model is trained on.

In entropy, we find the optimal/average number of bits we require to transmit an information using a probability distribution applied to its set of events. If we use the probability distribution of q to devise a code for the information from p , then the optimal/average number of bits we’ll need is known as the cross entropy. This is quite insightful. Cross-entropy tells us how different the 2 distributions are. The closer they are, the closer will their entropies be.

Hence, it is represented as follows.

Let’s write some code to better understand this.

import numpy as np
obtained_probability = np.random.dirichlet(np.ones(8))
true_probability = np.random.dirichlet(np.ones(8))

assert sum(obtained_probability), sum(true_probability) == 1.0

print("Obtained Probability: ", obtained_probability)
# [0.22374847 0.01351315 0.26085442 0.04087474 0.05121301 0.04455142 0.14301258 0.22223221]
print("True Probability: ", true_probability)
# [0.29786589 0.01631927 0.00065958 0.27396781 0.11610311 0.10891934 0.01762108 0.16854391]

plt.bar(np.arange(8), obtained_probability, width=0.35, color="r")
plt.bar(np.arange(8) + 0.35, true_probability, width=0.35, color="b")
plt.show()

aAf6juZ.png!web

All we’ve done till now is generate 2 random probability distributions such that they add up to 1 and plotted them. Below, we’ve written certain routines to calculate entropies and cross-entropies for these 2 distributions.

def entropy(x):
    return -sum([x_i * log(x_i, 2) for x_i in x])

def cross_entropy(p, q):
    return -sum([p_i * log(q_i, 2) for p_i, q_i in zip(p, q)])

entropy(true_probability) # 2.380762398444909
entropy(obtained_probability) # 2.5644798655740377
cross_entropy(true_probability, obtained_probability) # 3.4115389365100524
cross_entropy(true_probability, true_probability) # 2.380762398444909

Notice, the cross-entropy will always be higher than the individual entropies since we’ve performed the encoding using the wrong distribution. Also, when both the distributions are same, the cross-entropy and entropy are equal to each other.

Yayy! So, we’ve accomplished 95% of the crucial stuff required to understand KL divergence.

The KL divergence between two probability distributions is just the number of extra bits we need if we encode the information represented by one using the probability distribution of the other. It is the difference between cross entropy and entropy. Yes, that’s it. You can take a guess regarding how is it mathematically represented and you probably will be right.

Lesser the cross-entropy, lesser will be the KL divergence. The term divergence is itself self-explanatory. The factor by which one distribution diverges from the other is basically what KL divergence is.

Writing the code for it should be fairly rudimentary now.

def kld(p, q):
    return cross_entropy(p, q) - entropy(p)

kld(true_probability, obtained_probability) # 1.0307765380651435

In our case, the divergence comes out to be a factor of 1.0307765380651435.

I hope that you’ve clearly understood KL divergence by now. See ya next time! probably with the explanation of a different topic which I personally find hard to understand.

Kullback-Leibler divergence

Cross-Entropy

Kullback-Leibler divergence

Recommend

GitHub - Divergence/framework: PHP Framework providing ActiveRecord models and o...

比较两个概率分布的方法——Kullback-Leibler散度

Vector Calculus: Understanding Divergence

ƒ-divergence GAN 笔记

The weird world of shader divergence and LOD

The Tension Between Divergence Loss and Profit in Automated Market Makers

Implementation divergence on swapping bools within vectors

On divergence: being autistic in UX

Hispaniola's Great Divergence

Implementation divergence with a moved-from `set` comparator – Arthur O'Dwyer –...

About Joyk