A Gentle Introduction to Deep Learning: Part 4

Photo by Markus Spiske

“Life is a school of probability~Walter Bagehot”

This is the part 4 of our series which I started to share what I am learning and to show you the real essence of Machine Learning.

In part 3 of this you all learn about a complete mathematical understanding of PCA. If you haven’t seen that part I do suggest to check out all the previous parts ( part 1 , part 2 , part 3 ) to get a complete understanding of what I am doing here. So without any delay let’s begin.

Introduction

Probability is one of the concepts that we use on daily basis without realising that we are applying or using it.

Life is full of uncertainties . Will I pass my final exams? Will it rain today? Will I get the job? Now as we know that, in artificial intelligence we are trying to achieve human level intelligence and since our brain also works on probability and deduction, similarly we are designing artificial intelligence too. So in this part we will going to learn about probability theory and some other related concepts.

What is Probability?

Probability is a mathematical framework for representing uncertain statements. It is a measure of how likely an event is.

Only if you know probability!

The branch of mathematics that deals with probability known as probability theory.

Some terminology which we will use forward:

Experiment are the uncertain situations, which could have multiple outcomes. Whether it rains on a daily basis is an experiment.
Outcome is the result of a single trial. So, if it rains today, the outcome of today’s trial from the experiment is “It rained”
Event is one or more outcome from an experiment. “It rained” is one of the possible event for this experiment.

Random Variables

A random variable usually written as x, is a variable whose possible values are the numerical outcomes of a random phenomenon. The values that a random variable can take on represented as x₁ , x₂ ..For example,

x = outcome of a coin toss

Possible outcomes:

x₁ = 1 if heads

x₂ = 0 if tails

There are two types of random variables discrete and continuous .

Probability Distributions

A probability distribution is a description of how likely a random variable is to occur. For example, look at this probability distribution of a fair die roll:

The way we describe probability distributions depends on whether the random variables are discrete or continuous.

Discrete Random Variables and Probability Mass Function

A random variable which only take countable and finite number of distinct values known as discrete random variable . For example, number of children in a family or number of cancer patients in a hospital etc.

The probability distribution of a discrete random variable is a list of probabilities associated with each of its possible values. It is known as probability mass function ( PMF ). It is typically denoted by P.

Probability mass functions maps from a state of a random variable to the probability of that random variable taken on that state. It also act on many variables at the same time.

P (x = x , y = y ) denotes the probability that x = x and y = y simultaneously.

PMF( P) on a random variable must satisfy following conditions:

The domain of P must be the set of all possible states of x.
∀ x ∈ x, 0 ≤ P ( x ) ≤ 1. An impossible state has probability 0 and an event that is guaranteed to happen has probability 1.
∑ₓ P ( x ) = 1. We refer this property as being normalized .

Continuous Random Variables and Probability Density Function

A continuous random variable is one which takes an infinite number of possible values. Continuous variables are usually measurements. For example, height of a person, amount of sugar in an orange, time required to run a mile etc.

A continuous random variable is not defined at specific values. Instead, it is defined over an interval of values, and is represented by the area under a curve (in advanced mathematics, this is known as an integral ). The probability of observing any single value is equal to 0, since the number of values which may be assumed by the random variable is infinite.

Probability distribution used for describing continuous random variables is known as probability density function ( PDF ). To be a PDF p must satisfy following conditions:

The domain of p must be the set of all the possible states of x.
∀ x ∈ x, p ( x ) ≥ 0
∫ p(x)dx = 1

You might be wondering, “Did I forget to mention p ( x ) ≤ 1 ?” But let me assure you that I haven’t. Point to be noted here is that we don’t require p ( x )≤1 here. Let’s discuss it in more detail.

First thing that you should understand is unlike probability mass function, the output of a probability density function is not a probability value. To get the probability from a PDF we need to find the area under the curve between an interval of values.

Let’s take an example to understand it more, say a food-chain advertise their hamburgers weighing 0.25 pounds, but we all know it is not exactly 0.25 pounds one might randomly select hamburger of weight 0.23 pounds or 0.249 pounds. So, there are infinite values which means it is an example of continuous variable. Now let’s say you decided to calculate the probability of selecting hamburger of weight between 0.2 and 0.3 i.e. p (0.2 ≤ x ≤ 0.3).

Say you have 100 hamburgers and you weigh them and created a graph like this:

Such a curve called probability density function. Area of the whole curve is 1 (as area under curve represents probability) and to find the probability between an interval you have to find the area under the curve between those interval and to find it you to calculate the integral as mentioned in third condition above (∫ p(x)dx ):

Now this area represents probability and it satisfied the condition:

0 ≤ p (0.2 ≤ x ≤ 0.3) ≤ 1

Types of Probability

Joint Probability

Let’s take a simple example, say you have two discrete random variables x and y, where x and y represents disease and symptoms respectively.

Here for x, 0 represents person have disease and 1 represents person is fit. Similarly for y, 0 represents person have symptoms for disease and 1 represents persons doesn’t have symptoms.

Now, Joint probability is the probability of two different events occurring at the same time. So as for above example, joint probability of a person having disease and also symptoms is:

P (x = 1, y = 1) = 0.3

Marginal Probability

Marginal Probabilityis the probability where you are only interested in single event. To calculate this we simply sum up the joint probability, let’s again take above example:

To calculate P (y = 1), we can simply add:

P (y = 1) = P (x = 0, y = 1) + P (x = 1, y = 1)

Generally this can be written as:

∀ x ∈ x, P (x = x ) = ∑ P(x = x, y = y )

For continuous variable, we need to integrate instead of summation:

P ( x ) = ∫ p ( x, y ) dy

Conditional Probability

In many cases, we are interested in the probability of some event, given that some other event has happened. This is called conditional probability .

So if y = A given x = B then conditional probability, P (y = B | x = A) can be defined as:

Conditional Probability

We cannot compute the conditional probability conditioned on an event that never happens. Some examples of conditional probability are:

Probability of a person liking Harry Potter given that the person likes fiction.
Probability of having a disease given that test results are positive.

Baye’s Rule

We often find ourselves in a situation where we know P (y | x ) and need to know about P (x | y). For example, we know about probability of having a disease given that test results are positive . But now we need to know about probability that the test results are positive given that person having the disease.

For this, we need to know that the joint probability of two events x and y, can also be expressed as:

P(x, y) = P(x | y) P(y) = P(y |x) P(x)

This rule also known as product rule or multiplication rule .

So by above expression we can derive for P(x | y) if we have given the value of P(y | x) :

P(y | x) = P(y, x) / P(x)

P(y | x) = P(x | y ) P(y) / P(x)

P(x | y) = P(x) P(y | x) / P(y)

The Chain Rule of Conditional Probabilities

Generalizing the product rule leads to the chain rule .

Any joint probability distribution over many random variables may be decomposed into conditional distributions over only one variable:

Chain rule of probability

Let’s take an example,

P(a,b,c) = P(a | b,c) P(b,c),

P(b,c) = P(b|c) P(c)

so, P(a,b,c) = P(a | b,c) P(b | c) P(c)

Independence and Conditional Independence

Two random variables x and y are independent , if their probability distributions can be expressed as a product of two factors, one involving only x and other involving only y :

p(x, y) = p(x)p(y)

In other words, occurrence of one event does not affect occurrence of another event in any way. It can be denoted as x ⊥ y .

Two random variables x and y are conditionally independent given a random variable z if the conditional probability distribution over x and y factorizes in the way for every value of z:

p(x, y | z) = p(x | z) p(y | z)

Expectation, Variance and Covariance

The expectation or expected value of some function f(x) with respect to a probability distribution P(x) is the average or mean value, that f takes on when x is drawn from P. For discrete variables this can be computed with a summation as:

Expectation for discrete variables

while for continuous variables, it is computed with an integral:

Expectation for continuous random variables

The variance gives a measure of how much the values of a function of a random variable x vary as we sample different values of x from its probability distribution:

Variance

The covariance gives some sense of how much two values are linearly related to each other, as well as the scale of these variables:

Covariance

Common Probability Distributions

Several simple probability distributions are useful in many contexts in machine learning.

Bernoulli Distribution

Its name might sound too complex and scary but it is the easiest distribution to understand.

All the cricket fans knows that, at the beginning of a cricket match how do we decide who is going to bat or ball? A toss! It all depends on whether you win the toss or lose the toss. Let’s say if toss result in head you win else you lose.

A Bernoulli distribution has only two possible outcomes namely 1(success) and 0(failure), and a single trial . So the random variable x which has Bernoulli distribution can take value 1 with the probability of success say p and value 0 with the probability of failure q or 1-p .

The probability mass function(PMF) for this is given by:

Bernoulli Distribution

where x ∈ (0,1)

It can also be written as:

P(x) = 1-p, x = 0

P(x) = p, x = 1

The probabilities of success and failure need not to be equally likely. Say the probability that it’s going to be rain tomorrow is 0.15 and probability that there will be no rain tomorrow is 0.85, then this chart shows the Bernoulli distribution:

The expected value of a random variable x from a Bernoulli distribution can be computed as:

E(x) = 1 * (p) + 0 * (1-p) = p

And the variance will be:

V(x) = E(x²) — [E(x)]² = p — p² = p(1-p)

Multinoulli Distribution

The multinoulli or categorical distribution is a distribution over a single discrete variable with k different states, where k is finite.

It’s the generalization of Bernoulli distribution. In Bernoulli distribution, random variable has two possible outcomes where a categorical random variable has multiple possible outcomes. When there is a single trial, categorical distribution is known as multinomial distribution.

Let’s understand this with an example, throw a six-sided dice fifty times and observe the outcomes. The possible outcomes (the sample space) are 1,2,3,4,5,6. Each outcome has a probability of 1/6. The number of trials, “n” is 50.

Uniform Distribution

When you roll a die the outcomes are 1 to 6. The probabilities of getting these outcomes are equally likely and that is the basis of uniform distribution.

In uniform distribution, all the n number of possible outcomes are equally likely.

The PDF for this can be given by:

Uniform distribution

where a and b are parameters.

The mean and variance for a random variable x following a uniform distribution is:

Mean = E(x) = (a+b)/2

Variance = Var(x) = (b-a)²/12

Binomial Distribution

Binomial distributionhas only two possible outcomes 1(success) and 0(failure), but unlike Bernoulli distribution it is repeated multiple times (or have multiple trials).

We can use the above cricket example again here, for a toss you either get heads(success) or tails(failure). Suppose that you win the toss today but this doesn’t necessitate that you will toss tomorrow.

Each trial is independent since the outcome of the previous toss doesn’t affect or determine the outcome of the current toss.

On the basis of the above explanation, the properties of a Binomial Distribution are

Each trial is independent.
There are only two possible outcomes in a trial- either a success or a failure.
A total number of n identical trials are conducted.
The probability of success and failure is same for all trials. (Trials are identical.)

Mathematical representation of binomial distribution

The mean and variance of a binomial distribution are given by:

Mean, µ = n*p

Variance, Var(X) = n*p*q

Gaussian Distribution

This is the most commonly(normally!) used distributions over real numbers is called normal distribution or Gaussian distribution .

Normal Distribution

Any distribution is normal distribution if it has the following characteristics:

The mean, median and mode of the distribution coincide.
The curve of the distribution is bell-shaped and symmetrical about the line x = μ(mean).
The total area under the curve is 1(obviously!!).
Exactly half of the values are to the left of the center and the other half to the right.

The PDF of a random variable x following a normal distribution is given by:

The two parameters μ ∈ ℝ and σ ∈ (0, ∞) control the normal distribution. The parameter μ is the mean of the distribution and also gives the coordinate of the central peak. σ and σ² are the standard deviation and variance of the distribution respectively.

In the absence of prior knowledge about what form a distribution over the real numbers should take, the normal distribution is a good default choice for two major reasons:

Many distributions we wish to model are truly close to being normal distribution. The central limit theorem( that you will see later ) shows that the sum of many independent random variables is approximately normally distributed. This means that in practice, many complicated systems can be modeled successfully as normally distributed noise.
Out of all possible probability distributions with the same variance, the normal distribution encodes the maximum amount of uncertainty over the real numbers. We will discuss this idea in greater detail later in upcoming parts with some more mathematical foundation.

Central Limit Theorem

Formal definition of Central Limit Theorem(CLT) states:

The Central Limit Theorem (CLT) is a statistical theory states that given a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the population.

Simplifying it more,

The central limit theorem states that when an infinite number of successive random samples are taken from a population, the sampling distribution of the means of those samples will become approximately normally distributed with mean ( μ) and variance ( σ²/N) as the sample size (N) becomes larger , irrespective of the shape of the population distribution.

Let’s take an example to understand

Suppose we draw a random sample of size n ( x₁, x₂, ……, xₙ ) from a population random variable that is distributed with mean ( μ) and standard deviation ( σ).

Do this repeatedly, drawing many samples from the population, and the calculate the mean(μ or x̄) for each sample.

We will treat the x̄ values as another distribution, which we will call the sampling distribution of the mean ( x̄ ).

Given a distribution with mean ( μ ) and variance ( σ² ), the sampling distribution of the mean approaches the normal distribution with a mean( μ ) and variance( σ²/n ) as n , the sample size increases. The amazing and very interesting intuitive thing about the central limit theorem is that no matter what the shape of the original ( parent ) distribution, the sampling distribution of the mean approaches a normal distribution .

A normal distribution is approached very quickly as n increases, Note that n is the sample size for each mean and not the number of samples

Poisson Distribution

This distribution can be applied to some interesting examples that you can relate very easily.

Suppose you work at a call centre, how many calls do you get in a day? It can be any number. Now the entire number of calls at a call centre in a day can be modelled by Poisson Distribution . Some more examples are:

The number of emergency calls recorded at a hospital in a day.
Number of thefts reported in an area on a day.
Number of suicides reported in a particular city.

Poisson distribution is applicable in a situations where events occur at random point of time and space wherein our interest lies only in the number of occurrences of the event.

The following assumptions are necessary for Poisson distributions:

Any successful event should not influence the outcome of another successful event.
The probability of success over a short interval must be equal to the probability of success over a long interval.
The probability of success in an interval approaches zero as the interval becomes smaller.

The PMF of a random variable x for Poisson distribution is given by:

where, μ = mean number of events in an interval of length t,

μ = λ*t

where, t = length of time interval.

and λ = rate at which event occur

The graph of a Poisson distribution is shown below:

The mean and variance for Poisson distribution:

E(x) = μ

Var(x) = μ

Exponential Distribution

Let’s consider the call centre example again. What about the interval of time between the call? or some other examples are:

Length of time between metro arrivals.
The life of a air conditioner.
How much time will elapse before an earthquake occurs in a given region?
How long will a piece of machinery work without breaking down?

These type of questions where we need to find waiting time for a given event to occur, can be answered by exponential distribution. It is widely used for survival analysis.

The PDF for a random variable x to have exponential distribution is given by:

Exponential Distribution

where parameter λ >0, called the rate.

For survival analysis, λ is called the failure rate of a device at any time t, given that it has survived up to t.

Mean and variance for Exponential distribution is:

E(x) = 1/λ

Var(x) = (1/λ)²

Also, the greater the rate, the faster the curve drops and the lower the rate, flatter the curve. This is explained better with the graph shown below.

Laplace Distribution

Laplace distributionrepresents the distribution of differences between two independent variables having identical exponential distributions. It is also called double exponential distribution .

Like normal distribution, this distribution is unimodal(one peak) and symmetrical. However, it has a sharper peak than normal distribution.

The general PDF for laplace distribution is:

Laplace Distribution

Mixtures of Distribution

It is also common to define probability distributions by combining other simpler probability distributions. Mixture distribution is a mixture of two or more probability distributions.

The parent populations can be univariate or multivariate, although the mixed distributions should have the same dimensionality. In addition, they should either be all discrete probability distributions or all continuous probability distributions.

A mixture distribution can be defined by the following formula:

Where f₁, f₂,…fₙ are the component distributions and λₖ are the mixing weights (i.e. the probabilities for how much each individual distribution contributes to the mixture distribution).

λₖ > 0,
Σₖλₖ = 1

Examples when to use mixture distribution:

To show how variables can be differently distributed. Let’s say you are investigating how stress affects exam scores in a school. Two distribution that commonly used in this are normal distribution and bimodal distribution(two peaks). You could have probability of 0.7 of your random variable following normal distribution and 0.3 following bimodal distribution(note that probabilities add up to 1).
When you have no idea what an outcome will be. For example, let’s say you are thinking about investing in stock for a tech company. They are about to release a gadget and you think this will make stock rise dramatically by 100% mean and 25% standard deviation. But there is also some rumours that this gadget might have some major bugs, hindering its release and will make stock fall by 30% mean and 15% standard deviation. As you don’t know if the gadget is going to be released or not, the mixture will be an equally weighted (i.e. 50% for the falling distribution and 50% for a rising distribution).

A very powerful and common type of mixture model is the Gaussian mixture model, in which the components are Gaussians. Each component has separate parametrized mean and covariance. This is a very wide topic and I am going to discuss it in detail in later article.

This concludes this part of the series. Here I covered all the basics about probability, probability distribution and other related topics. In the next article we will see some more broad and complicated topics that I am currently working and soon upload them.I hope you enjoy this part and we will learn more in the upcoming part too.

P.S. : If you are looking to try Deep Learning projects somewhere then do check out this online collaborative notebook DeepNote .