Illustration with Python: Confidence Interval

Oct 28 ·7min read

This article uses knowledge in the central limit theorem , and also the concept of the weak law of large numbers and Chebyshev’s inequality , you can review these topics by visiting the links. If you would like to follow along, you can get the code from this link: Jupyter Notebook .

In my opinion, this topic is the most confusing theorem compare with others, even the article in Wikipedia mentions the misunderstanding about it, so I will try my best to explain it.

Before diving in, keep in mind that the mean of the population (the thing we what to estimate) is a constant, there is no randomness about the number.

The confidence interval is an estimator we use to estimate the value of population parameters. The interval will create a range that might contain the values. When we create the interval, we use a sample mean. Recall the central limit theorem, if we sample many times, the sample mean will be normally distributed.

I create the sample mean distribution to demonstrate this estimator.

# use gamma distribution
shape, scale = 2.0, 2.0  # mean=4, std=2*sqrt(2)
s = np.random.gamma(shape, scale, 1000000)
mu = shape*scale # mean and standard deviation
sigma = scale*np.sqrt(shape)# create sample mean distribution
meansample = []
# sample size
samplesize = 500
for j in range(0,50000):
    # sampling 500 sample from population
    rc = random.choices(s, k=samplesize)
    # collect mean of each sample
    meansample.append(sum(rc)/len(rc))plt.figure(figsize=(20,10))
plt.hist(meansample, 200, density=True, color='lightblue')
plt.show()

nYzqaa6.jpg!web

The sample means distribution is normally distributed with the mean equal to the population mean which is 4. We can change this distribution to standard normal distribution and use the Z table to calculate the probability.

I use the Z table to create a range that covers 95% of the sample mean, as the following picture.

# set mean and 95% probability
plt.figure(figsize=(20,10))
plt.hist(meansample, 200, density=True, color='lightblue')
plt.plot([mu,mu],[0, 3.2], 'k-', lw=4, color='green')
plt.plot([mu-(1.96*sigma/np.sqrt(samplesize)),mu-(1.96*sigma/np.sqrt(samplesize))],[0, 3.2], 'k-', lw=2, color='navy')
plt.plot([mu+(1.96*sigma/np.sqrt(samplesize)),mu+(1.96*sigma/np.sqrt(samplesize))],[0, 3.2], 'k-', lw=2, color='navy')
plt.show()

jeUVJjQ.jpg!web

The green line is the true mean or population mean. The two blue lines are a 95% border.

If we sample from the population and calculate the mean, 95% of the time, we will get the mean within two blue lines.

Unfortunately, we can not sample 50,000 times and create this distribution. Suppose that we can sample only 1 time and we get a sample mean with value 3.85. The mean we get is the black line.

# suppose that we sample 500 data that has a mean as 3.85
# Xbar mean = 3.85, sigma = sigma/np.sqrt(samplesize)
m, ss = 3.85, sigma/np.sqrt(samplesize) # mean and standard deviation
plt.figure(figsize=(20,10))
plt.hist(meansample, 200, density=True, color='lightblue')
plt.plot([mu,mu],[0, 3.2], 'k-', lw=4, color='green')
plt.plot([mu-(1.96*sigma/np.sqrt(samplesize)),mu-(1.96*sigma/np.sqrt(samplesize))],[0, 3.2], 'k-', lw=2, color='navy')
plt.plot([mu+(1.96*sigma/np.sqrt(samplesize)),mu+(1.96*sigma/np.sqrt(samplesize))],[0, 3.2], 'k-', lw=2, color='navy')
plt.plot([m,m],[0, 3.2], 'k-', lw=2, color='black')
plt.show()

IvQvMjY.jpg!web

The black line is the mean we got from sampling.

From the mean we got, we want to create an estimator to estimate the value of the true mean, so we create an interval by using a 95% range (the yellow area). I will explain later why we use a 95% range.

# the interval we create cover the population mean because it is within 95% range from population mean
plt.figure(figsize=(20,10))plt.hist(meansample, 200, density=True, color='lightblue')
plt.plot([mu,mu],[0, 3.2], 'k-', lw=4, color='green')
plt.plot([mu-(1.96*sigma/np.sqrt(samplesize)),mu-(1.96*sigma/np.sqrt(samplesize))],[0, 3.2], 'k-', lw=1, color='navy')
plt.plot([mu+(1.96*sigma/np.sqrt(samplesize)),mu+(1.96*sigma/np.sqrt(samplesize))],[0, 3.2], 'k-', lw=1, color='navy')
plt.plot([m,m],[0, 3.2], 'k-', lw=4, color='black')
plt.plot([m-(1.96*ss),m-(1.96*ss)],[0, 3.2], 'k-', lw=2, color='red')
plt.plot([m+(1.96*ss),m+(1.96*ss)],[0, 3.2], 'k-', lw=2, color='red')
# Create a Rectangle patch
plt.gca().add_patch(plt.Rectangle((m-(1.96*ss), 0),2*(1.96*ss),3.21, fill=True, linewidth=3, fc=(1,1,0,0.3)))
plt.xlim(3.43, 4.67) 
plt.show()

7nAJZrE.jpg!web

The yellow area is an interval of the sample mean.

We can see that the interval covers the true mean (the green line). The interval (the yellow area) covers the true mean because the sample means fall within a 95% range from the population mean (the green line).

Since the sample means is random, there is a 5% probability that the mean will fall outside a 95% range. This is what happens when the sample mean falls outside a 95% range.

# if the interval is not within 95% range from population mean the interval will not cover the true population mean
plt.figure(figsize=(20,10))
m = 3.72
plt.hist(meansample, 200, density=True, color='lightblue')
plt.plot([mu,mu],[0, 3.2], 'k-', lw=4, color='green')
plt.plot([mu-(1.96*sigma/np.sqrt(samplesize)),mu-(1.96*sigma/np.sqrt(samplesize))],[0, 3.2], 'k-', lw=1, color='navy')
plt.plot([mu+(1.96*sigma/np.sqrt(samplesize)),mu+(1.96*sigma/np.sqrt(samplesize))],[0, 3.2], 'k-', lw=1, color='navy')
plt.plot([m,m],[0, 3.2], 'k-', lw=4, color='black')
plt.plot([m-(1.96*ss),m-(1.96*ss)],[0, 3.2], 'k-', lw=2, color='red')
plt.plot([m+(1.96*ss),m+(1.96*ss)],[0, 3.2], 'k-', lw=2, color='red')
# Create a Rectangle patch
plt.gca().add_patch(plt.Rectangle((m-(1.96*ss), 0),2*(1.96*ss),3.21, fill=True, linewidth=3, fc=(1,1,0,0.3)))
plt.xlim(3.43, 4.67) 
plt.show()

Q7BNFrM.jpg!web

If we sample and get the mean that falls outside blue lines, the interval will not cover the true mean.

The yellow area does not cover the population mean.

We use the 95% range because it matches with the probability of sample mean. If the sample means is on the border of probability the range still covers that true mean, but if it is a bit further, it will not cover. The probability that the sample mean will fall within two blue lines is 95%, so the interval of 5% outside the two lines will not cover the true mean.

The number 95% is an arbitrary number, you can set it to be any number. For example, we can set it to be a 90% range.

# use 90% instead of 95%
plt.figure(figsize=(20,10))
m = 3.72
plt.hist(meansample, 200, density=True, color='lightblue')
plt.plot([mu,mu],[0, 3.2], 'k-', lw=4, color='green')
plt.plot([mu-(1.645*sigma/np.sqrt(samplesize)),mu-(1.645*sigma/np.sqrt(samplesize))],[0, 3.2], 'k-', lw=1, color='navy')
plt.plot([mu+(1.645*sigma/np.sqrt(samplesize)),mu+(1.645*sigma/np.sqrt(samplesize))],[0, 3.2], 'k-', lw=1, color='navy')
plt.plot([m,m],[0, 3.2], 'k-', lw=4, color='black')
plt.plot([m-(1.645*ss),m-(1.645*ss)],[0, 3.2], 'k-', lw=2, color='red')
plt.plot([m+(1.645*ss),m+(1.645*ss)],[0, 3.2], 'k-', lw=2, color='red')
# Create a Rectangle patch
plt.gca().add_patch(plt.Rectangle((m-(1.645*ss), 0),2*(1.645*ss),3.21, fill=True, linewidth=3, fc=(1,1,0,0.3)))
plt.xlim(3.43, 4.67) 
plt.show()

BJJ32qQ.jpg!web

When we change the confidence from 95% to 90%, the interval becomes narrower.

Therefore, when we say 95% confidence means we are confident that 95% of the samples will have a mean that is within the 95% range from the true mean (within two blue lines). The problem is if we knew the true mean we would not bother doing this. To solve the problem, we change the interpretation into “we are confident that 95% of the samplings, the samples will have a mean that can create the interval which covers the true mean”.

Like I mentioned above, the only thing that is random here is the sample mean, so we CANNOT say that the probability that true mean is within the interval because the population means is not a random variable, the true mean is a number.

In theory, if we sample 100 times, 95 times we will have a sample mean that has an interval that covers the true mean, so I use python to simulate 100 samplings and this is what happens.

# simulate 100 interval with 5,000 sample size
mu = shape*scale # mean
sigma = scale*np.sqrt(shape) # standard deviation
intervallist = []
k = 1.96
# sample size
samplesize = 5000
# start count
c = 0
for i in range(0,100):
    # sample 100 sample
    rs = random.choices(s, k=samplesize)
    # calculate mean
    mean = np.mean(rs)
    upbound = mean + k*sigma/np.sqrt(samplesize)
    lowbound = mean - k*sigma/np.sqrt(samplesize)
    # collect difference between sample mean and mu
    intervallist.append([lowbound,mean,upbound])
    if upbound >= mu and lowbound <= mu:
        c += 1
        
print("number of interval that cover the expected values:", c)
# set figure size.
plt.figure(figsize=(20,10))
# plot box plots of each sample mean.
plt.boxplot(intervallist)
plt.plot([1, 100],[mu,mu], 'k-', lw=2, color='red')
# show plot.
plt.show()

MBZnya7.jpg!web

The red line is a true mean, the box plots are intervals.

The number might not be exact 95 because it is a probability after all.

One last thing, the sample size has a huge impact on this topic. If the sample size increases, the variance of sample means will decrease. Thus, the interval range also decreases.

# simulate 100 interval with 20,000 sample size
intervallist = []
# sample size
samplesize = 20000
# start count
c = 0
for i in range(0,100):
    # sample 100 sample
    rs = random.choices(s, k=samplesize)
    # calculate mean
    mean = np.mean(rs)
    upbound = mean + k*sigma/np.sqrt(samplesize)
    lowbound = mean - k*sigma/np.sqrt(samplesize)
    # collect difference between sample mean and mu
    intervallist.append([lowbound,mean,upbound])
    if upbound >= mu and lowbound <= mu:
        c += 1
        
print("number of interval that cover the expected values:", c)
# set figure size.
plt.figure(figsize=(20,10))
# plot box plots of each sample mean.
plt.boxplot(intervallist)
plt.plot([1, 100],[mu,mu], 'k-', lw=2, color='red')
# show plot.
plt.show()

uemE3qM.jpg!web

Notice the y-axis scale is smaller.

If we have a large sample size, the estimator will be more precise.

The code can be found in this link: Jupyter Notebook .

Illustration with Python: Confidence Interval

Illustration with Python: Confidence Interval

Recommend

Flink 实时计算指标对数方案

JWT在Gin中的使用

When to switch to Python 3.8

深度学习遇见代码搜索，一篇论文概览神经代码搜索

Ten simple rules for helping newcomers become contributors to open projects

5G套餐价格曝光：起步价128元，最高598元

GitHub - appbrewery/EggTimer-iOS13: Learn to Code While Building Apps - The Comp...

好像种草 hhkb 了，求打醒

一文看懂 NLP 里的模型框架 Encoder-Decoder 和 Seq2Seq

Issue 205 – October 29th 2019

About Joyk