What’s Linear About Logistic Regression

There’s already a bunch of amazing articles and videos on Logistic Regression, but it was a struggle for me to understand the connection between the probabilities and the linearity of Logistic, so I figured I would document it here for myself and for those who might be going through the same thing.

This will also shed some light on where the ‘Logistic’ part of Logistic Regression comes from!

The focus of this blog will be on building an intuitive understanding of the relationship between the logistic model and the linear model, so I’m just going to do an overview of what Logistic Regression is and dive into that relationship. For a more complete explanation of this awesome algorithm, here are some of my favorite resources:

Now let’s get to the gist of Logistic Regression.

What is Logistic Regression?

Like Linear Regression, Logistic Regression is used to model the relationship between a set of independent variables and a dependent variable.

Unlike Linear Regression, the dependent variable is categorical, which is why it’s considered a classification algorithm.

Logistic Regression could be used to predict whether:

An email is spam or not spam
A tumor is malignant or not
A student will pass or fail an exam
I will regret snacking on cookies at 12 am

The applications listed above are examples of Binomial/Binary Logistic Regression where the target is dichotomous (2 possible values), but you could have more than 2 classes (Multinomial Logistic Regression).

These classifications are made based on the probabilities produced by the model and some threshold (typically 0.5). E.g. A student is predicted to pass if her probability of passing is greater than 0.5.

Let’s start digging into how these probabilities are calculated.

The Sigmoid Function

If we visualize a dataset with binary target variables, we’d get something like this:

There are a couple of reasons why fitting a line might not be a good idea here:

In Linear Regression, the dependent variable could range from negative inf to positive inf, but we’re trying to predict probabilities which should be between 0 and 1.
Even if we created some rules to map those out-of-bound values to a label, the classifier would be very sensitive to outliers which would have an adverse effect on its performance.

So, instead of a straight line, we model it with an S shape that flattens out near 0 and 1:

source

This is called a sigmoid function and it has this form:

This function returns the probability that an observation belongs to a class based on some combination of factors.

And if we solve for the linear function, we’d get the log of the odds or the logit:

Notice how when p(x) ≥0.5, βX ≥ 0.

But wait a minute, where did this magical function come from and how did the linear model get in there? To answer that, we’ll take a look at how Logistic Regression forms its decision boundary.

Decision Boundary

Behind every great Logistic Regression model is an unobservable (latent) linear regression model, because the question it’s really trying to answer is:

“What is the probability an observation belongs to class 1 given some characteristics x?”

Let’s take a look at an example.

Supposed we want to predict whether a student will pass an exam based on how much time she spent studying and sleeping:

Source: scilab

Let’s understand our data better by plotting Studied against Slept and color code our classes to visualize the split:

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

exams = pd.read_csv('data_classification.csv', names=['Studied','Slept','Passed'])

fig = plt.figure()
ax = fig.add_subplot(111)

colors = [‘red’, ’blue’]

ax.scatter(exams.Studied, exams.Slept, s=25, marker=”o”,  c=exams[‘Passed’], cmap=matplotlib.colors.ListedColormap(colors))

Looking at this plot, we can hypothesize a few relationships:

Students who spend enough time studying and get lots of sleep are likely to pass
Students who sleep less than 2 hours but spend 8+ hours studying will probably still pass (I was for sure in this group)
Students who slack on studying and forego sleep have probably accepted their fate of not passing

The idea here is there’s a clear line separating these two classes, and we’re hoping Logistic Regression is going to find that for us. Let’s fit a Logistic Regression model and overlay this plot with the model’s decision boundary.

from sklearn.linear_model import LogisticRegression

features = exams.drop(['Passed'],axis=1)
target = exams['Passed']

logmodel = LogisticRegression()
logmodel.fit(features, target)
predictions = logmodel.predict(features)

You can print out the parameter estimates:

Using those estimates, we can calculate the boundary. Since our threshold is set at 0.5, I’m holding the logit at 0. This also allows us to view the boundary in 2d:

exams['boundary'] = (-logmodel.intercept_[0] - (logmodel.coef_[0][0] * features['Studied'])) / logmodel.coef_[0][1]

Here’s what it looks like on our scatter plot:

plt.scatter(exams['Studied'],exams['Slept'], s=25, marker="o", c=exams['Passed'], cmap=matplotlib.colors.ListedColormap(colors))

plt.plot(exams['Studied'], exams['boundary'])

plt.show()

That looks reasonable! So how does Logistic Regression use this line to assign class labels? It looks at the distance between each individual observation and the linear model. It would label all points above this line as 1 and everything below as 0. Any points on this line could belong to either class (0.5 probability), so in order to classify a point as 1, we’re interested in the probability that the distance between this line and our observation is greater than 0.

As it turns out, in Logistic Regression, this distance is assumed to follow the logistic distribution.

In other words, the error term of the latent linear regression model in Logistic Regression is assumed to follow the logistic distribution.

This means when we ask:

We’re really asking:

To calculate this probability, we take the integral of the logistic distribution to get its cumulative distribution function:

Oh hey! It’s the sigmoid function :).

Tada! You should now be able to walk back and forth between the sigmoid function and the linear regression function more intuitively. I hope understanding this connection built a higher appreciation for Logistic Regression for you as it did mine.

Recommend

Build a recommender system with Spark: Logistic Regression – I failed...

Efficient Logistic Regression on Large Encrypted Data

Logistic regression笔记 - 代码玩乐场

多分类逻辑回归（Multinomial Logistic Regression）

线性模型篇之Logistic Regression数学公式推导

Logistic Regression from Bayes’ Theorem

Building a Bayesian Logistic Regression with Python and PyMC3

Logistic Regression

ML in Rust, implementing logistic and linear regression from scratch

Linear Regression vs Logistic Regression in Machine Learning

About Joyk