What to Do When Your Model Has a Non-Normal Error Distribution

OPTIMIZATION AND MACHINE LEARNING

How to use warping to fit arbitrary error distributions

Mar 13 ·13min read

3IR3uef.jpg!web

Photo by Neil Rosenstech on Unsplash

O ne of the most import things a model can tell us is how certain it is in a prediction. An answer to this question can come in the form of an error distribution. An error distribution is a probability distribution about a point prediction telling us how likely each error delta is.

The error distribution can be every bit as important than the point prediction.

Suppose you’re an investor considering two different opportunities (A and B) and using a model to predict the one-year returns (as a percentage of the amount invested). The model predicts A and B to have the same expected 1-year return of 10% but shows these error distributions

M3MVZnj.png!web

2iIfmir.png!web

Even though both opportunities have the same expected return, the error distribution shows how different they are. B is tightly distributed about its expected value with little risk of losing money; whereas, A is more like a lottery. There’s a small probability of a high payout (~500% return); but the majority of the time, we’ll lose everything (~-100% return).

A point prediction tells us nothing about where target values are likely to be distributed. If it’s important to know how far off a prediction can be, or if target values can be clustered about fat tails , then an accurate error distribution becomes essential.

An easy way to get the error distribution wrong is to try to force it into a form it doesn’t take. This frequently happens when we reach for the convenient, but often misapplied, normal distribution.

The normal distribution is popular for good reason. In addition to making math easier, the central limit theorem tells us the normal distribution can be a natural choice for many problems.

How can a normal distribution come about naturally?

Let X denote a feature matrix and b denote a vector of regressors. Suppose target values are generated by the equation

where

The central limit theorem says that if the E ’s are independently identically distributed random variables with finite variance, then the sum will approach a normal distribution as m increases.

Even when E is wildly non-normal, e will be close to normal if the summation contains enough terms.

Let’s look at a concrete example. Set b = (-2, 3) . Let the entries of X be generated independently from the uniform distribution [-1, 1] . We’ll generate the E’ s from this decidedly non-normal distribution

2uyYZ3j.png!web

We normalize the error distribution for e to have unit variance and allow the number of terms m to vary. Here are histograms of the errors (in orange) from least squares models taken from runs of a simulation for different values of m and overlaid with the expected histogram of the errors if they were normally distributed (in blue)¹.

rQZbUv3.png!web

jymInaj.png!web

AVVnUr7.png!web

Jfy6je7.png!web

With greater values of m, the error histogram gets increasingly closer to that of the normal distribution.

When there’s reason to think that error terms break down into sums of independent identically distributed factors like this, the normal distribution is a good choice. But in the general case, we have no reason to assume it. And indeed, many error distributions are not normal exhibiting skewing and fat tails .

What should we do when we have non-normality in an error distribution?

This is where warping helps us². It uses the normal distribution as a building block but gives us knobs to locally adjust the distribution to better fit the errors from the data.

To see how warping works, observe that if f(y) is a monotonically increasing surjective function and p(z) is a probability density function, then p(f(y))f′(y) forms a new probability density function.

because f′(y) ≥ 0; and after applying the substitution u=f(y) , we see that

Let’s look at an example to see how f can reshape a distribution. Suppose p(z) is the standard normal distribution N(0, 1) and f(y) is defined by

where c > 0; and between [0, 1], f is a spline that smoothly transitions between y and cy. Here’s what f looks like for a few different values of c

B7nUniV.png!web

and here are what the resulting warped probability distributions look like³

jQVNjum.png!web

When c = 2 , area is redistributed from the standard normal distribution so that the probability density function (PDF) peaks and then quickly falls off so as to have a thinner right tail. When c = 0.5 , the opposite happens: the PDF falls off quickly and then slows its rate of decline so as to have a fatter right tail.

Now, imagine f is parameterized by a vector ψ that allows us to make arbitrary localized adjustments to the rate of increase. (More on how to parameterize f later). Then with suitable ψ , f can fit a wide range of different distributions. If we can find a way to properly adjust ψ , then this will give us a powerful tool to fit error distributions.

How to adjust warping parameters?

A better fit error distribution makes the errors on the training data more likely. It follows that we can find warping parameters by maximizing likelihood on the training data.

First, let’s look at how maximizing likelihood works without warping.

Let θ denote the parameter vector for a given regression model. Let g(x; θ) represents the prediction of the model for feature vector x . If we use a normal distribution with standard deviation σ to model the error distribution of predictions, then the likelihood of the training data is

and the log-likelihood is

Put

(RSS stands for residual sum of squares )

For θ fixed, σ maximizes likelihood when

6ZBRbea.png!web

More generally, if σ² = cRSS ( c > 0 ), then the log-likelihood simplifies to

And we see that likelihood is maximized when θ minimize RSS.

Now, suppose we warp the target space with the monotonic function f parameterized by ψ. Let f(y; ψ) denote a warped target value. Then the likelihood with the warped error distribution is

and the log-likelihood becomes

Or with

and σ² = cRSS

To fit the error distribution, we’ll use an optimizer to find the parameters (θ, ψ) that maximize this likelihood.

For an optimizer to work, it requires a local approximation to the objective that it can use to iteratively improve on parameters. To build such an approximation, we’ll need to compute the gradient of the log-likelihood with respect to the parameter vector.

Put

We can use L as a proxy for the log-likelihood since it differs only by a constant.

Warping is a general process that can be applied to any base regression model, but we’ll focus on the simplest base model, linear regression.

How to warp a linear regression model?

With linear regression, we can derive a closed form for θ . Let Q and R be the matrices of the QR-factorization of the feature matrix X

where Q is orthogonal and R is rectangular triangular. Put

and let b̂ denote the vector that minimizes RSS for the warped targets z

Put

Then

Ifm6BnF.png!web

If X has m linear independent columns, then the first m rows of the rectangular triangular matrix R have non-zero entries on the diagonal and the remaining rows are 0. It follows that

for i ≤m and

for i>m . Therefore,

Let P be the n x n diagonal matrix with

Set

Then

Substituting these equations into the log-likelihood proxy, we get

And differentiating with respect to a warping parameter gives us

U3aeiyF.png!web

Using these derivatives, an optimizer can climb to warping parameters ψ that maximize the likelihood of the training data.

How to make predictions with a warped linear regression model?

Now that we’ve found warping parameters, we need to make predictions.

Consider how this works in a standard ordinary least squares model without warping. Suppose data is generated from the model

where ε is in N(0, σ²). Let X and y denote the training data. The regressors that minimize the RSS of the training data are

If x′ and y ′ denote an out-of-sample feature vector and target value

then the error of the out-of-sample prediction is

QJ3Qzmj.png!web

Because ε and ε′ are normally distributed, it follows that e′ is normally distributed and the variance is⁴

jy6zu2Y.png!web

We rarely know the noise variance σ², but we can use this equation to obtain an unbiased estimate for it

where p is the number of regressors.

Suppose now the ordinary least squares model is fitted to the warped target values

Ordinary least squares gives us a point prediction and error distribution for the latent space, but we need to invert the warping to get a prediction for the target space.

Let ẑ represent the latent prediction for an out-of-sample feature vector x′ . If s² is the estimated latent noise variance, then the probability of a target value y is

and the expected target value is

After making the substitution u=f(y) , the expected value can be rewritten as

The inverse of f can be computed using Newton’s method to find the root of f(y) − u, and the integral can be efficiently evaluated with a Gauss-Hermite quadrature .

What are some effective functions for warping?

Let’s turn our attention to the warping function f(y; ψ) and how to parameterize it. We’d like for the parameterization to allow for a wide range of different functions, but we also need to ensure that it only permits monotonically increasing surjective warping functions.

Observe that the warping function is invariant under rescaling: c f(y; ψ) leads to the same results as f(y; ψ) . Set θ′ so that g(x; θ ′ )=c g(x;θ). Then the log likelihood proxy L(ψ, θ′) for c f(y; ψ) is

beABfue.png!web

What’s important is how the warping function changes the relative spacing between target values.

One effective family of functions for warping is

Each tanh step allows for a localized change to the warping function’s slope. The t term ensures that the warping function is monotonically surjective and reverts back to the identity when t is far from any step. And because of the invariance to scaling, it’s unnecessary to add a scaling coefficient to t .

We’ll make one additional adjustment so that the warping function zeros the mean. Put

An Example Problem

The Communities and Crime Dataset⁵ provides crime statistics for different localities across the United States. As a regression problem, the task is to predict the violent crime rate from different socio-economic indicators. We’ll fit a warped linear regression model to the dataset and compare how it performs to an ordinary least squares model.

Let’s look at the warping function fit to maximize the log-likelihood on the training data.

b6bMviR.png!web

Let σ denote the estimated noise standard deviation in the latent space. To visualize how this function changes an error distribution, we’ll plot the range

across the target values

ve2aU3n.png!web

Warping makes a prediction’s error range smaller at lower target values⁶.

To see if warping leads to better results, let’s compare the performance of a warped linear regression model (WLR) to an ordinary least squares model (OLS) on a ten-fold cross-validation of the communities dataset. We use Mean Log-Likelihood (MLL) as the error measurement. MLL averages the log-likelihood of each out-of-sample prediction in the cross-validation⁷.

The results show warped linear regression performing substantially better. Drilling down on a few randomly chosen predictions and their error distributions helps explain why.

rA7jMrR.png!web

f2mEvi3.png!web

vyaINjv.png!web

aQZ3yau.png!web

The value range is naturally restricted at zero and the warping reshapes the probability density function to taper off so that there’s more probability mass for valid target values.

Summary

It can be tempting to use a normal distribution to model errors. It makes math easier and the central limit theorem tells us normality arises naturally when errors break down into sums over independently identically distributed random variables.

But many regression problems don’t fit into such a framework and error distributions can be far from normal.

When faced with non-normally in the error distribution, one option is to transform the target space. With the right function f , it may be possible to achieve normality when we replace the original target values y with f(y) . Specifics of the problem can sometimes lead to a natural choice for f . At other times, we might approach the problem with a toolbox of fixed transformations and hope that one unlocks normality. But that can be an ad-hoc process.

Warping turns the transformation step into a maximum likelihood problem. Instead of applying fixed transformations, warping uses parameterized functions that can approximate arbitrary transformations and fits the functions to the problem with the help of an optimizer.

Through the transformation function, warping can capture aspects of non-normality in error distributions like skewing and fat tails. For many problems, it leads to better performance on out-of-sample predictions and avoids the ad hocery of working with fixed transformations.

What to Do When Your Model Has a Non-Normal Error Distribution

OPTIMIZATION AND MACHINE LEARNING