Outlier Detection with Hampel Filter

How to implement the Hampel filter in Python from scratch

Sep 26 ·6min read

Recently I stumbled upon a new (to me) outlier detection algorithm — the Hampel filter. In this short article, I would like to describe how it works and how to use it in practice. To the best of my knowledge, there is no Python library containing this algorithm, so we will implement it from scratch using two different approaches ( for-loop and pandas ). In the end, we will see which one performs better in terms of execution speed.

Hampel Filter

Let’s start with a quick theoretical introduction. There is not a lot of online resources describing the algorithm (even no page on Wikipedia), but it is simple enough to quickly understand the logic. Also, I put some references at the end of the article.

The goal of the Hampel filter is to identify and replace outliers in a given series. It uses a sliding window of configurable width to go over the data. For each window (given observation and the 2 window_size surrounding elements, window_size for each side), we calculate the median and the standard deviation expressed as the median absolute deviation .

For the MAD to be a consistent estimator for the standard deviation, we must multiply it by a constant scale factor k . The factor is dependent on the distribution, for Gaussian it is approximately 1.4826.

If the considered observation differs from the window median by more than x standard deviations, we treat it as an outlier and replace it with the median.

The Hampel filter has two configurable parameters:

the size of the sliding window
the number of standard deviations which identify the outlier

We select these two parameters depending on the use-case. A higher standard deviation threshold makes the filter more forgiving, a lower one identifies more points as outliers. Setting the threshold to 0 corresponds to John Tukey’s median filter.

As the filter uses a sliding window, it makes the most sense to use it with time-series data, where the order of the data is governed by time.

Python Implementation

Import libraries

The first step is importing the required libraries.

import matplotlib.pyplot as plt
import warnings
import pandas as pd
import numpy as np

Random walk with outliers

Before implementing the algorithm, we create an artificial dataset using random walk . We define a function that accepts the percentage of outliers as a parameter and randomly scales some random walk increments by a constant multiplier.

Using the function we generate artificial data and plot it below, together with the introduced outliers.

rw, outlier_ind = random_walk_with_outliers(0, 1000, 0.01)plt.plot(np.arange(len(rw)), rw)
plt.scatter(outlier_ind, rw[outlier_ind], c='r', label='outlier')
plt.title('Random Walk with outliers')
plt.xlabel('Time steps')
plt.ylabel('Values')
plt.legend();

1*D8u1vta7A7weuVyaLtYHMw.png?q=20

Evaluating the results

We also need to define a function for evaluating the results of our outlier detection algorithms. For that, we will plot the actual vs. detected outliers and return a short summary of the performance.

Hampel filter implementation

for-loop implementation

We start with the for-loop implementation of the Hampel filter. We base the code on the one from the pracma R package.

We run the algorithm on the RW series. We chose window size 10, but this should be determined empirically for the best performance.

res, detected_outliers = hampel_filter_forloop(rw, 10)

We evaluate the results using the previously defined helper function.

tp, fp, fn = evaluate_detection(rw, outlier_ind, detected_outliers)

r6Nfeiq.png!web

Evaluation of the for-loop implementation

In the plot, we can see that the algorithm correctly identified 8 (out of 10) outliers, mistook one observation for an outlier (red dot) and missed 2 outliers (black dots). Perhaps better performance can be achieved by tuning the window size. However, this is sufficient for the purpose of the exercise.

We additionally plot the transformed series, in which the outliers were replaced with the window median.

plt.plot(np.arange(len(res)), res);
plt.scatter(outlier_ind, rw[outlier_ind], c='g', label='true outlier')
plt.scatter(fp, rw[fp], c='r', label='false positive')
plt.title('Cleaned series (without detected outliers)')
plt.legend();

Qje6zaE.png!web

Transformed series (removed outliers)

2. pandas implementation

For the pandas implementation we make use of the rolling method of a pd.Series and a lambda function.

In the rolling method we specify twice the window size and use centering, so the considered observation is in the middle of a 2 * window_size + 1 window. Before running the algorithm, we transform the RW from np.ndarray to pd.Series .

rw_series = pd.Series(rw)
res, detected_outliers = hampel_filter_pandas(rw_series, 10)
tp, fp, fn = evaluate_detection(rw, outlier_ind, detected_outliers)

vAzYRnr.png!web

Evaluation of the pandas implementation

The results of both approaches are identical, which is always a good sign :)

Performance comparison

At this point, we test the two implementations against each other in terms of execution speed. We expect that the pandas one will perform faster.

First, we test the for-loop implementation:

%%timeit
res, detected_outliers = hampel_filter_forloop(rw, 10)# 67.9 ms ± 990 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Then, we run the analogical test for the pandas implementation:

%%timeit
res, detected_outliers = hampel_filter_pandas(rw_series, 10)# 76.1 ms ± 4.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

We see that the pandas implementation turned out to be slower. One hypothesis to test is that the pandas implementation will be faster for a larger series. That is why we also increase the length of the random walk series to 100000 and test the performance once again.

rw, outlier_ind = random_walk_with_outliers(0, 10 ** 5, 0.01)
rw_series = pd.Series(rw)

Having prepared the data, we move to test the performance:

%%timeit
res, detected_outliers = hampel_filter_forloop(rw, 10)
# 6.75 s ± 203 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)%%timeit
res, detected_outliers = hampel_filter_pandas(rw_series, 10)
# 6.76 s ± 30.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It turns out that now they achieved comparable performance, with the pandas implementation providing more stable performance (lower standard deviation).

Bonus: Increasing the speed of the for-loops with `numba`

As a bonus, we explore the possibility of speeding up the code based on for-loops with numba . numba is a library that under-the-hood translates Python code into an optimized machine code. Given it is possible to transform the code, numba can make certain algorithms written in Python approach the speed of C.

The best part about numba is that (if possible) the speed boost comes at a very small price in terms of coding. We need to import the library and add the @jit decorator before the function we want to translate to machine code.

The nopython argument indicates if we want numba to use purely machine code or to use some Python code if necessary. Ideally, this should always be set to true , as long as there are no errors returned by numba .

Below we test the execution speed.

%%timeit
res, detected_outliers = hampel_filter_forloop_numba(rw, 10)
# 108 ms ± 1.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It only takes 108 ms (milliseconds) as compared to 6.76 s in case of the pandas implementation! That is a massive speed-up of ~63 times!

Conclusions

Summing up, in this article we showed explained how Hampel filter work in terms of outlier detection and how to implement it in Python. We also compared the implementations in terms of execution speed.

It is also good to consider the drawbacks of the Hampel filter:

it has problems with detecting outliers at the beginning and the end of the series — when the window is not complete (even on one side), the function does not detect possible outliers
it struggles to detect outliers when they are close to each other — within the window range

As always, any constructive feedback is welcome. You can reach out to me on Twitter or in the comments. You can find the code used for this article on my GitHub .

References

Hampel F. R., ”The influence curve and its role in robust estimation,” Journal of the American Statistical Association, 69, 382–393, 1974
Liu, Hancong, Sirish Shah, and Wei Jiang. “On-line outlier detection and data cleaning.” Computers and Chemical Engineering. Vol. 28, March 2004, pp. 1635–1647
Suomela, Jukka. “Median Filtering Is Equivalent to Sorting.” 2014

How to implement the Hampel filter in Python from scratch

Hampel Filter

Python Implementation

Import libraries

Random walk with outliers

Evaluating the results

Hampel filter implementation

Performance comparison

Bonus: Increasing the speed of the for-loops with `numba`

Conclusions

Recommend

Hyperparameters Optimization

多网站项目的 CSS 架构

这些年的体验技术部（六） · Node.js 基础服务 - 摸爬滚打才不负功名尘土 - 知乎

Dark Mode Apps - List of Apps Supporting iOS Dark Mode

蔚来汽车辟谣“4年亏损57亿美元”：实为200亿人民币

刷脸支付迎巨头数十亿补贴：照片、整容也能解锁吗？

特斯拉上海工厂主体已建成：首辆白车身已下线

美国最大电子烟制造商Juul宣布停止投放广告

5G商用倒计时：超210万用户预约尚缺“杀手级”应用

宝宝树裁员罗生门：高层承认9月裁员有部门整体被裁

About Joyk

Outlier Detection with Hampel Filter

How to implement the Hampel filter in Python from scratch

Hampel Filter

Python Implementation

Import libraries

Random walk with outliers

Evaluating the results

Hampel filter implementation

Performance comparison

Bonus: Increasing the speed of the for-loops with numba

Conclusions

Recommend

About Joyk

Bonus: Increasing the speed of the for-loops with `numba`