21

Outlier Detection with Hampel Filter

 4 years ago
source link: https://www.tuicool.com/articles/3ANzuaz
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

How to implement the Hampel filter in Python from scratch

Recently I stumbled upon a new (to me) outlier detection algorithm — the Hampel filter. In this short article, I would like to describe how it works and how to use it in practice. To the best of my knowledge, there is no Python library containing this algorithm, so we will implement it from scratch using two different approaches ( for-loop and pandas ). In the end, we will see which one performs better in terms of execution speed.

Hampel Filter

Let’s start with a quick theoretical introduction. There is not a lot of online resources describing the algorithm (even no page on Wikipedia), but it is simple enough to quickly understand the logic. Also, I put some references at the end of the article.

The goal of the Hampel filter is to identify and replace outliers in a given series. It uses a sliding window of configurable width to go over the data. For each window (given observation and the 2 window_size surrounding elements, window_size for each side), we calculate the median and the standard deviation expressed as the median absolute deviation .

For the MAD to be a consistent estimator for the standard deviation, we must multiply it by a constant scale factor k . The factor is dependent on the distribution, for Gaussian it is approximately 1.4826.

If the considered observation differs from the window median by more than x standard deviations, we treat it as an outlier and replace it with the median.

The Hampel filter has two configurable parameters:

  • the size of the sliding window
  • the number of standard deviations which identify the outlier

We select these two parameters depending on the use-case. A higher standard deviation threshold makes the filter more forgiving, a lower one identifies more points as outliers. Setting the threshold to 0 corresponds to John Tukey’s median filter.

As the filter uses a sliding window, it makes the most sense to use it with time-series data, where the order of the data is governed by time.

Python Implementation

Import libraries

The first step is importing the required libraries.

import matplotlib.pyplot as plt
import warnings
import pandas as pd
import numpy as np

Random walk with outliers

Before implementing the algorithm, we create an artificial dataset using random walk . We define a function that accepts the percentage of outliers as a parameter and randomly scales some random walk increments by a constant multiplier.

Using the function we generate artificial data and plot it below, together with the introduced outliers.

rw, outlier_ind = random_walk_with_outliers(0, 1000, 0.01)plt.plot(np.arange(len(rw)), rw)
plt.scatter(outlier_ind, rw[outlier_ind], c='r', label='outlier')
plt.title('Random Walk with outliers')
plt.xlabel('Time steps')
plt.ylabel('Values')
plt.legend();

1*D8u1vta7A7weuVyaLtYHMw.png?q=20

Evaluating the results

We also need to define a function for evaluating the results of our outlier detection algorithms. For that, we will plot the actual vs. detected outliers and return a short summary of the performance.

Hampel filter implementation

  1. for-loop implementation

We start with the for-loop implementation of the Hampel filter. We base the code on the one from the pracma R package.

We run the algorithm on the RW series. We chose window size 10, but this should be determined empirically for the best performance.

res, detected_outliers = hampel_filter_forloop(rw, 10)

We evaluate the results using the previously defined helper function.

tp, fp, fn = evaluate_detection(rw, outlier_ind, detected_outliers)

r6Nfeiq.png!web

Evaluation of the for-loop implementation

In the plot, we can see that the algorithm correctly identified 8 (out of 10) outliers, mistook one observation for an outlier (red dot) and missed 2 outliers (black dots). Perhaps better performance can be achieved by tuning the window size. However, this is sufficient for the purpose of the exercise.

We additionally plot the transformed series, in which the outliers were replaced with the window median.

plt.plot(np.arange(len(res)), res);
plt.scatter(outlier_ind, rw[outlier_ind], c='g', label='true outlier')
plt.scatter(fp, rw[fp], c='r', label='false positive')
plt.title('Cleaned series (without detected outliers)')
plt.legend();

Qje6zaE.png!web

Transformed series (removed outliers)

2. pandas implementation

For the pandas implementation we make use of the rolling method of a pd.Series and a lambda function.

In the rolling method we specify twice the window size and use centering, so the considered observation is in the middle of a 2 * window_size + 1 window. Before running the algorithm, we transform the RW from np.ndarray to pd.Series .

rw_series = pd.Series(rw)
res, detected_outliers = hampel_filter_pandas(rw_series, 10)
tp, fp, fn = evaluate_detection(rw, outlier_ind, detected_outliers)

vAzYRnr.png!web

Evaluation of the pandas implementation

The results of both approaches are identical, which is always a good sign :)

Performance comparison

At this point, we test the two implementations against each other in terms of execution speed. We expect that the pandas one will perform faster.

First, we test the for-loop implementation:

%%timeit
res, detected_outliers = hampel_filter_forloop(rw, 10)
# 67.9 ms ± 990 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Then, we run the analogical test for the pandas implementation:

%%timeit
res, detected_outliers = hampel_filter_pandas(rw_series, 10)
# 76.1 ms ± 4.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

We see that the pandas implementation turned out to be slower. One hypothesis to test is that the pandas implementation will be faster for a larger series. That is why we also increase the length of the random walk series to 100000 and test the performance once again.

rw, outlier_ind = random_walk_with_outliers(0, 10 ** 5, 0.01)
rw_series = pd.Series(rw)

Having prepared the data, we move to test the performance:

%%timeit
res, detected_outliers = hampel_filter_forloop(rw, 10)
# 6.75 s ± 203 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
res, detected_outliers = hampel_filter_pandas(rw_series, 10)
# 6.76 s ± 30.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It turns out that now they achieved comparable performance, with the pandas implementation providing more stable performance (lower standard deviation).

Bonus: Increasing the speed of the for-loops with numba

As a bonus, we explore the possibility of speeding up the code based on for-loops with numba . numba is a library that under-the-hood translates Python code into an optimized machine code. Given it is possible to transform the code, numba can make certain algorithms written in Python approach the speed of C.

The best part about numba is that (if possible) the speed boost comes at a very small price in terms of coding. We need to import the library and add the @jit decorator before the function we want to translate to machine code.

The nopython argument indicates if we want numba to use purely machine code or to use some Python code if necessary. Ideally, this should always be set to true , as long as there are no errors returned by numba .

Below we test the execution speed.

%%timeit
res, detected_outliers = hampel_filter_forloop_numba(rw, 10)
# 108 ms ± 1.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It only takes 108 ms (milliseconds) as compared to 6.76 s in case of the pandas implementation! That is a massive speed-up of ~63 times!

Conclusions

Summing up, in this article we showed explained how Hampel filter work in terms of outlier detection and how to implement it in Python. We also compared the implementations in terms of execution speed.

It is also good to consider the drawbacks of the Hampel filter:

  • it has problems with detecting outliers at the beginning and the end of the series — when the window is not complete (even on one side), the function does not detect possible outliers
  • it struggles to detect outliers when they are close to each other — within the window range

As always, any constructive feedback is welcome. You can reach out to me on Twitter or in the comments. You can find the code used for this article on my GitHub .

References

  • Hampel F. R., ”The influence curve and its role in robust estimation,” Journal of the American Statistical Association, 69, 382–393, 1974
  • Liu, Hancong, Sirish Shah, and Wei Jiang. “On-line outlier detection and data cleaning.” Computers and Chemical Engineering. Vol. 28, March 2004, pp. 1635–1647
  • Suomela, Jukka. “Median Filtering Is Equivalent to Sorting.” 2014

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK