Time Series Change Point Detection with Two Sample Statistic on Spark with Appli...

Time Series Change Point Detection with Two Sample Statistic on Spark with Application for Retail Sales Data

The goal of change point detection is to detect the times when statistically significant and sustained changes happen in a time series. It has wide range of applications in various domains including retail, medical, IoT, finance, business and meteorology. In this post we will go through a solution as implemented on Spark, based on non parametric two sample statistic to identify change points. Retail eCommerce sales data will be used as an example to show case the solution. Abrupt changes in sales can occur for various reasons e.g cannibalization by a competing product, sudden increase in sale due wrong posted sale price etc.

The implementation is part of my OSS project beymani in Github. As with all my projects, the implementation, aided with heavy usage of configuration is completely reusable and applicable to any use case from any domain.

Change Point Detection

Change Point Detection (CPD) is essentially a unconstrained optimization problem and can be stated as follows. Segment a time series in such a way that data in each segment is stationary. Data is stationary when it’s statistical properties are time invariant.

The change point locations are the variables in the optimization problem. As an optimization problem there is a cost associated with each segment. The goal is to segment the time series in such a way that the total cost across all segments is minimized.

It can be framed as a constrained optimization problem if there is regularizer penalty when too many change points are detected. Two questions need to be asked for any CPD solution. How do we search and find candidate change points. The second question is what metric to use at those points to evaluate them. We will address these issues next.

Change Point Detection Metric

There are various techniques for CPD based on supervised and unsupervised learning. Our focus is unsupervised learning techniques. Popular unsupervised techniques are listed below, under 2 categories parametric and non parametric. This is another good review of different CPD techniques. These metrics allow us to measure degree of change at the point under consideration. These are the parametric techniques.

Maximum Likelihood : The cost is sum of the log likelihood for each segment. The time series is considered as iid variable with piecewise constant parametric distributions
Piecewise Linear Regression : Liner regression model is fitted in each segment. At the change point there is significant change the linear regression model. The cost is sum of the least square cost for each segment.
Mahalonobis Metric: Mahalonobis distance is used in the cost. Relevant for multivariate data with significant co variance

The non parametric techniques are more flexible since they are not limited by the assumption of parametric distribution. They are as follows

Empirical Cumulative Distribution : Based on Cumulative Distribution Function (CDF) of 2 samples on either side of the change point. There are 3 flavors of this technique.. In Kolmogorov Sminov (KS) test, the maximum distance between the 2 CDF is used. Cramer Von Mises (CVM)uses weighted sum of CDF differences with more weights in the central region of the distribution. Anderson Darling (AD) uses weighted sum of CDF differences with more weights in the tail region of the distribution
Empirical Cumulative Distribution Likelihood : Based empirical cumulative distribution, but uses likelihood. There are 3 flavors of the technique corresponding to KS, CVM and AD tests as alluded to earlier. They have the acronyms ZK, ZC and ZA respectively
Kernel Based : The original signal transformed to a kernel space. Cost is based on mean shift in the kernel space. Uses kernel Fisher Discriminant ratio between the 2 samples
Subspace Model : Based on state space model. At change point there is significant difference in the models
Graph Based : Each observation is a vertex. An edge connects 2 data points, with weight depending on the distance. The weights between the points between the 2 samples are used
Cluster Based : The basic idea is that data points after a change point will belong to a different cluster compared to to data points before the change point

There is one simple and interesting supervised classifier technique called virtual classifier. The data points on side of a potential change point is labelled with 1 and the data points on the other side is labelled with -1. If the test accuracy is around 0.5 there is no change point. If it’s close to 1..0, the boundary is highly likely to be a change point.

Change Point Search

So far we have discussed various metrics to evaluate to what degree there is significant change at a point under consideration. Given a time series, how do we go about looking for the candidate change point to evaluate. Once we have a candidate change point, we can use any of the aforementioned metrics to decide whether it’s truly a change point. There are 3 techniques for the change point search. They are all greedy in nature.

Sliding Window : A window slides across the time series. The middle of the window is a candidate change point.The half windows on each side are the 2 samples. The window size is an important parameter

Binary Search : The best change as per the chosen metric is found. The time series is portioned at the change point. The process is repeated for each partition.

Bottom Up Search : The time series is split into many small segments. Adjacent segments that ares statistically similar are merged. The process is repeated.

Two Sample Statistic

My implementation supports 6 algorithms from the category Empirical Cumulative Distribution and Empirical Cumulative Distribution Likelihood. All of them come under broad category of 2 sample statistic and they return a statistic value based on 2 samples on either side of a potential change point.

Final determination is made by statistical hypothesis test with null being that the two samples belong to the same distribution and the alternative hypothesis being they belong to different distribution. S0 when the null hypothesis is rejected, we have a change point.

Whatever two sample statistic we use, we need a distribution, so that we can find the critical value based on which we can decide whether to reject null hypothesis or not. Distributions of these two sample statistic are not readily available. So I implemented a python script to generate the distributions using Monte Carlo Simulation. It can be used for any of the 6 two sample statistic.

At each iteration of the simulator a random non parametric distribution is generated. A second non parametric distribution is generated by mutating the first distribution by some random amount. Two samples are are generated by sampling each distribution. With the 2 samples the statistic is calculated.

Change Point Detection Spark Job

The retails sales data consists of product ID, time stamp and quantity sold in the previous hour. One of the products has sharp decline in sale at some point in time due to cannibalization by a competing product. Here is some sample data

GHT56FGT8K,1589853600,130
DK75HUI45X,1589857200,49
GHT56FGT8K,1589857200,121
DK75HUI45X,1589860800,59
GHT56FGT8K,1589860800,116
DK75HUI45X,1589864400,46
GHT56FGT8K,1589864400,89

The Scala Spark implementation uses a sliding window and at each window position calculates the chosen two sample statistic value at the center of the window. If it’s beyond a threshold based on critical value at chosen confidence level, the corresponding time stamp is marked as a change point. There is a choice of 6 different two sample statistic as set in the configuration. I have used Cramer Von Mises (CVM) statistic. The widow size is another important parameter set in the configuration.

Here is the upper tail statistic for CVM based on Monte Carlo simulation as alluded to earlier. I have selected the critical value of 38.863 based on an alpha of .01 i.e. 99% confidence interval

0.850           35.818
0.860           35.920
0.870           36.008
0.880           36.081
0.890           36.224
0.900           36.359
0.910           36.581
0.920           36.684
0.930           36.828
0.940           36.969
0.950           37.145
0.960           37.352
0.970           37.722
0.980           38.020
0.990           38.863
1.000           40.651

Here is a plot showing the sales data for one of the products with a significant change point. The change point ares is shown with a red band. Because the sales drop is gradual, we have multiple change points around where the sale drops. To isolate an instant of time as the change point you could either take the center of the band or use the time stamp where the 2 sample statistic has the highest value within the band.

Product sale change point

You could argue that change points could be detected manually through visualization. But that will never scale. You may have hundred of LoT sensor data that you may want to process near real time.

Wrapping Up

Although we used time series in this post, CPD can be used for any kind of sequence data, time series or not.

Change Point Detection (CPD) is a kind of anomaly detection, except that anomalies are sustained for a long time. If you were to use anomaly detection, then all the data points right of the change point will be flagged as anomalous. Moreover, there could multiple change points in a time series. So it won’t be practical to use anomaly detection for CPD. Here is a tutorial document with details on how to run the use case in this post.