Grafana Prometheus: Detecting anomalies in time series

In our previous post, we explored how we detect anomalies in time series using the 3 sigma rule using Influx. In this article we’ll do the same using Grafana

https://blog.davidvassallo.me/2021/09/28/influxdb-flux-detecting-anomalies-in-time-series/

As a quick recap, the 3-sigma rule states that approximately all our “normal” data should be within 3 standard deviations of the average value of your data. This article explorers how we can measure standard deviations from the mean and alert whenever this goes above 3.0 (in order words, our Z-Score goes above 3.0) using Grafana + Prometheus

The Z-Score Formula

So we need to query Grafana for three things:

A datapoint to describe a range interval (x)
The mean of our data over a longer period of time (μ)
The standard deviation over the same (longer) period of time (σ)

For the sake of discussion, let’s focus on the “node_disk_writes_completed_total” metric. The mean and standard deviation are very easily extracted using the in-built functions in Prometheus:

avg_over_time(range-vector)

avg_over_time(node_disk_writes_completed_total{instance="$node",job="$job",device=~"$diskdevices"}[1d]))

stddev_over_time(range-vector)

stddev_over_time(node_disk_writes_completed_total{instance="$node",job="$job",device=~"$diskdevices"}[1d])

The above two expressions would give us the average and standard deviation calculated over a day – hence the [1d] range variable.

The last piece of the puzzle is to grab a datapoint… “x” in our formula above. I tackled this by just taking another average over time, but of a range interval equal to the dashboard variable rather than fixed to 1 day as above:

avg_over_time(node_disk_writes_completed_total{instance="$node",job="$job",device=~"$diskdevices"}[$__rate_interval]

Note the use of [$__rate_interval] above: https://grafana.com/docs/grafana/latest/datasources/prometheus/#using-__rate_interval

Putting it all together

So implementing our formula above, we get the following query:

(avg_over_time(node_disk_writes_completed_total{instance="$node",job="$job",device=~"$diskdevices"}[$__rate_interval])-avg_over_time(node_disk_writes_completed_total{instance="$node",job="$job",device=~"$diskdevices"}[1d]))/stddev_over_time(node_disk_writes_completed_total{instance="$node",job="$job",device=~"$diskdevices"}[1d])

Results

The above queries where written against an input source whose graph was as follows:

Note the large spike to the right, we would want to be alerted to such a large spike, but not necessarily the smaller spikes to the left. This is what our anomaly check gives us:

If a threshold of >= 3 is set, then the smaller spikes would not give an alert, but our large spike would since it is quite a large spike.

Another example would be the below time series:

The above time series has many more spikes (in signal-speak the time series is a lot more “noisy”), so we wouldn’t want a static threshold that fires on every spike, but only on spikes which are out of the norm. Running our anomaly query we get:

Note how the results approach a z-score of 2.5, but never exceed our threshold of 3, automatically accounting for the fact that the signal is more noisy. Another interesting point is that the anomaly query automatically handles two different scenarios as shown below:

Grafana Prometheus: Detecting anomalies in time series

Grafana Prometheus: Detecting anomalies in time series

Putting it all together

Results

Recommend

[小调查]大家公司商业项目宿主机都用的什么 Linux 发行版

哈工大SCIR博士生覃立波获2021年“微软学者”称号

轻松玩转 Ukulele

来香港的两个月

好书一起读(464)：简爱，所恶有甚于死者

如何成为一个高情商的人？

My first open source package

尝试改变微信公众账号消息的推送方式

突发事件来临，推荐系统如何应对？

写博客就用 FarBox

About Joyk