Staying in Control with Moving Averages - Part 2

Last week, we introduced how to build a simple Control Chart using the new moving_avg pipeline aggregations. The demonstration used a very simple dataset, where the trend was very flat, and the spike was very obvious. In fact, it was so obvious you could probably catch it with a simple threshold.

This week, we'll show how the same control chart can be used in more tricky scenarios, such as constantly increasing linear trends, or cyclic/seasonal data

Linear Trends

The example from last week was very simple, and a threshold set by eye would have been sufficient. For example, you could easily determine the ideal mean, calculate three standard deviations yourself, and alert when it goes above that point. This works well for flat trends, but what if your data happens to have a constant linear trend?

Just as a refresher, here is the aggregation we built last week:

{
   "size": 0,
   "aggs": {
      "date_histo": {
         "histogram": {
            "field": "tick",
            "interval": "hour"
         },
         "aggs": {
            "stats": {
               "extended_stats": {
                  "field": "value"
               }
            },
            "movavg_mean": {
              "moving_avg": {
                "buckets_path": "stats.avg",
                "window": 24,
                "model": "ewma",
                "settings": {
                    "alpha": 0.1   
                }
              }
            },
            "movavg_std": {
              "moving_avg": {
                "buckets_path": "stats.std_deviation",
                "window": 24,
                "model": "ewma"
              }
            },
            "shewhart_ucl": {
              "bucket_script": {
                "buckets_path": {
                  "mean": "movavg_mean.value",
                  "std": "movavg_std.value"
                },
                "script": "mean + (3 * std)"
              }
            }
         }
      }
   }
}

Let's re-use that same aggregation on some data with a constant linear trend, which includes the same spike on Thursday. Without changing anything, we'll see:

Smoothed average: purple

Max value: yellow

Upper control limit: green

As you can see, a simple threshold would no longer work; it would be triggered due to the natural growth of the values. There are several ways you could work around it (plot a linear threshold trigger, diff against yesterday, etc). But the control chart handles this scenario in stride without modification

Because the threshold is generated dynamically based on the "local" data in the moving averages, the constant linear trend is no problem and everything just works.

Cyclic Trends

Cyclic trends are even more fun. Imagine your data has some seasonality. In this case, I just plotted a random sine wave, but you'll see this cyclic behavior everywhere in real data: sales numbers, server utilization, queue lengths, etc. Cycles can be very tricky for simpler spike detection algorithms. The algorithm needs to differentiate between the natural peaks and real spikes.

If we apply the exact same aggregation as before, we get a decent chart:

Smoothed average: purple

Max value: yellow

Upper control limit: green

You'll notice a problem though. The maximum values in yellow consistently "trip" the threshold (in green) on the leading edge. It looks like the green threshold lags behind the data, and never quite anticipates the upcoming cycle.

The problem is the moving average model. Simpler models like linear and ewma always display a certain amount of lag, and in particular struggle with cyclic data. The lag was present in all the previous examples (go look), it just usually isn’t a problem with non-seasonal data.

Instead, we should use holt_winters, a moving average model that includes terms that can account for seasonality. Let's replace the two previous moving averages with this:

"movavg_mean": {
  "moving_avg": {
    "buckets_path": "stats.avg",
    "window": 200,
    "model": "holt_winters",
    "settings": {
        "period": 69
    }
  }
},
"movavg_std": {
  "moving_avg": {
    "buckets_path": "stats.std_deviation",
    "window": 150,
    "model": "holt_winters",
    "settings": {
        "period": 69
    }
  }
},

You'll notice a few changes. Obviously, we swapped ewma for holt_winters. Next, we changed the window size. Holt-Winters requires a larger window so that it can more accurately model seasonal behavior. Finally, we specified how large the "period" of the data is. In this case, it is roughly 62 hours from peak to peak. Holt-Winters has more parameters that are tunable, but we are going to rely on the automatic minimization to choose those for us.

The graph that we get out looks much better:

Smoothed average: purple

Max value: yellow

Upper control limit: green

The threshold now lines up with the data perfectly, and we correctly detect the spike (and nothing else). You will notice a new anomaly though. Exactly one period after the first spike a new spike exists where there wasn’t one previously. And if you look closely, you'll see a tiny spike two periods afterwards which is also new.

This is an artifact from Holt-Winters. Forecasts are built based on past seasonal data, and since the past data had a spike, you'll see traces of that in future forecasts. This artifact can be diminished slightly by increasing the window length, and in general isn't usually large enough to trigger a "threshold breach".

Extra Credit: Configuring a Watcher alert

If you have Watcher installed -- an alerting and notification plugin for Elasticsearch -- it is trivial to add a watch which will alert you when a spike has been detected. We will define a watch that checks every hour (finer granularity is unnecessary, since the data is only logged at 1hr intervals).

Then we plop in our aggregation, and setup some email and logging notifications and define the condition. The condition is simply checking to see if the maximum value is greater than the upper control limit.

curl -XPUT 'http://localhost:9200/_watcher/watch/log_error_watch' -d '{
  "trigger": {
    "schedule": {
      "interval": "1hr"
    }
  },
  "input": {
    "search": {
      "request": {
        "indices": ["reactor_logs"],
        "body": {
          "size": 0,
          "aggs": {
            "histo": {
              "date_histogram": {
                "field": "tick",
                "interval": "hour"
              },
              "aggs": {
                "stats": {
                  "extended_stats": {
                    "field": "value"
                  }
                },
                "movavg_mean": {
                  "moving_avg": {
                    "buckets_path": "stats.avg",
                    "window": 24,
                    "model": "ewma",
                    "settings": {
                      "alpha": 0.1
                    }
                  }
                },
                "movavg_std": {
                  "moving_avg": {
                    "buckets_path": "stats.std_deviation",
                    "window": 24,
                    "model": "ewma"
                  }
                },
                "shewhart_ucl": {
                  "bucket_script": {
                    "buckets_path": {
                      "mean": "movavg_mean.value",
                      "std": "movavg_std.value"
                    },
                    "script": "mean + (3 * std)"
                  }
                }
              }
            }
          }
        }
      }
    }
  },
  "condition": {
    "script": {
      "inline": "def lastBucket = ctx.payload.aggregations.histo.buckets.last(); return lastBucket.stats.max > lastBucket.shewhart_ucl.value"
    }
  },
  "actions": {
    "log_error": {
      "logging": {
        "text": "Reactor Meltdown!"
      }
    },
    "send_email": { 
      "email": {
        "to": "user@example.com", 
        "subject": "Watcher Notification - Reactor Meltdown!",
        "body": "Reactor is melting down, please investigate. :)"
      }
    }
  }
}'

With this in place, Watcher will email you as soon as the upper control limit has been reached. It is fairly trivial to extend this to log/alert on “warnings”, such as when values exceed two standard deviations instead of three, or has remained above the mean for more than 10 consecutive hours. The sky is the limit!

Conclusion

I hope this article was interesting. Most folks are acquainted with the smoothing capabilities of moving averages; they are great for smoothing out noise so you can see the more general trend. But they can also be the building blocks for much richer functionality, such as finding anomalous data points in a dynamic dataset. It's fairly remarkable how powerful simple, statistical techniques can be in practice.

Because of the new functionality in pipeline aggregations, all of this functionality can now be expressed in Elasticsearch itself. And when coupled with Watcher, you can build robust alerting and notifications directly from your data, without having to pipe it to an external system first.

In the future, we'll be looking at how you can forecast into the future with moving averages, other methods for anomaly detection and more. Stay tuned!