3 août 2015

Staying in Control with Moving Averages - Part 1

In manufacturing and business processes, there is a common tool called a control chart. Created in 1920 by Dr. Walter Shewhart, a control chart is used to determine if a process is "in control" or "out of control".

At the time, Dr. Shewhart was working at Bell Labs trying to improve the signal quality of telephone lines. Poorly-machined components was a leading cause of signal degradation, so improving manufacturing processes to produce more uniform components was a critical step in improving signal quality.

Dr. Shewhart realized that all processes, manufacturing or otherwise, have some amount of natural variation. The key was to identify when the variation was behaving normally ("in control"), and when it suddenly began to change ("out of control"). A process that has gone out of control needs to be halted so the problem can be fixed, instead of churning out sloppy manufactured components.

Control charts work by triggering an alert when the value diverges sufficiently from the mean by a certain amount. In practice, they are very simple and intuitive to read, and often act as front-line anomaly detectors due to their simplicity and robustness.

Smoothing with Moving Averages

Control charts can be built fairly easily in Elasticsearch using a combination of aggregations, including the new pipeline aggregations. To get started, let’s look at some synthetic data that I generated for this post. For fun, we can imagine it is coolant temperature (in celsius) for a nuclear reactor.

Let’s take a look at the data first, using a histogram bucket and an extended_stats metric:

{
   "size": 0,
   "aggs": {
      "histo": {
         "date_histogram": {
            "field": "tick",
            "interval": "hour"
         },
         "aggs": {
            "stats": {
               "extended_stats": {
                  "field": "value"
               }
            }
         }
      }
   }
}

In the graph, we are plotting the `avg` for each bucket:

Click for full size.

As you can see, the data is basically a flat trend, with a random distribution around ~30. The data is noisy, so the first thing you might like to do is smooth it out so you can see the general trend better. Moving averages are great for this.

A moving average basically takes a window of values, computes the average, then moves the window forward one step. There are several different types of moving averages that you can choose from. We are going to use an Exponentially-Weighted Moving Average (EWMA). This type of moving average reduces the "importance" of a datapoint exponentially as it becomes "older" in the window. This helps keep the moving average centered on the data instead of lagging behind.

In the following query, we add a movavg_mean moving average pipeline aggregation that computes the moving average of each bucket's avg (i.e. a sliding mean of means):

{
   "size": 0,
   "aggs": {
      "histo": {
         "date_histogram": {
            "field": "tick",
            "interval": "hour"
         },
         "aggs": {
            "stats": {
               "extended_stats": {
                  "field": "value"
               }
            },
            "movavg_mean": {
              "moving_avg": {
                "buckets_path": "stats.avg",
                "window": 24,
                "model": "ewma",
                "settings": {
                    "alpha": 0.1   
                }
              }
            }
         }
      }
   }
}

There are a few interesting bits here:

buckets_path points to the avg value calculated inside our extended_stats metric
window is set to 24, meaning we want to average the last 24 hours together
model is set to ewma
And finally, we configure some settings for this particular model. The setting alpha controls how "smooth" the generated moving average is. The default (0.3) is usually pretty good, but I liked the look of 0.1 better for this demo. Check out the docs for more info on how alpha functions.

And the resulting graph now includes a nicely smoothed line (purple):

In control?

So, the question is... does this chart look "in control"? Is there a reason you should shut down the reactor, or is everything operating smoothly?

I admit, I was being sneaky in the previous graph: I plotted the average. As discussed previously, the average is a pretty poor metric in most cases. In this dataset, it is hiding a big spike that I placed on Thursday.

If we plot the maximum value in each bucket (yellow line) the spike is immediately clear:

I hope you turned the reactor off on Thursday! ;)

How might we have detected this spike? In this chart, the anomaly is absurdly clear. You could use a simple threshold. But as we'll see later, thresholds often fail under more complex patterns.

Instead, let's build a control chart. Control charts consider a process "out of control" if datapoints start falling three standard deviations away from the mean. With that in mind, we can modify our aggregation to turn it into a bona fide control chart. To do so, we need to add two new aggregations: a moving average on the standard deviation, and a script that calculates the upper limit:

{
   "size": 0,
   "aggs": {
      "date_histo": {
         "histogram": {
            "field": "tick",
            "interval": "hour"
         },
         "aggs": {
            "stats": {
               "extended_stats": {
                  "field": "value"
               }
            },
            "movavg_mean": {
              "moving_avg": {
                "buckets_path": "stats.avg",
                "window": 24,
                "model": "ewma",
                "settings": {
                    "alpha": 0.1   
                }
              }
            },
            "movavg_std": {
              "moving_avg": {
                "buckets_path": "stats.std_deviation",
                "window": 24,
                "model": "ewma"
              }
            },
            "shewhart_ucl": {
              "bucket_script": {
                "buckets_path": {
                  "mean": "movavg_mean.value",
                  "std": "movavg_std.value"
                },
                "script": "mean + (3 * std)"
              }
            }
         }
      }
   }
}

The new "movavg_std" pipeline agg is very simple: it is simply an EWMA (with default settings) that averages the stats.std_deviation metric over the last 24 hours.

The "shewhart_ucl" pipeline agg is a bucket_script which computes the "upper control limit"; aka, the point in time when you start worrying because the process has gone out of control. Think of it as a dynamic threshold. The threshold is calculated by multiplying the rolling standard deviation by three, then adding it to the rolling mean.

I omitted it for brevity, but most control charts also include a "lower control limit". To add that, you would simply copy "shewhart_ucl" , subtract three standard deviations instead of adding, and rename it to "shewhart_lcl".

Note: I’m using an inline script for convenience. You can substitute it for a static script if dynamic, inline scripting is disabled on your cluster.

Smoothed average: purple

Max value: yellow

Upper control limit: green

We can graph this and see that the spike (yellow) shoots up past the control limit (green). In a real system, this is when you send out an alert or email. Or maybe something more drastic, since this is a nuclear reactor we are modeling ;)

Conclusion

That's all for this week. To recap, we used the new pipeline aggregations to smooth out our data with a moving average. We then constructed a control chart to dynamically find outliers by calculating an "upper control limit" based on the moving average and a moving standard deviation.

In part two, we'll look at how the same control chart can be used for more interesting data patterns, such as linear trends and cyclic behavior. We'll also see how to integrate it with Watcher so that we can receive email notifications automatically. Check it out!