Averages Can Be Misleading: Try a Percentile

And now we can see that Antarctica has a particularly slow 95th percentile (for some strange reason):

{
    ...

   "aggregations": {
       "country" : {
           "buckets": [
                {
                    "key" : "AY",
                    "doc_count" : 20391,
                    "load_time_outlier": {
                         "95.0": 1205
                    }
                },
                ...

Percentiles Are (Usually) Approximate

All good things come at a price, and with percentiles it usually boils down to approximations. Fundamentally, percentiles are very expensive to calculate. If you want to calculate the 95th percentile, you need to sort all your values from least to greatest, then find the value at myArray[ count(myArray) * 0.95]

This works fine for small data that fits in memory, but simply fails when you have terrabytes of data spread over a cluster of servers (which is common for Elasticsearch users). The exact method just won't work for Elasticsearch.

Instead, we use an algorithm called T-Digest (you can read more about it here). Without getting bogged down in technical details, it is sufficient to make the following claims about T-Digest:

  • For small datasets, your percentiles will be highly accurate (potentially 100% exact if the data is small enough)
  • For larger datasets, T-Digest will begin to trade accuracy for memory savings so that your node doesn't explode
  • Extreme percentiles (e.g. 95th) tend to be more accurate than interior percentiles (e.g. 50th)

The following chart shows the relative error on a uniform distribution depending on the number of collected values and the requested percentile:

percentiles_error

The absolute error of a percentile is the actual value minus the approximate value. It is often useful to express that as a relative percentage rather than in absolute difference. In the chart, we can see that at 1000 values, the 50th percentile is 0.26% off the true 50th percentile. In absolute terms, if the true 50th was 100ms, T-Digest might have told us 100.26ms. Practically speaking, the error is often negligible, especially when you are looking at the more extreme percentiles

The chart also shows how precision is as you add more data. The reason why error diminishes for large number of values is that the law of large numbers makes the distribution of values more and more uniform and the t-digest tree can do a better job at summarizing it. It would not be the case on more skewed distributions.

The memory-vs-accuracy tradeoff is configurable via a compression parameter, which you can find more details about in the documentation.

Conclusion

Now armed with some basic knowledge about percentiles, hopefully you are beginning to see applications all over your data. These approximate algorithms are exciting new territory for Elasticsearch. We look forward to your feedback on the mailing list or Twitter!

Sign up for updates!

Subscribe to the RSS feed RSS