03 April 2014

Count on Elasticsearch!

Von Adrien Grand

One feature that has been requested for a very long time landed in Elasticsearch and is available in the 1.1.0 release: the ability to count unique values for a particular field. There are lots of interesting things that this feature allows to compute:

  • the number of visitors on your website,
  • the number of people you wrote e-mails to,
  • etc…

Example

This feature is exposed under the form of an aggregation called cardinality, so that you can benefit from all the goodness of aggregations, especially composability. So taking back the unique number of visitors on your website as an example, you could put this cardinality aggregation under a date histogram aggregation in order to see the trend over months:

curl -XGET "http://localhost:9200/_search" -d'
{
    "aggs": {
        "monthly": {
            "date_histogram": {
                "field": "timestamp",
                "interval": "month"
            },
            "aggs": {
                "visitor_count": {
                    "cardinality": {
                        "field": "ip_address"
                    }
                }
            }
        }
    }
}'

Under the hood

As long as a dataset is small and can fit on a single machine, computing the number of unique values is simple and is just a matter of adding elements into a hash set and returning its size. However, memory usage of this solution is linear with the number of unique values, which makes it impractical to evaluate high cardinalities. Another issue is that if you want to compute the cardinality of a dataset that is stored on several machines, summing up the cardinalities returned on each machine is not good enough since there might be overlap. So you actually need to stream these sets to a single location where they can be merged in order to know the actual cardinality.

Fortunately, there are other algorithms that address these challenges, in particular linear counting and HyperLogLog. I won't explain how they work given that there are already some excellent blog posts that do it. However, here are a few important things to know about these algorithms:

  • they don't work on values directly but on their hashes,
  • they return approximate results,
  • linear counting has excellent precision on low-cardinality sets but either requires more memory on larger sets or starts having very bad precision,
  • HyperLogLog can estimate the cardinality of arbitrarily-large sets with fixed memory usage.

Because of the different characteristics of these algorithms, there is something interesting that can be done: using linear counting on small cardinalities and HyperLogLog on larger ones. This way, we would have the best of both worlds: excellent precision on small cardinalities and fixed memory usage whatever the cardinality to estimate is. But this also means that a linear counter needs to be slightly modified in such a way that it can be upgraded to an HyperLogLog counter. This is exactly what the HyperLogLog++ algorithm, that we use for the cardinality aggregation, does.

Precision and memory usage

As we just saw, precision degrades at some point in order to keep memory usage bounded. Elasticsearch makes this threshold configurable through the precision_threshold parameter. For example, if you configure a precision_threshold of 1000, you could expect precision to be excellent if the return value is < 1000 and a bit more approximate otherwise. The memory usage of this aggregation also depends on this parameter: for a value of N, you should expect a memory usage of about 8 * N bytes per shard per aggregation bucket.

For example, we built the following chart that computes the relative error for various sets of random values, depending on the precision threshold and the actual cardinality:

For all 3 thresholds, counts have been accurate up to the configured threshold (although not guaranteed, this is likely to be the case). Please also note that even with a threshold as low as 100, the relative error remained way under under 5%, even when counting millions of items.

Hopefully this post gives you insights about what the cardinality aggregation does and how it works. We look forward to your feedback on the mailing list or Twitter!