All About Caching | Elasticsearch: The Definitive Guide [1.x]

WARNING: The 1.x versions of Elasticsearch have passed their EOL dates. If you are running a 1.x version, we strongly advise you to upgrade.

This documentation is no longer maintained and may be removed. For the latest information, see the current Elasticsearch documentation.

› › ›

« Dealing with Null Values Filter Order »

All About Cachingedit

Earlier in this chapter (Internal Filter Operation), we briefly discussed how filters are calculated. At their heart is a bitset representing which documents match the filter. Elasticsearch aggressively caches these bitsets for later use. Once cached, these bitsets can be reused wherever the same filter is used, without having to reevaluate the entire filter again.

These cached bitsets are “smart”: they are updated incrementally. As you index new documents, only those new documents need to be added to the existing bitsets, rather than having to recompute the entire cached filter over and over. Filters are real-time like the rest of the system; you don’t need to worry about cache expiry.

Independent Filter Cachingedit

Each filter is calculated and cached independently, regardless of where it is used. If two different queries use the same filter, the same filter bitset will be reused. Likewise, if a single query uses the same filter in multiple places, only one bitset is calculated and then reused.

Let’s look at this example query, which looks for emails that are either of the following:

In the inbox and have not been read
Not in the inbox but have been marked as important

"bool": {
   "should": [
      { "bool": {
            "must": [
               { "term": { "folder": "inbox" }}, 
               { "term": { "read": false }}
            ]
      }},
      { "bool": {
            "must_not": {
               "term": { "folder": "inbox" } 
            },
            "must": {
               "term": { "important": true }
            }
      }}
   ]
}

These two filters are identical and will use the same bitset.

Even though one of the inbox clauses is a must clause and the other is a must_not clause, the two clauses themselves are identical. This means that the bitset is calculated once for the first clause that is executed, and then the cached bitset is used for the other clause. By the time this query is run a second time, the inbox filter is already cached and so both clauses will use the cached bitset.

This ties in nicely with the composability of the query DSL. It is easy to move filters around, or reuse the same filter in multiple places within the same query. This isn’t just convenient to the developer—it has direct performance benefits.

Controlling Cachingedit

Most leaf filters—those dealing directly with fields like the term filter—are cached, while compound filters, like the bool filter, are not.

Leaf filters have to consult the inverted index on disk, so it makes sense to cache them. Compound filters, on the other hand, use fast bit logic to combine the bitsets resulting from their inner clauses, so it is efficient to recalculate them every time.

Certain leaf filters, however, are not cached by default, because it doesn’t make sense to do so:

Script filters: The results from https://www.elastic.co/guide/en/elasticsearch/reference/1.7/query-dsl-script-filter.html cannot be cached because the meaning of the script is opaque to Elasticsearch.
Geo-filters: The geolocation filters, which we cover in more detail in Geolocation, are usually used to filter results based on the geolocation of a specific user. Since each user has a unique geolocation, it is unlikely that geo-filters will be reused, so it makes no sense to cache them.
Date ranges: Date ranges that use the now function (for example "now-1h"), result in values accurate to the millisecond. Every time the filter is run, now returns a new time. Older filters will never be reused, so caching is disabled by default. However, when using now with rounding (for example, now/d rounds to the nearest day), caching is enabled by default.

Sometimes the default caching strategy is not correct. Perhaps you have a complicated bool expression that is reused several times in the same query. Or you have a filter on a date field that will never be reused. The default caching strategy can be overridden on almost any filter by setting the _cache flag:

{
    "range" : {
        "timestamp" : {
            "gt" : "2014-01-02 16:15:14" 
        },
        "_cache": false 
    }
}

	It is unlikely that we will reuse this exact timestamp.
	Disable caching of this filter.

Later chapters provide examples of when it can make sense to override the default caching strategy.

« Dealing with Null Values Filter Order »