19 août 2014

Optimizing Elasticsearch Searches

Par Alex Brasetvik

UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.

Simple Suggestions for Speedier Searches

Elasticsearch can query, filter and aggregate in many ways. Often there are several ways to solve the same problem – and possibly with very different performance characteristics. This article will cover some important optimizations that can buy you a lot of performance.

Find a Friend in Filters

Understanding how filters work is essential to making searches faster. A lot of search optimization is really about how to use filters, where to place them and when to (not) cache them.

Take this search, for example:

query:
    bool:
        must:
            - term:
                tag: "elasticsearch"
            - multi_match:
                query: "query tuning"
                fields: ["title^5", "body"]

Elasticsearch will search for documents that are tagged with elasticsearch and that contain query tuning, preferably in the title. However, as all resulting documents are required to contain elasticsearch, the tag query has no impact on the scoring – the tag query acts as a filter.

This is the key property of filters: the result will be the same for all searches, hence the result of a filter can be cached and reused for subsequent searches. Caching them is quite cheap, as you can store them as a compact bitmap. When you search with filters that have been cached, you are essentially manipulating in-memory bitmaps - which is just about as fast as it can possibly get.

A rule of thumb is to use filters when you can and queries when you must: when you need the actual scoring from the queries.

Filter First

Having realized that we want to use a filter instead of a query, a common rewrite is something like this:

query:
    multi_match:
       query: "query tuning"
      fields: ["title^5", "body"]
# Make it a filter, because filters are faaast. 
filter: # renamed to post_filter in 1.0
    - term:
       tag: "elasticsearch"
# But wait. This is wrong! Explanation follows.

This is one of the most common errors I see, and probably the reason why the top-level filter was renamed to post_filter in version 1.0, to emphasize that it is a filter that happens after (post) the query phase. To understand why this change may actually be for the worse, we’ll first have a look at the various places you can place a filter in a search.

High Level View of Filter Locations

Filters can appear in a filtered query, in filter aggregations, and in post filters. Filters are also useful for e.g. scoring in function score queries, but in that context they do not reduce the document set.

The filtering that happens in the filtered query – in the top of the figure – is applied to everything. Aggregations as well as hits are affected by the filtering. This is not true of filtering that happens in the post_filter. Its entire purpose is to have a filter that does not affect aggregations.

In the (suboptimal) rewrite that we did above, we moved the tag query component into a post_filter. The outcome of this is that all documents matching the "query-tuning" query will be scored, and then the filter is applied. While we have gained cacheability of the tag filter, we have potentially increased the cost of scoring significantly. Post filters are useful when you need aggregations to be unfiltered, but hits to be filtered. You should not be using post_filter (or its deprecated top-level synonym filter) if you do not have facets or aggregations.

Instead, this query should be rewritten to a filtered-query, like this:

query:
   filtered:
       query:
              multi_match:
              query: "query tuning"
              fields: ["title^5", "body"]
       filter:
           term:
               tag: "elasticsearch"

The filtered-query is smart enough to apply filters before queries, which reduces the search space before expensive scoring is performed.

Use a filter
Place filters in the right place. (Probably in a filtered-query!)

Combining Filters

Elasticsearch has several ways to combine filters: and, or, not, and … bool.

You should probably always use bool and not and or or. For more detailed reasoning for this, see Zachary Tong’s post all about elasticsearch filter bitsets. The gist is that most filters can be cached, while some filters (e.g. geo_distance or script) need to work document-by-document anyway. Therefore, you’ll want cached (and therefore cheap) filters to be applied before the expensive ones.

These subtle differences have, for the most part, been worked into the bool filter so you no longer have to worry about them, but it is always a good idea to test anyway!

That said, you still need to think about which order you filter in. You want the more selective filters to run first. Say you filter on type: book and tag: elasticsearch. If you have 30 million documents, 10 million of type book and only 10 tagged Elasticsearch, you’ll want to apply the tag filter first. It reduces the number of documents much more than the book filter does.

Combine filters with bool.
Order filters by their selectivity.

Cache Granularity and Acceleration Filters

The cacheability of filters is an important reason why they can be so fast. Not all filters can (sensibly) be cached, however. Consider a user with a smartphone at location x wanting to see nearby events occurring within the next hour. These two filters (location and time) would be highly specific to that user and to that exact time. It is unlikely that those filters will be reused, so it makes no sense to cache them.

In such scenarios it can be useful to add auxiliary filters that are less specific, but cacheable. For example, while it is unlikely that finding documents within 5 kilometers of the specific location (63.4305083, 10.3951494) (in downtown Trondheim) will be reused, any similar distance filter for users in the same area will fall within the much wider grid defined by the geohash u5r. (This is not necessarily true near meridians or the equator). Another possibility would be to filter on city or county, for instance.

Similarly, Elasticsearch does not cache any time filter using the now keyword in date math unless a rounding is specified. If you want to find all documents with timestamp >= 'now-1h', the filter will not be cached, because now is (hopefully) continuously moving. However, any document that less than an hour old is also necessarily less than one day old. Thus, you can have a filter like timestamp >= 'now/1d' AND timestamp >= 'now - 1h'. The timestamp >= 'now/1d' component, which should be applied first, can be cached because it is rounded to the current day. It is not exactly what we want, but it reduces the number of documents needed to be considered for the now-1h filter.

In other words, filters that seem redundant can speed up things a lot, because they can be cached and reduce the search space for filters that cannot. When you face a challenge with a filter that is not being cached, you should consider if you can accelerate the filter enough in other ways.

Mind which filters can(not) be cached.
Consider cacheable “accelerator” filters to reduce the burden of more expensive filters.

Filter Aggregations

As for queries and filters, there can be multiple ways of achieving the same aggregation. The filter aggregation (or facet) is incredibly useful, also when a terms or range aggregation could do the same.

Assume you have a web site with three different sections, and you want to show how many hits there are in each section. The most obvious approach would be to do a terms aggregation on the section field to get an aggregation that says e.g. {general: 123, news: 40, blog: 12}. To limit the search to a section, you would use a term filter like {term: {section: news}}. You might even be using these filters for function scores as well. A cached filter can be reused in many settings. Since you are already paying for the filters’ memory, it can make sense to replace the terms aggregation with a filters aggregation.

A terms aggregation will need the entire section field in memory, then count and bucket for every request. AND-ing together a few bitmaps is probably a lot faster. This can work well for low-cardinality fields: I am not suggesting replacing all your term aggregations with a huge number of filters!

The same can apply to range aggregations.

Consider whether your aggregation can be implemented with a filter aggregation instead.

Aggregation Abundance

Aggregations are powerful, but they can easily dominate the performance cost of your searches – and consume a great deal of memory.

Therefore, it can be worthwhile to minimize the number of aggregations you do. If you have a search results page where not all facets are visible, consider lazy loading the aggregation when the user enables the facet.

The same holds for pagination. When a user requests a second page of hits, the facets in the navigation will remain the same – after all, they’re aggregates. Therefore, you can skip the aggregations and just ask for the hits. This can make your user interface more stateful and complex, of course, but you can save a lot of CPU-cycles at your backend.

Aggregations are expensive. Reuse cached results or skip them entirely if possible.

Scoring and Scrolling

We mentioned above that you should filter when you can and query when you need scoring. Elasticsearch has really powerful scoring capabilities, and you can express quite intricate relevancy rules.

Scoring happens in two phases. First, there is the query phase, and then you may have rescorers that apply more detailed and expensive scoring rules to documents that survive the first round(s). Conceptually, they are a bit like the accelerator filters - we reduce the space where more computationally expensive scoring happens.

Where Scoring Happens

Elasticsearch works hard to do as little as possible to find the top n results. Calculating the scores for hits we are not going to return anyway is just wasteful. But if you want to do really deep pagination and want e.g. hits 10 000 000 – 10 000 010, it will require a lot of expensive scoring just to show those 10 hits. It is not that uncommon to have a “Last” link in a search results paginator, which will put you in this situation. This is quite questionable UX-wise as well: “Hey, check out the worst results!”

If you really do have needs to scroll through huge result sets, such as when reindexing, use the scroll and scan APIs.

Index vs. Search Time

When you work with Elasticsearch, it is important to get your text analysis and mappings right to support the searches you need to do. For the time being, changing mappings and reindexing can be quite painful.

It is not unusual to see suboptimal searches used to work around the fact that the original mappings were not designed to support that kind of search. A common example is searching for substrings. If you have indexed "AbstractPluginFactory" as "abstractpluginfactory" (the default analyzer will lowercase terms), you cannot search for "plugin". Elasticsearch has capabilities to let you wrap wildcards around your search, i.e. "*plugin*". Do not do that.

Instead, index properly. In this case, you could use an ngram-analyzer, or a CamelCase-tokenizer.

Things to Avoid

If you search the documentation for optimization, you will find the index optimization API. Unless you have an index that is no longer changing, you should probably avoid it. It’s for merging segments in an index, which you can learn more about in our article on Elasticsearch from the Bottom Up. If you run it on an index with lots of indexing activity, you will hurt performance big-time.

You probably should not _optimize.

As mentioned earlier, there are filters that can be cached, and there are filters that are not cacheable. There is a _cache option you can put on a filter to force it to be cached. Be careful with it. Enabling it at will can reduce performance: it can cause other filters to be expunged from the cache, and the cost of running the filter the first time can increase since it must now run across all documents.

Enabling _cache on all filters does not magically make things faster.

Occasionally, I see an over-complicated search where the goal is to do as much as possible in as few search requests as possible. These tend to have filters as late as possible, completely in contrary to the advise in Filter First. Do not be afraid to use multiple search requests to satisfy your information need. The multi-search API lets you send a batch of search requests.

Do not shoehorn everything into a single search request.

Whenever you use a script for something, consider whether there are other approaches to the same problem. When you need to resort to them, make sure you are careful with how you access document fields. If you use _source or _fields you will quickly kill performance. They access the stored fields data structure, which is intended to be used when accessing the resulting hits, not when processing millions of documents. If you use doc['field_name'], the field data APIs will be used.

As covered in Index vs. Search Time, some things are better to do when indexing than when searching. If you have indexed a timestamp and need to filter by weekday, you could use a script. However, it would probably be better to just index the weekday. You can use a transform-script to do that, which is okay.

Avoid scripts in searches when you can.

Optimize Maturely

Optimizations do not always apply. Test and confirm. Also, is it really your bottleneck?

While we have covered several things that can improve or hurt search performance, it is important to know where your bottlenecks are.

There is no point in trying to shave milliseconds off your filters if you spend a majority of the time establishing SSL connections because you use a poor client library.

Changing the way you cache filters can improve that one search you are working on right now, but it can also possibly cause higher filter cache churn, negatively impacting overall performance. It is important to test things both in isolation as well as seeing its effect in the bigger picture.

There are few rules that are absolute and without exceptions when it comes to optimizing searches, so proceed judiciously.

Further Reading

This article has focused on how you can improve your searches. It has not touched sharding and partitioning strategies, nor production considerations, such as the importance of having sufficient memory. These issues and more are covered in various other articles, which may be of interest: