How to

Advanced tuning: finding and fixing slow Elasticsearch queries

Elasticsearch is a very flexible and feature-rich application that provides many different ways to query your data. But have you ever experienced query speeds that were less than you hoped for? With a distributed system like Elasticsearch, there can be various possible factors contributing to query performance, including external factors such as load-balancer settings, network latency (bandwidth, NIC cards/drivers), etc.

Within this blog, I will be discussing what can cause slow queries and how to identify them within the context of Elasticsearch. This article stems from some of our common troubleshooting methods that may require one to be fairly intimate with how Elasticsearch works.

Common causes of slow Elasticsearch queries

Before we look into some of the trickier cases, let's start off with some of the most common causes of slow queries and their solutions.

Symptom: High resource utilization when inactive

Every shard consumes resources (CPU/Memory). Even with zero indexing/searching requests, the existence of shards consumes cluster overheads.

Problem

The cluster has too many shards to a point where any query may seem slow to execute. A good rule of thumb is to ensure you keep the number of non-frozen shards per node below 20 per GB heap configured.

Solution

Reduce shard count, freeze indices, and/or add additional nodes to cope with the load. Consider hot/warm architecture (works very well for time-based indices) together with the rollover/shrink feature in Elasticsearch to efficiently manage shard count. A good start to any deployment would be to perform proper capacity planning to help determine the optimal number of shards for each search use case.

Symptom: High thread pool rejected count

The search thread pool is showing an on-going increase in "rejected" count which is cumulative based on last cluster restart.

GET /_cat/thread_pool/search?v&h=node_name,name,active,rejected,completed

The response looks like this:

node_name             name   active rejected completed
instance-0000000001 search 0 10 0
instance-0000000002 search 0 20 0
instance-0000000003 search 0 30 0

Problem

Queries are targeting too many shards, exceeding the number of cores in the cluster. This creates queued tasks in the search thread pool, leading to search rejections. Another common cause is slow disk I/O causing searches to queue up or in some cases completely saturated cpu. 

Solution

Adopt a 1 Primary : 1 Replica model when creating indices. Index template is a good way to roll out this setting at index time. (1P:1R will be default in Elasticsearch 7.0 or later). Elasticsearch 5.1 or later supports search task cancellation, which can be useful when the slow query shows up in the Task Management API. To improve disk I/O, check out our storage recommendations and be sure to use recommended hardware for maximum performance.

Symptom: High CPU and indexing latency

Metrics correlation shows high CPU utilization and indexing latency when cluster is overwhelmed.

Problem

Cluster is heavily indexing, which affects search performance.

Solution

Increasing index.refresh_ interval (amount of time between when a document gets indexed and when it becomes visible) to a value like 30s generally helps improve indexing performance. Actual mileage may vary, so testing is key. This makes sure the shards don’t have to work too hard by creating a new segment every 1 second (default).

For a heavy indexing use case, checkout our index tuning recommendations to optimise both index and search performance.

Symptom: Increased latency with more replica shards

Query latency can be observed after increasing replica shards count (e.g., from 1 to 2). If more data is present, then cached data will get evicted sooner leading to an increase in operating system page faults.

Problem

The filesystem cache doesn't have enough memory to cache frequently queried parts of the index. Elasticsearch's query cache implements an LRU eviction policy: when the cache becomes full, the least recently used data is evicted to make way for new data.

Solution

Set aside at least 50% of the physical RAM for the filesystem cache. The more memory, the more can be cached especially if the cluster is experiencing I/O issues. Assuming heap size has been properly configured, any remaining physical RAM left that can be used for the filesystem cache goes a long way to speed up search performance.

For example, a 128GB RAM server has 30GB set aside for heap and remaining memory for filesystem cache (sometimes called OS cache), which is a way that the operating system caches recently accessed 4KB blocks of data, so that if you read the same files again and again then you won't need to go to disk most of the time, the read request will be served directly from memory.

Besides filesystem cache, Elasticsearch uses query cache and request cache
 to speed up searches. All these caches can be optimised using search-request-preference to route certain search requests to the same set of shards every time instead of alternating between the different copies available. This will make better use of the request cache, node query cache and filesystem cache.

Symptom: High utilization when sharing resources

Operating system shows consistently high CPU / disk I/O usage. Performance gain can be seen after stopping third-party applications.

Problem

There is resource (CPU and/or Disk I/O) contention between other processes (e.g., Logstash) and Elasticsearch itself.

Solution

Avoid running Elasticsearch together with other resource-intensive applications on shared hardware.

Symptom: High heap usage aggregating highly unique fields

Poor performance when querying aggregated fields that contain highly unique values (e.g., IDs, usernames, email addresses, etc). During heap dump analysis, you should see Java objects with terms like "search", "buckets", "aggregation", etc., consuming much of the heap.

Problem

Aggregations are running on high-cardinality fields requiring lots of resources to fetch many buckets. There can also be nested aggregations involving nested fields and/or join fields.

Solution

To improve performance of high-cardinality terms aggregations, have a read at this blog post by my colleague from the consulting team: https://www.elastic.co/blog/improving-the-performance-of-high-cardinality-terms-aggregations-in-elasticsearch

For further tuning, check out our recommendations on nested fields and join fields to better improve your aggregation performance.

Occasional slow queries

Generally speaking, occasional or intermittent slow queries can benefit from some of the tuning for index/tuning for search recommendations. Occasional slow queries should be closely correlated with one or more of these monitoring metrics:

Elasticsearch has another useful feature called adaptive replica selection (ARS) that allows the coordinating node to be aware of the load on data nodes and allows it to choose the best shard copies for executing a search, thus improving search throughput as well as latency. ARS can be a big help for occasional slowdowns by more evenly spreading the load during query time. In Elasticsearch 7.0 and later, ARS will be turned on by default.

Consistent slow queries

For consistent slow queries, we can try removing features from the query one by one and check whether the query is still slow. Finding the simplest query that reproduces the performance issue helps to isolate and identify the problem:

  • Is it still slow without highlighting?
  • Is it still slow without aggregations?
  • Is it still slow if size is set to 0? (when size is set to 0, Elasticsearch caches the results of search requests to make search faster)

Can it benefit from some of the "tuning for search" recommendations?

During troubleshooting, it is often useful to:

  • Get a query response with profile turned on.
  • Gather nodes hot threads output with the query running in a while(true) loop. This will help understand where CPU time is spent.
  • Profile the query using this user-friendlier version of the profile API.

If a query comes from a Kibana visualization, use the visualization spy panel (Kibana version 6.3 and earlier) or dashboard inspect panel (Kibana version 6.4 and later) to view and export the actual query request and import it into the profile API for further analysis.

Catching slow or expensive queries

Sometimes it can be hard to catch slow or expensive queries with different requests/threads being processed concurrently within a distributed application like Elasticsearch. Things get more complicated when there is no control over users running expensive queries that slow down the cluster performance (e.g., long garbage collection (GC) cycles) or worse, an out of memory (OOM) situation.

In Elasticsearch version 7.0, we introduce a new circuit-breaking strategy that measures real heap memory usage at the time when memory is being reserved. This new strategy improves node resiliency against expensive queries causing cluster to be overloaded and is turned on by default and can be controlled with the new cluster setting indices.breaker.total.use_real_memory.

However, we should note that these are best efforts; for scenarios that are not covered by these, it would be useful to collect heap dump after OOM crash or heap dump from a running JVM to better understand the root cause.

Elasticsearch has another protection setting (max bucket soft limit) to safeguard the cluster against OOM. This max bucket aggregation setting stops execution and fails the search request when the number of buckets (defaults to 10,000 in version 7.0) is exceeded (e.g., when running multiple layers of aggregations).

To further identify potential expensive queries, we can set circuit breaker setting (indices.breaker.request.limit) starting with a low threshold to isolate queries and gradually move up the threshold to narrow down to specific ones.

Slowlogs

Slow-running queries can also be identified by turning on slowlogs in Elasticsearch. Slowlogs works specifically on the shard level, which means only data node applies. Coordinating-only/client nodes are excluded as they do not hold data (indices/shards).

Slowlogs help to answer questions like:

  • How long did the query take?
  • What was the content of the query request body?

Sample slowlog output:

[2019-02-11T16:47:39,882][TRACE][index.search.slowlog.query] [2g1yKIZ] [logstash-20190211][4] took[10.4s], took_millis[10459], total_hits[16160], types[], stats[], 
search_type[QUERY_THEN_FETCH], total_shards[10], source[{"size":0,"query":{"bool":{"must":[{"range":{"timestamp":{"from":1549266459837,"to":1549871259837,"include_lower":true,
"include_upper":true,"format":"epoch_millis","boost":1.0}}}],"adjust_pure_negative":true,"boost":1.0}},"_source":{"includes":[],"excludes":[]},"stored_fields":"*","docvalue_fields":
[{"field":"timestamp","format":"date_time"},{"field":"utc_time","format":"date_time"}],"script_fields":{"hour_of_day":{"script":{"source":"doc['timestamp'].value.getHourOfDay()",
"lang":"painless"},"ignore_failure":false}},"aggregations":{"maxAgg":{"max":{"field":"bytes"}},"minAgg":{"min":{"field":"bytes"}}}}], id[]],

Slowlog message breakdown:

Item Description
[2019-02-11T16:47:39,882] Query date
[TRACE] Log level
[index.search.slowlog.query] Query phase of search slow log
[2g1yKIZ] Node name
[logstash-20190211] Index name
[4] Shard number on the query executed
took[10.4s] Processing time taken on shard [4]. Note: when looking at slowlogs, we want to refrain from adding up all the times from different shards, as each shard may be executing in parallel.
took_millis[10459] Time taken in milliseconds
total_hits[16160] Total hits
search_type[QUERY_THEN_FETCH] Search type
total_shards[10] Total shards of the index
source[] The query body that was executed

Audit logs

Customers with a Gold or Platinum subscription, which includes Elastic security features, can turn on audit logs to capture more details about the query. (Users can start a 30-day trial to test drive Elastic security features.) Audit logging helps to answer questions like:

  • When did the query happen?
  • Who executed the query?
  • What’s the content of the query?

We need to tune the audit settings, as the default ones are fairly chatty:

  1. Enable security audit log: Set xpack.security.audit.enabled: true in your elasticsearch.yml.
  2. Enable log or index in security audit outputs: Set xpack.security.audit.outputs:[logfile, index] in your elasticsearch.yml.

    Notes:
    • The xpack.security.audit.outputs setting only applies to version 6.0-6.2 and 5.x. Version 7.0 no longer accepts this setting and defaults to json output (<clustername>_audit.json) when xpack.security.audit.enabled is set to true.
    • For the purpose of troubleshooting, we recommend choosing logfile over index as the verbosity of audit logging may introduce unwanted stress to the cluster performance where the security index grows beyond its intended size.
    • Audit mode can be very verbose, so consider turning it off once troubleshooting is done.
  3. Include authentication_success access in your events list: Set xpack.security.audit.logfile.events.include: authentication_success in your elasticsearch.yml.

    Notes:
    • This setting is not included in the default events. Setting this will overwrite the default ones.
    • If you need to add one more event (not replace), please write the existing default event list first and then add the above setting after the last entry.
    • Output the request body in the audit events: Set xpack.security.audit.logfile.events.emit_request_body: true in your elasticsearch.yml.

With these settings, you can monitor the user query like this.

  • User: louisong
  • Querying time: 2019-05-01T19:26:53,554 (UTC)
  • Query endpoint: _msearch (usually means it is a query issued from Kibana Visualization/Dashboard)
  • Query body: start from "request.body": in the following log:

    {"@timestamp":"2019-05-01T19:26:53,554", "node.id":"Z1z_64sIRcy4dW2eqyqzMw", "event.type":"rest", "event.action":"authentication_success", "user.name":"louisong", "origin.type":"rest", "origin.address":"127.0.0.1:51426", "realm":"default_native", "url.path":"/_msearch", "url.query":"rest_total_hits_as_int=true&ignore_throttled=true", "request.method":"POST", "request.body":"{\"index\":\"*\",\"ignore_unavailable\":true,\"preference\":1556709820160}\n{\"aggs\":{\"2\":{\"terms\":{\"field\":\"actions\",\"size\":5,\"order\":{\"_count\":\"desc\"},\"missing\":\"__missing__\"}}},\"size\":0,\"_source\":{\"excludes\":[]},\"stored_fields\":[\"*\"],\"script_fields\":{},\"docvalue_fields\":[{\"field\":\"access_token.user_token.expiration_time\",\"format\":\"date_time\"},{\"field\":\"canvas-workpad.@created\",\"format\":\"date_time\"},{\"field\":\"canvas-workpad.@timestamp\",\"format\":\"date_time\"},{\"field\":\"creation_time\",\"format\":\"date_time\"},{\"field\":\"expiration_time\",\"format\":\"date_time\"},{\"field\":\"maps-telemetry.timeCaptured\",\"format\":\"date_time\"},{\"field\":\"task.runAt\",\"format\":\"date_time\"},{\"field\":\"task.scheduledAt\",\"format\":\"date_time\"},{\"field\":\"updated_at\",\"format\":\"date_time\"},{\"field\":\"url.accessDate\",\"format\":\"date_time\"},{\"field\":\"url.createDate\",\"format\":\"date_time\"}],\"query\":{\"bool\":{\"must\":[{\"range\":{\"canvas-workpad.@timestamp\":{\"format\":\"strict_date_optional_time\",\"gte\":\"2019-05-01T11:11:53.498Z\",\"lte\":\"2019-05-01T11:26:53.498Z\"}}}],\"filter\":[{\"match_all\":{}},{\"match_all\":{}}],\"should\":[],\"must_not\":[]}},\"timeout\":\"30000ms\"}\n", "request.id":"qrktsPxyST2nVh29GG7tog"}

    Conclusion

    In this article, we talked about the common causes of slow queries and solutions to address them. We also discussed the different methods to identify consistent and occasional slow queries. It is common to see slow queries being a symptom of a wider cluster performance issue that needs to be addressed.

    At Elasticsearch, we are constantly seeking to improve query times and working to bring faster query performance with our future version releases. I hope you find this article useful when dealing with slow queries. Feel free to post questions in our Elasticsearch discuss forum for further discussion or watch these two stories and discover how you can optimize Elasticsearch's performance.