﻿---
title: Task queue backlog
description: A backlogged task queue can lead to rejected requests or an unhealthy cluster state. Contributing factors can include uneven or resource constrained hardware,...
url: https://www.elastic.co/docs/troubleshoot/elasticsearch/task-queue-backlog
products:
  - Elasticsearch
applies_to:
  - Elastic Stack: Generally available
---

# Task queue backlog
A backlogged task queue can lead to [rejected requests](https://www.elastic.co/docs/troubleshoot/elasticsearch/rejected-requests) or an [unhealthy cluster state](https://www.elastic.co/docs/troubleshoot/elasticsearch/red-yellow-cluster-status). Contributing factors can include [uneven or resource constrained hardware](/docs/troubleshoot/elasticsearch/hotspotting#causes-hardware), a large number of tasks triggered at the same time, expensive tasks that are using [high CPU](https://www.elastic.co/docs/troubleshoot/elasticsearch/high-cpu-usage) or are inducing [high JVM](https://www.elastic.co/docs/troubleshoot/elasticsearch/high-jvm-memory-pressure), and long-running tasks.

## Diagnose a backlogged task queue

To identify the cause of the backlog, try these diagnostic actions.
- [Check thread pool status](#diagnose-task-queue-thread-pool)
- [Inspect node hot threads](#diagnose-task-queue-hot-thread)
- [Identify long-running node tasks](#diagnose-task-queue-long-running-node-tasks)
- [Look for long-running cluster tasks](#diagnose-task-queue-long-running-cluster-tasks)
- [Monitor slow logs](#diagnose-task-slow-logs)


### Check thread pool status

Use the [cat thread pool API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-thread-pool) to monitor active threads, queued tasks, rejections, and completed tasks:
```json
```

By way of explanation on these [thread pool](https://www.elastic.co/docs/reference/elasticsearch/configuration-reference/thread-pool-settings) metrics:
- the `active` and `queue` statistics are point-in-time
- the `rejected` and `completed` statistics are cumulative from node start-up
- the thread pool will fill `active` until it reaches the `pool_size` at which point it will fill `queue` until it reaches the `queue_size` after which it will [rejected requests](https://www.elastic.co/docs/troubleshoot/elasticsearch/rejected-requests)

There are a number of things that you can check as potential causes for the queue backlog:
- Look for continually high `queue` metrics, which indicate long-running tasks or [CPU-expensive tasks](https://www.elastic.co/docs/troubleshoot/elasticsearch/high-cpu-usage).
- Look for bursts of elevated `queue` metrics, which indicate opportunities to spread traffic volume.
- Determine whether thread pool issues are specific to a [node role](https://www.elastic.co/docs/deploy-manage/distributed-architecture/clusters-nodes-shards/node-roles).
- Check whether a specific node is depleting faster than others within a [data tier](https://www.elastic.co/docs/manage-data/lifecycle/data-tiers). This might indicate [hot spotting](https://www.elastic.co/docs/troubleshoot/elasticsearch/hotspotting).


### Inspect node hot threads

If a particular thread pool queue is backed up, periodically poll the CPU-related API's to gauge task progression vs resource constraints:
- the [nodes hot threads API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-nodes-hot-threads)
  ```json
  ```
- the [cat nodes API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-nodes)
  ```json
  ```

If `cpu` is consistently elevated or a hot thread's stack trace does not rotate over an extended period, investigate [high CPU usage](/docs/troubleshoot/elasticsearch/high-cpu-usage#check-hot-threads).
Although the hot threads API response does not list the specific tasks running on a thread, it provides a summary of the thread’s activities. You can correlate a hot threads response with a [task management API response](https://www.elastic.co/docs/api/doc/elasticsearch/group/endpoint-tasks) to identify any overlap with specific tasks. For example, if hot threads suggest the node is spending time in `search`, filter the [Task Management API for search tasks](#diagnose-task-queue-long-running-node-tasks).

### Identify long-running node tasks

Long-running tasks can also cause a backlog. Use the [task management API](https://www.elastic.co/docs/api/doc/elasticsearch/group/endpoint-tasks) to check for excessive `running_time_in_nanos` values:
```json
```

You can filter on a specific `action`, such as [bulk indexing](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-bulk) or search-related tasks. If investigating particular nodes, this API can be filtered to specific `nodes`.
- Filter on [bulk index](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-bulk) actions:
  ```json
  ```
- Filter on search actions:
  ```json
  ```

Long-running tasks might need to be [canceled](#resolve-task-queue-backlog-stuck-tasks).
Refer to [this video](https://www.youtube.com/watch?v=lzw6Wla92NY) for a walkthrough of how to troubleshoot the [task management API](https://www.elastic.co/docs/api/doc/elasticsearch/group/endpoint-tasks) output.
You can also check the [Tune for search speed](https://www.elastic.co/docs/deploy-manage/production-guidance/optimize-performance/search-speed) and [Tune for indexing speed](https://www.elastic.co/docs/deploy-manage/production-guidance/optimize-performance/indexing-speed) pages for more information.

### Look for long-running cluster tasks

Use the [cat pending tasks API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-pending-tasks) to identify delays in cluster state synchronization:
```json
```

Cluster state synchronization can be expected to fall behind when a [cluster is unstable](https://www.elastic.co/docs/troubleshoot/elasticsearch/troubleshooting-unstable-cluster), but otherwise this state usually indicates an unworkable [cluster setting](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cluster-get-settings) override or traffic pattern.
There are a few common `source` issues to check for:
- `ilm-`: [index lifecycle management (ILM)](https://www.elastic.co/docs/manage-data/lifecycle/index-lifecycle-management) polls every `10m` by default, as determined by the [`indices.lifecycle.poll_interval`](https://www.elastic.co/docs/reference/elasticsearch/configuration-reference/index-lifecycle-management-settings) setting. This starts asynchronous tasks executed by the node tasks. If ILM continually reports as a cluster pending task, this setting likely is being overridden. Otherwise, the cluster likely has misconfigured [indices count relative to master heap size](/docs/deploy-manage/production-guidance/optimize-performance/size-shards#shard-count-recommendation).
- `put-mapping`: Elasticsearch enables [dynamic mapping](https://www.elastic.co/docs/manage-data/data-store/mapping/dynamic-mapping) by default. This, or the [update mapping API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-put-mapping), triggers a mapping update. In this case, the corresponding cluster log will contain an `update_mapping` entry with the name of the affected index.
- `shard-started`: Indicates [active shard recoveries](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-recovery). Overriding [`cluster.routing.allocation.*` settings](https://www.elastic.co/docs/reference/elasticsearch/configuration-reference/cluster-level-shard-allocation-routing-settings#cluster-shard-allocation-settings) can cause pending tasks and recoveries to back up.


### Monitor slow logs

If you're not present during an incident to investigate backlogged tasks, you might consider enabling [slow logs](https://www.elastic.co/docs/reference/elasticsearch/index-settings/slow-log) to review later.
For example, you can review slow search logs later using the [search profiler](https://www.elastic.co/docs/reference/elasticsearch/rest-apis/search-profile), so that time consuming requests can be optimized.

## Recommendations

Per before, when task backlogs occur it is frequently due to
- a traffic volume spike
- [expensive tasks](#diagnose-task-queue-hot-thread) that are causing [high CPU](https://www.elastic.co/docs/troubleshoot/elasticsearch/high-cpu-usage)
- [long-running tasks](#diagnose-task-queue-long-running-node-tasks)
- [hot spotting](https://www.elastic.co/docs/troubleshoot/elasticsearch/hotspotting), particularly from [uneven or resource constrained hardware](/docs/troubleshoot/elasticsearch/hotspotting#causes-hardware)

Many of these can be investigated in isolation as unintended traffic pattern or configuration changes. Refer to the following recommendations to address repeat or long standing symptoms.

### Address CPU-intensive tasks

If an individual task is causing a [thread pool `queue`](#diagnose-task-queue-thread-pool) due to [high CPU usage](https://www.elastic.co/docs/troubleshoot/elasticsearch/high-cpu-usage), try [cancelling the task](#resolve-task-queue-backlog-stuck-tasks) and then optimizing it before retrying.
This problem can surface due to a number of possible causes:
- Creating new  tasks or modifying scheduled tasks which either run frequently or are broad in their effect, such as [index lifecycle management](https://www.elastic.co/docs/manage-data/lifecycle/index-lifecycle-management) policies or [rules](https://www.elastic.co/docs/explore-analyze/alerts-cases)
- Performing traffic load testing
- Doing extended look-backs, especially across [data tiers](https://www.elastic.co/docs/manage-data/lifecycle/data-tiers)
- [Searching](https://www.elastic.co/docs/api/doc/elasticsearch/group/endpoint-search) or performing [bulk updates](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-bulk) to a high number of indices in a single request


### Cancel stuck tasks

If an active task’s [hot thread](#diagnose-task-queue-hot-thread) shows no progress, consider [canceling the task](https://www.elastic.co/docs/api/doc/elasticsearch/group/endpoint-tasks#task-cancellation) if it's flagged as `cancellable`.
If you consistently encounter `cancellable` tasks running longer than expected, you might consider reviewing:
- setting a [`search.default_search_timeout`](/docs/solutions/search/the-search-api#search-timeout)
- ensuring [scroll requests are cleared](https://www.elastic.co/docs/reference/elasticsearch/rest-apis/paginate-search-results#clear-scroll) in a timely manner

For example, you can use the [task management API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-tasks-list) to identify and cancel searches that consume excessive CPU time.
```json
```

The response `description` contains the search request and its queries. The `running_time_in_nanos` parameter shows how long the search has been running.
```json
{
  "nodes" : {
    "oTUltX4IQMOUUVeiohTt8A" : {
      "name" : "my-node",
      "transport_address" : "127.0.0.1:9300",
      "host" : "127.0.0.1",
      "ip" : "127.0.0.1:9300",
      "tasks" : {
        "oTUltX4IQMOUUVeiohTt8A:464" : {
          "node" : "oTUltX4IQMOUUVeiohTt8A",
          "id" : 464,
          "type" : "transport",
          "action" : "indices:data/read/search",
          "description" : "indices[my-index], search_type[QUERY_THEN_FETCH], source[{\"query\":...}]",
          "start_time_in_millis" : 4081771730000,
          "running_time_in_nanos" : 13991383,
          "cancellable" : true
        }
      }
    }
  }
}
```

To cancel this example search to free up resources, you would run:
```json
```

For additional tips on how to track and avoid resource-intensive searches, refer to [Avoid expensive searches](/docs/troubleshoot/elasticsearch/high-jvm-memory-pressure#reduce-jvm-memory-pressure-setup-searches).

### Address hot spotting

If a specific node’s thread pool is depleting faster than its [data tier](https://www.elastic.co/docs/manage-data/lifecycle/data-tiers) peers, try addressing uneven node resource utilization, also known as "hot spotting". For details about reparative actions you can take, such as rebalancing shards, refer to the [Hot spotting](https://www.elastic.co/docs/troubleshoot/elasticsearch/hotspotting) troubleshooting documentation.

### Increase available resources

By default, Elasticsearch allocates processors equal to the number reported available by the operating system. You can override this behaviour by adjusting the value of [`node.processors`](https://www.elastic.co/docs/reference/elasticsearch/configuration-reference/thread-pool-settings#node.processors), but this advanced setting should be configured only after you've performed load testing.
In some cases, you might need to increase the problematic thread pool `size`. For example, it might help to increase a stuck [`force_merge` thread pool](https://www.elastic.co/docs/reference/elasticsearch/configuration-reference/thread-pool-settings). If the setting is automatically calculated to `1` based on available CPU processors, then increasing the value to `2` is indicated in `elasticsearch.yml`, for example:
```yaml
thread_pool.force_merge.size: 2
```