﻿---
title: Hot spotting
description: Computer hot spotting may occur in Elasticsearch when resource utilizations are unevenly distributed across nodes. Temporary spikes are not usually considered...
url: https://www.elastic.co/docs/troubleshoot/elasticsearch/hotspotting
products:
  - Elasticsearch
applies_to:
  - Elastic Stack: Generally available
---

# Hot spotting
Computer [hot spotting](https://en.wikipedia.org/wiki/Hot_spot_(computer_programming)) may occur in Elasticsearch when resource utilizations are unevenly distributed across [nodes](https://www.elastic.co/docs/reference/elasticsearch/configuration-reference/node-settings). Temporary spikes are not usually considered problematic, but ongoing significantly unique utilization may lead to cluster bottlenecks and should be reviewed.
Watch [this video](https://www.youtube.com/watch?v=Q5ODJ5nIKAM) for a walkthrough of troubleshooting a hot spotting issue.
<admonition title="Simplify monitoring with AutoOps">
  AutoOps is a monitoring tool that simplifies cluster management through performance recommendations, resource utilization visibility, and real-time issue detection with resolution paths. Learn more about [AutoOps](https://www.elastic.co/docs/deploy-manage/monitor/autoops).
</admonition>


## Detect hot spotting

To check for hot spotting you can investigate both active utilization levels and historical node statistics.

### Active

Hot spotting most commonly surfaces as significantly elevated resource utilization (of `disk.percent`, `heap.percent`, or `cpu`) among a subset of nodes as reported via [cat nodes](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-nodes). Individual spikes aren’t necessarily problematic, but if utilization repeatedly spikes or consistently remains high over time (for example longer than 30 seconds), the resource may be experiencing problematic hot spotting.
For example, let’s show case two separate plausible issues using cat nodes:
```json
```

Pretend this same output pulled twice across five minutes:
```json
name   master node.role heap.percent disk.used_percent cpu
node_1 *      hirstm              24                20  95
node_2 -      hirstm              23                18  18
node_3 -      hirstmv             25                90  10
```

Here we see two significantly unique utilizations: where the master node is at `cpu: 95` and a hot node is at `disk.used_percent: 90%`. This would indicate hot spotting was occurring on these two nodes, and not necessarily from the same root cause.

### Historical

A secondary method to notice hot spotting build up is to poll the [node statistics API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-nodes-stats) for index-related performance metrics.
```json
```

This request returns node operational metrics such as `query`, `refresh`, and `index`. It allows you to gauge:
- the total events attempted per node
- the node's average processing time per event type

These metrics accumulate during the uptime for each individual node. To help view the output, you can parse the response using a third-party tool such as [JQ](https://jqlang.github.io/jq/):
```bash
cat nodes_stats.json | jq -rc '.nodes[]|.name as $n|.roles as $r|.indices|to_entries[]|.key as $m|.value|select(.total and .total_time_in_millis)|select(.total>0)|{node:$n, roles:$r, metric:$m, total:.total, avg_millis:(.total_time_in_millis?/.total|round)}'
```

The results indicate that multiple major operations are non-performant across nodes, suggesting that the cluster is likely under-provisioned. If a particular operation type or node stands out, it likely indicates [shard distribution issues](#causes-shards) which you might compare against [indices stats](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-stats).
```json
```

These metrics accumulate from the history for each individual shard. You can parse this response with a tool like JQ to compare with earlier performance:
```bash
cat indices_stats.json | jq -rc '.indices|to_entries[]|.key as $i|.value.shards[]|to_entries[]|.key as $sh|.value|.routing.primary as $p|.routing.node[:4] as $n|to_entries[]|.key as $m|.value|select(.total and .total_time_in_millis)|select(.total>0)|{index:$i, shard:$sh, primary:$p, node:$n, metric:$m, total:.total, avg_millis:(.total_time_in_millis/.total|round)}'
```


## Causes

Historically, clusters experience hot spotting mainly as an effect of hardware, shard distributions, and/or task load. From a user’s perspective, hot spotting may manifest as slower search responses, delayed indexing, ingestion backlogs, or timeouts during queries and bulk operations. We’ll review these sequentially in order of their potentially impacting scope.

### Hardware

Here are some common improper hardware setups which might contribute to hot spotting:
- Resources are allocated non-uniformly. For example, if one hot node is given half the CPU of its peers. Elasticsearch expects all nodes on a [data tier](https://www.elastic.co/docs/manage-data/lifecycle/data-tiers) to share the same hardware profiles or specifications. To check this, use the [cat nodes API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-nodes):
  ```json
  ```
- Resources are consumed by another service on the host, including other Elasticsearch nodes. Refer to our [dedicated host](/docs/deploy-manage/deploy/self-managed/installing-elasticsearch#dedicated-host) recommendation.
- Resources experience different network or disk throughputs. For example, if one node’s I/O is lower than its peers. Refer to [Use faster hardware](/docs/deploy-manage/production-guidance/optimize-performance/indexing-speed#indexing-use-faster-hardware) for more information.
- A JVM that has been configured with a heap larger than 31GB. Refer to [Set the JVM heap size](https://www.elastic.co/docs/reference/elasticsearch/jvm-settings#set-jvm-heap-size) for more information.
- Problematic resources uniquely report [memory swapping](https://www.elastic.co/docs/deploy-manage/deploy/self-managed/setup-configuration-memory).


### Shard distributions

Elasticsearch indices are divided into one or more [shards](https://en.wikipedia.org/wiki/Shard_(database_architecture)) which can sometimes be poorly distributed. Elasticsearch accounts for this by [balancing shard counts](https://www.elastic.co/docs/reference/elasticsearch/configuration-reference/cluster-level-shard-allocation-routing-settings) across data nodes. As [introduced in version 8.6](https://www.elastic.co/blog/whats-new-elasticsearch-kibana-cloud-8-6-0), Elasticsearch by default also enables [desired balancing](https://www.elastic.co/docs/reference/elasticsearch/configuration-reference/cluster-level-shard-allocation-routing-settings) to account for ingest load. A node may still experience hot spotting either due to write-heavy indices or by the overall shards it’s hosting.

#### Node level

You can check for shard balancing via [cat allocation](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-allocation), though as of version 8.6, [desired balancing](https://www.elastic.co/docs/reference/elasticsearch/configuration-reference/cluster-level-shard-allocation-routing-settings) may no longer fully expect to balance shards. Kindly note, both methods may temporarily show problematic imbalance during [cluster stability issues](https://www.elastic.co/docs/deploy-manage/distributed-architecture/discovery-cluster-formation/cluster-fault-detection).
For example, let’s showcase two separate plausible issues using cat allocation:
```json
```

Which could return:
```json
node   shards disk.percent disk.indices disk.used
node_1    446           19      154.8gb   173.1gb
node_2     31           52       44.6gb   372.7gb
node_3    445           43      271.5gb   289.4gb
```

Here we see two significantly unique situations. `node_2` has recently restarted, so it has a much lower number of shards than all other nodes. This also relates to `disk.indices` being much smaller than `disk.used` while shards are recovering as seen via [cat recovery](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-recovery). While `node_2`'s shard count is low, it may become a write hot spot due to ongoing [ILM rollovers](https://www.elastic.co/docs/reference/elasticsearch/index-lifecycle-actions/ilm-rollover). This is a common root cause of write hot spots covered in the next section.
The second situation is that `node_3` has a higher `disk.percent` than `node_1`, even though they hold roughly the same number of shards. This occurs when either shards are not evenly sized (refer to [Aim for shards of up to 200M documents, or with sizes between 10GB and 50GB](/docs/deploy-manage/production-guidance/optimize-performance/size-shards#shard-size-recommendation)) or when there are a lot of empty indices.
Cluster rebalancing based on desired balance does much of the heavy lifting of keeping nodes from hot spotting. It can be limited by either nodes hitting [watermarks](https://www.elastic.co/docs/reference/elasticsearch/configuration-reference/cluster-level-shard-allocation-routing-settings#disk-based-shard-allocation) (refer to [fixing disk watermark errors](https://www.elastic.co/docs/troubleshoot/elasticsearch/fix-watermark-errors)) or by a write-heavy index’s total shards being much lower than the written-to nodes.
You can confirm hot spotted nodes via [the nodes stats API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-nodes-stats), potentially polling twice over time to only checking for the stats differences between them rather than polling once giving you stats for the node’s full [node uptime](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-nodes-usage). For example, to check all nodes indexing stats:
```json
```


#### Index level

Hot spotted nodes frequently surface via [cat thread pool](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-thread-pool)'s `write` and `search` queue backups. For example:
```json
```

Which could return:
```json
n      nn       q a r    c
search node_1   3 1 0 1287
search node_2   0 2 0 1159
search node_3   0 1 0 1302
write  node_1 100 3 0 4259
write  node_2   0 4 0  980
write  node_3   1 5 0 8714
```

Here you can see two significantly unique situations. Firstly, `node_1` has a severely backed up write queue compared to other nodes. Secondly, `node_3` shows historically completed writes that are double any other node. These are both probably due to either poorly distributed write-heavy indices, or to multiple write-heavy indices allocated to the same node. Since primary and replica writes are majorly the same amount of cluster work, we usually recommend setting [`index.routing.allocation.total_shards_per_node`](https://www.elastic.co/docs/reference/elasticsearch/index-settings/total-shards-per-node#total-shards-per-node) to force index spreading after lining up index shard counts to total nodes.
It’s important to monitor and be aware of any changes in your data ingestion flow, as sudden increases or shifts can lead to high CPU usage and ingestion delays. Depending on your cluster architecture, optimizing the number of primary shards can significantly improve ingestion performance. For more details, see [Clusters, nodes, and shards](https://www.elastic.co/docs/deploy-manage/distributed-architecture/clusters-nodes-shards).
We normally recommend heavy-write indices have sufficient primary `number_of_shards` and replica `number_of_replicas` to evenly spread across indexing nodes. Alternatively, you can [reroute](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cluster-reroute) shards to more quiet nodes to alleviate the nodes with write hot spotting.
If it’s non-obvious what indices are problematic, you can introspect further via [the index stats API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-stats) by running:
```json
```

For more advanced analysis, you can poll for shard-level stats, which lets you compare joint index-level and node-level stats. This analysis wouldn’t account for node restarts and/or shards rerouting, but serves as overview:
```json
```

You can for example use the [third-party JQ tool](https://stedolan.github.io/jq), to process the output saved as `indices_stats.json`:
```sh
cat indices_stats.json | jq -rc ['.indices|to_entries[]|.key as $i|.value.shards|to_entries[]|.key as $s|.value[]|{node:.routing.node[:4], index:$i, shard:$s, primary:.routing.primary, size:.store.size, total_indexing:.indexing.index_total, time_indexing:.indexing.index_time_in_millis, total_query:.search.query_total, time_query:.search.query_time_in_millis } | .+{ avg_indexing: (if .total_indexing>0 then (.time_indexing/.total_indexing|round) else 0 end), avg_search: (if .total_search>0 then (.time_search/.total_search|round) else 0 end) }'] > shard_stats.json

# show top written-to shard simplified stats which contain their index and node references
cat shard_stats.json | jq -rc 'sort_by(-.avg_indexing)[]' | head
```


### Task loads

Shard distribution problems will most-likely surface as task load, as seen above in the [cat thread pool](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-thread-pool) example. It is also possible for tasks to hot spot a node due either to individual qualitative expensiveness or to overall quantitative traffic loads, which will surface in [backlogged tasks](https://www.elastic.co/docs/troubleshoot/elasticsearch/task-queue-backlog).