Elasticsearch cluster performance metrics provide a quick and easy way for you to see how an Elasticsearch cluster has been performing on Elastic Cloud over the last 24 hours. For example, you can use these metrics to see how many indexing and search requests your cluster nodes are handling and how long it takes to respond to them. The performance metrics shown in the Elastic Cloud Console apply only to your cluster.
To see metrics for an extended period beyond the last 24 hours or to get more detailed monitoring information than what is provided by these cluster performance metrics, use the X-Pack monitoring features (called Marvel in versions before 5.0). Note that accurate X-Pack monitoring information for some system metrics, such as CPU utilization, is available as of version 5.2, but other metrics might not reflect the actual utilization for your cluster nodes. For always accurate system metrics, you can rely on the cluster performance metrics described in this section.
Cluster performance metrics are available directly in the Elastic Cloud Console. There is nothing you have to do to enable these metrics, but it does take a bit of time to collect meaningful data when you first create or change a cluster. Don’t see an active graph for all metrics? Send more work to your cluster.
To access cluster performance metrics:
- Log into the Elastic Cloud Console.
On the Deployments page, select your deployment.
Narrow the list of the deployments by name, ID, or choose from several other filters. Use a combination of them to further define the list. For example, you might want to select Is unhealthy and Has master problems to get a short list of deployments that need attention.
- From your deployment menu, go to the Elasticsearch page.
- Scroll down to the Cluster Performance Metrics section to see metrics for your cluster.
The following metrics are available:
Shows the usage of the CPU resources assigned to your Elasticsearch cluster, as a percentage. CPU resources are relative to the size of your cluster, so that a cluster with 32GB of RAM gets assigned twice as many CPU resources as a cluster with 16GB of RAM. All clusters are guaranteed their share of CPU resources, as Elastic Cloud infrastructure does not overcommit any resources. CPU credits permit boosting the performance of smaller clusters temporarily, so that CPU usage can exceed 100%.
Shows your remaining CPU credits, measured in seconds of CPU time. CPU credits enable the boosting of CPU resources assigned to your cluster to improve performance temporarily when it is needed most. These credits help a smaller cluster perform as if it were assigned the CPU resources of a larger cluster, but only for a limited time. Because they are intended to help the performance of smaller clusters, CPU credits are available only on smaller clusters up to and including 8 GB of RAM.
When you initially create a cluster, you receive a credit of 60 seconds worth of CPU time. You can accumulate additional credits when your CPU usage is less than what your cluster is assigned and you use credits when your CPU usage is being boosted to improve performance. At most, you can accumulate one hour worth of additional CPU time per GB of RAM for your cluster. For example: A cluster with 4 GB of RAM, can at most accumulate four hours worth of additional CPU time and can consume all of these CPU credits within four hours when loaded heavily with requests.
If you observe declining performance on a smaller cluster over time, check to see if you have depleted your CPU credits. If you have, this is an indicator that you might need to think about increasing the size of your cluster to handle the workload with consistent performance.
CPU credits persist through cluster restarts, but they are tied to your existing cluster nodes. Operations that create new cluster nodes will forfeit existing CPU credits. For example: If your cluster nodes have to be moved to a different host by Elastic due to a maintenance issue or if you resize your cluster, CPU credits are lost. These operations always create new cluster nodes in order to avoid any downtime for you.
Shows the number of requests that your cluster receives per second, separated into search requests and requests to index documents. This metric provides a good indicator of the volume of work that your cluster typically handles over time which, together with other performance metrics, helps you determine if your cluster is sized correctly. Also lets you see at a glance if there is a sudden increase in the volume of user requests that might explain an increase in response times.
Indicates the amount of time that it takes for your Elasticsearch cluster to complete a search query, in milliseconds. Response times won’t tell you about the cause of a performance issue, but they are often a first indicator that something is amiss with the performance of your Elasticsearch cluster.
Indicates the amount of time that it takes for your Elasticsearch cluster to complete an indexing operation, in milliseconds. Response times won’t tell you about the cause of a performance issue, but they are often a first indicator that something is amiss with the performance of your Elasticsearch cluster.
Indicates the total memory used by the JVM heap over time. Memory pressure that consistently remains above 75% indicates that you might need to resize your cluster or reduce the memory consumption soon. The higher the pressure, the less memory is available and the more frequent the garbage collection becomes. Memory pressure that is consistently above 85% indicates that you need to resize your cluster or reduce memory consumption immediately. More aggressive garbage collection will impact performance and running out of memory can lead to cluster unavailability and reboots. To learn more, see how high memory pressure can cause performance issues.
Indicates the overhead involved in JVM garbage collection to reclaim memory. Elasticsearch is configured to initiate garbage collection when the Java heap reaches 75% memory usage, which requires spending some CPU resources to reclaim memory. Initially, garbage collection uses the less aggressive ConcurrentMarkSweep (CMS) collector. If the less aggressive garbage collection does not free up memory for a needed memory allocation quickly enough, the JVM triggers more aggressive stop-the-world garbage collection, at the cost of halting all threads on the JVM until the collector finishes.
Performance correlates directly with resources assigned to your cluster on Elastic Cloud and many of these metrics will show some sort of correlation with each other when you are trying to determine the cause of a performance issue. Take a look at some of the scenarios included in this section to learn how you can determine the cause of performance issues.
It is not uncommon for performance issues on Elastic Cloud to be caused by an undersized cluster that cannot cope with the workload it is being asked to handle. If your cluster performance metrics often shows high CPU usage or excessive memory pressure, consider increasing the size of your cluster soon to improve performance. This is especially true for clusters that regularly reach 100% of CPU usage or that suffer out-of-memory failures; it is better to resize your cluster early when it is not yet maxed out than to have to resize a cluster that is already overwhelmed. Changing the configuration of your cluster adds some overhead as data needs to be migrated to the new nodes, which can increase the load on a cluster further and delay configuration changes.
Got an overwhelmed cluster that needs to be upsized? Try enabling maintenance mode first. It will likely help with configuration changes.
Work with the metrics shown in Cluster Performance Metrics section to help you find the information you need:
Hover on any part of a graph to get additional information. For example, hovering on a section of a graph that shows response times reveals the percentile that responses fall into at that point in time:
Zoom in on a graph by drawing a rectangle to select a specific time window. As you zoom in one metric, other performance metrics change to show data for the same time window.
- Pan around with to make sure that you can see the right parts of a metric graph as you zoom in.
- Reset the metric graph axes with , which returns the graphs to their original scale.
Cluster performance metrics are shown per node and are color-coded to indicate which running Elasticsearch instance they belong to.
For clusters that suffer out-of-memory failures, it can be difficult to determine whether the clusters are in a completely healthy state afterwards. For this reason, Elastic Cloud automatically reboots clusters that suffer out-of-memory failures.
You will receive an email notification to let you know that a restart occurred. For repeated alerts, the emails are aggregated so that you do not receive an excessive number of notifications. Either resizing your cluster to reduce memory pressure or reducing the workload that a cluster is being asked to handle can help avoid these cluster restarts.