Scenario: Why Is Performance Degrading over Time?

You have a smaller Elasticsearch cluster hosted on Elastic Cloud and you’ve noticed that performance seems to have declined recently. The response time during searches seems to have gone up, and overall the system just doesn’t seem to perform quite as well as it used to. You have already looked at the cluster performance metrics and have confirmed that both index and search response times have increased steadily and remained higher than before. So what explains the performance degradation?

When you look in the Cluster Performance Metrics section of the Elastic Cloud Console, you see the following metrics:

CPU usage versus CPU credits over time

Between just after 00:10 and 00:20, excessively high CPU usage consumes all CPU credits until no more credits are available. CPU credits enable boosting the assigned CPU resources temporarily to improve performance on smaller clusters up to and including 4 GB of RAM when it is needed most, but CPU credits are by their nature limited. You accumulate CPU credits when you use less than your assigned share of CPU resources, and you consume credits when you use more CPU resources than assigned. As you max out your CPU resources, CPU credits permit your cluster to consume more than 100% of the assigned resources temporarily, which explains why CPU usage exceeds 100%, with usage peaks that reach well over 400% for one node. As CPU credits are depleted, CPU usage gradually drops until it returns to 100% at 00:30 when no more CPU credits are available. You can also see that after 00:30 credits gradually begin to accumulate again.

If you need your cluster to be able to sustain a certain level of performance, you cannot rely on CPU boosting to handle the workload except temporarily. To ensure that performance can be sustained, consider increasing the size of your cluster.