Keeping on top of the health of your deployments is a key part of the shared responsibilities between Elastic and yourself. This section provides some best practices to help you monitor and understand the ongoing state of your deployments and their resources.
In the normal course of using your Elasticsearch Add-On for Heroku deployments, health warnings and errors are expected to appear from time to time. Following are the most common scenarios and methods to resolve them.
- Health warning messages
Health warning messages will sometimes appear on the main page for one of your deployments, as well as on the Logs and metrics page.
A single warning is rarely cause for concern, as often it just reflects ongoing, routine maintenance activity occuring on the Elasticsearch Add-On for Heroku platform. In many cases the warning will disappear over time. If you’d rather not wait, you can run a no-op plan as decribed in How do I resolve deployment health warnings?.
- Configuration change failures
In more serious cases, your deployment may be unable to restart. The failure can be due to a variety of causes, the most frequent of these being invalid secure settings, expired plugins or bundles, or out of memory errors. The problem is often not detected until an attempt is made to restart the deployment following a routine configuration change, such as a deployment resizing.
We’ve collected together the most common of these causes so that you can resolve configuration change failures as quickly and painlessly as possible. And, just in case these solutions don’t help, you can always ask for help.
- Out of memory errors
Out memory errors (OOMs) may occur during your deployment’s normal operations, and these can have a very negative impact on performance. Common causes of memory shortages are oversharding, data retention oversights, and the overall request volume.
On your deployment page, you can check the JVM memory pressure indicator to get the current memory usage of each node of your deployment. You can also review the common causes of high JVM memory usage to help diagnose the source of unexpectedly high memory pressure levels. To learn more, check How does high memory pressure affect performance?.
In a production environment, it’s important set up dedicated health monitoring in order to retain the logs and metrics that can be used to troubleshoot any health issues in your deployments. In the event of that you need to contact our support team, they can use the retained data to help diagnose any problems that you may encounter.
You have the option of sending logs and metrics to a separate, specialized monitoring deployment, which ensures that they’re available in the event of a deployment outage. The monitoring deployment also gives you access to Kibana’s stack monitoring features, through which you can view health and performance data for all of your deployment resources.
Read through our How to set up monitoring guide to learn more.
Preconfigured logs and metricsedit
In a non-production environment, you may choose to rely on the logs and metrics that are available for your deployment by default. The deployment Logs and metrics page displays any current deployment health warnings, and from there you can also view standard log files from the last 24 hours.
The logs capture any activity related to your deployments, their component resources, snapshotting behavior, and more. You can use the search bar to filter the logs by, for example, a specific instance (
instance-0000000005), a configuration file (
roles.yml), an operation type (
autoscaling), or a component (
The deployment Activity page gives you quick access to a record of all configuration changes that have been applied to your deployment, including the timing, the initiating user, and whether or not the change was successful.
Understanding deployment healthedit
We’ve compiled some guidelines to help you ensure the health of your deployments over time. These can help you to better understand the available performance metrics, and to make decisions involving performance and high availability.
- Why is my master node unavailable?
- Learn about common symptoms and possible actions that you can take to resolve issues when the master node becomes unhealthy or unavailable.
- Scenario: Why are my shards unavailable?"
- Provide instructions on how to troubleshoot issues related to unassigned shards.
- Why is performance degrading over time?
- Address performance degradation on a smaller size Elasticsearch cluster.
- Is my cluster really highly available?
- High availability involves more than setting multiple availability zones (although that’s really important!). Learn how to assess performance and workloads to determine if your deployment has adequate resources to mitigate a potential node failure.
- How does high memory pressure affect performance?
- Learn about typical memory usage patterns, how to assess when the deployment memory usage levels are problematic, how this impacts performance, and how to resolve memory-related issues.
- Why are my cluster response times suddenly so much worse?
- Learn about the common causes of increased query response times and decreased performance in your deployment.
Understanding cluster issuesedit
This section provides guidelines on how to address some common cluster issues.
- Elasticsearch cluster is unreachable
When the Elasticsearch cluster is unreachable, other connected services like Kibana, APM, and EnterpriseSearch become also unreachable. Possible causes are routing stopped, too many requests, CPU or heap starvation, elected master node unavailable or failed plan change. To remediate, you can try one of the following:
- Elasticsearch cluster is unable to ingest data
The Elasticsearch cluster is unable to ingest data because the disk capacity is full. To remediate, you can try one of the following:
- Elasticsearch cluster data is missing
Some data in the Elasticsearch is missing, search results may be incorrect or unavailable. You might get messages of unassigned shards or unhealthy master. To remediate, you can try one of the following:
For more information, you can also check how to fix common cluster issues.