Keeping your deployment healthyedit

Keeping on top of the health of your deployments is a key part of the shared responsibilities between Elastic and yourself. This section provides some best practices to help you monitor and understand the ongoing state of your deployments and their resources.

Health warningsedit

In the normal course of using your Elasticsearch Service deployments, health warnings and errors are expected to appear from time to time. Following are the most common scenarios and methods to resolve them.

Health warning messages

Health warning messages will sometimes appear on the main page for one of your deployments, as well as on the Logs and metrics page.

A single warning is rarely cause for concern, as often it just reflects ongoing, routine maintenance activity occuring on the Elasticsearch Service platform. In many cases the warning will disappear over time. If you’d rather not wait, you can run a no-op plan as decribed in How do I resolve deployment health warnings?.

Configuration change failures

In more serious cases, your deployment may be unable to restart. The failure can be due to a variety of causes, the most frequent of these being invalid secure settings, expired plugins or bundles, or out of memory errors. The problem is often not detected until an attempt is made to restart the deployment following a routine configuration change, such as a deployment resizing.

We’ve collected together the most common of these causes so that you can resolve configuration change failures as quickly and painlessly as possible. And, just in case these solutions don’t help, you can always ask for help.

Out of memory errors

Out memory errors (OOMs) may occur during your deployment’s normal operations, and these can have a very negative impact on performance. Common causes of memory shortages are oversharding, data retention oversights, and the overall request volume.

On your deployment page, you can check the JVM memory pressure indicator to see the current memory usage of each node of your deployment. You can also review the common causes of high JVM memory usage to help diagnose the source of unexpectedly high memory pressure levels. To learn more, check How does high memory pressure affect performance?.

Health monitoringedit

In a production environment, it’s important set up dedicated health monitoring in order to retain the logs and metrics that can be used to troubleshoot any health issues in your deployments. In the event of that you need to contact our support team, they can use the retained data to help diagnose any problems that you may encounter.

You have the option of sending logs and metrics to a separate, specialized monitoring deployment, which ensures that they’re available in the event of a deployment outage. The monitoring deployment also gives you access to Kibana’s stack monitoring features, through which you can view health and performance data for all of your deployment resources.

As part of health monitoring, it’s also a best practice to configure alerting, so that you can be notified right away about any deployment health issues.

Read through our How to set up monitoring guide to learn more.

Preconfigured logs and metricsedit

In a non-production environment, you may choose to rely on the logs and metrics that are available for your deployment by default. The deployment Logs and metrics page displays any current deployment health warnings, and from there you can also view standard log files from the last 24 hours.

The logs capture any activity related to your deployments, their component resources, snapshotting behavior, and more. You can use the search bar to filter the logs by, for example, a specific instance (instance-0000000005), a configuration file (roles.yml), an operation type (snapshot, autoscaling), or a component (kibana, ent-search).

Understanding deployment healthedit

We’ve compiled some guidelines to help you ensure the health of your deployments over time. These can help you to better understand the available performance metrics, and to make decisions involving performance and high availability.

Why is performance degrading over time?
Address performance degradation on a smaller size Elasticsearch cluster.
Is my cluster really highly available?
High availability involves more than setting multiple availability zones (although that’s really important!). Learn how to assess performance and workloads to determine if your deployment has adequate resources to mitigate a potential node failure.
How does high memory pressure affect performance?
Learn about typical memory usage patterns, how to assess when the deployment memory usage levels are problematic, how this impacts performance, and how to resolve memory-related issues.
Why are my cluster response times suddenly so much worse?
Learn about the common causes of increased query response times and decreased performance in your deployment.
Why did my node move to a different host?
Learn about why we may, from time to time, relocate your Elasticsearch Service deployments across hosts.