Troubleshoot upgrades
Most Elasticsearch upgrades succeed without issues, as long as you plan and prepare for them carefully. This page describes the problems you're most likely to encounter during a rolling upgrade, and how to resolve them.
You can avoid most of these issues by completing the steps in the Upgrade Assistant before you start. For more information, refer to Resolve Upgrade Assistant issues.
During a rolling upgrade, Elasticsearch supports running two versions at the same time (the earlier version and the later version), but never more than two, and only for the duration of the upgrade.
If your nodes share the same configuration (other than node roles), and you follow the recommended upgrade order, any potential issues will surface as you upgrade the first node.
Monitor the upgrade at a high level by checking the list of cluster nodes, and at a low level by tailing the logs of the restarting node.
To monitor which nodes have been upgraded, use the cat nodes API:
GET _cat/nodes?v=true&h=name,ip,role,master,version,uptime&s=uptime
In an example three-node cluster, the first node's upgrade progresses as follows:
All nodes are present in the cluster.
name ip role master version uptime instance-0000000000 10.42.4.93 himrst * 8.19.x 20d instance-0000000001 10.42.1.10 himrst - 8.19.x 20d tiebreaker-0000000003 10.42.0.222 mv - 8.19.x 20dAs the node shuts down, it stops syncing with the elected master.
name ip role master version uptime instance-0000000000 10.42.4.93 himrst * 8.19.x 20d instance-0000000001 tiebreaker-0000000003 10.42.0.222 mv - 8.19.x 20dThe elected master removes the node from the cluster, so it no longer appears.
name ip role master version uptime instance-0000000000 10.42.4.93 himrst * 8.19.x 20d tiebreaker-0000000003 10.42.0.222 mv - 8.19.x 20dAfter the node restarts and rejoins the cluster, it appears again, now running the later version.
name ip role master version uptime instance-0000000001 10.42.1.10 himrst - 9.x.x 5s instance-0000000000 10.42.4.93 himrst * 8.19.x 20d tiebreaker-0000000003 10.42.0.222 mv - 8.19.x 20d
If a node doesn't rejoin the cluster, inspect its restart logs.
While a node is restarting, you can tail its logs for information about the upgrade-and-restart process. Try filtering for logs related to discovery and cluster formation events. For example:
In Discover on an attached monitoring cluster, apply a Lucene filter on
.monitoring*:"node-join" OR "node-left" OR "master node changed" OR "elected-as-master" exitcode OR initializing OR fatal OR "publish_address"On the host,
tailthe Elasticsearch logs through agrepfilter:grep -Ei 'node-join|node-left|master node changed|elected-as-master|exitcode|initializing|fatal|publish_address'
Based on your findings, refer to the common error resolutions.
During a rolling upgrade, the cluster continues to operate normally.
New functionality stays inactive, or runs in a backward-compatible mode, until the last node running the earlier version leaves the cluster. New and updated features become fully operational only when every node is running the later version.
Normally, a node running the earlier version leaves the cluster only when you shut it down to upgrade it. The last earlier-version node leaves when there are no more nodes to upgrade.
The following sections describe edge cases that can disrupt this process:
- A node unexpectedly leaves the cluster.
- Nodes are upgraded out of the recommended order.
- The cluster isn't highly available.
Because of cluster fault detection, a node running the earlier version might leave the cluster before you deliberately shut it down (temporarily, or indefinitely until you intervene). Recover the node into the cluster before you continue the rolling upgrade.
If a node unexpectedly leaves the cluster during a rolling upgrade, the upgrade might pause to prevent data loss. When this happens, the Deployment Activity shows the status Waiting until cluster recovers and reports fewer nodes than expected.
If all the remaining earlier version nodes unexpectedly leave the cluster during an upgrade, the cluster does the following:
- Reports its state as fully upgraded
- Automatically activates new functionality
- Leaves its backward-compatible mode
Afterward, you can't return the cluster to a state that's compatible with the earlier version nodes.
Nodes running the earlier version can no longer join the fully upgraded cluster. Their Elasticsearch logs report failed to join errors, with a Caused by such as:
node version [x.x.x] may not join a cluster comprising only nodes of version [y.y.y] or greaternode with version [x.x.x] may not join a cluster with minimum version [y.y.y]node with system index mappings versions [y.y.y] may not join a cluster with minimum system index mappings versions [x.x.x]handshake with [NODE_ID] failed: remote node version [x.x.x] is incompatible with local node version [y.y.y]
Elasticsearch preserves the data in the data paths of the older nodes and uses it to recover the cluster to health after you fully upgrade them. To bring these nodes back into the cluster, upgrade them.
If a node leaving the cluster causes the cluster health API to report red, the upgrade might pause to protect your data. If this happens, contact us with one of the following:
- Elastic Cloud Hosted deployment ID
- Elastic Cloud Enterprise diagnostic flagged
--deploymentsfor the problematic deployment ID, after attempting a pause and resume instance
If you stop half or more of the master-eligible nodes at the same time during the upgrade, the cluster becomes unavailable because too few remain to form a voting quorum.
Production environments should have at least three master-eligible nodes for high availability. In a test or development environment with only one or two master-eligible nodes, you can't avoid stopping half or more of them, so the cluster always becomes unavailable at some point during the upgrade.
Restart all the stopped master-eligible nodes so the cluster can re-form. This might trigger a premature cluster version update; to reduce this risk, upgrade the master-eligible nodes last.
When nodes restart, they can encounter errors that also occur outside of upgrades. The most common are:
- Startup failures from misconfigured
systemctltimeout settings - Startup failures from misconfigured settings that trip bootstrap checks
- Errors during node discovery and cluster formation
- Circuit breaker or watermark errors from temporary resource shortages
- Issues caused by insufficient high availability
The rest of this page covers errors specific to the rolling upgrade itself.
These bootstrap checks occur only during rolling upgrades.
Elasticsearch indices are compatible across sequential major versions only. When a restarting node tries to load metadata for an outdated, incompatible index, it fails with an error such as:
The index [index-000001] created in version [y-1.x.x] with current compatibility version [y-1.x.x] must be marked as read-only using the setting [index.blocks.write] set to [true] before upgrading to y+1.z.z.Cannot start this node because it holds metadata for indices with version [y-1.x.x] with which this node of version [y+1.z.z] is incompatible. Revert this node to version [y.y.y] and delete any indices with versions earlier than [y.0.0] before upgrading to version [y+1.z.z]. If all such indices have already been deleted, revert this node to version [y.y.y] and wait for it to join the cluster to clean up any older indices from its metadata.cannot upgrade node because incompatible indices created with version [y-1.x.x] exist, while the minimum compatible index version is [y.y.y]. Upgrade your older indices by reindexing them in version [y+1.z.z] first
This error means the Upgrade Assistant found issues that still need to be resolved.
Before you begin the upgrade again, revert the node to the earlier version, rejoin it to the cluster, and complete every critical item in the Upgrade Assistant. For more details, refer to Resolve Upgrade Assistant issues.
If the Elasticsearch configuration contains settings that are no longer valid in the later version, the node might fail to start with an error such as:
unknown setting [X] please check that any required plugins are installed, or check the breaking changes documentation for removed settingsThe configuration setting [X] is required
This error means you didn't fully review breaking changes during preparation. Resolve every unknown setting startup error before you continue. For common examples, refer to Troubleshoot node bootlooping.
You might see shard allocation issues if:
- Node upgrades didn't follow the recommended upgrade order
- One of the edge cases described earlier occurs
Beyond the common allocation issues, these errors appear only during rolling upgrades:
incompatible index versions
illegal_argument_exception: The index [my_index] was created with version [X.X.X] but the minimum compatible version is [Y.Y.Y]java.lang.IllegalStateException: index [my_index] version not supported: X.X.X maximum compatible index version is: Y.Y.Y
incompatible shard versions
cannot allocate replica shard to a node with version [X.X.X] since this is older than the primary version [Y.Y.Y]
If you encounter any of these, continue upgrading your nodes. The data allocates as more nodes reach the later version.
These issues can appear after an Elasticsearch upgrade if specific upgrade tasks remain unfinished.
If Kibana doesn't start after its upgrade, or reports Kibana server is not ready yet, make sure you re-enabled shard allocation.
If you set upgrade_mode for transform indices, you might see unexpected errors after the upgrade, such as:
Cannot stop any Transform while the Transform feature is upgrading (408)Transform task will not be assigned while upgrade mode is enabled.
Set enabled=false to exit upgrade mode for transforms.
If you set upgrade_mode for machine learning indices, you might see unexpected errors after the upgrade, such as:
You don't have permission to manage Machine Learning jobs. Access to the plugin requires the Machine Learning feature to be visible in this space.Index migration in progress. Indices related to Machine Learning are currently being upgraded. Some actions will not be available during this time.
Set enabled=false to exit upgrade mode for machine learning.