Troubleshoot upgrades

Most Elasticsearch upgrades succeed without issues, as long as you plan and prepare for them carefully. This page describes the problems you're most likely to encounter during a rolling upgrade, and how to resolve them.

You can avoid most of these issues by completing the steps in the Upgrade Assistant before you start. For more information, refer to Resolve Upgrade Assistant issues.

Monitor an upgrade

During a rolling upgrade, Elasticsearch supports running two versions at the same time (the earlier version and the later version), but never more than two, and only for the duration of the upgrade.

If your nodes share the same configuration (other than node roles), and you follow the recommended upgrade order, any potential issues will surface as you upgrade the first node.

Monitor the upgrade at a high level by checking the list of cluster nodes, and at a low level by tailing the logs of the restarting node.

Poll cluster nodes

To monitor which nodes have been upgraded, use the cat nodes API:

				GET _cat/nodes?v=true&h=name,ip,role,master,version,uptime&s=uptime

In an example three-node cluster, the first node's upgrade progresses as follows:

All nodes are present in the cluster.

		name                  ip          role   master version uptime
instance-0000000000   10.42.4.93  himrst *      8.19.x   20d
instance-0000000001   10.42.1.10  himrst -      8.19.x   20d
tiebreaker-0000000003 10.42.0.222 mv     -      8.19.x   20d
		
	

As the node shuts down, it stops syncing with the elected master.

		name                  ip          role   master version uptime
instance-0000000000   10.42.4.93  himrst *      8.19.x   20d
instance-0000000001
tiebreaker-0000000003 10.42.0.222 mv     -      8.19.x   20d
		
	

The elected master removes the node from the cluster, so it no longer appears.

		name                  ip          role   master version uptime
instance-0000000000   10.42.4.93  himrst *      8.19.x   20d
tiebreaker-0000000003 10.42.0.222 mv     -      8.19.x   20d
		
	

After the node restarts and rejoins the cluster, it appears again, now running the later version.

		name                  ip          role   master version uptime
instance-0000000001   10.42.1.10  himrst -      9.x.x     5s
instance-0000000000   10.42.4.93  himrst *      8.19.x   20d
tiebreaker-0000000003 10.42.0.222 mv     -      8.19.x   20d
		
	

If a node doesn't rejoin the cluster, inspect its restart logs.

Check node logs

While a node is restarting, you can tail its logs for information about the upgrade-and-restart process. Try filtering for logs related to discovery and cluster formation events. For example:

In Discover on an attached monitoring cluster, apply a Lucene filter on .monitoring*:

		"node-join" OR "node-left" OR "master node changed" OR "elected-as-master" exitcode OR initializing OR fatal OR "publish_address"
		
	

On the host, tail the Elasticsearch logs through a grep filter:

		grep -Ei 'node-join|node-left|master node changed|elected-as-master|exitcode|initializing|fatal|publish_address'
		
	

Based on your findings, refer to the common error resolutions.

How a rolling upgrade works

During a rolling upgrade, the cluster continues to operate normally.

New functionality stays inactive, or runs in a backward-compatible mode, until the last node running the earlier version leaves the cluster. New and updated features become fully operational only when every node is running the later version.

Normally, a node running the earlier version leaves the cluster only when you shut it down to upgrade it. The last earlier-version node leaves when there are no more nodes to upgrade.

The following sections describe edge cases that can disrupt this process:

A node unexpectedly leaves the cluster.
Nodes are upgraded out of the recommended order.
The cluster isn't highly available.

Unexpected node disconnect

Because of cluster fault detection, a node running the earlier version might leave the cluster before you deliberately shut it down (temporarily, or indefinitely until you intervene). Recover the node into the cluster before you continue the rolling upgrade.

Note

If a node unexpectedly leaves the cluster during a rolling upgrade, the upgrade might pause to prevent data loss. When this happens, the Deployment Activity shows the status Waiting until cluster recovers and reports fewer nodes than expected.

Premature cluster version update

If all the remaining earlier version nodes unexpectedly leave the cluster during an upgrade, the cluster does the following:

Reports its state as fully upgraded
Automatically activates new functionality
Leaves its backward-compatible mode

Afterward, you can't return the cluster to a state that's compatible with the earlier version nodes.

Nodes running the earlier version can no longer join the fully upgraded cluster. Their Elasticsearch logs report failed to join errors, with a Caused by such as:

node version [x.x.x] may not join a cluster comprising only nodes of version [y.y.y] or greater
node with version [x.x.x] may not join a cluster with minimum version [y.y.y]
node with system index mappings versions [y.y.y] may not join a cluster with minimum system index mappings versions [x.x.x]
handshake with [NODE_ID] failed: remote node version [x.x.x] is incompatible with local node version [y.y.y]

Elasticsearch preserves the data in the data paths of the older nodes and uses it to recover the cluster to health after you fully upgrade them. To bring these nodes back into the cluster, upgrade them.

Note

If a node leaving the cluster causes the cluster health API to report red, the upgrade might pause to protect your data. If this happens, contact us with one of the following:

Elastic Cloud Hosted deployment ID
Elastic Cloud Enterprise diagnostic flagged --deployments for the problematic deployment ID, after attempting a pause and resume instance

Stopping master-eligible nodes

If you stop half or more of the master-eligible nodes at the same time during the upgrade, the cluster becomes unavailable because too few remain to form a voting quorum.

Production environments should have at least three master-eligible nodes for high availability. In a test or development environment with only one or two master-eligible nodes, you can't avoid stopping half or more of them, so the cluster always becomes unavailable at some point during the upgrade.

Restart all the stopped master-eligible nodes so the cluster can re-form. This might trigger a premature cluster version update; to reduce this risk, upgrade the master-eligible nodes last.

Common issues

When nodes restart, they can encounter errors that also occur outside of upgrades. The most common are:

Startup failures from misconfigured systemctl timeout settings
Startup failures from misconfigured settings that trip bootstrap checks
Errors during node discovery and cluster formation
Circuit breaker or watermark errors from temporary resource shortages
Issues caused by insufficient high availability

The rest of this page covers errors specific to the rolling upgrade itself.

Bootstrap checks

These bootstrap checks occur only during rolling upgrades.

Index compatibility

Elasticsearch indices are compatible across sequential major versions only. When a restarting node tries to load metadata for an outdated, incompatible index, it fails with an error such as:

The index [index-000001] created in version [y-1.x.x] with current compatibility version [y-1.x.x] must be marked as read-only using the setting [index.blocks.write] set to [true] before upgrading to y+1.z.z.
Cannot start this node because it holds metadata for indices with version [y-1.x.x] with which this node of version [y+1.z.z] is incompatible. Revert this node to version [y.y.y] and delete any indices with versions earlier than [y.0.0] before upgrading to version [y+1.z.z]. If all such indices have already been deleted, revert this node to version [y.y.y] and wait for it to join the cluster to clean up any older indices from its metadata.
cannot upgrade node because incompatible indices created with version [y-1.x.x] exist, while the minimum compatible index version is [y.y.y]. Upgrade your older indices by reindexing them in version [y+1.z.z] first

This error means the Upgrade Assistant found issues that still need to be resolved.

Before you begin the upgrade again, revert the node to the earlier version, rejoin it to the cluster, and complete every critical item in the Upgrade Assistant. For more details, refer to Resolve Upgrade Assistant issues.

Unknown settings

If the Elasticsearch configuration contains settings that are no longer valid in the later version, the node might fail to start with an error such as:

unknown setting [X] please check that any required plugins are installed, or check the breaking changes documentation for removed settings
The configuration setting [X] is required

This error means you didn't fully review breaking changes during preparation. Resolve every unknown setting startup error before you continue. For common examples, refer to Troubleshoot node bootlooping.

Shard allocation issues

You might see shard allocation issues if:

Node upgrades didn't follow the recommended upgrade order
One of the edge cases described earlier occurs

Beyond the common allocation issues, these errors appear only during rolling upgrades:

incompatible index versions
- illegal_argument_exception: The index [my_index] was created with version [X.X.X] but the minimum compatible version is [Y.Y.Y]
- java.lang.IllegalStateException: index [my_index] version not supported: X.X.X maximum compatible index version is: Y.Y.Y
incompatible shard versions
- cannot allocate replica shard to a node with version [X.X.X] since this is older than the primary version [Y.Y.Y]

If you encounter any of these, continue upgrading your nodes. The data allocates as more nodes reach the later version.

Cannot stop any Transform while the Transform feature is upgrading (408)
Transform task will not be assigned while upgrade mode is enabled.

Set enabled=false to exit upgrade mode for transforms.

Machine learning upgrade mode

If you set upgrade_mode for machine learning indices, you might see unexpected errors after the upgrade, such as:

You don't have permission to manage Machine Learning jobs. Access to the plugin requires the Machine Learning feature to be visible in this space.
Index migration in progress. Indices related to Machine Learning are currently being upgraded. Some actions will not be available during this time.

Set enabled=false to exit upgrade mode for machine learning.