Loading

Troubleshoot upgrades

Most Elasticsearch upgrades succeed without issues, as long as you plan and prepare for them carefully. This page describes the problems you're most likely to encounter during a rolling upgrade, and how to resolve them.

You can avoid most of these issues by completing the steps in the Upgrade Assistant before you start. For more information, refer to Resolve Upgrade Assistant issues.

During a rolling upgrade, Elasticsearch supports running two versions at the same time (the earlier version and the later version), but never more than two, and only for the duration of the upgrade.

If your nodes share the same configuration (other than node roles), and you follow the recommended upgrade order, any potential issues will surface as you upgrade the first node.

Monitor the upgrade at a high level by checking the list of cluster nodes, and at a low level by tailing the logs of the restarting node.

To monitor which nodes have been upgraded, use the cat nodes API:

				GET _cat/nodes?v=true&h=name,ip,role,master,version,uptime&s=uptime
		

In an example three-node cluster, the first node's upgrade progresses as follows:

  1. All nodes are present in the cluster.

    name                  ip          role   master version uptime
    instance-0000000000   10.42.4.93  himrst *      8.19.x   20d
    instance-0000000001   10.42.1.10  himrst -      8.19.x   20d
    tiebreaker-0000000003 10.42.0.222 mv     -      8.19.x   20d
    		
  2. As the node shuts down, it stops syncing with the elected master.

    name                  ip          role   master version uptime
    instance-0000000000   10.42.4.93  himrst *      8.19.x   20d
    instance-0000000001
    tiebreaker-0000000003 10.42.0.222 mv     -      8.19.x   20d
    		
  3. The elected master removes the node from the cluster, so it no longer appears.

    name                  ip          role   master version uptime
    instance-0000000000   10.42.4.93  himrst *      8.19.x   20d
    tiebreaker-0000000003 10.42.0.222 mv     -      8.19.x   20d
    		
  4. After the node restarts and rejoins the cluster, it appears again, now running the later version.

    name                  ip          role   master version uptime
    instance-0000000001   10.42.1.10  himrst -      9.x.x     5s
    instance-0000000000   10.42.4.93  himrst *      8.19.x   20d
    tiebreaker-0000000003 10.42.0.222 mv     -      8.19.x   20d
    		

If a node doesn't rejoin the cluster, inspect its restart logs.

While a node is restarting, you can tail its logs for information about the upgrade-and-restart process. Try filtering for logs related to discovery and cluster formation events. For example:

  • In Discover on an attached monitoring cluster, apply a Lucene filter on .monitoring*:

    "node-join" OR "node-left" OR "master node changed" OR "elected-as-master" exitcode OR initializing OR fatal OR "publish_address"
    		
  • On the host, tail the Elasticsearch logs through a grep filter:

    grep -Ei 'node-join|node-left|master node changed|elected-as-master|exitcode|initializing|fatal|publish_address'
    		

Based on your findings, refer to the common error resolutions.

During a rolling upgrade, the cluster continues to operate normally.

New functionality stays inactive, or runs in a backward-compatible mode, until the last node running the earlier version leaves the cluster. New and updated features become fully operational only when every node is running the later version.

Normally, a node running the earlier version leaves the cluster only when you shut it down to upgrade it. The last earlier-version node leaves when there are no more nodes to upgrade.

The following sections describe edge cases that can disrupt this process:

Because of cluster fault detection, a node running the earlier version might leave the cluster before you deliberately shut it down (temporarily, or indefinitely until you intervene). Recover the node into the cluster before you continue the rolling upgrade.

Note

If a node unexpectedly leaves the cluster during a rolling upgrade, the upgrade might pause to prevent data loss. When this happens, the Deployment Activity shows the status Waiting until cluster recovers and reports fewer nodes than expected.

If all the remaining earlier version nodes unexpectedly leave the cluster during an upgrade, the cluster does the following:

  • Reports its state as fully upgraded
  • Automatically activates new functionality
  • Leaves its backward-compatible mode

Afterward, you can't return the cluster to a state that's compatible with the earlier version nodes.

Nodes running the earlier version can no longer join the fully upgraded cluster. Their Elasticsearch logs report failed to join errors, with a Caused by such as:

  • node version [x.x.x] may not join a cluster comprising only nodes of version [y.y.y] or greater
  • node with version [x.x.x] may not join a cluster with minimum version [y.y.y]
  • node with system index mappings versions [y.y.y] may not join a cluster with minimum system index mappings versions [x.x.x]
  • handshake with [NODE_ID] failed: remote node version [x.x.x] is incompatible with local node version [y.y.y]

Elasticsearch preserves the data in the data paths of the older nodes and uses it to recover the cluster to health after you fully upgrade them. To bring these nodes back into the cluster, upgrade them.

Note

If a node leaving the cluster causes the cluster health API to report red, the upgrade might pause to protect your data. If this happens, contact us with one of the following:

If you stop half or more of the master-eligible nodes at the same time during the upgrade, the cluster becomes unavailable because too few remain to form a voting quorum.

Production environments should have at least three master-eligible nodes for high availability. In a test or development environment with only one or two master-eligible nodes, you can't avoid stopping half or more of them, so the cluster always becomes unavailable at some point during the upgrade.

Restart all the stopped master-eligible nodes so the cluster can re-form. This might trigger a premature cluster version update; to reduce this risk, upgrade the master-eligible nodes last.

When nodes restart, they can encounter errors that also occur outside of upgrades. The most common are:

The rest of this page covers errors specific to the rolling upgrade itself.

These bootstrap checks occur only during rolling upgrades.

Elasticsearch indices are compatible across sequential major versions only. When a restarting node tries to load metadata for an outdated, incompatible index, it fails with an error such as:

  • The index [index-000001] created in version [y-1.x.x] with current compatibility version [y-1.x.x] must be marked as read-only using the setting [index.blocks.write] set to [true] before upgrading to y+1.z.z.
  • Cannot start this node because it holds metadata for indices with version [y-1.x.x] with which this node of version [y+1.z.z] is incompatible. Revert this node to version [y.y.y] and delete any indices with versions earlier than [y.0.0] before upgrading to version [y+1.z.z]. If all such indices have already been deleted, revert this node to version [y.y.y] and wait for it to join the cluster to clean up any older indices from its metadata.
  • cannot upgrade node because incompatible indices created with version [y-1.x.x] exist, while the minimum compatible index version is [y.y.y]. Upgrade your older indices by reindexing them in version [y+1.z.z] first

This error means the Upgrade Assistant found issues that still need to be resolved.

Before you begin the upgrade again, revert the node to the earlier version, rejoin it to the cluster, and complete every critical item in the Upgrade Assistant. For more details, refer to Resolve Upgrade Assistant issues.

If the Elasticsearch configuration contains settings that are no longer valid in the later version, the node might fail to start with an error such as:

  • unknown setting [X] please check that any required plugins are installed, or check the breaking changes documentation for removed settings
  • The configuration setting [X] is required

This error means you didn't fully review breaking changes during preparation. Resolve every unknown setting startup error before you continue. For common examples, refer to Troubleshoot node bootlooping.

You might see shard allocation issues if:

Beyond the common allocation issues, these errors appear only during rolling upgrades:

  • incompatible index versions

    • illegal_argument_exception: The index [my_index] was created with version [X.X.X] but the minimum compatible version is [Y.Y.Y]
    • java.lang.IllegalStateException: index [my_index] version not supported: X.X.X maximum compatible index version is: Y.Y.Y
  • incompatible shard versions

    • cannot allocate replica shard to a node with version [X.X.X] since this is older than the primary version [Y.Y.Y]

If you encounter any of these, continue upgrading your nodes. The data allocates as more nodes reach the later version.

Post-upgrade issues

These issues can appear after an Elasticsearch upgrade if specific upgrade tasks remain unfinished.

Kibana availability

If Kibana doesn't start after its upgrade, or reports Kibana server is not ready yet, make sure you re-enabled shard allocation.

Transform upgrade mode

If you set upgrade_mode for transform indices, you might see unexpected errors after the upgrade, such as:

  • Cannot stop any Transform while the Transform feature is upgrading (408)
  • Transform task will not be assigned while upgrade mode is enabled.

Set enabled=false to exit upgrade mode for transforms.

Machine learning upgrade mode

If you set upgrade_mode for machine learning indices, you might see unexpected errors after the upgrade, such as:

  • You don't have permission to manage Machine Learning jobs. Access to the plugin requires the Machine Learning feature to be visible in this space.
  • Index migration in progress. Indices related to Machine Learning are currently being upgraded. Some actions will not be available during this time.

Set enabled=false to exit upgrade mode for machine learning.