Perform ECE hosts maintenance

This page describes how to safely perform maintenance on hosts in your ECE installation. Host maintenance refers to actions that are not part of taking care of Elastic Cloud Enterprise itself and that you might need to perform for a number of different reasons, including:

To apply urgent operating system patches or hot fixes
To perform regularly scheduled software or hardware upgrades
To enable new features, such as encryption of data at rest
To meet updated installation prerequisites

Overview

This section outlines the available methods for performing host maintenance in ECE.

Which method you choose depends on the impact of the maintenance on ECE services running on the host. Low-risk changes such as OS patching might only require stopping container services and restarting the host. More disruptive changes, such as hardware replacements or major operating system upgrades, are typically safer to perform by removing and reinstalling the host.

If your host maintenance could disrupt ECE, use the method that deletes the host from your installation. All described methods include a step that vacates the affected hosts by moving all Elastic Stack instances off them and are generally considered safe, provided that your ECE installation still has sufficient resources available to operate after the host has been removed.

Individual hosts maintenance

Use the following methods to perform maintenance on one or more ECE hosts while keeping the rest of the platform running.

Disable services and restart the host (nondestructive)
- For Docker-based installations: disable the Docker service
- For Podman-based installations: disable the Podman-related services
Remove and reinstall the host (destructive)

Entire ECE installation maintenance

Use this method when the maintenance activity requires shutting down the entire ECE platform.

Shut down all ECE hosts

Single or multiple hosts maintenance

The following methods allow you to perform maintenance on individual hosts while keeping the rest of the platform running.

Disable services and restart the host (nondestructive)

The way that you disable container services differs based on the platform you used to deploy your ECE hosts.

For Docker-based installations: disable the Docker service

This method lets you perform maintenance actions on hosts without first removing the associated host from your Elastic Cloud Enterprise installation. It works by disabling the Docker daemon. The host remains a part of your ECE installation throughout these steps but will be offline and the resources it provides will not be available.

To perform host maintenance:

Recommended: If the host holds the allocator role and you have enough spare capacity:
1. Enable maintenance mode on the allocator.
2. Move all nodes off the allocator and to other allocators in your installation. Moving all nodes lets you retain the same level of redundancy for highly available Elasticsearch clusters and ensures that other clusters without high availability remain available.
Important

Skipping Step 1 will affect the availability of clusters with nodes on the allocator.

Disable the Docker daemon:

		sudo systemctl disable docker
sudo systemctl disable docker.socket

Reboot the host:
```
sudo reboot
		
```
Perform your maintenance on the host, such as patching the operating system.

Enable the Docker daemon:

		sudo systemctl enable docker
sudo systemctl enable docker.socket

Reboot the host again:
```
sudo reboot
		
```
If you enabled maintenance mode in Step 1: Take the allocator out of maintenance mode.
Optional for allocators: ECE will start using the allocator again as you create new or change existing clusters, but it will not automatically redistribute nodes to an allocator after it becomes available. If you want to move nodes back to the same allocator after host maintenance, you need to manually move the nodes and specify the allocator as a target.
Verify that all ECE services and deployments are back up by checking that the host shows a green status in the Cloud UI.

After the host shows a green status in the Cloud UI, it is fully functional again and can be used as before.

For Podman-based installations: disable the Podman-related services

This method lets you perform maintenance actions on hosts without first removing the associated host from your Elastic Cloud Enterprise installation. It works by disabling the Podman related services. The host remains a part of your ECE installation throughout these steps but will be offline and the resources it provides will not be available.

To perform host maintenance:

Recommended: If the host holds the allocator role and you have enough spare capacity:
1. Enable maintenance mode on the allocator.
2. Move all nodes off the allocator and to other allocators in your installation. Moving all nodes lets you retain the same level of redundancy for highly available Elasticsearch clusters and ensures that other clusters without high availability remain available.
Important

Skipping Step 1 will affect the availability of clusters with nodes on the allocator.

Disable the Podman service, Podman socket, and Podman restart service:

		sudo systemctl disable podman.service
sudo systemctl disable podman.socket
sudo systemctl disable podman-restart.service
		
	

Reboot the host:
```
sudo reboot
		
```
After rebooting, confirm there are no running containers by running the following command. The output should be empty.
```
sudo podman ps
		
```
If an frc-* or fac-* container is returned in the output, stop it:
```
sudo podman stop $(sudo podman ps -a --filter "name=fac" --filter "name=frc" --format "{{.ID}}")
		
```
Perform your maintenance on the host, such as patching the operating system.

Re-enable the Podman related services:

		sudo systemctl enable podman.service
sudo systemctl enable podman.socket
sudo systemctl enable podman-restart.service
		
	

Reboot the host again:
```
sudo reboot
		
```
Confirm the containers have started:
```
sudo podman ps -a
		
```
The use -a flag ensures that no containers are overlooked.
If you enabled maintenance mode in Step 1, take the allocator out of maintenance mode.
Optional for allocators: ECE will start using the allocator again as you create new or change existing clusters, but it will not automatically redistribute nodes to an allocator after it becomes available. If you want to move nodes back to the same allocator after host maintenance, you need to manually move the nodes and specify the allocator as a target.
Verify that all ECE services and deployments are back up by checking that the host shows a green status in the Cloud UI.

After the host shows a green status in the Cloud UI, it is fully functional again and can be used as before.

Remove and reinstall the host (destructive)

This method lets you perform potentially destructive maintenance actions on hosts. It works by deleting the associated host, which removes the host from your Elastic Cloud Enterprise installation. To add the host to your ECE installation again after host maintenance is complete, you must reinstall ECE.

To perform host maintenance:

If the host holds the allocator role:
1. Enable maintenance mode on the allocator.
2. Move all nodes off the allocator and to other allocators in your installation. Moving all nodes lets you retain the same level of redundancy for highly available clusters and ensures that other clusters without high availability remain available.
  
  Important
  
  Do not skip this step or you will affect the availability of clusters with nodes on the allocator. You are in the process of removing the host from your installation and whatever ECE artifacts are stored on it will be lost.
Delete the host from your ECE installation.
Perform the maintenance on your host, such as enabling encryption of data at rest.
Reinstall ECE on the host as if it were a new host and assign the same roles as before.
Optional for allocators: ECE will start using the allocator again as you create new or change existing clusters, but it will not automatically redistribute nodes to an allocator after it becomes available. If you want to move nodes back to the same allocator after host maintenance, you need to manually move the nodes and specify the allocator as a target.

After the host shows a green status in the Cloud UI, the host is part of your ECE installation again and can be used as before.

Entire ECE installation maintenance

The following method is used when maintenance requires stopping the entire ECE installation.

Shut down all ECE hosts

This method lets you temporarily shut down all ECE hosts of the entire ECE platform, for example, for data center moves or planned power outages. It is offered as an non-guaranteed and less destructive alternative to fully rebuilding your ECE infrastructure.

To shutdown all ECE hosts:

Stop routing requests on all non system deployments to avoid unnecessary incoming traffic during your shutdown.
Make sure all Elasticsearch clusters of all deployments are healthy.
Take a successful snapshot on each deployment, including system deployments.
Disable traffic from load balancers.
Shut down all allocators.
Shut down all non-director hosts.
Shut down directors.

Guidance on deployment terminating

Do not terminate system deployments, as it can cause issues and you may lose access to the Cloud UI.
As a generic best practice, we do not recommend you terminating the deployments you have for your workload, as it deletes all your deployment resources, and you will need to restore the data from snapshot backup later.

After performing maintenance, start up the ECE hosts:

Start all directors.
Verify that there is a healthy Zookeeper quorum (at least one zk_server_state leader, and zk_followers + zk_synced_followers should match the number of Zookeeper followers):
```
docker exec frc-zookeeper-servers-zookeeper sh -c 'for i in $(seq 2191 2199); do echo "$(hostname) port is $i" && echo mntr | nc localhost ${i}; done'
		
```
Start all remaining hosts.
Re-enable traffic from load balancers.
Re-enable routing requests based on deployment priority.