When installing ECE, you will notice that several Elasticsearch clusters get created as part of the installation process. Those are the system clusters which are part of the ECE control plane. You must make sure that they are configured and sized correctly to ensure you have a production-ready installation.
We will review each cluster and provide recommendations to make sure that you are following best practices when starting your ECE journey.
By default, the system clusters have a dedicated
system_owned flag set to
true to avoid mistakenly changing the configuration of those clusters. Most configuration changes suggested in this section do not require this flag to be set to
false, but there are some cases where changing the flag might be required. If you do change this flag, always make sure to set it back to
true once you have completed the changes. The flag can be set by navigating to the Data section in the Advanced cluster configuration page.
- Admin console -
- Stores the state of your deployments, plans, and other operational data. If this cluster is not available, there will be several unexpected behaviors in the Cloud UI, such as stale or wrong status indicators for deployments, allocators, runners, and more.
- Logging and metrics -
- As part of an ECE environment, a Beats sidecar with Filebeat and Metricbeat is installed on each ECE host. The logs and metrics collected by those beats are indexed in the
logging-and-metricscluster. This includes ECE service logs, such as proxy logs, director logs, and more. It also includes hosted deployments logs, security cluster audit logs, and metrics, such as CPU and disk usage. Data is collected from all hosts. This information is critical in order to be able to monitor ECE and troubleshoot issues. You can also use this data to configure watches to alert you in case of an issue, or machine learning jobs that can provide alerts based on anomalies or forecasting.
- Security -
- When you enable the user management feature, you trigger the creation of a third system cluster named
security. This cluster stores all security-related configurations, such as native users and the related native realm, integration with SAML or LDAP as external authentication providers and their role mapping, and the realm ordering. The health of this cluster is critical to provide access to the ECE Cloud UI and REST API. To learn more, see Configure role-based access control.
ECE supports the concept of availability zones and, as described in the Playbook for production, requires three availability zones to be configured for the best fault tolerance. Alternatively, you can also configure ECE with two availability zones.
The system clusters are created when you install ECE or enable the user management feature, at which point they are not yet configured for high availability. As soon as you finish the installation process, you should change the configuration to ensure your system clusters are highly available and deployed across two or three availability zones. To configure your system clusters to be highly available, navigate to the Edit page for the cluster and change the number of availability zones under Fault tolerance.
logging-and-metrics cluster, you might want to also make sure that your Kibana instance and other components are deployed across multiple availability zones, since you will often access that cluster using Kibana. You can change the availability zones for Kibana on the same Edit page.
ECE lets you manage snapshot repositories, so that you can back up and restore your clusters. This mechanism allows you to centrally manage your snapshot repositories, assigning them to deployments, and restoring snapshots to an existing or new deployment.
security clusters have a key role in making sure your ECE installation is operational, it’s important that you configure a snapshot repository after you complete your ECE installation and enable snapshots for both the
security clusters, so that you can easily restore them if needed.
As mentioned earlier, the
logging-and-metrics cluster stores important information about your environment logs and metrics. There are also additional configurations provided out-of-the-box, such as index patterns, visualizations, and dashboards, that will require running an external script to recreate if you do not have a snapshot to restore from. We recommend that you also back up the
logging-and-metrics cluster, though it is up to you to decide if that information should be available to be restored.
To configure snapshot repositories, see Add snapshot repository configurations.
security clusters require relatively small amounts of RAM and almost no disk space, so increasing their size to 4 GB or 8 GB RAM per data node should be sufficient.
logging-and-metric cluster should be sized according to the expected workload, which will affect the daily ingest size and which depends on the number of ECE hosts, deployments, and which logs will be enabled, such as slow logs, audit logs, and more. As with any other time-series data, you should also make sure to properly manage your indices and delete old indices based on your desired retention period.
In the case of the
security system clusters, the team managing ECE and assigned to the platform admin role should have permission to change each system cluster configuration and also to access each cluster itself.
logging-and-metrics cluster is different since, as an ECE admin, you likely want to provide users with access to the cluster in
order to troubleshoot issues without your assistance, for example. In order to manage access to that cluster, you can configure roles that
will provide access to the relevant indices, map those to users, and manage access to Kibana by leveraging the Elastic security
integration with your authentication provider, such as LDAP, SAML, or AD. To configure one of those security realms, see
LDAP, Active Directory or SAML.
Enabling integration with external authentication provider requires that you set the
system_owned flag to
false in order to change the elasticsearch.yaml configuration. Remember to set the flag back to
true after you are done.