Collection of best practices and other information around running Elasticsearch on AWS.
When selecting disk please be aware of the following order of preference:
- EFS - Avoid as the sacrifices made to offer durability, shared storage, and grow/shrink come at performance cost, such file systems have been known to cause corruption of indices, and due to Elasticsearch being distributed and having built-in replication, the benefits that EFS offers are not needed.
- EBS - Works well if running a small cluster (1-2 nodes) and cannot tolerate the loss all storage backing a node easily or if running indices with no replicas. If EBS is used, then leverage provisioned IOPS to ensure performance.
- Instance Store - When running clusters of larger size and with replicas the ephemeral nature of Instance Store is ideal since Elasticsearch can tolerate the loss of shards. With Instance Store one gets the performance benefit of having disk physically attached to the host running the instance and also the cost benefit of avoiding paying extra for EBS.
Prefer Amazon Linux AMIs as since Elasticsearch runs on the JVM, OS dependencies are very minimal and one can benefit from the lightweight nature, support, and performance tweaks specific to EC2 that the Amazon Linux AMIs offer.
Networking throttling takes place on smaller instance types in both the form of bandwidth and number of connections. Therefore if large number of connections are needed and networking is becoming a bottleneck, avoid instance types with networking labeled as
- When running in multiple availability zones be sure to leverage shard allocation awareness so that not all copies of shard data reside in the same availability zone.
- Do not span a cluster across regions. If necessary, use a cross cluster search.
- If you have split your nodes into roles, consider tagging the EC2 instances by role to make it easier to filter and view your EC2 instances in the AWS console.
- Consider enabling termination protection for all of your instances to avoid accidentally terminating a node in the cluster and causing a potentially disruptive reallocation.