How to design your Elasticsearch data storage architecture for scale | Elastic Blog
Engineering

How to design your Elasticsearch data storage architecture for scale

Elasticsearch allows you to store, search, and analyze large amounts of structured and unstructured data. This speed, scale, and flexibility makes the Elastic Stack a powerful solution for a wide variety of use cases, like system observability, security (threat hunting and prevention), enterprise search, and more. Because of this flexibility, effectively architecting your deployment’s data storage for scale is incredibly important.

Now, it stands to reason that every Elasticsearch use case is different. Your own use case / deployment / business situation will have certain tolerances and thresholds for things like: total cost of ownership, ingest performance, query performance, number of / size of backups, mean time to recovery, and more.

So, as you begin to think about these various factors — there are 3 questions you might consider to help short circuit what could otherwise be a complex decision matrix:

  • How much data loss can your use case / deployment withstand?
  • How much does performance factor into your business objectives?
  • How much downtime can your project handle?

In this blog we will review several data storage options you can use and we’ll discuss the various pros and cons of each. By the end of this blog, you will have a better understanding of how to architect your own (unique) Elasticsearch deployment’s data storage for scale.

Don’t want to worry about any of this? Some good news: adopting Elasticsearch Service on Elastic Cloud means we’ll handle architecting for scale for you.

Options at-a-glance

Below is a quick reference for the options for architecting your data storage with Elasticsearch that we will be covering in this blog. We will outline in more detail the pros and cons and what to expect in regards to data loss, performance, and downtime below.

RAID 0RAID 1RAID 5RAID 6Multiple Data Paths
Data protection None 1 disk failure 1 disk failure 2 disk failures None
Performance* NX X/2 (N - 1)X (N - 2)X 1X to NX
Capacity 100% 50% 67% - 94% 50% - 88% 100%
Pros • Easy setup

• High performance

• High capacity

• Elasticsearch sees only 1 disk
• High data protection

• Easy recovery

• Elasticsearch sees only 1 disk
• Medium data protection

• Easy recovery

• Medium to high capacity

• Medium to high performance gains

• Elasticsearch sees only 1 disk
• High data protection

• Easy recovery

• Medium to high capacity

• Medium to high performance gains

• Elasticsearch sees only 1 disk
• Easy setup

• Add disks

• High capacity
Cons • No fault tolerance

• Long recovery

• Potential for permanent data loss if there are no replica shards
• Low capacity

• No performance gains for writes

• Can only use 2 disks
• Some capacity loss

• Only 1 disk can fail before the array fails

• Potentially long recovery

• If more than 1 disk fails the potential for data loss exists

• During recovery the arrays performance is reduced
• Small to medium capacity loss

• Potentially Long recovery

• During a recovery the arrays performance is reduced
• Shards not balanced between paths

• Single disk watermark affects the whole node

• Performance is not consistent

• Disks are not hot swappable

• Adding disk can lead to hot spotting
* N represents the number of disks in the volume. X represents the number of read/write IOPS that a single disk is capable of. Higher numbers mean more performance.

As you begin to look at scaling your disk capacity there are a few good options to choose from. Let’s take a look at some of these and discuss the pros and cons of each. As each situation is unique, there isn’t one path that can work for everyone.

RAID 0

RAID has been a cornerstone for combining multiple disks for decades. RAID has three components: mirroring, striping, and parity. Each number in RAID indicates a unique combination of these components.

The number 0 represents striping in RAID. Striping is splitting up data into chunks and writing those chunks across all disks in the volume.

Performance and capacity

Striping improves read/write performance as all disks are able to write in parallel. In effect, this will multiply your writes and reads by how many disks you have in the array. So if you have 6 disks in a RAID 0 array then you would have ~6x read/write speed.

Recovery

RAID 0 offers no recovery, therefore Elasticsearch must handle recovery via snapshots or replicas. Depending on the size of disks, and the transport mechanism used to get the data copied onto the array, this can be very time consuming. During the recovery step network traffic and other nodes’ performance will be impacted.

Caveats

As Elasticsearch indexes are made up of many shards, any index that has a shard on a RAID 0 volume that suffers a disk failure can also become corrupted if no other replicas exist. This will result in permanent data loss if you do not have snapshot lifecycle management (SLM) to manage backups, or have configured Elasticsearch to have replicas.

Pros and cons

Pros (+) Cons (-)
  • Easy to set up
  • High performance, as it can use 100% of the disks available read and write speed
  • High capacity, as the array can use all of the disk capacity for storage
  • Elasticsearch sees the array as a single large disk. So watermark levels, and shard distribution work as expected.
  • If a disk fails then all data on the entire array is lost, not just the single disk. This may impact many indices.
  • Recovery will require the cluster to copy replica shards to a new array and will be resource heavy and time consuming.
  • Potential for permanent data loss if there are no replica shards

RAID 1

Performance and capacity

Mirroring is represented in RAID with the number 1. Mirroring is the process of writing the same data to another disk. This in effect creates a copy of the data. Although data is written to both disks, most raid implementations do not use the two disks to read. Thus read and write performance is effectively halved. As the same data is written to both disks you also lose half of your capacity.

Recovery

RAID 1 is made to span only two disks. Thus if one disk fails, then the data is still preserved on the other disk. This creates high data redundancy at the cost of performance and capacity. When one disk fails you simply replace it and the data is then copied onto the new disk.

Caveats

In most cases, RAID 1 is paired with RAID 0 as RAID 1 only supports 2 disks. This means you would pair multiple RAID 1 volumes to a striped RAID 0. This is called RAID 10 and is used when you have four or more disks.

This means that you have some of the performance benefits of RAID 0 paired with the redundancy of RAID 1. Performance of RAID 10 depends on how many disks are in the array. The performance of RAID 10 is Nx/2.

Pros and cons

Pros (+) Cons (-)
  • High data protection, as all data is mirrored to another disk.
  • Easy recovery when a disk fails — just replace the failed disk and the data will be copied to it.
  • Elastic sees the array as a single disk.
  • Low capacity, as the RAID 1 will take 50% of your total disk capacity for data redundancy.
  • Effectively a 50% reduction in total read/write performance
  • Only covers 2 disks, unless using RAID 10

RAID 5, 6

Performance and capacity

Parity is represented in RAID with several numbers: 2, 3, 4, 5, 6. We will focus on 5 and 6, as 2, 4, and 3 are mostly superseded by other RAIDs. Parity is a way for computers to fix, or calculate, missing data due to a disk failure. Parity adds data protection to the performance of striping. The data recovery does come at a cost. RAID 5 uses one disk worth of capacity for parity, and RAID 6 uses two disks.

Recovery

Other points to consider with RAID 5 and 6 are recovery rebuild times. A rebuild happens when a new disk is added back into the array replacing a former disk in the array. For spinning media, add more disks instead of adding disk capacity. More disks will increase read and write times as well as rebuild times. For SSDs you will need to see if higher-capacity disks also have faster read and write performances. Many higher-capacity SSD disks have higher read/write performance. If this is the case then higher-capacity disks will help with read/write performance. RAID 5 can suffer one disk failure before it loses data from the array. RAID 6 can suffer two disk failures.

Caveats

While RAID 5 and 6 can withstand a disk failure, it doesn’t come without consequence. For RAID 5 this means you are effectively as fragile as RAID 0 until a new disk is added. In effect if another disk fails all data on the array will be lost and will need to be recovered from other shards, or a snapshot. For this reason many recommend running in RAID 6 as batches of disks can fail near the same time. As mentioned in the beginning, understanding your project tolerance to performance and data integrity is key to deciding between these two.

Losing a disk will also greatly affect the performance of both RAID 5 and 6 as the array will need to use the parity to recalculate data being read from the disks.

Pros and cons

Pros (+) Cons (-)
  • Medium data protection. The array can compute and reconstruct the missing data from a single failed disk.
  • Easy recovery when a disk fails — just add another disk and the data will be reconstructed.
  • Medium to high capacity. You only lose (one RAID 5/two RAID 6) disk’s worth of storage to parity.
  • Medium to high performance gains. You only lose (one RAID 5/ RAID 6) disk's worth of write performance for parity.
  • Elastic sees the array as a single disk.
  • Small to medium capacity loss due to parity.
  • RAID 5 can only suffer a single disk failure. RAID 6 can only suffer Two disk failures.
  • Depending on the size of the disks, restoring a disk can take more than 24 hours.
  • While a restore is in progress you are at risk of total data loss if one disk fails for RAID 5 and two for RAID 6.
  • While in a restore the array will operate with reduced read/write performance.

Multiple data paths in Elasticsearch (MDP)

Elasticsearch has a setting called path.data, which is used to configure the filesystem location(s) for Elasticsearch data files. When a list is specified for path.data, Elasticsearch will use multiple locations for storing data files. For example, if your elasticsearch.yml contained the following:

path.data: [ /mnt/path1, /mnt/path2, /mnt/path3 ]

Then Elasticsearch would write to multiple filesystem locations. Each of these paths could be a separate disk.

Defining multiple data paths allows a user to setup Elasticsearch to work with multiple data stores.

Performance and capacity

Elasticsearch splits the data by shards, and the shards are written to the data path with the most free space. If one shard receives most of the writes then your performance is limited to the speed of one data path. If however, your data is being written evenly across data paths then you will get the write speed of all the disks being used. Elasticsearch does not ensure writes are spanning many data paths thus performance is variable and not consistent.

MDP does not include any mirroring or parity of the data. This means that all of the disks’ capacity can be used to house data with the exception of hitting a watermark explained more in the pros and cons section. To use the full disk capacity you will need at least the same number of shards as data paths.

When adding a new data path to a node that already has data on other paths can lead to hot spotting of the new disk. As Elasticsearch won’t balance shards between data paths all new shards will be sent to the new path as it has the most free space. As Elasticsearch only reads updates to elasticsearch.yml on startup, adding any disk will require a full node restart.

Recovery

Disks will go bad. When a storage device that is part of the multiple data paths dies, the node will turn yellow and data located on that device is not accessible. However, Elasticsearch can continue to write to this failed data path that is throwing IOExceptions. To prevent this from occurring, it is important to remove the problem data path from the Elasticsearch node configuration by doing the following:

  1. Disable shard allocation.
  2. Stop the node.
  3. Remove the bad data path from the Elasticsearch config.
  4. Restart the node.
  5. Re-enable shard allocation.

To prevent permanent data loss, ensure you have Elasticsearch configured to have replication enabled, the default is to have one replica per shard.

Elasticsearch will then rebalance shards from other nodes. Elasticsearch doesn’t balance shards between data paths, see the caveat section for further details. If there is not enough capacity on the cluster to rebalance then the node will stay yellow.

To replace a new disk take care to unmount the disk and deactivate any LVM groups. This helps to ensure the OS can properly handle the new disk. Once the new disk is installed you will need to add the path back into the Elasticsearch path.data setting. After replacing the faulty disk, Elasticsearch will start using the new path.

Caveats

When a device that is specified in a multiple data path dies, the path is ignored for the next shard allocation. However, subsequent rounds will consider the path eligible for shard allocations again. This can lead to IOException errors unless the steps mentioned above are taken.

Elasticsearch handles watermarks on a per-node basis. This is important to be aware of because if one data path hits the high watermark the entire node will hit the high watermark. This will happen even if other data paths have enough free space. This will mean that the node that hit the watermark will stop accepting new shards, actively move shards off the node, or move the node into a read-only state.

Let’s take for example a node with 6 data paths of 500 GB having a total capacity of about 3 TB. It is possible for a single data path to get to 90% disk utilization. This would make the entire node hit the flood stage watermark, putting the node into read-only. This will happen even with other disks on the node having less than 50% utilization.

Elasticsearch does not handle shard balancing within a single node, i.e. it will not balance shards between data paths. So if a user is using multiple data paths, Elasticsearch will place shards on the disk with the most free space available at the time. If one shard gets more data than others and fills up a disk Elasticsearch won't balance shards. This can lead to some uneven distribution of IO load if you aren’t aware. Additionally, adding a new disk to a node containing data on other paths can lead to a high IO load on the new path.

Pros and cons

Pros (+) Cons (-)
  • Easy to set up
  • Able to add disks at any time
  • High utilization of disk capacity
  • Elasticsearch will not balance between data paths
  • When a single data store hits a watermark the whole node will be affected
  • Performance is gated by how well the data is distributed between data stores
  • Disks cannot be hot swapped
  • Adding disks can lead to overloading the disk

Conclusion

Getting to the point where you need to start growing your data is exciting! While we have talked through many options and tools in this blog — remember that there is no “one size fits all” solution when it comes to architecting your data storage for scale.

Should you still have questions or concerns about how to architect your Elasticsearch deployment’s storage for scale, you can find me and an always-growing community of users waiting to help in our Discuss forums.

Better yet… if you don’t feel like managing all of this yourself, then I’ll echo the good news from earlier: Elastic Cloud will manage all of this for you. Growing is as easy as adding another node. No servers to unbox, rack, and manage. And it works where you (likely) already are — on Amazon Web Services, Google Cloud Platform, and Microsoft Azure. Why not take it for a free spin today?