08 November 2017 Engineering

Elasticsearch Preview: Countermeasures against filling up disks

By Alexander Reelsen

More checks, less problems

This post will inform you about upcoming changes in Elasticsearch 6.0 with regards to the disk allocation decider, as there is one big change coming up you should be aware of. In addition we will quickly talk about some improvements in our logging infra as well, as this affects disk space usage.

The new disk threshold decider behaviour

In a previous post we already outlined nice improvements about disk space savings, just by upgrading. That's an awesome thing, but of course only one part of the equation. You can still run out of disk space. There are dozens of reasons this can happen. For instance your monitoring is broken, you are receiving an insane data spike (maybe due to a DDoS attack), huge merges are going on, or one of your nodes goes offline and relocation happens.

Elasticsearch has a list of allocation deciders, which check if a shard should be allocated on a node. For example these deciders make sure that no primary and replica shard are on the same node. Allocation deciders also take the shard allocation filtering rules into account or the total shard limit per node. Each of those deciders basically returns a decision telling the caller, if it is ok to put a shard on this node or not.

One of those allocation deciders is called the DiskThresholdDecider which checks if there is enough space in order to allocate a shard. This decider allows you to configure a low and a high watermark. The low watermark is used to decide if a shard should be allocated on this node based on the remaining disk space. The high watermark is used to move away shards, once a certain amount of the disk is used. This allows the remaining shards to have some more breathing room.

So far, so good. One precondition of this decider is that it is able to properly read the available disk space. Most of the time this is a not an issue, but you still might want to check. The easiest way to find out is to use the Nodes Stats API and check the fs information. You could do this via

GET _nodes/stats/fs?human

# another way of checking
GET /_cluster/allocation/explain?include_disk_info=true

Ok, so this is good, right? We get close to running out of disk space, we move the shard away, everything is awesome. Yeah, no. Only sometimes.

What if there is no space on other nodes to move the shard or there is only one node? Then we cannot move it, and at some point we will run out of disk space, risking data corruption.

Wouldn't it be great to just stop indexing in case we risk running out of disk space? Yes, it would. That's why we did it from Elasticsearch 6.0 onwards.

A new flood_stage watermark has been added to the disk threshold decider. If that watermark is passed (by default it is set to 95%), indices are marked as read-only. Which indices will be affected? Every index, which is writeable and contains a shard on this affected node. In addition, the indices are not fully read-only, deletes are still allowed, as they just need to update a small tombstone file.

One important tidbit needs be taken care of by the cluster administrator. Once the indices are switched into this read-only mode, you have to manually mark them as writeable again. There is no automatic mechanism to switch back to writeable, once enough space has been reclaimed.

In order to re-enable an index for writing again and to remove that setting, execute

PUT my_index/_settings
{
  "index.blocks.read_only_allow_delete" : null
}

Wait, there's more!

So, this is a great protection against running out of disk space. But this is not the only way to run out of disk space. Elasticsearch produces logs which get written into dedicated log files. If you have a rogue query that hits you several hundred times per second, this query might generate lots of log entries. This is one of the reasons you want to have your data and your logs on different partitions or even on different disks, so that logging I/O does not affect your query/index I/O.

With Elasticsearch 6.0 we will do a couple of things (some are new, some have been there before) with regards to logfiles it creates

  • Logs with date timestamp will be rolled at midnight
  • Logs will be rolled after 128 MB, even during the day
  • Rolled logs will be compressed
  • The total size of logs including compressed logs will not be more than 2GB

If you want to customize this behaviour, you can always change the log4j2.properties file in the config directory.

References:

Hopefully you won’t need to face this protection, but if you want to play around with it, we are thankful for any feedback. Participate in the Pioneer Program by finding and filing new bugs, and be eligible for Elastic swag.