05 Oktober 2016 Engineering

Bootstrap checks: Annoying you now instead of devastating you later!

Von Jason TedorNik Everett

Elasticsearch 5.x has ten new "bootstrap checks" which run when Elasticsearch starts up to check for configuration problems that might cause exciting failures after a node sees serious use. If any of these check fail, the node will abort during startup if it is bound to a non-local IP address. We earnestly hope that these checks are just a minor annoyance and that they prevent major heartache.

Serial Killer

Concrete example time! We recently added a check that fails if we detect that the JVM that Elasticsearch is running in is using the "serial" collector. The serial collector is designed for single-threaded applications or applications running with extremely small heaps. Elasticsearch is very much a multi-threaded application and will not fare well with an extremely small heap so the collector isn’t a good fit for it. This collector can cause nodes to time out of the cluster because this collector stops the world while it is running and those pauses can be quite substantial for large heaps because the collection process is single-threaded. If these pauses run longer than ninety seconds the master node might time out its periodic pings to the node and remove it from the cluster. When the node finishes its long GC cycle it will rejoin the cluster, causing the node to "bounce" in and out of the cluster.

We hope the serial collector check saves someone a lot of trouble down the line. In fact we’re fairly sure it will because we’ve seen the serial collector in action. One thing buying a subscription from Elastic entitles you to is support and "one of my nodes is periodically leaving the cluster and coming back" is a thing that comes up from time to time. Sometimes we track the failures back to the serial garbage collector. Users won’t see those failures in production on 5.x because Elasticsearch will refuse to start with the serial garbage collector at all.

Since the startup scripts explicitly set up a different collector we expect this check to trigger very rarely but given how insidious the failure mode is we thought it was worth checking.

The File Limits are Too Damn Low

Another issue that we see very frequently is "too many open files". This happens when a user (often unknowingly) has their file descriptor limits set far too conservatively, like 4096. Elasticsearch needs lots of file descriptors to open many indices composed of many segments (and, on Unix-based operating systems, things like network sockets are file descriptors too). If this limit is set too low and Elasticsearch bumps up against the limit, things like shard allocation can fail and disaster can ensue.

So, we added the file descriptor bootstrap check. It’s the original bootstrap check, inspired by issues opened on our GitHub repository, and posts on our discussion forum. It simply checks at startup the current file descriptor limit and screams loudly if it’s below 65536. It can take a few minutes to track down exactly the right incantation to increase the limit on your operating system and this is super annoying but it sure beats losing data in production because you never knew the limit should be higher.

These Warnings go to Eleven

A core value of the Elasticsearch team is "it should be easy to start exploring Elasticsearch". In the past few years this has to compete with values like "Elasticsearch should be secure" and "Elasticsearch should be stable" but it is still important. In that vein, we still want Elasticsearch to start for folks that just download a distribution and start it without configuring anything. Every failing bootstrap check will simply print a WARN.

[2016-09-23T12:19:25,588][WARN ][o.e.b.BootstrapCheck     ] [DOteo05] max file descriptors [4096] for elasticsearch process likely too low, increase to at least [65536]

and Elasticsearch will start as though nothing was wrong. We hope these warnings makes a first time user think "Elasticsearch cares enough about stability to warn me when something is poorly configured."

The trouble is that people ignore warnings. We really want first time users to scan the list of warnings, realize that they should go fix those things before going to production, and keep playing with Elasticsearch. We hope they play long and hard and fall in love with Elasticsearch like we all did. After all that, we expect folks that saw the warnings to forget that there are things to clean up. And, of course, there are folks who start Elasticsearch in the background and don’t read the logs at all, missing the warnings entirely.

So for these bootstrap checks to really be effective they have to change from warnings to actual errors at some point. They have to stop Elasticsearch from starting and make you fix them. We want to be sure that any "real" installation of Elasticsearch passes all of these checks so we don’t have to help anyone recover from a disaster caused by something like running out of file descriptors. The best measure we could find for what makes a "real" installation is binding to a non-local IP address. We briefly toyed with the idea of an explicit "production mode" flag but decided against it because it would be too easy for folks to turn it off to "work around" one of the bootstrap checks.

Binding to a non-local IP address is a fairly good way to detect a "real" installation because you can’t form a multi-node cluster without doing so. And you can’t serve requests from servers other than the one running Elasticsearch either. Well, you could play tricks with tunnels but that is fairly arcane. Anyway, we think that 99% of folks that have "real" "production" installation of Elasticsearch will bind to a non-local IP address. We also think that most folks that bind to a non-local IP address are either setting up a "real" installation of Elasticsearch or are running Elasticsearch in a container or VM, presumably set up using a configuration management system like Puppet/Chef/Ansible/Salt/bash-scripts-and-duct-tape.

So that is the bootstrap checks, annoying you now so you don’t get paged on a Saturday morning at 3 AM because a bulk import caused a long GC which caused a node to drop out of the cluster, all because the wrong garbage collector is configured.