29 January 2014

Why We Built Marvel

By Boaz Leskes

Yesterday, Steven Schuurman announced our latest product, Marvel. Judging by the twitter storm alone, people are just as excited about it as we are. Today, I would like to take the opportunity to tell more about how it works and how it came to be. Marvel is the result of all of our experiences helping users and providing support to customers. Most importantly, the product has come from our own needs for its capabilities and insights.

What happened at 3 AM this morning?

Since Elasticsearch first went public 4 years ago, its adoption has been nothing short of impressive. As the number of users has grown, so have the number of questions and requests for help on the user mailing list and in IRC (#elasticsearch on Freenode). We also have numerous inquiries that come in from our customers on dedicated support contracts. The questions come, of course, in many flavors. Some would be about how to best use a feature, or help with a certain aspect of the Query DSL. Others would report a problem and ask for help figuring out what has happened, be it about an exception from the logs, garbage collection taking longer than ideal or an indication that memory has run out. Sometimes a quick API call would be enough to find out the cause of the problem and resolve it quickly. Other times, issues are a result of a more complex sequence of events.

Take, for example, the following scenario: you are the proud owner of a cluster. You have an application which analyzes time-based data. Following best practices, you have your recent data on powerful SSD-equipped machines. As data gets older, you have a nightly cronjob that uses Elasticsearch's Shard Allocation feature to indicate the data should be moved to cheaper & less capable machines. One morning, just as you walk into the office, you notice that one of the old-data machines is stressing out and run out of capacity. But why? And why now? There is no spike in search traffic, and indexing goes to the more powerful nodes. The only change you can think of is that nightly cronjob, but it runs at 3 AM, 6 hours before the node started having problems.

As it happens, the cronjob started a chain of events that led to the current situation. In order not to impact performance, Elasticsearch throttles moving data around. Since the job issued the command at 3 AM, more and more data moved. Only a couple of hours later, the node's maximum capacity was reached. Once you've found out this information and spent some time digging into logs and thinking hard, the solution is easy: temporarily move some data back to the SSD machines, provision another cheap node and move the data to it.

Upon reflection, you realize you need some way of seeing cluster behavior over time. This functionality would have allowed you to see the increasing usage trend and to easily trace back the problem to its start at 3 AM in the morning. It would also have allowed you to see that the same thing happened yesterday, and the day before, except then it was not a problem - yet. It would also be great if just at the beginning of that trend, you could see an indication that Elasticsearch had started to relocate data … but we'll get to that later.

As you can imagine, such stories are not unique. In fact, things can get even more complex. Repeatedly, we ask customers to send us the logs from all their nodes. We also ask to call the stats API repeatedly while running heavy queries, what management commands were actually issued, when they were issued, etc. Once we get the information, we scan it carefully and try to correlate the different information streams and compare them to how we know Elasticsearch behaves. The process is manual, intensive and time consuming. Especially during those moments where time is not necessarily on your side.

Being engineers, we kept thinking about how we could improve and automate this process. We wanted to build something that would make our own lives easier and help all of our users: tools to monitor clusters & to collect and analyze the vast number of statistics Elasticsearch exposes. To accomplish this, we needed a place to store all this data, an analytics engine to analyze it and a tool to visualize the results. As it turns out, we have an intimate knowledge of just the right tools for the job. Based on Elasticsearch and Kibana, we've set out to build a smart solution: Marvel.

Cluster, this is Mission Control

Marvel is a plugin for Elasticsearch that hooks into the heart of Elasticsearch clusters and immediately starts shipping statistics and change events. By default, these events are stored in the very same Elasticsearch cluster. However, you can send them to any other Elasticsearch cluster of your choice.

Once data is extracted and stored, the second aspect of Marvel kicks in - a set of dedicated dashboards built specifically to give both a clear overview of cluster status and to supply the tools needed to deep dive into the darkest corners of Elasticsearch.

Overview, Nodes & Indices

The Overview dashboard is the one offered by default when navigating to http://localhost:9200/_plugin/marvel with your favorite browser. The dashboard displays the essentials metrics you need to know that your cluster is healthy. The dashboard also provides an overview of your nodes and indices, displayed in two clean tables along with the relevant key metrics. These tables serve as an entry point to more details on the Node Statistics and Index Statistics dashboards, where you can see more than 90 different metrics plotted over time. Simply click on a table cell or select multiple nodes/indices to compare and you'll be transferred to the relevant place in the detailed dashboard.

The Node Statistics dashboard displays metric charts from the perspective of one or more nodes. Metrics include hardware level metrics (like load and CPU usage), process and JVM metrics (memory usage, GC), and node level Elasticsearch metrics such as field data usage, search requests rate and thread pool rejection.

The Index Statistics dashboard is very similar to the Node Statistics dashboard, but it shows you all the metrics from the perspective of one or more indices. The metrics are per index, with data aggregated from all of the nodes in the cluster. For example, the 'store size' chart shows the total size of the index data across the whole cluster.


Cluster Events

The Cluster Pulse dashboard allows you to see any event of interest in the cluster. Typical events include nodes joining or leaving, master election, index creation, shard (re)allocation and more. Think of the Cluster Pulse Dashboard as your window into the nerve system of Elasticsearch.

Easy access to the REST API

Marvel also comes with a lightweight developer console, based on the popular Chrome extension Sense. The console is handy when you want to make an extra API call to check something or perhaps tweak a setting. The developer console understands both JSON and the Elasticsearch API, offering suggestions and auto-completes. It’s also quite handy to use to prototype queries, dive into your data or look at the current version of a specific document.

Back to 3AM in the morning

Let's go back to our example story, but this time with Marvel installed on the cluster. When you were first notified of a problem on a node, you would have gone to the Overview dashboard and confirmed that it’s in trouble. Let's assume the issue was a memory problem. You'd click on the JVM memory metric - which is red at the moment - opening up the Node Statistics dashboard to take a look at the history of the problematic node’s memory. You'd see a clear, slow growth pattern starting around 3 AM. Wondering what had happened there - and probably not remembering your cronjob yet - you'd go to the Cluster Pulse dashboard and see that Elasticsearch was relocating shards from SSD machines from that time. From there, you'd move to Sense, undo the allocation setting for the relevant index and go provision another node.

Ok, cool. So how do I get it?

Marvel is a plugin and you install it just as you would any other Elasticsearch plugin. Give it a try:

./bin/plugin -i elasticsearch/marvel/latest

followed by restarting the nodes and opening http://localhost:9200/_plugin/marvel in your browser.

Marvel is licensed free for development use and runs on Elasticsearch versions 0.90.9 and up. For detailed instructions please see the documentation.

Looking ahead

We have great plans for Marvel but they also depend on the feedback we get from you. All we can say for now is that as the analytical and visualization powers of Elasticsearch & Kibana grow, the future looks … Marvelous!

Please do let us know what think, either via our Google Group, Twitter or IRC (#elasticsearch).