In the spirit of knowledge sharing, we wanted to tell our Elasticsearch story. Like many others out there, TrackJS is a small company scaling quickly, and open source solutions can play a critical role in growing your organization without breaking the bank.
But before you dive in, we invite you to get an in-depth look at how we use Elasticsearch in an upcoming webinar “Scale Fast or Die Trying" on February 3. We'll give you the inside scoop on what happened when we hit #1 on HackerNews. (And if you'd like to get even more out of the session, we invite you to grab a free trial of of our software at trackjs.com.)
In the Beginning
TrackJS started simple, with an application server and database server. The database was a traditional RDBMS with a normalized schema and proper referential integrity. Our application performs many aggregate counts. To do aggregates in SQL inevitably means a GROUP BY, which often equates to a table scan. This negatively impacts performance, and once we hit a few million errors stored, we couldn't afford a large enough box to process things in any reasonable timeframe. We tried removing foreign keys, tweaking the indexes, denormalizing everything, and in the end, we concluded it was the wrong tool for the job. We wanted near real-time analytics, and SQL wasn't going to give it to us.
The Right Tool for the Job
We auditioned several NoSQL databases to replace our relational database. The test was simple: perform several aggregate counts over one million errors on a single core VM with 1.5GB of RAM, and see who does it best. It was meager hardware, but we were curious to see how various tools would perform. We will not name names here, but many challengers came up short. (One even experienced catastrophic data loss.) When using only the defaults on Windows machines, Elasticsearch was able to handle the challenge without breaking a sweat.
Running on Azure
TrackJS is primarily hosted on Microsoft Azure. We host our Elasticsearch cluster on a few virtual machines in a single Azure Cloud Service. Azure gives us load balancing out of the box. It does not support multicast, so we explicitly turn it off in our Elasticsearch config and instead rely on the unicast hosts list.
Originally, our web tier was not on the same virtual network as our Elasticsearch cluster. Azure exposed a public IP/hostname so it wasn't a big deal (or so we thought). We started experiencing random connectivity issues, but only when connecting to Elasticsearch from our web tier. After lots of diagnostics, we believe it's related to the load balancer/NAT our web tier went through. Today, we connect directly between internal IPs and the system is much more stable.
Index Per Customer
Next, we had to figure out how to index our error data. Elasticsearch gives you lots of options, which can be overwhelming. We really didn't know what we were doing, or know how to organize documents, so we went with the simplest thing we could think of: assign one index per customer.
At the time we were using a three node (Windows) cluster. Each node was two cores and 3.5GB of RAM. Each customer index was three shards with one replica. This meant that the data for each customer was spread over all the nodes, and to serve a query for a customer, each node had to be involved. When we hit 800 customers the amount of overhead for each index, coupled with the number of shards spread over that many nodes, started causing problems. We saw random disconnects and timeouts as the boxes wouldn't respond to pings in a timely fashion. CPU was always pegged, and it was impossible to do multiple index queries. More money and larger hardware could have solved the problem for us, but we had some fundamental data organization issues we needed to address.
Elasticsearch 1.0 and Backup
When we first started using Elasticsearch there was no good solution for backup. Many people use Elasticsearch as a projection of some other canonical data store which does have backup capability. We didn't have that situation; Elasticsearch was it. Being ex-enterprise guys we knew we couldn't just “wing it," so we made our own. We hand-rolled a system that shoved a serialized version of our error data into cloud storage should the need to restore ever arise.
Once Elasticsearch 1.0 came out with snapshot capability, we jumped on it and continue use it to this day. We backup to Amazon S3 and restore to our dev environment just to make sure it's all working. So far we've had great success. We believe it's possible to use Elasticsearch as your primary data store and still provide excellent retention and uptime guarantees.
Alias Per Customer
TIP: Use Aliases. Even with a single index you think you'll never have to change.
Each workload is different so there's no “best" way to organize your data in Elasticsearch. However, in our case it was apparent that we had too many customers and not enough money to pull off the index-per-customer approach. Our next idea was to put all of the error data into a single index and create an alias for each customer. It worked and continues to be the approach we use today.
We still have a three-node cluster, but upgraded our machines to run Linux and each node now has 14GB of RAM. Our single error index consists of 50 shards. We intentionally oversharded the index so we can add additional nodes for scaling at-will. Each customer has their own alias. The alias consists of a term filter (by customer ID) as well as a routing value (also the customer ID). This means that each customer's data is located on a single shard. To serve a search request the cluster need only contact one node, and one shard within that node, to craft a response. This also ensures we get accurate count data for each aggregation.
This strategy is working well, though some cracks in the foundation are appearing. We currently run a delete-by-query to truncate customer's data as it reaches the end of our retention period. This will occasionally cause stop-the-world garbage collection (GC) pauses and drop a node for a time. In addition, we're starting to see our field data cache sizes creep up as we ingest more (and varied) errors and aggregate them. Finally, we are worried about so-called “hot shards" — shards that were randomly assigned heavy-use customers and are much larger than other shards. There is no easy way to rebalance those at the moment.
The Future: Index Per Day
To address the GC pauses we're planning on migrating to an index-per-day strategy with customer aliases pointing at several indices at once (based on their retention period). This will let us delete data (drop an entire index) without putting memory pressure on the JVM. Creating a new index per day also lets us more easily change our type mappings. Currently if we want to change existing mappings we have to reindex all the data. With a new index each day we'll be able to evolve our mappings as necessary. We've also upgraded to 1.4.1 from 1.0.1, which should give us quite a few aggregate performance optimizations and cluster stability enhancements. So far it's been a huge improvement.
To address the field data cache increases we're beginning our experiments with doc values. By default, when doing aggregations, Elasticsearch pulls the field data in memory. This is an expensive operation and, in our case, the greatest driver of hardware cost. An alternative exists that pre-computes the field data at index time: doc values. With the release of Elasticsearch 1.4.1, doc values have been tuned for performance, and are approaching the speed of the field data cache. We're hopeful this will let us continue to scale without breaking the bank.
Elasticsearch has been central to our growth and success, and underpins a lot of what we hope to do in the future. We look forward to continuing to work together.