Customers

Scaling Trackjs with Elasticsearch for Fun and Profit

We're pleased to welcome TrackJS co-founders Todd H. Gardner and Eric Brandes as guest bloggers for Elasticsearch. Both active web developers with a “particular love of JavaScript," Gardner and Brandes are fueling the growth of their (proudly) Minnesota-based company with the help of Elasticsearch. Here is their story.


JavaScript errors are not great, and rarely do they provide you with enough information to identify what went wrong. TrackJS, a JavaScript error reporting service for modern web applications, helps solve this problem. Similar to an airplane's black box, TrackJS captures events from the application, user, and network leading up to the error, so you can recreate and fix the problem. If you're building a JavaScript app, this is an essential tool.

In the spirit of knowledge sharing, we wanted to tell our Elasticsearch story. Like many others out there, TrackJS is a small company scaling quickly, and open source solutions can play a critical role in growing your organization without breaking the bank.

TrackJS uses Elasticsearch to provide real-time JavaScript error reporting analytics to our customers. It underpins our backend and allows us to slice and dice client-side errors. We've encountered a number of scaling issues as we've grown, and have gained hard-won knowledge in the process. We don't have an unlimited budget — the answer for us is not always “add a bigger box." This post is a raw look at how our business has grown to process over 200 million error events per month on a limited budget — and still provide a great user experience.

But before you dive in, we invite you to get an in-depth look at how we use Elasticsearch in an upcoming webinar “Scale Fast or Die Trying" on February 3. We'll give you the inside scoop on what happened when we hit #1 on HackerNews. (And if you'd like to get even more out of the session, we invite you to grab a free trial of of our software at trackjs.com.)

In the Beginning

TrackJS started simple, with an application server and database server. The database was a traditional RDBMS with a normalized schema and proper referential integrity. Our application performs many aggregate counts. To do aggregates in SQL inevitably means a GROUP BY, which often equates to a table scan. This negatively impacts performance, and once we hit a few million errors stored, we couldn't afford a large enough box to process things in any reasonable timeframe. We tried removing foreign keys, tweaking the indexes, denormalizing everything, and in the end, we concluded it was the wrong tool for the job. We wanted near real-time analytics, and SQL wasn't going to give it to us.

The Right Tool for the Job

We auditioned several NoSQL databases to replace our relational database. The test was simple: perform several aggregate counts over one million errors on a single core VM with 1.5GB of RAM, and see who does it best. It was meager hardware, but we were curious to see how various tools would perform. We will not name names here, but many challengers came up short. (One even experienced catastrophic data loss.) When using only the defaults on Windows machines, Elasticsearch was able to handle the challenge without breaking a sweat.

Running on Azure

TrackJS is primarily hosted on Microsoft Azure. We host our Elasticsearch cluster on a few virtual machines in a single Azure Cloud Service. Azure gives us load balancing out of the box. It does not support multicast, so we explicitly turn it off in our Elasticsearch config and instead rely on the unicast hosts list.

Originally, our web tier was not on the same virtual network as our Elasticsearch cluster. Azure exposed a public IP/hostname so it wasn't a big deal (or so we thought). We started experiencing random connectivity issues, but only when connecting to Elasticsearch from our web tier. After lots of diagnostics, we believe it's related to the load balancer/NAT our web tier went through. Today, we connect directly between internal IPs and the system is much more stable.

Index Per Customer

Next, we had to figure out how to index our error data. Elasticsearch gives you lots of options, which can be overwhelming. We really didn't know what we were doing, or know how to organize documents, so we went with the simplest thing we could think of: assign one index per customer.

At the time we were using a three node (Windows) cluster. Each node was two cores and 3.5GB of RAM. Each customer index was three shards with one replica. This meant that the data for each customer was spread over all the nodes, and to serve a query for a customer, each node had to be involved. When we hit 800 customers the amount of overhead for each index, coupled with the number of shards spread over that many nodes, started causing problems. We saw random disconnects and timeouts as the boxes wouldn't respond to pings in a timely fashion. CPU was always pegged, and it was impossible to do multiple index queries. More money and larger hardware could have solved the problem for us, but we had some fundamental data organization issues we needed to address.

Elasticsearch 1.0 and Backup

When we first started using Elasticsearch there was no good solution for backup. Many people use Elasticsearch as a projection of some other canonical data store which does have backup capability. We didn't have that situation; Elasticsearch was it. Being ex-enterprise guys we knew we couldn't just “wing it," so we made our own. We hand-rolled a system that shoved a serialized version of our error data into cloud storage should the need to restore ever arise.

Once Elasticsearch 1.0 came out with snapshot capability, we jumped on it and continue use it to this day. We backup to Amazon S3 and restore to our dev environment just to make sure it's all working. So far we've had great success. We believe it's possible to use Elasticsearch as your primary data store and still provide excellent retention and uptime guarantees.

Alias Per Customer

TIP: Use Aliases. Even with a single index you think you'll never have to change.

Each workload is different so there's no “best" way to organize your data in Elasticsearch. However, in our case it was apparent that we had too many customers and not enough money to pull off the index-per-customer approach. Our next idea was to put all of the error data into a single index and create an alias for each customer. It worked and continues to be the approach we use today.

We still have a three-node cluster, but upgraded our machines to run Linux and each node now has 14GB of RAM. Our single error index consists of 50 shards. We intentionally oversharded the index so we can add additional nodes for scaling at-will. Each customer has their own alias. The alias consists of a term filter (by customer ID) as well as a routing value (also the customer ID). This means that each customer's data is located on a single shard. To serve a search request the cluster need only contact one node, and one shard within that node, to craft a response. This also ensures we get accurate count data for each aggregation.

This strategy is working well, though some cracks in the foundation are appearing. We currently run a delete-by-query to truncate customer's data as it reaches the end of our retention period. This will occasionally cause stop-the-world garbage collection (GC) pauses and drop a node for a time. In addition, we're starting to see our field data cache sizes creep up as we ingest more (and varied) errors and aggregate them. Finally, we are worried about so-called “hot shards" — shards that were randomly assigned heavy-use customers and are much larger than other shards. There is no easy way to rebalance those at the moment.

The Future: Index Per Day

To address the GC pauses we're planning on migrating to an index-per-day strategy with customer aliases pointing at several indices at once (based on their retention period). This will let us delete data (drop an entire index) without putting memory pressure on the JVM. Creating a new index per day also lets us more easily change our type mappings. Currently if we want to change existing mappings we have to reindex all the data. With a new index each day we'll be able to evolve our mappings as necessary. We've also upgraded to 1.4.1 from 1.0.1, which should give us quite a few aggregate performance optimizations and cluster stability enhancements. So far it's been a huge improvement.

Doc Values

To address the field data cache increases we're beginning our experiments with doc values. By default, when doing aggregations, Elasticsearch pulls the field data in memory. This is an expensive operation and, in our case, the greatest driver of hardware cost. An alternative exists that pre-computes the field data at index time: doc values. With the release of Elasticsearch 1.4.1, doc values have been tuned for performance, and are approaching the speed of the field data cache. We're hopeful this will let us continue to scale without breaking the bank.

Elasticsearch has been central to our growth and success, and underpins a lot of what we hope to do in the future. We look forward to continuing to work together.

P.S. Don't forget to register for the upcoming webinar for more in-depth technical discussion about our story and insight into how small teams can make a big impact with Elasticsearch. And grab a free trial of TrackJS and start finding and fixing your production JavaScript bugs.