Pythonic Analytics with Elasticsearch
Editor's note: With PyCon 2015 starting today in Montreal, Canada, we thought it might be nice to have the team of Pythonistas over at Parse.ly discuss how they make use of Elasticsearch for their analytics needs.
Parse.ly works with top media companies across the web to deliver a real-time analytics dashboard for digital storytellers. We also run an API platform for integrating analytics and content recommendations into websites like NewYorker.com and Arstechnica.com.
Python is central to everything we do at the company. So, when we evaluate open source technologies, strong Python support is one of the first things we have in mind.
Last year, we built a time series engine called Mage which uses Python to its core. To make this time series engine work at massive scale, we had to take advantage of technologies outside of the Python community.
Systems like Apache Storm and Apache Kafka caught our eye, but their Python support was lacking. So, we released our own open source modules like streamparse (for Storm) and pykafka (for Kafka). I am presenting our work on streamparse at PyCon this year.
This work allowed us to continue using Python, even while our real-time data processing backend spread load among hundreds of cores and many nodes. We maintained a design centered around immutability, reliability, and performance, with message volumes exceeding hundreds of thousands per second.
Elasticsearch's killer features for analytics
We first came across Elasticsearch about three years ago, because our team was very experienced with Lucene and Solr. We tracked its development over time, but got very excited about Elasticsearch's direction in 2014.
We recently wrote an in-depth post about Lucene internals called Lucene: The Good Parts. When we first evaluated Elasticsearch, we considered it merely as an alternative platform for full-text search and "hosted" Lucene. But, that all changed last year.
Two features caught our eye that made Elasticsearch particularly interesting to Parse.ly: aggregations and time-based indices. When we dug into Elasticsearch some more, we also found an extremely Python-friendly platform for document storage and analytics.
What's more, aggregations offer one of the most flexible ways to model time series data. For Parse.ly's use cases, it met the need to arbitrarily filter time series data, while offering ways to group (
terms bucket), resample (
date_histogram bucket), and calculate statistics (
Though learning the aggregation query API is an exercise in mind-bending declarative JSON, we eventually built our own wrapper for this atop the excellent Python libraries officially available through Elastic, like elasticsearch-py and elasticsearch-dsl-py.
Time-based indices are a data management technique where you group all records together by some time bucket, such as month, day, or hour. Then, you have some job expunge records that are older than a certain TTL.
In Elasticsearch, this is so convenient because dropping an old index is a cheap operation -- essentially equivalent to unlinking a few files. In time series or log data use cases, you may only care to keep a trailing 30 days or 6 months of data. Time-based indices help you accomplish this.
What's great about using Elasticsearch's time-based indices on a Python project is that the Elastic team maintains the curator module, which is written in Python, has an extensive Python API, and an easy-to-use CLI for quick management operations.
Scaling Elasticsearch in production
Together with Elasticsearch's built-in replication and sharding model, Python programmers have a scalable platform atop which to build Pythonic analytics applications.
However, Elasticsearch is not a silver bullet for the challenges of analytics over huge data sets. Parse.ly operates a huge Elasticsearch cluster, and our backend processes billions of data points per day to store over 4 billion documents in Elasticsearch.
We had to write our own tooling for index versioning, schema management, monitoring, and cluster expansion, along with our own query layer that took advantage of the above Elasticsearch features.
We also had to think hard about things like compressing our data and optimizing our use of the Lucene index format, so that we could take terabytes of raw customer data and store it in hundreds of megabytes of flexible rollups and summaries in Elasticsearch.
However, the results of this work is that we have one of the most flexible analytics stores we can imagine, and a data partitioning and sharding scheme that can help us as we grow. Our customers pay us for our beautiful analytics dashboard (written with AngularJS and d3.js) and our rich HTTP/JSON APIs (powered by Tornado).
Elasticsearch helps us meet our customers' massive query needs, so that we can focus on our core product and business value. It lets us offer a platform at a disruptively low price point for the enterprise web analytics market -- with a total cost of ownership that is often a fraction of a single data analyst's salary.
To the next thousand customers!
Our backend cluster is now spread over 50 production nodes, and growing. Those nodes are dedicated to serving hundreds of the web's highest-traffic sites -- from top news websites like The Telegraph, to culture/politics destinations like The New Republic. Operating at this scale is not easy, and required us to master and combine several open source technologies (Kafka, Storm, Redis, Cassandra, Elasticsearch) while also building plenty of tech on our own.
Among all of these technologies, Elasticsearch offered one of the best experiences for Python programmers. It allowed our team to meet several important analytics needs in a Pythonic way. We know it is the right technology to help us get to our next thousand customers.
We are happy to be Elasticsearch users, and we are excited for what's to come in Elasticsearch 2.0!
If you want to talk more about my experience with Elasticsearch, I'll be at PyCon 2015 in Montreal from April 10 - 15. Just reach out to me via @amontalenti on Twitter!