December 9, 2015

How the Elastic stack keeps our taxis rolling

mytaxi_logo

The Elastic-Technology-Stack supports our business in two different areas. It keeps our app responsive as a cache and also provides a backend for an intense amount of logs. This article will outline the architecture of our Logging Cluster.
Mytaxi is currently Europe’s most successful taxi app. Everyday we connect thousands of customers with professional taxi-drivers. Booking, procuration and payment have to run fast and smooth 24/7.

The Logging-Cluster

We started out in 2013 with a pretty simple standard configuration of the Elastic technology stack. It had two Elasticsearch nodes and one node for Logstash and Kibana. This worked well back in the days before our exponential growth started in 2014. This setup was running until mid 2015. The cluster was abandoned during our growth phase and thus it was lacking performance. By mid 2015 we had about 2 TB of data stored in Elasticsearch.

Also, in 2013 we decided to move away from a monolithic backend structure towards microservices. By now we have split our backend into around 50 microservices. We needed to gather logs from all of these services in one place. By July 2015 we decided to set up a new cluster that would support our needs for the upcoming years. The requirements were:

Save our logs for 90 days (a day is ~40 GB worth of logs)
Read performance was a bottleneck when people query large time ranges for data. We wanted to increase the performance by having more nodes.
The number of shards is configured during cluster creation. It cannot be changed. We wanted to have a number that is evenly divisible by 2, 4, 5 and 10. The least common multiple is 20.
Each worker should manage an even number of shards.
Although money is not the main focus, we wanted to reach a price/performance maximum.

Hardware setup and installation procedure

With these requirements we went to AWS and checked for a reasonable instance size. Three instance sizes were in line with our requirements (prices may be outdated):

i2.xlarge - 4 CPUs, 30,5 GB RAM, 800 GB SSD - 686,62 $/month
i2.2xlarge - 8 CPUs, 60,5 GB RAM, 2x800 GB SSD - 1510,57 $/month
m1.xlarge - 4 CPUs, 15 GB RAM, 4x420 GB HDD - 277,43 $/month

The m1.xlarge seemed to be gold here. They are cheap and they have 4 HDDs which could be combined to a RAID0. Ten m1.xlarge instances would form a cluster with 150 GB RAM and 16,8 TB of disk. With this much disk we could easily increase the amount of time that we keep the logs.

To configure ten nodes and install an equal configuration of Elasticsearch, we used Ansible. The playbook combined roles for:

Bootstrapping the AWS instance (setup hostname, install common packages etc.)
Setting up a RAID0 in software
Installing a monitoring software that reports to a monitoring proxy in the same VPC
Connecting the instances to our directory server, so that backend engineers can log in via SSH
Installing and configuring Elasticsearch including several plugin

Logcluster architecture

With the 10 Elasticsearch nodes running, you might wonder what the whole system looks like. Our microservices report their application logs into an AWS Elasticache (Redis). Logstash will pick up the logs from there and process them. The image below gives an overview of the architecture.

Logcluster usage examples

At the beginning of November 2015 we decided to push through a large change in our backend architecture. We started to set up every one of our microservices in a Docker container running on an AWS ECS cluster. This major change was accompanied by our Logcluster. As the migration needed to be performed without service interruption, we followed the Kibana screens closely. Even the tiniest amount of increased exception rates from an application, could indicate problems in the live service. So after we started containers for a new service, everyone would stare at the large TV set that shows Kibana and they would cheer when there were no exceptions. Although we were a bit frightened when batman showed up...

Conclusion

It is our goal to provide the best taxi experience in Europe and expand our service to do more and more cities over the next years. With around 50 microservices we need an information hub to get an overview of the overall system status. For us the Elastic-Technology-Stack provides this overview. But it also gives developers the chance to dig deep into bugs and closely follow the impact of changes.