2015年09月17日 ユーザストーリー

How Avaaz uses the ELK-Stack to Improve their Production Systems

By Peter Colclough

Avaaz - meaning "voice" in several European, Middle Eastern and Screen Shot 2015-09-11 at 15.27.30.pngAsian languages - launched in 2007 with a simple democratic mission: organize citizens of all nations to close the gap between the world we have and the world most people everywhere want. Avaaz is a global petitioning site with more than 40 million members which was hitting some particularly interesting issues. Central, at the time, was the fact they were outgrowing their database architecture, and were hitting database access issues. A victim of their own success.

The Issue is to find out 'Why' 

As we all know, we have logs: we have slow logs, we have Nagios, but all of these don't give the exact information of what a server is doing at any point in time. Briefly, I had come across Lucene three years ago as a 'search engine'. It was being used with Solr by a previous retail client. It was spectacular. At the time I had investigated using mysql-proxy in order to read the actual query metrics directly off the servers, using a hand crafted Lua script attached to Proxy. While never used in anger this was more than helpful in ironing out some database issues ... but we didn't have a sane place to plot the results on a graph and analyze them.

Looking back at other Clients

Another client of mine called Madbits, a startup in image interpretation (a dark art like no other http://www.farabet.net/) acquired by Twitter in 2014, had issues with a MongoDB setup, and also retrieving information in anything close to a reasonable time for a web site. Part of the issue was down to MongoDB, and the way it had been set up (never, ever, just have a single unit... not designed for that). After scaling out Mongo we had stunning results, but we then had issues with 'user generated tags'. Enter Lucene. We immediately faced a challenge in trying to use it. MongoDB stores documents in Bson (a binary json format) but Lucene uses Java objects. We could do the conversion our selves, but instead discovered Elasticsearch. We could flip documents between MongoDB and Elasticsearch with very little, if any changes. Wow! That works. So in came Elasticsearch and later Kibana.

However, we also needed to log the DB. So with some handy Logstash parsing, we were able to push Mongo logs into ES using Logstash. We then used Kibana to graph it which was also stunning. Direct insight into the workings of a fully sharded DB (across 3 data centres) in graph form... cool.

Introducing Elasticsearch at Avaaz

At Avaaz we had issues with MySQL. So we put in mysql-proxy, altered the Lua script to output Json to a file, used Logstash to push into ES, and we were there. Not only did mysql-proxy give us the ability to scale out the slaves, we got actual insight to every query going into each machine. For the record, this now also includes a couple of AWS RDS instances too. We ran up some Kibana-3 dashboards, and the results were astonishing.

The only issue we had was the sheer volume of data. We were recording off the live systems, onto a development ES cluster. Initially we were getting in excess of 150m documents per 24 hours on a normal day. On a busy day that could double. Elasticsearch was fine, it consumed with no issues, it was just storage and expansion. So we tweaked the cluster setup, added some more disk (Raid 0,and a reboot - simple), then made the decision to add a _ttl of 1 day allowing Elasticsearch to manage the deletion, and built a series of aggregation scripts. This enabled us to see the last 24 hours, and have Elasticsearch automatically drop the documents that were too old. The aggregated information helped us see what was happening on a day by day basis, aggregated by the hour. On our new production cluster we are saving 3 days at a time, so we can see what happened at the weekend.

What changed?

What did this show us? A lot - and please bear in mind that the code was 6 years old, and had been written with an expected size of 10% of current levels. So, using Elasticsearch indices, and Kibana graphs, we were able to query the dataset right the way down to threads.The graphs allow us to zero in on a given spike, so we can easily examine what caused the spike....and solve the issue. This showed, in no particular order:

  • How we could save 60m connections per day, that did absolutely nothing. They were being created by default, 1 to each 'type' of machine, just in case they were needed. What seemed like a good idea at the start of the system, turned out to be a major bug bear once it had grown.
  • How 40m queries per day should have been on the slaves, not the master, saving the master for what it should be doing: accepting data changes.
  • How some queries were constructed in such a way that they caused lockouts, where lockouts weren't wanted.
  • How the current monitoring software was consuming connection sockets at a rate that was unsustainable in a high load environment. Thanks cacti. So we made code alterations. We dropped the unnecessary connections, we moved queries onto slaves. We replaced the replication monitoring with a simple script that has 1 connection per machine - and only 1 connection. We upped the number of allowable connections. We slept at night. All down to Elasticsearch and Kibana. Oh, and some rather clever Lua code attached to mysql-proxy, that I wrote 2 years earlier for a completely different client.

We have since expanded the process. We now graph table hits, allowing us to remove unneeded tables and columns, that were created eons ago, and never removed. We log disk usage by the DB, and actual Disk usage. We are now setting up warnings, as Elasticsearch is so very fast at producing information, we can find 'pinch' points very quickly.

What's next?

So where next? Avaaz has a particular need to target its audience. In a lot of cases we need to do this geographically. So, while we have some really cunning MySQL queries that will do a 'radius' search, we are now building into our cluster enough data to allow searches by radius, shape, members within bounds of city limits. We will then add to that 'types of petition', so if you sign a petition on one type we will be able to produce 'you may also like...' type petitions to look at. All using the search capabilities of Elasticsearch, and the mapping that Kibana 4, using Leaflet JS, produces. How cool will this be? More to come on this when we have it running...

Avaaz Worldwide User Distribution

Screen Shot 2015-09-11 at 15.36.38.png

Screen Shot 2015-09-11 at 15.41.08.pngPeter Colclough has been in the Software Industry for 35 years, covering all aspects of technical implementation. For the last 18 months he’s been working as an independent consultant with Avaaz, a global petitioning site, and has had previous engagements with organisations from major Retail, Oil & Gas, Governments, and Big Data with RAL, the UK branch of the Large Hydron Collider (1 Petabyte per second.. but not kept): http://home.web.cern.ch/about/computing

Want to hear more of these stories? Then join us for the Elastic{ON} Tour in London on November 3.