How do you ensure a communications platform is delivering an industry-leading experience for 250 million customers worldwide?
By using the ELK stack to monitor logs and manage performance in real time
Case study highlights
Deliver high performance
- Keeping up with the speed of business
- Improve operations productivity by 100%
- Reduce performance issue response time from days to minutes
Provide actionable insights
- Provide a stellar customer experience
- Maintain high availability and performance
- Gain valuable business intelligence
Providing reliable service for 250 million customers
Tango is a free mobile messaging service based in California, with 250 million registered users in 224 countries. Through communication, social features and a compelling content platform, Tango users discover engaging ways to connect, get social and have fun.
"In our industry, the customer experience is the most important aspect," explains Guy Fighel, Director of Engineering at Tango. "Every time we have an outage or performance degradation, we lose a customer on that particular operation. And if that's repeated over and over, we lose the customer to our competition. Our number one priority is to keep everything working, with very good performance and minimal downtime."
Log analysis is an effective approach to performance management. In the past, Tango used command line tools to manually grab logs on the backend. On the client side, they pushed all the logs to a huge database, but they had to know what they were looking for and where to find it. They also didn't have critical capabilities such as correlating events coming from the backend and the client, and alerting on incidents or thresholds. Consequently, response time to solving performance issues was lengthy.
"We were completely blind to some events," says Fighel. "And when we finally found out about them, it was just too late, and we found ourselves in the middle of a crisis."
Using ELK to improve operations productivity by 100%
To gain visibility into logs for monitoring and troubleshooting infrastructure performance, Tango deployed the full ELK stack. Elasticsearch – serving as the core search and analytics engine – is at the heart of the stack, while Logstash serves as the data pipeline and Kibana is the data visualization tool.
"With ELK, we can search logs based on specific types, time, region – all the parameters we want," Fighel says.
Tango ships logs from servers via Logstash to a Redis cluster, and then on to Elasticsearch. The ELK stack pulls all the logs from the backend in production, and also pushes all the logs from clients all over the world, then correlates them. On top, they use Kibana as the dashboard.
"We know what to expect from our clients, so we can see even a slight change in performance," Fighel explains. "ELK gives us the visibility. We are actually measuring response times in real time for 250 million customers all around the world. This is amazing. It's like sitting with a customer anywhere in the world, and I can see what the system is doing, and when and why it is not performing well. This is the real value of ELK.
"ELK has enabled us to achieve a 100% improvement in productivity," he continues. "Since we implemented ELK, our response time to performance issues has dramatically increased to five minutes after an incident or even faster. Before ELK, it could be days before we even realized we had an issue.
"The bottom line for our business is that ELK gives us the capability to monitor our uptime and performance, and analyze and solve issues as quickly as possible," Fighel adds. " With ELK, we can ensure Tango is highly-available and delivering high performance."
Gaining business intelligence through log analysis
In addition to performance management, Tango also leverages ELK for Business Intelligence (BI). For example, ELK provides Tango with analytics on which features are used more frequently, and which version of Tango is more popular.
"We can do some basic BI analysis with ELK, based on the operational and infrastructure data coming both from the client and the servers," Fighel says. "This helps us pinpoint the features that are working and the features that are not. Then we can change a feature or add a new one to improve the customer experience."
For example, Tango uses ELK to identify specific geographical regions with low performance, possibly due to less reliable networks. Then Tango can partner with local cloud providers to enhance performance with a proxy layer in that region.
"This type of analysis is done exclusively with the ELK stack," Fighel says. "We didn't have any other option to analyze this before."
Complementing APM with log analysis
Tango uses New Relic for Application Performance Management (APM) to ensure the Tango app is performing for customers. Fighel says that log analysis via ELK is a critically-important complement to augment APM.
"We use ELK and New Relic APM side-by-side," he points out. "If you look at my screen, you will see APM on one side and Kibana on the other side. We can analyze the application performance issues with APM, but we have ELK to see the logs from servers, a broad set of data coming from the client side, so we can analyze performance issues from another perspective."
Monitoring Elasticsearch with Marvel
"How do you monitor the monitor?" Fighel asks. "This is always the question. It is important to have something to monitor the monitoring solution. Before Marvel, this was very hard. Now we use Marvel to monitor our Elasticsearch clusters. Marvel gives the flexibility and ease of monitoring Elasticsearch itself."
Immediate response to performance issues
Before ELK, Tango might not find out about a performance issue for days. With ELK, the Tango team responds to issues in minutes.
In addition to performance management, Tango also uses ELK to gain valuable business intelligence from logs, such as customer behavior, to help with business and product strategy.
Increased operations productivity
ELK saves Tango from many manual tasks, increasing team productivity by 100%.
Marvel, the new monitoring system designed specifically for Elasticsearch, enables Tango to ensure that this mission-critical performance management system is up and running.