Providing Reliable Service for 250 Million Customers
Tango is a free mobile messaging service based in California, with 250 million registered users in 224 countries. Through communication, social features and a compelling content platform, Tango users discover engaging ways to connect, get social and have fun.
"In our industry, the customer experience is the most important aspect," explains Guy Fighel, Director of Engineering at Tango. "Every time we have an outage or performance degradation, we lose a customer on that particular operation. And if that's repeated over and over, we lose the customer to our competition. Our number one priority is to keep everything working, with very good performance and minimal downtime."
Log analysis is an effective approach to performance management. In the past, Tango used command line tools to manually grab logs on the backend. On the client side, they pushed all the logs to a huge database, but they had to know what they were looking for and where to find it. They also didn't have critical capabilities such as correlating events coming from the backend and the client, and alerting on incidents or thresholds. Consequently, response time to solving performance issues was lengthy.
"We were completely blind to some events," says Fighel. "And when we finally found out about them, it was just too late, and we found ourselves in the middle of a crisis."
Using the Elastic Stack to Improve Operations Productivity by 100%
To gain visibility into logs for monitoring and troubleshooting infrastructure performance, Tango deployed the full Elastic Stack. Elasticsearch – serving as the core search and analytics engine – is at the heart of the stack, while Logstash serves as the data pipeline and Kibana is the data visualization tool.
"With the Elastic Stack, we can search logs based on specific types, time, region – all the parameters we want," Fighel says.
Tango ships logs from servers via Logstash to a Redis cluster, and then on to Elasticsearch. The Elastic Stack pulls all the logs from the backend in production, and also pushes all the logs from clients all over the world, then correlates them. On top, they use Kibana as the dashboard.
We know what to expect from our clients, so we can see even a slight change in performance. The Elastic Stack gives us the visibility. We are actually measuring response times in real time for 250 million customers all around the world. This is amazing. It's like sitting with a customer anywhere in the world, and I can see what the system is doing, and when and why it is not performing well. This is the real value of the Elastic Stack.
Director of Engineering, Tango
"The Elastic Stack has enabled us to achieve a 100% improvement in productivity," he continues. "Since we implemented the Elastic Stack, our response time to performance issues has dramatically decreased to five minutes after an incident or even faster. Before the Elastic Stack, it could be days before we even realized we had an issue."
"The bottom line for our business is that the Elastic Stack gives us the capability to monitor our uptime and performance, and analyze and solve issues as quickly as possible," Fighel adds. " With the Elastic Stack, we can ensure Tango is highly-available and delivering high performance."
Gaining Business Intelligence through Log Analysis
In addition to performance management, Tango also leverages the Elastic Stack for Business Intelligence (BI). For example, the Elastic Stack provides Tango with analytics on which features are used more frequently, and which version of Tango is more popular.
"We can do some basic BI analysis with the Elastic Stack, based on the operational and infrastructure data coming both from the client and the servers," Fighel says. "This helps us pinpoint the features that are working and the features that are not. Then we can change a feature or add a new one to improve the customer experience."
For example, Tango uses the Elastic Stack to identify specific geographical regions with low performance, possibly due to less reliable networks. Then Tango can partner with local cloud providers to enhance performance with a proxy layer in that region.
"This type of analysis is done exclusively with the Elastic Stack," Fighel says. "We didn't have any other option to analyze this before."
Complementing APM with Log Analysis
Tango uses New Relic for Application Performance Management (APM) to ensure the Tango app is performing for customers. Fighel says that log analysis via the Elastic Stack is a critically-important complement to augment APM.
"We use the Elastic Stack and New Relic APM side-by-side," he points out. "If you look at my screen, you will see APM on one side and Kibana on the other side. We can analyze the application performance issues with APM, but we have the Elastic Stack to see the logs from servers, a broad set of data coming from the client side, so we can analyze performance issues from another perspective."
Monitoring Elasticsearch with X-Pack Monitoring
"How do you monitor the monitor?" Fighel asks. "This is always the question. It is important to have something to monitor the monitoring solution. Before X-Pack Monitoring, this was very hard. Now we use X-Pack Monitoring to monitor our Elasticsearch clusters and it gives the flexibility and ease of monitoring Elasticsearch itself."