Building a World-Class Content Delivery Network (CDN)
The Telefónica brand has been built by delivering strong, reliable services for its customers. A core part of this is the result of Telefónica’s ongoing focus on innovation to ensure quality of service across its networks.
Over the last few years, the proliferation of new voice, internet, and video services has dramatically increased the complexity of Telefónica’s delivery methodology. This has led to a sharp increase in the volume of diverse logging and metric data created around both service delivery and consumption. In response, telecommunication companies of all sizes, have invested heavily in infrastructure management. Many of these solutions were developed to provide operational insight into specific parts of that infrastructure. However, what was missing, was a way to extract, unify, and analyze data that was spread across multiple, disparate systems. And to do it all in real time.
Like many others in the industry, Telefonica had created their own home-grown systems which were cumbersome, expensive to maintain and offered very little technical flexibility. Additionally, they only understood incidents after they happened, and latency was problematic. The internal system was providing a repository for data but no meaningful way to analyze the data, or take action based on new insights.
Telefónica found their solution with the Elastic Stack, which allowed them to combine and analyse different data sources without the need for a unified data format. Telefónica is innovating to create a data management platform providing real-time access to operational and commercial value in the data they possess, leading to improved overall customer experience.
Discovering the Power of the Elastic Stack
Global Video Monitoring Technical Lead, Álvaro Aldana, and his team at Telefónica Global Video Unit, had been experimenting with early iterations of the company’s Content Delivery Network (CDN), including monitoring them using a mix of open source and proprietary solutions. The aim was to start scaling the service portfolio to onboard more customers, while also using insight hidden within log and metric data to maintain performance. With the rapid growth of Video on Demand (VoD), mobile and internet services, Alvaro’s team knew they needed a sophisticated, highly scalable solution that would allow for the instantaneous ingestion and real time analysis of data from multiple sources. After experimenting with several options, the Elastic Stack was selected as the perfect fit for removing ad-hoc developments and bringing the platform to an enterprise scale.
Within a few short months, the team re-engineered the platform to ingest client transactions and video streaming logs into Elasticsearch to gain insight into consumption and service performance. For example, they were able to see which channels their customers were watching as well as finally seeing the associated bitrate statistics and latencies — data that had been consistently overlooked prior to adopting the Elastic Stack. Not only could they see the composition of Telefónica’s viewing audience and which content they’re watching, the team could also monitor the proportion of viewing taking place live or on-demand in particular geographies and at specific times of day.
Analyzing Log Data and Anomalies at Scale
Log data provides valuable information about what’s happening within and across large networks. The logging records all events happening within a system such as logins, user interaction, and errors as intermittent text-based records. The more systems and formats, the more complicated the challenge.
Telefónica found Elasticsearch to be the perfect tool for monitoring and analyzing large volumes of differently formatted data and discovered power in finding anomalies, spotting trends and data forecasting.
By being able to explore log data in real time (regardless of the source log format), the team can easily explore new relationships and correlations as fast as they can come up with new ideas. This new found freedom to explore, has not only enabled Telefónica to move from problem solving to system optimization, it has revealed a new, even larger role for data analysis within the broader business.
For example, the team can easily see the number of errors occuring against every video fragment and can compare this against infrastructure usage. This has been a crucial development as the team can now instantly tell which servers are most heavily used, for what reason, and where to focus engineering resources. By increasing the volume and variety of data ingested, queried, analyzed and stored, they are able to report potential problems to their operations teams with a greater level of insight, resolve issues in a more proactive and efficient manner, and optimize network performance in real time.
Since incorporating Elasticsearch into the CDN in 2014, Telefónica has seen an explosion in the volume of content consumed as new users join the platform. Telefónica’s customer numbers have doubled over the last three years alone and, as a result, the team continued to experiment.
In particular, Alvaro’s team has expanded into detecting anomalies based on the contents of logs. The team is using Elastic machine learning to analyze patterns in other logs from around the organization, more specifically, logs from Telefónica’s end-to-end video platform: encoding/decoding activity, content workflow, and other server activity sitting outside the core CDN. Elastic machine learning features automatically model the behavior of the Elasticsearch data trends, periodicity and more. Prior to activating machine learning features, they were not able to easily detect these anomalies. Detecting the influencing factors on these anomalies allowed their engineers to identify issues faster, streamline root cause analysis, and reduce false positives. This has improved all of the above, as well as protected their quality of service standards.
We see great promise for the application of Elastic machine learning across the estate, in a range of use cases. In fact, it is already helping us greatly with service management logging — identifying novel problems inside content delivery and streaming services that might otherwise be hidden. These hidden problems can damage our image so, being able to identify these tiny issues with in real time with Elasticsearch means we are far more responsive, the content delivery services perform well and our reputation for quality is maintained.
Global Video Monitoring Technical Lead, Telefónica
As Telefónica saw a steady increase in consumption of their digital service, they realised that they needed to analyze and store greater volumes of data. They needed access to 15-25 days of data, compared to three days they had historically retained. The team was particularly interested in making the platform easily available to developers without performance dips when a user performed a large query.
Additionally, in less than four months, Telefónica switched from a previous solution in the video platform logging to Elasticsearch, understanding the system more holistically, seeing the anomalies with machine learning features and saving costs at the same time.
Alvaro and his colleagues have worked closely with Elastic field and support teams to build and fine-tune the platform, testing and expanding their mix of hardware to find the perfect combination.
It is all about how well it integrates with other solutions — particularly with our previous supplier — and how easy it is to configure. Working in collaboration with Elastic, we’re able to fine-tune each component of the platform to the point we’re seeing major improvements. The performance of the platform has accelerated significantly: allowing us to process 200,000 documents per second, all achieved through fine-tuning and our close partnership with Elastic’s support team.
Global Video Monitoring Technical Lead, Telefónica
The team has reported immediate improvements in platform processing power, but the most notable improvements have been in operational processes. Now Alvaro can see, in real time, whether a software patch is effective or how a new update is affecting the time it takes for a video fragment to be served to the end viewer. Elastic has made this possible, and the move to working with Elastic products and the Elastic team has resulted in the most noticeable benefits for Telefónica.
Before Elastic, Telefónica used to have a subset of limited service metrics based on batch processes. Now, the CDN development teams are able to see, in real time, fully consolidated KPIs, and build real time dashboards for making immediate decisions.
“Being able to see changes in real time has transformed the way we manage the CDN and it is something that wasn’t possible before we started working with the Elastic Stack,” Alvaro noted. “We are able to improve rapidly because we have a powerful ecosystem of tools built on Elasticsearch. We have been able to develop quickly; expanding the solutions into which it integrates so that now, the Elastic Stack sits firmly at the core of our operational framework.”
Innovation around the combination of log data and machine learning is giving Telefónica a holistic view of its CDN, moving it from a model of management and maintenance to network optimisation, which is critical to improving the service overall. Using Elasticsearch allows administrators to find anomalies and pinpoint causality faster. It also becomes possible to model and analyze vast volumes of historic data, to not only learn from past failures, but also to identify patterns, trends, precursors, and warning signs.
Telefónica’s focus on network performance is, the team believes, the foundation and secret to maintaining customer loyalty both now and into the future. They will be expanding the implementation of the Elastic Stack to their video platform applications such as customer portals, digital rights management, content management and customer provisioning. But it’s the mix of technology Alvaro believes will allow Telefónica to remain competitive no matter how the telecommunications sector changes and customer needs evolve.
“Only by innovating around network performance — and moving to a model of optimisation versus simple monitoring — will we build the kind of network that our customers will trust. Reliability and resilience will remain our key focus as we grow and deliver our service portfolio in new and interesting ways,” Alvaro concluded. “What Elastic has brought us is a highly sensitive and intelligent platform which give us the power to respond in real time and better prepare us for growth.“