CenturyLink: Building a Network Management Framework with Open Source

This post is a recap of a community talk given at a recent Elastic{ON} Tour event. Interested in seeing more talks like this? Check out the conference archive or find out when the Elastic{ON} Tour is coming to a city near you.

CenturyLink is the third-largest telecommunications company in the United States and has created one of the largest networks in the world, spanning more than 450,000 route miles — long enough to go around the earth 18 times. At any given moment, they’re dealing with serious volumes of data from a plethora of different network devices, and those devices continue to evolve year after year. Networks that once supported landlines and pagers have had to adapt to deal with requests from smartphones, smart speakers, and tablets, all of which generate new and different types of data — and more of it. And CenturyLink itself has evolved, too, growing substantially over the years through mergers and acquisitions as well as organic growth. Today, CenturyLink’s environment includes over 300 thousand devices that generate over 1 million alerts daily.

CenturyLink’s rapid growth, in addition to the complex types and volumes of data that it is handling across its massive network, have posed challenges for the company’s logging and metrics analytics strategy. On the front lines is CenturyLink’s Service Assurance team, who is responsible for ensuring that network operations centers (NOCs) have the tools they need to ensure that the network is optimal at all times. The key here is the system logs and metrics that trigger alerts informing Operations when, for example, a fiber line has been cut or traffic is being rerouted.

When CenturyLink was relatively small and was juggling only a few network management systems (NMSs), a basic architecture — an adapter on top of the NMSs that fed into an Oracle database, which in turn fed into a display — was enough to keep those alerts flowing. But as CenturyLink acquired new partners that used different alarm types and disparate network management structures, database queries started to get complicated. Struggling database performance translated into repeated outages and downtime for their NMS app, which meant Operations couldn’t see what was going on in the network. It quickly became clear that this structure wouldn’t be able to scale as CenturyLink grew.

CenturyLink began mapping out their entire system with one goal in mind: identifying a simpler, more robust architecture that would be able to scale with the company’s future growth. They scored an early victory by implementing a CQRS (Command Query Responsibility Segregation) architecture that separated the transaction manager and query cache into different components, with a REST service on top. Then they began working to integrate these homegrown elements with a number of other pieces that had been built over the years, with the goal of simplifying the overall system. On the query side, they tried implementing H2, distributed MySQL databases, and even a HashMap distributed across multiple applications, and they began to see some improvements — reduced downtime and a happier Operations center. But the complexity of the queries and the volume of the alarms made it difficult to continue to support this in a relational database management system.

They found their answer in Elasticsearch. By incorporating Elasticsearch into the query side of their CQRS architecture, they were able to query more data than ever, faster than ever. Queries that had previously taken 1 or 2 minutes now took only seconds. Not only was Elasticsearch easy to set up — they were able to spin up a cluster instantly — but it also scaled effortlessly. Gone were the days of bracing for the next acquisition announcement. With this simpler network architecture, which also included Redis and Kafka and a custom UI working in concert with Elasticsearch, new network management systems and devices could be smoothly brought into the fold. Today, CenturyLink is supporting 1,000 users on this application, with greatly reduced outages and zero production defects.

Working with open source solutions like Elasticsearch has made the Service Assurance team realize that it’s possible to do even more with their data. In addition to the fault and performance data that they are already feeding into their systems, they are also collecting information about network utilization, weather, and even how technicians solved problems in the past — and they want to bring more of this data into their environment so that they can make more informed decisions. To make that a reality, the Service Assurance team envisions a large framework in which various types of data can be put through predictive algorithms and then operationalized in real time in order to produce actionable insights for the Operations team. To build this framework, CenturyLink has continued to leverage the flexibility of the Elastic Stack, with custom Logstash plugins and Elasticsearch playing key roles. And they plan on developing custom Kibana plugins in the future.

Want more details on what CenturyLink is doing to get better use out of their data? Watch The Elastic Evolution of CenturyLink’s Network Management System from Elastic{ON} Tour Denver.

Learn more about what CenturyLink is doing to get better use out of their data.