Elasticsearch as a time series database for telemetry data at NS1

NS1 is a leading Domain Name System (DNS) host and web traffic management company that equips customers with resilient, redundant, and multi-purpose DNS solutions for various application delivery stacks. Providing industry-leading web redundancy and speed means NS1’s time series database (TSDB) needs to collect and analyze incredible amounts of telemetry data — nominal throughput spikes can reach 700k data points per second — in as close to real time as possible.

This telemetry data comes from a globally distributed infrastructure that spans across systems, providers, and third-party integrations that all dynamically scale and emit logs, time series data, packet captures, sflow, and more. As NS1 continued to grow, establishing an architecture that could support real-time telemetry became a major need — one that NS1 DevOps and Software Engineer Christian Saide describes as being “difficult to solve elegantly.”

The team’s initial solution was a 13-server OpenTSDB cluster for handling OS and application metrics, as well as a three-server Elasticsearch cluster for collecting log data and general unstructured telemetry. This dual system worked well enough … until NS1’s terabytes-per-day operations continued to scale.

“These systems started to fail. Our telemetry systems started requiring more time to manage than the infrastructure it was intended to monitor.” - Christian Saide, DevOps and Software Engineer, NS1

As the team deliberated options for restructuring their TSDB, they established some core requirements:

  • A single system for metrics and unstructured telemetry
  • Granular control over what data is stored and retained
  • Easy scalability
  • Large, diverse, responsive community for support
  • Sufficient analysis capabilities

NS1 decided to go all in with Elastic. Not only would their primary need for a single, scalable TSDB be solved with Elasticsearch, but they could now smoothly transition to the Kibana UI to make sense of the mountains of diverse data being ingested by a variety of Elastic BeatsMetricbeat, Filebeat, Packetbeat, and Heartbeat. Product features aside, it was in large part Elastic’s community that won over the team and influenced their decision.

“The community and all of the plugins all fit together and make a cohesive environment for you to utilize,” Saide explains. “There is a myriad of blog posts and different organizations and companies out there that are explaining how they use Elasticsearch and providing insight to others on how they might be able to utilize it … [] has been an absolutely amazing tool for us.”

They also found value in data introspection and data structure support. Gone were the days of simple keyword and wildcard matching; they could now perform searches for everything from IP range and numeric lookups to actual full-text queries and tokenization. “These are incredibly powerful tools to allow your operations team to actually take a look at what you’re collecting and use that in interesting ways to solve problems or understand what’s happening in an infrastructure.”

“This has been an incredibly effective solution for us at NS1.” - Christian Saide, DevOps and Software Engineer, NS1

NS1’s biggest triumph by far has been their ability to manage a single operational telemetry cluster. “I cannot tell you how much time, money, and just cognitive brainspace that this has already saved us,” Saide says. “[W]e get to focus on one and really learn one cluster. And it has been an absolutely amazing opportunity for us to spend more time working on other systems that are more directly related to our company.”

Elastic’s clustering capabilities quickly became a hit at NS1. The automatic and manual multicast and unicast clustering options in Elasticsearch — and inherent fault tolerance therein — meant the team could readily scale up and down as needed without worrying about their cluster. They discovered they could take down nodes for major upgrades or maintenance with zero downtime.

Pioneering a major TSDB architecture overhaul doesn’t come without its learning curve, though. In the Leveraging Elasticsearch as a Time Series Database webinar, Saide details not only the successes, but also the challenges that came with NS1’s rebuild and even dives into suggestions for adopting Elasticsearch as a primary cluster source.

Webinar: Leveraging Elasticsearch as a time series database - lessons from a DevOps engineer