How Oak Ridge National Laboratory optimized its supercomputers with Elastic

blog-timeseries-720x420.png

The history of Oak Ridge National Laboratory, tucked in the hills of Tennessee, is that of a top-secret government facility, where scientists once raced to unlock the secrets of atomic energy.

These days, the lab, while still maintaining programs researching nuclear science, has a much broader mandate, studying everything from biological and environmental systems to clean energy to the structure of the COVID-19 virus.

Underpinning nearly everything Oak Ridge studies is its supercomputing program, where world class researchers push the limits of computational power in service of scientific advancement.


And underpinning Oak Ridge’s supercomputers, helping to keep them stable and performant, is technology from Elastic. Oak Ridge’s latest supercomputer, Summit, was deployed in 2018 and has a peak performance of 200 petaFLOPS, or 200 quadrillion calculations per second. While impressive for its time, that’s nothing compared to the lab’s forthcoming supercomputer, Frontier, due to come fully online later this year.

Frontier will have a peak performance of 1.5 exaFLOPS — a 650% increase from Summit. As the first exascale computer in the United States, it will help scientists achieve previously impossible breakthroughs in energy and national security research.

Frontier will occupy the space of nearly two football fields and require 40 megawatts of power to run. Compared to Summit’s 13 megawatt power load, Frontier’s power draw means that even small tweaks can translate into huge efficiencies in operation. That in turn translates into better economics and faster breakthroughs for researchers using Frontier to solve previously unsolvable problems.

All this means that speed and performance are critical for the teams building Frontier to optimize — and why that team has turned to Elastic to monitor and optimize its performance.

The analytics and monitoring team at Oak Ridge recently discussed how they use Elastic logging to help keep a complex system like Frontier stable, and utilize Kibana data visualization to pinpoint infrastructure efficiencies. Here, we share some of their insights, which are useful to anyone running Elastic at any scale, in any size organization.

How an Elastic-powered insight led to $2 million in savings

The Oak Ridge team enables scientific research in dynamic simulation, superconductivity, turbulent flow, quantum materials and earth science simulations. They are constantly searching for an edge in keeping their supercomputers stable and efficient. One such breakthrough will save Oak Ridge National Laboratory $2 million in annual cooling infrastructure costs.

Summit requires a massive amount of water to cool itself. By analyzing real-time data, the team identifies configuration efficiencies that can be made without interrupting the work of the data producers, reducing cooling and energy costs by seven figures.

How Oak Ridge uses Elastic to scale its research mission

Working with computer speeds that are measured in petaFLOPS on a daily basis, the Oak Ridge team knows a thing or two about scale. But they have to be able to do something with all the incredible amounts of data being generated by Summit, beyond just saving it. That’s where Elastic comes in. The Oak Ridge team use Elastic as both a data store and an analytics engine.

Currently, Summit has six data nodes with 2.7 petabytes of usable storage, and plans for expansion. Implementing data tiering for data lifecycle management, Summit has no hard limit on daily ingestion. The system does have a soft limit of 1.5 terabytes of data per day, supported with economical access to older data in cold and frozen tiers thanks to Elastic’s pricing model.

This distributed, scalable architecture helps the team be prepared for new research proposals and programs, allowing for easy addition of datasets to the system.

How the Oak Ridge supercomputer datastream works

The Oak Ridge team deals with some of the most complex system setups imaginable​​. Yet to simplify its streaming data ecosystem, the team uses a setup that would be familiar to anyone who has deployed Elastic in any size organization.

The Oak Ridge team uses Kafka to provide real-time feeds of data from directly to Elastic's Logstash tool. Logstash then parses these data streams to enable powerful queries in Elasticsearch and output visualization in Kibana. Output can also flow into Prometheus to monitor containerized workloads, and subsequently into Grafana or Nagios for additional visualization and alerting.

The Oak Ridge team uses this analysis to make data-driven decisions about the infrastructure of Summit. Also, the scientists producing data can request access to the indices. They too can then use Elastic’s data store and analytics engine technology without spending hours doing specialized training.

Optimizing Elastic, from the Oak Ridge supercomputer team

As Gina Tourassi, director of Oak Ridge's computer team wrote, "Data-driven scientific discovery has taken an accelerated turn and the trend will continue with growing advances in artificial intelligence."

As a facility at the vanguard of High Performance Computing on the international stage, the Oak Ridge team offers three tips for using Elastic for HPC monitoring:

Tuning

  • Know your hardware: Tune your Elastic cluster according to CPU, storage, and memory availability, in order to create a stable, reliable system
  • Aim for the magic ratios: Elastic ratios such 20 shards per 1 gigabyte of heap are an important rule of thumb
  • Tune storage and data tiers: Use hot, warm, cold, frozen ratios that match Elastic's recommendations for optimal performance
  • Keep index shard size at 50 gigabits per shard: For high-throughput data pipelines, increase the batch size or number of workers for Logstash to match this figure

Troubleshooting

  • Check your logs: Use a monitoring cluster to get an idea of where to start digging for issues if any arise
  • Common areas to inspect: High CPU or memory usage, unexpected network traffic patterns, and data pipeline or configuration problems. Use Logstash filtering capabilities to identify problem areas.

Recovery

  • Pre-plan update and upgrade strategies: Have a recovery plan to reduce data loss or downtime and optimize recovery
  • Some things to look for: Slow recovery times may indicate underlying issues such as bottleneck in disc I/O, or networking that could slow down recovery and reduce cluster performance

Ready for Frontier

At ElasticON, the Oak Ridge team presented their approach to building the United States’s contribution to the exascale revolution. Elastic is proud to be part of the solution that is keeping the world’s fastest computer online.

Additional resources