How Oak Ridge National Laboratory optimized its supercomputers with Elastic

Suzzanna Martinez Paul Smalera Justin Wright

March 2, 2022

The history of Oak Ridge National Laboratory, tucked in the hills of Tennessee, is that of a top-secret government facility, where scientists once raced to unlock the secrets of atomic energy.

These days, the lab, while still maintaining programs researching nuclear science, has a much broader mandate, studying everything from biological and environmental systems to clean energy to the structure of the COVID-19 virus.

Underpinning nearly everything Oak Ridge studies is its supercomputing program, where world class researchers push the limits of computational power in service of scientific advancement.

And underpinning Oak Ridge’s supercomputers, helping to keep them stable and performant, is technology from Elastic. Oak Ridge’s latest supercomputer, Summit, was deployed in 2018 and has a peak performance of 200 petaFLOPS, or 200 quadrillion calculations per second. While impressive for its time, that’s nothing compared to the lab’s forthcoming supercomputer, Frontier, due to come fully online later this year.

Frontier will have a peak performance of 1.5 exaFLOPS — a 650% increase from Summit. As the first exascale computer in the United States, it will help scientists achieve previously impossible breakthroughs in energy and national security research.

Frontier will occupy the space of nearly two football fields and require 40 megawatts of power to run. Compared to Summit’s 13 megawatt power load, Frontier’s power draw means that even small tweaks can translate into huge efficiencies in operation. That in turn translates into better economics and faster breakthroughs for researchers using Frontier to solve previously unsolvable problems.

All this means that speed and performance are critical for the teams building Frontier to optimize — and why that team has turned to Elastic to monitor and optimize its performance.

The analytics and monitoring team at Oak Ridge recently discussed how they use Elastic logging to help keep a complex system like Frontier stable, and utilize Kibana data visualization to pinpoint infrastructure efficiencies. Here, we share some of their insights, which are useful to anyone running Elastic at any scale, in any size organization.

How an Elastic-powered insight led to $2 million in savings

The Oak Ridge team enables scientific research in dynamic simulation, superconductivity, turbulent flow, quantum materials and earth science simulations. They are constantly searching for an edge in keeping their supercomputers stable and efficient. One such breakthrough will save Oak Ridge National Laboratory $2 million in annual cooling infrastructure costs.

Summit requires a massive amount of water to cool itself. By analyzing real-time data, the team identifies configuration efficiencies that can be made without interrupting the work of the data producers, reducing cooling and energy costs by seven figures.

How Oak Ridge uses Elastic to scale its research mission

Working with computer speeds that are measured in petaFLOPS on a daily basis, the Oak Ridge team knows a thing or two about scale. But they have to be able to do something with all the incredible amounts of data being generated by Summit, beyond just saving it. That’s where Elastic comes in. The Oak Ridge team use Elastic as both a data store and an analytics engine.

Currently, Summit has six data nodes with 2.7 petabytes of usable storage, and plans for expansion. Implementing data tiering for data lifecycle management, Summit has no hard limit on daily ingestion. The system does have a soft limit of 1.5 terabytes of data per day, supported with economical access to older data in cold and frozen tiers thanks to Elastic’s pricing model.

This distributed, scalable architecture helps the team be prepared for new research proposals and programs, allowing for easy addition of datasets to the system.

How the Oak Ridge supercomputer datastream works

The Oak Ridge team deals with some of the most complex system setups imaginable. Yet to simplify its streaming data ecosystem, the team uses a setup that would be familiar to anyone who has deployed Elastic in any size organization.

The Oak Ridge team uses Kafka to provide real-time feeds of data from directly to Elastic's Logstash tool. Logstash then parses these data streams to enable powerful queries in Elasticsearch and output visualization in Kibana. Output can also flow into Prometheus to monitor containerized workloads, and subsequently into Grafana or Nagios for additional visualization and alerting.

The Oak Ridge team uses this analysis to make data-driven decisions about the infrastructure of Summit. Also, the scientists producing data can request access to the indices. They too can then use Elastic’s data store and analytics engine technology without spending hours doing specialized training.

Optimizing Elastic, from the Oak Ridge supercomputer team

As Gina Tourassi, director of Oak Ridge's computer team wrote, "Data-driven scientific discovery has taken an accelerated turn and the trend will continue with growing advances in artificial intelligence."

As a facility at the vanguard of High Performance Computing on the international stage, the Oak Ridge team offers three tips for using Elastic for HPC monitoring:

Tuning

Know your hardware: Tune your Elastic cluster according to CPU, storage, and memory availability, in order to create a stable, reliable system
Aim for the magic ratios: Elastic ratios such 20 shards per 1 gigabyte of heap are an important rule of thumb
Tune storage and data tiers: Use hot, warm, cold, frozen ratios that match Elastic's recommendations for optimal performance
Keep index shard size at 50 gigabits per shard: For high-throughput data pipelines, increase the batch size or number of workers for Logstash to match this figure

Troubleshooting

Check your logs: Use a monitoring cluster to get an idea of where to start digging for issues if any arise
Common areas to inspect: High CPU or memory usage, unexpected network traffic patterns, and data pipeline or configuration problems. Use Logstash filtering capabilities to identify problem areas.

Recovery

Pre-plan update and upgrade strategies: Have a recovery plan to reduce data loss or downtime and optimize recovery
Some things to look for: Slow recovery times may indicate underlying issues such as bottleneck in disc I/O, or networking that could slow down recovery and reduce cluster performance

Ready for Frontier

At ElasticON, the Oak Ridge team presented their approach to building the United States’s contribution to the exascale revolution. Elastic is proud to be part of the solution that is keeping the world’s fastest computer online.

Additional resources

Log monitoring and anomaly detection at scale at ORNL
A single platform for US Government security and logging compliance
View more civilian government use cases and solutions
Get in touch with Elastic's ORNL team at federal@elastic.co

上下文工程

向量数据库

Search AI 驱动的应用程序

日志

威胁防护

工作流

Elasticsearch

Kibana（Discover、仪表板）

Elastic 智能体生成器

自动操作

管道化查询语言

Jina AI 搜索模型

Elastic Cloud Serverless

Elastic Cloud 托管

自管型 Elasticsearch

电子商务搜索

客户服务搜索

搜索驱动型应用程序

日志分析

基础架构监测

数字体验监测

应用性能监控

AIOps

LLM 可观测性

新一代 SIEM

安全工作流

XDR 和终端安全

面向安全的 AI

实现数据价值十倍跃升

云服务提供商

Elastic AI 生态系统

Search AI 合作伙伴计划

AV-Comparatives

Forrester Wave™ XDR

Gartner 魔力象限领导者

IDC MarketScape

Search

安全性

可观测性

开始使用

演示库

下载

集成

文档

Elasticsearch Labs

Elastic 安全实验室

Elastic 可观测性实验室

博客

社区

活动

网络研讨会

讨论

培训

支持

咨询

How Oak Ridge National Laboratory optimized its supercomputers with Elastic

How an Elastic-powered insight led to $2 million in savings

How Oak Ridge uses Elastic to scale its research mission

How the Oak Ridge supercomputer datastream works

Optimizing Elastic, from the Oak Ridge supercomputer team

Ready for Frontier

Share

注册 Elastic Cloud 免费试用版