Founded by the University of California, Berkeley in 1952, Lawrence Livermore National Laboratory, a part of the U.S. Department of Energy and the National Nuclear Security Administration, applies science and technology to make the world a safer place. Its defining responsibility is ensuring the safety, security, and reliability of the U.S. nuclear deterrent. In addition, the Laboratory’s science and engineering teams research counterterrorism, biosecurity, energy, environmental security, advanced materials, and other key focus areas.
These activities are supported by a high-performance computing (HPC) environment. The imminent deployment of a new greater than two exaflop supercomputer called El Capitan in 2024 will add to the Laboratory’s advanced capabilities for modeling, simulation, and artificial intelligence. This could create more accurate and more predictive models for complex multi-physics problems such as inertial confinement fusion (ICF) at the National Ignition Facility. Physicists will get better answers to their questions, and potentially— in the case of ICF science — save millions of dollars on fusion target fabrication.
Ian Lee, Security Operations Team Lead, is responsible for the overall security operations of the HPC environment, which hosts about 4,000 laboratory and collaborator users. "We have multi-disciplinary teams working on some of the most important security challenges facing the world. Keeping our systems available to support them on their mission is our top priority," he says.
The security team collects logs from across the HPC environment to identify vulnerabilities and issues. It then applies sophisticated risk management frameworks to prioritize alerts and allocate resources to resolve them. The team must also respond to growing requirements from the federal government (such as M-21-31) that impact logging, scanning, and remediation timelines. "We have a number of dashboards and workflows where data comes in from the system and then we generate tickets out to our ticketing system based on this live information," says Lee.
Taking supercomputing to new heights
Keen to reduce the pressure on resources, Lee looked for opportunities to automate logging and remediation activities that go beyond simply alerting an engineer to an issue. "If a website goes offline, it might be due to maintenance rather than an error. It is much better if you can provide extra information up front to the people tasked with fixing the alert."
The Laboratory was already using Elastic components (Logstash, Filebeat) to gather data from its HPC clusters and it investigated whether Elasticsearch and Kibana could be applied to all scanning and logging activities across the board.
He also spoke to his peers at Oak Ridge National Laboratory who have used Elastic for years. "They were able to describe Elastic’s performance in conditions similar to those in our own environment. This gave us full confidence to dive into the Elastic ecosystem."
The Laboratory team then spent two months fleshing out the solution architecture with a group of Elastic engineers. It is now migrating to Elastic Security for its SIEM, including centralized logging for cyber analytics across the HPC environment. Elastic Observability will be used for data aggregation, analysis, and evaluation of logs, metrics, and event data. The team is configuring Kibana dashboards that warn engineers of system vulnerabilities and errors.
The Laboratory sees benefits from Elastic’s out-of-the-box integrations, including with Apache and NGINX web servers, as well as Auditd, which collects logs from its Linux operating systems. Other integrations gather data from the Tenable vulnerability scanning system and various switches, routers, and firewalls.
Getting started in less than an hour
Even in the initial stages of deployment, the Laboratory has seen performance improvements, especially around speed and simplicity at the back end. Rather than using multiple shippers, most data is now managed through Elastic Agent connected to Fleet Server which sends data directly into Elasticsearch. This provides a single, unified method for adding logs, metrics, and other types of data to a host. As previously mentioned, speed is critical to the Laboratory and speed of search for this data enables the team to act quickly, where previously, results were too slow.
Speed of deployment is another important advantage. The Laboratory is using Ansible playbooks to spin up the Elastic containers on all the different nodes of its physical cluster. This process is fast and repeatable which means the Laboratory can set up a working Elastic cluster in less than an hour.
As the Laboratory gears up for the deployment of the El Capitan exascale supercomputer, Elastic Observability will monitor metric data such as low-level hardware performance, counter data, voltages, clock speeds, and error rates on the physical hardware.
Smoothing the way to production
Lee partnered with and called on the expertise of the Elastic Professional Services team, especially during the early phases of the project. "We had two dedicated Elastic consultants who were invaluable when it came to ironing out aspects of the deployment," says Lee. "Especially when making the transition from pre-production to the production environment."
Elastic Consultants studied the Laboratory’s environment and previous observability and security workflows to identify different approaches that are natively supported and easier to deploy with Elastic.
Now that the deployments of Elastic Security and Elastic Observability are starting to deliver results, Lee is looking at other potential Elastic uses. "We have lots of wikis, as well as internal and external websites, that our users rely on for information about our HPC systems. We hope to create a unified search function that can index across all these resources using Elastic Enterprise Search," he says.
The Laboratory sees an opportunity to accelerate and streamline its processes through machine learning. It plans to use Elastic algorithms that identify normal behavior and set alerts for deviations from those baselines. "Being able to do this automatically is something that we are looking at and excited about," says Lee.
While these developments will help keep the Laboratory at the forefront of security, innovation, and discovery, Lee is staying down to earth. "In addition to security, one of our primary responsibilities is advancing science in the U.S. Although we’re only just past our first year partnering with Elastic, early results show that we can continue to expand our support for this mission and the teams behind these exciting breakthroughs," he says.
For a deeper dive, watch the presentation.