In this post I will introduce sysgrok, a research prototype in which we are investigating how large language models (LLMs), like OpenAI's GPT models, can be applied to problems in the domains of performance optimization, root cause analysis, and systems engineering. You can find it on GitHub.
sysgrok can do things like:
- Take the top most expensive functions and processes identified by a profiler, explain the functionality that each provides, and suggest optimizations
- Take a host and a description of a problem that host is encountering and automatically debug the issue and suggest remediations and further actions
- Take source code that has been annotated by a profiler, explain the hot paths, and suggest ways to improve the performance of the code
sysgrok's functionality is aimed at three broad categories of solution:
As an analysis engine for performance, reliability, and other systems-related data. In this mode, the LLM is fed the output of some other tool used by the engineer (e.g., a Linux command line tool, a profiler, or an Observability platform). sysgrok's goal is to interpret, summarize, and form hypotheses about the state of the system using an LLM. It may also then suggest optimizations or remediations.
As a focused, automated solution to specific performance- and reliability-related tasks. There are some tasks that come up repeatedly in performance engineering and SRE work. For these we can build focused, automated assistants that can be used directly by the engineer or by sysgrok itself in solving other problems. For example, in performance engineering, answering the question, "Is there a faster version of this library with equivalent functionality?" is common. sysgrok supports this directly.
As an automated root-cause analysis tool for performance and reliability issues. The former two categories of solution are a mix of data analysis, interpretation, search, and summarization. Crucially, they are applied in a focused manner to data that the engineer has themselves gathered. In sysgrok, we are also investigating a third approach to problem solving with LLMs, in which the LLM is combined with other tools to autonomously perform root-cause analysis and resolution of a given problem. In this approach, the LLM is given a problem description (e.g., "The web server is experiencing high latency") and told what capabilities are available to it (e.g., "ssh to a host," "execute arbitrary Linux command line tools"). The LLM is then asked for the actions to take, using capabilities available to it, to triage the problem. Those actions are executed by sysgrok, and the LLM is asked to analyze the outcomes, triage the issue, suggest remediations, and recommend next steps.
sysgrok is still very much in its early days, but we're releasing it as it's already useful for a variety of tasks — we hope it will be a convenient base for others to perform similar experiments. Please feel free to send us PRs or open issues on GitHub if you have any ideas!
LLMs, such as OpenAI's GPT models, have exploded in popularity in the past several months, providing a natural language interface to, and engine at the core of, all sorts of products, from customer assistant chat bots, to data manipulation assistants, to coding assistants. An interesting aspect of this trend is that essentially all of these applications make use of off-the-shell, generic models that have not been specifically trained or fine-tuned for the task at hand. Instead, they have been trained on large swathes of the internet as a whole and, as a result, are applicable to a broad variety of tasks.
So, can we make use of these models to help with performance analysis, debugging, and optimization? There are a variety of methodologies for investigating performance issues, triaging root causes, and coming up with optimizations. At its core though, any performance analysis work is going to involve looking at the output of various tools, such as Linux command line tools or an observability platform, and interpreting that output in order to form a hypothesis about the state of the system. Among the material the GPT models have been trained on are sources that cover software engineering, debugging, infrastructure analysis, operating system internals, Kubernetes, Linux commands and their usage, and performance analysis methodologies. As a result, the models can be used to summarize, interpret, and hypothesize on the data and problems that performance engineers encounter on a day-to-day basis, and this can accelerate the pace at which an engineer moves through their analysis.
We can go further though and move beyond using the LLM purely for data analysis and question answering within the context of the engineer's own investigative process. As we'll show later in this post, the LLM itself can be used to drive the process in some scenarios, with the LLM deciding what commands to run, or what data sources to look at, to debug an issue.
For the full set of capabilities that sysgrok supports, check out the GitHub repository. Broadly, it supports three approaches to problem solving:
In this mode, the LLM is fed the output of another tool being used by the engineer, such as a Linux command line tool, a profiler, or an observability platform. sysgrok's goal is to interpret, summarize, and suggest remediations.
As an example, the topn sub-command takes the top most expensive functions as reported by a profiler, explains the output, and then suggests ways to optimize the system.
This video also shows the chat functionality provided by sysgrok. When provided with the –chat argument, sysgrok will drop into a chat session after each response from the LLM.
This capability can also be applied generically to the output of Linux command line tools. For example, in the article Linux Performance Analysis in 60 seconds, Brendan Gregg outlines 10 commands that an SRE should run when they first connect to a host that is experiencing a performance or stability issue. The analyzecmd sub-command takes as input a host to connect to and a command to execute and then analyzes and summarizes the output of the command for the user. We can use this to automate the process described by Gregg and give the user a one-paragraph summary of all the data generated by the 10 commands, saving them the hassle of having to go through the command's output one by one.
There are some tasks that come up repeatedly in performance engineering and SRE work. For these we can build focused, automated assistants that can be used either directly by the engineer or by sysgrok itself in solving other problems.
As an example, the findfaster subcommand takes as input the name of a library or program and uses the LLM to find a faster, equivalent, replacement for it. This is a very common task in performance engineering.
Another example of this approach in sysgrok is the explainfunction sub-command. This sub-command takes the name of a library and a function within that library. It explains what the library does and its common use-cases, and then it explains the function. Finally, it suggests possible optimizations if that library and function are consuming a significant amount of CPU resources.
Usage of LLMs is not just limited to focused question answering, summarization, and similar tasks. Nor is it limited to one-shot usage, where they are posed a single, isolated question. The sysgrok debughost sub-command demonstrates how an LLM can be used as the "brain" in an agent with the goal of automated problem solving. In this mode, the LLM is embedded within a process that uses the LLM to decide how to debug a particular issue and gives it the ability to connect to hosts, execute commands, and access other data sources.
The debughost command is probably the most experimental part of sysgrok at the moment. It demonstrates one step on the path toward automated agents for performance analysis, but there's a significant amount of R&D required to get there.
In this post I've introduced sysgrok, a new, open-source AI assistant for analyzing, understanding and optimizing systems. We've also discussed the three broad categories of approach that sysgrok implements:
An analysis engine for performance, reliability, and other systems related data: See the topn, stacktrace, analyzecmd, and code subcommands.
Focused, automated solutions to specific performance- and reliability-related tasks: See the explainprocess, explainfunction, and findfaster sub-commands.
Automated root-cause analysis for performance and reliability issues: See the debughost sub-command.
You can find the sysgrok project on GitHub. Please feel free to create PRs and Issues, or you can reach out to me directly via firstname.lastname@example.org if you'd like to discuss the project or applications of LLMs in general.