Unlocking whole-system visibility with Elastic Universal Profiling™

Identify, optimize, measure, repeat!

SREs and developers who want to maintain robust, efficient systems and achieve optimal code performance need effective tools to measure and improve code performance. Profilers are invaluable for these tasks, as they can help you boost your app's throughput, ensure consistent system reliability, and gain a deeper understanding of your code's behavior at runtime. However, traditional profilers can be cumbersome to use, as they often require code recompilation and are limited to specific languages. Additionally, they can also have a high overhead that negatively affects performance and makes them less suitable for quick, real-time debugging in production environments.

To address the limitations of traditional profilers, Elastic^® recently announced the general availability of Elastic Universal Profiling, a continuous profiling product that is refreshingly straightforward to use, eliminating the need for instrumentation, recompilations, or restarts. Moreover, Elastic Universal Profiling does not require on-host debug symbols and is language-agnostic, allowing you to profile any process running on your machines — from your application's code to third-party libraries and even kernel functions.

However, even the most advanced tools require a certain level of expertise to interpret the data effectively. The wealth of visual profiling data — flamegraphs, stacktraces, or functions — can initially seem overwhelming. This blog post aims to demystify continuous profiling and guide you through its unique visualizations. We will equip you with the knowledge to derive quick, actionable insights from Universal Profiling.

Let’s begin.

Stacktraces: The cornerstone for profiling

It all begins with a stacktrace — a snapshot capturing the cascade of function calls.

A stacktrace is a snapshot of the call stack of an application at a specific point in time. It captures the sequence of function calls that the program has made up to that point. In this way, a stacktrace serves as a historical record of the call stack, allowing you to trace back the steps that led to a particular state in your application.

Further, stacktraces are the foundational data structure that profilers rely on to determine what an application is executing at any given moment. This is particularly useful when, for instance, your infrastructure monitoring indicates that your application servers are consuming 95% of CPU resources. While utilities such as 'top -H' can show the top processes that are consuming CPU, they lack the granularity needed to identify the specific lines of code (in the top process) responsible for the high usage.

In the case of Elastic Universal Profiling, eBPF is used to perform sampling of every process that is keeping a CPU core busy. Unlike most instrumentation profilers that focus solely on your application code, Elastic Universal Profiling provides whole-system visibility — it profiles not just your code, but also code you don't own, including third-party libraries and even kernel operations.

The diagram below shows how the Universal Profiling agent works at a very high level. Step 5 indicates the ingestion of the stacktraces into the profiling collector, a new part of the Elastic Stack.

_ Just _ _ deploy the profiling host agent _ and receive profiling data (in Kibana^®) a few minutes later. _ Get started now __ . _

Unwinder eBPF programs (bytecode) are sent to the kernel.
The kernel verifies that the BPF program is safe. If accepted, the program is attached to the probes and executed when the event occurs.
The eBPF programs pass the collected data to userspace via maps.
The agent reads the collected data from maps. The data transferred from the agent to the maps are process-specific and interpreter-specific meta-information that help the eBPF unwinder programs perform unwinding.
Stacktraces, metrics, and metadata are pushed to the Elastic Stack.
Visualize data as flamegraphs, stacktraces, and functions via Kibana.

While stacktraces are the key ingredient for most profiling tools, interpreting them can be tricky. Let's take a look at a simple example to make things a bit easier. The table below shows a group of stacktraces from a Java application and assigns each a percentage to indicate its share of CPU time consumption.

Table 1: Grouped Stacktraces with CPU Time Percentage

Percentage	Function Calls
60%	startApp -> authenticateUser -> processTransaction
20%	startApp -> loadAccountDetails -> fetchRecentTransactions
10%	startApp -> authenticateUser -> processTransaction -> verifyFunds
2%	startApp -> authenticateUser -> processTransaction ->libjvm.so
1%	startApp -> authenticateUser -> processTransaction ->libjvm.so ->vmlinux: asm_common_interrupt ->vmlinux: asm_sysvec_apic_timer_interrupt

The percentages above represent the relative frequency of each specific stacktrace compared to the total number of stacktraces collected over the observation period, not actual CPU usage percentages. Also, the libjvm.so and kernel frames (vmlinux:*) in the example are commonly observed with whole-system profilers like Elastic Universal Profiling.

Also, we can see that 60% of the time is spent in the sequence startApp; authenticateUser; processTransaction. An additional 10% of the processing time is allocated to verifyFunds, a function invoked by processTransaction. Given these observations, it becomes evident that optimization initiatives would yield the most impact if centered on the processTransaction function, as it is one of the most expensive functions. However, real-world stacktraces can be far more intricate than this example. So how do we make sense of them quickly? The answer to this problem resulted in the creation of flamegraphs.

Flamegraphs: A visualization of stacktraces

While the above example may appear straightforward, it scarcely reflects the complexities encountered when aggregating multiple stacktraces across a fleet of machines on a continuous basis. The depth of the stack traces and the numerous branching paths can make it increasingly difficult to pinpoint where code is consuming resources. This is where flamegraphs, a concept popularized by Brendan Gregg, come into play.

A flamegraph is a visual interpretation of stacktraces, designed to quickly and accurately identify the functions that are consuming the most resources. Each function is represented by a rectangle, where the width of the rectangle represents the amount of time spent in the function, and the number of stacked rectangles represents the stack depth. The stack depth is the number of functions that were called to reach the current function.

Elastic Universal Profiling uses icicle graphs, which is an inverted variant of the standard flamegraph. In an icicle graph, the root function is at the top, and its child functions are shown below their parents –– making it easier to see the hierarchy of functions and how they are related to each other.

In most flamegraphs, the y-axis represents stack depth, but there is no standardization for the x-axis. Some profiling tools use the x-axis to indicate the passage of time; in these instances, the graph is more accurately termed a flame chart. Others sort the x-axis alphabetically. Universal Profiling sorts functions on the x-axis based on relative CPU percentage utilization, starting with the function that consumes the most CPU time on the left, as shown in the example icicle graph below.

Debugging and optimizing performance issues: Stacktraces, TopN functions, flamegraphs

SREs and SWEs can use Universal Profiling for troubleshooting, debugging, and performance optimization. It builds stacktraces that go from the kernel, through userspace native code, all the way into code running in higher level runtimes, enabling you to identify performance regressions , reduce wasteful computations , and debug complex issues faster.

To this end, Universal Profiling offers three main visualizations: Stacktraces, TopN Functions, and flamegraphs.

Stacktrace view

The stacktraces view shows grouped stacktrace graphs by threads, hosts, Kubernetes deployments, and containers. It can be used to detect unexpected CPU spikes across threads and drill down into a smaller time range to investigate further with a flamegraph. Refer to the documentation for details.

TopN functions view

Universal Profiling's topN functions view shows the most frequently sampled functions, broken down by CPU time, annualized CO₂, and annualized cost estimates. You can use this view to identify the most expensive functions across your entire fleet, and then apply filters to focus on specific components for a more detailed analysis. Clicking on a function name will redirect you to the flamegraph, enabling you to examine the call hierarchy.

Flamegraphs view

The flamegraph page is where you will most likely spend the most time, especially when debugging and optimizing. We recommend that you use the guide below to identify performance bottlenecks and optimization opportunities with flamegraphs. The three key elements-conditions to look for are width , hierarchy , and height.

Width matters: In icicle graphs, wider rectangles signify functions taking up more CPU time. Always read the graph from left to right and note the widest rectangles, as these are the prime hot spots.

Hierarchy matters: Navigate the graph's stack to understand function relationships. This vertical examination will help you identify whether one or multiple functions are responsible for performance bottlenecks. This could also uncover opportunities for code improvements, such as swapping an inefficient library or avoiding unnecessary I/O operations.

Height matters: Elevated or tall stacks in the graph usually point to deep call hierarchies. These can be an indicator of complex and less efficient code structures that may require attention.

Also, when navigating a flamegraph, you may want to look for specific function names to validate your assumptions on their presence: in the Universal Profiling flamegraphs view, there is a “Search” bar at the bottom left corner of the view. You can input a regex, and the match will be highlighted in the flamegraph; by clicking on the left and right arrows next to the Search bar, you can move across the occurrences on the flamegraph and spot callers and callee of the matched function.

In summary,

Scan horizontally from left to right, focusing on width for CPU-intensive functions.
Examine vertically to examine the stack and spot bottlenecks.
Look for towering stacks to identify potential complexities in the code.

To recap, use topN functions to generate optimization hypotheses and validate them with stacktraces and/or flamegraphs. Use stacktraces to monitor CPU utilization trends and to delve into the finer details. Use flamegraphs to quickly debug and optimize your code, using width, hierarchy, and height as guides.

_ Identify. Optimize. Measure. Repeat! _

Measure the impact of your change

For the very first time in history, developers can now measure the performance (gained or lost), cloud cost, and carbon footprint impact of every deployed change.

Once you have identified a performance issue and applied fixes or optimizations to your code, it is essential to measure the impact of your changes. The differential topN functions and differential flamegraph pages are invaluable for this, as they can help you identify regressions and measure your change impact not only in terms of performance but also in terms of carbon emissions and cost savings.

The Diff column indicates a change in the function’s rank.

You may need to use tags or other metadata, such as container and deployment name, in combination with time ranges to differentiate between the optimized and non-optimized changes.

Universal Profiling: The key to optimizing application resources

Computational efficiency is no longer just a nice-to-have, but a must-have from both a financial and environmental sustainability perspective. Elastic Universal Profiling provides unprecedented visibility into the runtime behavior of all your applications, so you can identify and optimize the most resource-intensive areas of your code. The result is not merely better-performing software but also reduced resource consumption, lower cloud costs, and a reduction in carbon footprint. Optimizing your code with Universal Profiling is not only the right thing to do for your business, it’s the right thing to do for our world.

Get started with Elastic Universal Profiling today.

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.

Unlocking whole- system visibility with Elastic Universal Profiling™