Continuous profiling: The key to more efficient and cost-effective applications

the-end-of-databases-A_(1).jpg

Recently, Elastic Universal ProfilingTM became generally available. It is the part of our Observability solution that allows users to do whole system, continuous profiling in production environments. If you're not familiar with continuous profiling, you are probably wondering what Universal Profiling is and why you should care. That's what we will address in this post.

Efficiency is important (again)

Before we jump into continuous profiling, let's start with the "Why should I care?" question. To do that, I'd like to talk a bit about efficiency and some large-scale trends happening in our industry that are making efficiency, specifically computational efficiency, important again. I say again because in the past, when memory and storage on a computer was very limited and you had to worry about every byte of code, efficiency was an important aspect of developing software.

The end of Moore’s Law

First, the Moore's Law era is drawing to a close. This was inevitable simply due to physical limits of how small you can make a transistor and the connections between them. For a long time, software developers had the luxury of not worrying about complexity and efficiency because the next generation of hardware would mitigate any negative cost or performance impact. 

If you can't rely on an endless progression of ever faster hardware, you should be interested in computational efficiency.

The move to Software-as-a-Service

Another trend to consider is the shift from software vendors that sold customers software to run themselves to Software-as-a-Service businesses. A traditional software vendor didn't have to worry too much about the efficiency of their code. That issue largely fell to the customer to address; a new software version might dictate a hardware refresh to the latest and most performant. For a SaaS business, inefficient software usually degrades the customer’s experience and it certainly impacts the bottom line. 

If you are a SaaS business in a competitive environment, you should be interested in computational efficiency.

Cloud migration

Next is the ongoing cloud migration to cloud computing. One of the benefits of cloud computing is the ease of scaling, both hardware and software. In the cloud, we are not constrained by the limits of our data centers or the next hardware purchase. Instead we simply spin up more cloud instances to mitigate performance problems. In addition to infrastructure scalability, microservices architectures, containerization, and the rise of Kubernetes and similar orchestration tools means that scaling services is simpler than ever. It's not uncommon to have thousands of instances of a service running in a cloud environment. This ease of scaling accounts for another trend, namely that many businesses are dealing with skyrocketing cloud computing costs. 

If you are a business with ever increasing cloud costs, you should be interested in computational efficiency.

Our changing climate

Lastly, if none of those reasons pique your interest, let's consider a global problem that all of us should have in mind — namely, climate change. There are many things that need to be addressed to tackle climate change, but with our dependence on software in every part of our society, computational efficiency is certainly something we should be thinking about. 

Thomas Dullien, distinguished engineer at Elastic and one of the founders of Optymize points out that if you can save 20% on 800 servers, and assume 300W power consumption for each server, that code change is worth 160 metric tons of CO2 saved per year. That may seem like a drop in the bucket but if all businesses focus more on computational efficiency, it will make an impact. Also, let's not forget the financial benefits: those 160 metric tons of CO2 savings also represent a significant annual cost savings. 

If you live on planet Earth, you should be interested in computational efficiency.

Performance engineering

Who's job is it to worry about computational efficiency? Application developers usually pay at least some attention to efficiency as they develop their code. Profiling is a common approach for a developer to understand the performance of their code, and there is an entire portfolio of profiling tools available. Frequently, however, schedule pressures trump time spent on performance analysis and computational efficiency. In addition, performance problems may not become apparent until an application is running at scale in production and interacting (and competing) with everything else in that environment. Many profiling tools are not well suited to use in a production environment because they require code instrumentation and recompilation and add significant overhead.

When inefficient code makes it into production and begins to cause performance problems, the next line of defense is the Operations or SRE team. Their mission is to keep everything humming, and performance problems will certainly draw attention. Observability tools such as APM can shed light on these types of issues and lead the team to a specific application or service, but these tools have limits into the observability of the full system. Third-party libraries and operating system kernels functions remain hidden without a profiling solution in the production environment.

So, what can these teams do when there is a need to investigate a performance problem in production? That's where continuous profiling comes into the picture.

Continuous profiling

Continuous profiling is not a new idea. Google published a paper about it in 2010 and began implementing continuous profiling in its environments around that time. Facebook and Netflix followed suit not long afterward. 

Typically, continuous profiling tools have been the domain of dedicated performance engineering or operating system engineering teams, which are usually only found at extremely large scale enterprises like the ones mentioned above. The key idea is to run profiling on every server, all of the time. That way, when your observability tools point you to a specific part of an application, but you need a more detailed view into exactly where that application is consuming CPU resources, the profiling data will be there, ready to use. 

Another benefit of continuous profiling is that it provides a view of CPU intensive software across your entire environment — whether that is a very CPU intensive function or the aggregate of a relatively small function that is run thousands of times a second in your environment.

While profiling tools are not new, most of them have significant gaps. Let's look at a couple of the most significant ones.

  • Limited visibility. Modern distributed applications are composed of a complex mix of building blocks, including custom software functions, third-party software libraries, networking software, operating system services, and more and more often, orchestration software such as Kubernetes. To fully understand what is happening in an application, you need visibility into each piece. However, even if a developer has the ability to profile their own code, everything else remains invisible. To make matters worse, most profiling tools require instrumenting the code, which adds overhead and therefore even your developers’ code is not profiled in production.
  • Missing symbols in production. All of these pieces of code building blocks typically have descriptive names (some more intuitive than others) so that developers can understand and make sense of them. In a running program, these descriptive names are usually referred to as symbols. For a human being to make sense of the execution of a running application, these names are very important. Unfortunately, almost always, any software running in production has these human readable symbols stripped away for space efficiency since they are not needed by the CPU executing the software. Without all of the symbols, it makes it much more difficult to understand the full picture of what's happening in the application. To illustrate this, think of the last time you were in an SMS chat on your mobile device and you only had some of the people in the chat group in your address book while the rest simply appeared as phone numbers — this makes it very hard to tell who is saying what.

Elastic Universal Profiling: Continuous profiling for all

Our goal is to allow any business, large or small, to make computational efficiency a core consideration for all of the software that they run. Universal Profiling imposes very low overhead on your servers so it can be used in production and it provides visibility to everything running on every machine. It opens up the possibility of seeing the financial unit cost and CO2 impact of every line of code running on every system in your business. How do we do that?

Whole-system visibility — SIMPLE

Universal Profiling is based on eBPF, which means that it imposes very low overhead (our goal is less than 1% CPU and less than 250MB of RAM) on your servers because it doesn't require code instrumentation. That low overhead means it can be run continuously, on every server, even in production. 

eBPF also lets us deploy a single profiler agent on a host and peek inside the operating system to see every line of code executing on the CPU. That means we have visibility into all of those application building blocks described above — the operating system itself as well as containerization and orchestration frameworks without complex configuration.

All the symbols

A key part of Universal Profiling is our hosted symbolization service. This means that symbols are not required on your servers, which not only eliminates a need for recompiling software with symbols, but it also helps to reduce overhead by allowing the Universal Profiling agent to send very sparse data back to the Elasticsearch platform where it is enriched with all of the missing symbols. Since we maintain a repository of most popular third-party software libraries and Linux operating system symbols, the Universal Profiling UI can show you all the symbols.

Your favorite language, and then some

Universal Profiling is multilanguage. We support all of today’s popular programming languages, including Python, Go, Java (and any other JVM-based languages), Ruby, NodeJS, PHP, Perl, and of course, C and C++, which is critical since these languages still underly so many third-party libraries used by the other languages. In addition, we support profiling native code a.k.a. machine language.   

Speaking of native code, all profiling tools are tied to a specific type of CPU. Most tools today only support the Intel x86 CPU architecture. Universal Profiling supports both x86 and ARM-based processors. With the expanding use of ARM-based servers, especially in cloud environments, Universal Profiling future-proofs your continuous profiling.

1 - A flamegraph showing traces across Python, Native, Kernel, and Java code
A flamegraph showing traces across Python, Native, Kernel, and Java code

Many businesses today employ polyglot programming — that is, they use multiple languages to build an application — and Universal Profiling is the only profiler available that can build a holistic view across all of these languages. This will help you look for hotspots in the environment, leading you to "unknown unknowns" that warrant deeper performance analysis. That might be a simple interest rate calculation that should be efficient and lightweight but, surprisingly, isn't. Or perhaps it is a service that is reused much more frequently than originally expected, resulting in thousands of instances running across your environment every second, making it a prime target for efficiency improvement.

Visualize your impact

Elastic Universal Profiling has an intuitive UI that immediately shows you the impact of any given function, including the time it spends executing on the CPU and how much that costs both in dollars and in carbon emissions.

Annualized dollar cost and CO2 emissions for any function
Annualized dollar cost and CO2 emissions for any function

Finally, with the level of software complexity in most production environments, there's a good chance that making a code change will have unanticipated effects across the environment. That code change may be due to a new feature being rolled out or a change to improve efficiency. In either case, a differential view, before and after the change, will help you understand the impact.

Performance, CO2, and cost improvements of a more efficient hashing function
Performance, CO2, and cost improvements of a more efficient hashing function

Let's recap

Computational efficiency is an important topic, both from the perspective of the ultra-competitive business climate we all work in and from living through the challenges of our planet's changing climate. Improving efficiency can be a challenging endeavor, but we can't even begin to attempt to make improvements without knowing where to focus our efforts. Elastic Universal Profiling is here to provide every business with visibility into computational efficiency.

How will you use Elastic Universal Profiling in your business?  

  • If you are an application developer or part of the site reliability team, Universal Profiling will provide you with unprecedented visibility into your applications that will not only help you troubleshoot performance problems in production, but also understand the impact of new features and deliver an optimal user experience.
  • If you are involved in cloud and infrastructure financial management and capacity planning, Universal Profiling will provide you with unprecedented visibility into the unit cost of every line of code that your business runs.
  • If you are involved in your business’s ESG initiative, Universal Profiling will provide you with unprecedented visibility into your CO2 emissions and open up new avenues for reducing your carbon footprint.

These are just a few examples. For more ideas, read how AppOmni benefits from Elastic Universal Profiling.

You can get started with Elastic Universal Profiling right now!

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.