The one certainty you will find in IT, developer, and SRE roles is that things always change! One hot topic in DevOps communities is observability. A long word, you may be wondering what it really means and how you can add it to your skillset. Here’s a quick primer to get you going on your path to observability.
Where did the term “observability” come from?
The concept of observability dates back to the mid-20th century, where it was originally used in control theory to describe how the internal states of a system can be inferred from knowledge of the system’s external outputs. Rudolf E. Kalman, a Hungarian-American electrical engineer and mathematician, coined the term in the 1960s in several seminal papers around dynamical and observable systems.
While the term is still used in the mathematical context, today it is commonly used in the context of infrastructure, service, and modern software application stacks.
What is observability?
While observability is a term that is used in multiple disciplines, in IT Operations, observability brings insights from your infrastructure and application telemetry data. There are two parts to observability that can be implemented in your organization.
While software is often specified with functional requirements, there are often non-functional requirements as well. Concepts such as “availability”, “scalability”, and “usability”. Observability is a non-functional requirement that is increasingly necessary as organizations adopt cloud technologies, thereby increasing runtime complexity. However, telemetry data based on the combination of logs, metrics, and traces allows you to understand how your applications, services, and infrastructure are performing. To truly know what is going on with your application ecosystem, your entire application environment and components should be "observable", as defined by the original concept sixty years ago.
Once telemetry data is being produced, having an observability solution that aggregates, correlates, and analyzes all the telemetry data from your applications and infrastructure will bring you full-stack observability. An observability platform is the solution you use to evaluate and understand the end to end performance of your environment. But, without telemetry data from your applications, services, and infrastructure, there is no observability. An observability solution gives you visibility both in real-time and from a historical perspective.
Why is observability important for modern DevOps?
When data centers and monolithic applications roamed the Earth, production changes were infrequent and thoroughly planned. External dependencies were documented, and overall application health could be determined by monitoring a few well-known metrics. As companies increasingly digitize and adopt cloud technologies like kubernetes, they dramatically increase their runtime complexity, making observability a critical DevOps initiative
With the distributed nature of modern applications today, no single team or individual has a complete picture of all the dependencies. Telemetry data (metrics, logs, and traces) is often siloed in different tools. Developer and operations teams spend too much time triaging problems due to swivel-chair investigations, resulting in a higher mean time to detection (MTTD) and mean time to resolution (MTTR). To address this increasing application complexity, you’ll need more data to truly understand your environment and your users’ experience. Observability is an attempt to address this complexity by instituting that mindset from both an infrastructure and application telemetry standpoint along with platforms that can consolidate, correlate, and analyze the performance of today’s modern, distributed systems architecture for cloud-native applications.
What questions can observability answer for developers, DevOps, ITOps, and SRE teams?
The benefits of observability vary based on the role you have: if you're looking at it from the perspective of a developer you'll likely be looking at things differently than if you were approaching it from an operations perspective, and different again if you're the business owner. While not an exhaustive list, here are some example IT operations, developer, and DevOps questions that you would be able to answer with a unified observability solution.
- Which of my services is unhealthy?
- What is our average response time for a given operation?
- What is causing some users to experience longer load times than others?
- How has the latest change created latency or impacted application performance?
- How am I doing compared to my SLO for this service?
- Which service should I try to tune first?
- What is the overall user experience for my application?
- How is the business performing? e.g. Is there too much shopping cart abandonment in my e-commerce application?
How is observability different from monitoring?
Monitoring allows you to answer simple questions about your application ecosystem, such as "which of my servers are over utilized?" or "which applications have high response times or latency?" and is based on the assumption that dependencies are well understood, and the overall health could be determined by monitoring the variance of a few known metrics. These measurements for servers, VMs or application services, often miss the context to show how that specific component interacts with upstream services and how it impacts downstream user experience. Monitoring focuses on answering the known unknowns.
An observability solution can help you extract even more information from your telemetry data (metrics, logs, traces) — allowing you not only to see the internal state of your distributed systems, but also help you uncover the unknown unknowns: the things that may be going wrong that you didn't even know to look for. By adding metadata to your metrics, logs and traces, you can now understand the context of those data points and issues relative to an application or user transaction. As an example, imagine a process with an order processing system: transactions are failing, but not every transaction, making debugging extremely challenging. If you are monitoring the right metrics, it might be able to show that something is wrong and even alert you. However, the notification of a problem is only possible if the metrics are being monitored and violate a threshold, which is highly unlikely in case of intermittent problems. In such cases, when users complain, often there is little in terms of investigation that can be done. Since observability has all the telemetry data, operations teams can ask questions, which would let the correlate issues and find that globally, only people checking out using British pounds to pay are having trouble. This operational insight then leads directly to identifying issues with the currency API.
Investing in an observability initiative allows you to intelligently troubleshoot, diagnose, and correlate application issues in an end-to-end manner, no matter how complex your application environment. And even take a look back at how an application performed in the past versus where it is today. The overall goal? The ability to analyze all your telemetry data for interactive investigations and actionable insights.
How do logs and metrics work within observability?
Capturing and storing logs and metrics is often the first step towards observability and also for basic monitoring. Metrics and logs allow you to measure the performance of infrastructure (CPU, memory, I/O metrics) and application technologies in your environment, in near real-time for interactive investigation.
All of your logging and metrics data is also observability data and you will need to store it at the right granularity along with additional contextual information in a common data store and common schema to get you to observability. It is important to consider high cardinality and high dimensionality for it to be meaningful for observability. You absolutely need logging and metrics data for any observability initiatives but they alone are not sufficient for today’s modern, cloud native applications whether on AWS, Azure, or Google Cloud.
What do application performance monitoring (APM) and distributed tracing add to observability?
While logs and metrics provide some visibility, adding distributed traces as part of your observability initiative, expands your diagnostic toolbox. Application performance monitoring via distributed traces helps you stitch together your transactions into a logical collection (using metadata) representing user journey through the services or application. A great start to ensuring a predictable customer experience.
Through instrumentation of your applications and services, you can now tag and track your telemetry data with custom metadata to assist debugging and troubleshooting. Distributed traces and APM helps you understand where there may be bottlenecks or performance issues within an application, or from the user’s perspective, an online transaction. The addition of machine learning and AIOps will even allow you to automatically surface anomalies and potential issues to truly optimize your applications.
While APM and distributed traces can help you within a transaction or application, observability broadens the scope further by giving you visibility across your ENTIRE application ecosystem. This helps you understand the impact across applications and services given the complexity of today's distributed software architectures and ephemeral cloud environments that rely on (potentially oversubscribed) shared resources (storage, compute, memory, microservices, serverless functions).
What kind of telemetry data is needed for an observability initiative?
For your observability solution to be effective, you need to collect as much data as possible with high dimensionality and cardinality — you can't answer the questions if you don't have the data dimensions. Logs, metrics, and traces are often referred to as the "pillars of observability". Tying this large collection of telemetry data will be the metadata/dimensionality you add to make sense of it all. The metadata you add will be unique to your application, how it is architected, and how you want to analyze it.
Nearly as important as the types of telemetry data you will collect is how it is stored and how long it is stored. You can always remove unneeded data, but it’s much harder to add it back after the fact. That means your observability solution will need cost-effective storage for the large volume of telemetry data you generate, at the granularity you need, and the time-frame you need it for. You will also need to have the ability to access your data in the ways that you want for quick and flexible analysis and rich visualizations. Be cautious with observability data stores that don’t allow you full access and use of standardized or open source tools (example: OpenTelemetry) for future analysis.
What are unknown unknowns and why are they important within the field of observability?
While monitoring works for “known unknown” issues, often the hardest problems to solve are the ones that are transient and infrequent that you don't even know to look for. They might arise due to changes in environment, the application, or a combination of factors. These are the types of problems that solidify the need to capture anything and everything, datawise. To understand and resolve the “unknown unknowns”.
An observability platform or solution with a unified and scalable data store allows you not only to determine what was happening at the time of a particular issue, but also helps you understand when unexpected events occur by putting different signals into context and at your fingertips when you need to explore further. This ability to look back historically at application and infrastructure performance is a key differentiator for an observability platform versus a monitoring solution.
How can observability help with productivity and eliminate tool silos within IT Operations and development teams?
Quite often, organizations start out with a need for log aggregation. They get that up and running, then decide that they should also start gathering metrics from their expanding infrastructure, so they add another tool. This process repeats again with APM. Before long, they've got multiple monitoring tools for their application ecosystem, along with already siloed data stores, with different groups potentially having their own tools.
While it's definitely possible to triage and resolve issues and errors with siloed tools, the swivel chair approach to diagnosing issues (trouble-shooting across multiple, single-purpose tools) definitely makes it harder. You lose the correlated and contextual aspects that an observability solution delivers. Also remember that multiple monitoring tools means that all aspects of the overhead are multiplied as well: administration, training, and storage, to name a few.
Having all your telemetry data in a single data store within the right observability solution that scales and provides coverage across dev, stage, and production is ideal. With the right research and investigation, finding an observability solution that allows you to instrument and observe everything in your IT environment in a cost-effective manner is within your reach.
There are a lot of approaches to observability and given the evolution and adoption of cloud and cloud-native technologies, we imagine observability will continue to evolve at a torrid pace moving forward. The ultimate goal? Full end-to-end observability across your application environments with all of your telemetry data. The ability to see across your infrastructure, application, and all the way to the end-user experience from a single dashboard. And the ability to correlate and contextualize telemetry data to your specific needs at any point and time for root cause analysis and debugging efforts.
A comprehensive observability platform will allow you to monitor day-to-day application performance, troubleshoot known issues, optimize applications, and even uncover problems that you were totally unaware of as your environment evolves. Choosing the right observability platform will future-proof your capabilities and organizational knowledge for the long-term as your DevOps teams continue to adopt newer technologies to accelerate innovation, improve digital experiences resulting in a faster time to market.