This blog was originally posted on tfir.io.
Kubernetes is a popular container orchestration system at the heart of the Cloud Native Computing Foundation projects. It automates the deployment, lifecycle, and operations of containers, containerized applications, and "pods," which are groups of one or more containers. The platform itself, along with each of these workloads, may generate event data. There are different kinds of data associated with these processes. Logs can range from simple "yep, got here" debug messages to detailed web server access logs which provide transaction information. Metrics, or time series data, are numeric values measured at a regular interval — for example, the number of instantaneous operations per second, cache hit rates, count of customers accessing your site, or basic things such as how much CPU or memory a container has been using in the past five seconds.
Observability takes these logs and metrics and makes them searchable, usually in a correlating data store, and are often combined with application trace data. This trace data, or detailed application performance monitoring (APM) information, captures where applications or services are spending their time, how and what they are interacting with, and any errors that are encountered. Logs and metrics provide a black-box view of an application, while APM data shows what's going on inside applications.
Combined logs, metrics, and application trace data can help reduce mean time to detection and resolution or errors or incidents, but as application deployment models evolve (as with Kubernetes deployments) it becomes very important to understand where things are actually happening in a dynamic environment. That's where metadata comes in.
What exactly is metadata?
As defined by Webster, metadata is "data that provides information about other data." Sounds pretty straightforward, right? There are many common places you'll find metadata — the page hosting this blog post has metadata. It has SEO tags, hints to help different browsers properly format the page, and keywords to help describe the page. Similarly, one picture on your mobile device has a whole lot of metadata — here's just a snippet:
ExifTool Version Number : 11.11 File Name : 60398459048__A20828DD-FAA4-4133-BA1F-059DEC9E7332.jpeg Directory : . File Size : 2.8 MB File Modification Date/Time : 2020:02:21 08:30:01-05:00 File Access Date/Time : 2020:02:21 08:30:23-05:00 File Inode Change Date/Time : 2020:02:21 08:30:22-05:00 MIME Type : image/jpeg JFIF Version : 1.01 Acceleration Vector : -0.03239090369 -0.9104139204 -0.3862404525 Exif Byte Order : Big-endian (Motorola, MM) Make : Apple Camera Model Name : iPhone XS Orientation : Rotate 90 CW X Resolution : 72 Y Resolution : 72 Exif Image Width : 4032 Exif Image Height : 3024
"How does knowing the MIME type of an iPhone picture help me with my observability initiatives?" you might ask. It doesn't, because that's not what the image metadata is for — but it should give you an idea of what metadata brings to the table. The iPhone metadata enables you to filter on size, orientation, or how shaky you were when you took the picture (apparently I've got to work on that). Let's go through some things that will help you correlate and navigate your observability data, namely the logs, metrics, and application trace data from your environment.
Software and hardware deployment trends
Gone are the glory days of monolithic applications on single-purpose, bare-metal servers in a single data center. Sure, it's still common to see them running dedicated workloads, and there's nothing wrong with that; many large-scale applications and products demand as much compute as they can get. On the whole, however, industry trends with both software and hardware deployment models are moving towards microservices and containers.
This "shift-to-the-right" trend isn't a big bang — many shops will have multiple software deployment models working in parallel, along with multiple hardware patterns. Virtual machines or cloud instances run client/server or SOA applications, while containers run images for their microservices, orchestrated by Kubernetes or Docker. In many cases applications and services in one deployment model are leveraging services in another — that fancy new microservice might still be using a database hosted on bare metal.
The nature of these heterogeneous systems makes metadata even more important. As we shift towards containers, pods, and dynamically scheduled microservices, it becomes even more difficult to walk into the data center and point to a box and say "my application is on that one." It might be, but it might also be on those other four servers behind you.
That's where location metadata comes in. I'm not talking about latitude and longitude (though that could be handy), but rather an address scheme — something that would at least let you logically see where a piece of data (whether it be a log, metric, or application trace data) comes from.
Location, location, location
What you need will vary based on your setup but you should plan for the future. How much you can capture will depend on what a given job is running on — if it's a monolithic application on bare metal you're not going to capture Kubernetes pod details, which is fine. We basically want to provide breadcrumbs so we can see where things are running. We'll see why in a little bit.
With location we are shooting for a metadata hierarchy, down to the application level — where the job is physically running.
The data center metadata should include a unique identifier for each data center — city name, for example. This information can get a little fuzzy when talking about cloud providers, but there are some parallel analogies. In this scenario we can leverage the cloud provider and the region that we're running in, for example, GCP, europe-west1 and availability zone b.
If you have dedicated tiers in your data centers — perhaps specific hosts are reserved for prod or test, or divided across projects — make sure to add that to your metadata as well. It's kind of like a walled-off portion of your data center, or a data center within a data center.
Whether we're running on bare metal, virtualized, or on a cloud instance, we've got some host information regularly available to us. Each host will have attributes for hostname, IP address(es), hardware model or instance type, configured RAM and storage, even operating system information. You might go so far as to include even more detailed information, such as where this host resides — the floor number in the data center, which rack, which row, which shelf even. It wouldn't be the first time that an entire rack was impacted by bad power or wiring!
At this point we have enough information to identify where each thing is running — but only down to the host level, and each host might be running many different services or applications. When we start talking about the applications and services we need to add the corresponding level of metadata. This is where things can get a little if-then-else, so we'll keep it at a high level and work through the more complex scenario, with microservices orchestrated by Kubernetes. Applying this to bare-metal applications and virtualized environments should be pretty straightforward.
Containers in Kubernetes and Docker automatically have some level of metadata available, which should be included. At the very least we'd want to include the container and/or pod name, the image and version used as the base of the container, and the start time. Ideally we'll also include the network name and IP info, along with any network, memory, or storage quotas. Notice the parallels to the host information? Containers and virtual machines are basically hosts that run on another host, so it makes sense that we want to extract the same information.
That said, when working with a virtualized environment we can draw the same analogies. A virtual host is going to have the same high-level details — a name, IP address, and memory and storage limits as hosts.
We have a bit of a dilemma at this point: we have some duplicate field names. It is important to remember that we want to maintain a hierarchy. The host metadata is higher in that hierarchy than a container or a virtual machine:
├── NYC DC 1 │ ├── Host 1 │ │ ├── vm 1 │ │ ├── vm 2 │ │ └── vm 3 │ └── Host 2 │ ├── vm 1 │ └── vm 2 └── NYC DC 2 └── Host 1 ├── vm 1 ├── vm 2 ├── vm 3 └── vm 4
With this it's pretty obvious that we need to put values in different, predictable namespaces to make sure that these names don't collide. A great way to do this is to pass them along as key/value pairs; for example, the metadata for vm 2 on Host 1 in our NYC DC 1 might include:
dc.name: "NYC DC 1" dc.floor: 2 <other dc fields> host.name: "Host 1" host.IP: … host.available_memory_mb: 16384 vm.name: "vm 1" vm.IP: …
Containers are a little different when talking about a hierarchy, since one Kubernetes cluster could span hosts. In this case, we care not only about the location information for a given pod or container, but also the corresponding orchestration metadata, as mentioned above. Next, we'll see how metadata helps us gain better observability to our applications.
Now that we know how to fully address things running in our systems we can start talking about gathering the actual data (recall that metadata describes other data). The "three pillars of observability" are logs, metrics, and application trace data (otherwise known as APM data), occasionally with uptime data shown as a fourth "pillar." When gathering logs and metrics we want to gather it at each layer of our ecosystem, as indicated in the diagram below, which includes information on what types of observability data should be gathered for each abstraction:
For example, we'll want to gather logs, metrics, and availability data from each host or network element in our data centers, but add APM for applications and services.
We enrich all of the above with the corresponding metadata for each tier to increase the visibility of our applications, infrastructure, and entire ecosystem. There are many ways to accomplish this — we can send it with each event or trace, which allows for fast search and filtering, or by storing the metadata from the static portions of our ecosystem and then cross referencing later. While the latter method would save a bit of space, it runs the risk of being stale or out of date, especially in dynamic ecosystems.
Putting the pieces together
With our metadata-enriched observability data we can slice and dice based on the facets we choose — not just limited to looking at specific APM data or logs. For example, we can break down current CPU utilization per host, per service:
Or look at the same parameter over time:
This allows us to pick out details that we want to pivot on, answering questions that we otherwise couldn't, such as:
- Is my US data center overutilized compared to my EMEA data center?
- Are any racks in my ecosystem encountering more errors than others?
- Can I repurpose some dev infrastructure for prod?
- Which physical hosts are the containers and pods for my ecommerce app running on?
- And, finally, why are my iPhone pictures always blurry?
Metadata adds new dimensions to your analysis, providing new ways to aggregate, slice, and dice your data to help answer your business and operations questions. Make sure that the solution that you use for your observability initiatives is prepared to grow with you, and takes into account searchable and navigable metadata, such as the Elastic Common Schema so you can make sure that your metadata ends up where you expect it.