Kubernetes Observability from alert to root cause: Dashboards, Alerts, and Anomaly Detection with Elastic

Tue, 21 Apr 2026 00:00:00 GMT

Kubernetes observability with Elastic, Dashboards, Alerts, and Anomaly Detection

Kubernetes observability with Elastic is built for the operator who gets paged at 3 AM. That operator is often in a terminal, a chat tool, or an IDE. They need an answer that is grounded in what is happening in the cluster right now.

The new Elastic Kubernetes integration is built for that operator. It includes dashboards with drilldowns, alert rule templates, and ML anomaly detection jobs. Additionally Elastic also offers Agentic Investigations, that drives investigations automatically.

This blog will cover the foundational observability components (dashboards, drilldowns, alert templates, etc), while a part 2 covering the agentic investigations will cover workflows, agent skills, and MCP tools and views

The new Kubernetes integration content in this post is generally available across Elastic Cloud Hosted, Serverless, and self-managed deployments.

Dashboards designed for drill-down, not just display

The new Kubernetes dashboards are organized around a three-tier design: a cluster Overview that surfaces what needs attention at a glance, object summary pages for clusters, nodes, namespaces, workloads, and pods, and object detail pages that give you the full picture for any single entity.

Every layer connects to the next: click any entity in a summary table and choose: apply it as a filter on the current view, or open its dedicated detail page.

Here's what that looks like when something's actually wrong:

Following a restart cascade from overview to container

Overview: The Overview surfaces what needs attention across your cluster. You can see top pods by CPU, top namespaces by container restarts, and top nodes by memory utilization in one screen. When the "container restarts" panel starts climbing, you know where to look.

Namespaces Overview: Click into the flagged namespace with 1232 restarts and CPU limit utilization at 116%. The detail view plots CPU and memory against requests and limits over time. This shows both the size and duration of the overage.

Namespace Details: We can get more info on the various pods in this namespace here. Click the pod driving the restarts.

Pod Details: The pod detail dashboard is organized into capacity, metrics, and containers sections. Container restarts are flagged in red at the top of the page. Most panels are metric-driven, and the dashboard also links to correlated pod logs in Discover.

It takes four clicks to move from the Cluster Overview to container logs that explain the failure. These dashboards are starting points for your team. You can copy and customize them with ESQL visualizations.

Alert rules that fire on day one

The integration ships with pre-built alerting rule templates for states that are wrong by definition. No historical baseline or warmup period is required. Enable them during setup and they work immediately.

These rules do not ask, "Is this abnormal for this service?" They ask, "Is this a known bad state in Kubernetes?" A pod in CrashLoopBackOff is always a problem. A container killed by the kernel for exceeding its memory limit is always a problem.

Like the Kubernetes dashboards, these alerts are built on ES|QL queries. You can see that in the CrashLoopBackOff definition below. If you are new to ES|QL, you can start with the ES|QL docs.

The alert templates cover:

CrashLoopBackOff detection - Fires when a pod's restart count exceeds a configurable threshold within a rolling window. The default catches a real restart cycle without triggering on routine restarts during a rolling deployment.
Container OOMKilled - Surfaces kernel-level container terminations due to memory limits. These events are easy to miss in dashboards and often precede wider failures. This rule fires on any occurrence.
Deployment below desired replicas - Fires when a deployment runs fewer replicas than declared for longer than a grace period. This catches scaling failures and partially failed rollouts.
Pod stuck in Pending - Fires when a pod cannot be scheduled past a configurable time threshold. This surfaces node capacity problems, missing resources, and affinity failures before availability drops.
Node disk pressure - Fires immediately when the Kubernetes DiskPressure node condition is True. A node condition is a direct state signal, not a statistical threshold.
Persistent volume near capacity - Alerts when storage utilization crosses a configurable threshold before writes start failing.

Each template is parameterized. Adjust thresholds in the ES|QL query to match your environment. Connect notifications to PagerDuty, Slack, or another destination in your runbook.

Anomaly detection jobs with ML baselines

Alert rules catch what is definitively wrong. ML anomaly detection catches patterns that often precede failures. If you are new to this area, see the Elastic anomaly detection overview.

A pod that always runs at 85% memory utilization might be healthy. A pod that grew from 40% to 85% over twelve hours is usually not healthy. A static threshold often catches this only after an OOM kill. The ML module should catch the trajectory earlier.

The integration ships with ML module configurations that learn workload baselines and flag meaningful deviations. These jobs need 24 to 48 hours of data before results become useful. Results become more reliable as jobs continue to run.

The included modules

1. Pod memory growth anomalies

What it learns: per-pod memory consumption pattern over time
What it flags: Growth trajectories that are inconsistent with baseline behavior, such as a slow leak that never crosses the hard limit.
Why ML (not alert rule): The alert rule catches the OOMKill after the fact. The ML job catches the trajectory that leads there.

2. Network I/O anomalies

What it learns: per-pod network transmit/receive byte rate patterns
What it flags: Unusual spikes or drops relative to the pod baseline. A spike can indicate a runaway process or unexpected load. A drop can indicate a network partition that causes the pod to go idle.
Why ML (not alert rule): Normal network traffic varies by time of day and workload type. A batch job pod at high throughput during its normal window is expected. The same throughput outside that window can be anomalous.

3. Pod restart frequency

What it learns: Per-workload restart rate patterns during deployments, scaling events, and routine operations.
What it flags: Restart patterns that are anomalous relative to each workload's own history. This is distinct from the CrashLoopBackOff alert rule, which fires on a fixed threshold regardless of context.
Why ML (not alert rule): A deployment that restarts twice during every rollout can be healthy. The same deployment restarting twice on a Tuesday afternoon may be unhealthy. The alert rule cannot distinguish these cases without workload history.

Here's our Single Metric Viewer showing anomalies triggered against a specific pod, for the memory growth job:

And here's the multi-series Anomaly Explorer view of the same job, showing detections firing across a variety of pods:

Try it yourself: the OTel Astronomy Shop

If you do not have a Kubernetes cluster ready, you can use the OpenTelemetry Astronomy Shop demo environment. It uses the same commands as Getting Started Step 2, Path A, but points to demo services. Create the namespace and secret, then run the Helm install. All 16 services, Kafka, and PostgreSQL start flowing into Elastic without instrumentation changes.

The demo ships with a built-in feature flag service, flagd, that lets you activate failure scenarios. Enable cartServiceFailure and watch the checkout-service restart cascade unfold in real time. The CrashLoopBackOff alert rule fires. The ML modules begin establishing baselines. If you have the investigation workflow enabled, it runs automatically when the alert fires.

Getting started

Step 1 - Install the Kubernetes integration. Dashboards are available immediately. No additional configuration is required.

Step 2 - Deploy data collection. There are two supported paths, both based on Helm. Choose the one that fits your deployment model.

Path A - OpenTelemetry (EDOT collector): This path uses the opentelemetry-kube-stack Helm chart with the Elastic Distribution of OpenTelemetry (EDOT) collector. Create a namespace and a secret with your endpoint and API key, then install:

kubectl create namespace opentelemetry-operator-system

kubectl create secret generic elastic-secret-otel \
  --namespace opentelemetry-operator-system \
  --from-literal=elastic_otlp_endpoint='https://.elastic.cloud:443' \
  --from-literal=elastic_api_key=''

helm upgrade --install opentelemetry-kube-stack open-telemetry/opentelemetry-kube-stack \
  --namespace opentelemetry-operator-system \
  --values 'https://raw.githubusercontent.com/elastic/elastic-agent/refs/tags/v9.3.2/deploy/helm/edot-collector/kube-stack/managed_otlp/values.yaml' \
  --version '0.12.4'

Path B - Elastic Agent (standalone): This path uses the elastic/elastic-agent Helm chart. The default manifest includes resource limits that may not be appropriate for production. Review the Scaling Elastic Agent on Kubernetes guide before deploying.

helm repo add elastic https://helm.elastic.co/ && \
helm install elastic-agent elastic/elastic-agent \
  --version 9.3.2 \
  -n kube-system \
  --set outputs.default.url=https://.es.elastic.cloud:443 \
  --set outputs.default.type=ESPlainAuthAPI \
  --set outputs.default.api_key=$(echo "" | base64 -d) \
  --set kubernetes.enabled=true

Step 3 - Enable the alert rule templates. Go to Observability > Alerts in Kibana. The Kubernetes templates are in the rule library. Enable the templates relevant to your environment, set thresholds, and connect your notification channel.

Step 4 - Let the ML modules warm up. After 24 to 48 hours, anomaly detection modules establish baselines and begin surfacing pattern-based deviations. Longer running jobs usually produce better baselines. Find results in the ML Anomaly Explorer, linked from the Kubernetes dashboards.

Steps 5, 6, and 7 - Agentic content will be covered in Part 2 (forthcoming), Kubernetes observability with Elastic: Agentic Investigations.

What's next

The next step is the layer that runs investigation workflows when an alert fires. That includes skills that encode investigation logic, tools that expose facts like ML state and topology, and MCP apps that render outputs in places like Claude Desktop or VS Code. These technical preview capabilities are available today and will be covered in Part 2 (forthcoming), Kubernetes observability with Elastic: Agentic Investigations.

If you are running Kubernetes on Elastic today, tell us which investigation steps you repeat manually on every incident. Tell us which remediations you would trust a workflow to propose. You can join the Elastic Community Discussion here.

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.

Elastic Observability Labs - Articles by Jesse Miller