<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
    <channel>
        <title>Elastic Observability Labs - Articles by Jesse Miller</title>
        <link>https://www.elastic.co/observability-labs</link>
        <description>Trusted security news &amp; research from the team at Elastic.</description>
        <lastBuildDate>Mon, 08 Jun 2026 15:18:17 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <image>
            <title>Elastic Observability Labs - Articles by Jesse Miller</title>
            <url>https://www.elastic.co/observability-labs/assets/observability-labs-thumbnail.png</url>
            <link>https://www.elastic.co/observability-labs</link>
        </image>
        <copyright>© 2026. Elasticsearch B.V. All Rights Reserved</copyright>
        <item>
            <title><![CDATA[Agentic Powered Kubernetes Investigations with Elastic Observability and MCP]]></title>
            <link>https://www.elastic.co/observability-labs/blog/ai-powered-kubernetes-observability-elastic-mcp</link>
            <guid isPermaLink="false">ai-powered-kubernetes-observability-elastic-mcp</guid>
            <pubDate>Wed, 22 Apr 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[See how Elastic's Agentic powered Kubernetes observability uses MCP App, agent skills  to let agents investigate clusters, detect anomalies, and automate root cause analysis.]]></description>
            <content:encoded><![CDATA[<p>Agentic powered Kubernetes observability is now available in Elastic Observability. Whether you are using Elastic Observability's UI or your own agentic workflows, Elastic provides a set of capabilities to help investigate the Kubernetes issue at hand. We have released an <a href="https://github.com/elastic/example-mcp-app-observability">MCP (Model Context Protocol) App</a> that lets AI agents like Claude and Cursor query Elastic Observability to understand K8s failures, and surface ML anomalies without leaving your chat interface.</p>
<p>In <a href="https://www.elastic.co/observability-labs/blog/kubernetes-dashboards-alerts-anomaly-detection">Part 1</a>, we covered how Elastic's Kubernetes integration ships telemetry via the EDOT Collector into Elasticsearch. In this post, we go further with an MCP (Model Context Protocol) app server that exposes that telemetry as AI-callable tools, complete with interactive React UIs rendered inline. We'll also cover how to take it further with Elastic Workflows: automated runbooks that handle the full root cause analysis loop from alert to remediation proposal.</p>
<h2>Observability MCP App that renders where you work</h2>
<p>The Elastic Observability MCP App (tech preview) ships six views, one per tool. Each renders inline when the tool returns, and each surfaces opinionated next-step prompts as clickable buttons so you don't have to guess the right follow-up. MCP Apps take it further than standalone agent workflows — they render live, interactive views directly inside your chat or IDE, inline in the conversation, without a context switch to Kibana.</p>
<h3>Cluster health rollup</h3>
<p>Ask &quot;what's broken?&quot; or &quot;give me a status report&quot; and get a one-shot orientation: overall health badge, degraded services with reasons, top pod memory consumers, anomaly severity breakdown, and service throughput — all in one inline view.</p>
<p>The view adapts based on what your deployment supports. APM gives you service health. Kubernetes metrics add pod and node context. ML jobs layer in anomalies. If a signal isn't present, the view tells you what's missing rather than failing. We'll begin with a status report of the Kubernetes cluster:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-powered-kubernetes-observability-elastic-mcp/mcp-app-health-summary.png" alt="Elastic MCP app showing AI-generated Kubernetes cluster health summary with anomaly breakdown" /></p>
<p>Compound reports like the health summary have condensed data presentation with detail-expansion so that you get to choose the appropriate amount of information to view at once. Suggested investigation actions provide guidance for both specific information being returned, as well as orienting users to other tools to run.</p>
<h3>Service dependency graph</h3>
<p>Ask &quot;what calls checkout?&quot; or &quot;show me the topology&quot; and get a layered dependency graph — upstream callers, downstream dependencies, protocols, call volume, and latency per edge. Hover over an edge to highlight the full call path. Let's ask Claude to &quot;Show me the service dependencies of the frontend&quot;:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-powered-kubernetes-observability-elastic-mcp/mcp-app-topology.png" alt="Service dependency topology for Kubernetes frontend service in Elastic AI observability app" /></p>
<p>Zoom, pan, and hover to get all the details you need to understand the complex service relationships:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-powered-kubernetes-observability-elastic-mcp/mcp-app-topology-zoom.png" alt="Zoomed service dependency graph showing Kubernetes frontend connections in Elastic MCP observability" /></p>
<h3>Anomaly Details</h3>
<p>Ask &quot;what's anomalous?&quot; or &quot;is anything unusual in checkout?&quot; and get one of two views, chosen automatically. If multiple entities are affected, the overview mode shows severity counts, affected entities, and a by-job breakdown. If a single entity is the focus, the detail mode shows score, actual vs. typical values with a comparison bar, deviation percentage, and a time-series when available. Let's check on the frontend service:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-powered-kubernetes-observability-elastic-mcp/mcp-app-anomaly-details.png" alt="ML anomaly details for Kubernetes frontend pod memory, surfaced by AI observability MCP tool" /></p>
<p>This isn't an ESQL query — it's an explanation of results of a previously-defined anomaly detection job. As discussed in Part 1 of this blog series, the Kubernetes integration ships with a few for you to enable. This tool will help you make the most of them.</p>
<h3>Observe</h3>
<p>Observe is the agent's primary access primitive for Elastic — one tool, with two modes for three different needs. Say &quot;what is the network throughput of each of my kubernetes clusters&quot; for a table or chart of results. Say &quot;tell me when memory drops below 80MB&quot; or &quot;watch the frontend memory for anything unusual for the next 10 minutes&quot; and it blocks until the condition fires or the window expires.</p>
<p>The view adapts to the mode: a results table for one-shot queries, a live trend chart with current/peak/baseline stats for sampling and threshold conditions, and a severity-scored trigger card for anomaly mode. We'll use it here to identify the busiest Kubernetes node:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-powered-kubernetes-observability-elastic-mcp/mcp-app-observe-k8s-services.png" alt="AI observability tool querying Kubernetes node service counts via Elastic MCP" /></p>
<h3>Assess risk with a blast radius</h3>
<p>Ask &quot;what happens if this node goes down?&quot; and get a radial impact diagram: the target node at center, full-outage deployments in red, degraded in amber, unaffected in gray. A floating summary card shows pods at risk and rescheduling feasibility. Single-replica deployments are flagged as single points of failure. What would happen if our busy node were to fail:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-powered-kubernetes-observability-elastic-mcp/mcp-app-blast-radius.png" alt="Kubernetes blast radius analysis showing node failure impact across deployments in Elastic MCP app" /></p>
<h3>Alert Management</h3>
<p>With the alert management tool, you can create, list, get info, and delete alerts. We'll create an alert next, but first use Observe once more to take a quick baseline so we know the alert will make sense:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-powered-kubernetes-observability-elastic-mcp/mcp-app-observe-memory.png" alt="Live Kubernetes pod memory chart generated by AI observability app using Elastic MCP" /></p>
<p>Say &quot;alert me if frontend memory goes above 75MB&quot; and the agent creates a persistent Kibana alerting rule — a saved object that keeps running after the conversation ends. The view renders a live rule card: rule name, condition, window, check interval, KQL filter, and tags. Next-step buttons offer to verify the rule, watch the metric stabilize, or check current cluster health. The agent confirms what was created and where to find it in Kibana:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-powered-kubernetes-observability-elastic-mcp/mcp-app-create-alert.png" alt="AI-created Kubernetes alert rule for frontend pod memory via Elastic MCP observability tool" /></p>
<h3>MCP App Architecture</h3>
<p>The app is composed of a Node.js server, six model-facing tools wired to six single-file view resources, app-only tools for re-queries, and vite-plugin-singlefile bundling. Tools are grouped by deployment backend (Universal, APM-dependent, K8s-dependent, ML-dependent), so the agent and the user both know up front which tools apply to a given deployment instead of discovering capability gaps at call time. The repo includes six Skills as separate .zip artifacts that teach the agent when and how to call each tool.</p>
<p>The following diagram shows the three components that make up the app: the MCP host (Claude Desktop, VS Code, or similar), which holds the LLM and the Claude skills that teach it how to use the tools; the MCP app server, a single Node.js process that exposes the tool registry, bundles the React UI views, and handles all communication with Elastic; and the Elastic Stack itself, where Elasticsearch and Kibana serve as the live data and alerting backends.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-powered-kubernetes-observability-elastic-mcp/mcp-app-architecture-application.png" alt="Architecture diagram of AI-powered Kubernetes observability app built on Elastic MCP" /></p>
<p>The diagram below traces the flow of a user request: Claude reads the relevant skill file to understand which tool to call and how to fill its parameters, calls the tool which triggers server-side queries against Elasticsearch and Kibana, and receives back a compact text summary alongside a React UI resource that renders inline as an interactive widget.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-powered-kubernetes-observability-elastic-mcp/mcp-app-architecture-chat-flow.png" alt="Chat flow diagram showing AI Kubernetes monitoring request lifecycle through Elastic MCP server" /></p>
<h2>From alert to root cause: Investigation Workflows</h2>
<p>Alert rules tell you something is wrong. ML modules tell you the pattern. Elastic Workflows run the diagnosis — automatically, the moment an alert fires.</p>
<p>We're shipping a Kubernetes Investigation Workflow (technical preview) that triggers on a Kubernetes alert and returns a structured root cause summary before you've opened a single dashboard. The SRE who gets paged opens the alert and finds the investigation already done.</p>
<p>The workflow is a directed graph of steps that queries multiple data sources — primarily via Elasticsearch Query Language (ES|QL), with an Elasticsearch search for the ML anomaly lookup. <code>if</code> steps branch on query results, choosing which corroboration to run (ML memory anomaly vs log classification) and whether to assess upstream health (only when APM dependencies exist). AI steps appear at three points: classifying log patterns on the non-OOM path, classifying upstream degraded-vs-healthy, and a final <code>ai.summarize</code> that synthesizes all structured evidence into a root-cause narrative.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-powered-kubernetes-observability-elastic-mcp/k8s-workflow.png" alt="Elastic AI workflow for automated Kubernetes CrashLoopBackOff investigation" /></p>
<p><strong>What the investigation workflow looks like in practice</strong></p>
<p>The example execution below is based on the OpenTelemetry Astronomy Shop running against Elastic — 16 services, Kafka, PostgreSQL, all pre-instrumented via OTLP. Alongside the Shop's real telemetry, we injected a synthetic OOMKill cascade, which writes synthetic K8s and APM signals into the same namespace via the EDOT data streams. The workflow can't tell our signals from real ones — it just investigates the alert.</p>
<p><strong>Alert fires:</strong> CrashLoopBackOff — app-deployment in oteldemo-esyox-default. Restart count: 6.</p>
<p><strong>Workflow step 1 — Characterize pod and container context</strong></p>
<p>The workflow queries K8s metrics for restart count, last termination reason, and utilization against declared limits.</p>
<p>Result: Last termination reason OOMKilled, restart count 6. (Note: kubeletstats utilization was unavailable for this pod/window — the workflow continues gracefully.)</p>
<p><strong>Workflow branches:</strong> Termination reason is OOMKilled, so the workflow takes the memory-investigation path, not the log-investigation path.</p>
<p><strong>Workflow step 2a — Consult ML anomaly results</strong></p>
<p>Rather than recomputing memory trends, the workflow queries the ML anomaly index for an active <code>k8s_pod_memory_growth</code> anomaly.</p>
<p>Result: No anomaly — the spike is flagged load-driven, not a suspected leak.</p>
<p><strong>Workflow step 3 — Check upstream service health</strong></p>
<p>The workflow enumerates upstream dependencies from APM <code>service_destination.1m</code> aggregates, then compares current error rate and mean latency against the same hour 7 days ago. An AI classification step decides whether upstream degradation preceded the alert. Result: One upstream — api-gateway. Current mean latency 15.13 ms, error rate 41.26%. Baseline (168h ago): identical. Classification: upstream_healthy — within 5× error / 3× latency thresholds. Upstream is ruled out.</p>
<p><strong>Workflow step 4 — Correlate with recent K8s changes</strong></p>
<p>Event log for the namespace shows a tight cycle of Pulled → Created → Started → Killing → BackOff repeating roughly every 60–90 seconds. No deployments or scaling events in the past two hours.</p>
<p><strong>Workflow output:</strong></p>
<pre><code>ROOT CAUSE HYPOTHESIS (confidence: high)

app-deployment is OOMKilling under memory pressure. The pod has restarted
6 times with termination reason OOMKilled. ML flagged the memory spike as
load-driven (no leak). Upstream api-gateway is healthy at current vs 7-day
baseline. This is a resource-allocation issue — the container's memory
limit is too low for its real working set.

Evidence:
- 6 restarts, last termination reason OOMKilled
- No ML memory-growth anomaly → leak_suspected=false (load-driven)
- Upstream api-gateway unchanged vs 7d baseline (15.13 ms, 41.26%) → healthy
- K8s events show tight Pulled/Created/Started/Killing/BackOff cycles;
  no deployments in the last 2h

Likely cause: memory limit insufficient for actual working set under load.

Recommended next steps:
1. Raise the app-deployment memory limit based on observed usage
2. Review application code for memory-optimization opportunities
3. Consider graceful degradation on high-load paths

Downstream impact: none identified from APM destination metrics.
</code></pre>
<p>The output above is what the alert looks like when you open it — not a link to a bunch of logs or a dashboard, but an answer.</p>
<p>The same workflow is accessible as an MCP tool from Claude Desktop, VS Code, or any MCP-compatible client. When a developer asks &quot;why is checkout erroring?&quot; from their IDE, the agent calls the workflow and returns the same structured output inline — same evidence, same root cause, without leaving the editor.</p>
<p>Here's an animated walkthrough of the workflow execution:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-powered-kubernetes-observability-elastic-mcp/k8s-workflow-walkthrough.gif" alt="Walkthrough of AI-powered Kubernetes root cause analysis workflow in Elastic" /></p>
<h2>Observability Skill for Kubernetes investigations</h2>
<p>We're also shipping a single, comprehensive investigation Skill (<code>observability-k8s-investigation</code>) that encodes the full diagnostic protocol for Kubernetes workload, node, and control-plane issues. It is an opinionated investigation methodology that includes the reasoning an experienced SRE applies instinctively but rarely writes down. You'll get this by keeping Kibana up to date, as it's baked into our AI Agent skills. It starts with governing principles that prevent the most common misdiagnoses:</p>
<ul>
<li><strong>Absence of evidence is not evidence.</strong> If log queries return zero rows, report <code>no_logs_available</code> — don't infer a failure mode from empty results.</li>
<li><strong>OOMKilled does not mean memory leak by default.</strong> Compare current usage against a 7-day baseline before claiming a leak. The limit may simply be undersized.</li>
<li><strong>Average CPU metrics hide throttling.</strong> A pod can look healthy at 40–60% average utilization while being severely throttled at p99. Look at max and p95, not just average.</li>
<li><strong>Co-symptoms are not causes.</strong> Two services degrading simultaneously usually share an upstream cause. Only attribute causation when one service's degradation clearly precedes the other's and the delta is large.</li>
</ul>
<p>From there, the Skill encodes a failure-mode taxonomy covering 16 distinct K8s failure patterns across workload, node, control-plane, autoscaling, and networking layers — from OOMKilled and CFS throttling through admission webhook blocks and StatefulSet split-brain. Each mode has a pivotal signal that identifies it and a corroboration checklist that confirms it.</p>
<p>The investigation flow follows a structured arc: orient (resolve the target pod, namespace, deployment), characterize (get restart count, termination reasons, utilization), classify (match against the taxonomy), corroborate (pull events, logs, APM, baseline comparisons), and synthesize (produce a root cause hypothesis at calibrated confidence — high, medium, or low — with explicit evidence and recommended next steps).</p>
<p>When two failure modes fit the evidence, the Skill names both and says which it believes is causal and why. When evidence is ambiguous, it says so. &quot;Competing hypotheses are a valid output&quot; is an explicit design principle — manufacturing false confidence is treated as a failure mode of the investigation itself.</p>
<h2>Getting started</h2>
<p>These capabilities build on the Kubernetes integration described in Part 1. Once you have dashboards and data collection running:</p>
<p><strong>Step 1 — Enable investigation workflows</strong> (technical preview). Import the Kubernetes Crashloop Investigation Workflow from the Workflows page in Kibana, and optionally configure it to trigger on an alert rule.</p>
<p><strong>Step 2 — Install the MCP App on an MCP-compatible client</strong> (technical preview). The MCP App for Observability repo can be found on GitHub (see the Releases page for downloads). When installing the app, don't forget to also install and enable the included skills. Access the Example MCP App's tools from your favorite agentic client — instructions are in the README at the GitHub link above.</p>
<p><strong>Step 3 — Leverage the K8s Investigation Skill</strong> (technical preview). This one is a freebie if you're using Agent Builder, because it's baked into AI Agent Skills. The Skill teaches the agent when and how to call the underlying tools and workflows, ensuring consistent diagnostics in conversational contexts.</p>
<h2>What's next</h2>
<p>Investigation workflows diagnose what's broken in the services you're monitoring. The next question is harder: what about the services you're not monitoring?</p>
<p>We're thinking about topology-aware coverage intelligence — automatically discovering every workload deployed in your cluster via the Kubernetes API, cross-referencing against telemetry flowing into Elastic, and surfacing the gap. &quot;You have 47 services. 11 have no distributed traces. Here's your riskiest blind spot.&quot; That capability is under consideration and will likely be the subject of a future post.</p>
<p>In parallel, we're extending workflows toward remediation — not just diagnosis but action: creating a case with the investigation summary attached, proposing a rollback for human approval, or scaling a workload to buy time while the root cause is addressed.</p>
<p>If you're running Kubernetes on Elastic today, tell us which investigation steps you repeat manually on every incident, which remediations you'd trust a workflow to propose, and which MCP tools we should build next. You can join the <a href="https://discuss.elastic.co/c/observability">Elastic Community Discussion Here</a>.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/ai-powered-kubernetes-observability-elastic-mcp/header.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Kubernetes Observability from alert to root cause: Dashboards, Alerts, and Anomaly Detection with Elastic]]></title>
            <link>https://www.elastic.co/observability-labs/blog/kubernetes-dashboards-alerts-anomaly-detection</link>
            <guid isPermaLink="false">kubernetes-dashboards-alerts-anomaly-detection</guid>
            <pubDate>Tue, 21 Apr 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Kubernetes observability with Elastic includes dashboards, alert rules, and ML anomaly detection for alerts with root-cause context.]]></description>
            <content:encoded><![CDATA[<h1>Kubernetes observability with Elastic, Dashboards, Alerts, and Anomaly Detection</h1>
<p>Kubernetes observability with Elastic is built for the operator who gets paged at 3 AM. That operator is often in a terminal, a chat tool, or an IDE. They need an answer that is grounded in what is happening in the cluster right now.</p>
<p>The new <a href="https://www.elastic.co/docs/reference/integrations/kubernetes">Elastic Kubernetes integration</a> is built for that operator. It includes  dashboards with drilldowns, alert rule templates, and ML anomaly detection jobs. Additionally Elastic also offers Agentic Investigations, that drives investigations automatically.</p>
<p>This blog will cover the foundational observability components (dashboards, drilldowns, alert templates, etc), while a part 2 covering the agentic investigations will cover workflows, agent skills, and MCP tools and views</p>
<p>The new Kubernetes integration content in this post is generally available across Elastic Cloud Hosted, Serverless, and self-managed deployments.</p>
<hr />
<h2>Dashboards designed for drill-down, not just display</h2>
<p>The new Kubernetes dashboards are organized around a three-tier design: a cluster Overview that surfaces what needs attention at a glance, object summary pages for clusters, nodes, namespaces, workloads, and pods, and object detail pages that give you the full picture for any single entity.</p>
<p>Every layer connects to the next: click any entity in a summary table and choose: apply it as a filter on the current view, or open its dedicated detail page.</p>
<p>Here's what that looks like when something's actually wrong:</p>
<p><strong>Following a restart cascade from overview to container</strong></p>
<p><strong>Overview:</strong> The Overview surfaces what needs attention across your cluster.
You can see top pods by CPU, top namespaces by container restarts, and top nodes by memory utilization in one screen.
When the &quot;container restarts&quot; panel starts climbing, you know where to look.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-dashboards-alerts-anomaly-detection/overview-dashboard.jpg" alt="Kubernetes observability with Elastic, cluster overview dashboard showing top pods by CPU and container restarts by namespace" /></p>
<p><strong>Namespaces Overview:</strong> Click into the flagged namespace with 1232 restarts and CPU limit utilization at 116%.
The detail view plots CPU and memory against requests and limits over time.
This shows both the size and duration of the overage.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-dashboards-alerts-anomaly-detection/namespace-overview.jpg" alt="Kubernetes observability with Elastic, namespace overview showing multiple namespaces" /></p>
<p><strong>Namespace Details:</strong> We can get more info on the various pods in this namespace here.
Click the pod driving the restarts.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-dashboards-alerts-anomaly-detection/namespace-details.jpg" alt="Kubernetes observability with Elastic, namespace detail view showing CPU limit utilization at 116% and container restart count" /></p>
<p><strong>Pod Details:</strong> The pod detail dashboard is organized into capacity, metrics, and containers sections.
Container restarts are flagged in red at the top of the page.
Most panels are metric-driven, and the dashboard also links to correlated pod logs in Discover.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-dashboards-alerts-anomaly-detection/pod-details.jpg" alt="Kubernetes observability with Elastic, pod detail dashboard with container restart alerts, capacity metrics, and log drilldown links" /></p>
<p>It takes four clicks to move from the Cluster Overview to container logs that explain the failure.
These dashboards are starting points for your team.
You can copy and customize them with ESQL visualizations.</p>
<hr />
<h2>Alert rules that fire on day one</h2>
<p>The integration ships with pre-built alerting rule templates for states that are wrong by definition.
No historical baseline or warmup period is required.
Enable them during setup and they work immediately.</p>
<p>These rules do not ask, &quot;Is this abnormal for this service?&quot;
They ask, &quot;Is this a known bad state in Kubernetes?&quot;
A pod in CrashLoopBackOff is always a problem.
A container killed by the kernel for exceeding its memory limit is always a problem.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-dashboards-alerts-anomaly-detection/alert-list.png" alt="Kubernetes observability with Elastic, list of alerts with the CrashLoopBackOff alert rule selected" /></p>
<p>Like the Kubernetes dashboards, these alerts are built on ES|QL queries.
You can see that in the CrashLoopBackOff definition below.
If you are new to ES|QL, you can start with the <a href="https://www.elastic.co/docs/explore-analyze/query-filter/languages/esql">ES|QL docs</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-dashboards-alerts-anomaly-detection/alert-detail.png" alt="Kubernetes observability with Elastic, ES|QL query that defines the CrashLoopBackOff alert rule" /></p>
<p>The alert templates cover:</p>
<ul>
<li><strong>CrashLoopBackOff detection</strong> - Fires when a pod's restart count exceeds a configurable threshold within a rolling window.
The default catches a real restart cycle without triggering on routine restarts during a rolling deployment.</li>
<li><strong>Container OOMKilled</strong> - Surfaces kernel-level container terminations due to memory limits.
These events are easy to miss in dashboards and often precede wider failures.
This rule fires on any occurrence.</li>
<li><strong>Deployment below desired replicas</strong> - Fires when a deployment runs fewer replicas than declared for longer than a grace period.
This catches scaling failures and partially failed rollouts.</li>
<li><strong>Pod stuck in Pending</strong> - Fires when a pod cannot be scheduled past a configurable time threshold.
This surfaces node capacity problems, missing resources, and affinity failures before availability drops.</li>
<li><strong>Node disk pressure</strong> - Fires immediately when the Kubernetes DiskPressure node condition is <code>True</code>.
A node condition is a direct state signal, not a statistical threshold.</li>
<li><strong>Persistent volume near capacity</strong> - Alerts when storage utilization crosses a configurable threshold before writes start failing.</li>
</ul>
<p>Each template is parameterized.
Adjust thresholds in the ES|QL query to match your environment.
Connect notifications to PagerDuty, Slack, or another destination in your runbook.</p>
<hr />
<h2>Anomaly detection jobs with ML baselines</h2>
<p>Alert rules catch what is definitively wrong.
ML anomaly detection catches patterns that often precede failures.
If you are new to this area, see the <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-ad-overview.html">Elastic anomaly detection overview</a>.</p>
<p>A pod that always runs at 85% memory utilization might be healthy.
A pod that grew from 40% to 85% over twelve hours is usually not healthy.
A static threshold often catches this only after an OOM kill.
The ML module should catch the trajectory earlier.</p>
<p>The integration ships with ML module configurations that learn workload baselines and flag meaningful deviations.
These jobs need 24 to 48 hours of data before results become useful.
Results become more reliable as jobs continue to run.</p>
<h3>The included modules</h3>
<p><strong>1. Pod memory growth anomalies</strong></p>
<ul>
<li><strong>What it learns:</strong> per-pod memory consumption pattern over time</li>
<li><strong>What it flags:</strong> Growth trajectories that are inconsistent with baseline behavior, such as a slow leak that never crosses the hard limit.</li>
<li><strong>Why ML (not alert rule):</strong> The alert rule catches the OOMKill after the fact.
The ML job catches the trajectory that leads there.</li>
</ul>
<p><strong>2. Network I/O anomalies</strong></p>
<ul>
<li><strong>What it learns:</strong> per-pod network transmit/receive byte rate patterns</li>
<li><strong>What it flags:</strong> Unusual spikes or drops relative to the pod baseline.
A spike can indicate a runaway process or unexpected load.
A drop can indicate a network partition that causes the pod to go idle.</li>
<li><strong>Why ML (not alert rule):</strong> Normal network traffic varies by time of day and workload type.
A batch job pod at high throughput during its normal window is expected.
The same throughput outside that window can be anomalous.</li>
</ul>
<p><strong>3. Pod restart frequency</strong></p>
<ul>
<li><strong>What it learns:</strong> Per-workload restart rate patterns during deployments, scaling events, and routine operations.</li>
<li><strong>What it flags:</strong> Restart patterns that are anomalous relative to each workload's own history.
This is distinct from the CrashLoopBackOff alert rule, which fires on a fixed threshold regardless of context.</li>
<li><strong>Why ML (not alert rule):</strong> A deployment that restarts twice during every rollout can be healthy.
The same deployment restarting twice on a Tuesday afternoon may be unhealthy.
The alert rule cannot distinguish these cases without workload history.</li>
</ul>
<p>Here's our Single Metric Viewer showing anomalies triggered against a specific pod, for the memory growth job:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-dashboards-alerts-anomaly-detection/single-metric-viewer.png" alt="Kubernetes observability with Elastic, ML Single Metric Viewer showing pod memory growth anomaly detection for one pod" /></p>
<p>And here's the multi-series Anomaly Explorer view of the same job, showing detections firing across a variety of pods:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-dashboards-alerts-anomaly-detection/anomaly-explorer.png" alt="Kubernetes observability with Elastic, Anomaly Explorer showing pod memory anomaly detections across multiple pods" /></p>
<hr />
<h2>Try it yourself: the OTel Astronomy Shop</h2>
<p>If you do not have a Kubernetes cluster ready, you can use the OpenTelemetry Astronomy Shop demo environment.
It uses the same commands as Getting Started Step 2, Path A, but points to demo services.
Create the namespace and secret, then run the Helm install.
All 16 services, Kafka, and PostgreSQL start flowing into Elastic without instrumentation changes.</p>
<p>The demo ships with a built-in feature flag service, <code>flagd</code>, that lets you activate failure scenarios.
Enable <code>cartServiceFailure</code> and watch the checkout-service restart cascade unfold in real time.
The CrashLoopBackOff alert rule fires.
The ML modules begin establishing baselines.
If you have the investigation workflow enabled, it runs automatically when the alert fires.</p>
<hr />
<h2>Getting started</h2>
<p><strong>Step 1 - Install the Kubernetes integration.</strong>
Dashboards are available immediately.
No additional configuration is required.</p>
<p><strong>Step 2 - Deploy data collection.</strong>
There are two supported paths, both based on Helm.
Choose the one that fits your deployment model.</p>
<p><strong>Path A - OpenTelemetry (EDOT collector):</strong>
This path uses the <code>opentelemetry-kube-stack</code> Helm chart with the Elastic Distribution of OpenTelemetry (EDOT) collector.
Create a namespace and a secret with your endpoint and API key, then install:</p>
<pre><code class="language-bash">kubectl create namespace opentelemetry-operator-system

kubectl create secret generic elastic-secret-otel \
  --namespace opentelemetry-operator-system \
  --from-literal=elastic_otlp_endpoint='https://&lt;your-endpoint&gt;.elastic.cloud:443' \
  --from-literal=elastic_api_key='&lt;your-api-key&gt;'

helm upgrade --install opentelemetry-kube-stack open-telemetry/opentelemetry-kube-stack \
  --namespace opentelemetry-operator-system \
  --values 'https://raw.githubusercontent.com/elastic/elastic-agent/refs/tags/v9.3.2/deploy/helm/edot-collector/kube-stack/managed_otlp/values.yaml' \
  --version '0.12.4'
</code></pre>
<p><strong>Path B - Elastic Agent (standalone):</strong>
This path uses the <code>elastic/elastic-agent</code> Helm chart.
The default manifest includes resource limits that may not be appropriate for production.
Review the <a href="https://www.elastic.co/docs/reference/fleet/scaling-on-kubernetes">Scaling Elastic Agent on Kubernetes guide</a> before deploying.</p>
<pre><code class="language-bash">helm repo add elastic https://helm.elastic.co/ &amp;&amp; \
helm install elastic-agent elastic/elastic-agent \
  --version 9.3.2 \
  -n kube-system \
  --set outputs.default.url=https://&lt;your-endpoint&gt;.es.elastic.cloud:443 \
  --set outputs.default.type=ESPlainAuthAPI \
  --set outputs.default.api_key=$(echo &quot;&lt;your-base64-api-key&gt;&quot; | base64 -d) \
  --set kubernetes.enabled=true
</code></pre>
<p><strong>Step 3 - Enable the alert rule templates.</strong>
Go to Observability &gt; Alerts in Kibana.
The Kubernetes templates are in the rule library.
Enable the templates relevant to your environment, set thresholds, and connect your notification channel.</p>
<p><strong>Step 4 - Let the ML modules warm up.</strong>
After 24 to 48 hours, anomaly detection modules establish baselines and begin surfacing pattern-based deviations.
Longer running jobs usually produce better baselines.
Find results in the ML Anomaly Explorer, linked from the Kubernetes dashboards.</p>
<p><strong>Steps 5, 6, and 7 - Agentic content</strong> will be covered in Part 2 (forthcoming), Kubernetes observability with Elastic: Agentic Investigations.</p>
<hr />
<h2>What's next</h2>
<p>The next step is the layer that runs investigation workflows when an alert fires.
That includes skills that encode investigation logic, tools that expose facts like ML state and topology, and MCP apps that render outputs in places like Claude Desktop or VS Code.
These technical preview capabilities are available today and will be covered in Part 2 (forthcoming), Kubernetes observability with Elastic: Agentic Investigations.</p>
<p>If you are running Kubernetes on Elastic today, tell us which investigation steps you repeat manually on every incident.
Tell us which remediations you would trust a workflow to propose.
You can <a href="https://discuss.elastic.co/c/observability">join the Elastic Community Discussion here</a>.</p>
<hr />
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion.</em>
<em>Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/kubernetes-dashboards-alerts-anomaly-detection/header.jpg" length="0" type="image/jpg"/>
        </item>
    </channel>
</rss>