Agentic Powered Kubernetes Investigations with Elastic Observability and MCP

Agentic powered Kubernetes observability is now available in Elastic Observability. Whether you are using Elastic Observability's UI or your own agentic workflows, Elastic provides a set of capabilities to help investigate the Kubernetes issue at hand. We have released an MCP (Model Context Protocol) App that lets AI agents like Claude and Cursor query Elastic Observability to understand K8s failures, and surface ML anomalies without leaving your chat interface.

In Part 1, we covered how Elastic's Kubernetes integration ships telemetry via the EDOT Collector into Elasticsearch. In this post, we go further with an MCP (Model Context Protocol) app server that exposes that telemetry as AI-callable tools, complete with interactive React UIs rendered inline. We'll also cover how to take it further with Elastic Workflows: automated runbooks that handle the full root cause analysis loop from alert to remediation proposal.

Observability MCP App that renders where you work

The Elastic Observability MCP App (tech preview) ships six views, one per tool. Each renders inline when the tool returns, and each surfaces opinionated next-step prompts as clickable buttons so you don't have to guess the right follow-up. MCP Apps take it further than standalone agent workflows — they render live, interactive views directly inside your chat or IDE, inline in the conversation, without a context switch to Kibana.

Cluster health rollup

Ask "what's broken?" or "give me a status report" and get a one-shot orientation: overall health badge, degraded services with reasons, top pod memory consumers, anomaly severity breakdown, and service throughput — all in one inline view.

The view adapts based on what your deployment supports. APM gives you service health. Kubernetes metrics add pod and node context. ML jobs layer in anomalies. If a signal isn't present, the view tells you what's missing rather than failing. We'll begin with a status report of the Kubernetes cluster:

Compound reports like the health summary have condensed data presentation with detail-expansion so that you get to choose the appropriate amount of information to view at once. Suggested investigation actions provide guidance for both specific information being returned, as well as orienting users to other tools to run.

Service dependency graph

Ask "what calls checkout?" or "show me the topology" and get a layered dependency graph — upstream callers, downstream dependencies, protocols, call volume, and latency per edge. Hover over an edge to highlight the full call path. Let's ask Claude to "Show me the service dependencies of the frontend":

Zoom, pan, and hover to get all the details you need to understand the complex service relationships:

Anomaly Details

Ask "what's anomalous?" or "is anything unusual in checkout?" and get one of two views, chosen automatically. If multiple entities are affected, the overview mode shows severity counts, affected entities, and a by-job breakdown. If a single entity is the focus, the detail mode shows score, actual vs. typical values with a comparison bar, deviation percentage, and a time-series when available. Let's check on the frontend service:

This isn't an ESQL query — it's an explanation of results of a previously-defined anomaly detection job. As discussed in Part 1 of this blog series, the Kubernetes integration ships with a few for you to enable. This tool will help you make the most of them.

Observe

Observe is the agent's primary access primitive for Elastic — one tool, with two modes for three different needs. Say "what is the network throughput of each of my kubernetes clusters" for a table or chart of results. Say "tell me when memory drops below 80MB" or "watch the frontend memory for anything unusual for the next 10 minutes" and it blocks until the condition fires or the window expires.

The view adapts to the mode: a results table for one-shot queries, a live trend chart with current/peak/baseline stats for sampling and threshold conditions, and a severity-scored trigger card for anomaly mode. We'll use it here to identify the busiest Kubernetes node:

Assess risk with a blast radius

Ask "what happens if this node goes down?" and get a radial impact diagram: the target node at center, full-outage deployments in red, degraded in amber, unaffected in gray. A floating summary card shows pods at risk and rescheduling feasibility. Single-replica deployments are flagged as single points of failure. What would happen if our busy node were to fail:

Alert Management

With the alert management tool, you can create, list, get info, and delete alerts. We'll create an alert next, but first use Observe once more to take a quick baseline so we know the alert will make sense:

Say "alert me if frontend memory goes above 75MB" and the agent creates a persistent Kibana alerting rule — a saved object that keeps running after the conversation ends. The view renders a live rule card: rule name, condition, window, check interval, KQL filter, and tags. Next-step buttons offer to verify the rule, watch the metric stabilize, or check current cluster health. The agent confirms what was created and where to find it in Kibana:

MCP App Architecture

The app is composed of a Node.js server, six model-facing tools wired to six single-file view resources, app-only tools for re-queries, and vite-plugin-singlefile bundling. Tools are grouped by deployment backend (Universal, APM-dependent, K8s-dependent, ML-dependent), so the agent and the user both know up front which tools apply to a given deployment instead of discovering capability gaps at call time. The repo includes six Skills as separate .zip artifacts that teach the agent when and how to call each tool.

The following diagram shows the three components that make up the app: the MCP host (Claude Desktop, VS Code, or similar), which holds the LLM and the Claude skills that teach it how to use the tools; the MCP app server, a single Node.js process that exposes the tool registry, bundles the React UI views, and handles all communication with Elastic; and the Elastic Stack itself, where Elasticsearch and Kibana serve as the live data and alerting backends.

The diagram below traces the flow of a user request: Claude reads the relevant skill file to understand which tool to call and how to fill its parameters, calls the tool which triggers server-side queries against Elasticsearch and Kibana, and receives back a compact text summary alongside a React UI resource that renders inline as an interactive widget.

From alert to root cause: Investigation Workflows

Alert rules tell you something is wrong. ML modules tell you the pattern. Elastic Workflows run the diagnosis — automatically, the moment an alert fires.

We're shipping a Kubernetes Investigation Workflow (technical preview) that triggers on a Kubernetes alert and returns a structured root cause summary before you've opened a single dashboard. The SRE who gets paged opens the alert and finds the investigation already done.

The workflow is a directed graph of steps that queries multiple data sources — primarily via Elasticsearch Query Language (ES|QL), with an Elasticsearch search for the ML anomaly lookup. if steps branch on query results, choosing which corroboration to run (ML memory anomaly vs log classification) and whether to assess upstream health (only when APM dependencies exist). AI steps appear at three points: classifying log patterns on the non-OOM path, classifying upstream degraded-vs-healthy, and a final ai.summarize that synthesizes all structured evidence into a root-cause narrative.

What the investigation workflow looks like in practice

The example execution below is based on the OpenTelemetry Astronomy Shop running against Elastic — 16 services, Kafka, PostgreSQL, all pre-instrumented via OTLP. Alongside the Shop's real telemetry, we injected a synthetic OOMKill cascade, which writes synthetic K8s and APM signals into the same namespace via the EDOT data streams. The workflow can't tell our signals from real ones — it just investigates the alert.

Alert fires: CrashLoopBackOff — app-deployment in oteldemo-esyox-default. Restart count: 6.

Workflow step 1 — Characterize pod and container context

The workflow queries K8s metrics for restart count, last termination reason, and utilization against declared limits.

Result: Last termination reason OOMKilled, restart count 6. (Note: kubeletstats utilization was unavailable for this pod/window — the workflow continues gracefully.)

Workflow branches: Termination reason is OOMKilled, so the workflow takes the memory-investigation path, not the log-investigation path.

Workflow step 2a — Consult ML anomaly results

Rather than recomputing memory trends, the workflow queries the ML anomaly index for an active k8s_pod_memory_growth anomaly.

Result: No anomaly — the spike is flagged load-driven, not a suspected leak.

Workflow step 3 — Check upstream service health

The workflow enumerates upstream dependencies from APM service_destination.1m aggregates, then compares current error rate and mean latency against the same hour 7 days ago. An AI classification step decides whether upstream degradation preceded the alert. Result: One upstream — api-gateway. Current mean latency 15.13 ms, error rate 41.26%. Baseline (168h ago): identical. Classification: upstream_healthy — within 5× error / 3× latency thresholds. Upstream is ruled out.

Workflow step 4 — Correlate with recent K8s changes

Event log for the namespace shows a tight cycle of Pulled → Created → Started → Killing → BackOff repeating roughly every 60–90 seconds. No deployments or scaling events in the past two hours.

Workflow output:

ROOT CAUSE HYPOTHESIS (confidence: high)

app-deployment is OOMKilling under memory pressure. The pod has restarted
6 times with termination reason OOMKilled. ML flagged the memory spike as
load-driven (no leak). Upstream api-gateway is healthy at current vs 7-day
baseline. This is a resource-allocation issue — the container's memory
limit is too low for its real working set.

Evidence:
- 6 restarts, last termination reason OOMKilled
- No ML memory-growth anomaly → leak_suspected=false (load-driven)
- Upstream api-gateway unchanged vs 7d baseline (15.13 ms, 41.26%) → healthy
- K8s events show tight Pulled/Created/Started/Killing/BackOff cycles;
  no deployments in the last 2h

Likely cause: memory limit insufficient for actual working set under load.

Recommended next steps:
1. Raise the app-deployment memory limit based on observed usage
2. Review application code for memory-optimization opportunities
3. Consider graceful degradation on high-load paths

Downstream impact: none identified from APM destination metrics.

The output above is what the alert looks like when you open it — not a link to a bunch of logs or a dashboard, but an answer.

The same workflow is accessible as an MCP tool from Claude Desktop, VS Code, or any MCP-compatible client. When a developer asks "why is checkout erroring?" from their IDE, the agent calls the workflow and returns the same structured output inline — same evidence, same root cause, without leaving the editor.

Here's an animated walkthrough of the workflow execution:

Observability Skill for Kubernetes investigations

We're also shipping a single, comprehensive investigation Skill (observability-k8s-investigation) that encodes the full diagnostic protocol for Kubernetes workload, node, and control-plane issues. It is an opinionated investigation methodology that includes the reasoning an experienced SRE applies instinctively but rarely writes down. You'll get this by keeping Kibana up to date, as it's baked into our AI Agent skills. It starts with governing principles that prevent the most common misdiagnoses:

Absence of evidence is not evidence. If log queries return zero rows, report no_logs_available — don't infer a failure mode from empty results.
OOMKilled does not mean memory leak by default. Compare current usage against a 7-day baseline before claiming a leak. The limit may simply be undersized.
Average CPU metrics hide throttling. A pod can look healthy at 40–60% average utilization while being severely throttled at p99. Look at max and p95, not just average.
Co-symptoms are not causes. Two services degrading simultaneously usually share an upstream cause. Only attribute causation when one service's degradation clearly precedes the other's and the delta is large.

From there, the Skill encodes a failure-mode taxonomy covering 16 distinct K8s failure patterns across workload, node, control-plane, autoscaling, and networking layers — from OOMKilled and CFS throttling through admission webhook blocks and StatefulSet split-brain. Each mode has a pivotal signal that identifies it and a corroboration checklist that confirms it.

The investigation flow follows a structured arc: orient (resolve the target pod, namespace, deployment), characterize (get restart count, termination reasons, utilization), classify (match against the taxonomy), corroborate (pull events, logs, APM, baseline comparisons), and synthesize (produce a root cause hypothesis at calibrated confidence — high, medium, or low — with explicit evidence and recommended next steps).

When two failure modes fit the evidence, the Skill names both and says which it believes is causal and why. When evidence is ambiguous, it says so. "Competing hypotheses are a valid output" is an explicit design principle — manufacturing false confidence is treated as a failure mode of the investigation itself.

Getting started

These capabilities build on the Kubernetes integration described in Part 1. Once you have dashboards and data collection running:

Step 1 — Enable investigation workflows (technical preview). Import the Kubernetes Crashloop Investigation Workflow from the Workflows page in Kibana, and optionally configure it to trigger on an alert rule.

Step 2 — Install the MCP App on an MCP-compatible client (technical preview). The MCP App for Observability repo can be found on GitHub (see the Releases page for downloads). When installing the app, don't forget to also install and enable the included skills. Access the Example MCP App's tools from your favorite agentic client — instructions are in the README at the GitHub link above.

Step 3 — Leverage the K8s Investigation Skill (technical preview). This one is a freebie if you're using Agent Builder, because it's baked into AI Agent Skills. The Skill teaches the agent when and how to call the underlying tools and workflows, ensuring consistent diagnostics in conversational contexts.

What's next

Investigation workflows diagnose what's broken in the services you're monitoring. The next question is harder: what about the services you're not monitoring?

We're thinking about topology-aware coverage intelligence — automatically discovering every workload deployed in your cluster via the Kubernetes API, cross-referencing against telemetry flowing into Elastic, and surfacing the gap. "You have 47 services. 11 have no distributed traces. Here's your riskiest blind spot." That capability is under consideration and will likely be the subject of a future post.

In parallel, we're extending workflows toward remediation — not just diagnosis but action: creating a case with the investigation summary attached, proposing a rollback for human approval, or scaling a workload to buy time while the root cause is addressed.

If you're running Kubernetes on Elastic today, tell us which investigation steps you repeat manually on every incident, which remediations you'd trust a workflow to propose, and which MCP tools we should build next. You can join the Elastic Community Discussion Here.