Bahubali Shetti

Agentic CI/CD: Kubernetes Deployment Gates with Elastic MCP Server

Deploy agentic CI/CD gates with Elastic MCP Server. Integrate AI agents into GitHub Actions to monitor K8s health and improve deployment reliability via Observability (O11y)

Agentic CI/CD: Kubernetes Deployment Gates with Elastic MCP Server

The "Build-Push-Deploy" cycle is never simple. High-availability environments require automated guardrails, proactive checks that prevent a deployment from even starting if the target cluster is under stress. Today these are generally performed with APIs and scripts during the CI/CD process. Different gates are initiated during the process to ensure the application tests have passed, the artifact is clean, the infrastructure is stable, and many more. 

With AI, and agents, these gates are slowly becoming more sophisticated. More and more these gates are using a Model Context Protocol (MCP) server for this check. This is a newer, more cutting-edge "agentic" approach. It allows your CI/CD pipeline to act as an intelligent agent that "asks" your cluster for its health status before making a change.

A standard Kubernetes deployment workflow generally follows these high-level steps:

  1. Verification Gate: Ensuring all automated testing has passed.

  2. Artifact Creation: Building the Docker container.

  3. Environment Gate: Verifying that the production Kubernetes environment, supporting infrastructure, and existing applications are healthy.

Kubernetes Deployment: Triggering the final release. Modern workflows often use GitOps tools like ArgoCD or Flux, where a simple image tag update in Docker Hub automatically synchronizes the cluster.

Kubernetes health checks can range from simple to complex depending on your Service Level Objectives (SLOs) and operational maturity. Typically, the primary goal is to ensure the cluster is healthy and not nearing a resource bottleneck. Common "red flag" metrics used in these gates include:

Red FlagScenarioSRE Meaning
Pod Count > 90%High pod densityApproaching node-level scheduling limits.
CPU Usage > 70%High real-time loadRisk of CPU throttling during deployment.
Memory Usage > 80%Memory pressureHigh risk of Out-of-Memory (OOM) kills.
OOM Terminating ProcessesResource limits reachedInadequate pod configuration or sizing.
Available vs. RequestedCapacity imbalanceRisk of deployment failure due to insufficient reserved space.

I will show you how you can use a CI/CD pipeline that integrates Observability AI Agents with GitHub Actions via an Model Context Protocol (MCP) server, creating automated pre-deployment health checks for Kubernetes clusters.

By introducing an observability checkpoint before deployment, we transform the pipeline into an intelligent system that:

  • Queries real-time metrics from Kubernetes clusters

  • Analyzes capacity using custom ESQL queries

  • Makes autonomous decisions about deployment readiness

  • Prevents failures proactively rather than reacting to them

  • Provides actionable feedback to engineering teams

Here is the “architecture” of what is being deployed and how it works in this blog.

As you can see the flow uses Elastic Observability, which is storing and analyzing Kubernetes OpenTelemetry metrics from the opentelemetry-kube-stack-cluster-stats-collector (deployed via OpenTelemetry Operator).

Github Actions calls the Observability Kubernetes Agent, via the Elastic MCP server, which has tools that help check for some of the “red flag” issues identified in the table above.

Based on the results, Github Actions will either stop the process or continue to deploy the artifact via a trigger for ArgoCD.

The Observability Kubernetes Agent was built using Elastic’s Agent Builder capability, as well as some of the tools it uses. These are then exposed via the MCP server.

Hence the overall set of components used here include:

  1. GitHub Actions: Orchestrates the build and deployment workflow

  2. Elastic MCP Server: Serverless endpoint that exposes AI agents

  3. Observability Kubernetes Agent: Custom agent with specialized ESQL tools

  4. Kubernetes Cluster: Target deployment environment with metrics collection

  5. ES|QL Query Tools: Precision queries for node and pod resource analysis

What Happens When a Kubernetes Health Check Fails in GitHub Actions?

How the Pipeline Blocks a Deployment Automatically

When the cluster exceeds capacity thresholds, the workflow automatically blocks deployment. In this scenario I didn’t load the cluster, but used a simple check of whether more than 25% of resources were being used to purposely stop the deployment.

The workflow shows:

  • Build Docker Image (28s)

  • Push to Docker Hub (5s)

  • K8s Health via Elastic O11y K8s Agent (16s) - FAILED

  • Deploy to otel-test Cluster - BLOCKED

Annotation: "Cluster has resource issues - blocking deployment"

What Does the AI Agent's Health Check Response Look Like?

The agent provides detailed analysis:

Step 1: Finding Kubernetes analysis agent...
Found agent: Observability Kubernetes Agent (kubernetes_analysis_agent)

Step 2: Querying cluster health...
Prompt: tell me if my cluster otel-test is using more than 25% memory or CPU on any of its nodes

Agent Response:
================================================================
Yes, your cluster "otel-test" has nodes and pods using more than 25% of resources.

++Node exceeding 25%:++
- ip-192-168-165-175.us-west-2.compute.internal
  - Memory: 36.44%
  - CPU: 7.99% (below threshold)

++All other nodes are below the 25% threshold++ for both CPU and memory.

While the query for pods doesn't show percentage values directly, the data indicates
normal resource usage patterns for the pods in your cluster, with none appearing to
consume excessive resources relative to their allocations.
================================================================

Cluster has resource issues - blocking deployment
Error: Process completed with exit code 1.

As you can see, a prompt was sent to the Observability Kubernetes Agent via MCP vs having to build some logic or call another script etc.

This single check prevented:

  • A deployment that would have failed

  • Wasted CI/CD minutes

  • Potential service degradation

  • Manual SRE intervention

What it provided:

  • Provided actionable intelligence for capacity planning

How to Build a Kubernetes Health Check Agent in Elastic

Building the agent isn’t hard, Elastic’s AgentBuilder’s UI makes it easy to create it and have it running in minutes. 

How to Configure the Observability Kubernetes Agent

Other than naming the agent, you need to provide it with some instructions.

Custom Instructions:

# Agent Instructions

## Primary Role
You are a Kubernetes monitoring assistant that helps users analyze cluster performance
and resource utilization. Your primary goal is to provide clear, accurate information
about Kubernetes clusters using available data sources.

## Tool Selection Guidelines
1. When users ask about Kubernetes metrics, node performance, or cluster health:
   - Use ESQL tools for detailed analysis
   - Query metrics from kubeletstatsreceiver.otel-default

2. For alert-related queries:
   - Use the alerts tool to check active alerts

3. Always provide context about:
   - Time ranges queried
   - Cluster names
   - Resource thresholds

How to Write ES|QL Queries for Kubernetes Node and Pod Metrics

I created several tools that checked Node CPU and memory, pod CPU and memory, and OOM from pods. Additionally, the Observability Kubernetes Agent utilized a large portion of the OOTB tools like observability_alerts as part of its abilities. 

Here is an example of the node CPU and memory tool, which uses a simple ES|QL query against OpenTelemetry metrics to check the CPU and memory utilization in the cluster.

ES|QL Query:

FROM metrics-kubeletstatsreceiver.otel-default
| WHERE resource.attributes.k8s.cluster.name == ?cluster_name
  AND @timestamp > NOW() - 3 hours
| STATS
    avg_cpu_usage = AVG(metrics.k8s.node.cpu.usage),
    avg_memory_usage = AVG(metrics.k8s.node.memory.usage),
    avg_memory_available = AVG(metrics.k8s.node.memory.available),
    avg_memory_working_set = AVG(metrics.k8s.node.memory.working_set)
  BY resource.attributes.k8s.node.name
| EVAL
    cpu_usage_pct = avg_cpu_usage * 100,
    memory_usage_pct = (avg_memory_working_set / (avg_memory_working_set + avg_memory_available)) * 100
| SORT cpu_usage_pct DESC, memory_usage_pct DESC
| KEEP resource.attributes.k8s.node.name, cpu_usage_pct, memory_usage_pct
| LIMIT 100

Parameters:

  • cluster_name
    (string): Name of the K8s cluster to analyze

How to Expose the Agent via the Elastic MCP Server

Once configured, the agent is automatically available via Elastic's MCP server running in your Observability project. The MCP server provides a standardized interface that any MCP-compatible client can query.

MCP Endpoint:

https://your-elastic-project.elastic.cloud/mcp

Authentication: Uses Elastic API keys for secure access

Why Agentic CI/CD Matters for Kubernetes Operations

Agentic CI/CD represents an evolution in proactive deployment strategies. By integrating Elastic Observability AI agents with GitHub Actions via MCP, we've created a system that:

Prevents failures before they happen Provides real-time cluster health insights Makes data-driven deployment decisions Reduces operational burden on SRE teams Improves overall deployment reliability

This approach is at the cutting edge of modern CI/CD practices. While traditional pipelines focus solely on the "Build-Push-Deploy" cycle, agentic pipelines introduce automated pre-deployment guardrails using observability data, transforming your CI/CD infrastructure into an intelligent agent that actively protects production environments.

Resources and Next Steps

Sign up for Elastic Cloud Serverless and try this out with your pipeline.

Documentation

Share this article