Adrian ChenVu PhamEmily McAlister

Automated Reliability: The Architecture of Self- Healing Enterprises

Discover how to close the remediation gap using automation and artificial intelligence. Learn to build self-healing systems that detect, analyse, and fix infrastructure issues automatically. Improve system reliability and eliminate manual operations today.

32 min read

Your enterprise systems are built for speed. The organisation has adopted microservices to decouple teams. The strategy is to move to the cloud to scale infrastructure on demand. We’ve started to deploy Kubernetes to orchestrate containers. The team has achieved velocity, but you likely sacrificed clarity. The complexity of your distributed environment now exceeds the cognitive capacity of any single human operator. When systems break, your Mean Time To Resolution (MTTR) suffers not because you lack data, but because the gap between detection and action is too wide.

This guide outlines the strategy to close that gap. It details how you move from passive Observability—watching a dashboard turn red—to proactive auto-remediation, where the system identifies, analyzes, and fixes itself. You will see how Elastic Workflows, Elastic Observability, Out-of-the-box (OOTB) Observability Agents, and Custom Agent in Agent Builder converge to solve the "Trust Deficit" that plagues traditional automation. We examine the technical implementation of self-healing systems and the business imperatives that make this transition necessary for the modern Chief Information Officer (CIO). You will learn to construct a reliability architecture that acts faster than your best engineer.

The Entropy of Scale

The Failure of Manual Operations

Systems generate telemetry at machine speed—millions of log lines, metric data points, and traces per second. Yet, your response to failure operates at human speed. You rely on an on-call engineer to wake up, interpret a notification, log in to a VPN, navigate a console, and execute a command. This latency is the "Remediation Gap". In that gap, revenue bleeds, customer trust erodes, and technical debt accumulates.

Manual remediation fails at scale for three specific reasons: cognitive overload, context switching, and the fear of action.

Cognitive Overload and the Signal-to-Noise Problem

Your Observability tools ingest petabytes of data. Finding the signal in that noise is the primary challenge for your Site Reliability Engineers (SREs). Traditional alerting relies on static thresholds where a rule is set at a specific value, for example, "Alert if CPU > 90%." In a dynamic cloud environment, static thresholds are insufficient as they create a lot of noise from false positives. For example, a database compaction process might spike CPU usage safely every night, or a Java application might naturally consume memory up to its heap limit before garbage collection. This is perfectly normal and expected behaviour in the environment, which creates confusion and fatigue for SREs when they’re alerted to false-positive issues in the environment.

When you bombard engineers with these low-fidelity alerts, you create "Alert Fatigue”, where SREs become overwhelmed with large volumes of alerts, many of which are false positives. When this happens, SREs stop trusting the pager. They treat alerts as suggestions rather than mandates, which impacts responsiveness to fixing real issues. Many organisations look to automation as the solution to alert fatigue, to fast-track response to issues in the environment. However, when alerts are inaccurate and untrustworthy, this creates more issues than it solves. If you automate a restart based on a false positive, you turn your monitoring system into a chaos monkey that attacks your own infrastructure.

The Cost of Context Switching

Consider the anatomy of a typical incident. An alert fires in your Observability platform. The engineer sees "High Latency on Service Checkout." Now the toil begins. They open a new tab for the infrastructure provider to check the pod status. They open another tab for the APM traces. They open Jira to see recent changes. They open Slack to ask if anyone pushed code as part of a change.

This fragmentation of data is expensive. Every context switch breaks the analyst's flow, widening the Remediation Gap through delays in determining the root cause and resolution fix. Context switching interrupts an analyst's focus. In a major incident, you need analysts to be focused on resolving the issue at hand, rather than losing time switching between tools and data, sets trying to work out what has happened. You need the context and the control plane to exist in the same interface so analysts can identify and resolve issues as fast as possible. The separation of "looking" and "doing" forces your engineers to act as human routers, copying IDs and error messages between incompatible tools.

The Blast Radius and the Fear of Automation

You hesitate to turn on auto-remediation because you fear the unknown impact of automation operating outside the expected bounds. A script that restarts a service works fine when one node fails. If a global configuration error causes all nodes to fail simultaneously, that same script might trigger a cascading restart loop that takes down the entire platform.

This fear leads to "Change Management Paralysis". You implement rigid approval boards and manual checks to prevent automation from running amok. You trade speed for reliability, but ultimately lose both. The solution is not to avoid automation, but to implement "Safe Automation" that understands dependency chains and business impact.

The Architecture of Action

Unified Observability and Remediation

To reduce the Remediation Gap, you must combine your observability data and action systems . You cannot have one tool for seeing and another for acting. Elastic Observability and Elastic Workflows provide this unification, enabling organisations to take intelligent response actions to proactively remediate issues in the environment. You ingest logs, metrics, and traces into Elastic, where intelligence, machine-learning and AI-driven analysis is done out of the box, against your environment’s behaviour baseline. From here, Elastic can use AIOps to detect anomalies with high precision and engage with proactive notifications and automations to quickly resolve the issue. You use the AI to diagnose the root cause and understand the resolution steps. You use Workflows to execute the fix and resolve the issue.

Combining Elastic Observability, AI & Agent Builder, and Workflows creates a closed-loop architecture for automatically resolving issues:

  1. Sense: Ingest telemetry.
  2. Think: Analyze with Machine Learning and AI.
  3. Act: Trigger a Workflow.
  4. Verify: Measure the result with telemetry.

Elastic Workflows: The Orchestration Engine

Elastic Workflows is the mechanism that turns insight into action. It sits directly within the Elastic platform, which means it has zero-latency access to your data. This is key when taking intelligent response actions to resolve performance issues in a timely manner, as context can be given to the automation for auto-analysis and informed logical execution.

A Workflow acts as a structured "recipe" for automation. It defines a sequence of steps that execute based on a trigger. Workflows are defined in YAML, which allows you to treat operations as code. You version control them, review them in Pull Requests, and test them before deployment.

AIOps and SLOs: The Vitals of the Enterprise

You solve the "Trust Deficit" by moving away from static thresholds and adopting Service Level Objectives (SLOs) and AIOps.

Understanding SLOs through the "Vital Signs" Analogy

Think of your IT system as a patient in a hospital. An SLO is the target for the patient's vital signs (e.g., "Heart rate must stay between 60 and 100").

  • SLI (Service Level Indicator): This is the heart monitor. It measures the reality (e.g., "Current heart rate is 105").
  • Error Budget: This is the patient's resilience. The patient can tolerate a heart rate of 105 for a few minutes (budget consumption) without permanent damage. However, if it stays there for an hour (budget exhaustion), you have a crisis.

In auto-remediation, we use the Burn Rate of this budget to determine urgency. A slow burn (minor degradation) is like a slight fever; you treat it with medication (email notification) in the morning. A fast burn (outage) is cardiac arrest; you treat it with a defibrillator (immediate pager/auto-restart) instantly.

Elastic Agent Builder: The Cognitive Engine

Standard automation is brittle because it is deterministic. It follows strict rules: "If X, do Y." Real-world incidents are messy. They require probabilistic reasoning and adaptation. Elastic Agent Builder provides this capability using Generative AI and Retrieval Augmented Generation (RAG).

Agent Builder allows you to define multiple agents that can be used to build Agentic AI experiences. It does more than just "chat”, by integrating into Elastic Workflows as a reasoning agent that can interact with other agents as needed.

  • Contextual Analysis: It looks at the specific logs related to an alert. It deciphers cryptic stack traces. It determines whether there are related alerts or cases.
  • Institutional Memory: It searches your internal runbooks, Jira tickets, and Confluence pages (ingested into Elasticsearch). It finds that "Error 503 on Service Payment" was fixed last month by rotating a specific key.
  • Code Generation: It writes the specific ES|QL query needed to verify the impact of the issue.

Technical Implementation Scenarios

We will now look at two distinct scenarios: an Application-Level issue (SLO breach) and an Infrastructure-Level issue (Disk Space Exhaustion).

Scenario A: Application SLO Breach (Smart Escalation)

The Objective: Handle a service degradation intelligently. If the service is merely "degraded" (breaching a warning threshold but not down) and the time is near the start of the business day, we route the alert to email to avoid waking up engineers. If the service is "down" or if it is a critical off-hours failure, we page the on-call team.

Workflow Logic:

  1. Trigger: SLO Burn Rate alert fires.
  2. Check Context: Calculate if it is "Start of Business Day."
  3. Decision: If Alert is Warning AND Time is Start of Day → Email. Else → PagerDuty.

Workflow YAML Definition:

name: Smart Escalation
enabled: false
description: Escalation based on business hours
triggers:
  - type: alert

steps:
  # Check for start day
  - name: start_day_check
    type: console
    with:
      message: |
        {%- assign hour = "now" | date: "%H", "Australia/Sydney" | plus: 0 %}
        {%- if hour >= 7 and hour <= 9 %}
          true
        {%- else %}
          false
        {%- endif %}
  - name: get_severity
    type: console
    with:
      message: '{{event.alerts[0]["kibana.alert.severity"] | default: "low"}}'
  # routing logic
  - name: routing_logic
    type: if
    condition: 'steps.get_severity.output: critical AND steps.trimmed_start_day.output: false'
    steps:
      - name: hard_notification
        type: pagerduty
        connector-id: "# Enter connector UUID here"
        with:
          eventAction: "trigger"
          severity: "{{steps.get_severity.output}}"
          summary: 'CRITICAL: {{event.alerts[0]["monitor.name"]}} SLO breached'
        
    else:
      - name: soft_notification
        type: email
        connector-id: Elastic-Cloud-SMTP
        with:
          to: ["<your_email>"]
          subject: '{{event.alerts[0]["kibana.alert.rule.name"]}} Failed'
          message: 'Reason: {{event.alerts[0]["kibana.alert.reason"]}}'

Scenario B: Infrastructure Auto-Remediation (Disk Space)

The Challenge:

In a large enterprise, "Out of Disk Space" is a silent killer. When /var/log fills up:

  • Nginx crashes because it cannot write access logs.
  • MySQL panics because it cannot write transaction logs.
  • SSH fails because it cannot write to /var/log/auth.log, locking you out of the server you need to fix.

By the time the server stops responding to ping, it is too late. You need to catch the trend while the server is still reporting to your observability solution.

Note: To allow grouping of the infra service, we utilized the Add custom field feature of the Elastic Agent policy to add the my_org.custom.service field.

The Workflow:

  1. Phase 1: Context & Health Verification
    • Detect: Elastic detects disk usage > 90% over time
    • Verify (Freshness): Confirm the host is still sending logs (it hasn't crashed yet).
    • Impact Analysis (HA): Check if other instances of this service are healthy. If 9/10 nodes are healthy, this is a P3 issue. If 1/2 nodes are healthy, it is P1.
  2. Phase 2: Remediate
    • Remediate: Trigger an Ansible Tower playbook to rotate logs and clear cache.
    • Audit: Log the result.

Workflow YAML Definition:

name: "Disk_Space_Remediation"
description: "Detects full disk, checks HA status, and triggers Ansible Tower cleanup."
enabled: true
consts:
  ansible_webhook: "<your_ansible_url>"
triggers:
  - type: alert
steps:
  # ---------------------------------------------------------
  # Phase 1: Context & Health Verification
  # ---------------------------------------------------------
  # Check 1: Are logs still flowing? (Avoid false positives from dead nodes)
  - name: check_log_freshness
    type: elasticsearch.esql.query
    with:
      query: |
        FROM logs-*
        | WHERE host.name == "{{event.alerts[0]['host.hostname']}}"
        | STATS latest_log = MAX(@timestamp)
        | EVAL latency_sec = (TO_LONG(NOW()) - TO_LONG(latest_log)) / 1000
        | LIMIT 1
  # Get how many hosts serving that service within 24h
  - name: count_service_hosts
    type: elasticsearch.esql.query
    with:
      query: |
        FROM logs-*
        | WHERE @timestamp >= (NOW() - TO_TIMEDURATION("24 hours")) AND my_org.custom.service == "{{event.alerts[0]['kibana.alert.grouping'].my_org.custom.service}}"
        | STATS COUNT_DISTINCT(host.hostname)
  # Get how many hosts of the service have alerts within the last 15m 
  - name: count_distinct_alerts
    type: elasticsearch.esql.query
    with:
      query: |
        FROM .alerts-observability.metrics.alerts-default
        | WHERE @timestamp >= (NOW() - TO_TIMEDURATION("15 minutes")) AND kibana.alert.grouping.my_org.custom.service == "{{event.alerts[0]['kibana.alert.grouping'].my_org.custom.service}}"
        | STATS COUNT_DISTINCT(kibana.alert.grouping.host.name)
  # ---------------------------------------------------------
  # Phase 2: Remediation
  # ---------------------------------------------------------
  - name: routing_logic
    type: if
    condition: 'steps.check_log_freshness.output.values[0][1] <= 300'
    steps:
      # Action: Trigger Ansible Tower Job Template (Log Cleanup)
      # We pass the hostname as an 'extra_var' to the playbook
      - name: trigger_ansible
        type: http
        with:
          url: "{{consts.ansible_webhook}}"
          method: POST
          body:
            extra_vars:
              target_host: "{{event.alerts[0]['host.hostname']}}"
              remediation_type": clear_var_log
          headers: 
            Accept: application/json
            Content-Type: application/json
      # Action: Notify SRE with Context (SLO/HA)
      # If HA is healthy (>75% of the nodes), send standard notification. If HA is at risk, escalate.
      - name: notify_result
        type: console
        with:
          message: |
            {%- assign num_alerted_hosts = steps.count_distinct_alerts.output.values[0][0] | plus: 0 %}
            {%- assign num_service_hosts = steps.count_service_hosts.output.values[0][0] | plus: 0 %}
            {%- assign critical_ratio = num_alerted_hosts | divided_by: num_service_hosts | times: 1.0 %}
            {%- if critical_ratio >= 0.25 %} #ops-critical {%- else %} #ops-alerts{%- endif %}
            *Auto-Remediation Triggered: Disk Space*
            *Host:* {{event.alerts[0]['host.hostname']}}
            *Service:* {{event.alerts[0]['kibana.alert.grouping'].my_org.custom.service}}
            *Number of alerted instances:* {{steps.count_distinct_alerts.output.values[0][0]}}/{{steps.count_service_hosts.output.values[0][0]}}.
            *Action:* Ansible Playbook ID 15 triggered.
    else:
      # Fallback: If logs are stale, the agent might be dead. Manual intervention needed.
      - name: manual_escalation
        type: console
        with:
          message: |
            CRITICAL: Host {{event.alerts[0]['host.hostname']}} Unresponsive & Disk Full

Note: The notify_result is using a console step placeholder to consolidate demo output.

Analysis of the Infrastructure Scenario

This workflow demonstrates a mature "self-healing" capability:

  1. Predictive vs. Reactive: By triggering at 90% (a safe threshold before systems start to crash), we use the Elastic Agent to fix the issue before the OS locks up the filesystem.
  2. Safety Checks: The check_log_freshness step ensures we don't try to run a remote playbook on a server that has already disconnected, which would likely fail and create noise.
  3. Business Awareness (HA): The notification logic (check_ha_status) understands priority. A disk issue on 1 of 50 web servers is a standard alert. A disk issue on 1 of 2 database nodes is a critical emergency. The workflow adjusts the Slack channel destination dynamically based on this reality.

Scenario C: AI-Augmented Triage (The "Digital Specialist")

The Challenge:

Application errors are often ambiguous. An alert says "Payment Failed: Error 9001." No dashboard explains this. The answer lies in a PDF runbook titled Legacy Payment API v2.0 stored on a forgotten SharePoint or Wiki, or in experienced SRE or application team members' heads.

Usually, the SRE wastes 30 minutes searching for this document or finding a team member with the contextual background knowledge.

The Solution:

By combining Search and Generative AI, we can couple telemetry and context into a single system. Using a RAG architectural approach, Elastic can surface relevant contextual information alongside telemetry using natural language queries to support analysts without switching tools. We index our runbooks and documentation into Elastic using Search tools and techniques which allow contextual information to be stored for use by AI Agents. This data provides the AI Agents with important contextual information about the environment, which can be used as part of intelligent automation workflows that can be applied to solve performance issues. In this scenario, when the workflow triggers, it searches an index (‘sre-knowledge-base’) to fetch any applicable runbooks which would be used to solve the issue, or documented knowledge articles. This information is then handed to the AI Agent to synthesize a fix and provide the analyst with recommended action steps.

The Workflow:

  1. Phase 1: Retrieve Knowledge
    • Trigger: "Application Error" alert fires.
    • Retrieve Context (RAG): The workflow performs a Semantic Search against the ‘sre-knowledge-base’ index using the error message from the alert to find relevant contextual information about the error.
  2. Phase 2: AI Analysis
    • Synthesize (AI): The workflow passes the Alert Logs and the Retrieved Runbook Snippets to the AI Agent, along with a prompt which asks: "Based on these logs and this runbook, what is the fix?"
  3. Phase 3: Communication
    • Notify: The workflow posts the analysis to Slack.

Workflow YAML Definition:

name: AI_Runbook_Assistant
enabled: true
description: Uses RAG to find runbooks for unknown errors and suggests fixes.
triggers:
  - type: alert

steps:
  # ---------------------------------------------------------
  # Phase 1: Retrieve Knowledge (The "Memory" Lookup)
  # ---------------------------------------------------------
  
  # Perform a Semantic Search against internal runbooks
  # We use the alert reason as the search query

  - name: search_knowledge_base
    type: elasticsearch.esql.query
    with:
      query: |
        FROM sre-knowledge-base*
        | WHERE semantic_text : "What caused '{{event.alerts[0]['kibana.alert.rule.parameters'].criteria[0].value}}' of '{{event.alerts[0]['kibana.alert.grouping'].service.name}}' and if available, how to resolve?"
        | KEEP title, text
        | LIMIT 3

  # ---------------------------------------------------------
  # Phase 2: AI Analysis (The "Specialist" Reasoning)
  # ---------------------------------------------------------
  # Pass the retrieved knowledge + live logs to the AI
 - name: ai_analysis
    type: ai.agent
    with:
      agent_id: observability.agent
      message: |
        ISSUE DETECTED:
        {{ event.alerts[0]["kibana.alert.context"].conditions}} for {{event.alerts[0]['kibana.alert.grouping'].service.name}} 

        RELEVANT INTERNAL RUNBOOKS FOUND:
        {% for doc in steps.search_knowledge_base.output.values limit: 3 offset: 0 %}
        - Title: {{doc[0]}}
          Content: {{doc[1]}}
        {% endfor %}

        TASK:
        Analyze the issue. If the runbooks provide a solution, summarize it in a concise manner. 
        If not, suggest standard triage steps.
        Provide the output in Slack Markdown format.

  # ---------------------------------------------------------
  # Phase 3: Communication
  # ---------------------------------------------------------
  
  - name: notify_slack
    type: console 
    with:
      message: |
        channel: "#incident-war-room"
        text: |
          🚨 *Incident Detected*
          *Issue:* {{ event.alerts[0]["kibana.alert.context"].conditions}} for {{event.alerts[0]['kibana.alert.grouping'].service.name}} 
        
          🤖 *AI Agent Analysis:*
          {{ steps.ai_analysis.output.content }}
          

Note: The notification to Slack is simulated using a console step to consolidate demo output.

Analysis of the AI-Augmented Scenario

This workflow changes the nature of incident response:

  1. Instant Context: It automates the "Search" phase of troubleshooting. The engineer enters the war room and the relevant runbook is already there.
  2. Semantic Understanding: Unlike traditional keyword search, the RETRIEVE command uses semantic search techniques to find the required runbook, even if the wording doesn't match exactly (e.g., searching for "Login Fail" finds "Authentication Timeout").
  3. Automated AI Triage: Leverage AI agents for first pass analysis and suggested remediation steps, giving analysts a head start on identifying the root cause and remediation steps. Get the output automatically sent to the engineers for proactive response action.
  4. Reduced MTTR: By putting the solution in front of the engineer immediately, you eliminate the initial investigation delay.

The screenshots below show the transparency Workflows provides across the data analysed and transferred in each step of the process. You can see how Search is used to retrieve the appropriate run book, which is then incorporated with automated AI analysis for a comprehensive notification to engineers, which contains initial triage and remediation steps.

Figure 1: Output from automated Search for runbooks using Workflow

Figure 2: Output from automated AI triage of error using Elastic Workflows

Figure 3: AI Analysis, which is sent to Slack using Elastic Workflows

The Agentic Shift: Integrating Elastic Agent Builder

While workflows are powerful, they are deterministic. They follow the "If X, then Y" logic you explicitly code. However, modern operations often require probabilistic reasoning—the ability to figure out "X" in the first place. Additionally, RAG-based architectures often rely on a one-shot approach to prompt response, which can be limited for complex scenarios. This is where the Elastic Agent Builder can be used as part of Agentic AI experiences, where AI Agents are able to reason and interact autonomously in a series of steps to provide improved and highly accurate responses.

Bridging Reasoning and Action

Elastic Agent Builder allows you to create specialized AI agents that live within your Elastic environment. These agents have secure access to data in your Elastic deployment - both telemetry and unstructured information in the knowledge base for contextual background. More crucially, they have the ability to dynamically query this information and access Workflows as Tools, allowing them to assess and reason, and then take action appropriately .

This creates a powerful bidirectional architecture:

  • Elastic Workflows call Agents: A workflow pauses to ask an agent for further investigation and decision (e.g., "Is this log pattern malicious?").
  • Agents call Elastic Workflows: An agent, during a chat with an SRE, calls a workflow to execute a safe, pre-approved action (e.g., "Run the Disk Cleanup Workflow").

The "Hands" of the AI (MCP)

A major risk with AI in operations is hallucination. You do not want an LLM to guess commands and steps to execute. When using AI in automation, we need to give it tools and information to ground the LLM and ensure responses and actions are accurate and controlled. You solve this with the Model Context Protocol (MCP).

MCP is the standard that allows your Agent to connect to external systems safely. Instead of giving the Agent raw shell access, you give it a "Tool" called cleanup_disk, which has a specific command attached to it in the external system. Different types of tools can be defined, including Workflows, ES|QL queries or index patterns that expose a set of indices to the Agent. For example, instead of searching the ‘sre-knowledge-base’ index as a step in the workflow, a tool could be created to allow Agents to query the data autonomously when asked to do analysis. Elastic has the ability to serve tools via MCP to other systems as needed, as well as to native Agents defined in Agent Builder. For example, a ‘cleanup_disk’ tool maps directly to the deterministic workflow we built in Scenario B, which can be used as part of a revised AI-powered remediation flow:

The Human-in-the-loop Interaction Flow:

  1. SRE: "Agent, the checkout service is failing."
  2. Agent: Queries logs (using ES|QL tool). "It looks like the disk is full on node-01."
  3. SRE: "Fix it."
  4. Agent: "I will run the Disk_Space_Remediation workflow for node-01." (The agent invokes the workflow via MCP, where it is defined as the ‘cleanup_disk’ tool).
  5. Elastic Workflows: Executes the Ansible job. (Deterministic, logged, and safe).

The Self-Healing Flow:

  1. Alert: "The checkout service is failing."
  2. Alert Action: "Run the SRE Agent Runbook workflow"
  3. Elastic Workflows: Pass the alert and its context to the SRE agent.
  4. SRE Agent: Queries logs (using ES|QL tool). "It looks like the disk is full on node-01."
  5. SRE Agent: I have authorization and the tool to remediate this.
  6. SRE Agent: "I will run the Disk_Space_Remediation workflow for node-01." (The agent invokes the workflow via MCP, where it is defined as the ‘cleanup_disk’ tool).
  7. Elastic Workflows: Executes the Ansible job. (Deterministic, logged, and safe).

Implementation: Connecting Agent Builder

To enable this, you register your workflows as tools in the Agent Builder UI.

  1. Define the Tool: In Agent Builder, create a new tool. Select "Elastic Workflow”, and provide the tool name and description. The description should be informative, as it is used by the AI Agents in understanding how the tool is used.
  2. Bind the Workflow: Choose the Disk_Space_Remediation workflow from the drop-down.
  3. Define Schema: Tell the agent what inputs the workflow needs (e.g., target_host).
  4. Deploy: Create the AI Agent in the Agent Builder UI, which has access to the tool we just created. It can now execute the defined workflow, ensuring a sequence of actions is completed in a deterministic manner.

Custom Knowledge Base Tool creation API:

POST kbn:/api/agent_builder/tools
{
  "id": "search_knowledge_base",
  "type": "esql",
  "description": "Semantic Search against internal runbooks",
  "tags": [
    "knowledge-base",
    "observability"
  ],
  "configuration": {
    "query": """FROM sre-knowledge-base*
        | WHERE semantic_text : "What caused '?alert_criteria' of '?service_name' and if available, how to resolve?"
        | KEEP title, text
        | LIMIT 3""",
    "params": {}
  }
}

Custom Cleanup Tool creation API:

POST kbn:/api/agent_builder/tools
{
  "id": "cleanup_disk",
  "type": "workflow",
  "description": "Predefined Disk Space remediation workflow to perform the required cleanups. ",
  "tags": [],
  "configuration": {
    "workflow_id": "workflow-0f1f5ba7-0e75-4469-92b1-b85a032074e5",
    "wait_for_completion": true
  }
}

Custom SRE Agent creation API:

POST kbn:/api/agent_builder/agents
{
  "id": "sre_agent",
  "name": "SRE Agent",
  "description": "An SRE agent to analyze issues, provide summarized solutions from runbook where they exist or suggest standard triage steps when they don't exist. ",
  "labels": [],
  "avatar_color": "#61A2FF",
  "avatar_symbol": "🕵",
  "configuration": {
    "instructions": """You are a Senior SRE.

The issues/alerts detected will be provided in the following format
ISSUE DETECTED:
<alert_criteria> for <service_name>

Use the search_knowledge_base tool to extract known runbooks based on the alert_criteria and service name. 
For each runbook result, format it as follows.
- Title: <title from search_knowledge_base>
   Content: <text from search_knowledge_base>

TASK:
Analyze the issue. If the runbooks provide a solution, summarize it in a concise manner. 
If not, suggest standard triage steps. If the determined solution is to remediate via a disk cleanup, use the cleanup_disk tool. 
Provide the output in Slack Markdown format. Avoid unnecessary token usage be concise whilst descriptive. """,
    "tools": [
      {
        "tool_ids": [
          "platform.core.get_workflow_execution_status",
          "observability.get_alerts",
          "search_knowledge_base",
          "cleanup_disk"
        ]
      }
    ]
  }
}

Workflow YAML Definition:

version: "1"
name: SRE Agent Runbook
description: Use a custom SRE agent to suggests fixes and perform disk cleanup where appropriate.
enabled: true
triggers:
  - type: alert
steps:
  - name: sre_analysis
    type: ai.agent
    with:
      agent_id: sre_agent
      message: |
        ISSUE DETECTED:
        {{ event.alerts[0]["kibana.alert.context"].conditions}} for {{event.alerts[0]['kibana.alert.grouping'].service.name}} 
  - name: notify_slack
    type: console
    with:
      message: |+
        channel: "#incident-war-room"
        text: |
          🚨 *Incident Detected*
          *Issue:* {{ event.alerts[0]["kibana.alert.context"].conditions}} for {{event.alerts[0]['kibana.alert.grouping'].service.name}} 

          🤖 *SRE Agent Analysis:*
          {{ steps.sre_analysis.output }}

Note: The notification to Slack is simulated using a console step to consolidate demo output.

Workflow YAML Definition:

The screenshots below show how we create custom tools for the agent, then create a custom agent scoped to specific tooling. The level of control simplifies the workflow and sets up the framework for how users and agents interact and their boundaries.

Figure 4: Create a custom search_knowledge_base tool

Figure 5: Create a custom cleanup_disk tool

Figure 6: Create a custom SRE Agent with defined tools and Instructions

Figure 7: Leverage the custom SRE Agent within Elastic Workflows

This shifts your operating model from reactive, manual human-driven analysis and action to a proactive, automated analysis and action model. Using the tools explored in this article, Elastic brings together data, context and remediation for an intelligent, automated response that is reliable and controlled, with the option for ‘Human-in-the-loop’ check points or full AI-driven automation. This supports analysts in bridging the remediation gap and reducing MTTR by using AI and automation to augment root cause analysis efforts.

Conclusion

You operate in an era when the complexity of your systems has outpaced the capacity for manual management. The "Remediation Gap" is the single largest source of inefficiency in your IT operations. You cannot hire enough engineers to close it. You must close it with automation.

Proactive operations rely on bringing together telemetry data, contextual information and automation and AI capabilities to reduce time to repair. Elastic Observability provides the vision across systems in your environment. Elastic Workflows provides the hands to automate response actions. Elastic Agent Builder provides the brain for intelligent automation and AI-assisted triage. By unifying these three elements, you build a system that is not just monitored, but resilient. You move from a reactive posture—waiting for the phone to ring—to a proactive one, where the system heals itself before the customer notices.

For more examples of Elastic Workflows checkout the documentation and this GitHub repository.

Start small. Pick one recurring issue. Build one workflow. Measure the time saved. Trust is built on evidence. Once you prove that the machine can fix the machine, you change the nature of your operations forever.

Sign up for Elastic Cloud Serverless or Elastic Cloud and try this out.

Share this article