Automate root cause analysis for an observability alert
This guide walks through building an observability workflow that responds to an alert by running an Elastic Agent Builder agent for root-cause analysis, generating a case title and description from the agent's output, opening the case, and attaching both the alert and the agent's reasoning trace as comments.
The workflow is adapted from root-cause-analysis-rca-workflow.yaml in the elastic/workflows library.
If you're new to workflows, complete Build your first workflow first.
- Permissions.
Allon Analytics > Workflows, Observability > Cases, and whatever Agent Builder privilege is required to invoke agents in your space. Refer to Kibana privileges. - Alerting rule. A configured observability alerting rule that fires on the conditions you want to auto-investigate (metric thresholds, SLO burn rate, anomaly detection, or custom query).
- SRE agent. An Elastic Agent Builder agent configured to investigate observability signals. The source workflow uses an agent named
sre-agent. Substitute your agent ID. - Attach the workflow to the rule. After saving the workflow, attach it as an action on the alerting rule. Refer to Alert triggers.
The workflow runs in a single pass when an alert fires:
- An alert trigger starts the workflow with the alert payload at
event. - An
ai.agentstep runs an initial analysis of the alert and returns the agent's conversation ID so follow-up steps can continue the same conversation. - Two more
ai.agentcalls reuse the conversation to generate a case title and a case description. - A
cases.createCasestep opens the case with the agent-generated title and description. cases.addAlertsattaches the triggering alert.- A
kibana.requeststep fetches the agent's conversation transcript. cases.addCommentsteps append the agent's reasoning trace and the raw analysis as comments for auditability.
-
Trigger on observability alerts
triggers: - type: alertAttach the workflow to the alerting rule you want to investigate.
-
Run initial RCA
Call the agent with the alert payload. Keep
create-conversation: trueso follow-up steps can continue the conversation and the agent has context when generating the title and description:steps: - name: rca_analysis type: ai.agent agent-id: "{{ consts.agent_id }}" connector-id: "{{ consts.connector_id }}" create-conversation: true with: prompt: | Investigate the following alert and propose root causes. Keep your analysis and data exploration brief to preserve context. <alert> {{ event | json }} </alert>The agent's response is at
steps.rca_analysis.output.message, and the conversation ID is atsteps.rca_analysis.output.conversation_id. -
Generate a case title and description
Reuse the conversation (so the agent remembers its analysis) and ask it for a title and description:
- name: case_title type: ai.agent agent-id: "{{ consts.agent_id }}" connector-id: "{{ consts.connector_id }}" conversation-id: "{{ steps.rca_analysis.output.conversation_id }}" with: prompt: "Based on your analysis, produce a clear case title. Output only the title." - name: case_description type: ai.agent agent-id: "{{ consts.agent_id }}" connector-id: "{{ consts.connector_id }}" conversation-id: "{{ steps.rca_analysis.output.conversation_id }}" with: prompt: "Based on your analysis, produce a clear case description. Output only the description."Using a conversation ID keeps tokens cheap and ensures the title and description match the earlier analysis.
-
Open the case
Create the case with the agent-generated title and description:
- name: create_case type: cases.createCase with: title: "{{ steps.case_title.output.message }}" description: "{{ steps.case_description.output.message }}" owner: "observability" severity: "medium" tags: ["auto-rca", "ai-generated"]ownerisobservabilityfor observability cases. -
Attach the alert
- name: attach_alert type: cases.addAlerts with: case_id: "{{ steps.create_case.output.id }}" alerts: - alertId: "{{ event.alerts[0]._id }}" index: "{{ event.alerts[0]._index }}" rule: id: "{{ event.rule.id }}" name: "{{ event.rule.name }}" -
Attach the agent's analysis and reasoning
Append the raw analysis as one comment and the reasoning trace as another. Fetch the reasoning trace with a
kibana.requestagainst the Agent Builder conversations API:- name: add_analysis type: cases.addComment with: case_id: "{{ steps.create_case.output.id }}" comment: "{{ steps.rca_analysis.output.message }}" - name: get_conversation type: kibana.request with: method: GET path: /api/agent_builder/conversations/{{ steps.rca_analysis.output.conversation_id }} - name: add_reasoning type: cases.addComment with: case_id: "{{ steps.create_case.output.id }}" comment: | ## AI investigation summary [View full conversation]({{ kibanaUrl }}/app/agent_builder/conversations/{{ steps.rca_analysis.output.conversation_id }}) {%- for round in steps.get_conversation.output.rounds %} {%- for step in round.steps %} {%- if step.type == "reasoning" %} - **Reasoning:** {{ step.reasoning }} {%- elsif step.type == "tool_call" %} - **Action:** `{{ step.tool_id }}` {%- endif %} {%- endfor %} {%- endfor %}The Liquid loop walks the conversation's rounds and formats each reasoning step and tool call as a bullet. The comment becomes an auditable record of how the agent reached its conclusion.
Full workflow YAML
name: observability--root-cause-analysis
description: Investigate an observability alert with an AI agent, then open a case populated with the analysis and reasoning trace.
enabled: true
tags: ["rca", "ai", "observability"]
triggers:
- type: alert
consts:
agent_id: "sre-agent"
connector_id: "your-connector-id"
steps:
- name: rca_analysis
type: ai.agent
agent-id: "{{ consts.agent_id }}"
connector-id: "{{ consts.connector_id }}"
create-conversation: true
with:
prompt: |
Investigate the following alert and propose root causes.
Keep your analysis and data exploration brief.
<alert>
{{ event | json }}
</alert>
- name: case_title
type: ai.agent
agent-id: "{{ consts.agent_id }}"
connector-id: "{{ consts.connector_id }}"
conversation-id: "{{ steps.rca_analysis.output.conversation_id }}"
with:
prompt: "Based on your analysis, produce a clear case title. Output only the title."
- name: case_description
type: ai.agent
agent-id: "{{ consts.agent_id }}"
connector-id: "{{ consts.connector_id }}"
conversation-id: "{{ steps.rca_analysis.output.conversation_id }}"
with:
prompt: "Based on your analysis, produce a clear case description. Output only the description."
- name: create_case
type: cases.createCase
with:
title: "{{ steps.case_title.output.message }}"
description: "{{ steps.case_description.output.message }}"
owner: "observability"
severity: "medium"
tags: ["auto-rca", "ai-generated"]
- name: attach_alert
type: cases.addAlerts
with:
case_id: "{{ steps.create_case.output.id }}"
alerts:
- alertId: "{{ event.alerts[0]._id }}"
index: "{{ event.alerts[0]._index }}"
rule:
id: "{{ event.rule.id }}"
name: "{{ event.rule.name }}"
- name: add_analysis
type: cases.addComment
with:
case_id: "{{ steps.create_case.output.id }}"
comment: "{{ steps.rca_analysis.output.message }}"
- name: get_conversation
type: kibana.request
with:
method: GET
path: /api/agent_builder/conversations/{{ steps.rca_analysis.output.conversation_id }}
- name: add_reasoning
type: cases.addComment
with:
case_id: "{{ steps.create_case.output.id }}"
comment: |
## AI investigation summary
[View full conversation]({{ kibanaUrl }}/app/agent_builder/conversations/{{ steps.rca_analysis.output.conversation_id }})
{%- for round in steps.get_conversation.output.rounds %}
{%- for step in round.steps %}
{%- if step.type == "reasoning" %}
- **Reasoning:** {{ step.reasoning }}
{%- elsif step.type == "tool_call" %}
- **Action:** `{{ step.tool_id }}`
{%- endif %}
{%- endfor %}
{%- endfor %}
- Route by service. Use a
switchstep onevent.alerts[0].service.nameto pick different agents for different services (a database-focused agent for DB alerts, a frontend-focused agent for RUM alerts, and so on). - Summarize before paging. Add an
ai.summarizestep that turns the analysis into a one-liner and post it to the on-call Slack channel. - Gate destructive remediation. If you want the workflow to trigger remediation, add an
ifstep that only runs when the agent's confidence is high, and invoke a child workflow that handles the remediation in isolation. - Correlate across signals. Add
elasticsearch.esql.querysteps before the agent call to pull metric and log context in the alert's time window, and feed them into the agent's prompt.
- Observability workflows: The outcome this workflow supports.
- AI steps reference: Parameters for
ai.agentand related AI steps. - Elastic Agent Builder for Observability: How Agent Builder integrates with observability workflows.
- Cases action steps: Full reference for
cases.*steps. elastic/workflowsexamples folder: More end-to-end examples.