Loading

Automate root cause analysis for an observability alert

This guide walks through building an observability workflow that responds to an alert by running an Elastic Agent Builder agent for root-cause analysis, generating a case title and description from the agent's output, opening the case, and attaching both the alert and the agent's reasoning trace as comments.

The workflow is adapted from root-cause-analysis-rca-workflow.yaml in the elastic/workflows library.

If you're new to workflows, complete Build your first workflow first.

  • Permissions. All on Analytics > Workflows, Observability > Cases, and whatever Agent Builder privilege is required to invoke agents in your space. Refer to Kibana privileges.
  • Alerting rule. A configured observability alerting rule that fires on the conditions you want to auto-investigate (metric thresholds, SLO burn rate, anomaly detection, or custom query).
  • SRE agent. An Elastic Agent Builder agent configured to investigate observability signals. The source workflow uses an agent named sre-agent. Substitute your agent ID.
  • Attach the workflow to the rule. After saving the workflow, attach it as an action on the alerting rule. Refer to Alert triggers.

The workflow runs in a single pass when an alert fires:

  1. An alert trigger starts the workflow with the alert payload at event.
  2. An ai.agent step runs an initial analysis of the alert and returns the agent's conversation ID so follow-up steps can continue the same conversation.
  3. Two more ai.agent calls reuse the conversation to generate a case title and a case description.
  4. A cases.createCase step opens the case with the agent-generated title and description.
  5. cases.addAlerts attaches the triggering alert.
  6. A kibana.request step fetches the agent's conversation transcript.
  7. cases.addComment steps append the agent's reasoning trace and the raw analysis as comments for auditability.
  1. Trigger on observability alerts

    triggers:
      - type: alert
    		

    Attach the workflow to the alerting rule you want to investigate.

  2. Run initial RCA

    Call the agent with the alert payload. Keep create-conversation: true so follow-up steps can continue the conversation and the agent has context when generating the title and description:

    steps:
      - name: rca_analysis
        type: ai.agent
        agent-id: "{{ consts.agent_id }}"
        connector-id: "{{ consts.connector_id }}"
        create-conversation: true
        with:
          prompt: |
            Investigate the following alert and propose root causes.
            Keep your analysis and data exploration brief to preserve context.
    
            <alert>
            {{ event | json }}
            </alert>
    		

    The agent's response is at steps.rca_analysis.output.message, and the conversation ID is at steps.rca_analysis.output.conversation_id.

  3. Generate a case title and description

    Reuse the conversation (so the agent remembers its analysis) and ask it for a title and description:

    - name: case_title
      type: ai.agent
      agent-id: "{{ consts.agent_id }}"
      connector-id: "{{ consts.connector_id }}"
      conversation-id: "{{ steps.rca_analysis.output.conversation_id }}"
      with:
        prompt: "Based on your analysis, produce a clear case title. Output only the title."
    
    - name: case_description
      type: ai.agent
      agent-id: "{{ consts.agent_id }}"
      connector-id: "{{ consts.connector_id }}"
      conversation-id: "{{ steps.rca_analysis.output.conversation_id }}"
      with:
        prompt: "Based on your analysis, produce a clear case description. Output only the description."
    		

    Using a conversation ID keeps tokens cheap and ensures the title and description match the earlier analysis.

  4. Open the case

    Create the case with the agent-generated title and description:

    - name: create_case
      type: cases.createCase
      with:
        title: "{{ steps.case_title.output.message }}"
        description: "{{ steps.case_description.output.message }}"
        owner: "observability"
        severity: "medium"
        tags: ["auto-rca", "ai-generated"]
    		

    owner is observability for observability cases.

  5. Attach the alert

    - name: attach_alert
      type: cases.addAlerts
      with:
        case_id: "{{ steps.create_case.output.id }}"
        alerts:
          - alertId: "{{ event.alerts[0]._id }}"
            index: "{{ event.alerts[0]._index }}"
            rule:
              id: "{{ event.rule.id }}"
              name: "{{ event.rule.name }}"
    		
  6. Attach the agent's analysis and reasoning

    Append the raw analysis as one comment and the reasoning trace as another. Fetch the reasoning trace with a kibana.request against the Agent Builder conversations API:

    - name: add_analysis
      type: cases.addComment
      with:
        case_id: "{{ steps.create_case.output.id }}"
        comment: "{{ steps.rca_analysis.output.message }}"
    
    - name: get_conversation
      type: kibana.request
      with:
        method: GET
        path: /api/agent_builder/conversations/{{ steps.rca_analysis.output.conversation_id }}
    
    - name: add_reasoning
      type: cases.addComment
      with:
        case_id: "{{ steps.create_case.output.id }}"
        comment: |
          ## AI investigation summary
    
          [View full conversation]({{ kibanaUrl }}/app/agent_builder/conversations/{{ steps.rca_analysis.output.conversation_id }})
    
          {%- for round in steps.get_conversation.output.rounds %}
          {%- for step in round.steps %}
          {%- if step.type == "reasoning" %}
          - **Reasoning:** {{ step.reasoning }}
          {%- elsif step.type == "tool_call" %}
          - **Action:** `{{ step.tool_id }}`
          {%- endif %}
          {%- endfor %}
          {%- endfor %}
    		

    The Liquid loop walks the conversation's rounds and formats each reasoning step and tool call as a bullet. The comment becomes an auditable record of how the agent reached its conclusion.

  • Route by service. Use a switch step on event.alerts[0].service.name to pick different agents for different services (a database-focused agent for DB alerts, a frontend-focused agent for RUM alerts, and so on).
  • Summarize before paging. Add an ai.summarize step that turns the analysis into a one-liner and post it to the on-call Slack channel.
  • Gate destructive remediation. If you want the workflow to trigger remediation, add an if step that only runs when the agent's confidence is high, and invoke a child workflow that handles the remediation in isolation.
  • Correlate across signals. Add elasticsearch.esql.query steps before the agent call to pull metric and log context in the alert's time window, and feed them into the agent's prompt.