Automated Error Triage: From Reactive to Autonomous

The engineering feedback loop is often pictured as a clean cycle: shipping a feature, monitoring its health, triaging issues, identifying bugs, and deploying fixes. However, in large-scale cloud environments, the path from monitoring to identification frequently becomes a bottleneck. When thousands of Kibana instances running on Elastic Cloud emit millions of logs across a vast codebase, the lag between an error occurring and an engineer understanding its root cause—the Maintenance Gap—can stretch from hours to months.

To close this gap, we built an automated pipeline that moves beyond simple monitoring. By automating the discovery and investigation phases, we have shifted the focus of the engineer from "what happened?" to "is this fix correct?"

The Bottleneck in the Feedback Loop

In a high-velocity engineering environment, the path from deployment to resolution involves several distinct stages: Ship, Monitor, Triage, Identify, Fix, and Review/Deploy.

Velocity typically stalls during triage and identification. While catastrophic failures are reported immediately, smaller errors—intermittent UI glitches or failed background tasks—often go unreported. This dependency on manual reporting creates an inflated time to resolution; by the time a report is filed and routed, the issue may have already impacted the fleet for days.

By automating discovery and investigation, even these "paper cut" bugs are quantified before they accumulate into significant technical debt. The goal is to ensure that by the time a developer enters the cycle to write a fix, the detective work is already complete.

Discovery: Automated Log Clustering

The first challenge in this process is signal-to-noise. In a massive production environment, creating a ticket for every error event is unmanageable.

Instead of analyzing individual log lines, we automate the triage process using ES|QL's CATEGORIZE grouping function. CATEGORIZE clusters text messages into groups of similarly formatted values, turning unstructured telemetry into a prioritized backlog of distinct error patterns.

For example, a query like the following runs on a rolling window across all Kibana error logs:

FROM kibana-server-logs
| WHERE log.level == "ERROR"
    AND @timestamp >= NOW() - 7 days
| STATS count = COUNT() BY category = CATEGORIZE(message)
| SORT count DESC

The result is a table of regex-like categories and their occurrence counts:

count	category
1,247	`.?TypeError.+?Cannot.+?read.+?properties.+?of.+?undefined.+?reading.+?document.?`
812	`.?Connection.+?error.?`
3	`.?Disconnected.?`

A category like TypeError Cannot read properties of undefined reading document with 1,200+ hits over the past week tells us there is a real, recurring defect worth investigating. A category like Connection error spread uniformly across the fleet is more likely infrastructure noise.

The output is used to automatically file prioritized issues in a backlog, each enriched with the category, its regex, the occurrence count, and deep links into the raw telemetry. This automation ensures the feedback loop no longer waits for a user report to trigger an investigation; the discovery is proactive and immediate. These prioritized clusters then serve as the direct input for our autonomous investigation agent.

Investigation: The Automated Detective

Once an error pattern is identified, the pipeline moves to the identification phase. We deployed an AI agent to run a complete investigation of the issue. Navigating a codebase of Kibana's complexity is a significant time sink; the agent accelerates this by correlating information across the stack using ES|QL (Elasticsearch Query Language).

Protocol-Driven Investigation

It is important to distinguish this agent from a traditional automation script. The agent does not follow a hardcoded state machine; instead, it is provided with a protocol that outlines investigation goals and available tools.

The protocol prescribes a phased approach: understand the error, analyze its distribution, correlate with other data sources, find the source, and report. Each phase is described in terms of goals, not commands. The following excerpt shows how the protocol defines the first investigation step:

### Phase 1: Understand the Error
- Review the pre-extracted error details from the backlog issue
- Check for similar/overlapping error backlog issues (include closed!)
  - the categorization is often imperfect; closed issues may have
    valuable context about fixes
- Query for error overview statistics
- Get sample error messages to understand the actual content

The agent is also provided with an ES|QL reference guide and a library of query templates. Here is one of the templates for analyzing version distribution (a common first step to determine whether an error is a regression):

FROM logging-*:cluster-kibana-*
| WHERE @timestamp >= NOW() - 4 hours
    AND log.level == "ERROR"
    AND message : "TypeError Cannot read properties"
| STATS
    error_count = COUNT(*),
    deployments = COUNT_DISTINCT(ece.deployment)
  BY `docker.container.labels.org.label-schema.version`
| SORT error_count DESC

Because the agent has the autonomy to choose which tools to call—and in what order—based on the results of previous queries, it can adapt its strategy to the specific error. It might decide to skip proxy analysis if the telemetry suggests a background task failure, or it might dive deep into git history if ES|QL reveals the bug only exists on a specific version. This flexibility allows it to navigate the nuance of a massive codebase without requiring a pre-defined path for every possible failure mode.

Lessons Learned: Query Discipline

Direct LLM access to production clusters requires tactical constraints to manage costs and performance. We codified several requirements into the investigation workflow to ensure efficiency:

Query Budgets: The agent is restricted to ~15-20 queries per investigation, forcing it to form a hypothesis before data-retrieval.
The 4-Hour Rule: The agent starts with a small time window (the most recent 1-4 hours) to leverage caches and reduce compute costs.
Optimal Operators: The agent prefers equality filters and the MATCH (:) operator over LIKE or regex, which can make queries 50-1000× faster.
Fail-Fast Timeouts: Every query has a strict timeout, requiring the agent to refine its filters rather than retrying expensive operations.

Source Code Contextualization

To complete the identification phase, the agent correlates telemetry with the git history and source files. It uses the stack trace and log patterns to narrow its search, parsing through potential code matches faster than a manual search. By identifying the specific line of code producing the error and checking recent PRs, the agent links a production symptom directly to its technical root cause.

Real-World Case Study: The Streams UI Crash

The value of this autonomous investigation is best illustrated by the rare edge cases it uncovers. In one instance, the clustering system surfaced a sporadic pattern:

.*?TypeError.+?Cannot.+?read.+?properties.+?of.+?undefined.+?reading.+?document.*?

A human might have dismissed this as generic telemetry noise, but the agent's investigation revealed a reproducible race condition in the Streams UI:

Quantification: Using ES|QL, the agent analyzed the error distribution and identified the specific application context (Streams) and the relevant loggers.
Code Analysis: It identified a logic error in processor_outcome_preview.tsx. The code was indexing into an array (originalSamples[currentDoc.index].document) without verifying the element existed.
Root Cause: The agent realized that when a user changed filters while a row was expanded, the currentDoc.index became stale before the next render cleared it.
Outcome: The agent provided a suggested fix (guarding the access) and recommended a regression test around filter changes during row expansion.

This case highlights the economic scale of autonomous triage. Sifting through thousands of "noisy" logs to find the few that represent real, fixable UI crashes is a non-starter for senior engineers. Agents process this volume at a fraction of the cost, acting as a high-fidelity filter that ensures human time is only spent on verified, actionable issues.

The Future of Engineering Velocity

Automating triaging and identification is the first step. We are currently layering in the ability to pass these findings to a coding agent for draft Pull Requests. Beyond production errors, we are also investigating agentic exploratory testing to stress-test features during the pre-release phase and catch bugs before they ever reach a user.

This autonomous layer is complementary to, not a replacement for, classic quality gates. Unit tests, API-level checks, and UI integration tests remain the primary defense. Our approach provides a safety net for the failures that inevitably bypass these gates in a complex environment, ensuring they are addressed with the same rigor as pre-release bugs.

As we move toward a more agent-driven development process, the ability to rapidly validate that changes are safe and to control overall quality is the primary bottleneck for engineering velocity. While code generation itself is becoming a commodity, the "reasoning" required to verify that a change is both correct and safe remains the most critical hurdle. By focusing our automation on the discovery and root-cause analysis of failures, we ensure that our engineering teams can scale their impact without being buried by the operational weight of maintaining quality. The goal is to build a system that can understand, diagnose, and eventually fix itself.

For more information on Elastic and its observability capabilities, check out Elastic Observability. You can also sign up for a free trial to try it out yourself.