Turn Dashboards Into an Investigation Tool with ES|QL Variable Controls

Static dashboards are useful until the first incident, where the default view hides the signal you need. ES|QL variable controls on a Kibana dashboard make it possible to go from a healthy-looking fleet overview to a clear root cause without editing a single query.

In this blog, we’ll show how these ES|QL variable controls turn dashboards into interactive investigation tools, and how to set them up to uncover problems that averages were hiding. By selecting a value in a control, every panel using that variable adapts.

The dashboard

This is a custom "Infrastructure Overview" dashboard monitoring 10 hosts across 3 AWS regions using OpenTelemetry host metrics. Four line charts (CPU, Memory, Disk, Load average) and ES|QL variable controls at the top.

With the default dashboard controls (AVG aggregation, region breakdown, 15-minute buckets, all hosts selected), everything looks healthy. Smooth diurnal cycles across all three regions.

But there is a problem hiding in this view.

The problem with fixed queries

A fixed chart query hardcodes decisions that need to change during an investigation:

The aggregation function (AVG, MAX, MIN, MEDIAN)
The dimension used to slice the data (host, region, availability zone)
Which hosts are included or excluded
The time bucket interval (1m, 5m, 15m, 1h)

With those baked in, every change means editing queries across multiple panels.

ES|QL variable controls

ES|QL variable controls inject user-selected values into queries at runtime. Two types:

Value controls (?variable): replace a value in the query, such as a time interval or a list of hostnames
Structure controls (??variable): replace a function name or field name, such as the aggregation function or the dimension used to slice data

One query pattern, reused across all panels.

The query

The original static CPU query looks like this:

TS metrics-hostmetricsreceiver.otel-default
| WHERE system.cpu.utilization IS NOT NULL
  AND attributes.state != "idle"
| STATS AVG(system.cpu.utilization)
  BY BUCKET(@timestamp, 1 minute), resource.attributes.host.name

To adapt this query to use variable controls, each hardcoded part has to be replaced with a variable. The aggregation function, the time bucket, and the breakdown dimension are straightforward replacements. The hostname filter requires one extra step because we want the control to allow selecting multiple hosts at once, and filtering by a single value only matches one host at a time. MV_CONTAINS checks whether a value exists inside a multi-value list, so MV_CONTAINS(?hostname, resource.attributes.host.name) returns true if the field contains any of the selected values in the control.

After replacing each part, the query becomes:

TS metrics-hostmetricsreceiver.otel-default
| WHERE system.cpu.utilization IS NOT NULL
  AND attributes.state != "idle"
  AND MV_CONTAINS(?hostname, resource.attributes.host.name)
| STATS ??aggregation(system.cpu.utilization)
  BY BUCKET(@timestamp, ?interval), ??breakdown

The same pattern applies to all four panels (CPU, Memory, Disk, Load). Changing any control updates every panel at once.

The controls

Hostname (?hostname): Filters to the hosts selected in the control. Configured as "Values from a query" with multi-select enabled. It runs an ES|QL query that returns available host names, and MV_CONTAINS in the chart queries enables selecting more than one.
Aggregation (??aggregation): Swaps the aggregation function. Static values control with AVG, MAX, MIN, MEDIAN.
Time interval (?interval): Controls the time bucket size. Static values control with 1 minute, 5 minutes, 15 minutes, 1 hour.
Breakdown (??breakdown): Swaps the dimension used to slice the data. Static values control with resource.attributes.host.name, resource.attributes.cloud.region, resource.attributes.cloud.availability_zone.

The investigation

The dashboard opens with AVG aggregation, region breakdown, 15-minute buckets, and all hosts selected. Nothing looks wrong. The first change is switching the aggregation from AVG to MAX and the time interval to 1 minute. A bump immediately appears in us-east-1 around March 7, roughly 68% where normal peak sits around 57%. The average was hiding this because one host's intermittent spikes get averaged across five hosts in the region.

Next, switching the breakdown from region to host makes it clear. db-01 stands out with spikes to 65-70% while its normal baseline sits around 24%. Every other host follows its expected pattern.

Setting the hostname control to db-01 only isolates the incident. Intermittent CPU bursts, not sustained saturation. Memory climbs from 85% to 93%, Load from 2.4 to 3.0, Disk from 67% to 73%. All four panels corroborate a 4-hour event window.

Why structure your queries with variable controls

A dashboard built with variable controls supports investigation paths that did not exist when the dashboard was built. Without them, every dashboard is a frozen perspective chosen at build time. When an incident does not match that perspective, someone has to edit queries or build a new dashboard under pressure. With controls, the panels adapt.

Value controls like ?hostname and ?interval handle what you filter and define the granularity of the data. Structure controls like ??aggregation and ??breakdown handle how you aggregate and how you slice. Panels sharing one query pattern means a fix or improvement applies everywhere, and a new investigation path is a single value added to a control. Together they turn a static dashboard into an investigation surface.