Loading

Approximate STATS queries

ES|QL STATS commands summarize large volumes of data into aggregated statistics. For many analytics workloads, exact results are not strictly necessary — approximate results with known error bounds are sufficient, and can be computed dramatically faster. The approximation setting enables this: ES|QL rewrites your query to use random sampling and extrapolation, returning estimates together with confidence intervals and a certification flag.

Approximation breaks the dependency between performance and dataset size. Accuracy depends principally on the data characteristics and the query itself, not on how many rows are in the source index. This means the performance advantage grows as your data grows.

To approximate a STATS query, prepend SET approximation=true; to your existing query:

SET approximation=true;
FROM web_traffic | WHERE @timestamp >= NOW()-1w
                 | STATS total_hits = COUNT(),
                         avg_load_time = AVG(page_load_ms)
                   BY country_code
                 | SORT total_hits DESC
                 | LIMIT 5
		

No other changes to the query are required. The underlying rewrite — sampling, extrapolation, and confidence interval computation — is handled automatically.

An approximate query returns the same columns as the exact query, plus two additional columns for each estimated quantity.

Consider a query that counts page hits and computes the average page load time per country over the last week. The exact results might be:

total_hits avg_load_time country_code
12483710 237.4 US
4271856 189.2 DE
3804219 211.7 GB
2156033 195.8 FR
1847291 302.6 JP

The approximate query returns additional columns for each estimated quantity:

total_hits avg_load_time country_code _approximation_confidence_interval(total_hits) _approximation_certified(total_hits) _approximation_confidence_interval(avg_load_time) _approximation_certified(avg_load_time)
12510000 237.1 US [12430000, 12590000] true [235.8, 238.4] true
4284000 189.5 DE [4240000, 4328000] true [187.9, 191.1] true
3790000 212.3 GB [3752000, 3828000] true [210.1, 214.5] true
2162000 195.1 FR [2134000, 2190000] true [192.8, 197.4] true
1839000 303.8 JP [1814000, 1864000] true [300.2, 307.4] true

The additional columns are:

  • _approximation_confidence_interval(<agg>): The central 90% confidence interval for the estimate. This is an interval that has a high probability (0.9) of containing the true value.
  • _approximation_certified(<agg>): A boolean indicating whether the statistical properties of the estimate are behaving as expected. When true, the confidence interval is trustworthy. When false, the estimate may still be accurate, but the distribution of the approximation could not be confirmed to satisfy the assumptions used to compute the interval.

The default settings work well for most queries, but you can tune the confidence interval computation and sample size to trade off between accuracy and speed.

Computing confidence intervals adds overhead. If you only need point estimates, disable them by setting confidence_level to null:

SET approximation={"confidence_level":null};
FROM web_traffic | WHERE @timestamp >= NOW()-1d
                 | STATS total_bytes = SUM(response_bytes),
                         avg_load_time = AVG(page_load_ms)
                   BY datacenter_region
                 | SORT total_bytes DESC
                 | LIMIT 10
		

This skips the interval and certification computation and can yield additional speedup.

The default sample size is 1,000,000 rows for grouped STATS (queries with a BY clause) and 100,000 rows otherwise. If you find results are too imprecise — particularly for high-cardinality grouping — you can increase the sample size:

SET approximation={"rows":5000000};
FROM web_traffic | WHERE @timestamp >= NOW()-1w
                 | STATS total_hits = COUNT(*),
                         avg_load_time = AVG(page_load_ms)
                   BY url_path
                 | SORT total_hits DESC
                 | LIMIT 25
		

Larger sample sizes improve accuracy at the cost of reduced speedup. As long as the sample size remains well below the total row count, you will still see performance benefits.

Some aggregations can be computed directly from summary statistics maintained in the index (for example, simple COUNT(*) over an indexed numeric field with no grouping). The query planner detects these cases automatically and executes the query exactly, since it is already fast. You do not need to handle this yourself — when this happens, the confidence intervals will have zero length, indicating that the results are exact.

Approximation works with aggregation functions where sampling and extrapolation produce statistically sound estimates. The functions that are not currently supported and will cause the query to fall back to exact execution include:

  • COUNT_DISTINCT
  • MIN
  • MAX
  • FIRST
  • LAST
  • TOP
  • ABSENT
  • PRESENT
  • ST_CENTROID_AGG
  • ST_EXTENT_AGG

Some of these (such as MIN and MAX) are intrinsically difficult to estimate reliably from samples without making strong distributional assumptions. Rather than risk accidental misuse, they are excluded from automatic approximation.

The following query patterns are not currently supported for approximation and fall back to exact execution:

  • Queries using the TS source command
  • Queries using the FORK or JOIN processing commands
  • Pipelines containing two or more STATS commands

Approximation works best on large, broad queries. Certain query patterns reduce or eliminate the benefit.

If a query's WHERE clause matches only a small fraction of the data, sampling provides little benefit — the data is already small. ES|QL detects this during the rewrite phase and falls back to exact execution. However, the rewrite step itself adds some overhead, so if you know in advance that your query will match very few rows, it is better to run without approximation.

When the BY expression has very high cardinality (many distinct values), individual groups may end up with very few sampled rows. This can cause:

  • Infrequent groups being dropped entirely (groups with fewer than 10 samples are excluded from results).
  • Large estimation errors for groups that are retained.
  • If the grouping field is unique per document, the query may return no results at all.

Sorting by ascending count (finding the rarest groups) is particularly problematic, as heavy hitters may require sampling most of the dataset.

If accuracy for high-cardinality queries matters, increase the sample size using the rows configuration option. As a rule of thumb, you want at least a few hundred samples per group for reliable estimates.

For expert users who want full control, ES|QL provides the SAMPLE command. This gives you raw sampled data with no automatic extrapolation or confidence interval computation:

FROM web_traffic | SAMPLE 0.01
                 | STATS unique_visitors = COUNT_DISTINCT(client_ip)
		

This computes the distinct count of client IPs on roughly 1% of the data. Since COUNT_DISTINCT is not supported by automatic approximation, SAMPLE is the alternative — but interpreting the result and accounting for sampling bias is your responsibility.

You can also use SAMPLE to build custom estimation pipelines, for example by extracting frequency profiles:

FROM web_traffic | SAMPLE 0.01
                 | STATS c = COUNT(*) BY search_phrase
		

Adjusting the sample probability and observing how results change gives you a sense of convergence.

Aspect Detail
Enable approximation SET approximation=true;
Disable confidence intervals SET approximation={"confidence_level":null};
Custom sample size SET approximation={"rows":N};
Default sample size (grouped) 1,000,000 rows
Default sample size (ungrouped) 100,000 rows
Confidence interval default Central 90% interval
Minimum samples per group 10