Approximate STATS queries
ES|QL STATS commands summarize large volumes of data into aggregated statistics. For many analytics workloads, exact results are not strictly necessary — approximate results with known error bounds are sufficient, and can be computed dramatically faster. The approximation setting enables this: ES|QL rewrites your query to use random sampling and extrapolation, returning estimates together with confidence intervals and a certification flag.
Approximation breaks the dependency between performance and dataset size. Accuracy depends principally on the data characteristics and the query itself, not on how many rows are in the source index. This means the performance advantage grows as your data grows.
To approximate a STATS query, prepend SET approximation=true; to your existing query:
SET approximation=true;
FROM web_traffic | WHERE @timestamp >= NOW()-1w
| STATS total_hits = COUNT(),
avg_load_time = AVG(page_load_ms)
BY country_code
| SORT total_hits DESC
| LIMIT 5
No other changes to the query are required. The underlying rewrite — sampling, extrapolation, and confidence interval computation — is handled automatically.
An approximate query returns the same columns as the exact query, plus two additional columns for each estimated quantity.
Consider a query that counts page hits and computes the average page load time per country over the last week. The exact results might be:
| total_hits | avg_load_time | country_code |
|---|---|---|
| 12483710 | 237.4 | US |
| 4271856 | 189.2 | DE |
| 3804219 | 211.7 | GB |
| 2156033 | 195.8 | FR |
| 1847291 | 302.6 | JP |
The approximate query returns additional columns for each estimated quantity:
| total_hits | avg_load_time | country_code | _approximation_confidence_interval(total_hits) | _approximation_certified(total_hits) | _approximation_confidence_interval(avg_load_time) | _approximation_certified(avg_load_time) |
|---|---|---|---|---|---|---|
| 12510000 | 237.1 | US | [12430000, 12590000] | true | [235.8, 238.4] | true |
| 4284000 | 189.5 | DE | [4240000, 4328000] | true | [187.9, 191.1] | true |
| 3790000 | 212.3 | GB | [3752000, 3828000] | true | [210.1, 214.5] | true |
| 2162000 | 195.1 | FR | [2134000, 2190000] | true | [192.8, 197.4] | true |
| 1839000 | 303.8 | JP | [1814000, 1864000] | true | [300.2, 307.4] | true |
The additional columns are:
_approximation_confidence_interval(<agg>): The central 90% confidence interval for the estimate. This is an interval that has a high probability (0.9) of containing the true value._approximation_certified(<agg>): A boolean indicating whether the statistical properties of the estimate are behaving as expected. Whentrue, the confidence interval is trustworthy. Whenfalse, the estimate may still be accurate, but the distribution of the approximation could not be confirmed to satisfy the assumptions used to compute the interval.
The default settings work well for most queries, but you can tune the confidence interval computation and sample size to trade off between accuracy and speed.
Computing confidence intervals adds overhead. If you only need point estimates, disable them by setting confidence_level to null:
SET approximation={"confidence_level":null};
FROM web_traffic | WHERE @timestamp >= NOW()-1d
| STATS total_bytes = SUM(response_bytes),
avg_load_time = AVG(page_load_ms)
BY datacenter_region
| SORT total_bytes DESC
| LIMIT 10
This skips the interval and certification computation and can yield additional speedup.
The default sample size is 1,000,000 rows for grouped STATS (queries with a BY clause) and 100,000 rows otherwise. If you find results are too imprecise — particularly for high-cardinality grouping — you can increase the sample size:
SET approximation={"rows":5000000};
FROM web_traffic | WHERE @timestamp >= NOW()-1w
| STATS total_hits = COUNT(*),
avg_load_time = AVG(page_load_ms)
BY url_path
| SORT total_hits DESC
| LIMIT 25
Larger sample sizes improve accuracy at the cost of reduced speedup. As long as the sample size remains well below the total row count, you will still see performance benefits.
Some aggregations can be computed directly from summary statistics maintained in the index (for example, simple COUNT(*) over an indexed numeric field with no grouping). The query planner detects these cases automatically and executes the query exactly, since it is already fast. You do not need to handle this yourself — when this happens, the confidence intervals will have zero length, indicating that the results are exact.
Approximation works with aggregation functions where sampling and extrapolation produce statistically sound estimates. The functions that are not currently supported and will cause the query to fall back to exact execution include:
COUNT_DISTINCTMINMAXFIRSTLASTTOPABSENTPRESENTST_CENTROID_AGGST_EXTENT_AGG
Some of these (such as MIN and MAX) are intrinsically difficult to estimate reliably from samples without making strong distributional assumptions. Rather than risk accidental misuse, they are excluded from automatic approximation.
The following query patterns are not currently supported for approximation and fall back to exact execution:
- Queries using the
TSsource command - Queries using the
FORKorJOINprocessing commands - Pipelines containing two or more
STATScommands
Approximation works best on large, broad queries. Certain query patterns reduce or eliminate the benefit.
If a query's WHERE clause matches only a small fraction of the data, sampling provides little benefit — the data is already small. ES|QL detects this during the rewrite phase and falls back to exact execution. However, the rewrite step itself adds some overhead, so if you know in advance that your query will match very few rows, it is better to run without approximation.
When the BY expression has very high cardinality (many distinct values), individual groups may end up with very few sampled rows. This can cause:
- Infrequent groups being dropped entirely (groups with fewer than 10 samples are excluded from results).
- Large estimation errors for groups that are retained.
- If the grouping field is unique per document, the query may return no results at all.
Sorting by ascending count (finding the rarest groups) is particularly problematic, as heavy hitters may require sampling most of the dataset.
If accuracy for high-cardinality queries matters, increase the sample size using the rows configuration option. As a rule of thumb, you want at least a few hundred samples per group for reliable estimates.
For expert users who want full control, ES|QL provides the SAMPLE command. This gives you raw sampled data with no automatic extrapolation or confidence interval computation:
FROM web_traffic | SAMPLE 0.01
| STATS unique_visitors = COUNT_DISTINCT(client_ip)
This computes the distinct count of client IPs on roughly 1% of the data. Since COUNT_DISTINCT is not supported by automatic approximation, SAMPLE is the alternative — but interpreting the result and accounting for sampling bias is your responsibility.
You can also use SAMPLE to build custom estimation pipelines, for example by extracting frequency profiles:
FROM web_traffic | SAMPLE 0.01
| STATS c = COUNT(*) BY search_phrase
Adjusting the sample probability and observing how results change gives you a sense of convergence.
| Aspect | Detail |
|---|---|
| Enable approximation | SET approximation=true; |
| Disable confidence intervals | SET approximation={"confidence_level":null}; |
| Custom sample size | SET approximation={"rows":N}; |
| Default sample size (grouped) | 1,000,000 rows |
| Default sample size (ungrouped) | 100,000 rows |
| Confidence interval default | Central 90% interval |
| Minimum samples per group | 10 |