﻿---
title: Aggregating data for faster performance
description: When you aggregate data, Elasticsearch automatically distributes the calculations across your cluster. Then you can feed this aggregated data into the...
url: https://www.elastic.co/docs/explore-analyze/machine-learning/anomaly-detection/ml-configuring-aggregation
products:
  - Elasticsearch
  - Machine Learning
applies_to:
  - Elastic Cloud Serverless: Generally available
  - Elastic Stack: Generally available
---

# Aggregating data for faster performance
When you aggregate data, Elasticsearch automatically distributes the calculations across your cluster. Then you can feed this aggregated data into the machine learning features instead of raw results. It reduces the volume of data that must be analyzed.

## Requirements

There are a number of requirements for using aggregations in datafeeds.

### Aggregations

- Your aggregation must include a `date_histogram` aggregation or a top level `composite` aggregation, which in turn must contain a `max` aggregation on the time field. It ensures that the aggregated data is a time series and the timestamp of each bucket is the time of the last record in the bucket.
- The `time_zone` parameter in the date histogram aggregation must be set to `UTC`, which is the default value.
- The name of the aggregation and the name of the field that it operates on need to match. For example, if you use a `max` aggregation on a time field called `responsetime`, the name of the aggregation must also be `responsetime`.
- For `composite` aggregation support, there must be exactly one `date_histogram` value source. That value source must not be sorted in descending order. Additional `composite` aggregation value sources are allowed, such as `terms`.
- The `size` parameter of the non-composite aggregations must match the cardinality of your data. A greater value of the `size` parameter increases the memory requirement of the aggregation.
- If you set the `summary_count_field_name` property to a non-null value, the anomaly detection job expects to receive aggregated input. The property must be set to the name of the field that contains the count of raw data points that have been aggregated. It applies to all detectors in the job.
- The influencers or the partition fields must be included in the aggregation of your datafeed, otherwise they are not included in the job analysis. For more information on influencers, refer to [Influencers](/docs/explore-analyze/machine-learning/anomaly-detection/ml-ad-run-jobs#ml-ad-influencers).


### Intervals

- The bucket span of your anomaly detection job must be divisible by the value of the `calendar_interval` or `fixed_interval` in your aggregation (with no remainder).
- If you specify a `frequency` for your datafeed, it must be divisible by the `calendar_interval` or the `fixed_interval`.
- Anomaly detection jobs cannot use `date_histogram` or `composite` aggregations with an interval measured in months because the length of the month is not fixed; they can use weeks or smaller units.


## Limitations

- If your [datafeed uses aggregations with nested `terms` aggs](#aggs-dfeeds) and model plot is not enabled for the anomaly detection job, neither the **Single Metric Viewer** nor the **Anomaly Explorer** can plot and display an anomaly chart. In these cases, an explanatory message is shown instead of the chart.
- Your datafeed can contain multiple aggregations, but only the ones with names that match values in the job configuration are fed to the job.
- Using [scripted metric](https://www.elastic.co/docs/reference/aggregations/search-aggregations-metrics-scripted-metric-aggregation) aggregations is not supported in datafeeds.


## Recommendations

- When your detectors use [metric](https://www.elastic.co/docs/reference/machine-learning/ml-metric-functions) or [sum](https://www.elastic.co/docs/reference/machine-learning/ml-sum-functions) analytical functions, it’s recommended to set the `date_histogram` or `composite` aggregation interval to a tenth of the bucket span. This creates finer, more granular time buckets, which are ideal for this type of analysis.
- When your detectors use [count](https://www.elastic.co/docs/reference/machine-learning/ml-count-functions) or [rare](https://www.elastic.co/docs/reference/machine-learning/ml-rare-functions) functions, set the interval to the same value as the bucket span.
- If you have multiple influencers or partition fields or if your field cardinality is more than 1000, use [composite aggregations](https://www.elastic.co/docs/reference/aggregations/search-aggregations-bucket-composite-aggregation).
  To determine the cardinality of your data, you can run searches such as:
  ```js
  GET .../_search
  {
    "aggs": {
      "service_cardinality": {
        "cardinality": {
          "field": "service"
        }
      }
    }
  }
  ```


## Including aggregations in anomaly detection jobs

When you create or update an anomaly detection job, you can include aggregated fields in the analysis configuration. In the datafeed configuration object, you can define the aggregations.
```json

{
  "analysis_config": {
    "bucket_span": "60m",
    "detectors": [{
      "function": "mean",
      "field_name": "responsetime",  <1>
      "by_field_name": "airline"  <1>
    }],
    "summary_count_field_name": "doc_count" <2>
  },
  "data_description": {
    "time_field":"time"  <1>
  },
  "datafeed_config":{
    "indices": ["kibana-sample-data-flights"],
    "aggregations": {
      "buckets": {
        "date_histogram": {
          "field": "time",
          "fixed_interval": "360s",
          "time_zone": "UTC"
        },
        "aggregations": {
          "time": {  <3>
            "max": {"field": "time"}
          },
          "airline": {  <4>
            "terms": {
             "field": "airline",
              "size": 100
            },
            "aggregations": {
              "responsetime": {  <5>
                "avg": {
                  "field": "responsetime"
                }
              }
            }
          }
        }
      }
    }
  }
}
```

Use the following format to define a `date_histogram` aggregation to bucket by time in your datafeed:
```js
"aggregations": {
  ["bucketing_aggregation": {
    "bucket_agg": {
      ...
    },
    "aggregations": {
      "data_histogram_aggregation": {
        "date_histogram": {
          "field": "time",
        },
        "aggregations": {
          "timestamp": {
            "max": {
              "field": "time"
            }
          },
          [,"<first_term>": {
            "terms":{...
            }
            [,"aggregations" : {
              [<sub_aggregation>]+
            } ]
          }]
        }
      }
    }
  }
}
```


## Composite aggregations

Composite aggregations are optimized for queries that are either `match_all` or `range` filters. Use composite aggregations in your datafeeds for these cases. Other types of queries may cause the `composite` aggregation to be inefficient.
The following is an example of a job with a datafeed that uses a `composite` aggregation to bucket the metrics based on time and terms:
```json

{
  "analysis_config": {
    "bucket_span": "60m",
    "detectors": [{
      "function": "mean",
      "field_name": "responsetime",
      "by_field_name": "airline"
    }],
    "summary_count_field_name": "doc_count"
  },
  "data_description": {
    "time_field":"time"
  },
  "datafeed_config":{
    "indices": ["kibana-sample-data-flights"],
    "aggregations": {
      "buckets": {
        "composite": {
          "size": 1000,  <1>
          "sources": [
            {
              "time_bucket": {  <2>
                "date_histogram": {
                  "field": "time",
                  "fixed_interval": "360s",
                  "time_zone": "UTC"
                }
              }
            },
            {
              "airline": {  <3>
                "terms": {
                  "field": "airline"
                }
              }
            }
          ]
        },
        "aggregations": {
          "time": {  <4>
            "max": {
              "field": "time"
            }
          },
          "responsetime": { <5>
            "avg": {
              "field": "responsetime"
            }
          }
        }
      }
    }
  }
}
```

Use the following format to define a composite aggregation in your datafeed:
```js
"aggregations": {
  "composite_agg": {
    "sources": [
      {
        "date_histogram_agg": {
          "field": "time",
          ...settings...
        }
      },
      ...other valid sources...
      ],
      ...composite agg settings...,
      "aggregations": {
        "timestamp": {
            "max": {
              "field": "time"
            }
          },
          ...other aggregations...
          [
            [,"aggregations" : {
              [<sub_aggregation>]+
            } ]
          }]
      }
   }
}
```


## Nested aggregations

You can also use complex nested aggregations in datafeeds.
The next example uses the [`derivative` pipeline aggregation](https://www.elastic.co/docs/reference/aggregations/search-aggregations-pipeline-derivative-aggregation) to find the first order derivative of the counter `system.network.out.bytes` for each value of the field `beat.name`.
<note>
  `derivative` or other pipeline aggregations may not work within `composite` aggregations. See [composite aggregations and pipeline aggregations](https://www.elastic.co/docs/reference/aggregations/search-aggregations-bucket-composite-aggregation#search-aggregations-bucket-composite-aggregation-pipeline-aggregations).
</note>

```js
"aggregations": {
  "beat.name": {
    "terms": {
      "field": "beat.name"
    },
    "aggregations": {
      "buckets": {
        "date_histogram": {
          "field": "@timestamp",
          "fixed_interval": "5m"
        },
        "aggregations": {
          "@timestamp": {
            "max": {
              "field": "@timestamp"
            }
          },
          "bytes_out_average": {
            "avg": {
              "field": "system.network.out.bytes"
            }
          },
          "bytes_out_derivative": {
            "derivative": {
              "buckets_path": "bytes_out_average"
            }
          }
        }
      }
    }
  }
}
```


## Single bucket aggregations

You can also use single bucket aggregations in datafeeds. The following example shows two `filter` aggregations, each gathering the number of unique entries for the `error` field.
```js
{
  "job_id":"servers-unique-errors",
  "indices": ["logs-*"],
  "aggregations": {
    "buckets": {
      "date_histogram": {
        "field": "time",
        "interval": "360s",
        "time_zone": "UTC"
      },
      "aggregations": {
        "time": {
          "max": {"field": "time"}
        }
        "server1": {
          "filter": {"term": {"source": "server-name-1"}},
          "aggregations": {
            "server1_error_count": {
              "value_count": {
                "field": "error"
              }
            }
          }
        },
        "server2": {
          "filter": {"term": {"source": "server-name-2"}},
          "aggregations": {
            "server2_error_count": {
              "value_count": {
                "field": "error"
              }
            }
          }
        }
      }
    }
  }
}
```


## Using `aggregate_metric_double` field type in datafeeds

<note>
  It is not currently possible to use `aggregate_metric_double` type fields in datafeeds without aggregations.
</note>

You can use fields with the [`aggregate_metric_double`](https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/aggregate-metric-double) field type in a datafeed with aggregations. It is required to retrieve the `value_count` of the `aggregate_metric_double` filed in an aggregation and then use it as the `summary_count_field_name` to provide the correct count that represents the aggregation value.
In the following example, `presum` is an `aggregate_metric_double` type field that has all the possible metrics: `[ min, max, sum, value_count ]`. To use an `avg` aggregation on this field, you need to perform a `value_count` aggregation on `presum` and then set the field that contains the aggregated values `my_count` as the `summary_count_field_name`:
```js
{
  "analysis_config": {
    "bucket_span": "1h",
    "detectors": [
      {
        "function": "avg",
        "field_name": "my_avg"
      }
    ],
    "summary_count_field_name": "my_count" 
  },
  "data_description": {
    "time_field": "timestamp"
  },
  "datafeed_config": {
    "indices": [
      "my_index"
    ],
    "datafeed_id": "datafeed-id",
    "aggregations": {
      "buckets": {
        "date_histogram": {
          "field": "time",
          "fixed_interval": "360s",
          "time_zone": "UTC"
        },
        "aggregations": {
            "timestamp": {
                "max": {"field": "timestamp"}
            },
            "my_avg": {  
                "avg": {
                    "field": "presum"
                }
             },
             "my_count": { 
                 "value_count": {
                     "field": "presum"
                 }
             }
          }
        }
     }
  }
}
```