Create a datafeed | Elasticsearch API documentation

Create a datafeed Generally available; Added in 5.4.0

PUT /_ml/datafeeds/{datafeed_id}

Datafeeds retrieve data from Elasticsearch for analysis by an anomaly detection job. You can associate only one datafeed with each anomaly detection job. The datafeed contains a query that runs at a defined interval (frequency). If you are concerned about delayed data, you can add a delay (query_delay') at each interval. By default, the datafeed uses the following query:{"match_all": {"boost": 1}}`.

When Elasticsearch security features are enabled, your datafeed remembers which roles the user who created it had at the time of creation and runs the query using those same roles. If you provide secondary authorization headers, those credentials are used instead. You must use Kibana, this API, or the create anomaly detection jobs API to create a datafeed. Do not add a datafeed directly to the .ml-config index. Do not give users write privileges on the .ml-config index.

Required authorization

Index privileges: read
Cluster privileges: manage_ml

Path parameters

datafeed_id string Required

A numerical character string that uniquely identifies the datafeed. This identifier can contain lowercase alphanumeric characters (a-z and 0-9), hyphens, and underscores. It must start and end with alphanumeric characters.

Query parameters

allow_no_indices boolean

A setting that does two separate checks on the index expression. If false, the request returns an error (1) if any wildcard expression (including _all and *) resolves to zero matching indices or (2) if the complete set of resolved indices, aliases or data streams is empty after all expressions are evaluated. If true, index expressions that resolve to no indices are allowed and the request returns an empty result.
expand_wildcards string | array[string]
Type of index that wildcard patterns can match. If the request can target data streams, this argument determines whether wildcard expressions match hidden data streams. Supports comma-separated values.

Supported values include:
- all: Match any data stream or index, including hidden ones.
- open: Match open, non-hidden indices. Also matches any non-hidden data stream.
- closed: Match closed, non-hidden indices. Also matches any non-hidden data stream. Data streams cannot be closed.
- hidden: Match hidden data streams and hidden indices. Must be combined with open, closed, or both.
- none: Wildcard expressions are not accepted.
Values are all, open, closed, hidden, or none.
ignore_throttled boolean Deprecated

If true, concrete, expanded, or aliased indices are ignored when frozen.
ignore_unavailable boolean

If false, the request returns an error if it targets a concrete (non-wildcarded) index, alias, or data stream that is missing, closed, or otherwise unavailable. If true, unavailable concrete targets are silently ignored.

application/json

Body Required

aggregations object

If set, the datafeed performs aggregation searches. Support for aggregations is limited and should be used only with low cardinality data.
chunking_config object

Datafeeds might be required to search over long time periods, for several months or years. This search is split into time chunks in order to ensure the load on Elasticsearch is managed. Chunking configuration controls how the size of these time chunks are calculated; it is an advanced configuration option.
Hide chunking_config attributes Show chunking_config attributes object
- mode string Required
  
  If the mode is auto, the chunk size is dynamically calculated; this is the recommended value when the datafeed does not use aggregations. If the mode is manual, chunking is applied according to the specified time_span; use this mode when the datafeed uses aggregations. If the mode is off, no chunking is applied.
  
  Values are auto, manual, or off.
- time_span string
  
  The time span that each search will be querying. This setting is applicable only when the mode is set to manual.
  
  External documentation
delayed_data_check_config object

Specifies whether the datafeed checks for missing data and the size of the window. The datafeed can optionally search over indices that have already been read in an effort to determine whether any data has subsequently been added to the index. If missing data is found, it is a good indication that the query_delay is set too low and the data is being indexed after the datafeed has passed that moment in time. This check runs only on real-time datafeeds.
Hide delayed_data_check_config attributes Show delayed_data_check_config attributes object
- check_window string
  
  The window of time that is searched for late data. This window of time ends with the latest finalized bucket. It defaults to null, which causes an appropriate check_window to be calculated when the real-time datafeed runs. In particular, the default check_window span calculation is based on the maximum of 2h or 8 * bucket_span.
  
  External documentation
- enabled boolean Required
  
  Specifies whether the datafeed periodically checks for delayed data.
frequency string

The interval at which scheduled queries are made while the datafeed runs in real time. The default value is either the bucket span for short bucket spans, or, for longer bucket spans, a sensible fraction of the bucket span. When frequency is shorter than the bucket span, interim results for the last (partial) bucket are written then eventually overwritten by the full bucket results. If the datafeed uses aggregations, this value must be divisible by the interval of the date histogram aggregation.

External documentation
indices string | array[string]

An array of index names. Wildcards are supported. If any of the indices are in remote clusters, the master nodes and the machine learning nodes must have the remote_cluster_client role.
indices_options object

Specifies index expansion options that are used during search
Hide indices_options attributes Show indices_options attributes object
- allow_no_indices boolean
  
  A setting that does two separate checks on the index expression. If false, the request returns an error (1) if any wildcard expression (including _all and *) resolves to zero matching indices or (2) if the complete set of resolved indices, aliases or data streams is empty after all expressions are evaluated. If true, index expressions that resolve to no indices are allowed and the request returns an empty result.
- expand_wildcards string | array[string]
  Type of index that wildcard patterns can match. If the request can target data streams, this argument determines whether wildcard expressions match hidden data streams. Supports comma-separated values, such as open,hidden.
  
  Supported values include:
  
  all: Match any data stream or index, including hidden ones.
  
  open: Match open, non-hidden indices. Also matches any non-hidden data stream.
  
  closed: Match closed, non-hidden indices. Also matches any non-hidden data stream. Data streams cannot be closed.
  
  hidden: Match hidden data streams and hidden indices. Must be combined with open, closed, or both.
  
  none: Wildcard expressions are not accepted.
- ignore_unavailable boolean
  
  If false, the request returns an error if it targets a concrete (non-wildcarded) index, alias, or data stream that is missing, closed, or otherwise unavailable. If true, unavailable concrete targets are silently ignored.
  
  Default value is false.
- ignore_throttled boolean
  
  If true, concrete, expanded or aliased indices are ignored when frozen.
  
  Default value is true.
job_id string

Identifier for the anomaly detection job.
max_empty_searches number

If a real-time datafeed has never seen any data (including during any initial training period), it automatically stops and closes the associated job after this many real-time searches return no documents. In other words, it stops after frequency times max_empty_searches of real-time operation. If not set, a datafeed with no end time that sees no data remains started until it is explicitly stopped. By default, it is not set.
query object

The Elasticsearch query domain-specific language (DSL). This value corresponds to the query object in an Elasticsearch search POST body. All the options that are supported by Elasticsearch can be used, as this object is passed verbatim to Elasticsearch.

External documentation
query_delay string

The number of seconds behind real time that data is queried. For example, if data from 10:04 a.m. might not be searchable in Elasticsearch until 10:06 a.m., set this property to 120 seconds. The default value is randomly selected between 60s and 120s. This randomness improves the query performance when there are multiple jobs running on the same node.

External documentation
runtime_mappings object

Specifies runtime fields for the datafeed search.
Hide runtime_mappings attribute Show runtime_mappings attribute object
- * object Additional properties
  Hide * attributes Show * attributes object
  
  fields object
  
  For type composite
  
  Hide fields attribute Show fields attribute object
  
  * object Additional properties
  
  Hide * attribute Show * attribute object
  
  type string Required
  
  Values are boolean, composite, date, double, geo_point, geo_shape, ip, keyword, long, or lookup.
  
  fetch_fields array[object]
  
  For type lookup
  
  Hide fetch_fields attributes Show fetch_fields attributes object
  
  field string Required
  
  Path to field or array of paths. Some API's support wildcards in the path to select multiple fields.
  
  format string
  
  format string
  
  A custom format for date type runtime fields.
  
  input_field string
  
  For type lookup
  
  target_field string
  
  For type lookup
  
  target_index string
  
  For type lookup
  
  script object
  
  Painless script executed at query time.
  
  Hide script attributes Show script attributes object
  
  source
  
  id string
  
  The id for a stored script.
  
  params object
  
  Specifies any named parameters that are passed into the script as variables. Use parameters instead of hard-coded values to decrease compile time.
  
  Hide params attribute Show params attribute object
  
  * object Additional properties
  
  lang
  
  options object
  
  Hide options attribute Show options attribute object
  
  * string Additional properties
  
  type string Required
  
  Field type, which can be: boolean, composite, date, double, geo_point, ip,keyword, long, or lookup.
  
  Values are boolean, composite, date, double, geo_point, geo_shape, ip, keyword, long, or lookup.
script_fields object

Specifies scripts that evaluate custom expressions and returns script fields to the datafeed. The detector configuration objects in a job can contain functions that use these script fields.
Hide script_fields attribute Show script_fields attribute object
- * object Additional properties
  Hide * attributes Show * attributes object
  
  script object Required
  
  Hide script attributes Show script attributes object
  
  source string | object
  
  The script source.
  
  One of:
  string-1 string SearchRequestBody object
  
  id string
  
  The id for a stored script.
  
  params object
  
  Specifies any named parameters that are passed into the script as variables. Use parameters instead of hard-coded values to decrease compile time.
  
  Hide params attribute Show params attribute object
  
  * object Additional properties
  
  lang string
  
  Specifies the language the script is written in.
  
  Supported values include:
  
  painless: Painless scripting language, purpose-built for Elasticsearch.
  
  expression: Lucene’s expressions language, compiles a JavaScript expression to bytecode.
  
  mustache: Mustache templated, used for templates.
  
  java: Expert Java API
  
  Any of:
  string-1 string string-2 string
  
  Values are painless, expression, mustache, or java.
  
  options object
  
  Hide options attribute Show options attribute object
  
  * string Additional properties
  
  ignore_failure boolean
scroll_size number

The size parameter that is used in Elasticsearch searches when the datafeed does not use aggregations. The maximum value is the value of index.max_result_window, which is 10,000 by default.

Default value is 1000.
headers object Generally available; Added in 8.0.0

Responses

200 application/json
Hide response attributes Show response attributes object
- aggregations object
- authorization object
  
  Hide authorization attributes Show authorization attributes object
  
  api_key object
  
  If an API key was used for the most recent update to the datafeed, its name and identifier are listed in the response.
  
  Hide api_key attributes Show api_key attributes object
  
  id string Required
  
  The identifier for the API key.
  
  name string Required
  
  The name of the API key.
  
  roles array[string]
  
  If a user ID was used for the most recent update to the datafeed, its roles at the time of the update are listed in the response.
  
  service_account string
  
  If a service account was used for the most recent update to the datafeed, the account name is listed in the response.
- chunking_config object Required
  
  Hide chunking_config attributes Show chunking_config attributes object
  
  mode string Required
  
  If the mode is auto, the chunk size is dynamically calculated; this is the recommended value when the datafeed does not use aggregations. If the mode is manual, chunking is applied according to the specified time_span; use this mode when the datafeed uses aggregations. If the mode is off, no chunking is applied.
  
  Values are auto, manual, or off.
  
  time_span string
  
  The time span that each search will be querying. This setting is applicable only when the mode is set to manual.
  
  External documentation
- delayed_data_check_config object
  
  Hide delayed_data_check_config attributes Show delayed_data_check_config attributes object
  
  check_window string
  
  The window of time that is searched for late data. This window of time ends with the latest finalized bucket. It defaults to null, which causes an appropriate check_window to be calculated when the real-time datafeed runs. In particular, the default check_window span calculation is based on the maximum of 2h or 8 * bucket_span.
  
  External documentation
  
  enabled boolean Required
  
  Specifies whether the datafeed periodically checks for delayed data.
- datafeed_id string Required
- frequency string
  
  A duration. Units can be nanos, micros, ms (milliseconds), s (seconds), m (minutes), h (hours) and d (days). Also accepts "0" without a unit and "-1" to indicate an unspecified value.
  
  External documentation
- indices array[string] Required
- job_id string Required
- indices_options object
  
  Controls how to deal with unavailable concrete indices (closed or missing), how wildcard expressions are expanded to actual indices (all, closed or open indices) and how to deal with wildcard expressions that resolve to no indices.
  
  Hide indices_options attributes Show indices_options attributes object
  
  allow_no_indices boolean
  
  A setting that does two separate checks on the index expression. If false, the request returns an error (1) if any wildcard expression (including _all and *) resolves to zero matching indices or (2) if the complete set of resolved indices, aliases or data streams is empty after all expressions are evaluated. If true, index expressions that resolve to no indices are allowed and the request returns an empty result.
  
  expand_wildcards string | array[string]
  
  Type of index that wildcard patterns can match. If the request can target data streams, this argument determines whether wildcard expressions match hidden data streams. Supports comma-separated values, such as open,hidden.
  
  Supported values include:
  
  all: Match any data stream or index, including hidden ones.
  
  open: Match open, non-hidden indices. Also matches any non-hidden data stream.
  
  closed: Match closed, non-hidden indices. Also matches any non-hidden data stream. Data streams cannot be closed.
  
  hidden: Match hidden data streams and hidden indices. Must be combined with open, closed, or both.
  
  none: Wildcard expressions are not accepted.
  
  ignore_unavailable boolean
  
  If false, the request returns an error if it targets a concrete (non-wildcarded) index, alias, or data stream that is missing, closed, or otherwise unavailable. If true, unavailable concrete targets are silently ignored.
  
  Default value is false.
  
  ignore_throttled boolean
  
  If true, concrete, expanded or aliased indices are ignored when frozen.
  
  Default value is true.
- max_empty_searches number
- query object Required
  
  An Elasticsearch Query DSL (Domain Specific Language) object that defines a query.
  
  External documentation
- query_delay string Required
  
  A duration. Units can be nanos, micros, ms (milliseconds), s (seconds), m (minutes), h (hours) and d (days). Also accepts "0" without a unit and "-1" to indicate an unspecified value.
  
  External documentation
- runtime_mappings object
  
  Hide runtime_mappings attribute Show runtime_mappings attribute object
  
  * object Additional properties
  
  Hide * attributes Show * attributes object
  
  fields object
  
  For type composite
  
  Hide fields attribute Show fields attribute object
  
  * object Additional properties
  
  fetch_fields array[object]
  
  For type lookup
  
  Hide fetch_fields attributes Show fetch_fields attributes object
  
  field
  
  format string
  
  format string
  
  A custom format for date type runtime fields.
  
  input_field string
  
  For type lookup
  
  target_field string
  
  For type lookup
  
  target_index string
  
  For type lookup
  
  script object
  
  Painless script executed at query time.
  
  Hide script attributes Show script attributes object
  
  params object
  
  Specifies any named parameters that are passed into the script as variables. Use parameters instead of hard-coded values to decrease compile time.
  
  options object
  
  type string Required
  
  Field type, which can be: boolean, composite, date, double, geo_point, ip,keyword, long, or lookup.
  
  Values are boolean, composite, date, double, geo_point, geo_shape, ip, keyword, long, or lookup.
- script_fields object
  
  Hide script_fields attribute Show script_fields attribute object
  
  * object Additional properties
  
  Hide * attributes Show * attributes object
  
  script object Required
  
  Hide script attributes Show script attributes object
  
  source
  
  id string
  
  The id for a stored script.
  
  params object
  
  Specifies any named parameters that are passed into the script as variables. Use parameters instead of hard-coded values to decrease compile time.
  
  Hide params attribute Show params attribute object
  
  * object Additional properties
  
  lang
  
  options object
  
  Hide options attribute Show options attribute object
  
  * string Additional properties
  
  ignore_failure boolean
- scroll_size number Required

PUT /_ml/datafeeds/{datafeed_id}

PUT _ml/datafeeds/datafeed-test-job?pretty
{
  "indices": [
    "kibana_sample_data_logs"
  ],
  "query": {
    "bool": {
      "must": [
        {
          "match_all": {}
        }
      ]
    }
  },
  "job_id": "test-job"
}

resp = client.ml.put_datafeed(
    datafeed_id="datafeed-test-job",
    pretty=True,
    indices=[
        "kibana_sample_data_logs"
    ],
    query={
        "bool": {
            "must": [
                {
                    "match_all": {}
                }
            ]
        }
    },
    job_id="test-job",
)

const response = await client.ml.putDatafeed({
  datafeed_id: "datafeed-test-job",
  pretty: "true",
  indices: ["kibana_sample_data_logs"],
  query: {
    bool: {
      must: [
        {
          match_all: {},
        },
      ],
    },
  },
  job_id: "test-job",
});

response = client.ml.put_datafeed(
  datafeed_id: "datafeed-test-job",
  pretty: "true",
  body: {
    "indices": [
      "kibana_sample_data_logs"
    ],
    "query": {
      "bool": {
        "must": [
          {
            "match_all": {}
          }
        ]
      }
    },
    "job_id": "test-job"
  }
)

$resp = $client->ml()->putDatafeed([
    "datafeed_id" => "datafeed-test-job",
    "pretty" => "true",
    "body" => [
        "indices" => array(
            "kibana_sample_data_logs",
        ),
        "query" => [
            "bool" => [
                "must" => array(
                    [
                        "match_all" => new ArrayObject([]),
                    ],
                ),
            ],
        ],
        "job_id" => "test-job",
    ],
]);

curl -X PUT -H "Authorization: ApiKey $ELASTIC_API_KEY" -H "Content-Type: application/json" -d '{"indices":["kibana_sample_data_logs"],"query":{"bool":{"must":[{"match_all":{}}]}},"job_id":"test-job"}' "$ELASTICSEARCH_URL/_ml/datafeeds/datafeed-test-job?pretty"

client.ml().putDatafeed(p -> p
    .aggregations(Map.of())
    .datafeedId("datafeed-test-job")
    .expandWildcards(List.of())
    .headers(Map.of())
    .indices("kibana_sample_data_logs")
    .jobId("test-job")
    .query(q -> q
        .bool(b -> b
            .filter(List.of())
            .must(m -> m
                .matchAll(ma -> ma)
            )
            .mustNot(List.of())
            .should(List.of())
        )
    )
    .runtimeMappings(Map.of())
    .scriptFields(Map.of())
);

Request example

An example body for a `PUT _ml/datafeeds/datafeed-test-job?pretty` request.

{
  "indices": [
    "kibana_sample_data_logs"
  ],
  "query": {
    "bool": {
      "must": [
        {
          "match_all": {}
        }
      ]
    }
  },
  "job_id": "test-job"
}