Create datafeeds APIedit

Instantiates a datafeed.

Requestedit

PUT _ml/datafeeds/<feed_id>

Prerequisitesedit

  • You must create an anomaly detection job before you create a datafeed.
  • If Elasticsearch security features are enabled, you must have manage_ml or manage cluster privileges to use this API. See Security privileges.

Descriptionedit

You can associate only one datafeed to each anomaly detection job.

  • You must use Kibana or this API to create a datafeed. Do not put a datafeed directly to the .ml-config index using the Elasticsearch index API. If Elasticsearch security features are enabled, do not give users write privileges on the .ml-config index.
  • When Elasticsearch security features are enabled, your datafeed remembers which roles the user who created it had at the time of creation and runs the query using those same roles.

Path parametersedit

<feed_id>
(Required, string) A numerical character string that uniquely identifies the datafeed. This identifier can contain lowercase alphanumeric characters (a-z and 0-9), hyphens, and underscores. It must start and end with alphanumeric characters.

Request bodyedit

aggregations
(Optional, object) If set, the datafeed performs aggregation searches. Support for aggregations is limited and should only be used with low cardinality data. For more information, see Aggregating data for faster performance.
chunking_config

(Optional, object) Datafeeds might be required to search over long time periods, for several months or years. This search is split into time chunks in order to ensure the load on Elasticsearch is managed. Chunking configuration controls how the size of these time chunks are calculated and is an advanced configuration option. A chunking configuration object has the following properties:

chunking_config.mode

(string) There are three available modes:

  • auto: The chunk size is dynamically calculated. This is the default and recommended value.
  • manual: Chunking is applied according to the specified time_span.
  • off: No chunking is applied.
chunking_config.time_span
(time units) The time span that each search will be querying. This setting is only applicable when the mode is set to manual. For example: 3h.
delayed_data_check_config

(Optional, object) Specifies whether the datafeed checks for missing data and the size of the window. For example: {"enabled": true, "check_window": "1h"}.

The datafeed can optionally search over indices that have already been read in an effort to determine whether any data has subsequently been added to the index. If missing data is found, it is a good indication that the query_delay option is set too low and the data is being indexed after the datafeed has passed that moment in time. See Working with delayed data.

This check runs only on real-time datafeeds.

delayed_data_check_config.enabled
(boolean) Specifies whether the datafeed periodically checks for delayed data. Defaults to true.
delayed_data_check_config.check_window
(time units) The window of time that is searched for late data. This window of time ends with the latest finalized bucket. It defaults to null, which causes an appropriate check_window to be calculated when the real-time datafeed runs. In particular, the default check_window span calculation is based on the maximum of 2h or 8 * bucket_span.
frequency

(Optional, time units) The interval at which scheduled queries are made while the datafeed runs in real time. The default value is either the bucket span for short bucket spans, or, for longer bucket spans, a sensible fraction of the bucket span. For example: 150s.

To learn more about the relationship between time related settings, see Interaction between time-related settings.

indices
(Required, array) An array of index names. Wildcards are supported. For example: ["it_ops_metrics", "server*"].

Advanced configuration option. If set, feature importance for the top most important features will be computed. Importance is calculated using the SHAP (SHapley Additive exPlanations) method as described in Lundberg, S. M., & Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In NeurIPS 2017..

+

If any indices are in remote clusters then cluster.remote.connect must not be set to false on any machine learning nodes.

job_id
(Required, string) Identifier for the anomaly detection job.
max_empty_searches
(Optional,integer) If a real-time datafeed has never seen any data (including during any initial training period) then it will automatically stop itself and close its associated job after this many real-time searches that return no documents. In other words, it will stop after frequency times max_empty_searches of real-time operation. If not set then a datafeed with no end time that sees no data will remain started until it is explicitly stopped. By default this setting is not set.
query
(Optional, object) The Elasticsearch query domain-specific language (DSL). This value corresponds to the query object in an Elasticsearch search POST body. All the options that are supported by Elasticsearch can be used, as this object is passed verbatim to Elasticsearch. By default, this property has the following value: {"match_all": {"boost": 1}}.
query_delay

(Optional, time units) The number of seconds behind real time that data is queried. For example, if data from 10:04 a.m. might not be searchable in Elasticsearch until 10:06 a.m., set this property to 120 seconds. The default value is randomly selected between 60s and 120s. This randomness improves the query performance when there are multiple jobs running on the same node.

To learn more about the relationship between time related settings, see Interaction between time-related settings.

script_fields
(Optional, object) Specifies scripts that evaluate custom expressions and returns script fields to the datafeed. The detector configuration objects in a job can contain functions that use these script fields. For more information, see Transforming data with script fields and Script fields.
scroll_size
(Optional, unsigned integer) The size parameter that is used in Elasticsearch searches. The default value is 1000.

Interaction between time-related settingsedit

Time-related settings have the following relationships:

  • Queries run at query_delay after the end of each frequency.
  • When frequency is shorter than bucket_span of the associated job, interim results for the last (partial) bucket are written, and then overwritten by the full bucket results eventually.

Examplesedit

PUT _ml/datafeeds/datafeed-total-requests
{
  "job_id": "total-requests",
  "indices": ["server-metrics"]
}

When the datafeed is created, you receive the following results:

{
  "datafeed_id": "datafeed-total-requests",
  "job_id": "total-requests",
  "query_delay": "83474ms",
  "indices": [
    "server-metrics"
  ],
  "query": {
    "match_all": {
      "boost": 1.0
    }
  },
  "scroll_size": 1000,
  "chunking_config": {
    "mode": "auto"
  }
}