Working with anomaly detection at scale

There are many advanced configuration options for anomaly detection jobs, some of them affect the performance or resource usage significantly. This guide contains a list of considerations to help you plan for using anomaly detection at scale.

In this guide, you’ll learn how to:

Understand the impact of configuration options on the performance of anomaly detection jobs

Prerequisites:

This guide assumes you’re already familiar with how to create anomaly detection jobs. If not, refer to Anomaly detection.

The following recommendations are not sequential – the numbers just help to navigate between the list items; you can take action on one or more of them in any order. You can implement some of these changes on existing jobs; others require you to clone an existing job or create a new one.

1. Consider autoscaling, node sizing, and configuration

An anomaly detection job runs on a single node and requires sufficient resources to hold its model in memory. When a job is opened, it will be placed on the node with the most available memory at that time.

The memory available to the machine learning native processes can roughly be thought of as total machine RAM minus that which is required for the operating system, Elasticsearch and any other software that is running on the same machine.

The available memory for machine learning on a node must be sufficient to accommodate the size of the largest model. The total available memory across all machine learning nodes must be sufficient to accommodate the memory requirement for all simultaneously open jobs.

In Elastic Cloud, dedicated machine learning nodes are provisioned with most of the RAM automatically being available to the machine learning native processes. If deploying self-managed, then we recommend deploying dedicated machine learning nodes and increasing the value of xpack.ml.max_machine_memory_percent from the default 30%. The default of 30% has to be set low in case other software is running on the same machine and to leave memory free for an OS file system cache on machine learning nodes that are also data nodes. If you use dedicated machine learning nodes as recommended and do not run any other software on them then it would be reasonable to run with a 2GB JVM heap and set xpack.ml.max_machine_memory_percent to 90% on machines with at least 24GB of RAM. This maximizes the number of machine learning jobs that can be run.

Increasing the number of nodes will allow distribution of job processing as well as fault tolerance. If running many jobs, even small memory ones, then consider increasing the number of nodes in your environment.

In Elastic Cloud, you can enable autoscaling so that the machine learning nodes in your cluster scale up or down based on current machine learning memory and CPU requirements. The Elastic Cloud infrastructure allows you to create machine learning jobs up to the size that fits on the maximum node size that the cluster can scale to (usually somewhere between 58GB and 64GB) rather than what would fit in the current cluster. If you attempt to use autoscaling outside of Elastic Cloud, then set xpack.ml.max_ml_node_size to define the maximum possible size of a machine learning node. Creating machine learning jobs with model memory limits larger than the maximum node size can support is not allowed, as autoscaling cannot add a node big enough to run the job. On a self-managed deployment, you can set xpack.ml.max_model_memory_limit according to the available resources of the machine learning node. This prevents you from creating jobs with model memory limits too high to open in your cluster.

2. Use dedicated results indices

For large jobs, use a dedicated results index. This ensures that results from a single large job do not dominate the shared results index. It also ensures that the job and results (if results_retention_days is set) can be deleted more efficiently and improves renormalization performance. By default, anomaly detection job results are stored in a shared index. To change to use a dedicated result index, you need to clone or create a new job.

3. Disable model plot

By default, model plot is enabled when you create jobs in Kibana. If you have a large job, however, consider disabling it. You can disable model plot for existing jobs by using the Update anomaly detection jobs API.

Model plot calculates and stores the model bounds for each analyzed entity, including both anomalous and non-anomalous entities. These bounds are used to display the shaded area in the Single Metric Viewer charts. Model plot creates one result document per bucket per split field value. If you have high cardinality fields and/or a short bucket span, disabling model plot reduces processing workload and results stored.

4. Understand how detector configuration can impact model memory

The following factors are most significant in increasing the memory required for a job:

High cardinality of the by or partition fields
Multiple detectors
A high distinct count of influencers within a bucket

Optimize your anomaly detection job by choosing only relevant influencer fields and detectors.

If you have high cardinality by or partition fields, ensure you have sufficient memory resources available for the job. Alternatively, consider if the job can be split into smaller jobs by using a datafeed query. For very high cardinality, using a population analysis may be more appropriate.

To change partitioning fields, influencers and/or detectors, you need to clone or create a new job.

5. Optimize the bucket span

Short bucket spans and high cardinality detectors are resource intensive and require more system resources.

Bucket span is typically between 15m and 1h. The recommended value always depends on the data, the use case, and the latency required for alerting. A job with a longer bucket span uses less resources because fewer buckets require processing and fewer results are written. Bucket spans that are sensible dividers of an hour or day work best as most periodic patterns have a daily cycle.

If your use case is suitable, consider increasing the bucket span to reduce processing workload. To change the bucket span, you need to clone or create a new job.

6. Set the `scroll_size` of the datafeed

This consideration only applies to datafeeds that do not use aggregations. The scroll_size parameter of a datafeed specifies the number of hits to return from Elasticsearch searches. The higher the scroll_size the more results are returned by a single search. When your anomaly detection job has a high throughput, increasing scroll_size may decrease the time the job needs to analyze incoming data, however may also increase the pressure on your cluster. You cannot increase scroll_size to more than the value of index.max_result_window which is 10,000 by default. If you update the settings of a datafeed, you must stop and start the datafeed for the change to be applied.

7. Set the model memory limit

The model_memory_limit job configuration option sets the approximate maximum amount of memory resources required for analytical processing. When you create an anomaly detection job in Kibana, it provides an estimate for this limit. The estimate is based on the analysis configuration details for the job and cardinality estimates, which are derived by running aggregations on the source indices as they exist at that specific point in time.

If you change the resources available on your machine learning nodes or make significant changes to the characteristics or cardinality of your data, the model memory requirements might also change. You can update the model memory limit for a job while it is closed, using either of the following methods:

In Kibana, navigate to the Anomaly Detection Jobs page, open the edit flyout for the closed job, and select Apply under the Model memory limit field to apply the estimate. This option is available only when the current value differs from the estimate.
Use the update anomaly detection jobs API to set model_memory_limit manually.

To decrease the limit to less than the current model memory usage, you must recreate the job.

Tip

You can view the current model size statistics with the get anomaly detection job stats and get model snapshots APIs. You can also obtain a model memory limit estimate at any time by running the estimate anomaly detection jobs model memory API. However, you must provide your own cardinality estimates.

As a job approaches its model memory limit, the memory status is soft_limit and older models are more aggressively pruned to free up space. If you have categorization jobs, no further examples are stored. When a job exceeds its limit, the memory status is hard_limit and the job no longer models new entities. It is therefore important to have appropriate memory model limits for each job. If you reach the hard limit and are concerned about the missing data, ensure that you have adequate resources then recreate the job with a larger model memory limit.

8. Pre-aggregate your data

You can speed up the analysis by summarizing your data with aggregations.

Anomaly detection jobs use summary statistics that are calculated for each bucket. The statistics can be calculated in the job itself or via aggregations. It is more efficient to use an aggregation when it’s possible, as in this case, the data node does the heavy-lifting instead of the machine learning node.

You may want to use chunking_config to tune your search speed when your datafeeds use aggregations. In these cases, set chunking_config.mode to manual and experiment with the time_span value. Increasing it may speed up search. However, the higher the chunking time_span, the higher number of buckets are included in the search response. Thus, if you hit the search.max_buckets limit, decrease time_span to reduce the number of buckets per response.

In certain cases, you cannot do aggregations to increase performance. For example, categorization jobs use the full log message to detect anomalies, so this data cannot be aggregated. If you have many influencer fields, it may not be beneficial to use an aggregation either. This is because only a few documents in each bucket may have the combination of all the different influencer fields.

See Aggregating data for faster performance to learn more.

9. Optimize the results retention

Set a results retention window to reduce the amount of results stored.

Anomaly detection results are retained indefinitely by default, except for predefined machine learning configurations for logs which retain results for 120 days ( ). Results build up over time, and your result index may be quite large. A large results index is slow to query and takes up significant space on your cluster. Consider how long you wish to retain the results and set results_retention_days accordingly – for example, to 30 or 60 days – to avoid unnecessarily large result indices. Deleting old results does not affect the model behavior. You can change this setting for existing jobs.

10. Optimize the renormalization window

Reduce the renormalization window to reduce processing workload.

When a new anomaly has a much higher score than any anomaly in the past, the anomaly scores are adjusted on a range from 0 to 100 based on the new data. This is called renormalization. It can mean rewriting a large number of documents in the results index. Renormalization happens for results from the last 30 days or 100 bucket spans (depending on which is the longer) by default. When you are working at scale, set renormalization_window_days to a lower value, so the workload is reduced. You can change this setting for existing jobs and changes will take effect after the job has been reopened.

11. Optimize the model snapshot retention

Model snapshots are taken periodically, to ensure resilience in the event of a system failure and to allow you to manually revert to a specific point in time. These are stored in a compressed format in an internal index and kept according to the configured retention policy. Load is placed on the cluster when indexing a model snapshot and index size is increased as multiple snapshots are retained.

When working with large model sizes, consider how frequently you want to create model snapshots using background_persist_interval. The default is every 3 to 4 hours. Increasing this interval reduces the periodic indexing load on your cluster, but in the event of a system failure, you may be reverting to an older version of the model.

Also consider how long you wish to retain snapshots using model_snapshot_retention_days and daily_model_snapshot_retention_after_days. Retaining fewer snapshots substantially reduces index storage requirements for model state, but also reduces the granularity of model snapshots from which you can revert.

For more information, refer to Model snapshots.

12. Optimize your search queries

If you are operating on a big scale, make sure that your datafeed query is as efficient as possible. There are different ways to write Elasticsearch queries and some of them are more efficient than others. See Tune for search speed to learn more about Elasticsearch performance tuning.

You need to clone or recreate an existing job if you want to optimize its search query.

13. Consider using population analysis

Population analysis is more memory efficient than individual analysis of each series. It builds a profile of what a "typical" entity does over a specified time period and then identifies when one is behaving abnormally compared to the population. Use population analysis for analyzing high cardinality fields if you expect that the entities of the population generally behave in the same way.

14. Reduce the cost of forecasting

There are two main performance factors to consider when you create a forecast: indexing load and memory usage. Check the cluster monitoring data to learn the indexing rate and the memory usage.

Forecasting writes a new document to the result index for every forecasted element of the for every bucket. Jobs with high partition or by field cardinality create more result documents, as do jobs with small bucket span and longer forecast duration. Only three concurrent forecasts may be run for a single job.

To reduce indexing load, consider a shorter forecast duration and/or try to avoid concurrent forecast requests. Further performance gains can be achieved by reviewing the job configuration; for example by using a dedicated results index, increasing the bucket span and/or by having lower cardinality partitioning fields.

The memory usage of a forecast is restricted to 20 MB by default. From 7.9, you can extend this limit by setting max_model_memory to a higher value. The maximum value is 40% of the memory limit of the anomaly detection job or 500 MB. If the forecast needs more memory than the provided value, it spools to disk. Forecasts that spool to disk generally run slower. If you need to speed up forecasts, increase the available memory for the forecast. Forecasts that would take more than 500 MB to run won’t start because this is the maximum limit of disk space that a forecast is allowed to use.