Finding anomalies in time series dataedit

The machine learning features automate the analysis of time series data by creating accurate baselines of normal behavior in your data and using them to identify anomalous events or patterns.

The results of machine learning analysis are stored in Elasticsearch and you can use Kibana to help you visualize and explore the results. For example, the Machine Learning app provides charts that illustrate the actual data values, the bounds for the expected values, and the anomalies that occur outside these bounds:

Example screenshot from the Machine Learning Single Metric Viewer in Kibana

Using proprietary machine learning algorithms, the following circumstances are detected:

  • Anomalies related to temporal deviations in values, counts, or frequencies
  • Statistical rarity
  • Unusual behaviors for a member of a population

Automated periodicity detection and quick adaptation to changing data ensure that you don’t need to specify algorithms, models, or other data science-related configurations in order to get the benefits of machine learning.

Anomaly detection algorithmsedit

The anomaly detection machine learning features use a bespoke amalgamation of different techniques such as clustering, various types of time series decomposition, Bayesian distribution modeling, and correlation analysis. These analytics provide sophisticated real-time automated anomaly detection for time series data.

The machine learning analytics statistically model the time-based characteristics of your data by observing historical behavior and adapting to new data. The model represents a baseline of normal behavior and can therefore be used to determine how anomalous new events are.

Anomaly detection results are written for each bucket span. These results include scores that are aggregated in order to reduce noise and normalized in order to rank the most mathematically significant anomalies.

1. Define the problemedit

The machine learning features in Elastic Stack enable you to seek anomalies in your data in many different ways. For example, there are functions that calculate metrics, analyze geographic data, or seek rare events in your data set. You can also optionally analyze your data relative to a specific population or group the data based on specific attributes. For the full list of functions, see Function reference.

The most important considerations are the data sets that you have available and the type of anomalous behavior you want to detect.

If you are uncertain where to begin, Kibana can recognize certain types of data and suggest useful machine learning jobs. Likewise, some integrations in Fleet include anomaly detection job configuration information, dashboards, searches, and visualizations that are customized to help you analyze your data.

2. Set up the environmentedit

If you want to use machine learning features, there must be at least one machine learning node in your cluster and all master-eligible nodes must have machine learning enabled. For more information about these settings, see machine learning nodes.

If Elastic Stack security features are enabled, you must also ensure your users have the necessary privileges. See Setup and security.

If your data is located outside of Elasticsearch, you cannot use Kibana to create your jobs and you cannot use datafeeds to retrieve your data in real time. Posting data directly to anomaly detection jobs is deprecated, in a future major version a datafeed will be required.

3. Create a jobedit

Anomaly detection jobs contain the configuration information and metadata necessary to perform the machine learning analysis.

You can create anomaly detection jobs by using the create anomaly detection jobs API. Kibana also provides wizards to simplify the process:

Create New Job
  • The single metric wizard creates simple jobs that have a single detector. A detector applies an analytical function to specific fields in your data. In addition to limiting the number of detectors, the single metric wizard omits many of the more advanced configuration options.
  • The multi-metric wizard creates jobs that can have more than one detector, which is more efficient than running multiple jobs against the same data.
  • The population wizard creates jobs that detect activity that is unusual compared to the behavior of the population. For more information, see Performing population analysis.
  • The categorization wizard creates jobs that group log messages into categories and use count or rare functions to detect anomalies within them. See Detecting anomalous categories of data.
  • The advanced wizard creates jobs that can have multiple detectors and enables you to configure all job settings.

Kibana can also recognize certain types of data and provide specialized wizards for that context. For example, there are anomaly detection jobs for the sample eCommerce orders and sample web logs data sets, as well as for data generated by the Elastic Security and Observability solutions, Beats, and Fleet Integrations. For a list of all the customized jobs, see Supplied configurations.

When you create an anomaly detection job in Kibana, the job creation wizards can provide advice based on the characteristics of your data. By heeding these suggestions, you can create jobs that are more likely to produce insightful machine learning results. The most important concepts are covered here; for a description of all the job properties, see the create anomaly detection jobs API.

Example screenshot from the single metric job wizard in Kibana
Bucket spanedit

The machine learning features use the concept of a bucket to divide the time series into batches for processing.

The bucket span is part of the configuration information for an anomaly detection job. It defines the time interval that is used to summarize and model the data. This is typically between 5 minutes to 1 hour and it depends on your data characteristics. When you set the bucket span, take into account the granularity at which you want to analyze, the frequency of the input data, the typical duration of the anomalies, and the frequency at which alerting is required.

The bucket span must contain a valid time interval. When you create an anomaly detection job in Kibana, you can choose to estimate a bucket span value based on your data characteristics. If you choose a value that is larger than one day or is significantly different than the estimated value, you receive an informational message.

Detectorsedit

Each anomaly detection job must have one or more detectors. A detector defines the type of analysis that will occur and which fields to analyze.

Detectors can also contain properties that affect which types of entities or events are considered anomalous. For example, you can specify whether entities are analyzed relative to their own previous behavior or relative to other entities in a population. There are also multiple options for splitting the data into categories and partitions. Some of these more advanced job configurations are described in Examples.

If your job does not contain a detector or the detector does not contain a valid function, you receive an error. If a job contains duplicate detectors, you also receive an error. Detectors are duplicates if they have the same function, field_name, by_field_name, over_field_name and partition_field_name.

Influencersedit

When anomalous events occur, we want to know why. To determine the cause, however, you often need a broader knowledge of the domain. If you have suspicions about which entities in your data set are likely causing irregularities, you can identify them as influencers in your anomaly detection jobs. That is to say, influencers are fields that you suspect contain information about someone or something that influences or contributes to anomalies in your data.

Influencers can be any field in your data. If you use datafeeds, however, the field must exist in your datafeed query or aggregation; otherwise it is not included in the job analysis. If you use a query in your datafeed, there is an additional requirement: influencer fields must exist in the query results in the same hit as the detector fields. Datafeeds process data by paging through the query results; since search hits cannot span multiple indices or documents, datafeeds have the same limitation.

Influencers do not need to be fields that are specified in your anomaly detection job detectors, though they often are. If you use aggregations in your datafeed, it is possible to use influencers that come from different indices than the detector fields. However, both indices must have a date field with the same name, which you specify in the data_description.time_field property for the datafeed.

Picking an influencer is strongly recommended for the following reasons:

  • It allows you to more easily assign blame for anomalies
  • It simplifies and aggregates the results

If you use Kibana, the job creation wizards can suggest which fields to use as influencers. The best influencer is the person or thing that you want to blame for the anomaly. In many cases, users or client IP addresses make excellent influencers.

As a best practice, do not pick too many influencers. For example, you generally do not need more than three. If you pick many influencers, the results can be overwhelming and there is a small overhead to the analysis.

Cardinalityedit

If there are logical groupings of related entities in your data, machine learning analytics can make data models and generate results that take these groupings into consideration. For example, you might choose to split your data by user ID and detect when users are accessing resources differently than they usually do.

If the field that you use to split your data has many different values, the job uses more memory resources. In Kibana, if the cardinality of the by_field_name, over_field_name, or partition_field_name is greater than 1000, the job creation wizards advise that there might be high memory usage. Likewise if you are performing population analysis and the cardinality of the over_field_name is below 10, you are advised that this might not be a suitable field to use.

Model memory limitsedit

For each anomaly detection job, you can optionally specify a model_memory_limit, which is the approximate maximum amount of memory resources that are required for analytical processing. The default value is 1 GB. Once this limit is approached, data pruning becomes more aggressive. Upon exceeding this limit, new entities are not modeled.

You can also optionally specify the xpack.ml.max_model_memory_limit setting. By default, it’s not set, which means there is no upper bound on the acceptable model_memory_limit values in your jobs.

If you set the model_memory_limit too high, it will be impossible to open the job; jobs cannot be allocated to nodes that have insufficient memory to run them.

If the estimated model memory limit for an anomaly detection job is greater than the model memory limit for the job or the maximum model memory limit for the cluster, the job creation wizards in Kibana generate a warning. If the estimated memory requirement is only a little higher than the model_memory_limit, the job will probably produce useful results. Otherwise, the actions you take to address these warnings vary depending on the resources available in your cluster:

  • If you are using the default value for the model_memory_limit and the machine learning nodes in the cluster have lots of memory, the best course of action might be to simply increase the job’s model_memory_limit. Before doing this, however, double-check that the chosen analysis makes sense. The default model_memory_limit is relatively low to avoid accidentally creating a job that uses a huge amount of memory.
  • If the machine learning nodes in the cluster do not have sufficient memory to accommodate a job of the estimated size, the only options are:

    • Add bigger machine learning nodes to the cluster, or
    • Accept that the job will hit its memory limit and will not necessarily find all the anomalies it could otherwise find.

If you are using Elastic Cloud Enterprise or the hosted Elasticsearch Service on Elastic Cloud, xpack.ml.max_model_memory_limit is set to prevent you from creating jobs that cannot be allocated to any machine learning nodes in the cluster. If you find that you cannot increase model_memory_limit for your machine learning jobs, the solution is to increase the size of the machine learning nodes in your cluster.

Dedicated indicesedit

For each anomaly detection job, you can optionally specify a dedicated index to store the anomaly detection results. As anomaly detection jobs may produce a large amount of results (for example, jobs with many time series, small bucket span, or with long running period), it is recommended to use a dedicated results index by choosing the Use dedicated index option in Kibana or specifying the results_index_name via the Create anomaly detection jobs API.

Datafeedsedit

If you create anomaly detection jobs in Kibana, you must use datafeeds to retrieve data from Elasticsearch for analysis. When you create an anomaly detection job, you select a data view and Kibana configures the datafeed for you under the covers.

You can associate only one datafeed with each anomaly detection job. The datafeed contains a query that runs at a defined interval (frequency). By default, this interval is calculated relative to the bucket span of the anomaly detection job. If you are concerned about delayed data, you can add a delay before the query runs at each interval. See Handling delayed data.

Datafeeds can also aggregate data before sending it to the anomaly detection job. There are some limitations, however, and aggregations should generally be used only for low cardinality data. See Aggregating data for faster performance.

When the Elasticsearch security features are enabled, a datafeed stores the roles of the user who created or updated the datafeed at that time. This means that if those roles are updated, the datafeed subsequently runs with the new permissions that are associated with the roles. However, if the user’s roles are adjusted after creating or updating the datafeed, the datafeed continues to run with the permissions that were associated with the original roles.

One way to update the roles that are stored within the datafeed without changing any other settings is to submit an empty JSON document ({}) to the update datafeed API.

If the data that you want to analyze is not stored in Elasticsearch, you cannot use datafeeds. You can however send batches of data directly to the job by using the post data to jobs API. [7.11.0] Deprecated in 7.11.0.

4. Open the jobedit

An anomaly detection job must be opened in order for it to be ready to receive and analyze data. It can be opened and closed multiple times throughout its lifecycle.

After you start the job, you can start the datafeed, which retrieves data from your cluster. A datafeed can be started and stopped multiple times throughout its lifecycle. When you start it, you can optionally specify start and end times. If you do not specify an end time, the datafeed runs continuously. Datafeeds with an end time close their corresponding jobs when they are stopped. For this reason, when historical data is analysed, there is no need to stop the datafeed and/or close the job as they are stopped and closed automatically when the end time is reached. However, when a datafeed without an end time is stopped, it does not close the corresponding job automatically.

You can perform both these tasks in Kibana or use the open anomaly detection jobs and start datafeeds APIs.

5. View the job resultsedit

After the anomaly detection job has processed some data, you can view the results in Kibana.

Depending on the capacity of your machine, you might need to wait a few seconds for the machine learning analysis to generate initial results.

There are two tools for examining the results from anomaly detection jobs in Kibana: the Anomaly Explorer and the Single Metric Viewer.

Bucket resultsedit

When you view your machine learning results, each bucket has an anomaly score. This score is a statistically aggregated and normalized view of the combined anomalousness of all the record results in the bucket.

The machine learning analytics enhance the anomaly score for each bucket by considering contiguous buckets. This extra multi-bucket analysis effectively uses a sliding window to evaluate the events in each bucket relative to the larger context of recent events. When you review your machine learning results, there is a multi_bucket_impact property that indicates how strongly the final anomaly score is influenced by multi-bucket analysis. In Kibana, anomalies with medium or high multi-bucket impact are depicted in the Anomaly Explorer and the Single Metric Viewer with a cross symbol instead of a dot. For example:

Examples of anomalies with multi-bucket impact in Kibana

In this example, you can see that some of the anomalies fall within the shaded blue area, which represents the bounds for the expected values. The bounds are calculated per bucket, but multi-bucket analysis is not limited by that scope.

If you have more than one anomaly detection job, you can also obtain overall bucket results, which combine and correlate anomalies from multiple jobs into an overall score. When you view the results for job groups in Kibana, it provides the overall bucket scores. For more information, see Get overall buckets API.

Bucket results provide the top level, overall view of the anomaly detection job and are ideal for alerts. For example, the bucket results might indicate that at 16:05 the system was unusual. This information is a summary of all the anomalies, pinpointing when they occurred. When you identify an anomalous bucket, you can investigate further by examining the pertinent records.

Influencer resultsedit

The influencer results show which entities were anomalous and when. One influencer result is written per bucket for each influencer that affects the anomalousness of the bucket. The machine learning analytics determine the impact of an influencer by performing a series of experiments that remove all data points with a specific influencer value and check whether the bucket is still anomalous. That means that only influencers with statistically significant impact on the anomaly are reported in the results. For jobs with more than one detector, influencer scores provide a powerful view of the most anomalous entities.

For example, the high_sum_total_sales anomaly detection job for the eCommerce orders sample data uses customer_full_name.keyword and category.keyword as influencers. You can examine the influencer results with the get influencers API. Alternatively, you can use the Anomaly Explorer in Kibana:

Influencers in the Kibana Anomaly Explorer

On the left is a list of the top influencers for all of the detected anomalies in that same time period. The list includes maximum anomaly scores, which in this case are aggregated for each influencer, for each bucket, across all detectors. There is also a total sum of the anomaly scores for each influencer. You can use this list to help you narrow down the contributing factors and focus on the most anomalous entities.

You can also explore swim lanes that correspond to the values of an influencer. In this example, the swim lanes correspond to the values for the customer_full_name.keyword. By default, the swim lanes are sorted according to which entity has the maximum anomaly score values. You can click on the sections in the swim lane to see details about the anomalies that occurred in that time interval.

The anomaly scores that you see in each section of the Anomaly Explorer might differ slightly. This disparity occurs because for each anomaly detection job, there are bucket results, influencer results, and record results. Anomaly scores are generated for each type of result. The anomaly timeline in Kibana uses the bucket-level anomaly scores. If you view swim lanes by influencer, it uses the influencer-level anomaly scores, as does the list of top influencers. The list of anomalies uses the record-level anomaly scores.

6. Tune the jobedit

While your anomaly detection job is open, you might find that you need to alter its configuration or settings.

Calendars and scheduled eventsedit

Sometimes there are periods when you expect unusual activity to take place, such as bank holidays, "Black Friday", or planned system outages. If you identify these events in advance, no anomalies are generated during that period. The machine learning model is not ill-affected and you do not receive spurious results.

You can create calendars and scheduled events in the Settings pane on the Machine Learning page in Kibana or by using Machine learning anomaly detection APIs.

A scheduled event must have a start time, end time, and description. In general, scheduled events are short in duration (typically lasting from a few hours to a day) and occur infrequently. If you have regularly occurring events, such as weekly maintenance periods, you do not need to create scheduled events for these circumstances; they are already handled by the machine learning analytics.

You can identify zero or more scheduled events in a calendar. Anomaly detection jobs can then subscribe to calendars and the machine learning analytics handle all subsequent scheduled events appropriately.

If you want to add multiple scheduled events at once, you can import an iCalendar (.ics) file in Kibana or a JSON file in the add events to calendar API.

  • You must identify scheduled events before your anomaly detection job analyzes the data for that time period. Machine learning results are not updated retroactively.
  • If your iCalendar file contains recurring events, only the first occurrence is imported.
  • Bucket results are generated during scheduled events but they have an anomaly score of zero.
  • If you use long or frequent scheduled events, it might take longer for the machine learning analytics to learn to model your data and some anomalous behavior might be missed.

Custom rulesedit

By default, anomaly detection is unsupervised and the machine learning models have no awareness of the domain of your data. As a result, anomaly detection jobs might identify events that are statistically significant but are uninteresting when you know the larger context. Machine learning custom rules enable you to customize anomaly detection.

Custom rules – or job rules as Kibana refers to them – instruct anomaly detectors to change their behavior based on domain-specific knowledge that you provide. When you create a rule, you can specify conditions, scope, and actions. When the conditions of a rule are satisfied, its actions are triggered.

For example, if you have an anomaly detector that is analyzing CPU usage, you might decide you are only interested in anomalies where the CPU usage is greater than a certain threshold. You can define a rule with conditions and actions that instruct the detector to refrain from generating machine learning results when there are anomalous events related to low CPU usage. You might also decide to add a scope for the rule, such that it applies only to certain machines. The scope is defined by using machine learning filters.

Filters contain a list of values that you can use to include or exclude events from the machine learning analysis. You can use the same filter in multiple anomaly detection jobs.

If you are analyzing web traffic, you might create a filter that contains a list of IP addresses. For example, maybe they are IP addresses that you trust to upload data to your website or to send large amounts of data from behind your firewall. You can define the scope of a rule such that it triggers only when a specific field in your data matches one of the values in the filter. Alternatively, you can make it trigger only when the field value does not match one of the filter values. You therefore have much greater control over which anomalous events affect the machine learning model and appear in the machine learning results.

For more information, see Customizing detectors with custom rules.

Model snapshotsedit

Elastic Stack machine learning features calculate baselines of normal behavior then extrapolate anomalous events. These baselines are accomplished by generating models of your data.

To ensure resilience in the event of a system failure, snapshots of the machine learning model for each anomaly detection job are saved to an internal index within the Elasticsearch cluster. The amount of time necessary to save these snapshots is proportional to the size of the model in memory. By default, snapshots are captured approximately every 3 to 4 hours. You can change this interval (background_persist_interval) when you create or update a job.

To reduce the number of snapshots consuming space on your cluster, at the end of each day, old snapshots are automatically deleted. The age of each snapshot is calculated relative to the timestamp of the most recent snapshot. By default, if there are snapshots over one day older than the newest snapshot, they are deleted except for the first snapshot each day. As well, all snapshots over ten days older than the newest snapshot are deleted. You can change these retention settings (daily_model_snapshot_retention_after_days and model_snapshot_retention_days) when you create or update a job. If you want to exempt a specific snapshot from this clean up, use Kibana or the update model snapshots API to set retain to true.

You can see the list of model snapshots for each job with the get model snapshots API or in the Model snapshots tab on the Job Management page in Kibana:

Example screenshot with a list of model snapshots

There are situations other than system failures where you might want to revert to using a specific model snapshot. The machine learning features react quickly to anomalous input and new behaviors in data. Highly anomalous input increases the variance in the models and machine learning analytics must determine whether it is a new step-change in behavior or a one-off event. In the case where you know this anomalous input is a one-off, it might be appropriate to reset the model state to a time before this event. For example, after a Black Friday sales day you might consider reverting to a saved snapshot. If you know about such events in advance, however, you can use calendars and scheduled events to avoid impacting your model.

7. Forecast future behavioredit

After the machine learning features create baselines of normal behavior for your data, you can use that information to extrapolate future behavior.

You can use a forecast to estimate a time series value at a specific future date. For example, you might want to determine how many users you can expect to visit your website next Sunday at 0900.

You can also use it to estimate the probability of a time series value occurring at a future date. For example, you might want to determine how likely it is that your disk utilization will reach 100% before the end of next week.

Each forecast has a unique ID, which you can use to distinguish between forecasts that you created at different times. You can create a forecast by using the forecast anomaly detection jobs API or by using Kibana. For example:

Example screenshot from the Machine Learning Single Metric Viewer in Kibana

The yellow line in the chart represents the predicted data values. The shaded yellow area represents the bounds for the predicted values, which also gives an indication of the confidence of the predictions.

When you create a forecast, you specify its duration, which indicates how far the forecast extends beyond the last record that was processed. By default, the duration is 1 day. Typically the farther into the future that you forecast, the lower the confidence levels become (that is to say, the bounds increase). Eventually if the confidence levels are too low, the forecast stops. For more information about limitations that affect your ability to create a forecast, see Unsupported forecast configurations.

You can also optionally specify when the forecast expires. By default, it expires in 14 days and is deleted automatically thereafter. You can specify a different expiration period by using the expires_in parameter in the forecast anomaly detection jobs API.

8. Close the jobedit

If you need to stop your anomaly detection job, an orderly shutdown ensures that:

  • Datafeeds are stopped
  • Buffers are flushed
  • Model history is pruned
  • Final results are calculated
  • Model snapshots are saved
  • Anomaly detection jobs are closed

This process ensures that jobs are in a consistent state in case you want to subsequently re-open them.

Stopping datafeedsedit

When you stop a datafeed, it ceases to retrieve data from Elasticsearch. You can stop a datafeed by using Kibana or the stop datafeeds API. For example, the following request stops the feed1 datafeed:

POST _ml/datafeeds/feed1/_stop

You must have manage_ml, or manage cluster privileges to stop datafeeds. For more information, see Security privileges.

A datafeed can be started and stopped multiple times throughout its lifecycle.

Stopping all datafeedsedit

If you are upgrading your cluster, you can use the following request to stop all datafeeds:

POST _ml/datafeeds/_all/_stop

Closing anomaly detection jobsedit

When you close an anomaly detection job, it cannot receive data or perform analysis operations. You can close a job by using the close anomaly detection job API. For example, the following request closes the job1 job:

POST _ml/anomaly_detectors/job1/_close

You must have manage_ml, or manage cluster privileges to stop anomaly detection jobs. For more information, see Security privileges.

If you submit a request to close an anomaly detection job and its datafeed is running, the request first tries to stop the datafeed. This behavior is equivalent to calling the stop datafeeds API with the same timeout and force parameters as the close job request.

Anomaly detection jobs can be opened and closed multiple times throughout their lifecycle.

Closing all anomaly detection jobsedit

If you are upgrading your cluster, you can use the following request to close all open anomaly detection jobs on the cluster:

POST _ml/anomaly_detectors/_all/_close

Next stepsedit

For a more detailed walk-through of machine learning features, see Getting started with machine learning.

For more advanced settings and scenarios, see Examples.

Refer to Working with anomaly detection at scale to learn more about the particularities of large anomaly detection jobs.

Further readingedit

Interpretability in ML: Identifying anomalies, influencers, and root causes