Detecting anomalous categories of dataedit

Categorization is a machine learning process that tokenizes a text field, clusters similar data together, and classifies it into categories. It works best on machine-written messages and application output that typically consist of repeated elements. For example, it works well on logs that contain a finite set of possible messages:

{"@timestamp":1549596476000,
"message":"org.jdbi.v2.exceptions.UnableToExecuteStatementException: com.mysql.jdbc.exceptions.MySQLTimeoutException: Statement cancelled due to timeout or client request [statement:\"SELECT id, customer_id, name, force_disabled, enabled FROM customers\"]",
"type":"logs"}

Categorization is tuned to work best on data like log messages by taking token order into account, including stop words, and not considering synonyms in its analysis. Complete sentences in human communication or literary text (for example email, wiki pages, prose, or other human-generated content) can be extremely diverse in structure. Since categorization is tuned for machine data, it gives poor results for human-generated data. It would create so many categories that they couldn’t be handled effectively. Categorization is not natural language processing (NLP).

When you create a categorization anomaly detection job, the machine learning model learns what volume and pattern is normal for each category over time. You can then detect anomalies and surface rare events or unusual types of messages by using count or rare functions.

In Kibana, there is a categorization wizard to help you create this type of anomaly detection job. For example, the following job generates categories from the contents of the message field and uses the count function to determine when certain categories are occurring at anomalous rates:

Creating a categorization job in Kibana
API example
PUT _ml/anomaly_detectors/it_ops_app_logs
{
  "description" : "IT ops application logs",
  "analysis_config" : {
    "categorization_field_name": "message",
    "bucket_span":"30m",
    "detectors" :[{
      "function":"count",
      "by_field_name": "mlcategory"
    }]
  },
  "data_description" : {
    "time_field":"@timestamp"
  }
}

This field is used to derive categories.

The categories are used in a detector by setting by_field_name, over_field_name, or partition_field_name to the keyword mlcategory. If you do not specify this keyword in one of those properties, the API request fails.

You can use the Anomaly Explorer in Kibana to view the analysis results:

Categorization results in the Anomaly Explorer

For this type of job, the results contain extra information for each anomaly: the name of the category (for example, mlcategory 2) and examples of the messages in that category. You can use these details to investigate occurrences of unusually high message counts.

If you use the advanced anomaly detection job wizard in Kibana or the create anomaly detection jobs API, there are additional configuration options. For example, the optional categorization_examples_limit property specifies the maximum number of examples that are stored in memory and in the results data store for each category. The default value is 4. Note that this setting does not affect the categorization; it just affects the list of visible examples. If you increase this value, more examples are available, but you must have more storage available. If you set this value to 0, no examples are stored.

Another advanced option is the categorization_filters property, which can contain an array of regular expressions. If a categorization field value matches the regular expression, the portion of the field that is matched is not taken into consideration when defining categories. The categorization filters are applied in the order they are listed in the job configuration, which enables you to disregard multiple sections of the categorization field value. In this example, you might create a filter like [ "\\[statement:.*\\]"] to remove the SQL statement from the categorization algorithm.

Per-partition categorizationedit

If you enable per-partition categorization, categories are determined independently for each partition. For example, if your data includes messages from multiple types of logs from different applications, you can use a field like the ECS event.dataset field as the partition_field_name and categorize the messages for each type of log separately.

If your job has multiple detectors, every detector that uses the mlcategory keyword must also define a partition_field_name. You must use the same partition_field_name value in all of these detectors. Otherwise, when you create or update a job and enable per-partition categorization, it fails.

When per-partition categorization is enabled, you can also take advantage of a stop_on_warn configuration option. If the categorization status for a partition changes to warn, it doesn’t categorize well and can cause a lot of unnecessary resource usage. When you set stop_on_warn to true, the job stops analyzing these problematic partitions. You can thus avoid an ongoing performance cost for partitions that are unsuitable for categorization.

Customizing the categorization analyzeredit

Categorization uses English dictionary words to identify log message categories. By default, it also uses English tokenization rules. For this reason, if you use the default categorization analyzer, only English language log messages are supported, as described in the Limitations.

If you use the categorization wizard in Kibana, you can see which categorization analyzer it uses and highlighted examples of the tokens that it identifies. You can also change the tokenization rules by customizing the way the categorization field values are interpreted:

Editing the categorization analyzer in Kibana

The categorization analyzer can refer to a built-in Elasticsearch analyzer or a combination of zero or more character filters, a tokenizer, and zero or more token filters. In this example, adding a pattern_replace character filter achieves exactly the same behavior as the categorization_filters job configuration option described earlier. For more details about these properties, see the categorization_analyzer API object.

If you use the default categorization analyzer in Kibana or omit the categorization_analyzer property from the API, the following default values are used:

POST _ml/anomaly_detectors/_validate
{
  "analysis_config" : {
    "categorization_analyzer" : {
      "char_filter" : [
        "first_line_with_letters"
      ],
      "tokenizer" : "ml_standard",
      "filter" : [
        { "type" : "stop", "stopwords": [
          "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
          "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
          "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
          "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
          "GMT", "UTC"
        ] }
      ]
    },
    "categorization_field_name": "message",
    "detectors" :[{
      "function":"count",
      "by_field_name": "mlcategory"
    }]
  },
  "data_description" : {
  }
}

If you specify any part of the categorization_analyzer, however, any omitted sub-properties are not set to default values.

The ml_standard tokenizer and the day and month stopword filter are more or less equivalent to the following analyzer, which is defined using only built-in Elasticsearch tokenizers and token filters:

PUT _ml/anomaly_detectors/it_ops_new_logs3
{
  "description" : "IT Ops Application Logs",
  "analysis_config" : {
    "categorization_field_name": "message",
    "bucket_span":"30m",
    "detectors" :[{
      "function":"count",
      "by_field_name": "mlcategory",
      "detector_description": "Unusual message counts"
    }],
    "categorization_analyzer":{
      "char_filter" : [
        "first_line_with_letters" 
      ],
      "tokenizer": {
        "type" : "simple_pattern_split",
        "pattern" : "[^-0-9A-Za-z_./]+" 
      },
      "filter": [
        { "type" : "pattern_replace", "pattern": "^[0-9].*" }, 
        { "type" : "pattern_replace", "pattern": "^[-0-9A-Fa-f.]+$" }, 
        { "type" : "pattern_replace", "pattern": "^[^0-9A-Za-z]+" }, 
        { "type" : "pattern_replace", "pattern": "[^0-9A-Za-z]+$" }, 
        { "type" : "stop", "stopwords": [
          "",
          "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
          "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
          "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
          "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
          "GMT", "UTC"
        ] }
      ]
    }
  },
  "analysis_limits":{
    "categorization_examples_limit": 5
  },
  "data_description" : {
    "time_field":"time",
    "time_format": "epoch_ms"
  }
}

Only consider the first line of the message with letters for categorization purposes.

Tokens basically consist of hyphens, digits, letters, underscores, dots and slashes.

By default, categorization ignores tokens that begin with a digit.

By default, categorization also ignores tokens that are hexadecimal numbers.

Underscores, hyphens, and dots are removed from the beginning of tokens.

Underscores, hyphens, and dots are also removed from the end of tokens.

The key difference between the default categorization_analyzer and this example analyzer is that using the ml_standard tokenizer is several times faster. The ml_standard tokenizer also tries to preserve URLs, Windows paths and email addresses as single tokens. Another difference in behavior is that this custom analyzer does not include accented letters in tokens whereas the ml_standard tokenizer does, although that could be fixed by using more complex regular expressions.

If you are categorizing non-English messages in a language where words are separated by spaces, you might get better results if you change the day or month words in the stop token filter to the appropriate words in your language. If you are categorizing messages in a language where words are not separated by spaces, you must use a different tokenizer as well in order to get sensible categorization results.

It is important to be aware that analyzing for categorization of machine generated log messages is a little different from tokenizing for search. Features that work well for search, such as stemming, synonym substitution, and lowercasing are likely to make the results of categorization worse. However, in order for drill down from machine learning results to work correctly, the tokens that the categorization analyzer produces must be similar to those produced by the search analyzer. If they are sufficiently similar, when you search for the tokens that the categorization analyzer produces then you find the original document that the categorization field value came from.