February 21, 2018

Categorizing Non-English Log Messages in Machine Learning for Elasticsearch

Editor's Note (August 3, 2021): This post uses deprecated features. Please reference the map custom regions with reverse geocoding documentation for current instructions.

Machine learning (ML) in the Elastic Stack has the ability to group log messages into categories and then look for anomalies in some other statistic for each of those categories. But prior to version 6.2 there was a problem: the code used to determine the category for each log message made the assumption that the log messages were in English.

Before version 6.2 non-English characters were completely ignored in the categorization process. This meant that in the case of no English words at all in the log messages all messages would be considered to be in the same category. In the case where there were mostly non-English words with a few English words mixed in, these few English words would dominate the categorization, leading to very strange results.

Version 6.2 has taken the first step towards addressing this problem. The default tokenizer used to split log messages into tokens prior to categorization now splits into words consisting of letters from all alphabets.

For example, consider these two log messages, the first in English, the second (saying the same thing) in Korean:

Directory change success
디렉토리 변경 성공

The pre-6.2 categorization tokenizer would have tokenized these as follows:

Directory change success

The Korean message was tokenized into absolutely nothing, because it didn’t contain any letters from the Latin alphabet!

Starting from version 6.2, the default categorization tokenizer will tokenize like this:

Directory change success
디렉토리 변경 성공

Where previously there was nothing left for the main categorization algorithm to work on in the case of the Korean log message, now it is sensibly split into three tokens.

But sometimes even more customization of the categorization analyzer is necessary or beneficial. A simple categorization job can be defined as follows:

{
  "analysis_config" : {
    "categorization_field_name" : "message", 
    "bucket_span" : "30m",
    "detectors" : [{
      "function" : "count",
      "by_field_name" : "mlcategory", 
      "detector_description" : "Unusual message counts"
    }]
  },
  "data_description" : {
    "time_field" : "timestamp"
  }
}

Starting from version 6.2, this can be written more verbosely (spelling out the default categorization_analyzer in full) like this:

{
  "analysis_config" : {
    "categorization_field_name" : "message", 
    "bucket_span" : "30m",
    "detectors" : [{
      "function" : "count",
      "by_field_name" : "mlcategory", 
      "detector_description" : "Unusual message counts"
    }],
    "categorization_analyzer" : {
      "tokenizer" : "ml_classic", 
      "filter" : [
        { "type" : "stop", "stopwords" : [
          "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
          "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
          "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
          "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
          "GMT", "UTC"
        ] }
  },
  "data_description" : {
    "time_field" : "timestamp"
  }
}

This shows that the default categorization_analyzer consists of a tokenizer called ml_classic, and a stop filter that removes day and month names and abbreviations.

You can probably see the next improvement we can make if our Korean log messages contain day or month names written as words. (If dates are written using only numbers in your log files this change won’t make any different, so spare yourself the effort.)

{
  "analysis_config" : {
    "categorization_field_name" : "&#xBA54;&#xC2DC;&#xC9C0;", 
    "bucket_span" : "30m",
    "detectors" : [{
      "function" : "count",
      "by_field_name" : "mlcategory", 
      "detector_description" : "&#xBE44;&#xC815;&#xC0C1;&#xC801; &#xBA54;&#xC2DC;&#xC9C0;&#xC758; &#xAC1C;&#xC218;"
    }],
    "categorization_analyzer" : {
      "tokenizer" : "ml_classic", 
      "filter" : [
        { "type" : "stop", "stopwords" : [
          "&#xC6D4;&#xC694;&#xC77C;", "&#xD654;&#xC694;&#xC77C;", "&#xC218;&#xC694;&#xC77C;", "&#xBAA9;&#xC694;&#xC77C;", "&#xAE08;&#xC694;&#xC77C;", "&#xD1A0;&#xC694;&#xC77C;", "&#xC77C;&#xC694;&#xC77C;",
          "&#xC77C;&#xC6D4;", "&#xC774;&#xC6D4;", "&#xC0BC;&#xC6D4;", "&#xC0AC;&#xC6D4;", "&#xC624;&#xC6D4;", "&#xC720;&#xC6D4;", "&#xCE60;&#xC6D4;", "&#xD314;&#xC6D4;", "&#xAD6C;&#xC6D4;", "&#xC2DC;&#xC6D4;", "&#xC2ED;&#xC77C;&#xC6D4;", "&#xC2ED;&#xC774;&#xC6D4;",
          "KST"
        ] }
  },
  "data_description" : {
    "time_field" : "&#xD0C0;&#xC784;&#xC2A4;&#xD0EC;&#xD504;"
  }
}

As well as customizing the categorization_analyzer’s token filters you can also customize the tokenizer itself. For English log messages the ml_classic tokenizer does what the hardcoded tokenizer did in version 6.1 and earlier. It has to do this, otherwise there would be a backwards compatibility problem with ML jobs that use categorization when upgrading to version 6.2.

The ml_classic tokenizer and the day and month stopword filter are more or less equivalent to the following analyzer, which is defined using only built-in Elasticsearch tokenizers and token filters:

    "categorization_analyzer" : {
      "tokenizer" : {
        "type" : "simple_pattern_split",
        "pattern" : "[^-0-9A-Za-z_.]+"
      },
      "filter" : [
        { "type" : "pattern_replace", "pattern" : "^[0-9].*" },
        { "type" : "pattern_replace", "pattern" : "^[-0-9A-Fa-f.]+$" },
        { "type" : "pattern_replace", "pattern" : "^[^0-9A-Za-z]+" }, 
        { "type" : "pattern_replace", "pattern" : "[^0-9A-Za-z]+$" }, 
        { "type" : "stop", "stopwords" : [
          "",
          "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
          "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
          "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
          "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
          "GMT", "UTC"
        ] }
      ]
    }

(The reason it’s only “more or less” like this in version 6.2 is that characters from all alphabets are included in tokens, not just those from the Latin alphabet as suggested by the patterns above.)

This tokenization strategy clearly won’t work for languages that don’t have spaces between words, such as Chinese and Japanese. To get sensible categorization in these languages the tokenizer needs to be changed to one that knows how to split character sequences into words, such as the ICU tokenizer.

{
  "analysis_config" : {
    "categorization_field_name" : "message", 
    "bucket_span" : "30m",
    "detectors" :[{
      "function" : "count",
      "by_field_name" : "mlcategory",
      "detector_description" : "&#x5F02;&#x5E38;&#x7684;&#x6D88;&#x606F;&#x6570;&#x91CF;"
    }],
    "categorization_analyzer" : {
      "tokenizer" : "icu_tokenizer",
      "filter" : [
        { "type" : "stop", "stopwords" : [
          "&#x661F;&#x671F;&#x4E00;", "&#x661F;&#x671F;&#x4E8C;", "&#x661F;&#x671F;&#x4E09;", "&#x661F;&#x671F;&#x56DB;", "&#x661F;&#x671F;&#x4E94;", "&#x661F;&#x671F;&#x516D;", "&#x661F;&#x671F;&#x65E5;",
          "&#x4E00;&#x6708;", "&#x4E8C;&#x6708;", "&#x4E09;&#x6708;", "&#x56DB;&#x6708;", "&#x4E94;&#x6708;", "&#x516D;&#x6708;", "&#x4E03;&#x6708;", "&#x516B;&#x6708;", "&#x4E5D;&#x6708;", "&#x5341;&#x6708;", "&#x5341;&#x4E00;&#x6708;", "&#x5341;&#x4E8C;&#x6708;",
          "CST"
        ] }
  },
  "data_description" : {
    "time_field" : "timestamp"
  }
}

Two things to be aware of when customizing the categorization_analyzer are:

Although techniques such as lowercasing, stemming and decompounding work well for search, for categorizing machine-generated log messages it’s best not to do these things. For example, stemming rules out the possibility of distinguishing “service starting” from “service started”. In human-generated text this could be appropriate, as people use slightly different words when writing about the same thing. But for machine-generated log messages from a given program, different words mean a different message.
The tokens generated by the categorization_analyzer need to be sufficiently similar to those generated by the analyzer used at index time that when you search for them you’ll match the original message. This is required in order for drilldown from category definitions to the original data to work.

Earlier, I said that version 6.2 has taken the first step towards addressing the problem of categorizing non-English log messages. So what’s the second step? The answer is opening up the dictionary used to weight the tokens. In English categorization we give dictionary words a higher weighting when deciding which category a message belongs to, and verbs an even higher weighting. The English dictionary used to do this cannot currently be changed, so when categorizing non-English logs every token will be given a weight of 1. You’ll probably still get a reasonable split in most cases, but the case that might be different is that “machine learning service started” and “machine learning service stopped” would be considered different categories in English, but, without weighting, translations of these messages to other languages could well end up in the same category. We’ll make the categorization dictionary customizable in a future version of the Elastic Stack.

Try this out in the latest release. And if you need help defining your categorization_analyzer, ask us in the Discuss forum.