Categorizing Non-English Log Messages in Machine Learning for Elasticsearch | Elastic Blog
엔지니어링

Elasticsearch Machine Learning을 이용한 다국어 로그 분류

Editor's Note (September 7, 2018): This post refers to X-Pack. Starting with the 6.3 release, the X-Pack code is now open and fully integrated as features into the Elastic Stack.

Elastic Stack (ML) (anomaly) . 6.2 . .

6.2 . . , .

6.2 . (tokenizer) .

, ,

  1. Directory change success

6.2 .

  1. Directory change success
  2. (, . 😅)

!

6.2 .

  1. Directory change success

, .

(categorization_analyzer) . (categorization job) .

{
  "analysis_config" : {
    "categorization_field_name" : "message", 
    "bucket_span" : "30m",
    "detectors" : [{
      "function" : "count",
      "by_field_name" : "mlcategory", 
      "detector_description" : "Unusual message counts"
    }]
  },
  "data_description" : {
    "time_field" : "timestamp"
  }
}

6.2 ( categorization_analyzer ) .

{
  "analysis_config" : {
    "categorization_field_name" : "message", 
    "bucket_span" : "30m",
    "detectors" : [{
      "function" : "count",
      "by_field_name" : "mlcategory", 
      "detector_description" : "Unusual message counts"
    }],
    "categorization_analyzer" : {
      "tokenizer" : "ml_classic", 
      "filter" : [
        { "type" : "stop", "stopwords" : [
          "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
          "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
          "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
          "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
          "GMT", "UTC"
        ] }
  },
  "data_description" : {
    "time_field" : "timestamp"
  }
}

categorization_analyzer ml_classic ( ) stop .

categorization_analyzer . ( .)

{
  "analysis_config" : {
    "categorization_field_name" : "메시지", 
    "bucket_span" : "30m",
    "detectors" : [{
      "function" : "count",
      "by_field_name" : "mlcategory", 
      "detector_description" : "비정상적 메시지의 개수"
    }],
    "categorization_analyzer" : {
      "tokenizer" : "ml_classic", 
      "filter" : [
        { "type" : "stop", "stopwords" : [
          "월요일", "화요일", "수요일", "목요일", "금요일", "토요일", "일요일",
          "일월", "이월", "삼월", "사월", "오월", "유월", "칠월", "팔월", "구월", "시월", "십일월", "십이월",
          "KST"
        ] }
  },
  "data_description" : {
    "time_field" : "타임스탬프"
  }
}

categorization_analyzer . ml_classic 6.1 . 6.2 ML .

ml_classic / (stopword) Elasticsearch .

    "categorization_analyzer" : {
      "tokenizer" : {
        "type" : "simple_pattern_split",
        "pattern" : "[^-0-9A-Za-z_.]+"
      },
      "filter" : [
        { "type" : "pattern_replace", "pattern" : "^[0-9].*" },
        { "type" : "pattern_replace", "pattern" : "^[-0-9A-Fa-f.]+$" },
        { "type" : "pattern_replace", "pattern" : "^[^0-9A-Za-z]+" }, 
        { "type" : "pattern_replace", "pattern" : "[^0-9A-Za-z]+$" }, 
        { "type" : "stop", "stopwords" : [
          "",
          "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
          "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
          "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
          "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
          "GMT", "UTC"
        ] }
      ]
    }

( 6.2 .)

. , ICU tokenizer .

{
  "analysis_config" : {
    "categorization_field_name" : "message", 
    "bucket_span" : "30m",
    "detectors" :[{
      "function" : "count",
      "by_field_name" : "mlcategory",
      "detector_description" : "异常的消息数量"
    }],
    "categorization_analyzer" : {
      "tokenizer" : "icu_tokenizer",
      "filter" : [
        { "type" : "stop", "stopwords" : [
          "星期一", "星期二", "星期三", "星期四", "星期五", "星期六", "星期日",
          "一月", "二月", "三月", "四月", "五月", "六月", "七月", "八月", "九月", "十月", "十一月", "十二月",
          "CST"
        ] }
  },
  "data_description" : {
    "time_field" : "timestamp"
  }
}

categorization_analyzer .

  • (lowercasing), (stemming), (decompounding) , . , service starting service started . . , .
  • categorization_analyzer . .

6.2 . ? . , . , 1 . , machine learning service started machine learning service stopped , . .

. categorization_analyzer , Discuss .