工程

训练、评估、监测、推理:Elastic 中的端对端 Machine Learning

在过去几年间,Machine Learning 管道发生了翻天覆地的变化。借助于各种可简化构建、训练和部署的工具与框架,Machine Learning 模型的开发周转时间已被大幅缩短。但即使有这些简化,和很多此类工具相关的学习曲线依然陡峭。不过,Elastic 并不在此列。

为了在 Elastic Stack 中使用 Machine Learning,您所需要做的只是将数据存储到 Elasticsearch 当中。然后,只需要在 Kibana 中点击几个按钮,您就可以轻松从这些数据中提取重要的信息。Machine Learning 已完全整合到 Elastic Stack 中,让您可以轻松而直观地构建完全可操作的端对端机器学习管道。这篇博客将教您如何进行操作。

为什么是 Elastic?

作为一家搜索公司意味着 Elastic 一直致力于高效处理大量数据。Elasticsearch Query DSL 使搜索与整合用于分析的数据变得简单而直观。在 Kibana 中,您可以通过各种方式对大型数据集进行可视化。Elastic Machine Learning 界面提供简单的特征和模型选择、模型训练和超参数调优功能。在您完成对模型的训练与调优后,还可使用 Kibana 来评估模型并以可视化方式监测模型。这使 Elastic Stack 成为适用于生产级 Machine Learning 的一站式完美解决方案。

示例数据集:EMBER 2018

我们将使用由 Endgame 发布的 EMBER 数据集在 Elastic Stack 中演示端对端 Machine Learning,借助派生自可移植可执行 (PE) 文件的静态特征实现恶意软件检测。为进行此演示,我们将使用 EMBER (Endgame Malware BEnchmark for Research) 2018 数据集,它是包含 1 百万个样本的开放源集合。每个样本都有样本文件的 sha256 哈希、文件被首次看到的月份、标签,以及派生自文件的特征。 

在这个实验中,我们将从 EMBER 2018 数据集中选择 30 万个样本(15 万个恶意和 15 万个良性)。要针对样本开展监督式学习,首先我们必须选择一些特征。数据集中的特征是派生自二进制文件内容的静态特征。我们决定对一般的文件头和文件节信息、字符串和字节直方图进行实验,以便研究使用不同的 EMBER 数据集特征子集时的模型表现。 

Elastic Stack 中的端对端 Machine Learning:演示

为进行此演示,我们将使用 Python Elasticsearch 客户端插入数据到 Elasticsearch,也就是 Elastic Machine Learning 的数据帧分析特征,以便创建训练作业,然后使用 Kibana 以可视化方式对训练后的模型进行监测。 

我们将创建两个监督式作业,一个使用一般文件头和文件节信息及字符串,另一个仅使用字节直方图作为特征。这样做是为了在 Stack 中同时训练多个模型,以及后续对多个候选模型进行可视化。

Elasticsearch 设置

要在 Elastic Stack 中使用 Machine Learning,首先我们需要通过 Machine Learning 节点快速部署 Elasticsearch。这样一来,我们就可以开始为期 14 天的 Elastic Cloud 免费试用,而且任何人都可以免费试用。我们的示例部署的设置如下:

  • Cloud Platform(云平台):Amazon Web Services
  • Region(区域):US West (N. California)
  • Optimization(优化):I/O Optimized(I/O 优化)
  • Customize Deployment(定制部署):Enable Machine Learning(启用 Machine Learning)

我们还需要使用 Python Elasticsearch 客户端创建 API 密钥,为其分配适当的权限以便和 Elasticsearch 进行交互。在演示中,我们将插入数据到 ember_ml 索引,所以我们要使用以下代码创建密钥:

POST /_security/api_key 
{ 
  "name": "my_awesome_key",
  "role_descriptors": { 
    "role_1": { 
      "cluster": ["all"],
      "index": [ 
        { 
          "names": ["ember_*],
          "privileges": ["all"] 
        } 
      ] 
    } 
  } 
}

数据采集

在 Elasticsearch 实例设置完毕以后,我们将开始把数据采集到 Elasticsearch 索引。首先,我们将创建名为 ember_ml 的索引,然后使用 Python Elasticsearch 客户端采集组成数据集的文档到该索引。我们将利用 Streaming Bulk Helper 采集两个模型所需的全部特征到一个索引,以便于将文档批量采集到 Elasticsearch。创建 ember_ml 索引,并将文档批量采集到该索引所用的 Python 代码如下:

import elasticsearch 
import certifi 
from elasticsearch import Elasticsearch, helpers 
# 要插入到 Elasticsearch 的文档长列表,仅显示一个作为示例 
documents = [ 
  { 
    "_index": "ember_ml",
    "_id":"771434adbbfa2ff5740eb91d9deb51828e0f4b060826b590cd9fd8dd46ee0d40",
    "_source": { 
      "sha256":"771434adbbfa2ff5740eb91d9deb51828e0f4b060826b590cd9fd8dd46ee0d4b",
      "appeared":"2018-01-06 00:00:00",
      "label":1,
      "byte_0":0.1826012283563614,
      "byte_1":0.006036404054611921,
      "byte_2":0.003830794943496585,
      "byte_3":0.004225482698529959,
      "byte_4":0.004388001281768084,
      "byte_5":0.0036218424793332815,
      "byte_6":0.0035289747174829245,
      "byte_7":0.004666604567319155,
      "byte_8":0.004225482698529959,
      "byte_9":0.0029253342654556036,
      "byte_10":0.0034361069556325674,
      "byte_11":0.003993313293904066,
      "byte_12":0.004039747174829245,
      "byte_13":0.0029253342654556036,
      "byte_14":0.0030182020273059607,
      "byte_15":0.0036450594197958708,
      "byte_16":0.004573736805468798,
      "byte_17":0.002693164860829711,
      "byte_18":0.002507429337128997,
      "byte_19":0.0026699479203671217,
      "byte_20":0.003505757777020335,
      "byte_21":0.0022056091111153364,
      "byte_22":0.0032503714319318533,
      "byte_23":0.0025770801585167646,
      "byte_24":0.005363112781196833,
      "byte_25":0.002600297098979354,
      "byte_26":0.0025538632180541754,
      "byte_27":0.0031807206105440855,
      "byte_28":0.0034593238960951567,
      "byte_29":0.0022288260515779257,
      "byte_30":0.002507429337128997,
      "byte_31":0.0025770801585167646,
      "byte_32":0.004921990912407637,
      "byte_33":0.0028092495631426573,
      "byte_34":0.0017877042992040515,
      "byte_35":0.0033664561342447996,
      "byte_36":0.002437778515741229,
      "byte_37":0.0021359582897275686,
      "byte_38":0.0016716195968911052,
      "byte_39":0.0020430905278772116,
      "byte_40":0.003227154491469264,
      "byte_41":0.0025770801585167646,
      "byte_42":0.0017644873587414622,
      "byte_43":0.0032039375510066748,
      "byte_44":0.003296805312857032,
      "byte_45":0.003134286729618907,
      "byte_46":0.0028324665036052465,
      "byte_47":0.003505757777020335,
      "byte_48":0.0038772288244217634,
      "byte_49":0.0035521916579455137,
      "byte_50":0.0031110697891563177,
      "byte_51":0.00417904881760478,
      "byte_52":0.004225482698529959,
      "byte_53":0.0032503714319318533,
      "byte_54":0.0035289747174829245,
      "byte_55":0.003320022253319621,
      "byte_56":0.0030878528486937284,
      "byte_57":0.003575408598408103,
      "byte_58":0.002182392170652747,
      "byte_59":0.0029021173249930143,
      "byte_60":0.002344910753890872,
      "byte_61":0.0020430905278772116,
      "byte_62":0.0015555348945781589,
      "byte_63":0.0020198735874146223,
      "byte_64":0.004016530234366655,
      "byte_65":0.004457652103155851,
      "byte_66":0.0036450594197958708,
      "byte_67":0.0036218424793332815,
      "byte_68":0.0038075780030339956,
      "byte_69":0.0033432391937822104,
      "byte_70":0.004852340091019869,
      "byte_71":0.004039747174829245,
      "byte_72":0.00480590621009469,
      "byte_73":0.002971768146380782,
      "byte_74":0.002693164860829711,
      "byte_75":0.0039468794129788876,
      "byte_76":0.0036450594197958708,
      "byte_77":0.0034361069556325674,
      "byte_78":0.0028324665036052465,
      "byte_79":0.0028324665036052465,
      "byte_80":0.005664933007210493,
      "byte_81":0.0029949850868433714,
      "byte_82":0.0031110697891563177,
      "byte_83":0.004527302924543619,
      "byte_84":0.003923662472516298,
      "byte_85":0.0029949850868433714,
      "byte_86":0.004016530234366655,
      "byte_87":0.004573736805468798,
      "byte_88":0.004109397996217012,
      "byte_89":0.003296805312857032,
      "byte_90":0.0033664561342447996,
      "byte_91":0.0034593238960951567,
      "byte_92":0.0031110697891563177,
      "byte_93":0.0022984768729656935,
      "byte_94":0.0022288260515779257,
      "byte_95":0.002275259932503104,
      "byte_96":0.002855683444067836,
      "byte_97":0.0035986255388706923,
      "byte_98":0.0026699479203671217,
      "byte_99":0.0037843610625714064,
      "byte_100":0.004364784341305494,
      "byte_101":0.004016530234366655,
      "byte_102":0.004713038448244333,
      "byte_103":0.003505757777020335,
      "byte_104":0.005479197483509779,
      "byte_105":0.0032503714319318533,
      "byte_106":0.00366827636025846,
      "byte_107":0.004016530234366655,
      "byte_108":0.005061292555183172,
      "byte_109":0.005014858674257994,
      "byte_110":0.0039468794129788876,
      "byte_111":0.004109397996217012,
      "byte_112":0.004596953745931387,
      "byte_113":0.0021127413492649794,
      "byte_114":0.0046433876268565655,
      "byte_115":0.004086181055754423,
      "byte_116":0.005664933007210493,
      "byte_117":0.005293461959809065,
      "byte_118":0.0039468794129788876,
      "byte_119":0.0038075780030339956,
      "byte_120":0.0035289747174829245,
      "byte_121":0.004480869043618441,
      "byte_122":0.00183413818012923,
      "byte_123":0.0032503714319318533,
      "byte_124":0.0027163818012923002,
      "byte_125":0.002066307468339801,
      "byte_126":0.003505757777020335,
      "byte_127":0.002252042992040515,
      "byte_128":0.0033432391937822104,
      "byte_129":0.0032039375510066748,
      "byte_130":0.001741270418278873,
      "byte_131":0.003923662472516298,
      "byte_132":0.003830794943496585,
      "byte_133":0.0033664561342447996,
      "byte_134":0.0034361069556325674,
      "byte_135":0.0014162332518026233,
      "byte_136":0.002600297098979354,
      "byte_137":0.00304141896776855,
      "byte_138":0.0022984768729656935,
      "byte_139":0.0037147102411836386,
      "byte_140":0.0051773772574961185,
      "byte_141":0.003296805312857032,
      "byte_142":0.0031575036700814962,
      "byte_143":0.0015555348945781589,
      "byte_144":0.003064635908231139,
      "byte_145":0.002693164860829711,
      "byte_146":0.0012304977281019092,
      "byte_147":0.0015555348945781589,
      "byte_148":0.003830794943496585,
      "byte_149":0.0028092495631426573,
      "byte_150":0.00208952440880239,
      "byte_151":0.0014626671327278018,
      "byte_152":0.0026699479203671217,
      "byte_153":0.004388001281768084,
      "byte_154":0.0019502228824421763,
      "byte_155":0.0017644873587414622,
      "byte_156":0.004086181055754423,
      "byte_157":0.0017180534778162837,
      "byte_158":0.003412890015169978,
      "byte_159":0.002252042992040515,
      "byte_160":0.002507429337128997,
      "byte_161":0.002437778515741229,
      "byte_162":0.002623514039441943,
      "byte_163":0.0022288260515779257,
      "byte_164":0.0020430905278772116,
      "byte_165":0.0022984768729656935,
      "byte_166":0.0017180534778162837,
      "byte_167":0.0010911960853263736,
      "byte_168":0.002159175230190158,
      "byte_169":0.0015091010136529803,
      "byte_170":0.003227154491469264,
      "byte_171":0.0025770801585167646,
      "byte_172":0.0027628156822174788,
      "byte_173":0.0029253342654556036,
      "byte_174":0.0013697993708774447,
      "byte_175":0.001648402656428516,
      "byte_176":0.003134286729618907,
      "byte_177":0.0016019687755033374,
      "byte_178":0.002437778515741229,
      "byte_179":0.001927005941979587,
      "byte_180":0.0027163818012923002,
      "byte_181":0.004016530234366655,
      "byte_182":0.003227154491469264,
      "byte_183":0.00241456157527864,
      "byte_184":0.0025538632180541754,
      "byte_185":0.00208952440880239,
      "byte_186":0.001648402656428516,
      "byte_187":0.002275259932503104,
      "byte_188":0.0025538632180541754,
      "byte_189":0.0028092495631426573,
      "byte_190":0.0021359582897275686,
      "byte_191":0.0027395987417548895,
      "byte_192":0.0030878528486937284,
      "byte_193":0.0027395987417548895,
      "byte_194":0.00208952440880239,
      "byte_195":0.002878900384530425,
      "byte_196":0.0021359582897275686,
      "byte_197":0.00208952440880239,
      "byte_198":0.0027395987417548895,
      "byte_199":0.0019734397064894438,
      "byte_200":0.003064635908231139,
      "byte_201":0.002066307468339801,
      "byte_202":0.0012304977281019092,
      "byte_203":0.00183413818012923,
      "byte_204":0.003389673074707389,
      "byte_205":0.00304141896776855,
      "byte_206":0.0029021173249930143,
      "byte_207":0.0024609954562038183,
      "byte_208":0.0029021173249930143,
      "byte_209":0.002507429337128997,
      "byte_210":0.0022288260515779257,
      "byte_211":0.0019734397064894438,
      "byte_212":0.0023913446348160505,
      "byte_213":0.0017180534778162837,
      "byte_214":0.0032735883723944426,
      "byte_215":0.0023216938134282827,
      "byte_216":0.003412890015169978,
      "byte_217":0.0025538632180541754,
      "byte_218":0.002530646277591586,
      "byte_219":0.004550519865006208,
      "byte_220":0.003320022253319621,
      "byte_221":0.002437778515741229,
      "byte_222":0.003389673074707389,
      "byte_223":0.002855683444067836,
      "byte_224":0.0031575036700814962,
      "byte_225":0.0018109212396666408,
      "byte_226":0.002182392170652747,
      "byte_227":0.003737927181646228,
      "byte_228":0.0036218424793332815,
      "byte_229":0.0014626671327278018,
      "byte_230":0.0024609954562038183,
      "byte_231":0.002600297098979354,
      "byte_232":0.0024609954562038183,
      "byte_233":0.0015323179541155696,
      "byte_234":0.001137629966251552,
      "byte_235":0.004341567400842905,
      "byte_236":0.004782689269632101,
      "byte_237":0.0024609954562038183,
      "byte_238":0.0016716195968911052,
      "byte_239":0.0028092495631426573,
      "byte_240":0.0036218424793332815,
      "byte_241":0.00183413818012923,
      "byte_242":0.0035289747174829245,
      "byte_243":0.002623514039441943,
      "byte_244":0.0022984768729656935,
      "byte_245":0.001741270418278873,
      "byte_246":0.003296805312857032,
      "byte_247":0.003412890015169978,
      "byte_248":0.003134286729618907,
      "byte_249":0.0023913446348160505,
      "byte_250":0.0012304977281019092,
      "byte_251":0.0067561292089521885,
      "byte_252":0.005943536292761564,
      "byte_253":0.0031575036700814962,
      "byte_254":0.004480869043618441,
      "byte_255":0.038958024233579636,
      "strings_0":488,
      "strings_1":7.477458953857422,
      "strings_2":3649,
      "strings_3":0.011784050613641739,
      "strings_4":0.0043847630731761456,
      "strings_5":0.003562619909644127,
      "strings_6":0.005206905771046877,
      "strings_7":0.004110715351998806,
      "strings_8":0.003014524467289448,
      "strings_9":0.003562619909644127,
      "strings_10":0.005755001213401556,
      "strings_11":0.006029048934578896,
      "strings_12":0.003014524467289448,
      "strings_13":0.0019183338154107332,
      "strings_14":0.010961906984448433,
      "strings_15":0.006577144376933575,
      "strings_16":0.006851192098110914,
      "strings_17":0.008769526146352291,
      "strings_18":0.013428336940705776,
      "strings_19":0.011784050613641739,
      "strings_20":0.012058097869157791,
      "strings_21":0.014250479638576508,
      "strings_22":0.013428336940705776,
      "strings_23":0.01315428875386715,
      "strings_24":0.01068785972893238,
      "strings_25":0.01315428875386715,
      "strings_26":0.012880241498351097,
      "strings_27":0.010139764286577702,
      "strings_28":0.010413811542093754,
      "strings_29":0.0027404767461121082,
      "strings_30":0.006029048934578896,
      "strings_31":0.004658810794353485,
      "strings_32":0.0021923815365880728,
      "strings_33":0.0027404767461121082,
      "strings_34":0.004110715351998806,
      "strings_35":0.005755001213401556,
      "strings_36":0.01589476503431797,
      "strings_37":0.011784050613641739,
      "strings_38":0.01397643145173788,
      "strings_39":0.010413811542093754,
      "strings_40":0.016168814152479172,
      "strings_41":0.015346670523285866,
      "strings_42":0.012332146055996418,
      "strings_43":0.013428336940705776,
      "strings_44":0.01452452689409256,
      "strings_45":0.00986571703106165,
      "strings_46":0.016442861407995224,
      "strings_47":0.014798575080931187,
      "strings_48":0.012058097869157791,
      "strings_49":0.01068785972893238,
      "strings_50":0.010413811542093754,
      "strings_51":0.015620717778801918,
      "strings_52":0.010139764286577702,
      "strings_53":0.013428336940705776,
      "strings_54":0.015072622336447239,
      "strings_55":0.014250479638576508,
      "strings_56":0.011510002426803112,
      "strings_57":0.012880241498351097,
      "strings_58":0.01397643145173788,
      "strings_59":0.012332146055996418,
      "strings_60":0.01068785972893238,
      "strings_61":0.00931762158870697,
      "strings_62":0.00986571703106165,
      "strings_63":0.005206905771046877,
      "strings_64":0.003014524467289448,
      "strings_65":0.003014524467289448,
      "strings_66":0.003562619909644127,
      "strings_67":0.0043847630731761456,
      "strings_68":0.01397643145173788,
      "strings_69":0.010413811542093754,
      "strings_70":0.017539052292704582,
      "strings_71":0.017539052292704582,
      "strings_72":0.02000548131763935,
      "strings_73":0.016442861407995224,
      "strings_74":0.014250479638576508,
      "strings_75":0.01452452689409256,
      "strings_76":0.01260619331151247,
      "strings_77":0.011510002426803112,
      "strings_78":0.013428336940705776,
      "strings_79":0.014798575080931187,
      "strings_80":0.016442861407995224,
      "strings_81":0.01452452689409256,
      "strings_82":0.017813099548220634,
      "strings_83":0.015072622336447239,
      "strings_84":0.00931762158870697,
      "strings_85":0.01452452689409256,
      "strings_86":0.014250479638576508,
      "strings_87":0.015620717778801918,
      "strings_88":0.014250479638576508,
      "strings_89":0.012332146055996418,
      "strings_90":0.013702384196221828,
      "strings_91":0.01397643145173788,
      "strings_92":0.00986571703106165,
      "strings_93":0.006303096655756235,
      "strings_94":0.004110715351998806,
      "strings_95":0.0027404767461121082,
      "strings_96":0.0027404767461121082,
      "strings_97":0.0024664292577654123,
      "strings_98":0.007399287540465593,
      "strings_99":6.4175848960876465,
      "strings_100":0,
      "strings_101":0,
      "strings_102":0,
      "strings_103":3,
      "general_info_0":43072,
      "general_info_1":110592,
      "general_info_2":0,
      "general_info_3":0,
      "general_info_4":5,
      "general_info_5":0,
      "general_info_6":1,
      "general_info_7":0,
      "general_info_8":0,
      "general_info_9":0,
      "file_header_0":1142459136,
      "file_header_1":0,
      "file_header_2":0,
      "file_header_3":0,
      "file_header_4":0,
      "file_header_5":0,
      "file_header_6":1,
      "file_header_7":0,
      "file_header_8":0,
      "file_header_9":0,
      "file_header_10":0,
      "file_header_11":0,
      "file_header_12":0,
      "file_header_13": -1,
      "file_header_14":0,
      "file_header_15": -1,
      "file_header_16": -1,
      "file_header_17":0,
      "file_header_18":0,
      "file_header_19":0,
      "file_header_20":0,
      "file_header_21":0,
      "file_header_22":0,
      "file_header_23":0,
      "file_header_24":0,
      "file_header_25":0,
      "file_header_26":0,
      "file_header_27":0,
      "file_header_28":1,
      "file_header_29":0,
      "file_header_30":0,
      "file_header_31":0,
      "file_header_32":0,
      "file_header_33":0,
      "file_header_34":0,
      "file_header_35":0,
      "file_header_36":0,
      "file_header_37":0,
      "file_header_38":0,
      "file_header_39":0,
      "file_header_40":0,
      "file_header_41":0,
      "file_header_42": -1,
      "file_header_43":0,
      "file_header_44":0,
      "file_header_45":0,
      "file_header_46":0,
      "file_header_47":0,
      "file_header_48":0,
      "file_header_49":0,
      "file_header_50":0,
      "file_header_51":0,
      "file_header_52":0,
      "file_header_53":2,
      "file_header_54":48,
      "file_header_55":4,
      "file_header_56":0,
      "file_header_57":4,
      "file_header_58":0,
      "file_header_59":32768,
      "file_header_60":4096,
      "file_header_61":4096,
      "sections_0":3,
      "sections_1":1,
      "sections_2":0,
      "sections_3":1,
      "sections_4":3,
      "sections_5":0,
      "sections_6":0,
      "sections_7":0,
      "sections_8":0,
      "sections_9":0,
      "sections_10":0,
      "sections_11":0,
      "sections_12":0,
      "sections_13":0,
      "sections_14":0,
      "sections_15":0,
      "sections_16":0,
      "sections_17":0,
      "sections_18":0,
      "sections_19":0,
      "sections_20":0,
      "sections_21":0,
      "sections_22":0,
      "sections_23":0,
      "sections_24":0,
      "sections_25":0,
      "sections_26":0,
      "sections_27":0,
      "sections_28":0,
      "sections_29":0,
      "sections_30":0,
      "sections_31":0,
      "sections_32":0,
      "sections_33":0,
      "sections_34":0,
      "sections_35":0,
      "sections_36":0,
      "sections_37":0,
      "sections_38":0,
      "sections_39":0,
      "sections_40":0,
      "sections_41":0,
      "sections_42":0,
      "sections_43":0,
      "sections_44":0,
      "sections_45":0,
      "sections_46":0,
      "sections_47":0,
      "sections_48":0,
      "sections_49":0,
      "sections_50":0,
      "sections_51":0,
      "sections_52": -42048,
      "sections_53":0,
      "sections_54":0,
      "sections_55":0,
      "sections_56":0,
      "sections_57":0,
      "sections_58":0,
      "sections_59":0,
      "sections_60":0,
      "sections_61":0,
      "sections_62":0,
      "sections_63":0,
      "sections_64":0,
      "sections_65":0,
      "sections_66":0,
      "sections_67":0,
      "sections_68":0,
      "sections_69":0,
      "sections_70":0,
      "sections_71":0,
      "sections_72":0,
      "sections_73":0,
      "sections_74":0,
      "sections_75":0,
      "sections_76":0,
      "sections_77":0,
      "sections_78":0,
      "sections_79":0,
      "sections_80":0,
      "sections_81":0,
      "sections_82":0,
      "sections_83":0,
      "sections_84":0,
      "sections_85":0,
      "sections_86":0,
      "sections_87":0,
      "sections_88":0,
      "sections_89":0,
      "sections_90":0,
      "sections_91":0,
      "sections_92":0,
      "sections_93":0,
      "sections_94":0,
      "sections_95":0,
      "sections_96":0,
      "sections_97":0,
      "sections_98":0,
      "sections_99":0,
      "sections_100":0,
      "sections_101":0,
      "sections_102": -11.691457748413086,
      "sections_103":0,
      "sections_104":0,
      "sections_105":0,
      "sections_106":0,
      "sections_107":0,
      "sections_108":0,
      "sections_109":0,
      "sections_110":0,
      "sections_111":0,
      "sections_112":0,
      "sections_113":0,
      "sections_114":0,
      "sections_115":0,
      "sections_116":0,
      "sections_117":0,
      "sections_118":0,
      "sections_119":0,
      "sections_120":0,
      "sections_121":0,
      "sections_122":0,
      "sections_123":0,
      "sections_124":0,
      "sections_125":0,
      "sections_126":0,
      "sections_127":0,
      "sections_128":0,
      "sections_129":0,
      "sections_130":0,
      "sections_131":0,
      "sections_132":0,
      "sections_133":0,
      "sections_134":0,
      "sections_135":0,
      "sections_136":0,
      "sections_137":0,
      "sections_138":0,
      "sections_139":0,
      "sections_140":0,
      "sections_141":0,
      "sections_142":0,
      "sections_143":0,
      "sections_144":0,
      "sections_145":0,
      "sections_146":0,
      "sections_147":0,
      "sections_148":0,
      "sections_149":0,
      "sections_150":0,
      "sections_151":0,
      "sections_152": -102464,
      "sections_153":0,
      "sections_154":0,
      "sections_155":0,
      "sections_156":0,
      "sections_157":2,
      "sections_158":0,
      "sections_159":0,
      "sections_160":0,
      "sections_161":0,
      "sections_162":0,
      "sections_163":0,
      "sections_164":2,
      "sections_165":0,
      "sections_166":0,
      "sections_167":2,
      "sections_168":0,
      "sections_169":0,
      "sections_170":0,
      "sections_171":0,
      "sections_172":0,
      "sections_173":0,
      "sections_174":0,
      "sections_175":0,
      "sections_176":0,
      "sections_177":0,
      "sections_178":0,
      "sections_179":0,
      "sections_180":0,
      "sections_181":2,
      "sections_182":0,
      "sections_183":0,
      "sections_184":0,
      "sections_185":0,
      "sections_186":0,
      "sections_187":0,
      "sections_188":0,
      "sections_189":0,
      "sections_190":0,
      "sections_191":0,
      "sections_192":0,
      "sections_193":0,
      "sections_194":0,
      "sections_195":0,
      "sections_196":0,
      "sections_197":0,
      "sections_198":0,
      "sections_199":0,
      "sections_200":0,
      "sections_201":0,
      "sections_202":0,
      "sections_203":0,
      "sections_204":0,
      "sections_205":2,
      "sections_206":0,
      "sections_207":0,
      "sections_208":0,
      "sections_209":0,
      "sections_210":0,
      "sections_211":0,
      "sections_212":0,
      "sections_213":0,
      "sections_214":0,
      "sections_215":0,
      "sections_216":0,
      "sections_217":0,
      "sections_218": -1,
      "sections_219":0,
      "sections_220":0,
      "sections_221":0,
      "sections_222":0,
      "sections_223":0,
      "sections_224":0,
      "sections_225":0,
      "sections_226":0,
      "sections_227":0,
      "sections_228":3,
      "sections_229":0,
      "sections_230":0,
      "sections_231":0,
      "sections_232":0,
      "sections_233":0,
      "sections_234":0,
      "sections_235":0,
      "sections_236":0,
      "sections_237":0,
      "sections_238":0,
      "sections_239":0,
      "sections_240":0,
      "sections_241":0,
      "sections_242":3,
      "sections_243":0,
      "sections_244":0,
      "sections_245":0,
      "sections_246":0,
      "sections_247":0,
      "sections_248":0,
      "sections_249":0,
      "sections_250":0,
      "sections_251":0,
      "sections_252": -1,
      "sections_253":0,
      "sections_254":0 
    } 
  } 
] 
url = "YOUR_KIBANA_ENDPOINT_URL" 
api_key = "YOUR_API_KEY" 
api_id = "YOUR_API_ID" 
# 初始化 Elasticsearch 客户端 
es = Elasticsearch( 
        url,
        api_key=(api_id, api_key),
        use_ssl=True,
        ca_certs=certifi.where() 
    ) 
# 创建索引 
es.indices.create(index="ember_ml") 
# 批量采集文档至 Elasticsearch 
try:
    for success, info in helpers.streaming_bulk(es, documents, chunk_size=2500):
        if not success:
            print("A document failed:", info) 
except elasticsearch.ElasticsearchException:
    print("Failed to insert")

请注意,特征向量需要被压平,即,每项特征需要是每个文档中受支持数据类型(数值、布尔值、文本、关键字或 IP)的独立字段,因为数据帧分析不支持超过一个元素的阵列。另外还要注意,为了后续对时间序列进行可视化,EMBER 数据集中“出现”(首次看到)的字段已被更改,以匹配与 Elasticsearch 兼容的日期格式。 

为确保以正确格式将所有数据都采集到 Elasticsearch,我们会在开发工具控制台(Management(管理)-> Dev Tools(开发工具))中运行以下查询:

要获得文档的数量:

GET ember_ml/_count

要在索引中搜索文档,并确保它们采用正确的格式:

GET ember_ml/_search

在验证过 Elasticsearch 中的数据和预期的一样以后,我们现在随时可以创建我们的分析作业。然而,在创建作业前,我们需要为作业定义一个索引模式。索引模式会告诉 Kibana(以及后续的作业)哪个 Elasticsearch 索引包含您想要进行操作的数据。我们会创建索引模式 ember_*,以便与我们的索引 ember_ml 匹配。

模型训练

在创建索引模式以后,我们会如上文提到的那样接着创建两个具有两个特征子集的分析作业。此操作可以通过 Kibana 中的 Machine Learning 应用程序完成。我们将如下所述配置我们的作业:

  • Job type(作业类型):我们会选择“分类”来预测特定二进制文件是恶意或是良性的。Elastic Machine Learning 中的底层分类模型是一种名为“提升树回归”的提升类型,它将多个较不理想的模型合并成为一个复合模型。该类型会使用决策树来学习预测数据点归属于某个特定类别的概率。
  • Dependent variable(因变量):我们的例子中的“标签”,1 为恶意,0 为良性。
  • Fields to include(包含的字段):我们会选择希望包含在训练中的字段。 
  • Training percentage(训练百分比):我们建议您采用迭代式方法进行训练,尤其如果您的操作对象是大型数据集(即,首先创建一个训练百分比较低的训练作业,评估表现,然后决定是否有必要提高训练百分比)。因为要操作的数据集相对较大(30 万个文档),因此我们从 10% 训练百分比开始。
  • Additional information options(其他信息选项):我们将保留默认选项,但您也可以在此阶段选择为训练作业设置超参数。
  • Job details(作业详情):我们将为作业分配一个适当的作业 ID 和目标索引。
  • Create index pattern(创建索引模式):我们将禁用它,因为我们会创建单个索引模式来匹配两个训练作业的目标索引,从而同时对其结果进行可视化。 

我们将在上述流程后创建两个分析作业,一个仅以字节直方图作为特征(目标索引:bytes_preds),另一个则以除字节直方图以外的一切信息作为其特征(目标索引:main_preds)。分析作业决定了每项特征的最佳编码、表现最佳的特征,以及模型的最优超参数。您还可以在 Machine Learning 应用程序中跟踪作业进度:

在 Machine Learning 应用程序中跟踪作业进度在 Machine Learning 应用程序中跟踪作业进度

在 Machine Learning 应用程序中跟踪作业进度

模型评估

在作业完成以后,我们可以通过点击已完成作业旁的 “View”(查看)按钮来查看预测结果。在点击“View”(查看)时,我们会看到一个数据帧风格的视图,显示目标索引的内容和模型的混淆矩阵。数据帧的每一行(如下所示)都将显示某样本是否被用于训练、模型预测、标签,以及类别概率和分数:

main_preds 目标索引的数据帧视图

main_preds 目标索引的数据帧视图

我们会使用混淆矩阵来评估与比较两个模型的表现。此处混淆矩阵的每一行代表实际类别中的实例,每一列代表预测类别中的实例,从而为我们提供正确肯定、错误肯定(顶行)、错误否定和正确否定(底行)指标。

以一般文件头和文件节,以及字符串为特征的模型的混淆矩阵

以一般文件头和文件节,以及字符串为特征的模型的混淆矩阵

以字节直方图为特征的模型的混淆矩阵

以字节直方图为特征的模型的混淆矩阵

我们看到两个模型都有相当高的准确率(至少对于演示来说!),因此我们决定不再进行另一轮训练或超参数调优。在下一节中,我们将了解如何在 Kibana 中以可视化方式比较两个模型,并决定要部署哪一个。

模型监测

在相应的目标索引中得到两个模型的预测以后,我们将创建一个索引模式(在这个例子中为 *_preds)来匹配二者,以便在 Kibana 中创建模型监测仪表板。如此例,监测仪表板有两个作用:

  • 比较仅以字节直方图为特征的模型和另一个模型的表现;我们为此使用 TSVB 可视化。
  • 跟踪表现较好模型的不同指标;我们使用竖线图对预测概率、良性和恶意样本数量进行可视化,并且使用 TSVB 来跟踪错误肯定率和错误否定率。

两个经训练模型在一段时间内的错误否定率两个经训练模型在一段时间内的错误肯定率

两个经训练模型在一段时间内的错误否定率和错误肯定率

通过观察在较长时间里两个模型的错误否定率和错误肯定率,并查看上一节所示的混淆矩阵,我们得出针对一般文件头和文件节,以及字符串进行训练的模型表现更好的结论。然后,我们会绘制想要跟踪的此模型的各种指标,假设它是我们希望部署,而且会在部署后进行监测的模型。

在 Kibana 中创建的各种模型表现指标的仪表板

在现实用例中,此类监测仪表板可以比较用于生产环境的候选模型,并且在模型部署以后确定生产环境中的模型衰减(如错误肯定率突增)与触发相关的应对措施(如新的模型训练)。在下一个部分,我们将了解如何在 Machine Learning 生产管道中部署选定要使用的模型。

部署我们的监督式模型,以便在采集时间扩充数据

除了模型训练和评估,Elastic Stack 还提供一种方式让用户可以在采集管道中使用经训练模型。反过来,这也让您有机会在采集时间使用 Machine Learning 模型来扩充您的数据。在这个部分,我们将看到您可以如何使用我们在上文中训练的恶意软件分类模型执行此项操作!

假设在这个例子中,我们有从我们希望分类为恶意或良性的二进制文件中提取的传入数据流。我们将通过采集管道将此数据采集到的 Elasticsearch,并在推理处理器中引用经训练的恶意软件分类模型。 

首先,我们来创建推理处理器和采集管道。推理处理器的最重要部分是经训练的模型及其 model_id,我们可以在 Kibana 控制台中通过以下 REST API 调用进行查找:

GET _ml/inference

此操作将返回集群中经训练模型列表,而对于每个模型,将显示其各种特征,如 model_id(我们应该记下来用于推理)、用于训练模型的字段、模型的训练时间,等等。

在通过调用检索关于经训练模型的信息时,其样本输出将显示 model_id,配置推理处理器时需要用到它

在通过调用检索关于经训练模型的信息时,其样本输出将显示 model_id,配置推理处理器时需要用到它

如果集群中有大量经训练模型,您可以使用通配符查询来运行上述 API 调用,这样做可能非常有帮助,而这些通配符查询要基于用来训练模型的数据帧分析作业的名称。在这个例子中,我们所关心的模型使用名为 ember_* 的作业进行训练,因此我们可以运行 

GET _ml/inference/ember_*

以快速缩小范围,找到我们想要的模型。 

如果已经记下 model_id,我们就可以创建采集管道配置。完整配置如下所示。记下名为 inference 的配置块。它会引用我们希望用于扩充文档的模型。另外,它还会指定 target_field(在这个例子中,我们已经将其设置为 is_malware,您当然也可以根据偏好进行设置),当推理处理器处理文档时,它将被用作要添加的机器学习字段的前缀。 

PUT _ingest/pipeline/malware-classification
{
  "description":"Classifies incoming binaries as malicious or benign",
  "processors": [
    {
      "inference": {
        "model_id": "ember_main-1598557689011",
        "target_field": "is_malware",
        "inference_config": {
          "classification": {
            "num_top_classes":2
          }
        }
      }
    }
  ]
}

现在,假设我们在对具有二进制文件特征的文档进行采集,而且希望使用每个二进制文件的恶意预测扩充此数据。以下是经过删节的样本文档:

{ 
          "appeared" :"2020-04-01 00:00:00",
          "byte_0" :0.1622137576341629,
          "byte_1" :0.007498478516936302,
          "byte_2" :0.003992937505245209,
          "byte_3" :0.00546838915720582,
          "byte_4" :0.007421959958970547,
          ...
          "byte_253" :0.0019106657709926367,
          "byte_254" :0.003551538335159421,
          "byte_255" :0.1782810389995575,
          "strings_0" :3312.0,
          "strings_1" :24.97675132751465,
          "strings_2" :82723.0,
          "strings_3" :0.07208394259214401,
          "strings_4" :8.099319529719651E-4,
          "strings_5" :0.005427753087133169,
           ...
          "strings_100" :0.0,
          "strings_101" :39.0,
          "strings_102" :0.0,
          "strings_103" :9.0,
          "general_info_0" :1130496.0,
          "general_info_1" :1134592.0,
          "general_info_2" :1.0,
          "general_info_3" :0.0,
          "general_info_4" :247.0,
          "general_info_5" :1.0,
          "general_info_6" :1.0,
          "general_info_7" :1.0,
          "general_info_8" :1.0,
          "general_info_9" :0.0,
          "file_header_0" :1.511340288E9,
          "file_header_1" :0.0,
          "file_header_2" :0.0,
          "file_header_3" :0.0,
          "file_header_4" :0.0,
          "file_header_5" :0.0,
          "file_header_6" :1.0,
          "file_header_7" :0.0,
          "file_header_8" :0.0,
          "file_header_9" :0.0,
           ...
          "file_header_59" :262144.0,
          "file_header_60" :1024.0,
          "file_header_61" :4096.0,
          "sections_0" :5.0,
          "sections_1" :0.0,
          "sections_2" :0.0,
          "sections_3" :1.0,
          "sections_4" :1.0,
          "sections_5" :0.0,
           ...
          "sections_253" :0.0,
          "sections_254" :0.0 
]

我们可以使用其中一个索引 API 来采集此文档,并使其通过我们在上文中创建的 malware-classification 管道。以下是名为 main_preds,可将此文档采集到目标索引的示例 API 调用。为节省空间,文档已被删节。 

POST main_preds/_doc?pipeline=malware-classification 
{ 
          "appeared" :"2020-04-01 00:00:00",
          "byte_0" :0.1622137576341629,
          "byte_1" :0.007498478516936302,
          "byte_2" :0.003992937505245209,
          "byte_3" :0.00546838915720582,
          "byte_4" :0.007421959958970547,
          "byte_5" :0.0025378242135047913,
          "byte_6" :0.002135345945134759,
          "byte_7" :0.001892974367365241,
          "byte_8" :0.007126075681298971,
          "byte_9" :0.001768250367604196,
          "byte_10" :0.0055223405789583921,
          "byte_11" :0.001283507444895804,
          "byte_12" :0.008042919423431158,
          "byte_13" :0.001533839968033135,
          "byte_14" :0.0010570581071078777,
          "byte_15" :0.006860705558210611,
...

所以,在目标索引 main_preds 当中,我们现在有使用经训练 Machine Learning 模型的预测进行过扩充的新文档。如果查看文档(例如,使用“Discover”(发现)选项卡),我们将看到按照我们的配置,推理处理器已经把经训练 Machine Learning 模型的预测添加到文档。在这个例子中,我们的文档(它代表我们想要分类为恶意或良性的未知二进制文件)已经被分配到类别 1,也就是说,我们的模型预测此二进制文件为恶意文件。 

被采集文档的代码片段显示我们的经训练 Machine Learning 模型对其进行的扩充

被采集文档的代码片段显示我们的经训练 Machine Learning 模型对其进行的扩充

随着带预测的新文档被添加到目标索引,Kibana 仪表板将自动对其进行处理,从而提供关于经训练模型在一段时间内对新样本表现如何的见解。

结论

在生产环境中,价值创造不会(或不应该)因为模型部署而停止。管道需要提供一种方式,在模型部署到客户环境前高效对其进行评估,并且在部署以后密切监测。这有助于数据科学团队在现实环境中预测问题,并在发生模型衰减迹象时采取必要行动。

在这篇博客文章中,我们探讨了为什么 Elastic Stack 是管理此类端对端 Machine Learning 管道的绝佳平台,其中的原因包括它具有出色的存储能力、模型训练、内置调优,以及 Kibana 中的各种可视化工具。