字段太多啦！3 招防止 Elasticsearch 映射爆炸式增长

作者

Ugo Sangiorgi

2022年6月3日

当系统具备这三样东西时，即可称为“可观测”：日志、指标和痕迹。指标和痕迹采用的是可预测结构，而日志（尤其是应用程序日志）通常由非结构化数据组成，需要经过收集和解析才能发挥作用。因此，掌控日志信息可以说是实现可观测性过程中最艰难的环节。

本文将深入探讨开发人员使用 Elasticsearch 管理日志时可以遵循的三种有效策略。如需更详细的概述，请观看下方视频。

[相关文章：利用 Elastic 改善云中的数据管理和可观测性]

针对数据应用 Elasticsearch

有时，我们无法控制在集群中接收到的日志类型。以日志分析服务提供商为例，它既要在特定预算范围内存储客户的日志，又要做到妥善存储（Elastic 咨询部门为多种类似用例提供过服务）。

通常情况下，我们会让客户索引字段，“以防”搜索时需要用到这些字段。如果你是这种情况，采用以下方法将会非常有用，它可以帮助你节约成本并确定集群性能的重点关注领域。

我们首先来概述一下问题。以下方的 JSON 文档（包含三个字段：message、transaction.user、transaction.amount）为例：

{
 "message": "2023-06-01T01:02:03.000Z|TT|Bob|3.14|hello",
 "transaction": {
   "user": "bob",
   "amount": 3.14
 }
}

此类文档的索引所对应的映射可能如下所示：

PUT dynamic-mapping-test
{
 "mappings": {
   "properties": {
     "message": {
       "type": "text"
     },
     "transaction": {
       "properties": {
         "user": {
           "type": "keyword"
         },
         "amount": {
           "type": "long"
         }
       }
     }
   }
 }
}

但是，借助 Elasticsearch，我们可以索引新字段，而不必预先指定一个映射，然后就可以轻松载入新数据，这也是 Elasticsearch 如此简单易用的一个原因。因此，不按原始映射方式索引内容也是可以的，例如：

POST dynamic-mapping-test/_doc
{
 "message": "hello",
 "transaction": {
   "user": "hey",
   "amount": 3.14,
   "field3": "hey there, new field with arbitrary data"
 }
}

GET dynamic-mapping-test/_mapping 将显示索引的新映射结果。transaction.field3 现在由 text 和 keyword 组成，它们实际上是两个新字段。

{
 "dynamic-mapping-test" : {
   "mappings" : {
     "properties" : {
       "transaction" : {
         "properties" : {
           "user" : {
             "type" : "keyword"
           },
           "amount" : {
             "type" : "long"
           },
           "field3" : {
             "type" : "text",
             "fields" : {
               "keyword" : {
                 "type" : "keyword",
                 "ignore_above" : 256
               }
             }
           }
         }
       },
       "message" : {
         "type" : "text"
       }
     }
   }
 }
}

很好，但是现在问题又来了：如果完全无法控制发送给 Elasticsearch 的内容，我们很容易面临映射爆炸式增长的问题。你可以随意创建子字段和子子字段，且这些字段都拥有相同的两种类型 text 和 keyword，例如：

POST dynamic-mapping-test/_doc
{
 "message": "hello",
 "transaction": {
   "user": "hey",
   "amount": 3.14,
   "field3": "hey there, new field",
   "field4": {
     "sub_user": "a sub field",
     "sub_amount": "another sub field",
     "sub_field3": "yet another subfield",
     "sub_field4": "yet another subfield",
     "sub_field5": "yet another subfield",
     "sub_field6": "yet another subfield",
     "sub_field7": "yet another subfield",
     "sub_field8": "yet another subfield",
     "sub_field9": "yet another subfield"
   }
 }
}

存储上述字段既耗费内存又占用磁盘空间，因为数据结构的构建方式需要实现可搜索和可聚合。这些字段有可能从未被使用过，它们的存在只是以防搜索时需要用到。

当被要求优化索引时，我们建议采取的第一步是，查看索引中每个字段的使用情况，确定哪些字段真正被搜索了，哪些只是在浪费资源。

策略 1：设为 strict

如果想要完全掌控 Elasticsearch 中存储的日志结构以及存储的方式，可以对映射设定一个清晰的定义，这样任何偏离要求的内容根本不会被存储。

通过在顶层或某些子字段中使用 dynamic: strict，我们可以拒绝与 mappings 定义不匹配的文档，迫使发送方遵守预定义的映射模式：

PUT dynamic-mapping-test
{
 "mappings": {
   "dynamic": "strict",
   "properties": {
     "message": {
       "type": "text"
     },
     "transaction": {
       "properties": {
         "user": {
           "type": "keyword"
         },
         "amount": {
           "type": "long"
         }
       }
     }
   }
 }
}

然后，当我们尝试使用额外字段来索引文档时…

POST dynamic-mapping-test/_doc
{
 "message": "hello",
 "transaction": {
   "user": "hey",
   "amount": 3.14,
   "field3": "hey there, new field"
   }
 }
}

…会得到如下回应：

{
 "error" : {
   "root_cause" : [
     {
       "type" : "strict_dynamic_mapping_exception",
       "reason" : "mapping set to strict, dynamic introduction of [field3] within [transaction] is not allowed"
     }
   ],
   "type" : "strict_dynamic_mapping_exception",
   "reason" : "mapping set to strict, dynamic introduction of [field3] within [transaction] is not allowed"
 },
 "status" : 400
}

如果你非常确定只想存储映射中的内容，则此策略会强制发送方遵守预定义的映射规则。

策略 2：不要都设为 strict

我们可以更灵活一点，使用 "dynamic": "false" 让文档顺利通过，即便它们并不是完全符合我们的期望。

PUT dynamic-mapping-disabled
{
 "mappings": {
   "dynamic": "false",
   "properties": {
     "message": {
       "type": "text"
     },
     "transaction": {
       "properties": {
         "user": {
           "type": "keyword"
         },
         "amount": {
           "type": "long"
         }
       }
     }
   }
 }
}

在应用这一策略时，我们接受所收到的所有文档，但只对映射中指定的字段进行索引，从而使额外字段根本无法搜索。换句话说，新字段不会耗费内存，只会占用磁盘空间。这些字段仍可在搜索的 hits 中可见，其中包括 top_hits 聚合。但是，我们无法对其进行搜索或聚合，因为没有创建相应数据结构来保存其中的内容。

结果不必是全有或全无 - 你甚至可以将根设置为 strict 并创建子字段以接受新字段，而无需索引这些字段。我们的在内部对象上设置动态映射 (Setting dynamic on inner objects) 文档详细介绍了具体操作。

PUT dynamic-mapping-disabled
{
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "message": {
        "type": "text"
      },
      "transaction": {
        "dynamic": "false",
        "properties": {
          "user": {
            "type": "keyword"
          },
          "amount": {
            "type": "long"
          }
        }
      }
    }
  }
}

策略 3：运行时字段

Elasticsearch 支持读取模式和写时模式，每个模式都有各自的注意事项。使用 dynamic:runtime 时，新字段会作为运行时字段添加到映射中。我们会索引映射中指定的字段，并使额外字段仅在查询时可搜索/可聚合。换句话说，我们不会在新字段上预先浪费内存，但代价是查询响应速度变慢，因为运行时会构建数据结构。

PUT dynamic-mapping-runtime
{
 "mappings": {
   "dynamic": "runtime",
   "properties": {
     "message": {
       "type": "text"
     },
     "transaction": {
       "properties": {
         "user": {
           "type": "keyword"
         },
         "amount": {
           "type": "long"
         }
       }
     }
   }
 }
}

让我们索引大型文档：

POST dynamic-mapping-runtime/_doc
{
 "message": "hello",
 "transaction": {
   "user": "hey",
   "amount": 3.14,
   "field3": "hey there, new field",
   "field4": {
     "sub_user": "a sub field",
     "sub_amount": "another sub field",
     "sub_field3": "yet another subfield",
     "sub_field4": "yet another subfield",
     "sub_field5": "yet another subfield",
     "sub_field6": "yet another subfield",
     "sub_field7": "yet another subfield",
     "sub_field8": "yet another subfield",
     "sub_field9": "yet another subfield"
   }
 }
}

GET dynamic-mapping-runtime/_mapping 将显示映射在索引大型文档时发生了更改：

{
 "dynamic-mapping-runtime" : {
   "mappings" : {
     "dynamic" : "runtime",
     "runtime" : {
       "transaction.field3" : {
         "type" : "keyword"
       },
       "transaction.field4.sub_amount" : {
         "type" : "keyword"
       },
       "transaction.field4.sub_field3" : {
         "type" : "keyword"
       },
       "transaction.field4.sub_field4" : {
         "type" : "keyword"
       },
       "transaction.field4.sub_field5" : {
         "type" : "keyword"
       },
       "transaction.field4.sub_field6" : {
         "type" : "keyword"
       },
       "transaction.field4.sub_field7" : {
         "type" : "keyword"
       },
       "transaction.field4.sub_field8" : {
         "type" : "keyword"
       },
       "transaction.field4.sub_field9" : {
         "type" : "keyword"
       }
     },
     "properties" : {
       "transaction" : {
         "properties" : {
           "user" : {
             "type" : "keyword"
           },
           "amount" : {
             "type" : "long"
           }
         }
       },
       "message" : {
         "type" : "text"
       }
     }
   }
 }
}

新字段现在可以像普通关键字字段一样进行搜索。请注意，数据类型是在索引第一个文档时的猜测结果，但这也可以使用动态模板来控制。

GET dynamic-mapping-runtime/_search
{
 "query": {
   "wildcard": {
     "transaction.field4.sub_field6": "yet*"
   }
 }
}

结果：

{
…
 "hits" : {
   "total" : {
     "value" : 1,
     "relation" : "eq"
   },
   "hits" : [
     {
       "_source" : {
         "message" : "hello",
         "transaction" : {
           "user" : "hey",
           "amount" : 3.14,
           "field3" : "hey there, new field",
           "field4" : {
             "sub_user" : "a sub field",
             "sub_amount" : "another sub field",
             "sub_field3" : "yet another subfield",
             "sub_field4" : "yet another subfield",
             "sub_field5" : "yet another subfield",
             "sub_field6" : "yet another subfield",
             "sub_field7" : "yet another subfield",
             "sub_field8" : "yet another subfield",
             "sub_field9" : "yet another subfield"
           }
         }
       }
     }
   ]
 }
}

好极了！当你不知道要采集哪种类型的文档时，不难看出这个策略是多么有用，因此使用运行时字段听起来像是一种保守的方法，可以在性能和映射复杂性之间取得很好的权衡。

关于使用 Kibana 和运行时字段的注意事项

请记住，如果在 Kibana 上使用搜索栏进行搜索时没有指定字段（例如，只键入 “hello”，而不是 "message: hello"），则该搜索将匹配所有字段，其中包括我们声明的所有运行时字段。你可能不希望出现这种情况，因此我们的索引必须使用动态设置 index.query.default_field。请将其设置为全部或部分映射字段，并让运行时字段显式查询（例如，"transaction.field3: hey"）。

更新后的最终映射结果如下所示：

PUT dynamic-mapping-runtime
{
  "mappings": {
    "dynamic": "runtime",
    "properties": {
      "message": {
        "type": "text"
      },
      "transaction": {
        "properties": {
          "user": {
            "type": "keyword"
          },
          "amount": {
            "type": "long"
          }
        }
      }
    }
  },
  "settings": {
    "index": {
      "query": {
        "default_field": [
          "message",
          "transaction.user"
        ]
      }
    }
  }
}

选择最佳策略

每种策略都有自己的优点和缺点，因此最佳策略最终将取决于你的具体用例。下表总结了三种策略的优缺点，可以帮助你根据自己的需求做出正确的选择：

战略	优势	劣势
#1 - strict	可以确保存储的文档遵守映射规则	如果文档包含未在映射中声明的字段，则文档将被拒绝
#2 - dynamic: false	存储的文档可以有任意数量的字段，但只有映射的字段才会使用资源	未映射的字段不能用于搜索或聚合
#3 - 运行时字段	策略 2 的所有优点可以像任何其他字段一样在 Kibana 中使用运行时字段	查询运行时字段时搜索响应速度相对较慢

可观测性是 Elastic Stack 真正擅长的领域。无论是在跟踪受影响的系统的同时安全存储多年的金融交易数据，还是采集数 TB 的每日网络指标，我们的客户都以极低的成本、十倍的速度实现了可观测性。

想要开始使用 Elastic 可观测性？最好的方式是在云中使用。立即开始免费试用 Elastic Cloud！

Elasticsearch Platform

ELK Stack

Elastic Cloud

可观测性

安全性

搜索

按行业

按解决方案

客户聚焦

开发人员

保持联系

学习

帮助

字段太多啦！3 招防止 Elasticsearch 映射爆炸式增长

针对数据应用 Elasticsearch

策略 1：设为 strict

策略 2：不要都设为 strict

策略 3：运行时字段

关于使用 Kibana 和运行时字段的注意事项

选择最佳策略

注册 Elastic Cloud 免费试用版

关注我们

关于我们

加入我们

新闻稿

合作伙伴

信任和安全性

投资者关系

卓越奖

关于我们

加入我们

新闻稿

合作伙伴

信任和安全性

投资者关系

卓越奖

Elasticsearch Platform

ELK Stack

可观测性

安全性

搜索

按行业

按解决方案

字段太多啦！3 招防止 Elasticsearch 映射爆炸式增长

针对数据应用 Elasticsearch

策略 1：设为 strict

策略 2：不要都设为 strict

策略 3：运行时字段

关于使用 Kibana 和运行时字段的注意事项

选择最佳策略

分享

注册 Elastic Cloud 免费试用版

关注我们

关于我们

加入我们

新闻稿

合作伙伴

信任和安全性

投资者关系

卓越奖