本页内容尚不支持所选语言。Elastic 正在不断努力，以实现对多种语言内容的支持。感谢您在此期间给予的耐心与陪伴！

April 19, 2018

Practical BM25 - Part 3: Considerations for Picking b and k1 in Elasticsearch

This is the final post in the three-part Practical BM25 series about similarity ranking (relevancy). If you're just joining, check out Part 1: How Shards Affect Relevance Scoring in Elasticsearch and Part 2: The BM25 Algorithm and its Variables.

Picking b and k1

It’s worth noting that picking b and k1 are generally not the first thing to do when your users aren’t finding documents quickly. The default values of b = 0.75 and k1 = 1.2 work pretty well for most corpuses, so you’re likely fine with the defaults. More likely, you want to start with something like:

Boosting or adding constant scores for things like exact phrase matches in a bool query
Making use of synonyms to match other terms the user may be interested in
Adding fuzziness, typeahead, phonetic matching, stemming, and other text/analysis components to help with misspellings, language differences, etc.
Adding or using a function score to decay the scores of older documents or documents which are further geographically from the end user

Part of what makes Elasticsearch so powerful is how expressive you can be with these primitives to create a very robust search experience. But let’s say you’ve done everything else and want to look at the last mile of b and k1... how do you choose them?

It’s been fairly well studied that there are no “best” b and k1 values for all data/queries. Users that do change the b and k1 parameters typically do so incrementally by evaluating each increment. The Rank Eval API in Elasticsearch can help with the evaluation stage.

When experimenting with b and k1, you should first consider their bounds. I would also suggest looking into past experiments to give you a rough feel for the type of experimentation you may be interested in doing — especially if you’re just getting into this for the first time:

b needs to be between 0 and 1. Many experiments test values in increments of around 0.1 and most experiments seem to show the optimal b to be in a range of 0.3-0.9 (Lipani, Lupu, Hanbury, Aizawa (2015); Taylor, Zaragoza, Craswell, Robertson, Burges (2006); Trotman, Puurula, Burgess (2014); etc.)
k1 is typically evaluated in the 0 to 3 range, though there’s nothing to stop it from being higher. Many experiments have focused on increments of 0.1 to 0.2 and most experiments seem to show the optimal k1 to be in a range of 0.5-2.0

For k1, you should be asking, “when do we think a term is likely to be saturated?” For very long documents like books — especially fictional or multi-topic books — it’s very likely to have a lot of different terms several times in a work, even when the term isn’t highly relevant to the work as a whole. For example, “eye” or “eyes” can appear hundreds of times in a fictional book even when “eyes” are not one of the the primary subjects of the book. A book that mentions “eyes” a thousand times, though, likely has a lot more to do with eyes. You may not want terms to be saturated as quickly in this situation, so there’s some suggestion that k1 should generally trend toward larger numbers when the text is a lot longer and more diverse. For the inverse situation, it’s been suggested to set k1 on the lower side. It’s very unlikely that a collection of short news articles would have “eyes” dozens to hundreds of times without being highly related to eyes as a subject.

For b, you should be asking, “when do we think a document is likely to be very long, and when should that hinder its relevance to a term?” Documents which are highly specific like engineering specifications or patents are lengthy in order to be more specific about a subject. Their length is unlikely to be detrimental to the relevance and b may be more appropriate to be lower. On the opposite side, documents which touch on several different topics in a broad way — news articles (a political article may touch on economics, international affairs, and certain corporations), user reviews, etc. — often benefit by choosing a larger b so that irrelevant topics to a user’s search, including spam and the like, are penalized.

These are general starting points, but ultimately you should test any parameters you set. This also goes to showcase how relevance is really tightly bound to having similar documents (similar language, similar general structure, etc.) in the same index.

Explain API

Now that you understand how the BM25 algorithm works and how the parameters work, I want to briefly cover one of the handiest tools in the Elasticsearch toolbox for giving you more information to answer the “why” questions that inevitably come up. If you’ve ever had to answer the question “why is document x ranked higher than document y” the Explain API can help you significantly. Let’s look at document 4 in people3, this time with a two-term query:

GET /people3/_doc/4/_explain
{
  "query": {
    "match": {
         "title": "shane connelly"
     }
  }
}

This returns a wealth of information about how document 4 was scored against this query:

{
  "_index": "people3",
  "_type": "_doc",
  "_id": "4",
  "matched": true,
  "explanation": {
    "value": 0.71437943,
    "description": "sum of:",
    "details": [
      {
        "value": 0.102611035,
        "description": "weight(title:shane in 3) [PerFieldSimilarity], result of:",
        "details": [
          {
            "value": 0.102611035,
            "description": "score(doc=3,freq=1.0 = termFreq=1.0\n), product of:",
            "details": [
              {
                "value": 0.074107975,
                "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                "details": [
                  {
                    "value": 6,
                    "description": "docFreq",
                    "details": []
                  },
                  {
                    "value": 6,
                    "description": "docCount",
                    "details": []
                  }
                ]
              },
              {
                "value": 1.3846153,
                "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                "details": [
                  {
                    "value": 1,
                    "description": "termFreq=1.0",
                    "details": []
                  },
                  {
                    "value": 5,
                    "description": "parameter k1",
                    "details": []
                  },
                  {
                    "value": 1,
                    "description": "parameter b",
                    "details": []
                  },
                  {
                    "value": 3,
                    "description": "avgFieldLength",
                    "details": []
                  },
                  {
                    "value": 2,
                    "description": "fieldLength",
                    "details": []
                  }
                ]
              }
            ]
          }
        ]
      },
      {
        "value": 0.61176836,
        "description": "weight(title:connelly in 3) [PerFieldSimilarity], result of:",
        "details": [
          {
            "value": 0.61176836,
            "description": "score(doc=3,freq=1.0 = termFreq=1.0\n), product of:",
            "details": [
              {
                "value": 0.44183275,
                "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                "details": [
                  {
                    "value": 4,
                    "description": "docFreq",
                    "details": []
                  },
                  {
                    "value": 6,
                    "description": "docCount",
                    "details": []
                  }
                ]
              },
              {
                "value": 1.3846153,
                "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                "details": [
                  {
                    "value": 1,
                    "description": "termFreq=1.0",
                    "details": []
                  },
                  {
                    "value": 5,
                    "description": "parameter k1",
                    "details": []
                  },
                  {
                    "value": 1,
                    "description": "parameter b",
                    "details": []
                  },
                  {
                    "value": 3,
                    "description": "avgFieldLength",
                    "details": []
                  },
                  {
                    "value": 2,
                    "description": "fieldLength",
                    "details": []
                  }
                ]
              }
            ]
          }
        ]
      }
    ]
  }
}

We can see the individual values of k1 and b, but also the fieldLength and avgFieldLength and other constituent parts of the score for each term! So with our final score of 0.71437943, we can see that 0.102611035 came from “shane” and 0.61176836 came from “connelly.” Connelly is a much rarer term in our corpus, so it carries a much higher IDF, which seems to have been the primary influencer of the score. But we can also see that the length of the document (2 terms vs an average of 3) has raised the “tfNorm” part of the score as well. If we felt that this was unfair, we could potentially decrease the value of b to compensate. Of course any changes to b or k1 have an effect on more than just the one given query here, so if you do end up changing these, make sure to re-test across many queries and many documents.

Please do note that the Explain API is a debug tool and treated as such. Just like you wouldn’t run your production application in debug mode under normal circumstances, you should turn off calls to _explain in your production deployment of Elasticsearch under normal circumstances.

Last Words

BM25 isn’t the only scoring algorithm in town! There’s the classic TF/IDF, divergence from randomness, and many, many more — not to mention hyperlink-based modifiers like pagerank — and even further you can generally combine many of these together! Additionally, a variety of permutations on the core BM25 algorithm have popped up over the years. For example, there has been some academic effort to pick/suggest/account for optimal values for k1 and b automatically with some of these BM25 permutations. In fact, there is some reason/evidence to believe that at least k1 would be best optimized on a term-by-term basis (Lv, ChengXiang (2011)). With this, it’d be natural to ask “why BM25?” or “why BM25 with these particular k1 = 1.2 and b = 0.75 values?”

The short answer is that there doesn’t appear to be any silver bullet in algorithms or in picking k1 or b values, but BM25 with k1 = 1.2 and b = 0.75 seem to give very good results for most cases. In “Improvements to BM25 and Language Models Examined” (Trotman, Puurula, Burgess (2014)), Trotman et al. searched over b = 0-1 and k1 = 0-3, and applied a number of different relevance algorithms, including ones that attempt to automatically tune the BM25 parameters. I think they state it best in their conclusion:

“This investigation examined 9 ranking functions, 2 relevance feedback methods, 5 stemming algorithms, and 2 stop word lists. It shows that stop words are ineffective, that stemming is effective, that relevance feedback is effective, and that the combination of not stopping, stemming, and feedback typically leads to improvements on a plain ranking function. However, there is no clear evidence that any one of the ranking functions is systematically better than the others.”

So, as we started this blog, so shall we end: most of your tuning efforts are likely best spent making use of the expressive Elasticsearch query language, index/linguistic controls, and incorporating user feedback, all of which can be done via the Elasticsearch APIs. For those that do all of this and then want to dive really deeply, consider varying the k1 and b parameters. For those that want to go even further, Elasticsearch supports pluggable similarity algorithms including shipping with a number of the more common ones. But no matter what you do, make sure you test your changes!

References

Lipani, A., Lupu, M., Hanbury, A., Aizawa, A. (2015). Verboseness Fission for BM25 Document Length Normalization. Association for Computing Machinery

Taylor, M., Zaragoza, H., Craswell, N., Robertson, S., Burges, C. (2006). Optimisation methods for ranking functions with multiple parameters. Association for Computing Machinery

Trotman, A., Puurula, A., Burgess, B. (2014). Improvements to BM25 and Language Models Examined. Association for Computing Machinery

Lv, Y., ChengXiang, Z. (2011). Adaptive term frequency normalization for BM25. Association for Computing Machinery

上下文工程

向量数据库

Search AI 驱动的应用程序

日志

威胁防护

工作流

Elasticsearch

Kibana（Discover、仪表板）

Elastic 智能体生成器

自动操作

管道化查询语言

Jina AI 搜索模型

Elastic Cloud Serverless

Elastic Cloud 托管

自管型 Elasticsearch

电子商务搜索

客户服务搜索

搜索驱动型应用程序

日志分析

基础架构监测

数字体验监测

应用性能监控

AIOps

LLM 可观测性

新一代 SIEM

安全工作流

XDR 和终端安全

面向安全的 AI

实现数据价值十倍跃升

云服务提供商

Elastic AI 生态系统

Search AI 合作伙伴计划

AV-Comparatives

Forrester Wave™ XDR

Gartner 魔力象限领导者

IDC MarketScape

Search

安全性

可观测性

开始使用

演示库

下载

集成

文档

Elasticsearch Labs

Elastic 安全实验室

Elastic 可观测性实验室

博客

社区

活动

网络研讨会

讨论

培训

支持

咨询

Practical BM25 - Part 3: Considerations for Picking b and k1 in Elasticsearch

Picking b and k1

Explain API

Last Words

References