24 April 2018 Engineering

Language Models in Elasticsearch

By Mayya Sharipova

Background

Elasticsearch provides various ways for users to define how matching documents will be scored. One of these ways is through the configurable similarity module. While the default scoring method in Elasticsearch is BM25, other alternative scoring methods, such as language models and divergence from randomness, have been widely researched and have demonstrated promising performance. This blog will present an introduction to one such alternative scoring method based on language modelling. We don't set a goal here to contrast various scoring methods (this will be a topic of future blogs), but rather present a theoretical foundation of the scoring method based on language models, as well as their practical usage in Elasticsearch.

1. What is a language model?

Language models have been widely used in the Natural Language Processing (NLP) field for speech recognition and generation tasks. A language model for a given string tells you the probability of a string to occur within a language. Given a language with a certain vocabulary, a language model represents the probability distribution of terms in the vocabulary [language_model].

For example, imagine, that I have a very simple language with the following vocabulary:

elasticsearch
kibana
logstash
beats
x-pack

The language model for my language can have the following probability distribution, showing the probability of various vocabulary terms to occur:

Language model: M
elasticsearch 0.4
kibana 0.3
logstash 0.1
beats 0.1
x-pack 0.1

Thus, the probability of my language model M to generate the term "elasticsearch" is 0.4.

There are various types of language models. The most simple one (presented above) is the Unigram Language Model. The Unigram Language Model assumes that terms occur independently from each other. For our model, it would mean that "elasticsearch" occurring in a document doesn't influence the probability of "kibana" occurring in the same document. While this is not always true, and many terms tend to co-occur together, it is often good enough approximation for information retrieval purposes.

2. Language modelling in Information Retrieval

An approach to use language modelling for documents retrieval is following. For every document in our collection we build a separate language model and rank documents by the likelihood of their respective language models to generate a given query [zhai_lafferty]. More precisely, we first build a separate language model for each document in the collection:

doc1 -> Md1
doc2 -> Md2

docn -> Mdn

To find the best matching documents for the given query q, for every document d in the collection, we estimate the likelihood of observing q in d. Or, in other words, we estimate what is the probability of the document's language model Md to generate q : P(q|Md). We then rank documents by these estimated probabilities. Notice how the original problem to know whether a document is relevant for a query, in the language modelling approach reverses to computing the probability that the query is extracted from the document.

The intuition behind this approach is as follows: When a user has an information need, she imagines what kind of documents will satisfy this need; she builds in her mind an imaginary model of these documents, visualizing what terms are representative of these documents. Based on these imaginary documents' models and their representative terms, the user formulates a query, predicting that this query should generate these documents. During the search time, we reverse the process. For each document, we estimate the probability that the user will use the query to find a document such as this one. In other words, what is the probability of this document model to generate the given query [manning, büttcher]?

3. Estimating probabilities

If we have a query q consisting of a single term t, how can we calculate the probability of a document language model Md to generate t? The simplest way to estimate the probability is based on the counts of this term appearing in the document [manning, büttcher]:

$$P(t | M_d) =\frac{tf_{t,d}}{L_d} (1)$$

t - a term in a query
d - document
Md - language model of a document
tft,d - term frequency of term t in a document d, number of times t occurs in d
Ld - the length of document d - the total number of terms that d contains
P(t| Md) - stands for probability of the term t given Md

Thus, the more often t appears in d, the higher is the probability that t can be used to retrieve d. Notice, that for terms not appearing in the document, this probability is 0.

If the query consists of several terms, we assume that terms are independent of each other (unigram model), and thus to calculate the probability for the whole query, we need to multiply the probabilities of individual terms:

$$P(q | M_d) =\prod_{t \in q} P(t | M_d) = \prod_{t \in q} \frac{tf_{t,d} }{L_d}(2)$$

q - query, consisting of a single or several terms t
Π - stands for multiplication

The formula above is called Query Likelihood Model, as it estimates the likelihood of the query given the document, or how well the document d fits the particular query q [zhai_lafferty, query_likelihood_model].

Let's have an example of calculating scores using the model of the formula (2) for the query q consisting of two terms "desert" "people" for the following collection of three documents:

d1: "This is the desert. There are no people in the desert. The Earth is large."

d2: "'Where are the people?' resumed the little prince at last. 'It's a little lonely in the desert…' ,' It is lonely when you're among people, too,' said the snake."

d3: " 'What makes the desert beautiful,' said the little prince, 'is that somewhere it hides a well' "

Let's start our score calculations. For document d1, we see tf("desert") is 2 because there are 2 occurrences of the term, and Ld is 15 because d1 is 15 terms long. For the second term, tf("people") is 1 for this document. So we multiply these together to get 0.0089:

$$P(q | M_{d1}) = \frac{2}{15} \times \frac{1}{15} = 0.0089 $$

Similarly, for documents d2 and d3:

$$P(q | M_{d2}) = \frac{1}{28} \times \frac{2}{28} = 0.00255 $$

$$P(q | M_{d3}) = \frac{1}{16} \times \frac{0}{16} = 0 $$

If we rank our documents, document d1 with the highest scoring with be ranked the highest, and document d3 with the lowest scoring, the lowest.

4. Smoothing Language Models

Notice from the previous section how document d3 just because it doesn't have a single query term got a score of 0 (since we use multiplication of individual term scores). Documents are very sparsely represented and can't possibly contain all terms that could be used to query them. On the other hand, just because a document contains a certain term, doesn't mean that the document is about this term [manning]. This is the problem, as documents contain only limited amount of text, language models built from them, are not perfect. To improve the accuracy of these document models, we need to adjust them, or smooth them with a background model. The model for the whole collection of documents could serve as a good background model, as it contains much more documents than a single document.

A general idea of smoothing is adjusting a document model Md based on the collection model Mc, making Md to be a little bit similar to Mc by bringing some mass from Mc to Md [manning]. This allows us to improve the accuracy of the model, as well as avoid zero probabilities for unseen words.

We do two things for smoothing adjustment:

  1. For terms that are not present in Md, we assign some probability, which will be fraction of that of Mc. This will prevent Md from generating zero probabilities for unseen words.
  2. For terms that are present in Md, we will combine both probabilities from Md and a fraction of Mc to discount the original probability from Md.This should improve the overall accuracy of the document model.

There are various ways to implement smoothing. The two well-performed methods that are also implemented in Lucene, and exposed in Elasticsearch are Jelinek Mercer smoothing and Dirichlet smoothing.

4.1 Jelinek Mercer smoothing

Jelinek and Mercer proposed to linearly combine the probability of generating a term from a document with the probability of generating it from the collection using λ parameter.

$$P(t | d) =(1- \lambda) P(t | M_d) + \lambda P(t | M_c) (3)$$

λ - lambda, a value between (0,1]
Md - language model of a document
Mc - a language model for the whole collection

Or for the whole query:

$$P(q | d) =\prod_{t \in q} ((1 - \lambda) \frac{tf_{t,d}}{L_d} + \lambda \frac{tf_t}{L_c}) (4) $$

q - query, consisting of a single or several terms t
t - a term in a query
d - document
tft,d - term frequency of term t in a document d, number of times t occurs in d
Ld - the length of document d - the total number of terms in d
Lc - the length of the collection c - the total number of terms in c

The choice of the λ parameter is important:

  • Choosing λ close to 1 means a bigger amount of smoothing. By increasing λ, we are increasing the importance of the collection model, and diminishing the importance of document model. This is a good choice for longer queries, as the behaviour will be similar to the disjunction of the query terms. Documents that contain fewer query terms may still get good scores.
  • Choosing λ close to 0 means a little amount of smoothing. This is a good choice for shorter queries, as the behaviour will similar to the conjunction of query terms. Here we are concerned with finding documents that contain the most query terms, and these documents will be ranked highest.

In practical information retrieval applications, it is common to use logarithms to calculate scores. If we follow this approach and take the logarithm of the formula (4), and do some transformations, we will get the formula (5), which is the way Jelinek-Mercer scoring is implemented in Lucene and Elasticsearch:

$$P(q| d) = \sum_{t \in q} \log(1 + \frac{(1- \lambda) M_d}{\lambda M_c}) = \sum_{t \in q} \log(1 + \frac{(1- \lambda) \frac{tf_{t,d}}{L_d}}{\lambda \frac{tf_t + 1}{L_c + 1}}) (5)$$

λ - default value is 0.1 in Elasticsearch.

Here's an example of calculating scores using formula (5). We can use the same document collection and query as we used in Section 3:

q: "desert" "people"

d1: "This is the desert. There are no people in the desert. The Earth is large."

d2: "'Where are the people?' resumed the little prince at last. 'It's a little lonely in the desert…' ,' It is lonely when you're among people, too,' said the snake."

d3: " 'What makes the desert beautiful,' said the little prince, 'is that somewhere it hides a well' "

λ : 0.1
tf("desert"): 4, total number of occurence of "desert" in the collection across all documents
tf("people"): 3, total number of occurence of "people" in the collection across all documents
Lc : 59, total number of tokens in collection
Mc("desert") = (4 + 1) / (59 + 1) = 5/60
Mc("people") = (3 + 1) / (59 + 1) = 4/60

$$ P(q | M_{d1}) = \log(1 + \frac{0.9 * \frac{2}{15}}{0.1 * \frac{5}{60}} ) + \log(1 + \frac{0.9 * \frac{1}{15}}{0.1 * \frac{4}{60}} ) = 2.7343674 + 2.302585 = 5.036952 $$

$$ P(q | M_{d2}) = \log(1 + \frac{0.9 * \frac{1}{28}}{0.1 * \frac{5}{60}} ) + \log(1 + \frac{0.9 * \frac{2}{28}}{0.1 * \frac{4}{60}} ) = 1.5804503 + 2.364889 = 3.9453392 $$

$$ P(q | M_{d3}) = \log(1 + \frac{0.9 * \frac{1}{16}}{0.1 * \frac{5}{60}} ) + \log(1 + \frac{0.9 * \frac{0}{16}}{0.1 * \frac{4}{60}} ) = 2.0476928 + 0 = 2.0476928 $$

Notice, how unlike the traditional definition of a smoothed language model, the term "people" for document d3 will still get a score 0. In fact, Lucene will not even calculate the score for this term for this document, as the posting list of the term "people" will not contain d3.

4.2 Dirichlet smoothing

The intuition behind the Dirichlet smoothing is that we added μ terms to each document in the collection and distributed them according to the statistics of the collection [büttcher]. For example, if μ = 1000, we will add 50.85 terms of "people" (1000 * 3/59) to the document d3 from the previous section. With this and other terms, the length of document d3 will be increased by 1000. Of course, in reality, we are not going to add any terms to any document, we just use this intuition for calculating probability distribution using the following formula:

$$P(t | d) = \frac{tf_{t,d} + \mu P(t | M_c)}{L_d + \mu} (6) $$

μ - mu, a value > 0
tft,d - term frequency of term t in a document d, number of times t occurs in d
Ld - the length of document d - the total number of terms in d
Mc - a language model for the whole collection

The choice of μ should depend on the length of a document. The longer the document, the higher value of μ should be to fill the impact.

μ - default value is 2000 in Lucene and Elasticsearch.

5. Full example in Elasticsearch

Let's look at an example how to use Language Model Similarity in Elasticsearch. We will use Elasticsearch version 6.2.3, which uses Lucene version 7.2.1, but this should work with any version of Elasticsearch that supports LMJelinekMercer similarity. Let's create an index with the Jelinek-Mercer similarity:

PUT /my_index
{
    "settings" : {
        "index" : {
            "similarity" : {
              "my_similarity" : {
                "type" : "LMJelinekMercer",
                "lambda" : 0.1
              }
            }
        },
        "number_of_shards": 1
    },
    "mappings": {
      "doc": {
        "properties": {
          "my_text" : { 
            "type" : "text", 
            "similarity" : "my_similarity" 
          }
        }
      }
    }
}

Put some documents into it:

 
PUT my_index/doc/1
{
   "my_text": "This is the desert. There are no people in the desert. The Earth is large."
}
PUT my_index/doc/2
{
   "my_text": "'Where are the people?' resumed the little prince at last. 'It's a little lonely in the desert…' ,' It is lonely when you're among people, too,'  said the snake."
}
PUT my_index/doc/3
{
   "my_text": " 'What makes the desert beautiful,' said the little prince, 'is that somewhere it hides a well' "
}

Now, let's run search request with explanation of the query scores:

GET my_index/_search
{
    "query": {
        "match" : {
            "my_text" : "desert people"
        }
    },
    "explain": true
}

Below is the display of the score of the first document. Notice the "collection probability" value in the explanation of scores. This value will be combined with the document probability in the calculation of the final score in the way we showed in the Section 4.1.

"hits": [
      {
        "_shard": "[my_index][0]",
        "_node": "hNw0go7pSfK6uXr-frGVuw",
        "_index": "my_index",
        "_type": "doc",
        "_id": "1",
        "_score": 5.036952,
        "_source": {
          "my_text": "This is the desert. There are no people in the desert. The Earth is large."
        },
        "_explanation": {
          "value": 5.036952,
          "description": "sum of:",
          "details": [
            {
              "value": 2.7343674,
              "description": "weight(my_text:desert in 0) [PerFieldSimilarity], result of:",
              "details": [
                {
                  "value": 2.7343674,
                  "description": "score(LMJelinekMercerSimilarity, doc=0, freq=2.0), computed from:",
                  "details": [
                    {
                      "value": 0.1,
                      "description": "lambda",
                      "details": []
                    },
                    {
                      "value": 0.083333336,
                      "description": "collection probability",
                      "details": []
                    }
                  ]
                }
              ]
            },
            {
              "value": 2.302585,
              "description": "weight(my_text:people in 0) [PerFieldSimilarity], result of:",
              "details": [
                {
                  "value": 2.302585,
                  "description": "score(LMJelinekMercerSimilarity, doc=0, freq=1.0), computed from:",
                  "details": [
                    {
                      "value": 0.1,
                      "description": "lambda",
                      "details": []
                    },
                    {
                      "value": 0.06666667,
                      "description": "collection probability",
                      "details": []
                    }
                  ]
                }
              ]
            }
          ]
        }
      }
....

This concludes this introductory blog on language modelling. In the future blogs, we plan to expand on this by contrasting various scoring methods: probabilistic models versus language models versus divergence from randomness etc.

6. References

  • [language_model]: https://en.wikipedia.org/wiki/Language_model
  • [zhai_lafferty]: Chengxiang Zhai and John Lafferty. 2001. A study of smoothing methods for language models applied to Ad Hoc information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '01). ACM, New York, NY, USA, 334-342.
  • [manning]: Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze: An Introduction to Information Retrieval, pages 237–240. Cambridge University Press, 2009
  • [büttcher]: Büttcher, Stefan, Charles LA Clarke, and Gordon V. Cormack. Information retrieval: Implementing and evaluating search engines, pages 286-298, Mit Press, 2016.
  • [query_likelihood_model]: https://en.wikipedia.org/wiki/Query_likelihood_model