Similarity moduleedit

A similarity (scoring / ranking model) defines how matching documents are scored. Similarity is per field, meaning that via the mapping one can define a different similarity per field.

Configuring a custom similarity is considered a expert feature and the builtin similarities are most likely sufficient as is described in the mapping section

Configuring a similarityedit

Most existing or custom Similarities have configuration options which can be configured via the index settings as shown below. The index options can be provided when creating an index or updating index settings.

"similarity" : {
  "my_similarity" : {
    "type" : "DFR",
    "basic_model" : "g",
    "after_effect" : "l",
    "normalization" : "h2",
    "normalization.h2.c" : "3.0"
  }
}

Here we configure the DFRSimilarity so it can be referenced as my_similarity in mappings as is illustrate in the below example:

{
  "book" : {
    "properties" : {
      "title" : { "type" : "string", "similarity" : "my_similarity" }
    }
}

Available similaritiesedit

Default similarityedit

The default similarity that is based on the TF/IDF model. This similarity has the following option:

discount_overlaps
Determines whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm. By default this is true, meaning overlap tokens do not count when computing norms.

Type name: default

BM25 similarityedit

Another TF/IDF based similarity that has built-in tf normalization and is supposed to work better for short fields (like names). See Okapi_BM25 for more details. This similarity has the following options:

k1

Controls non-linear term frequency normalization (saturation).

b

Controls to what degree document length normalizes tf values.

discount_overlaps

Determines whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm. By default this is true, meaning overlap tokens do not count when computing norms.

Type name: BM25

DFR similarityedit

Similarity that implements the divergence from randomness framework. This similarity has the following options:

basic_model

Possible values: be, d, g, if, in, ine and p.

after_effect

Possible values: no, b and l.

normalization

Possible values: no, h1, h2, h3 and z.

All options but the first option need a normalization value.

Type name: DFR

IB similarity.edit

Information based model . This similarity has the following options:

distribution

Possible values: ll and spl.

lambda

Possible values: df and ttf.

normalization

Same as in DFR similarity.

Type name: IB

LM Dirichlet similarity.edit

LM Dirichlet similarity . This similarity has the following options:

mu

Default to 2000.

Type name: LMDirichlet

LM Jelinek Mercer similarity.edit

LM Jelinek Mercer similarity . This similarity has the following options:

lambda

The optimal value depends on both the collection and the query. The optimal value is around 0.1 for title queries and 0.7 for long queries. Default to 0.1.

Type name: LMJelinekMercer

Default and Base Similaritiesedit

By default, Elasticsearch will use whatever similarity is configured as default. However, the similarity functions queryNorm() and coord() are not per-field. Consequently, for expert users wanting to change the implementation used for these two methods, while not changing the default, it is possible to configure a similarity with the name base. This similarity will then be used for the two methods.

You can change the default similarity for all fields like this:

index.similarity.default.type: BM25