Tech Topics

Elasticsearch 0.90.4 released

The Elasticsearch dev team are pleased to announce the release of Elasticsearch 0.90.4, which is based on Lucene 4.4. You can download it here.

As usual, this release contains numerous optimizations and bug fixes, which you can read about in the release notes, but there are some exciting new features which warrant a special mention in this blog post:

Function score

Elasticsearch already had extensive support for custom manipulation of the relevance _score in the Query DSL with the:

  • custom_score query: modify the relevance _score by running a script on fields in the document
  • custom_filters_score query: modify the _score of documents which match one or more filters using a boost value or script (which benefits from filter caching)
  • custom_boost_factor query: modify the _score by a multiplier, rather than using a `boost` value which gets normalized

In 0.90.4 we present the new function_score query which combines the functionality of the above three queries to make them more flexible and efficient, and adds support for predefined functions. These allow you to include recency, distance-from-a-point, popularity or even randomness in your scoring model.

Decay functions:

A common use case is the need to boost recently created documents over older results. You could achieve this by applying a function_score query to the timestamp field. The scale parameter allows you to control how quickly this recency effect should decay. In the following example, blog posts from one month ago will score half as much as blog posts from today:

curl localhost:9200/blogposts/_search -d '
"query": {
  "function_score": {
    "query": {
      "match": { "title": "elasticsearch"}
    },
    "gauss": {
      "timestamp": {
          "scale":  "4w"
      }
    }
  }
}'

The above query would, however, reduce the score of relevant-but-very-old docs to almost zero, because the Gaussian decay function (like a bell curve) outputs a number between 0 and 1, which is then multiplied with the query _score. If you want old-but-relevant docs to still be ranked highly, then a better recency calculation would output a multiplier between 1 and 2:

curl localhost:9200/blogposts/_search -d '
"query": {
  "function_score": {
    "query": {
      "match": { "title": "elasticsearch"}
    },
    "functions": [
      { "boost":  1 },
      {
        "gauss": {
          "timestamp": {
            "scale": "4w"
          }
        }
      }
    ],
    "score_mode": "sum"
  }
}'

The score_mode tells the function_score query that the output from both functions should be added together (ie 1 + gauss()) before being multiplied with the _score from the query.

These decay functions can be applied to any date, numeric or geo_point field. For instance, if a user searches for a hotel in Central London which costs less than £80 per night, and has received good ratings from other users, you may have a number of very highly rated hotels which almost match all of the criteria, but perhaps cost a bit more, or lie just outside Central London.

Instead of using filters, which would just exclude results which don’t match, you can use a decay function for each of these criteria:

curl localhost:9200/hotels/_search -d '
"query": {
  "function_score": {
    "functions": [
      { "gauss":  { "price": { "origin": 0,      "scale": 40    }}},
      { "gauss":  { "loc":   { "origin": "51,0", "scale": "5km" }}},
      { "linear": { "stars": { "origin": 5,      "scale": 1     }}}
    ]
  }
}'

The hotels which most closely match all of the criteria will receive the highest scores, but almost-matches will still appear in the results.

The shape of the decay can be controlled by choosing the most appropriate function (Gaussian, exponential and linear) and the rate of decay can be controlled with the `scale` and `decay` parameters. See the documentation for more information.

Random sorting

Occasionally, users have the need to sort documents randomly. Perhaps you have sponsored listings that you want to include in your results, but you want to ensure that all sponsors get equal exposure. One way to do this would be to sort results randomly. However, you want a single user to see results in a consistent order, so that they won’t see duplicate results on pages 2 or 3.

This can now be achieved with the function_score query, using e.g., the timestamp of the user’s initial search as the random seed:

curl localhost:9200/sponsors/_search -d '
"query": {
  "function_score": {
    "query": { "match": { "title": "elasticsearch"}}
    "random_score": {
      "seed": 1379333621000
    }
  }
}'

As long as you use the same seed for each search, results will be sorted consistently. This allows you to choose whether you want to randomise results differently for every new search that a user runs, or to keep the same sort order for a single user throughout their session.

Disk-space aware shard allocation

The shard allocator can now take the amount of free space left on a disk into account when deciding where to allocate shards. This can be enabled in the config file or dynamically on a live cluster by setting cluster.routing.allocation.disk.threshold_enabled to true. When enabled, the allocator takes two watermarks into account:

The low watermark (cluster.routing.allocation.disk.watermark.low, default 0.7) will not allocate new shards to nodes whose disks usage is greater than 70%. The high watermark (cluster.routing.allocation.disk.watermark.high, default 0.85) tells the allocator to actively try to reallocate shards when a node’s disk usage rises above 85%. Both watermarks can also use absolute values like "500mb".

Suggester improvements

The completion suggester made its first appearance in 0.90.3, and this release adds:

  • fuzzy completion, to allow matching of misspelled words #3465
  • completion stats, so that you can see how much memory is being used by each suggester #3522
  • reduced memory usage for payloads, which now accept scalar values instead of requiring JSON objects #3550

We’ve also added result highlighting to the phrase suggester, to highlight the parts of the phrase that the suggester has added or corrected. #3442

Other improvements

  • The clear scroll API allows you to expire old scrolls as soon as you are finished with them, freeing up resources more quickly #3657
  • The rest.action.multi.allow_explicit_index setting allows you to restrict the bulk, multi-get and multi-search APIs to just the indices used in the URL, making it easier to use a URL-based security approach #3636
  • Named queries have been added to the existing named filter support, so that you can tell which query or filter clauses a document matched
    #3581

We hope you enjoy this new release. Please download and use 0.90.4, and let us know what you think.