Product release

Elasticsearch 1.1.0, 1.0.2 and 0.90.13 released

#index #post_content p { margin: 0.5em }

Today we are happy to announce the release of Elasticsearch 1.1.0, based on Lucene 4.7, along with bug fix releases Elasticsearch 1.0.2 and Elasticsearch 0.90.13:

You can download them and read the full changes list here:


New features in 1.1.0

Elasticsearch 1.1.0 is packed with new features: better multi-field search, the search templates and the ability to create aliases when creating an index manually or with a template. In particular, the new aggregations framework has enabled us to support more advanced analytics: the cardinality agg for counting unique values, the significant_terms agg for finding uncommonly common terms, and the percentiles agg for understanding data distribution.

We will be blogging about all of these new features in more detail, but for now we’ll give you a taste of what each feature adds:


Better multi-field search

Up until now, the multi_match query was field-centric: it looked for all of the words in the query string in a single field, then combined the results for each field. While this is useful, many times what we need is a term-centric query which looks for each term in any field. In other words, a query that treats multiple fields as if they were one big field.

The multi_match query now supports three types of execution:

best_fields

(field-centric, default) Find the field that best matches the query string. Useful for finding a single concept like “full text search” in either the title or the body field.

most_fields

(field-centric) Find all matching fields and add up their scores. Useful for matching against multi-fields, where the same text has been analyzed in different ways to improve the relevance score: with/without stemming, shingles, edge-ngrams etc.

cross_fields

(term-centric) New execution mode which looks for each term in any of the listed fields. Useful for documents whose idenitifying features are spread across multiple fields, such as first_name and last_name, and supports the minimum_should_match operator in a more natural way than the other two modes.

The new cross_fields execution goes one step further and blends the term frequencies for each field to produce more relevant results. For instance, an unusual first name like “Smith” will no longer skew the results for a search like “John Smith”. Instead the term frequencies for first_name:Smith and last_name:Smith will be blended together as if they were one field.

See the multi_match query docs for more.


Search Templates

Sometimes it can be useful to predefine a search request as a template, and allow the user to specify just the parameters that they are interested in. The new search template endpoint uses Mustache templates to allow you to do just that. Templates can either be specified in the request itself, or be stored on each node in the cluster and referred to by name. For instance, we could provide an interface for querying the title field with:

GET /_search/template
{
"template": {
"query": { "match": { "title": "{{search_terms}}" }},
"size": "{{search_size}}"
},
"params": {
"search_terms": "quick brown fox",
"search_size": 10
}
}

Mustache templates support conditional blocks and repeatable sections, which will allow you to implement complicated logic in your templates.

Next to a complete search request, you can also template just a portion of the query using the new template query. For more information see the documentation sections: search templates & template query .


Alias support in indices and templates

Often, when you create an index, you already know what alias (or aliases) you want to associate with it. Why should you have to assign those aliases in a second step? Now the create-index API accepts an aliases section alongside settings, mappings and warmers. Even better than that, index templates also support an aliases section, which will be a great help to our Logstash users: at midnight, not only can we automatically create a new index for the new day, but we can automatically add an alias for the index.

See the create index docs and the index templates docs for more.


cardinality aggregation

With aggregations we can count things like how many page requests a user has made, at scale and in real time. However, in a distributed environment, it is very difficult to then ask: “how many unique pages did each user look at?” With the new cardinality aggregation we can do just that. Counting unique values in a high cardinality field can consume a lot of memory, so instead of producing completely accurate counts, we allow you to trade precision for memory usage.

This aggregation uses the HyperLogLog++ algorithm to estimate unique counts — it counts low cardinality fields with high accuracy, then switches over to estimations on high cardinality fields. The precision_threshold allows you to control exactly how much memory you want to allow the aggregation to use.

See the cardinality aggregation docs for more.


significant_terms aggregation

With the terms aggregation, we could ask: “what are the popular words on the news today?” and get back results like “the”, “and” and “or”… but those are popular words for every day. What we are really interested in is: “what are the uncommonly common words on the news today?”.

The significant_terms aggregation gives us exactly this information. It compares the term frequency for each term in our results against the background term frequency that we would expect for typical documents in the index, and finds the unusual terms — the ones that set our result set apart from the typical. A search for “bird flu” could tell us that “H5N1” is a significant term, a search for “vampires” might suggest “dracula” and “van helsing”. This aggregation even allows us to build classifier models automatically from data that already exists in an index.

See the significant_terms aggregation docs for more.


percentiles aggregation

The percentiles aggregation allows us to estimate the distribution of our data and to find anomalies. It is less important to know that our average page load time is 50ms, and more important to realise that 2% of pages take 30 seconds to load! Once we know about the outliers, we have something to investigate.

Like the cardinality aggregation, calculating completely accurate percentiles over billions of records in a real time distributed environment is very costly. This aggregation uses the TDigest algorithm to approximate percentiles and the compression parameter allows you to trade accuracy for memory usage. You can specify which percentiles you are interested in using e.g. "percents" : [95, 99, 99.9].

See the percentiles aggregation docs for more information.


Conclusion

We hope you are as excited about these new features as we are. Please download Elasticsearch 1.1.0 and try them out. You can report any problems on the GitHub issues page.