1.0.0.Beta1 released

Today we are delighted to announce the release of elasticsearch 1.0.0.Beta1, the first public release on the road to 1.0.0. The countdown has begun!

You can download Elasticsearch 1.0.0.Beta1 here.

In each beta release we will add one major new feature, giving you the chance to try it out, to break it, to figure out what is missing and to tell us about it. Your use cases, ideas and feedback is essential to making Elasticsearch awesome.

The main feature we are showcasing in this first beta is Distributed Percolation.

WARNING: This is a beta release - it is not production ready, features are not set in stone and may well change in the next version, and once you have made any changes to your data with this release, it will no longer be readable by older versions!

Distributed percolation

For those of you who aren’t familiar with percolation, it is “search reversed”. Instead of running a query to find matching docs, percolation allows you to find queries which match a doc. Think of people registering alerts like: tell me when a newspaper publishes an article mentioning "Elasticsearch".

Percolation has been supported by Elasticsearch for a long time. In the current implementation, queries are stored in a special _percolator index which is replicated to all nodes, meaning that all queries exist on all nodes. The idea was to have the queries alongside the data.

But users are using it at a scale that we never expected, with hundreds of thousands of registered queries and high indexing rates. Having all queries on every node just doesn’t scale.

Enter Distributed Percolation.

In the new implementation, queries are registered under the special .percolator type within the same index as the data. This means that queries are distributed along with the data, and percolation can happen in a distributed manner across potentially all nodes in the cluster. It also means that an index can be made as big or small as required. The more nodes you have the more percolation you can do.

We have removed the ability to percolate while indexing a document, as that didn’t scale either. But we’ve added a host of new features that makes percolation awesome.

  • You can percolate a document that has not yet been indexed, or you can percolate an indexed document by specifying the index, type and id.
  • The multi-percolate API allows you to percolate docs in bulk, saving on network overhead.
  • Percolation queries can highlight the documents that they match, providing more context to why a particular query matched.
  • The percolation count API tells you how many queries match the current document.
  • Facet-support allows you to facet on the metadata of  matching queries.

Check out the distributed percolation docs here.

Doc Values

The most common problem that users have with Elasticsearch is with fielddata. In order to sort or facet on field values, those values need to be easily accessible. By far the fastest way of accessing field values is by loading them into memory, which is known as fielddata. Especially string fields can use large amounts of RAM and can cause slow garabage collection and even out of memory exceptions.

The other big new feature in this release is the addition of doc values. With this new functionality, field values can be stored on disk at index time, instead of in memory. It’s not as fast as holding fielddata in memory - facets take longer because they need to hit disk - but this is achieved with a fraction of the memory, meaning that you can now calculate facets, sort on fields or access field values in scripts for much larger datasets than was previously possible.

Doc values need to be setup in the field mapping at index time, and work on numeric, geo and not_analyzed string values, both single and multi-valued. They cannot be enabled on analyzed string fields as that would require a second analysis phase. This may change in the future.

Check out the docvalues docs here.

Stopwords disabled by default

The standard analyzer, the default analyzer used by Elasticsearch, comes with stopwords enabled. Not only that, it uses English stopwords by default. It doesn't matter what language your text is in, or which full text field you're indexing, it'll have stopwords removed. This can produce unexpected confusion, like wondering why { "country_code": "no" } doesn't match anything, or why you can't find any plays called "To be or not to be". With modern hardware and queries like the common terms query, stopwords are less useful than they used to be.

This release sets the default stopwords list for the standard analyzer to empty, meaning that no stopwords are removed by default. You can still enable stopwords where you want them, but you can do that where you choose to. For the sake of backwards compatibility, indices created with a previous version of Elasticsearch will continue to use the English stopwords list.

Still to come

There are three major new features still to come:

  • snapshot/restore for backing up the whole cluster or individual indices, see #3826
  • aggregations, or facets reborn, see #3300
  • the _cat api, for easy console-based insight into your cluster (documentation to follow)

Please download Elasticsearch 1.0.0.Beta1, try it out, and let us know what you think.