Tech Topics

Elasticsearch 1.3.0 And 1.2.3 Released

Today, we are happy to announce the release of Elasticsearch 1.3.0, based on Lucene 4.9, along with a bugfix release of Elasticsearch 1.2.3.

You can download them and read the full changes list here:

Elasticsearch 1.3.0 is the latest stable release. It is chock full of new features, performance and stability improvements, and bugfixes.  We recommend upgrading, especially for users with high indexing or aggregation loads. The full change log is available in the Elasticsearch 1.3.0 release notes, but we will highlight the most important changes below:

Security

Elasticsearch previously allowed any request to return its response in a JSONP format.  While very useful, this meant that any web page you view could send requests to any Elasticsearch instance which you have access to.   JSONP is now disabled by default (BREAKING:#6795) but can be enabled if you so choose (#6164).

Scripting

In the 1.2 branch, we disabled dynamic scripting by default. This was a good decision from a security standpoint, but made it more difficult to use scripting with Elasticsearch.  This release adds a number of awesome scripting features which give you the best of both worlds:

  • MVEL has been the default scripting language since scripting was first added to Elasticsearch.  Unfortunately, the project hasn't had much love of late and we've seen an increasing numbers of bugs when it's used with more recent JVMS.  On top of that, there is no way to sandbox it. While it is still the default language in this release,  MVEL is deprecated and will be removed in the next release. From 1.4 onwards, it will be available as a plugin only.
  • The new scripting language of choice is Groovy ( #6571), which is an active and growing project, is more flexible than MVEL, performs significantly faster and can be sandboxed (#6233). This means that...
  • Dynamic scripting is enabled by default for sandboxed languages. You can now use Groovy scripts safely via the API, by specifying { "lang": "groovy" }.
  • We have added the sandboxed Lucene "expressions" library into core (#6819). It is a simpler, more limited option than Groovy, but has the advantage of being incredibly fast, outperforming even native Java scripts thanks to tight integration with Lucene data structures. Use { "lang": "expressions" }
  • Scripts can now be stored in the special .scripts index via the indexed scripts API (#5484) rather than having to store them in the config directory on every node.
  • Search templates are really just a special case of scripting (with Mustache as the "scripting" language), so you can now also store search templates in the .scripts index (#5921). (See preregistered templates).

Aggregations

Aggregations keep on getting better and better. We have added three new aggregations:

  • The top hits aggregation (#6124) resolves the most +1'ed issue in the history of Elasticsearch: Field collapsing/combining. You can now bucket on a common field like brand_name, and get the best matching documents for each bucket. Even highlighting is supported!
  • The percentile ranks aggregation (#6386) turns the percentiles aggregation on its head. While the percentiles agg will (for example) tell you the response time that you achieve on 95% of all web requests, the percentile ranks aggregation will tell you what percentage of responses are achieved within the specified time.
  • The geo bounds aggregation (#5634) is a useful addition for mapping.  It returns a geo-bounding box which encompasses all the geo-points in a bucket, allowing you to zoom to the appropriate level on the map.

Besides these new features, aggregation performance and memory usage have also received a lot of love:

  • The performance of terms aggregations on high cardinality fields has improved dramatically (#6518).
  • Hierarchical aggregations (terms, grouped by terms, grouped by terms... etc) can result in combinatorial explosion and runaway memory usage.  The terms aggregation now supports a collect_mode parameter which controls the order that results are collected. The default depth_first order is usually the better option, but for deep hierarchies, especially on high cardinality fields, the breadth_first order can be much more efficient. You can now run more complex aggregations that previously would have been aborted due to a lack of resources.

Indexing

Elasticsearch is used for a range of very different use cases, such as large document search, centralised logging, and high performance analytics. Document size and indexing rate varies dramatically between these use cases, so it is important that Elasticsearch is able to adapt dynamically.  We are closer to an auto-regulating system than ever before, thanks to changes that have been made in recent releases and to the following:

  • Elasticsearch keeps track of the versions of recently indexed documents in memory. Previously, this data structure was only cleared when a flush happened, which meant it could become very large during heavy indexing.  Now, we clear the version map whenever we refresh (#6363), and if the version map becomes too large, it will trigger a refresh automatically (#6378). 
  • The translog flush threshold is no longer controlled by the number of changes in the log, but by the size of the log in bytes (#6783). This delivers improved performance for small documents, such as log messages, as it reduces the number of flushes that occur.
  • Looking up the version number for existing documents is now faster (#6212) and documents with auto-generated IDs are faster still (#6584).
  • The amount of RAM used for the indexing buffer is adjusted automatically during indexing.  If an index has not received updates for a while, the buffer size is reduced automatically. Previously, changing the size of the buffer could introduce long unexpected pauses while Elasticsearch reopened the index.  This is no longer necessary and the buffer size can now be changed without pausing (#6856).
  • If the _all field is enabled, but no other field uses index time boosting, a more efficient token stream is used (#6187).

At Elasticsearch we like good defaults.  We don't want our users to have to twiddle many confusing knobs in the hope that one particular combination might deliver what they need; It should just work. With the above changes, we have reduced the number of settings that you know about to just three:

  • Reduce the value of index.merge.scheduler.max_thread_count to 1 if you are using spinning disks.
  • Turn off store throttling if you are using SSDs or if you are doing heavy bulk indexing.
  • Increase the refresh_interval if you are doing heavy bulk indexing, or you are happy with your search results being refreshed less frequently than once every second. It is better to use a refresh interval like "30s" rather than disabling it completely with "-1".

Suggesters

We have a new in-house suggesters expert, so expect a number of improvements to suggesters in the near future. To start off with, we have added  the much requested ability to limit "did-you-mean" phrase suggestions to phrases that actually exist in the index (#3482).

Mapping

The new transform feature adds the ability to use scripting to transform the source document on-the-fly during indexing (#6599). This doesn't change the _source field that is stored on disk, but it changes how the _source field is indexed.

Mapping has also seen some significant performance improvements:

  • Adding many new fields to the mapping is exponentially faster than it used to be (#6707, #6648).
  • Analyzers (and the resources that they require) are now much more shareable (#6714), which greatly reduces the amount of memory needed to support many fields with the same analysis chain.  This improvement has been pushed back into Lucene so that other Lucene-based projects can benefit from it.
  • Similarly, analyzers for date fields are now shared, which reduces resource usage for mappings with many date fields (#6843).

Disk, files and I/O

The slowest component of any modern server is the disk.  Small improvements to I/O and disk usage can make a big difference to the performance of the system overall.  With this in mind, we have made the following changes:

  • Not all shards are equal. Some are bigger and some are smaller.  It is possible to end up in a situation where one node holds considerably more data than other nodes, which can impact performance.  The disk-based shard allocation decider has been available for a while, but we now feel comfortable with the new defaults (#6201) and have enabled it by default (#6200).  (Also: #6209 and #6196)
  • With version 1.0.0, we changed the default file system storage type from NioFS to MMap for better performance. Unfortunately, we have seen heavy users run out of MMap space because of memory fragmentation.  Not all file access patterns benefit from using MMap: it is very good for random access, but for sequential streaming NioFS performs just as well. With this release we introduce a new hybrid file system storage type (#6636) which gives us the best of both worlds.  Files that benefit from random access use MMap, while the other file types use NioFS.
  • The upgrade to Lucene 4.9 brings better compression for norms (field length normalization), especially for documents where most fields are of a similar length.  This compression not only improves disk usage but has a significant impact on memory usage too.
  • Elasticsearch has always written checksums for segments, but previously they were used only for comparison and not for verification.  Lucene has now added checksums for all files in a segment, so we have switched to using Lucene's code rather than our own (#5924) and we now verify segments when recovering a shard, relocating a shard, or restoring a shard from a snapshot.  More checksumming improvements to follow in later releases.

Resiliency

With a complex asynchronous distributed system like Elasticsearch, there are always complicated corner cases which can impact the stability of a system.  You can read about the many low-level improvements and bugfixes that have been added in this release here. These resiliency improvements are part of an ongoing effort to make Elasticsearch rock solid.

Please download Elasticsearch 1.3.0, try it out, and let us know what you think on Twitter (@elasticsearch). You can report any problems on the GitHub issues page.