This Week in Elasticsearch and Apache Lucene - 2016-01-11
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Here’s your cookbook on how to deploy #Elasticsearch 2.0.0 with @chef. https://t.co/sE5IBrpdlE pic.twitter.com/HPV5JFKZpb
— elastic (@elastic) December 17, 2015
- Translog fixes:
- Failure to create a translog writer could leak open translog readers. (backported to 2.0 and 2.1)
- Tightened up the logic around deleting older translogs which was causing recovery failure with missing translog files. (backported to 2.0 and 2.1)
- Tragic event exceptions were not being caught in the unbuffered translog writer. (backported to 2.0 and 2.1)
- The translog is now synced to disk after recovery from the primary to avoid losing data if the node shuts down. (backported to 2.0 and 2.1)
- Translog flushes could be disabled if a recovery takes a long time, resulting in translogs growing unbounded. (backported to 2.0 and 2.1)
- Don't attempt (and log a failure message) to remove a temporary checkpoint file when it is in use. (backported to 2.0 and 2.1)
- Index and bulk thread pool sizes are now limited to the number of available processors - any more than that is harmful. (backported to 2.1)
- A failed replication request could result in failing both the target and source shards, instead of just the target shard. (backported to 1.7, 2.0, 2.1)
- BitSetFilter caches were being duplicated, which could cause significant memory usage on clusters with many indices. (backported to 1.7, 2.0, 2.1)
- An exception in a cluster state task listener could prevent other tasks from being notified.
- The `missing` bucket on the terms agg didn't work with all execution modes. (backported to 2.1)
- Prevent replicas from being relocated to older nodes when the primary is already on a newer node. (backported to 1.7, 2.0, 2.1)
- Restored camelcase variants of analysis components for bwc. (backported to 2.0 and 2.1)
- Exceptions from scheduled-once tasks were being swallowed silently, but are now logged. (backported to 1.7, 2.0, 2.1)
- Better exception when nodes without the licensing plugin installed attempt to join the cluster.
- The get-field-mapping API could use a lot of extra memory when used with many fields. (backported to 1.7, 2.0, 2.1)
- The bucket selector script was not being applied to empty buckets.
- A number of performance improvements have been made to the BalancedShardsAllocator.
- The Warmers API has been removed as it is no longer needed.
- Translog now always uses a buffered stream which is managed automatically by Java, removing the need for read/write locks.
- Query terms in percolator queries are now indexed so that the percolator only needs to check queries that have a chance of matching.
- The Task Management framework has been merged. Next steps here.
- Cluster settings are now applied atomically, can be unset/reset to the default value, and support strict validation.
- Geo-point and boolean fields now support multi-fields.
- TF/IDF similarity has been renamed from `default` to `classic` to make way for the new Lucene 6 default: BM25
- Replica shards must be failed before primary shards when the primary shard fails.
- Recovery threadpools and throttling have been greatly simplified.
- Segments are now always written as compound files when flushed.
- Node ingest:
- Azure repository settings can be configured globally, and now supports timeouts.
- Starting POC to incorporate sequence numbers into the translog with simplified approach which doesn't rely on ref counting.
- Pull-based bulk processor, which can be used by the reindex API.
- Shard failure reporting should wait for a new master to be elected.
- Adding an option to the reroute API to force the assignment of a stale primary shard.
- Reindexing will soon be a one-step action.
- Discussions have started for the next major (6.0.0) Lucene release
- Lucene's github mirror from subversion will stop working any day now, but work is well underway to convert our source control to git, and we also have a workaround script in the meantime
- Here's cool visualization of Lucene and Solr's source code history
- The confusion matrix in the classifier module can now give you its overall precision and recall
- Dimensional values fields now report their per-dimension global min and max value
- The javasrcipt compiler in the expressions module now throws
IllegalArgumentExceptionif you use an unrecognized function name, or pass the wrong number of arguments to a function
- Some small improvements to the code generated by the expressions module
NRTCachingDirectoryfailed to implement the new
createTempOutputmethod, and yours truly broke the build by accidentally suppressing one test case, leading to failing the build if you override a method without an
CustomAnalyzernow gives you compile time safety when defining its components, by accepting factory classes instead of string SPI names
- Improve exception handlers in analyzer factories
- Fix possible resource leak in
- A tricky test failure shows how difficult geo intersection apis are to get right, and the k-d tree is only as good as the geo apis it invokes
GeoPointInPolygonQuerynow uses point orientation based line crossing algorithm to test whether a point is inside the polygon
GeoUtilsto properly handle of rectangles too close to the earth's poles
- Some sizable speedups in scoring
- Fix highlighter when multiple adjacent stop words appear
- Prune some unused
UninvertingReader,which fakes doc values by the slow uninversion of postings, was incorrectly hiding some
- Factor separate tests from XML query parser's
SortField.equalsnow takes the missing value into account, and we now hide the
SortField'smissing value behind a getter
BlendedInfixSuggesternow has a scoring mode that even more strongly favors matches near the start of each suggestion
BooleanQueryshould not create bulk scorers only to throw them away
- Codec level encryption offers fine-grained control over which parts of the index need encryption
- A new LSH (locality sensitive hashing)
TokenFilterand query is an alternative to the standard
- You can pass all tests, only to see CI sometimes fail thanks to randomized testing!
- Can we remove
decompounding messes up
- The release smoke tester is confused when versions greater than the one you are now testing have already been released
MoreLikeThisQueryshould keep track of which terms came from which fields
FilterLeafReadershould be abstract
- It's very dangerous to use
MMapDirectoryand incorrectly close your
IndexReaderwhile searches are still in flight
- More progress on the challenging change to push retrying of file deletion down under the
Directoryabstraction, instead of making it the caller's job
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!