This Week in Elasticsearch and Apache Lucene - 2016-01-11
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Top News
Here’s your cookbook on how to deploy #Elasticsearch 2.0.0 with @chef. https://t.co/sE5IBrpdlE pic.twitter.com/HPV5JFKZpb
— elastic (@elastic) December 17, 2015
Elasticsearch Core
- Translog fixes:
- Failure to create a translog writer could leak open translog readers. (backported to 2.0 and 2.1)
- Tightened up the logic around deleting older translogs which was causing recovery failure with missing translog files. (backported to 2.0 and 2.1)
- Tragic event exceptions were not being caught in the unbuffered translog writer. (backported to 2.0 and 2.1)
- The translog is now synced to disk after recovery from the primary to avoid losing data if the node shuts down. (backported to 2.0 and 2.1)
- Translog flushes could be disabled if a recovery takes a long time, resulting in translogs growing unbounded. (backported to 2.0 and 2.1)
- Don't attempt (and log a failure message) to remove a temporary checkpoint file when it is in use. (backported to 2.0 and 2.1)
- Index and bulk thread pool sizes are now limited to the number of available processors - any more than that is harmful. (backported to 2.1)
- A failed replication request could result in failing both the target and source shards, instead of just the target shard. (backported to 1.7, 2.0, 2.1)
- BitSetFilter caches were being duplicated, which could cause significant memory usage on clusters with many indices. (backported to 1.7, 2.0, 2.1)
- An exception in a cluster state task listener could prevent other tasks from being notified.
- The `missing` bucket on the terms agg didn't work with all execution modes. (backported to 2.1)
- Prevent replicas from being relocated to older nodes when the primary is already on a newer node. (backported to 1.7, 2.0, 2.1)
- Restored camelcase variants of analysis components for bwc. (backported to 2.0 and 2.1)
- Exceptions from scheduled-once tasks were being swallowed silently, but are now logged. (backported to 1.7, 2.0, 2.1)
- Better exception when nodes without the licensing plugin installed attempt to join the cluster.
- The get-field-mapping API could use a lot of extra memory when used with many fields. (backported to 1.7, 2.0, 2.1)
- The bucket selector script was not being applied to empty buckets.
- A number of performance improvements have been made to the BalancedShardsAllocator.
- The Warmers API has been removed as it is no longer needed.
- Translog now always uses a buffered stream which is managed automatically by Java, removing the need for read/write locks.
- Query terms in percolator queries are now indexed so that the percolator only needs to check queries that have a chance of matching.
- The Task Management framework has been merged. Next steps here.
- Cluster settings are now applied atomically, can be unset/reset to the default value, and support strict validation.
- Geo-point and boolean fields now support multi-fields.
- TF/IDF similarity has been renamed from `default` to `classic` to make way for the new Lucene 6 default: BM25
- Replica shards must be failed before primary shards when the primary shard fails.
- Recovery threadpools and throttling have been greatly simplified.
- Segments are now always written as compound files when flushed.
- Node ingest:
- Node ingest had now been merged into core as a module.
- The simulate API no longer requires index and type.
- An on_failure handler can be specified for each processor or for a whole pipeline.
- All processors now live in a single package: org.elasticsearch.
ingest.processor
- Azure repository settings can be configured globally, and now supports timeouts.
- Starting POC to incorporate sequence numbers into the translog with simplified approach which doesn't rely on ref counting.
- Pull-based bulk processor, which can be used by the reindex API.
- Shard failure reporting should wait for a new master to be elected.
- Adding an option to the reroute API to force the assignment of a stale primary shard.
- Reindexing will soon be a one-step action.
Apache Lucene
- Discussions have started for the next major (6.0.0) Lucene release
- Lucene's github mirror from subversion will stop working any day now, but work is well underway to convert our source control to git, and we also have a workaround script in the meantime
- Here's cool visualization of Lucene and Solr's source code history
- The confusion matrix in the classifier module can now give you its overall precision and recall
- Dimensional values fields now report their per-dimension global min and max value
- The javasrcipt compiler in the expressions module now throws
ParseException
instead ofIllegalArgumentException
if you use an unrecognized function name, or pass the wrong number of arguments to a function - Some small improvements to the code generated by the expressions module
NRTCachingDirectory
failed to implement the newcreateTempOutput
method, and yours truly broke the build by accidentally suppressing one test case, leading to failing the build if you override a method without an@Override
annotationCustomAnalyzer
now gives you compile time safety when defining its components, by accepting factory classes instead of string SPI names- Improve exception handlers in analyzer factories
- Fix possible resource leak in
SynonymFilterFactory
- A tricky test failure shows how difficult geo intersection apis are to get right, and the k-d tree is only as good as the geo apis it invokes
GeoPointInPolygonQuery
now uses point orientation based line crossing algorithm to test whether a point is inside the polygon- Improve
GeoUtils
to properly handle of rectangles too close to the earth's poles - Some sizable speedups in scoring
MUST_NOT
clauses - Fix highlighter when multiple adjacent stop words appear
- Prune some unused
test-framework
dependencies UninvertingReader,
which fakes doc values by the slow uninversion of postings, was incorrectly hiding someFieldInfos
properties- Factor separate tests from XML query parser's
TestParser
SortField.equals
now takes the missing value into account, and we now hide theSortField's
missing value behind a getterBlendedInfixSuggester
now has a scoring mode that even more strongly favors matches near the start of each suggestionBooleanQuery
should not create bulk scorers only to throw them away- Codec level encryption offers fine-grained control over which parts of the index need encryption
- A new LSH (locality sensitive hashing)
TokenFilter
and query is an alternative to the standardMoreLikeThisQuery
- You can pass all tests, only to see CI sometimes fail thanks to randomized testing!
- Can we remove
ToParentBlockJoinCol<wbr>lector
? JapaneseAnalyzer's
decompounding messes up PhraseQuery
matching- The release smoke tester is confused when versions greater than the one you are now testing have already been released
MoreLikeThisQuery
should keep track of which terms came from which fieldsFilterLeafReader
should be abstract- It's very dangerous to use
MMapDirectory
and incorrectly close yourIndexReader
while searches are still in flight - More progress on the challenging change to push retrying of file deletion down under the
Directory
abstraction, instead of making it the caller's job
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!