This Week in Elasticsearch and Apache Lucene - 2016-06-06
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Top News
Supporting #Elasticsearch customers is encoded into our DNA. @mcmesser explains how https://t.co/GJDN2jGRee pic.twitter.com/Lu80GbJ01F
— elastic (@elastic) June 2, 2016
Elasticsearch Core
Changes in master:
- Upgraded to Lucene 6.0.1.
- Legacy IP fields and the new point-based IP fields can now be used together at query time.
- Painless and Lucene Expressions gained an improved Date API consistent with other engines, and Painless has compile time exceptions to match its run time exceptions.
- Ingest processors support an ignore_failure option.
- Mappings in dynamic templates are now validated when templates are created or updated.
- The new multi-field matrix stats aggregation can compute correlations between fields plus more stats.
- Thanks to Index UUIDs, index deletion no longer needs a specialised acknowledgement mechanism but can use the standard cluster state acknowledgement.
- The `bootstrap.mlockall` setting has been renamed to `bootstrap.memory_lock`.
- Elasticsearch can no longer be run as root.
- Recovery throttling during replica shard relocation was counting recoveries against the replica, instead of the primary.
- Segment stats no longer report index_writer_max_memory as it no longer has meaning.
- The shrink API should check that there is enough disk space on the shrink node before starting, and should keep the target index on the shrink node until shrinking has completed.
- Snapshot UUIDs will enable more robust snapshot deletions.
- Windows does a better job of detecting the installed JVM.
- RoutingNodes continues to receive cleanups.
- AggregatorBuilder and PipelineAggregatorBuilder do not need generics.
- Empty query bodies are handled at parse time.
- Scheduled pings should start after the transport starts.
- Clearer exceptions are thrown when a node tries to join the wrong cluster or is of the wrong version.
- Missing `cloud.node.auto_attributes` setting has been added.
- PageCacheRecycler is just an implementation detail of BigArrays.
Ongoing:
- A Java HTTP REST client is coming soon.
- The index shrink API will be able to shrink to N shards, instead of just 1.
- The alias rollover API will rollover to a new index when conditions like max_docs or max_age are met.
- The task list API will be able to fetch the status of finished tasks from the .results index.
- The reindex API can pull data from a remote cluster of a different major version of Elasticsearch.
- Reindex and update-by-query will support document deletion.
- CRUD requests can be made to block until their changes are visible to search.
- Creating an index should not turn the cluster red.
- _msearch should limit the number of parallel requests that are run to avoid overwhelming search queues.
- Ingest processors are moving to a module to make development easier.
- The data directory should not use the cluster name as the first directory, but this behaviour will continue to be supported during 5.x.
- The percolator shouldn't bother running queries that are guaranteed to match.
- The scroll API will be able to partition a scroll in multiple slices for parallel processing.
- Adding a `query_score` function to the function_score query for scriptable query-based scoring.
- The plugin manager gains a progress bar for monitoring downloads.
- Failing to rewrite a query should not leak SearchContexts.
- Thread pool settings are now node level, not cluster level, and plugins can register their own thread pools.
- Using a closed transport client should throw a meaningful exception.
- _msearch will report the HTTP status of each search request.
- ConstructingObjectParser gains optional constructor arguments.
Apache Lucene
- 6.1.0 release is coming soon
- 6.0.1 bug fix release is out!
- The sort policeman strikes again: indexing dimensional points gets a nice speedup (see annotation O) by switching to radix sort
- 2D polygon construction is faster now
- Properly quantizing double x/y/z values for
geo3d
to sidestep quantization effects that confound testing is not easy BooleanQuery
has a tricky bug that sometimes assigns the wrong scores to hitsInetAddressPoint
andLatLonPoi<wbr>nt
are ready to graduate out of sandboxDocValuesDocIdSet's
ref count dropped to zero and it will now be removed- Should
MatchNoDocsQuery
include a reason for its creation? TermQuery
could delay term lookups until they are truly needed but the change adds non-trivial code complexity- Can we use doc values instead of a heap-resident bitset to implement block-joins?
- Some queries already create whole bitsets so it would be nice if the query cache could store them directly instead of inefficiently cloning them
- The new point queries were missing some getters
- Jenkins found some fun test bugs with Lucene's new half floats
- A tricky span queries test failure came down to using the right merge policy to preserve docID order
IndexWriter
sometimes applies multiple doc values updates to the same document in the wrong order- Our custom javadocs linter, to catch broken links and other silly problems in our javadocs, is fragile
- A test bug in
MoreLikeThisTest
remains tricky to fix - Geo3D's test leniency gets twice as large
- Test randomization caused a scary looking yet actually harmless test failure in the new
HardLinkCopyDirectoryWrapp<wbr>er
- A new Ukrainian lemmatizer leads to discussions about how it differs from the existing hunspell-based tokenizer
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!