This Week in Elasticsearch and Apache Lucene - 2016-05-23
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Top News
Unwittingly using deprecated features in #elasticsearch? Maybe not for much longer
— Chris Earle (@pickypg) May 21, 2016
Elasticsearch Core
Changes in 2.x:
- New child types can now be added to existing parent types.
- Elasticsearch no longer returns the decoded path with an error.
Changes in master:
- The profiler has been refactored to extend profiling beyond just queries.
- Improvements to the Painless scripting language keep flowing.
- Translog checkpointing and fsyncing has been moved outside the index writer's global lock, allowing for more concurrency.
- The ingest node gained a Sort processor.
- Added Google Cloud Storage snapshot/restore repository plugin.
- The reindex API learned to back off and retry when it encounters search failures, and the default batch size has been increased to 1000.
- Delete-by-query is back in core, reimplemented using the reindex infrastructure.
- File system I/O statistics are available again on Linux.
- Added logging for small but frequent garbage collections, which indicate memory pressure.
- Command line settings no longer use the `es.` prefix.
- The percolator cache has been removed, saving a large amount of heap. Performance is still better than before thanks to query indexing. Added support for MatchNoDocs query in the percolator.
- Terms and significant terms aggs now support string include/exclude for IP addresses and dates.
- Registered missing `indices.query.bool.max_clause_count` setting.
- Fuzzy, regexp, prefix, and wildcard queries are only supported on text/keyword fields.
- Elasticsearch heap size defaults to 256MB min and 2GB max.
- Failing shard allocations now give up after 5 attempts, instead of looping forever.
- The Debian package no longer tries to create the data dir, as this is already handled by Elasticsearch.
- Filter the -server flag from the JVM options file when installing a service on Windows.
- Compilation on Java 9 works again.
- Fixed time unit rounding for hour, minutes, seconds to cope with DST.
- Named queries are now significantly faster than before when used with expensive queries.
Ongoing:
- Persistent node IDs will be replaced by node names, which will have to be unique across the cluster.
- Creating an index shouldn't turn the cluster red.
- Task management should persist the status of long running tasks after they have finished.
- Added dedicated masters to the tests - once stable, will add replication tests that use IndexShards without relying on nodes.
- Highlighting doesn't play well with GeoPointInBBoxQuery.
- Version lookups during indexing are expensive. Is there a low-risk optimisation that could skip the ID lookup if Lucene does the check instead of Elasticsearch?
- Use of deprecated features should inform the clients so they can warn.
- Adding profile support to aggregations.
- Scroll requests can be split for parallel processing.
- Multi-shard indices can be merged into a single shard.
- Rally will be getting a logging data set, probably from the World Cup, and Elasticsearch will get infrastructure for microbenchmarking.
- The Azure repository isn't removing deleted files.
- Refactoring of IndicesClusterStateService continues, to make cluster state updates more testable.
- Decreasing the delayed allocation timeout can lead to longer delays. Refactoring delayed shard allocation logic to make it simpler to test and maintain.
- The cluster allocation explain API will report when allocation is still waiting for the shard state.
- Snapshot UUIDs will enable more robust handling of snapshot deletion and partially failed deletes.
- The ingest compound processor should be internal only and should report the wrapped processor in error messages.
- epoch_millis should support the full range of Longs.
- We should use Java's Base64 library instead of our own version.
Apache Lucene
- A 6.0.1 bugfix release is coming soon
- Highlighter and geo point queries do not mix
- A test bug in
MoreLikeThisTest
is tricky to fix - An extremely thin slice down from the north pole confuses
Geo3d
Geo3d
gets a doc-values field to enable sort-by-distance, using the unique "distance from a shape's boundary"Geo3d
now usesDouble.POSITIVE_INFINITY
instead ofDouble.MAX_VALUE
when a point is outside of the shapeDateRangePrefixTree
lets you control the calendar template- Every time we prepare for a release (this time 6.0.1) we find and fix fun bugs in our release scripts
- The
equals
andhashCode
methods will become abstract in theQuery
base class - Our build scripts should use
-release
instead of-source
and-target
to ensure full Java 8 binary compatibility even when compiling with Java 9 - Lucene will soon support half-float points, using 2 bytes to represent a floating point number
ToParentBlockJoinQuery's
explain is lame today and should instead include the explanation from its children- Codec-level encryption continues iterating
- Our randomizes tests found a failure in heatmap facets
- We see a nice performance gain by poaching Solr's
ExpandingIntArray
for collecting hits - A new lemmatizer appears for Ukrainian
QueryParser
should let you sometimes use unescaped internal operator characters- Lucene's highlighter is confused when you try to highlight
SynonymQuery
- Our Brazilian analyzer has a bug in its stop words file
SlowCompositeReaderWrapper
should move out of Lucene's sources
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!