This Week in Elasticsearch and Apache Lucene - 2016-05-31
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Top News
Using #Elasticsearch for #geohazards & working w' @esa to map ground deformation @terradue https://goo.gl/vcKs4X
— elastic (@elastic) May 27, 2016
Elasticsearch Core
Changes in 2.x:
- Highlighting on queries with geo-points no longer throws an exception.
- The _only_nodes search preference now handles multiple node names and attributes, and round robins between all matching nodes.
- Named queries, especially range or fuzzy queries, got a significant speed boost.
- Time settings were not handling decimal points for seconds correctly.
- Ensure that network messages have been successfully sent before trace logging them.
- An empty filter in the percolator no longer throws an NPE.
- The cat-indices API now expands wildcards to include closed indices.
- Joda-time upgraded to fix loading of time zones from scripts.
- Time zone rounding had some bugs in edge cases and DST transitions.
- Reindex API now defaults to using batches of 1,000 hits.
- The Azure repository now deletes files correctly.
- Dynamic templates with a match_mapping_type are no longer ignored if followed by a match: "*".
Changes in master:
- The new shrink API allows a multi-shard index to be merged down to a single shard.
- Command line settings no longer use the `es.` prefix, and system properties can no longer be set with `-D`.
- Dots in field names are supported again.
- Delete-by-query plugin has been removed as the feature has returned to core.
- Epoch datetimes now support the full range of Java Long.
- Painless scripting:
- Removing ambiguous grammars greatly speeds up script compilation.
- Many more Java APIs have been whitelisted, including java.time.
- Greatly improved script exception output
- The status of long running tasks like reindex are now persisted to an index after the task has finished.
- The cluster allocation explain API now reports whether it is still waiting for shard information.
- The _source parameter in nested hits now uses absolute paths.
- Elasticsearch now warns if the minimum_master_node setting is lower than a quorum.
- XContent objects and arrays must be closed explicitly to avoid bugs with incorrect nesting.
- When recovering from the transaction log, don't add the same transactions back into the log.
- Custom plugins path is no longer supported.
- Doc stats are now pulled from IndexWriter instead of IndexReader for more accurate values.
- A relocating replica shard will no longer fail the target shard if the old replica fails.
- Doc-values for the _type field can now be accessed correctly.
- Plugins can no longer register shard allocation commands.
- The internal ingest CompoundProcessor now exposes the actual processor throwing an exception.
- Liveness requests should never trip a circuit breaker.
- Our custom Base64 implementation has been replaced with Java's implementation.
- The percolator has moved to a module, and is able to optimise more queries.
- Nodes log their OS and JVM version during startup.
- Delayed shard allocation has been refactored and simplified.
- Lists of modules and plugins are now maintained in generated resource files.
- The Java HTTP client needs to target Java 7 as well as 8 and 9.
Ongoing:
- The cluster name should not be appended to path.data.
- How should we deal with empty queries in 5.x?
- How can we implement scroll slicing without requiring users to provide a field to slice on?
- We should be able to reindex from a remote elasticsearch cluster.
- It should be possible to sort/aggregate ip fields coming from 2.x/5.0 indices.
- REST responses should include warnings when deprecated syntax has been used.
- CRUD requests will be able to wait until their changes are visible to search.
- Index deletion requests should only be acknowledged after the cluster state has been persisted.
- Global check points for sequence IDs should be merged soon.
- Function score query should be able to combine scores from different queries using a script.
- Index templates should be validated when created or updated.
- Reindex and update-by-query should support document deletion.
- Rally's benchmark script should be separate from the tracks being benchmarked, which allows benchmarking different versions of Elasticsearch. Also adding logging data set.
Apache Lucene
- 6.0.1 vote has passed so the bits will be set free shortly
- The leaky-abstraction
SlowComposit<wbr>eReaderWrapper
is finally gone from Lucene, a process at least 7 years in the making when Lucene first switched to near-real-time searching! - Lucene now supports half-float points, using 2 bytes to represent a floating point number
- Some as yet unexplained scary bug lurks deep inside
IndexWriter
when you mix updates of documents and doc values - A new Ukrainian lemmatizer leads to discussions about how it differs from the existing hunspell-based tokenizer
- The confusion matrix computation in Lucene's classifier module now uses the macro average to avoid bias
- Don't throw a
NullPointerException
if you try to runSimpleNaiveBayesClassifier
on a non-existing field - A new directory wrapper uses hard links when possible to optimize copying files
- More release script improvements
- Lucene's internal
BytesRefHash
class, used in several hotpots including buffering postings in RAM, gets a nice speedup by switching to a radix sort, and it looks like dimensional points, also sort intensive during indexing, will get a nice speedup as well ArrayUtil
had accumulated some rustToParentBlockJoinQuery's
explain now includes the explanation from its children - Can we use doc values instead of a heap-resident bitset to implement block-joins?
- Another user falls into the unfortunately common trap of thinking Lucene's stored fields store all information about a field
- A Java 1.9 javadocs bug causes our javadocs to fail
TermAutomatonQuery,
a fun query letting you query with complex graph-like phrases, had a ridiculously costlyhashCode
implementationDocIdSetBuilderBuilder,
a hot spot used to efficiently gather matching docIDs, gets a nice speedup, so much so that we were able to switch theLatLonPoint
queries back to itIndexWriter
should tell you the effective order of concurrent operations- We upgraded
ForbiddenAPIs
to version 2.1 to get a number of improvements including better Java 1.9 support - The
equals
andhashCode
methods are now abstract in the Query
base class - Remove the added dependency from highlighter to spatial
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!