This Week in Elasticsearch and Apache Lucene - 2016-07-11
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Just wrote a new blogpost about the new #elasticsearch java RestClient. The post shows how to use the client. https://t.co/5Sf6topxgS
— Jettro Coenradie (@jettroCoenradie) July 8, 2016
Changes in 2.x:
- The query DSL in the percolator should respect the
_idfield in the same way as in search.
- Stale routing data can lead to CRUD requests bouncing between nodes during primary relocation.
- Netty3 has been upgraded to 3.10.6.Final.
Changes in master:
- Elasticsearch can now reindex from a remote cluster (including clusters of a different major version).
- REST responses include "Warning:" headers when a request has used deprecated functionality.
fieldshas been renamed to
stored_fieldsand will only retrieve values from fields marked as stored.
docvalue_fieldsshould be used to retrieve doc values.
- All nodes now have persistent IDs which survive restarts, making it easier to track the same node in monitoring tools.
- The field-stats API was missing the datatype of each field.
- The change to inline reroutes when a node joins the cluster has been reapplied to master with fixes.
- Elasticsearch no longer catches Throwable, which can hide important exceptions like OOM and stack overflows. Instead, unrecoverable errors should cause a node to die with dignity.
- Log messages for batched cluster state updates were difficult to interpret.
_sizefield now has support for doc-values.
- The percolator can extract terms from queries used inside function_score. Range queries which use
nowcannot be used with the percolator.
- The Java REST client now provides shorter performRequest() methods when the body or query string is not request. Also, it defaults to using
httpfor sniffing, and logs correct URLs even when the path is missing a preliminary slash.
- Sequence IDs should allow fast replica recovery from the translog instead of having to always copy segments.
- Work continues on moving aggregations to use NamedWritable instead of their custom AggregationStreams.
- Networking code is being refactored to make it possible to move Netty3 to a module, and to add a Netty4 module.
- The eradication of Guice from the codebase continues.
- CRUD exceptions on the primary need to be replicated for proper sequence ID accounting.
- Similarities should be dynamically updatable.
- The cardinality agg should use a static precision as it is less confusing to users.
- The profile API will also count the number of method calls.
- Scaled floats (backed by a long field) will make storing percentages more efficient.
- Data loss can occur when indexing during primary relocation while there are ongoing replica recoveries.
- Cockroach actions are persistent long-running tasks that can survive cluster restart.
- The allocation explain API will also report disk usage.
- Script exceptions should not be swallowed by the ingest API.
- The query evaluation framework will gain a REST API and other evaluation metrics such as reciprocal rank and discounted cumulative gain.
- The FVH should be able to extract terms from nested queries.
- The java REST client will be benchmarked against the transport client.
- The REST client needs an async API.
- The analysis API is being expanded to accept custom tokenizers, token filters and character filters.
- The create-index API should wait for sufficient shards to be assigned before returning.
elasticsearch-toolcommand-line tool will allow truncating corrupt transaction logs to recover data that has already been indexed.
- Node-left and node-failed messages should be batch processed.
- Rally will provide a static benchmark page with annotations, as annotations are not yet supported in Kibana.
- We are removing the archaic
coordfactors from Lucene's scoring, starting with
BooleanQuery, and next with
Weight.normalize, since modern scoring models handle term saturation better
- Dimensional points get their first index format change, to better compress the on-disk docIDs, with additional improvements to values storage coming soon, giving a 9.7% disk size reduction on our
- The benchmarks module's
TestQualityRunuses BM25 scoring
- A user thread leads to a improved documentation about maximum length binary doc values fields
MemoryIndex.toStringbreaks if payloads are used
IndexWriter.getMaxCompletedSequenceNumberwas incorrect after
- The new Ukrainian lemmatizer's dictionary contains unique token + lemma pairs
- At long last, Lucene's classic query parser no longer pre-splits on whitespace, leaving that decision to analyzer instead
- Our ant
build.xmlfiles had redundant
MultiTermAwareComponentfor better multi-term query support
Collection.size() == 0
IllegalArgumentExceptionif the points for path segment intersections are ambiguous
DecimalDigitFilterhas problems with digits that use Unicode's non-BMP supplemental characters
- The new
releasedJirasRegex.pyhelps automate some parts of the lengthy release process
DelegatingWeightdelegates all methods to another
- Can we improve the default behavior of query parsers and multi-term queries?
- Lucene's default
MMapDirectorycan lead to JVM crashes if the user incorrectly concurrently closes a still-in-use
BooleanScorerhas high overhead for tiny segments
- How should
- Adding a single method to
Termsto return all terms as a string is too dangerous
- Is it
- Something is invoking
RAMDirectorybe able to copy from any other
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!