This Week in Elasticsearch and Apache Lucene - 2016-07-11
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Top News
Just wrote a new blogpost about the new #elasticsearch java RestClient. The post shows how to use the client. https://t.co/5Sf6topxgS
— Jettro Coenradie (@jettroCoenradie) July 8, 2016
Elasticsearch Core
Changes in 2.x:
- The query DSL in the percolator should respect the
_id
field in the same way as in search. - Stale routing data can lead to CRUD requests bouncing between nodes during primary relocation.
- Netty3 has been upgraded to 3.10.6.Final.
Changes in master:
- Elasticsearch can now reindex from a remote cluster (including clusters of a different major version).
- REST responses include "Warning:" headers when a request has used deprecated functionality.
fields
has been renamed tostored_fields
and will only retrieve values from fields marked as stored.docvalue_fields
should be used to retrieve doc values.- All nodes now have persistent IDs which survive restarts, making it easier to track the same node in monitoring tools.
- The field-stats API was missing the datatype of each field.
- The change to inline reroutes when a node joins the cluster has been reapplied to master with fixes.
- Elasticsearch no longer catches Throwable, which can hide important exceptions like OOM and stack overflows. Instead, unrecoverable errors should cause a node to die with dignity.
- Log messages for batched cluster state updates were difficult to interpret.
- The
_size
field now has support for doc-values. - The percolator can extract terms from queries used inside function_score. Range queries which use
now
cannot be used with the percolator. - The Java REST client now provides shorter performRequest() methods when the body or query string is not request. Also, it defaults to using
http
for sniffing, and logs correct URLs even when the path is missing a preliminary slash.
Ongoing changes:
- Sequence IDs should allow fast replica recovery from the translog instead of having to always copy segments.
- Work continues on moving aggregations to use NamedWritable instead of their custom AggregationStreams.
- Networking code is being refactored to make it possible to move Netty3 to a module, and to add a Netty4 module.
- The eradication of Guice from the codebase continues.
- CRUD exceptions on the primary need to be replicated for proper sequence ID accounting.
- Similarities should be dynamically updatable.
- The cardinality agg should use a static precision as it is less confusing to users.
- The profile API will also count the number of method calls.
- Scaled floats (backed by a long field) will make storing percentages more efficient.
- Data loss can occur when indexing during primary relocation while there are ongoing replica recoveries.
- Cockroach actions are persistent long-running tasks that can survive cluster restart.
- The allocation explain API will also report disk usage.
- Script exceptions should not be swallowed by the ingest API.
- The query evaluation framework will gain a REST API and other evaluation metrics such as reciprocal rank and discounted cumulative gain.
- The FVH should be able to extract terms from nested queries.
- The java REST client will be benchmarked against the transport client.
- The REST client needs an async API.
- The analysis API is being expanded to accept custom tokenizers, token filters and character filters.
- The create-index API should wait for sufficient shards to be assigned before returning.
- The
elasticsearch-tool
command-line tool will allow truncating corrupt transaction logs to recover data that has already been indexed. - Node-left and node-failed messages should be batch processed.
- Rally will provide a static benchmark page with annotations, as annotations are not yet supported in Kibana.
Apache Lucene
- We are removing the archaic
queryNorm
andcoord
factors from Lucene's scoring, starting withBooleanQuery
, and next withWeight.normalize
, since modern scoring models handle term saturation better - Dimensional points get their first index format change, to better compress the on-disk docIDs, with additional improvements to values storage coming soon, giving a 9.7% disk size reduction on our
OpenStreetMaps
benchmark - The benchmarks module's
TestQualityRun
uses BM25 scoring - A user thread leads to a improved documentation about maximum length binary doc values fields
MemoryIndex.toString
breaks if payloads are usedIndexWriter.getMaxCompletedSequenceNumber
was incorrect afterIndexWriter.deleteAll
- The new Ukrainian lemmatizer's dictionary contains unique token + lemma pairs
- At long last, Lucene's classic query parser no longer pre-splits on whitespace, leaving that decision to analyzer instead
- Our ant
build.xml
files had redundantfailonerror=true
ScandinavianFoldingFilterFactory
andScandinavianNormalizationFilterFactory
implementMultiTermAwareComponent
for better multi-term query supportExplanation
implementsequals
andhashCode
Explanation.toHtml
is gone- Use
Collection.isEmpty
instead ofCollection.size() == 0
Geo3D
throwsIllegalArgumentException
if the points for path segment intersections are ambiguousDecimalDigitFilter
has problems with digits that use Unicode's non-BMP supplemental characters- The new
releasedJirasRegex.py
helps automate some parts of the lengthy release process DelegatingWeight
delegates all methods to anotherWeight
- Can we improve the default behavior of query parsers and multi-term queries?
- Lucene's default
MMapDirectory
can lead to JVM crashes if the user incorrectly concurrently closes a still-in-useIndexReader
BooleanScorer
has high overhead for tiny segments- How should
FieldInfo
andFieldInfos
implementtoString
? - Adding a single method to
Terms
to return all terms as a string is too dangerous - Is it
Levenstein
orLevenshtein
? - Something is invoking
toString
on aField
object unexpectedly - Should
RAMDirectory
be able to copy from any otherDirectory
on init?
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!