09 January 2017

This Week in Elasticsearch and Apache Lucene - 2017-01-09

•

Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.

Finished write-up over holidays: Everything we know about #elasticsearch for e-commerce sites @elastic @sprysys https://t.co/y1tXjMU5Cw
— Martin Loetzsch (@martin_loetzsch) January 9, 2017

Elasticsearch Core

Changes in 5.1:

The field stats API now has support for geo-point fields.
Don't close store under CancellableThreads.
Ensure shrunk indices carry over version information from its source.
Primary relocation for shadow indices had a hidden bug that caused the source to fail itself before performing a full recovery.

Changes in 5.2:

The cluster-allocation-explain API now uses the allocation process itself to explain shard allocation decisions, which means that the explanation will always be in sync with reality.
Snapshot repositories containing pre 2.x snapshots compressed with LZF can now be read again (although the pre-2.x snapshots cannot be restored).
The minimum_number_should_match parameter is deprecated in favour of minimum_should_match.
Certain exceptions during index deletion weren't being caught and could cause cluster state application to fail.
The low level node handshake has been moved from #connectToNode to #openConnection to prevent bypassing.

Changes in 5.3:

Added infrastructure for storing sensitive settings (eg passwords) in a password-protected keystore.
The new ToXContentObject interface represents complete objects, while ToXContent is to be used for object fragments which don't output opening and closing curlies.
Painless strings didn't support escaping of quotes with backslashes.
Add support for ca-central-1 region to EC2 and S3 plugins.

Changes in master:

Socket, ServerSocket, and HttpServer usages in tests replaced with mocksocket versions to move SocketPermissions out of core. Also, moved IfConfig.logIfNecessary (which also requires socket perms) into bootstrap, before the security manager is applied.
Added the first method ping to the new Java HighLevelRestClient.
Version now implements Comparable.
Disable the Netty recycler and pooled allocator as they seem to be more trouble than they are worth.

Ongoing changes:

#namedObject is replacing SearchExtRegistry, AggregatorParsers, Suggesters, and AllocationCommands.
The syntax for lower/upper bounds of stddev can be simplified.
All booleans everywhere should be strictly evaluated.
Can field collapsing on search hits be done more efficiently and simply with search instead of top hits?
Custom routing can be used to target a subset of shards instead of just one shard.
Aggs over indices which return a mix of floats and integers should treat all numbers as doubles.
Nested and parent-child queries are ignoring the ignore_unmapped parameter.
Remove unneeded weak reference from prefix logger to resolve potential memory leak.
S3 is being moved to use the new secure settings infra.
Remove support for the _all field in 6.0.
Sequence numbers allow for fast recovery when a replica has fallen out of sync with the master.
The timezone and date format should be normalised when rewriting range queries for caching.
An adjacency matrix aggregation can show co-occurrence of terms.

Apache Lucene

Lucene should better optimize the case of costly multi-term and point queries AND'd with a fast, restrictive query
WordDelimiterGraphFilter, to replace WordDelimiterFilter, would finally work with positional queries correctly at search time eventually fixing this Elasticsearch issue
Dimensional points now tries harder to split on all dimension, even in slivery cases
UnifiedHighlighter should let you customize how candidate passages are created, wrapping a sentence BreakIterator by default
Lucene's GroupingSearch helper class could use some improvements
The Surround query parser should be modernized to use the numerous new Lucene APIs added since it was created
Like SynonymQuery, which scores multiple synonym terms as a single term, SpanSynonymQuery would do the same thing for span queries
BooleanQuery could quite easily allow for per-document minimumShouldMatch , instead of the single global value you can provide today
Now that LongValuesSource has moved to Lucene's core, the suggester module no longer needs to depend on the queries module. Likewise, the expressions module should use DoubleValuesSource , removing its dependency on the queries module, and the facets module should only use LongValueSource and DoubleValueSource .
The ComplexPhraseQueryParser should also handle a single multi-term query in quotes
AnalyzingInfixSuggester no longer relies on the misc module since we promoted index sorting as a core feature
The FlattenGraphFilter, added so multi-token synonyms work correctly, was not correctly handling broken incoming token offsets
AutomatonTermsEnum gave a confusing exception if you passed a special-case CompiledAutomaton
DrillSideways, letting you still see other facet counts even after you've drilled down, now uses threads to gain concurrency
Lucene was not always enforcing that PositionLengthAttribute was > 0, possibly causing illegal cyclic token streams
Query parsers can now handle graph token streams, finally fixing multi-token synonyms with positional queries to behave correctly
A possible optimization to DocValuesRangeQuery may not pan out
The obscure git mailmap feature lets us coalesce commits by the same person using different names and/or email addresses over time
We had to disable a test case on Java 9 but we are not exactly sure why
The QueryNode.toQueryString API in the flexible query parser illegally claims to create a string which, when parsed, would result in exactly this same query node
CustomAnalyzer was only applying character normalization to the last TokenFilter
Java 9 breaks Lucene's efforts to estimate RAM usage of JDK runtime classes
A tricky non-reproducing test failure turned out to be a concurrency bug in AnalyzingInfixSuggester
LongValuesSource and DoubleValueSource have been promoted to Lucene's core module
LeafFieldComparator.setScorer is now allowed to throw IOException
Index sorting was failing to ask the codec to create a mutable bit set while flushing a new segment
Our source code checks (tabs vs whitespace, nocommits, etc.) should also check .xml and .template files
A Jenkins test failure in Geo3D resulting in adding a new threshold Vector.MINIMUM_ANGULAR_RESOLUTION to reject too-small slivers

Watch This Space

Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!

The Search AI Company

ELK Stack

Elastic Cloud

Generative AI

Search

Security

Observability

By solution

Industries

Customer spotlight

Research

Build

Learn

Connect

This Week in Elasticsearch and Apache Lucene - 2017-01-09

Elasticsearch Core

Apache Lucene

Watch This Space

Follow us

About us

Join us

Press

Partners

Trust & Security

Investor relations

EXCELLENCE AWARDS