16 January 2017

This Week in Elasticsearch and Apache Lucene - 2017-01-16

•

Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.

Protecting Against Attacks that Hold Your Data for Ransom #elasticsearch https://t.co/Fg5wNC2aQJ
— elastic (@elastic) January 13, 2017

Elasticsearch Core

Multi-word synonyms and synonym graphs

There has been much work recently on improving Lucene's handling of graph token streams, where analysis of text, either from a document during indexing, or a query during searching, produces multiple overlapping paths or interpretations for the tokens. Multi-word synonyms do this and have long been buggy when used with proximity queries but thanks to the recent addition of SynonymGraphFilter as well as improvements to Lucene's query parsers to translate the token graph into separate queries, such analysis chains are finally handled correctly at search time. WordDelimiterFilter is also being fixed to produce correct graphs. These changes have already been exposed in Elasticsearch, and then subsequently in Lucene, thanks to Matt Weber. Graph token streams still present challenges, though, such as the need to use FlattenGraphFilter during indexing, but not searching, since a Lucene index cannot represent a graph. There are also a number of token filters that should produce a graph but do not yet, such as ShingleTokenFilter, EdgeNGramTokenFilter and decompounders.

Changes in 5.2:

A source filtering bug introduced in 5.1 meant that any fields that were a prefix of the specified pattern were being included incorrectly.
New network connections should not be opened when transport is closing or already closed, and channels opened during handshakes should not leak if the handshake is interrupted or times out.
Upgrade to Netty 4.1.7 should fix a refcounting bug exposed in Java 9.
A default chunk size of -1 for the Azure repository plugin resulted in corrupt snapshots.
The Painless loop counter should be higher for update scripts than for search scripts.
The shard recovery message had swapped doc counts for target and source shards.
Reindex-from-remote had a race condition when trying to clear the remote scroll ID, and was ignoring user-supplied source filtering.
The NodeConnectionService now relies on the current cluster state to determine which nodes should be connected to.
The parent/child and nested queries were ignoring the ignore_unmapped parameter.
Single index and delete operations are now executed as bulk operations internally.

Changes in 5.x:

ParseFieldMatcher and ParseFieldMatcherSupplier are no more.
SearchRequestParsers have been removed in favour of NamedObjects.
As part of the synonym graph filter changes, query_string and simple_query_string now support token graphs, as does the match query (now with prefix or cutoff_frequency), and the analyze API now returns the position length by default when it is greater than 1.
Affix settings (eg foo.bar.USER_VALUE.enabled) can now be validated and updated dynamically.
S3 repository settings have been moved to use the new secure settings infrastructure.
The Java High Level REST client continues to make progress: JSON hits can now be parsed into InternalSearchHit/s, and the response from index requests can be parsed into IndexResponse.

Changes in master:

Painless whitelisted some methods to make it easier to access IP and Binary fields in a script.
As part of the change to limit connect and accept permissions to Netty4 only, Netty channels have been extended with versions which wrap connect and accept in doPriviliged blocks.
When terms aggs across indices result in a mixture of longs and doubles, longs should be promoted to doubles.
The deprecated percolate and mpercolate APIs have been removed (in favour of search and msearch).
The _all field will not be allowed on new indices created in v6.0.0 and above.

Upcoming changes:

A new cross-cluster federated search will replace the Tribe Node and fix some of its liabilities.
Sequence-number based recovery will allow quick-syncs of replicas when a primary fails.
Field collapsing at search time should be more efficient than the top-hits aggregation.
Use custom routing to target a group of shards at index time, instead of just a single shard.

Apache Lucene

Now that Lucene has upgraded to Groovy 2.8 to fix Java 9 issues, the 6.4.0 release process will begin!
WordDelimiterGraphFilter, to replace WordDelimiterFilter, would finally work with positional queries correctly at search time eventually fixing this Elasticsearch issue
TokenStreamToAutomaton failed to handle holes (deleted tokens) correctly and we have promoted its test case to the core module's tests
The new FunctionScoreQuery and FunctionMatchQuery use the new DoubleValuesSource for scoring and matching
Starting with Lucene 7.0, IndexWriter will reject broken offsets in term vectors
An inexplicable ClassCastException surfaces when estimating heap bytes used by an FST
EdgeNGramTokenFilter deletes incoming token payloads
It's unreasonably difficult to intersect an automaton with the terms from doc values
TermsQuery has been promoted to Lucene's core and renamed to TermInSetQuery , letting us remove the dependency from the facets module on the queries module
UnifiedHighlighter should let you customize how candidate passages are created, wrapping a sentence BreakIterator by default, and Passage is now public for better extensibility
Lucene's GroupingSearch helper class is now simpler to use
DisjunctionScorer should expose the children scorers matching the current hit to facilitate highlighting
Now that LongValuesSource has moved to Lucene's core, the suggester module no longer needs to depend on the queries module. Likewise, the expressions module should use DoubleValuesSource , removing its dependency on the queries module, and the facets module should only use LongValueSource and DoubleValueSource .

Watch This Space

Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!

The Search AI Company

ELK Stack

Elastic Cloud

Generative AI

Search

Security

Observability

By solution

Industries

Customer spotlight

Research

Build

Learn

Connect

This Week in Elasticsearch and Apache Lucene - 2017-01-16

Elasticsearch Core

Apache Lucene

Watch This Space

Follow us

About us

Join us

Press

Partners

Trust & Security

Investor relations

EXCELLENCE AWARDS