This Week in Elasticsearch and Apache Lucene - 2017-01-16
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Elasticsearch CoreMulti-word synonyms and synonym graphs
There has been much work recently on improving Lucene's handling of graph token streams, where analysis of text, either from a document during indexing, or a query during searching, produces multiple overlapping paths or interpretations for the tokens. Multi-word synonyms do this and have long been buggy when used with proximity queries but thanks to the recent addition of
SynonymGraphFilter as well as improvements to Lucene's query parsers to translate the token graph into separate queries, such analysis chains are finally handled correctly at search time.
WordDelimiterFilter is also being fixed to produce correct graphs. These changes have already been exposed in Elasticsearch, and then subsequently in Lucene, thanks to Matt Weber. Graph token streams still present challenges, though, such as the need to use
FlattenGraphFilter during indexing, but not searching, since a Lucene index cannot represent a graph. There are also a number of token filters that should produce a graph but do not yet, such as
EdgeNGramTokenFilter and decompounders.
Changes in 5.2:
- A source filtering bug introduced in 5.1 meant that any fields that were a prefix of the specified pattern were being included incorrectly.
- New network connections should not be opened when transport is closing or already closed, and channels opened during handshakes should not leak if the handshake is interrupted or times out.
- Upgrade to Netty 4.1.7 should fix a refcounting bug exposed in Java 9.
- A default chunk size of -1 for the Azure repository plugin resulted in corrupt snapshots.
- The Painless loop counter should be higher for update scripts than for search scripts.
- The shard recovery message had swapped doc counts for target and source shards.
- Reindex-from-remote had a race condition when trying to clear the remote scroll ID, and was ignoring user-supplied source filtering.
- The NodeConnectionService now relies on the current cluster state to determine which nodes should be connected to.
- The parent/child and
nestedqueries were ignoring the
- Single index and delete operations are now executed as bulk operations internally.
- ParseFieldMatcher and ParseFieldMatcherSupplier are no more.
- SearchRequestParsers have been removed in favour of NamedObjects.
- As part of the synonym graph filter changes,
simple_query_stringnow support token graphs, as does the
matchquery (now with
cutoff_frequency), and the analyze API now returns the position length by default when it is greater than 1.
- Affix settings (eg
foo.bar.USER_VALUE.enabled) can now be validated and updated dynamically.
- S3 repository settings have been moved to use the new secure settings infrastructure.
- The Java High Level REST client continues to make progress: JSON hits can now be parsed into InternalSearchHit/s, and the response from index requests can be parsed into IndexResponse.
- Painless whitelisted some methods to make it easier to access IP and Binary fields in a script.
- As part of the change to limit
acceptpermissions to Netty4 only, Netty channels have been extended with versions which wrap
- When terms aggs across indices result in a mixture of longs and doubles, longs should be promoted to doubles.
- The deprecated
mpercolateAPIs have been removed (in favour of
_allfield will not be allowed on new indices created in v6.0.0 and above.
- A new cross-cluster federated search will replace the Tribe Node and fix some of its liabilities.
- Sequence-number based recovery will allow quick-syncs of replicas when a primary fails.
- Field collapsing at search time should be more efficient than the top-hits aggregation.
- Use custom routing to target a group of shards at index time, instead of just a single shard.
- Now that Lucene has upgraded to Groovy 2.8 to fix Java 9 issues, the 6.4.0 release process will begin!
WordDelimiterFilter,would finally work with positional queries correctly at search time eventually fixing this Elasticsearch issue
TokenStreamToAutomatonfailed to handle holes (deleted tokens) correctly and we have promoted its test case to the core module's tests
- The new
FunctionMatchQueryuse the new
DoubleValuesSourcefor scoring and matching
- Starting with Lucene 7.0,
IndexWriterwill reject broken offsets in term vectors
- An inexplicable
ClassCastExceptionsurfaces when estimating heap bytes used by an
EdgeNGramTokenFilterdeletes incoming token payloads
- It's unreasonably difficult to intersect an automaton with the terms from doc values
TermsQueryhas been promoted to Lucene's core and renamed to
TermInSetQuery, letting us remove the dependency from the
facetsmodule on the
UnifiedHighlightershould let you customize how candidate passages are created, wrapping a sentence
BreakIteratorby default, and
Passageis now public for better extensibility
GroupingSearchhelper class is now simpler to use
DisjunctionScorershould expose the children scorers matching the current hit to facilitate highlighting
- Now that
LongValuesSourcehas moved to Lucene's core, the suggester module no longer needs to depend on the
queriesmodule. Likewise, the
expressionsmodule should use
DoubleValuesSource, removing its dependency on the
queriesmodule, and the facets module should only use
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!