Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Protecting Against Attacks that Hold Your Data for Ransom #elasticsearch https://t.co/Fg5wNC2aQJ
— elastic (@elastic) January 13, 2017
Elasticsearch Core
Multi-word synonyms and synonym graphsThere has been much work recently on improving Lucene's handling of graph token streams, where analysis of text, either from a document during indexing, or a query during searching, produces multiple overlapping paths or interpretations for the tokens. Multi-word synonyms do this and have long been buggy when used with proximity queries but thanks to the recent addition of SynonymGraphFilter
as well as improvements to Lucene's query parsers to translate the token graph into separate queries, such analysis chains are finally handled correctly at search time. WordDelimiterFilter
is also being fixed to produce correct graphs. These changes have already been exposed in Elasticsearch, and then subsequently in Lucene, thanks to Matt Weber. Graph token streams still present challenges, though, such as the need to use FlattenGraphFilter
during indexing, but not searching, since a Lucene index cannot represent a graph. There are also a number of token filters that should produce a graph but do not yet, such as ShingleTokenFilter,
EdgeNGramTokenFilter
and decompounders.
Changes in 5.2:
- A source filtering bug introduced in 5.1 meant that any fields that were a prefix of the specified pattern were being included incorrectly.
- New network connections should not be opened when transport is closing or already closed, and channels opened during handshakes should not leak if the handshake is interrupted or times out.
- Upgrade to Netty 4.1.7 should fix a refcounting bug exposed in Java 9.
- A default chunk size of -1 for the Azure repository plugin resulted in corrupt snapshots.
- The Painless loop counter should be higher for update scripts than for search scripts.
- The shard recovery message had swapped doc counts for target and source shards.
- Reindex-from-remote had a race condition when trying to clear the remote scroll ID, and was ignoring user-supplied source filtering.
- The NodeConnectionService now relies on the current cluster state to determine which nodes should be connected to.
- The parent/child and
nested
queries were ignoring theignore_unmapped
parameter. - Single index and delete operations are now executed as bulk operations internally.
- ParseFieldMatcher and ParseFieldMatcherSupplier are no more.
- SearchRequestParsers have been removed in favour of NamedObjects.
- As part of the synonym graph filter changes,
query_string
andsimple_query_string
now support token graphs, as does thematch
query (now withprefix
orcutoff_frequency
), and the analyze API now returns the position length by default when it is greater than 1. - Affix settings (eg
foo.bar.USER_VALUE.enabled
) can now be validated and updated dynamically. - S3 repository settings have been moved to use the new secure settings infrastructure.
- The Java High Level REST client continues to make progress: JSON hits can now be parsed into InternalSearchHit/s, and the response from index requests can be parsed into IndexResponse.
- Painless whitelisted some methods to make it easier to access IP and Binary fields in a script.
- As part of the change to limit
connect
andaccept
permissions to Netty4 only, Netty channels have been extended with versions which wrapconnect
andaccept
indoPriviliged
blocks. - When terms aggs across indices result in a mixture of longs and doubles, longs should be promoted to doubles.
- The deprecated
percolate
andmpercolate
APIs have been removed (in favour ofsearch
andmsearch
). - The
_all
field will not be allowed on new indices created in v6.0.0 and above.
- A new cross-cluster federated search will replace the Tribe Node and fix some of its liabilities.
- Sequence-number based recovery will allow quick-syncs of replicas when a primary fails.
- Field collapsing at search time should be more efficient than the top-hits aggregation.
- Use custom routing to target a group of shards at index time, instead of just a single shard.
Apache Lucene
- Now that Lucene has upgraded to Groovy 2.8 to fix Java 9 issues, the 6.4.0 release process will begin!
WordDelimiterGraphFilter,
to replaceWordDelimiterFilter,
would finally work with positional queries correctly at search time eventually fixing this Elasticsearch issueTokenStreamToAutomaton
failed to handle holes (deleted tokens) correctly and we have promoted its test case to the core module's tests- The new
FunctionScoreQuery
andFunctionMatchQuery
use the newDoubleValuesSource
for scoring and matching - Starting with Lucene 7.0,
IndexWriter
will reject broken offsets in term vectors - An inexplicable
ClassCastException
surfaces when estimating heap bytes used by anFST
EdgeNGramTokenFilter
deletes incoming token payloads- It's unreasonably difficult to intersect an automaton with the terms from doc values
TermsQuery
has been promoted to Lucene's core and renamed toTermInSetQuery
, letting us remove the dependency from thefacets
module on thequeries
moduleUnifiedHighlighter
should let you customize how candidate passages are created, wrapping a sentenceBreakIterator
by default, andPassage
is now public for better extensibility- Lucene's
GroupingSearch
helper class is now simpler to use DisjunctionScorer
should expose the children scorers matching the current hit to facilitate highlighting- Now that
LongValuesSource
has moved to Lucene's core, the suggester module no longer needs to depend on thequeries
module. Likewise, theexpressions
module should useDoubleValuesSource
, removing its dependency on thequeries
module, and the facets module should only useLongValueSource
andDoubleValueSource
.
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!