Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Finished write-up over holidays: Everything we know about #elasticsearch for e-commerce sites @elastic @sprysys https://t.co/y1tXjMU5Cw
— Martin Loetzsch (@martin_loetzsch) January 9, 2017
Elasticsearch Core
Changes in 5.1:
- The field stats API now has support for geo-point fields.
- Don't close store under CancellableThreads.
- Ensure shrunk indices carry over version information from its source.
- Primary relocation for shadow indices had a hidden bug that caused the source to fail itself before performing a full recovery.
Changes in 5.2:
- The cluster-allocation-explain API now uses the allocation process itself to explain shard allocation decisions, which means that the explanation will always be in sync with reality.
- Snapshot repositories containing pre 2.x snapshots compressed with LZF can now be read again (although the pre-2.x snapshots cannot be restored).
- The
minimum_number_should_match
parameter is deprecated in favour ofminimum_should_match
. - Certain exceptions during index deletion weren't being caught and could cause cluster state application to fail.
- The low level node handshake has been moved from #connectToNode to #openConnection to prevent bypassing.
Changes in 5.3:
- Added infrastructure for storing sensitive settings (eg passwords) in a password-protected keystore.
- The new ToXContentObject interface represents complete objects, while ToXContent is to be used for object fragments which don't output opening and closing curlies.
- Painless strings didn't support escaping of quotes with backslashes.
- Add support for ca-central-1 region to EC2 and S3 plugins.
Changes in master:
- Socket, ServerSocket, and HttpServer usages in tests replaced with mocksocket versions to move SocketPermissions out of core. Also, moved IfConfig.logIfNecessary (which also requires socket perms) into bootstrap, before the security manager is applied.
- Added the first method
ping
to the new Java HighLevelRestClient. - Version now implements Comparable.
- Disable the Netty recycler and pooled allocator as they seem to be more trouble than they are worth.
Ongoing changes:
- #namedObject is replacing SearchExtRegistry, AggregatorParsers, Suggesters, and AllocationCommands.
- The syntax for lower/upper bounds of stddev can be simplified.
- All booleans everywhere should be strictly evaluated.
- Can field collapsing on search hits be done more efficiently and simply with search instead of top hits?
- Custom routing can be used to target a subset of shards instead of just one shard.
- Aggs over indices which return a mix of floats and integers should treat all numbers as doubles.
- Nested and parent-child queries are ignoring the
ignore_unmapped
parameter. - Remove unneeded weak reference from prefix logger to resolve potential memory leak.
- S3 is being moved to use the new secure settings infra.
- Remove support for the
_all
field in 6.0. - Sequence numbers allow for fast recovery when a replica has fallen out of sync with the master.
- The timezone and date format should be normalised when rewriting range queries for caching.
- An adjacency matrix aggregation can show co-occurrence of terms.
Apache Lucene
- Lucene should better optimize the case of costly multi-term and point queries AND'd with a fast, restrictive query
WordDelimiterGraphFilter,
to replaceWordDelimiterFilter,
would finally work with positional queries correctly at search time eventually fixing this Elasticsearch issue- Dimensional points now tries harder to split on all dimension, even in slivery cases
UnifiedHighlighter
should let you customize how candidate passages are created, wrapping a sentenceBreakIterator
by default- Lucene's
GroupingSearch
helper class could use some improvements - The
Surround
query parser should be modernized to use the numerous new Lucene APIs added since it was created - Like
SynonymQuery,
which scores multiple synonym terms as a single term,SpanSynonymQuery
would do the same thing for span queries BooleanQuery
could quite easily allow for per-documentminimumShouldMatch
, instead of the single global value you can provide today- Now that
LongValuesSource
has moved to Lucene's core, the suggester module no longer needs to depend on thequeries
module. Likewise, theexpressions
module should useDoubleValuesSource
, removing its dependency on thequeries
module, and the facets module should only useLongValueSource
andDoubleValueSource
. - The
ComplexPhraseQueryParser
should also handle a single multi-term query in quotes AnalyzingInfixSuggester
no longer relies on the misc module since we promoted index sorting as a core feature- The
FlattenGraphFilter,
added so multi-token synonyms work correctly, was not correctly handling broken incoming token offsets AutomatonTermsEnum
gave a confusing exception if you passed a special-caseCompiledAutomaton
DrillSideways,
letting you still see other facet counts even after you've drilled down, now uses threads to gain concurrency- Lucene was not always enforcing that
PositionLengthAttribute
was > 0, possibly causing illegal cyclic token streams - Query parsers can now handle graph token streams, finally fixing multi-token synonyms with positional queries to behave correctly
- A possible optimization to
DocValuesRangeQuery
may not pan out - The obscure
git
mailmap
feature lets us coalesce commits by the same person using different names and/or email addresses over time - We had to disable a test case on Java 9 but we are not exactly sure why
- The
QueryNode.toQueryString
API in the flexible query parser illegally claims to create a string which, when parsed, would result in exactly this same query node CustomAnalyzer
was only applying character normalization to the lastTokenFilter
- Java 9 breaks Lucene's efforts to estimate RAM usage of JDK runtime classes
- A tricky non-reproducing test failure turned out to be a concurrency bug in
AnalyzingInfixSuggester
LongValuesSource
andDoubleValueSource
have been promoted to Lucene's core moduleLeafFieldComparator.setScorer
is now allowed to throwIOException
- Index sorting was failing to ask the codec to create a mutable bit set while flushing a new segment
- Our source code checks (tabs vs whitespace, nocommits, etc.) should also check
.xml
and.template
files - A Jenkins test failure in
Geo3D
resulting in adding a new thresholdVector.MINIMUM_ANGULAR_RESOLUTION
to reject too-small slivers
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!