This Week in Elasticsearch and Apache Lucene - 2016-11-21
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Every shard deserves a home https://t.co/Iyns3sX7MQ via @elastic #Elasticsearch #DevOps
— Daniel Berman (@proudboffin) November 21, 2016
Elasticsearch Core
Changes in 2.x:
- Remove cluster update tasks when task times out to prevent memory leak.
- Added option to disable caching of term queries (disabled by default).
- A date-range with a bucket script that left
from
orto
unbounded could result in an NPE. - Update Joda time to v2.9.5 which fixes a bug with parsing
ZZZ
time zones.
- Log messages should be truncated from the end because the beginning of the message contains the most interesting parts.
- Non-dynamic settings should not be resettable with null.
- The default global search timeout was not being respected.
- The engine failed to report when it was being throttled.
- Grok's
trace_match
behaviour didn't work with only one grok pattern. - Instructions for disabling deprecation logging were incorrect.
- Emulate Java8 FilePermissions in Java 9, which has removed pathname canonicalization when constructing FilePermission objects.
- Tribe nodes were prevented from using non-default ports by the security manager.
- Log4J should be shut down when the JVM exits instead of when the node exits to avoid problems in the Tribe node.
- The parent field should not be added to nested documents.
Changes in 5.1:
- Deprecated queries against
boolean
fields from accepting values other thantrue
,false
,"true"
, or"false"
. - Uncommitted mapping updates should not affect existing indices.
- IndexAlreadyExistsException has been replaced with ResourceAlreadyExistsException.
- Painless gains the Elvis
?:
operator. - Functions in Painless now have access to reserved variables like
doc
, which allows Kibana to use (most) scripts both for scripted fields and in the script query. - The update to Tika v1.14 allows the ingest mapper attachment to handle docs > 100kB.
- The
match_phrase_prefix
query didn't work on boosted fields. - Term queries are no longer cached by default as they are fast already, and queries with many terms can result in other filters never being cached.
- The
split
ingest processor gains theignore_missing
parameter. - Parsing of the
level
parameter in index and node stats is now strict. - Handle DST shifts that happen one hour after midnight.
- Alias filters should be parsed on the coordinating node to allow caching of filters which use now() and to ensure that all shards see the same filter.
- Upgrade to Lucene 6.3.0.
Changes in 6.0:
- Removed netty3 in favour of netty4.
- Removed store throttling - has been handled automatically by Lucene since 2.x.
- Parsing of the metrics parameter in node stats is now strict.
- Enabled 5.x -> 6.x bwc tests.
- The sequence IDs branch has been merged into master to allow for easier development.
Ongoing:
- The synonym graph token filter will provide correct handling of multi-word synonyms in phrase and contextual queries.
- The
cross_fields
execution of themulti_match
query doesn't handle synonyms correctly. - Allow term aggs to be partitioned so that more terms can be retrieved using multiple requests.
- Lazy DNS resolution of unicast hosts would allow starting Elasticsearch before DNS entries are ready and relookup of changed entries, but has an impact on ping timeouts in Zen discovery.
- Highlighting doesn't work on keyword fields which contain non-ascii characters.
- Writing UTF8 to StreamOutput can be done more efficiently using a local buffer.
- The tribe node should be able to merge custom cluster state metadata.
- The master node should be able to retry assigning a primary shard to a node that has the shard store locked during shard state fetching.
- Binary field values should be accessible in scripts.
- Add a
size
ingest processor to replace the_size
metafield. - Slow application of cluster state changes can hold on to many old cluster states resulting in OOM. The whole cluster state is not required, and can be replaced with revision numbers to indicate which states have been applied.
- The new
unified
highlighter offers more flexible highlighting, but doesn't work with queries that need access to an index reader. - Remove the deprecated
groovy
,javascript
, andpython
scripting languages.
Apache Lucene
QueryBuilder
should allow subclasses to overridecreateFieldQuery
to allow for query parsers that properly handle multi-token synoynms, for exampleSynonymFilter,
which has long-standing tricky bugs with multi-token synonyms, may soon be replaced bySynonymGraphFilter
offering a path to fix those bugs- Index time sorting now supports sorting on multi-valued fields using selectors
- The simplistic in-heap BKD index can be optimized to require less heap
- Codec-level encryption continues iterating, this time getting some technical documentation
- Lucene may soon have an implementation of a logistic regression classifier
- Subclasses of the primary node in NRT segment based replication should have access to the
IndexWriter
ASCIIFoldingFilterFactory.getMultiTermComponent
no longer emits the original token even whenpreserveOriginal
is true- The classic query parser no longer allows
autoGeneratePhraseQueries
whensplitOnWhitespace
is false - The changes-to-html generator should not rely on Jira being up
- A possible ICU 58.1 upgrade is on hold because of unexplained ICU assertion test failures
- A write-once attribute analysis chain has some very nice properties but is a massive change
- Can we make it easier to create graph tokenizers and token filters?
- Nested
SpanNearQueries
miss some hits today, but the fix is surprisingly tricky and only addresses some cases - Lucene should provide axiomatic similarities, six in all
- The release smoke tester now fails if
CHANGES.txt
seems to be from the future UnifiedHighlighter's
Passage
is now public, for better extensibility; its passage relevancy has improved and it should let you set the max passage character lengthAnalyzingInfixSuggester
now closes itsIndexWriter
by default afterbuild
- We were missing backwards compatibility coverage for sorted indices
- The document based suggester sometimes threw
NullPointerException
on level 2 ghost fields - End-of-line characters were causing false failures in our precommit tests
- Doc values queries now use the new iterator API directly
- Lucene's
RollingBuffer
utility class was missing a getter for its internal buffer size - The new
BooleanSimilarity
caused problems for tests that expect similarity implementations to be sane
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!