This Week in Elasticsearch and Apache Lucene - 2016-10-17
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Upgrade to #Elasticsearch 5.0.0-rc1 paying back. A small benchmark in my project shows consistent reduction in response time by 25%.
— DoHyung Kim (@_dohyung_) October 13, 2016
Elasticsearch Core
2.4:- The
match_phrase_prefix
query on the_all
field was running a term query instead of a prefix query. - The
position_increment_gap
onnot_analyzed
fields should be ignored for bwc instead of throwing an exception. - The
multi_match
query should not accept an array of query strings.
- AbstractArrays that release their bytes more than once can lead to incorrect circuit breaker memory counting.
- The FVH was not extracting terms from the SynonymQuery.
- The results from script queries should not be cached.
- Cluster settings updates should not be blocked by circuit breakers.
- Changing how we split strings broke parsing of the S3 repository base path.
- Shadow replicas should not increment their allocation ID when being promoted to primary.
- The
elasticsearch-plugin
script no longer displays plugin version numbers as this introduces confusion about how plugins should be referred to. - Netty4 was not closing connections correctly.
- Cat APIs are now sortable.
- Mustache gains a
{{#url}}
function that knows how to do URL escaping. - Source filtering should be able to step into paths in dotted field names.
- Update scripts now have the current timestamp available in
ctx._now
. - Self-referencing objects can result in stack overflow exceptions
- Source-filtering automatons should only be compiled once.
- The whole multi-get request should not fail if an alias resolves to too many indices.
- Alias filters should be parsed on the coordinating node instead of on each shard, so that filters are the same for all shards, and so they can be rewritten.
- Keep snapshot restore state and the routing table in sync to avoid failed communication resulting in an unrecoverable state.
- Sequence numbers are now written to commits using Lucene's api, so that the max seq id is accurately reflected in the commit, including deletes.
- Bulk requests have been simplified and limited to DocumentRequests, which simplifies execution and error handling.
- Threadpools no longer impose an artificial limit of 32 processors.
- ObjectParser is now used by Score, Field, and ScriptSortBuilder.
- Synonyms should be parsed with the analyzer chain specified in the analyzer, not just whitespace.
- Logstash and Beats 5.x templates should be tested for bwc against master.
- Alias names should undergo the same validation as index names.
- Reindex and friends should be parallelisable.
- Storing the execution context of a Painless script can make them much faster.
- Can the
_all
field be replaced by an_all
query? - Negated index expressions should only be taken into consideration when there is a positive wildcard.
- Template validation should take into account other possible matching templates.
- The tribe node should be able to store custom cluster state metadata, in order to support licensing.
- Searches should be cancellable in the task management API.
- The term suggester should have the option to return exact matches.
- Rank evaluation should support search templates.
Apache Lucene
SimpleQueryParser
now parses*
toMatchAllDocsQuery
- The 7.0 codec now stores norms sparsely and will soon store sparse doc values more efficiently, taking advantage of the new iterator-based APIs for doc values
Lucene70NormsFormat
was over-synchronized- A fun usage of Lucene's
Automaton
APIs to implement source filtering in Elasticsearch led to Lucene simplifications to make it clear that an automaton's initial state is always 0 FastVectorHighlighter
failed to highlightSynonymQuery
- The benchmark module now supports all Lucene highlighters
JapaneseNumberFilter
should not invokeincrementToken
on its input after it had already returnedfalse
Lucene60SegmentInfoFormat
gets its own dedicated test case- The code fragment in the javadocs for
LRUQueryCache
was stale DisjunctionMaxQuery
does not work correctly with sub-queries that return negative scores- Lucene complex efforts to pretend it has no schema were buggy if you suddenly started indexing dimensional points into a pre-existing index; this does not affect Elasticsearch since its mappings ensure we never mix points and no points for a given field in the same index
- Our
AssertingDocValuesFormat,
used to validate all arguments and return results to/from the doc values APIs usage by tests, was too lenient in checking thetarget
argument toadvance
, and would have prevented this otherwise tricky-to-debug test failure - Should the dimensional points APIs take a field name up front, like most other
LeafReader
APIs? - A nasty JVM bug causing an unexpected
AssertionError
in Lucene'sByteSliceReader
has finally been fixed - The
UAX_URL_EMAIL
tokenizer still needs to be modernized to detect all top-level domain names - Lucene's facets module does not let you get facets without hits today
- Lack of synchronization or
volatile
keyword means theoretically an NRT reader refresh might not see a recent change to the index but likely in practice it's a non-issue - Lucene's classic
QueryParser
mis-parsesOR
asAND
, but it's unfortunately a known limitation LMDirichletSimilarity
incorrectly ignores query terms that do not appear in the current segment
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!