This Week in Elasticsearch and Apache Lucene
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Top News
My blog about using @Elasticsearch as a time series database: https://t.co/KaUitUuYvb
— Felix Barnsteiner (@felix_b) November 4, 2015
Elasticsearch Core
- Benchmarks comparing the old completion suggester using payloads with the new completion suggester using doc values.
- The Azure repository plugin now has support for multiple repository credentials.
- Transport options are now immutable.
- The thread interrupt flag is now restored properly after an InterruptedException. Work still needs to be done to make the BulkProcessor handle interrupts correctly.
Changes in master:
- After a number of issues with delayed allocation were fixed, delayed allocation has been refactored to make it easier to follow.
- As part of the change to wait for shard failures to be acknowledged, the acknowledgement step now respects a timeout.
- The OS stats now include the CPU usage, which will also be exposed in Marvel.
- Added a variable-length long encoding that supports negative values.
- Response filtering now uses native Jackson streaming support making it faster with less custom code, it handles escaping of dots correctly, and can also be used to filter `_source`.
- GeoShapes are now built with ShapeBuilders, much like we do with QueryBuilders. The next step is to make them serializable.
- The move to Gradle continues with:
- Improve integration test startup behavior
- Fix integTest output if the elasticsearch script fails
- Add javadocs jars
- Fix RecoveryBackwardsCompatibility
IT - Improve test output on errors and when debugging
- Use JDK at JAVA_HOME for compiling/testing, and improve build info output
- Add jar hell check before tests run
- On top of that, Gradle can now build RPMs and debian packages, and we'll soon have a test to check that packages are signed.
Ongoing work:
- Twelve aggregations have been refactored in the Aggs Refactoring branch... Only 30 left to do!
- A PR for the Query Profiler is looking for reviews.
- The tribe node doesn't play well with non-default configuration paths.
- Make the BulkProcessor back off and retry after request handling has been rejected due to a full queue.
- The new scripting language is much more succinct than it was before and is now able to call methods on string constants.
- The first step in ensuring that writes are not lost while the primary is relocated is ready for review: this decouples the actual write from the notion of which node is in charge of ensuring that the write happens.
- Sequence numbers are now being added to each write and work has started on local checkpoints.
- We will store allocation IDs with each shard to ensure that we choose the most recent shard at recovery time.
- A task management prototype has been implemented, but testing is proving harder.
- Work has started on splitting the `string` field datatype into separate `text` and `keywords` field types.
- Improvements to exceptions to stop swallowing stack traces.
Apache Lucene
- We have a volunteer release manager for 5.4.0!
- Factor out components of the XML query parser so consumers can inherit from classes without requiring
sandbox
module code - Optimize stored fields retrieval avoid skipping the last field in the document if it's not needed, but this will only help if the last field is more than 16 KB at default settings
- Upgrade randomized testing to version 2.3.1, to get several improvements
- Prefix-coding the values in block KD-tree leaf blocks gives a sizable reduction in the already small index size, with only a small slowdown in query performance
- Add optimizations/specializations for bulk merging in the common case of 1D dimensional values, resulting in sizable speedups making indexing faster than numeric field
WordDelimiterFilter
should respectKeywordAttribute
- Don't use
null
to represent sorting by relevance! PhraseQuery
does, in fact, allow more than one term at the same position, but it's interpreted differently (conjunction) thanMultiPhraseQuery
(disjunction) - Many similarities incorrectly treat a 0 norm value to be an infinitely long document
- A
GeoDistanceRangeQuery
that overlaps one or both of the poles is problematic, adding risk to Santa Claus's upcoming flight planning UnicodeWhitespaceTokenizer
splits tokens on any Unicode-defined whitespace character - The new
matchCost
method will let Lucene execute conjunction queries, with multiple two-phased clauses, more efficiently - If you search on a massive shape, such that it manages to wrap around and span the entire earth, we should rewrite that to just match all documents with the field
GeoPointDistanceQuery
matches the wrong documents when it has to cross the datelineIOUtils.fsync
should not retry on hittingIOException
since that means a great disturbance has occurredJapaneseTokenizer
should offer more then two possible tokenizations- Add some simple optimizations for filters in
BooleanQuery.rewrite
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!