This Week in Elasticsearch and Apache Lucene - October 19th 2015
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Top News
Just released new #elasticsearch #python clients - 2.0.0 for upcoming Elasticsearch 2.0 and 1.8.0 for 1.x versions - https://t.co/VUmgunhAgP
— Honza Král (@HonzaKral) October 13, 2015
Elasticsearch Core
Last week we released Elasticsearch 1.7.3 which contained some good bug fixes for the Tribe node, synced flushing, and snapshot restore.
Changes in 2.0:
- Testing of 2.0.0-RC1 uncovered a serious bug with Field/Document Level Security, which caused OOMs during bulk indexing. which required some major refactoring to fix (#14070, #14071, #14084).
- File/directory permissions in RPM/Deb packages have been tightened up so that config files/plugin dirs are readable by elasticsearch but writeable only by root.
- The `default_index_analyzer` has been renamed to `default_analyzer` to be consistent with the `index_analyzer` -> `search_analyzer` mapping changes.
- Guava is gone! A big cleanup to remove a dependency which can often clash with a user's own dependencies.
- The query part of search requests are now parsed on the coordinating node.
In progress:
- There was a lot of debate about the syntax of the simple query string parser and how to make it more intuitive.
- The new scripting language is taking shape. Now working on using invokeDynamic to reduce the need for strict typing.
- In memory fielddata support is being removed in 3.0 for field types which support doc values.
- Trying to offload the memory used by global ordinals for multi-value fields to disk.
- Aggs are being refactored to be parsed on the coordinating node, just like queries.
- GeoPoints v2 PRs are almost ready to be merged into 2.1
- Multi-dimensional BKD trees may be promoted from Lucene sandbox to core, which would allow us to add support for 3D Geo plus much requested features like IPv6, BigInt, and BigDecimal.
Apache Lucene
- Upgrade ANTLR to version 4.5.1 for numerous bug fixes
- Add getters for the query cache and caching policy on
IndexSearcher
SpanOrQuery
is now immutableOfflineSorter
now uses Lucene'sDirectory
abstraction instead of secretly trying to consume temp directory spaceBooleanQuery
hashCode
andequals
now ignore clause orderBoostQuery
now adds parens around the boosted query, for the future Lucene 6.0 only- At long last we can deprecate the
Filter
class, now that its capabilities are fully folded intoQuery
and all internal usage in Lucene has been cutover - Remove the slow
RegexQuery
from Lucene's sandbox: Lucene's coreRegexpQuery
(note the extra p!) is faster - Java 9 has stricter type inference
- Simplify the base
Query.equals
method - Add
GeoPointDistanceRangeQuery
to search a ring instead of a circle - Refactor recent geo tests to improve test coverage and fix a few accuracy bugs
- We should add a
DimensionalFormat
to Lucene's codec, to enable fast numeric and spatial searching on arbitrary byte[] - LZ4 decompression can be costly if you load too many stored documents
- We can greatly reduce (96% in one test!) the heap usage for certain doc values by moving the storage for ordinals to disk
- Should we add a new
matchCost
method toTwoPhaseDocIdSetIterator
to better optimize query execution? TermQuery
should clone the incoming term- Lucene's classifier should be able to classify an already indexed document
JapaneseTokenizer
should produce the top N possible tokenizations, not just the top 2- Move the "delete file retry logic" down under
Directory,
fromIndexWriter
, since this is really a Windows-only limitation
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!