This Week in Elasticsearch and Apache Lucene - October 19th 2015
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Just released new #elasticsearch #python clients - 2.0.0 for upcoming Elasticsearch 2.0 and 1.8.0 for 1.x versions - https://t.co/VUmgunhAgP
— Honza Král (@HonzaKral) October 13, 2015
Last week we released Elasticsearch 1.7.3 which contained some good bug fixes for the Tribe node, synced flushing, and snapshot restore.
Changes in 2.0:
- Testing of 2.0.0-RC1 uncovered a serious bug with Field/Document Level Security, which caused OOMs during bulk indexing. which required some major refactoring to fix (#14070, #14071, #14084).
- File/directory permissions in RPM/Deb packages have been tightened up so that config files/plugin dirs are readable by elasticsearch but writeable only by root.
- The `default_index_analyzer` has been renamed to `default_analyzer` to be consistent with the `index_analyzer` -> `search_analyzer` mapping changes.
- Guava is gone! A big cleanup to remove a dependency which can often clash with a user's own dependencies.
- The query part of search requests are now parsed on the coordinating node.
- There was a lot of debate about the syntax of the simple query string parser and how to make it more intuitive.
- The new scripting language is taking shape. Now working on using invokeDynamic to reduce the need for strict typing.
- In memory fielddata support is being removed in 3.0 for field types which support doc values.
- Trying to offload the memory used by global ordinals for multi-value fields to disk.
- Aggs are being refactored to be parsed on the coordinating node, just like queries.
- GeoPoints v2 PRs are almost ready to be merged into 2.1
- Multi-dimensional BKD trees may be promoted from Lucene sandbox to core, which would allow us to add support for 3D Geo plus much requested features like IPv6, BigInt, and BigDecimal.
- Upgrade ANTLR to version 4.5.1 for numerous bug fixes
- Add getters for the query cache and caching policy on
SpanOrQueryis now immutable
OfflineSorternow uses Lucene's
Directoryabstraction instead of secretly trying to consume temp directory space
equalsnow ignore clause order
BoostQuerynow adds parens around the boosted query, for the future Lucene 6.0 only
- At long last we can deprecate the
Filterclass, now that its capabilities are fully folded into
Queryand all internal usage in Lucene has been cutover
- Remove the slow
RegexQueryfrom Lucene's sandbox: Lucene's core
RegexpQuery(note the extra p!) is faster
- Java 9 has stricter type inference
- Simplify the base
to search a ring instead of a circle
- Refactor recent geo tests to improve test coverage and fix a few accuracy bugs
- We should add a
DimensionalFormatto Lucene's codec, to enable fast numeric and spatial searching on arbitrary byte
- LZ4 decompression can be costly if you load too many stored documents
- We can greatly reduce (96% in one test!) the heap usage for certain doc values by moving the storage for ordinals to disk
- Should we add a new
to better optimize query execution?
TermQueryshould clone the incoming term
- Lucene's classifier should be able to classify an already indexed document
JapaneseTokenizershould produce the top N possible tokenizations, not just the top 2
- Move the "delete file retry logic" down under
IndexWriter, since this is really a Windows-only limitation
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!