This Week in Elasticsearch and Apache Lucene - 2017-05-22
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Using #machinelearning and elasticsearch for security analytics: a deep dive https://t.co/cqnxvAPcct pic.twitter.com/cY2lZCs52Z
— Bruno Costa (@BrunoElo) May 22, 2017
Java High Level REST Client
In an arduous and heroic campaign, three brave knights (Sir Luca, Sir Tanguy, and Sir Christoph) have scythed their way through ferocious hordes of aggregations and have finally destroyed the last one standing. Victory! This means that the new REST client (which will replace the transport client) is close to landing. More remains to be done: the branch still has to be merged to master and then backported to 5.x, plus a lot of tests remain to be written, but the battle is close to being won! The REST client covers only the essential APIs for now (ping, info, index, bulk, get, exists, delete, update, search) but this list will be expanded over time.
Routing around degraded nodes
The latency of a search request depends on the slowest/busiest node involved in the search. The search thread pool in master has been switched to a new type (fixed_auto_queue_size
) which behaves like the old type by default. It can, however, be configured to have a target response time (and min and max queue sizes) in which case queue sizes will be automatically adjusted based on response times. Slower nodes will have shorter queues, and so will reject search requests earlier than other nodes. These rejected requests will be forwarded to other nodes. This is the first experimental step in the ability to work around degraded nodes. The next step is to figure out a way to set the target response time automatically.
Changes in 5.4:
- A range query without ranges now throws a proper exception instead of an NPE.
- Legacy geo-point fields no longer throw an AIOBE in the field stats API.
- Range queries in the percolator are able to use
now
, but a code check prevented it. - More-like-this can now retrieve terms from documents with custom routing.
- Field collapsing should not throw an exception when a search response contains zero hits.
- Dynamic settings which accept an IP address should be resettable.
- Paths in Windows did not handle parentheses correctly.
- Disabled the Netty recycler on the client.
- Time zone offsets in a date histogram with extended bounds should be applied after rounding the extended bounds.
- A new cluster block allows an index marked as
read_only
to be deleted. - Google Cloud repository settings can now be stored in the secure settings keystore.
- Parent-child no longer needs a specialised fielddata implementation, but instead uses vanilla doc values.
- The list of versions used for bwc testing is now generated automatically instead of being maintained manually.
- Scripts no longer have access to term statistics (TF, IDF, etc). This advanced use case can be implemented more efficiently by creating a custom script engine.
- File-based scripts have been removed in favour of stored scripts.
- Native scripts have been removed in favour of implementing a custom script engine.
- Deleting a document in a non-existent index will no longer cause the index to spring into existence, unless external versioning is used.
- The Google Cloud repository settings file must now be stored in the secure keystore instead of in a file on disk.
- Common settings may no longer be shared across plugins.
- Deprecated script settings have been removed.
- Script settings have been extended to support
allowed_types
(inline and stored) andallowed_contexts
(in which APIs scripts may be executed). - Upgraded the ICU plugin to use ICU4j 56.1, which may require reindexing collation fields.
- The minimum compatible version has been set to 5.4.0 (currently the latest released) and
_UNRELEASED
version constants have been removed. - Plugins are now able to register pre-configured tokenizers.
- With sequence numbers, a replica will now learn about the existence of a new primary term by receiving replica requests from the new primary, and will block further requests from the old primary.
- Range fields can benefit from the same optimization as range queries on point fields, to decide whether to execute the query using the index or doc values.
Apache Lucene
Lucene 6.6
The release process for Lucene 6.6.0 has started. This might be the last 6.x minor release before 7.0.
Indexed geo bounding boxes
There is a patch in progress that adds support for indexing geo bounding boxes using points similarly to how range fields work. These bounding boxes can then be queried at search time and support the same relations as range fields: INTERSECTS, WITHIN, CONTAINS or CROSSES.
Greater accuracy of the length normalization factors
Up to Lucene 6.x, norms had to store a scoring factor that combined the index-time boost with length normalization. However Lucene 7.0 will not support index-time boosts anymore, which gave us an opportunity to store length-normalization factors more accurately. Actually we are no longer storing the normalization factor but the length directly, which we encode on a single byte while retaining 4 significant bits (and even more for small values). And at search time we keep a translation table between the encoded length, which only has 256 possible values, and the length normalization factor. This will give users who search short fields a much better experience since all lengths up to 40 now get encoded to a different byte, while the previous encoding would already quantize the length normalization factors for lengths of 3 and 4 to the same byte, which many users complained about over the years.
PostingsHighlighter removal
Over the last months, there have been lots of efforts to create a highlighter to rule them all, called the unified highlighter. In particular, there are no features that the postings highlighter has and the unified highlighter doesn't, so we are considering removing the postings highlighter.
Other changes:
- Nested queries mistakenly compute the min of the child scores when the max is requested.
- The classification module got a new classifier based on fuzzy-like-this, and another one based on bm25 scores.
- Thanks to the change to the information that we store in norms, we might be able to improve compression.
- Should the terms dictionary load the terms index lazily or even allow it to be unloaded?
- We should not let methods throw checked exceptions that they do not declare (sneaky throw).
- Should we fail opening indices that have been created with version N-2 or less? Today the versions we check are the versions that wrote segments and the commit point, which means we would still accept to open an old index if it has only been merged with version N-1, even though opening such old indices could raise issues.
- Join queries have equals/hashCode bugs that cause false positives.
- We could allow sorting child documents by fields of their parent.
- CodecUtil would sometimes throw confusing exceptions when run against truncated files.
- Range fields are slow when many documents share the same value, but this looks due to the fact that range queries do not leverage all the information they have on inner nodes.
- Corrupt SegmentInfos can throw a misleading IllegalArgumentException.
- The classic query parser parsed range queries leniently.
- AnalyzingInfixSuggester's textgrams field is unnecessary when minPrefixChars is 0.
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!