18 December 2017

This Week in Elasticsearch and Apache Lucene - 2017-12-18

•

Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.

Optimised version map for append-only indexing

#27752 automatically optimises away the need to track versions of in-memory buffered documents while indexing if all documents in the ram buffer are guaranteed to have no duplicates and are documents using auto-generated IDs. This reduces the GC overhead drastically in high-throughput scenarios (up to 50%) and offers a 5-10% indexing throughput improvement depending on the workload. This change will come in 6.2.

Elasticsearch 6.2.0 supporting JDK 9

Elasticsearch 6.2.0 will be the first release of Elasticsearch to officially support JDK 9. Elasticsearch 6.2.0 will run out-of-the-box on both JDK 8 and JDK 9. We recommend that users stay on JDK 8 as JDK 9 is not an LTS release of the JDK, but Elasticsearch will move forward with the JDK ecosystem. When JDK 9 is end-of-life in March 2018, releases of Elasticsearch will stop supporting JDK 9; we intend to support JDK 10 but there is no guarantee of that at this time. Support for JDK 8 will continue until end-of-life in September 2018 when JDK 11 will be the next LTS release of the JDK.

New ranking evaluation API

A new ranking evaluation endpoint (_rank_eval) has been added to master and is planned to be backported to 6.2. The ranking evaluation API can be used to evaluate the quality of ranked search results over a set of typical search queries. Users can supply a set of typical queries together with a list or manually rated documents, and the API will perform the queries and calculate common information retrieval metrics like mean reciprocal rank, precision or discounted cumulative gain on it.

The API is currently marked as experimental and will probably change a bit in the foreseeable future. More details about the current state can be found in the documentation.

Ranking via the API is a very manual process at the moment, so we only expect to see traction around this feature once we have a UI to make interaction much more point-and-click. Brainstorming in progress with the Kibana team.

Changes in 5.6:

update ingest-attachment to use Tika 1.17 and newer deps #27824
Do not use system properties when building the HttpAsyncClient #27829

Changes in 6.0:

Use AmazonS3.doesObjectExist() method in S3BlobContainer #27723

Changes in 6.1:

Add version support for inner hits in field collapsing (#27822) #27833
No longer unidle shard during recovery #27757

Changes in 6.2:

BREAKING: Remove operationThreaded from Java API #27836
Allow _doc as a type. #27816
Use lastSyncedGlobalCheckpoint in deletion policy #27826
Add NioGroup for use in different transports #27737
Optimize version map for append-only indexing #27752
Fixes ByteSizeValue to serialise correctly #27702
also extract match_all queries when indexing percolator queries #27585
Allow custom service names when installing on windows #25255
Remove potential nio selector leak #27825
Clean Up Painless Cast Object #27794
Use CountedBitSet in LocalCheckpointTracker #27793
Keep commits and translog up to the global checkpoint #27606
Painless: Only allow Painless type names to be the same as the equivalent Java class. #27264
Fix performance of RoutingNodes#assertShardStats #27747
Use typeName() to check field type in GeoShapeQueryBuilder #27730
X-Pack:
- [Watcher] Use index.auto_expand_replicas: 0-1 #3284
- Watcher: Set index and type dynamically in index action #3264
- Fix license messaging for Logstash functionality #3268
- Check for existing x-pack directory on users commands #3271

Changes in 7.0:

Add ranking evaluation API #27478
Fail restore when the shard allocations max retries count is reached #27493
Remove pre 6.0.0 support from InternalEngine #27720
String distance algorithms cleanup #27640

Apache Lucene

Lucene 7.2.0

There is an ongoing vote to release Lucene 7.2.0, which is going well so far.

New committer / PMC member

Ahmet Arslan is now a Lucene/Solr committer and Ishan Chattopadhyaya is now a PMC member.

Other

An optimization to regexp queries that have leading wildcards ends up slowing down some other regexp queries.
CustomScoreQuery, BoostedQuery and BoostingQuery should be deprecated in favour of FunctionScoreQuery.
We could speed up range queries on sorted indices by using binary search to compute the range of matching doc ids.
TermInSetQuery (Elasticsearch's terms_set query) has a confusing string representation when combined in boolean queries.
The trim filter should be applied for multi-term queries and usable in keyword normalizers.
What is the maximum score of a disjunction? Floating point arithmetic makes it more complicated than it sounds.
The Explanation class should return a java.lang.Number rather than a float so that interger contributions to the score like docCount or totalTermFreq are better formatted and accurate.
While expanding certain nodes into fixed-size arrays can make lookups faster on FSTs, it might also make them space-inefficient.
Most similarities now have much better explanations.
Static analysis found an impossible branch.
The APIs that we introduced to speed up disjunctions and phrase queries when total hits counts are not needed could also be used to speed up sorting by (geo/numeric/date) distance.

The Search AI Company

ELK Stack

Elastic Cloud

Generative AI

Search

Security

Observability

By solution

Industries

Customer spotlight

Research

Build

Learn

Connect

This Week in Elasticsearch and Apache Lucene - 2017-12-18

Apache Lucene

Follow us

About us

Join us

Press

Partners

Trust & Security

Investor relations

EXCELLENCE AWARDS