This Week in Elasticsearch and Apache Lucene - 2017-12-18
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Optimised version map for append-only indexing
#27752 automatically optimises away the need to track versions of in-memory buffered documents while indexing if all documents in the ram buffer are guaranteed to have no duplicates and are documents using auto-generated IDs. This reduces the GC overhead drastically in high-throughput scenarios (up to 50%) and offers a 5-10% indexing throughput improvement depending on the workload. This change will come in 6.2.
Elasticsearch 6.2.0 supporting JDK 9
Elasticsearch 6.2.0 will be the first release of Elasticsearch to officially support JDK 9. Elasticsearch 6.2.0 will run out-of-the-box on both JDK 8 and JDK 9. We recommend that users stay on JDK 8 as JDK 9 is not an LTS release of the JDK, but Elasticsearch will move forward with the JDK ecosystem. When JDK 9 is end-of-life in March 2018, releases of Elasticsearch will stop supporting JDK 9; we intend to support JDK 10 but there is no guarantee of that at this time. Support for JDK 8 will continue until end-of-life in September 2018 when JDK 11 will be the next LTS release of the JDK.
New ranking evaluation API
A new ranking evaluation endpoint (_rank_eval
) has been added to master and is planned to be backported to 6.2. The ranking evaluation API can be used to evaluate the quality of ranked search results over a set of typical search queries. Users can supply a set of typical queries together with a list or manually rated documents, and the API will perform the queries and calculate common information retrieval metrics like mean reciprocal rank, precision or discounted cumulative gain on it.
The API is currently marked as experimental and will probably change a bit in the foreseeable future. More details about the current state can be found in the documentation.
Ranking via the API is a very manual process at the moment, so we only expect to see traction around this feature once we have a UI to make interaction much more point-and-click. Brainstorming in progress with the Kibana team.
Changes in 5.6:
- update ingest-attachment to use Tika 1.17 and newer deps #27824
- Do not use system properties when building the HttpAsyncClient #27829
Changes in 6.0:
- Use AmazonS3.doesObjectExist() method in S3BlobContainer #27723
Changes in 6.1:
- Add version support for inner hits in field collapsing (#27822) #27833
- No longer unidle shard during recovery #27757
Changes in 6.2:
- BREAKING: Remove
operationThreaded
from Java API #27836 - Allow
_doc
as a type. #27816 - Use lastSyncedGlobalCheckpoint in deletion policy #27826
- Add NioGroup for use in different transports #27737
- Optimize version map for append-only indexing #27752
- Fixes ByteSizeValue to serialise correctly #27702
- also extract match_all queries when indexing percolator queries #27585
- Allow custom service names when installing on windows #25255
- Remove potential nio selector leak #27825
- Clean Up Painless Cast Object #27794
- Use CountedBitSet in LocalCheckpointTracker #27793
- Keep commits and translog up to the global checkpoint #27606
- Painless: Only allow Painless type names to be the same as the equivalent Java class. #27264
- Fix performance of RoutingNodes#assertShardStats #27747
- Use typeName() to check field type in GeoShapeQueryBuilder #27730
- X-Pack:
Changes in 7.0:
- Add ranking evaluation API #27478
- Fail restore when the shard allocations max retries count is reached #27493
- Remove pre 6.0.0 support from InternalEngine #27720
- String distance algorithms cleanup #27640
Apache Lucene
Lucene 7.2.0
There is an ongoing vote to release Lucene 7.2.0, which is going well so far.
New committer / PMC member
Ahmet Arslan is now a Lucene/Solr committer and Ishan Chattopadhyaya is now a PMC member.
Other
- An optimization to regexp queries that have leading wildcards ends up slowing down some other regexp queries.
- CustomScoreQuery, BoostedQuery and BoostingQuery should be deprecated in favour of FunctionScoreQuery.
- We could speed up range queries on sorted indices by using binary search to compute the range of matching doc ids.
- TermInSetQuery (Elasticsearch's
terms_set
query) has a confusing string representation when combined in boolean queries. - The
trim
filter should be applied for multi-term queries and usable in keyword normalizers. - What is the maximum score of a disjunction? Floating point arithmetic makes it more complicated than it sounds.
- The Explanation class should return a java.lang.Number rather than a float so that interger contributions to the score like docCount or totalTermFreq are better formatted and accurate.
- While expanding certain nodes into fixed-size arrays can make lookups faster on FSTs, it might also make them space-inefficient.
- Most similarities now have much better explanations.
- Static analysis found an impossible branch.
- The APIs that we introduced to speed up disjunctions and phrase queries when total hits counts are not needed could also be used to speed up sorting by (geo/numeric/date) distance.