This Week in Elasticsearch and Apache Lucene - 2018-04-02

Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.

Highlights

Percolator Bugs

Adrien fixed a few bugs in Percolator that caused queries to be returned that did not actually match the document specified in the request.

Search Optimizations

Jim added an optimization that will speed up composite aggregations in certain cases with range and match_all queries.

Secure LDAP Settings

Tim has changed the LDAP secure_bind_password field to be a secure setting. As LDAP bind users tend to be long-lived service accounts with broad access to search the directory, it is great that these credentials are now more secure.

SAML Improvements

Ioannis improved our SAML support with two recent contributions. These contributions allow us to sign generated SAML metadata as well as fixing a few missing SAML settings.

XContent Refactoring

Lee has moved a large amount of XContent code out of the elasticsearch core server code. This work is important as it helps provide a foundation to one day move to a world where the java high-level rest client code does not depend on the elasticsearch server code itself.

Formal Modeling

David continues to leverage formal modeling to find bugs in elasticsearch - most recently addressing the concurrent execution model of elasticsearch replicas.

Snapshot Restore

Tanguy addressed some bugs with snapshot restore functionality - specifically, the snapshot restore process examines the snapshot’s stored state of the entire cluster even when restoring a single index. In certain cases, this global state may not be readable, failing the restore process. Now, index restores only examine the index data in the snapshot, speeding up the restore and avoiding these issues.

Changes in 6.2:

  • Bulk processor#awaitClose to close scheduler #29263
  • Propagate ignore_unmapped to inner_hits #29261
  • REST client: hosts marked dead for the first time should not be immediately retried #29230
  • X-Pack:
    • Adds missing SAML Realm Settings #4221

Changes in 6.3:

  • REST high-level client: add support for Indices Update Settings API #28892
  • Fold EngineDiskUtils into Store, for better lock semantics #29156
  • Move trimming unsafe commits from the Engine constructor to Store #29260
  • Search: Validate script query is run with a single script #29304
  • Fix incorrect geohash for lat 90, lon 180 #29256
  • Fix handling of bad requests #29249
  • Prune only gc deletes below the local checkpoint #28790
  • Do not optimize append-only operation if normal operation with higher seq# was seen #28787
  • Allow _update and upsert to read from the transaction log #29264
  • Fix a type check that is always false #27726
  • Optimize the composite aggregation for match_all and range queries #28745
  • X-Pack:
    • Improve error if Indices Permission is too complex
    • Add secure_bind_password to LDAP realm
    • Replace ThrottlerField → Field in comments and string constants
    • [Rollup] Make Rollup a Basic license feature
    • [Rollup] Delegate GetJobs to master
    • Set order of audit log template to 1000
    • All logging audit settings updateable
    • Saml metadata signing
    • LdapUserSearch rebind with bind DN after bind user
    • [Rollup] Select best jobs then execute msearch-per-job

Changes in 7.0:

  • Do not load global state when deleting a snapshot #29278
  • Remove IndicesOptions bwc serialization layer #29281
  • Don’t load global state when only restoring indices #29239

Apache Lucene 7.3

The vote is closing today and no issues have been detected, so there are good chances that the 7.3 bits will be released later this week. In the meantime we have upgraded Elasticsearch to a recent 7.3 snapshot to make sure Lucene 7.3 doesn't introduce regressions with Elasticsearch.

Interval queries

Alan and Jim worked on the prototype of interval queries based on Efficient optimally lazy algorithms for minimal- interval semantics. Paolo Boldi and Sebastiano Vigna. Theoretical Computer Science, 2016. A first iteration has been pushed to the Lucene sandbox. These positional queries are similar to span queries, but we hope to take advantage of starting from scratch again to fix some issues with span queries and eventually replace span queries or merge this work into span queries.

Soft-delete-aware IndexWriter

At the moment soft deletes are managed on top of the IndexWriter. Simon is proposing changes to make the IndexWriter aware of soft deletes so that it would be possible to expunge soft deletes through merges just like hard deletes, eg. with a 7-days retention policy.

Nori, a new korean analyzer

Jim has been working hard with members of the Korea team at Elastic in order to build a morphological analyzer for Korean using the same underlying ideas and data-structures as Kuromoji (the Japanese analyzer), even though the implementation is very different due to differences between these languages. The initial prototype received good initial feedback, Jim and Robert are now iterating on improving dictionary compression, memory usage and decompounding.

Give queries an API to iterate positions of matches

Simon and then Alan had been working for a long time on a new API that would allow scorers to expose positions/offsets. This would have several benefits, including making the implementation of highlighters easier since they could directly ask a query for an iterator over the matching positions. Unfortunately this also introduced quite some complexity and never got merged. Alan recently went back with a proposal of a much smaller scope which only consists of returning the matching positions/offsets for a single document.

Other

- TestIndexSorting uses large indexes which occasionnally cause OOMEs with in-memory codecs. We should make these tests simple or only use efficient codecs for this test.

- Can we populate the filter cache asynchronously?