22 January 2018

This Week in Elasticsearch and Apache Lucene - 2018-01-22

By Clinton GormleyAdrien Grand

Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.

SAML 2.0 is now supported by X-Pack

SAML authentication has been a highly requested feature to add to X-Pack and after several months of work, X-Pack 6.2 will now support SAML 2.0 Authenticationusing the Web SSO profile. SAML stands for Security Assertion Markup Language and is a standard protocol built upon XML for exchanging authentication and authorization data between parties. The most common use of SAML is to implement single sign on (SSO) for applications in an enterprise environment. The SAML specification has a few different versions: V1.0, V1.1, and V2.0. The V2.0 specification was completed in 2005 and is the most common version in use today.

The SAML authentication support has been designed to specifically work with Kibana.

Audit Log Filtering

The audit logs produced by X-Pack can be very verbose and for some users too verbose. A repeated request that we've seen over the past few years has been for the ability to add more fine grained controls of the events that get logged. Elasticsearch 6.2.0 adds the ability to define filter policies to limit which events are logged. The filters apply to the data of an audit event and can filter on users, roles, indices, and realms.

Aggregations now use Kahan summation to compute sums

When summing up N positive doubles with recursive summation ((((x1 + x2) + x3) + ... ) + xN), the relative error is only bounded by (N-1) * 2^-52. Kahan summation maintains a compensation in order to improve accuracy, which helps bound the relative error by 2^-52, even when summing up millions of values. The sum, avg, stats and extended_stats aggregations now all use Kahan summation when summing up values from the collected documents. This will improve the accuracy of these aggregations, with a reasonable cost of 8 bytes per bucket, which is accounted for by circuit breakers.

Lucene Rollbacks on Recovery

When a primary fails it may leave the other shard copies out of sync with each other. This is due to concurrent indexing operations in flight that may have not been executed on all shard copies. With 6.0 and the sequence numbers work, we already ship the missing operations from the new primary to the replicas. However, the replicas may still have operations that do not exists on the new primary. With 6.2 we are now removing these operations during ops based recovery (file based recovery is a full reset anyway). To do so, shards now keep potentially older lucene commits that are known to be "safe". These safe commits are guaranteed to have only operations that exist on the new primary. When recovering, we will only use these commit as a basis and thus throw away any operation in lucene that is not on the primary. Following on this, we will work on the translog and later on on real time rollbacks (i.e., do not require a recovery).

6.2 also fixes an issue where ops based recovery threw away the translog on the target shard. This reduced the chance of a future ops based recovery as the long history is now gone. For the fix to work all nodes need to be on 6.2.

Changes in 5.6:

  • Never return null from Strings.tokenizeToStringArray #28224
  • Fallback to TransportMasterNodeAction for cluster health retries #28195
  • Allow update of eager_global_ordinals on _parent. #28014

Changes in 6.1:

  • X-Pack:
    • [Security] Handle cache expiry in token service #3565
    • Watcher: Fix NPE in watcher index template registry #3571

Changes in 6.2:

  • BREAKING: REST high-level client: remove index suffix from indices client method names #28263
  • Add Close Index API to the high level REST client #27734
  • Painless: Add spi jar that will be published for extending whitelists #28302
  • Fix simple_query_string on invalid input #28219
  • Simplify RankEvalResponse output #28266
  • Add client actions to action plugin #28280
  • add toString implementation for UpdateRequest. #27997
  • Dependencies: Update joda time to 2.9.9 #28261
  • Add multi get api to the high level rest client #27337
  • Open engine should keep only starting commit #28228
  • Avoid doing redundant work when checking for self references. #26927
  • [GEO] Add WKT Support to GeoBoundingBoxQueryBuilder #27692
  • Painless: Add whitelist extensions #28161
  • Fix daitch_mokotoff phonetic filter to use the dedicated Lucene filter #28225
  • Fix NPE on composite aggregation with sub-aggregations that need scores #28129
  • Fix synonym phrase query expansion for cross_fields parsing #28045
  • Introduce elasticsearch-core jar #28191
  • Limit the analyzed text for highlighting (#27934) #28176
  • Adds metadata to rewritten aggregations #28185
  • X-Pack:
    • Drop native controller from descriptors (except ML) #3650
    • Merge saml (6x) #3648
    • [Security] Add SAML authentication support #3646
    • Introduce plugin-specific env scripts #3649
    • Split transport implementations into client/server #3635
    • Add the ability to refresh tokens obtained via the API #3468
    • Watcher: Improve cluster state listener behaviour #3538
    • Fix for Issue #3403 - Predictable ordering of security realms #3533

Changes in 6.3:

  • Clean up commits when global checkpoint advanced #28140

Changes in 7.0:

  • Unify nio read / write channel contexts #28160
  • X-Pack:
    • Support TLS/SSL renegotiation #3600
    • Add TLS/SSL enabled SecurityNioTransport #3519

Apache Lucene

Can BKD trees be extended to store and query R-Trees?

Support for BKD trees has driven a lot of improvements since they were introduced, like faster indexing of numerics, faster box/polygon/distance queries on geo points and range queries on numerics, support for range fields, etc.

We would like to index geo shapes into this BKD tree as well, but this comes with some challenges due to the fact that the BKD tree is not aware of the fact that longitude wraps, or that shapes cannot be represented accurately by a range of values in each dimension in the general case. Hence this proposal to extend the API to allow it to behave like a R-tree.

Other: