13 July 2018

This Week in Elasticsearch and Apache Lucene - 2018-07-13

By Paul SanwaldAdrien GrandBoaz LeskesJay ModiColin Goodheart-Smithe

Elasticsearch

Kerberos

The kerberos realm has been merged into the feature branch. We are currently working on a QA test that uses an actual KDC. Once the lookup realms feature has been completed, we will need to integrate the kerberos realm with lookup realms but this will not block the completion of the kerberos realm.

Structured Audit Logging

We hafe raised an in progress PR for structured audit logging. This PR makes use of the log4j StringMapMessage for audit events. They allow the greatest flexibility in terms of final log line format. Specifically, log4j defines layouts which format the logline printf style utilizing the values from the map message event. This should allow us to experiment with multiple formats by only changing the log4j config.

Cross Cluster Replication

We have merged the rewrite for the ShardFollowingTask. The rewrite was meant to simplify the logic and allow us to get better insights into the various internals of the tasks for better monitoring and control (for example, how many operations are already fetched from the leader buffered up for writes on the follower).

Zen2

The Zen2 project is now managed on our public repo and is now being tracked on a meta issue. Due to the complexity of the project, the POC phase was quite elaborate and the code is relatively mature for a POC. Work will focus on porting the POC code to a production level, simplifying things and adding tests. This is a major milestone. Congratulations to David and Yannick for reaching it!

Auto-Interval Date Histogram

We have merged a PR for a new aggregation, called auto_date_histogram. It works like a date_histogram, but instead of specifying the time interval, you specify the max number of buckets you’d like, and the aggregation chooses the interval that will be closest to the maximum without going over.

SQL Drivers

Work on ODBC and JDBC drivers continues, with a goal of getting to parity. We have been working on adding parameterised execution to the ODBC driver, allowing a statement to be prepared once and then executed multiple times with different data parameters. For JDBC, adding single parameter text manipulating functions to SQL which allows users to transform text using function such as LENGTH, UCASE, LCASE and LTRIM

Changes in 6.3:

  • SQL: HAVING clause should accept only aggregates #31872
  • Fix building AD URL from domain name #31849
  • Watcher: Add ssl.trust email account setting #31684
  • Watcher: Increase HttpClient parallel sent requests #31859
  • Inconsistency between description and example #31858
  • SQL: Fix incorrect HAVING equality #31820

Changes in 6.4:

  • Slack message empty text #31596
  • Date: Add DateFormatters class that uses java.time #31856
  • Tests: Remove use of joda time in some tests #31922
  • Add Get Snapshots High Level REST API #31980
  • Force execution of fetch tasks #31974
  • Add Expected Reciprocal Rank metric #31891
  • SQL: Add support for single parameter text manipulating functions #31874
  • SQL: Support for escape sequences #31884
  • Add Snapshots Status API to High Level Rest Client #31515
  • ingest: date_index_name processor template resolution #31841
  • Test: fix null failure in watcher test #31968
  • Added lenient flag for synonym token filter #31484
  • Fix wrong NaN check in MovingFunctions#stdDev() #31888
  • Add opaque_id to audit logging #31878
  • add support for is_write_index in put-alias body parsing #31674
  • Handle missing values in painless #30975
  • BREAKING: High Leven REST Client: Add x-pack-info API #31870
  • Do not return all indices if a specific alias is requested via get aliases api. #29538
  • Ingest: Enable Templated Fieldnames in Rename #31690
  • Ingest: Add ignore_missing option to RemoveProc #31693
  • Add template config for Beat state to X-Pack Monitoring #31809
  • SQL: Remove restriction for single column grouping #31818
  • Check timeZone argument in AbstractSqlQueryRequest #31822
  • Fix profiling of ordered terms aggs #31814

Changes in 6.5:

  • [X-Pack] Beats centralized management: security role + licensing #30520

Changes in 7.0:

  • Tests: Fix SearchFieldsIT.testDocValueFields #31995
  • BREAKING: Remove the ability to index or query context suggestions without context #31007

Apache Lucene

Points-based geo shapes

Lucene 6 introduced a new API called points, which is implemented under the hood with a BKD-tree. Since then, points been the way to go in order to index and search numerics (1 dimension), geo points and numeric ranges (2 dimensions) as BKD trees provide faster searching and better compression than the previous inverted-index-based support.

The next logical step would be to implement shape support on top of the points API, but this isn't without raising questions: shapes are more commonly indexed using R-trees rather than BKD trees, which are designed for points. However adding support for a new type of data-structure to Lucene isn't a small task, and the proposal to add support for R-trees didn't get much traction.

We are back with a new proposal that uses the BKD tree as a R-tree by tessellating shapes into triangles that are indexed as 6-dimensional points. The minimum and maximum values for each coordinate that the BKD tree provides on inner levels help compute the minimum bounding rectangle that contains all triangles on the leaves, just like a R-tree would provide. We are also confident that the current split mechanism will perform well. Initial results are very promising.

Other