This Week in Elasticsearch and Apache Lucene - 2018-09-21

Elasticsearch

Highlights

Structured Audit Logging

We have merged structured audit logging to master at the end of last week, and have been working on the backport for 6.5. For 6.x, we have decided to maintain the old (non-structured) audit log alongside the new format, so the backport requires replicating the old format on top of the new code.

Append only indexing and cross-cluster replication

Users of time based indices usually index documents once and never change them again (metrics, logs etc.), relying on ES to generate ids for them. In ES jargon, this indexing pattern is called append only indexing and we have various optimizations in place to support it. For cross-cluster replication, those optimizations are not available to the “following” index because from its perspective the documents already had IDs when they came. Without these optimizations the following index won't be able to keep up.

We have had numerous discussions on how to support those optimizations on the following index and came up with an exciting plan to automatically detect operations that don't change any existing documents on the primary, capture this information and transfer it to the replicas and secondarily to the following shards. This means that this optimizations will automatically be enabled on the following indices, even if the indexing operation originally had an ID and thus following indices may be able to catch up faster.

Java Time migration

We have reworked Joda time backwards compatibility in scripts so both the Joda and the Java time API can be used at the same time in 6.x, and users get an explicit deprecation warning for each method that will disappear on upgrade to 7.0. We introduced a date formatter interface to be able to parse milliseconds since epoch properly; also each date formatter now has a name which is needed for mapping changes.

Performance

We have released Rally 1.0.1 with lots of stability fixes but also a few features like a new shrink index API.

Types removal

We have resumed work on types removal, coming up with a plan for REST tests migration, and moving tests and docs to typeless APIs.

Annotated text field type

We have merged work on a new field type, annotated_text. The purpose of this field type is to allow identification of entities in documents, without the use of synonyms. This is done as a plugin to elasticsearch, includes a highlighter.

Token Filters

We have added two scriptable token filters: condition, which will only apply a given filter subchain if the current token matches a predicate; and predicate_token_filter, which will strip out tokens that don’t match a predicate. We will add a further scriptable filter that allows you to edit the byte content of a token, which should reduce the need to write Java plugins for custom analysis chains.

We have also reworked how token filters that refer to other filters, or to their predecessors in the token chain, are constructed. This means that synonym filters now work correctly with conditional or multiplexer filters.

Changes

Changes in 6.4:

  • Allow to clear the fielddata cache per field #33807
  • Watcher: Ensure triggered watch deletion is sync #33799
  • Do not override named S3 client credentials #33793
  • Skip rebalancing when cluster_concurrent_rebalance threshold reached #33329

Changes in 6.5:

  • Introduce a search_throttled threadpool #33732
  • Restore local history from translog on promotion #33616
  • HLRC: Add support for reindex rethrottling #33832
  • Update geolite2 database in ingest geoip plugin #33840
  • ingest: support simulate with verbose for pipeline processor #33839
  • SQL: Fix ANTL4 Grammar ambiguities. #33854
  • HLRC: Reindex should support requests_per_seconds parameter #33808
  • SQL: TRUNCATE and ROUND functions #33779
  • Add contains method to LocalCheckpointTracker #33871
  • Create a WatchStatus class for the high-level REST client. #33527
  • Move CompletionStats into the Engine #33847
  • Fix potential NPE in _cat/shards/ with partial CommonStats #33858
  • Allow TokenFilterFactories to rewrite themselves against their preceding chain #33702
  • [Tests] Nudge wait time in RemoteClusterServiceTests #33853
  • Fixing assertions in integration test #33833
  • SQL: Fix issue with options for QUERY() and MATCH(). #33828
  • Move DocsStats into Engine #33835
  • Use the global doc id to generate random scores #33599
  • Profiler: Don’t profile NEXTDOC for ConstantScoreQuery. #33196
  • SQL: Better handling of number parsing exceptions #33776
  • Ensure fully deleted segments are accounted for correctly #33757
  • SQL: Grammar tweak for number declarations #33767
  • Watcher: Use Bulkprocessor in HistoryStore/TriggeredWatchStore #32490
  • Dependencies: Update javax.mail in watcher to 1.6.2 #33664
  • Add create rollup job api to high level rest client #33521
  • Implement xpack.monitoring.elasticsearch.collection.enabled setting #33474
  • [Monitoring] Removing unused version.* fields #33584
  • DiskThresholdDecider#canAllocate can report negative free bytes #33641
  • Only notify ready global checkpoint listeners #33690
  • Structured audit logging #31931
  • [Kerberos] Add realm name & UPN to user metadata #33338
  • Deprecate negative weight in Function Score Query #33624
  • SQL: Return functions in JDBC driver metadata #33672

Changes in 7.0:

  • BREAKING: Add minimal sanity checks to custom/scripted similarities. #33564
  • BREAKING: Core: Default node.name to the hostname #33677
  • add RemoveCorruptedShardDataCommand #32281
  • TESTS: Set SO_LINGER = 0 for MockNioTransport #32560
  • Ensure realtime _get and _termvectors don’t run on the network thread #33814
  • New Annotated_text field type #30364
  • Don’t count hits via the collector if the hit count can be computed from index stats. #33701

Apache Lucene

Indexed shapes

After benchmarking BKD-based geo shapes, we were a bit disappointed by their performance, but we looked into it and found out that we had quite some room for improvement!

One issue is that BKD trees are designed to index points. The goal of the BKD tree is to split the space into cells that contain equal numbers of points in such a way that filtering by any dimension or combination of dimensions is efficient. When indexing shapes and ranges, a better goal is to minimize intersection between cells, which is what R-trees do. Unfortunately, naively indexing coordinates of a triangle as a 6D point in a BKD tree doesn't split the space efficiently. Nick is proposing to improve BKD trees so that only a subset of their dimensions could be indexed. The underlying idea would be that instead of indexing triangles as 6D points, we would index the minimum bounding rectangle of the triangle as a 4D point, and remaining information that would allow to reconstruct the triangle in 3 additional dimensions that wouldn't be indexed. This makes the BKD tree essentially behave like a R-tree and gave a 46% speedup to shape search in the benchmark.

Another issue is that the bounding box of points for a sub-tree that the BKD tree records in its index is sometimes larger than the minimum bounding rectangle of those points, especially if there is correlation between dimensions which is common when storing ranges or shapes. Ignacio addressed this performance issue by making sure to record the minimum bounding rectangle that contains all points on every leaf of the BKD tree. Again this brought significant speedups, up to 10x in some cases.

Other