Elasticsearch highlightsedit

This list summarizes the most important enhancements in Elasticsearch 7.9. For the complete list, go to Elasticsearch release highlights.

Fixed retries for cross-cluster replicationedit

Cross-cluster replication now retries operations that failed due to a circuit breaker or a lost remote cluster connection.

Fixed index throttlingedit

When indexing data, Elasticsearch and Lucene use heap memory for buffering. To control memory usage, Elasticsearch moves data from the buffer to disk based on your indexing buffer settings. If ongoing indexing outpaces the relocation of data to disk, Elasticsearch will now throttle indexing. In previous Elasticsearch versions, this feature was broken and throttling was not activated.

EQLedit

EQL (Event Query Language) is a declarative language dedicated for identifying patterns and relationships between events.

Consider using EQL if you:

  • Use Elasticsearch for threat hunting or other security use cases
  • Search time series data or logs, such as network or system logs
  • Want an easy way to explore relationships between events

A good intro on EQL and its purpose is available in this blog post. See the EQL in Elasticsearch documentaton for an in-depth explanation, and also the language reference.

This release includes the following features:

  • Event queries
  • Sequences
  • Pipes

An in-depth discussion of EQL in ES scope can be found at #49581.

Data streamsedit

A data stream is a convenient, scalable way to ingest, search, and manage continuously generated time series data. They provide a simpler way to split data across multiple indices and still query it via a single named resource.

See the Data streams documentation to get started.

Enable fully concurrent snapshot operationsedit

Snapshot operations can now execute in a fully concurrent manner.

  • Create and delete operations can be started in any order
  • Delete operations wait for snapshot finalization to finish, are batched as much as possible to improve efficiency and, once enqueued in the cluster state, prevent new snapshots from starting on data nodes until executed
  • Snapshot creation is completely concurrent across shards, but per shard snapshots are linearized for each repository, as are snapshot finalizations

Improve speed and memory usage of multi-bucket aggregationsedit

Before 7.9, many of our more complex aggregations made a simplifying assumption that required that they duplicate many data structures once per bucket that contained them. The most expensive of these weighed in at a couple of kilobytes each. So for an aggregation like:

POST _search
{
  "aggs": {
    "date": {
      "date_histogram": { "field": "timestamp", "calendar_interval": "day" },
      "aggs": {
        "ips": {
          "terms": { "field": "ip" }
        }
      }
    }
  }
}

When run over three years, this aggregation spends a couple of megabytes just on bucket accounting. More deeply nested aggregations spend even more on this overhead. Elasticsearch 7.9 removes all of this overhead, which should allow us to run better in lower memory environments.

As a bonus we wrote quite a few Rally benchmarks for aggregations to make sure that these tests didn’t slow down aggregations, so now we can think much more scientifically about aggregation performance. The benchmarks suggest that these changes don’t affect simple aggregation trees and speed up complex aggregation trees of similar or higher depth than the example above. Your actual performance changes will vary but this optimization should help!

Optimize date_histograms across daylight savings timeedit

Rounding dates on a shard that contains a daylight savings time transition is currently drastically slower than when a shard contains dates only on one side of the DST transition, and also generates a large number of short-lived objects in memory. Elasticsearch 7.9 has a revised and far more efficient implemention that adds only a comparatively small overhead to requests.

Improved resilience to network disruptionedit

Elasticsearch now has mechansisms to safely resume peer recoveries when there is network disruption, which would previously have failed any in-progress peer recoveries.

Wildcard field optimised for wildcard queriesedit

Elasticsearch now supports a wildcard field type, which stores values optimised for wildcard grep-like queries. While such queries are possible with other field types, they suffer from constraints that limit their usefulness.

This field type is especially well suited for running grep-like queries on log lines. See the wildcard datatype documentation for more information.

Indexing metrics and back pressureedit

Elasticsearch 7.9 now tracks metrics about the number of indexing request bytes that are outstanding at each point in the indexing process (coordinating, primary, and replication). These metrics are exposed in the node stats API. Additionally, the new setting indexing_pressure.memory.limit controls the maximum number of bytes that can be outstanding, which is 10% of the heap by default. Once this number of bytes from a node’s heap is consumed by outstanding indexing bytes, Elasticsearch will start rejecting new coordinating and primary requests.

Additionally, since a failed replication operation can fail a replica, Elasticsearch will assign 1.5X limit for the number of replication bytes. Only replication bytes can trigger this limit. If replication bytes increase to high levels, the node will stop accepting new coordinating and primary operations until the replication work load has dropped.

Inference in pipeline aggregationsedit

In 7.6, we introduced inference that enables you to make predictions on new data with your regression or classification models via a processor in an ingest pipeline. Now, in 7.9, inference is even more flexible! You can reference a pre-trained data frame analytics model in an aggregation to infer on the result field of the parent bucket aggregation. The aggregation uses the model on the results to provide a prediction. This addition enables you to run classification or regression analysis at search time. If you want to perform analysis on a small set of data, you can generate predictions without the need to set up a processor in the ingest pipeline.