Today, we are proud to announce the release of Elasticsearch 7.0.0 GA, based on Lucene 8.0.0. This is the latest stable release and is already available for deployment via our Elasticsearch Service on Elastic Cloud.
A big thank you goes out to all the Elastic Pioneers who tested early versions and opened bug reports, and so helped to make this release as good as it is.
Go and download it today!
Elasticsearch 7.0.0 has something for everyone. At Elastic, we constantly talk about speed, scale, and relevance: it’s in our source code. Elasticsearch 7.0 goes to exemplify this, as it’s the fastest, safest, most resilient, easiest to use version of Elasticsearch ever, and it comes with a boatload of enhancements and new features.
Without further ado, let’s dive into some of the big new developments!
Some of the first things we typically get asked about with any new release of Elasticsearch are questions relating to performance. Search (and Elasticsearch) makes things fast, so it’s naturally one of the first things people gravitate towards. Elasticsearch 7.0 brings a number of improvements to improve both its search and indexing performance. In fact, some of the improvements we’ve incorporated in Elasticsearch 7.0 can result in unbounded performance improvements! We’re pretty excited, and I think you will be too!
Faster top k retrieval
We have made a significant improvement to search performance in Elasticsearch 7.0 for situations in which the exact hit count is not needed. For example, if your users typically just look at the first page of results on your site and don’t care about exactly how many documents matched, you may be able to show them “more than 10,000 hits” and then provide them with paginated results.
It’s quite common to have users enter frequently-occurring terms like “the” and “a” in their queries, which has historically forced Elasticsearch to score a lot of documents even when those frequent terms couldn’t possibly add much to the score. In these conditions Elasticsearch can now skip calculating scores for records that are identified at an early stage as records that will not be ranked at the top of the result set. This can significantly improve the query speed. The actual number of top results that are scored is configurable, but the default is 10,000. The behavior of queries that have a result set that is smaller than this threshold will not change — i.e. the results count is accurate but there is no performance improvement for queries that match a small number of documents. Because the improvement is based on skipping low ranking records, it does not apply to aggregations. You can read more about this powerful algorithmic development in our blog post Magic WAND: Faster Retrieval of Top Hits in Elasticsearch.
Building on top of the faster top k retrieval, Elasticsearch 7.0 has several new field types to get the most out of your data. Two to help with core search use cases are rank_feature and rank_features. These can be used to boost documents based on numeric or categorical values that you know are relevant to the scoring while still maintaining the performance of the new faster top k query capabilities. For more information on these fields and how to use them, read our blog.
Adaptive Replica Selection
In Elasticsearch 6.x and prior, a series of search requests to the same shard would be forwarded to the primary and each replica in round robin fashion. This could prove problematic if one node starts a long garbage collection — search requests would still be forwarded to the slow node regardless and would have an impact on search latency.
In 6.1, we added an experimental feature called Adaptive Replica Selection. Each node tracks and compares how long search requests to other nodes take, and uses this information to adjust how frequently to send requests to shards on particular nodes. In our benchmarks, this results in an overall improvement in search throughput and reduced 99th percentile latencies.
This option was disabled by default throughout 6.x, but we’ve heard feedback from our users that have found the setting to be very beneficial to real-world search performance, so we’ve turned it on by default starting in Elasticsearch 7.0.
Minimum round-trip cross-cluster search
In Elasticsearch 5.3, we released a feature called cross-cluster search for users to query across multiple clusters. We’ve since improved on the cross-cluster search framework, adding features to ultimately use it to deprecate and replace tribe nodes as a way to federate queries. In Elasticsearch 7.0, we’re adding a new execution mode for cross-cluster search: one which has fewer round-trips when they aren’t necessary. This mode (
ccs_minimize_roundtrips) can result in faster searches when the cross-cluster search spans high-latencies, e.g. across a WAN.
Skip refreshes on “search idle” shards
On the indexing side, Elasticsearch 6.x and prior refreshed the index automatically in the background by default every second. This provides the “near real time” search capabilities Elasticsearch is known for: results are available for search requests within 1 second after they’d been added by default. However, this 1-second-default-refresh behavior has had a significant impact on indexing performance if the refreshes are not needed, e.g. if Elasticsearch isn’t servicing any active searches.
Elasticsearch 7.0 gets much smarter about this behavior by introducing the notion of a shard being “search idle.” A shard now transitions to being search idle after it hasn’t had any searches for 30 seconds by default. Once a shard is search idle, all scheduled refreshes will be skipped until a search comes through, which will trigger the next scheduled refresh. We know that this is going to significantly increase the indexing throughput for many users. The new behavior is only applied if there is no explicit refresh interval set, so do set the refresh interval explicitly for any indices on which you prefer the old behavior.
More scalable and resilient
Over the life of Elasticsearch, we’ve tried to be very transparent about any known issues with the stability and scale of the software as well as working rapidly towards improvements. We’re very pleased to announce that with Elasticsearch 7.0, we’ve worked from the ground up to formally model and then rewrite our coordination layer as well as implement strong new protections against issues like out-of-memory (OOM) errors. Let’s dive in!
New cluster coordination
Since the beginning, we focused on making Elasticsearch easy to scale and resilient to catastrophic failures. To support these requirements, we created a pluggable cluster coordination system, with the default implementation known as Zen Discovery. Zen Discovery was meant to be effortless, and give our users peace of mind (as the name implies). The meteoric rise in Elasticsearch usage has taught us a great deal. For instance, Zen’s
minimum_master_nodes setting was often misconfigured, which put clusters at a greater risk of split brains and losing data. Maintaining this setting across large and dynamically resizing clusters was also difficult.
In Elasticsearch 7.0, we have completely rethought and rebuilt the cluster coordination layer. The new implementation gives safe sub-second master election times, where Zen may have taken several seconds to elect a new master, valuable time for a mission critical deployment. With the
minimum_master_nodes setting removed, growing and shrinking clusters becomes safer and easier, and leaves much less room to mis-configure the system. Most importantly, the new cluster coordination layer gives us strong building blocks for the future of Elasticsearch, ensuring we can build functionality for even more advanced use cases to come.
Better support for small heaps (real-memory circuit breaker)
Elasticsearch 7.0.0 adds an all-new circuit breaker that keeps track of the total memory used by the JVM and will reject requests if they exceed a threshold of 95% of the heap allocated to the process. We’ve also changed the default maximum buckets to return as part of an aggregation (
search.max_buckets) to 10,000, which is unbounded by default in 6.x and prior. These two show great signs at seriously improving the out-of-memory protection of Elasticsearch in 7.x, helping you keep your cluster alive even in the face of adversarial or novice users running large queries and aggregations.
Default to one shard
One of the biggest sources of troubles we’ve seen over the years from our users has been oversharding — and defaults play a big role in that. In Elasticsearch 6.x and prior, we defaulted to 5 shards by default per index. If you had one daily index for 10 different applications and each had the default of 5 shards, you were creating 50 shards per day and it wasn’t long before you had thousands of shards even if you were only indexing a few gigabytes of data per day. Index lifecycle management (ILM) was a first step to help with this: providing native rollover functions to create indexes by size instead of (just) by day and built-in shrink functionality to shrink the number of shards per index. Defaulting indices to 1 shard is the next step in helping to reduce oversharding. Of course, if you have another preferred primary shard count, you can set it via the index settings.
Easier to use
As the Elasticsearch userbase has grown, a larger percentage of our users have less working knowledge of how to make Elasticsearch hum. As a reaction, we’ve focused a lot of our effort on making it easier for users to get things done “right.” Soft limits, circuit breakers, and other warnings help to protect system administrators from new users writing “bad” queries. We released Helm charts to make sure we could provide a great out-of-the-box experience for users that wanted to get started quickly in those environments. And as you’ll see below, we’ve continued our investment in making other parts of the Elastic Stack work well with Elasticsearch and generally help users get up and running quicker and with less opportunities for mistakes. Have a look below for some examples!
One of the more prominent “getting started hurdles” we’ve seen users run into has been not knowing that Elasticsearch is a Java application and that they need to install one of the supported JDKs first. With 7.0, we’re now releasing versions of Elasticsearch which pre-bundle the JDK to help users get started with Elasticsearch even faster. If you want to bring your own JDK, you can still do so by setting JAVA_HOME before starting Elasticsearch.
JSON logging is now enabled in Elasticsearch in addition to plaintext logs. Starting in 7.0.0, you will find new files with .json extension in your log directory. This means you can now use filtering tools like jq to pretty print and process your logs in a much more structured manner. You can also expect to find additional information like
cluster.uuid, type (and more) in each log line. The “type” field per each JSON log line lets you distinguish log streams when running on docker.
Feature-complete high-level REST client for Java
If you’ve been following our blog or our GitHub repository, you may be aware of a task we’ve been working on for quite a while now: creating a next-generation Java client for accessing an Elasticsearch cluster. We started off by working on the most commonly-used features like search and aggregations, and have been working our way through administrative and monitoring APIs. The high-level rest client simplifies network architectures by allowing all actions you’d perform against the cluster from your client application to use a common port. Many of you that use Java are already using this new client, but for those that are still using the TransportClient, now is a great time to upgrade to our high-level REST client (HLRC).
As of 7.0.0, the HLRC now has all the API checkboxes checked to call it “complete” so those of you still using the TransportClient should be able to migrate. We’ll, of course, continue to develop our REST APIs and will add them to this client as we go. For a list of all of the APIs that are available, have a look at our HLRC documentation. To get started, have a look at the getting started with the HLRC section of our docs and if you need help migrating from the TransportClient, have a look at our migration guide.
New indices automatically eligible for replication to other Elasticsearch clusters
In Elasticsearch 6.5, we released cross-cluster replication (CCR) as a beta feature. CCR requires any replicated index to maintain a history of document changes (when a document is updated or deleted) through the
soft_deletes index setting on the leader index at index creation time. By retaining these soft deletes, a history can be maintained on the leader shards and replayed for replicating index changes to other Elasticsearch clusters. The
soft_deletes index setting is required for CCR, and it’s now turned on by default in Elasticsearch 7.0, which makes CCR much easier to use. Soft deletes will also be valuable for future Elasticsearch data replication improvements outside of CCR. Any newly created index has the
soft_deletes setting enabled by default. For more details and use cases for CCR, have a look at the blog we published recently or dive straight into the docs!
Elasticsearch has supported encrypted communications for a long time, however, we recently started supporting JDK 11, which gives us new capabilities. JDK 11 now has TLSv1.3 support so starting with 7.0, we’re now supporting TLSv1.3 within Elasticsearch for those of you running JDK 11. In order to help new users from inadvertently running with low security, we’ve also dropped TLSv1.0 from our defaults. For those running older versions of Java, we have default options of TLSv1.2 and TLSv1.1. Have a look at our TLS setup instructions if you need help getting started.
All new capabilities
Lastly, we always look to expand the horizons of what you can do with Elasticsearch: from helping you achieve new use cases to doubling down on the ones you’re already using it for. With Elasticsearch 7.0, we’ve added some new field types and search capabilities to do exactly that. You’ll find a few of them below.
Up until 7.0, Elasticsearch could only store timestamps with millisecond precision. If you wanted to process events that occur at a higher rate — for example if you want to store and analyze tracing or network packet data in Elasticsearch — you may want higher precision. Historically, we have used the Joda time library to handle dates and times, and Joda lacked support for such high precision timestamps.
With JDK 8, an official Java time API has been introduced which can also handle nanosecond precision timestamps and over the past year, we’ve been working to migrate our Joda time usage to the native Java time while trying to maintain backwards compatibility. As of 7.0.0, you can now make use of these nanosecond timestamps via a dedicated date_nanos field mapper. Note that aggregations are still on a millisecond resolution with this field to avoid having an explosion of buckets.
Intervals query: The next-gen span query for legal and patent search
Sometimes our users want to find records in which words or phrases are within a certain distance from each other. In areas like patent and legal search, this is the main way in which experts find documents. It used to be that the only way to do that was span queries, but now we are introducing a brand new way to construct such queries: interval queries. While span queries are a good tool, they are not always easy to use. Span queries do not use the analyzer, so the person performing the query has to be aware of the analyzer’s logic and perform actions like stemming. Since analyzers can be sophisticated, writing span query logic can be just as sophisticated (and complicated).
The new intervals query are not just easier to define: they also use the analyzer. This way, the person writing them does not have to be familiar with the transformations performed by the analyzer. In addition, intervals queries are based on sound mathematical research, published in the article Efficient Optimally Lazy Algorithms for Minimal-Interval Semantics. This allowed us to accurately deal with a number of edge cases that were not accurately handled with span queries.
Script score query (a.k.a. function score 2.0)
With 7.0.0, we are introducing the next generation of our function score capability. This new
script_score query provides a new, simpler, and more flexible way to generate a ranking score per record. The
script_score query is constructed of a set of functions, including arithmetic and distance functions, which the user can mix and match to construct arbitrary function score calculations. The modular structure is simpler to use and will open this important functionality to additional users.