Elasticsearch 8.3: Easily revive that old data archive

blog-thumb-release-elasticsearch.png

Today, we are pleased to announce the general availability Elasticsearch 8.3.  If you are looking for Kibana and Elastic Cloud be sure to head over to this blog. In 8.3, you can query old snapshots as simple archives without the need of an old Elasticsearch cluster. Plus new sharding guidance, a new geo grid query and finally a new third party ML model for question and answer.

Ready to dive in and get started? We have the links you need:

New era in snapshots 

We’ve got your back, ups 

Since the release of Elasticsearch 1.0, snapshots have undergone many technical innovations. Snapshots started out as backups of Elasticsearch cluster data, including the indices for business continuity, but then were used to transfer data between clusters. Recently, snapshots became searchable, giving you the ability to query data that is remotely stored on low-cost cloud object storage which reduces total cost of ownership while delivering fast search results

Now we are excited to introduce the next chapter of snapshots: snapshots as simple archives generally available in Elasticsearch 8.3.

Video thumbnail

Data with no end of life 

In 8.3, Elasticsearch can directly search snapshots as old as 5.0 without the need of an old Elasticsearch cluster. With this new capability, Elasticsearch can access the snapshot repository, including the full _source of the documents, so you can run simple queries and aggregations. This ensures that data you store in Elasticsearch can be accessed when you upgrade, without requiring a reindex process. Snapshots can now be used as archives for governance and compliance, security investigations, and historical lookbacks regardless of what Elasticsearch version they were indexed on and regardless of which version (from 8.3 onwards) your cluster is on.

With snapshots as simple archives you choose how to store the data. Either restore the snapshot as read-only access on local storage or as an archive index, or keep the data remotely stored via searchable snapshots so that the archived data doesn’t fully reside on local disks for access.

We expect that querying data on snapshots as old as six years won’t be as frequent as data ingested six minutes ago. Thus, having a trade-off in performance and limited query capability is acceptable for most and pairs nicely with use cases such as security investigations, governance, and historical lookbacks. Be sure to check out the list of supported queries and aggregations to get started.

Scaling Elasticsearch to new heights

The ELK stack (or Elastic Stack) is already known for speed, scale, relevancy, and simplicity but we are always looking to push boundaries. Over the last few releases, we have been working on reducing the amount of resources that Elasticsearch shards consume to help free up memory and CPU for important things like indexing or searching. The reduction of resources to idle shard resource usage has been so significant that we are ready to unveil new guidance of how many shards you should have for your Elasticsearch cluster.

New Elasticsearch shard guidance

Prior to 8.3, it was the general rule that you keep the number of shards per node below 20 per GB heap it has configured. So a node with a 30GB heap should therefore have a maximum of 600 shards. 

With 8.3, this rule should no longer be followed as we have significantly reduced the amount of resources a shard uses. You now have two considerations. 

The first shard sizing consideration is to aim for approximately 3,000 indices or fewer per GB of heap memory on each master node.  Having an idea of the number of indices your data nodes can help you plan on how much heap the master node’s should have in the cluster. For example, if your cluster contains 12,000 indices then each dedicated master node should have at least 4GB of heap. Of course your results may vary depending on your use case but this is a great starting point. 

The second shard sizing consideration is on planning how much heap for a data node based on mapped fields. While resources on each mapped field can vary, plan to have 1kB of heap per field per index on a data node. Add an extra 0.5GB of extra heap for its workload and other overheads.  For example, if a data node holds shards from 1,000 indices, each containing 4,000 mapped fields, then you should allow approximately 1000 × 4000 × 1kB = 4GB of heap for the fields and another 0.5GB of heap for its workload. This means that the heap size should be at least 4.5GB.

Elasticsearch is a geospatial powerhouse

Elasticsearch “leveled up” its geospatial game in 8.3 with the introduction of the geo grid query. With the geo grid query, you can now natively return all the documents that overlap a specific geo tile. There is no need to reconstruct the geometry or the actual boundaries of the spatial cluster as Elasticsearch can do this for you, saving you time, reducing complexity, and without differences between the search query and the aggregation bounds. This is especially useful when geometries are spread across tiles like on a soccer ball / football. While hexagon tiles line the sphere, each tile is not exactly straight forward to calculate its boundary.

GET /example/_search
{
  "query": {
    "bool": {
      "must": {
        "match_all": {}
      },
      "filter": {
        "geotile_grid": {
          "location": {
            "key": "12/25/124" 
            "relation": "within"
          }
        }
      }
    }
  }
}

Geo grid query can also help determine the single source of truth of containment. With geo grid query, you can match exactly the intersection-test of Elasticsearch. For example, if a client has bounds for a grid-cell at a higher (or lower) precision than what is used by Elasticsearch when running a corresponding aggregation, the containment-check might be slightly different. This side-steps any disconnect based on projection/datum difference between client and Elasticsearch. 

Support for dots in field names

If you have been using metrics within Elasticsearch then you may have encountered dots in field names. Before 8.3 in Elasticsearch, dots within field were treated as an object separator or a hierarchy. This means a document like this:

{
  "metric.value.max": 42
}

is indexed as if it was formatted like this:

{
  "metric": {
    "value": {
      "max": 42
    }
  }
}

And this mapping format translates into two object fields called metric and metric.value and a long field called metric.value.max .

This becomes a problem when ingesting metrics that come from an external system such as OpenTelemetry. It can be common to have both metric.value and metric.value.max as metric names:

{
  "metric.value": 10,
  "metric.value.max": 42
}

The document will always fail indexing because metric.value needs to be an object field because of metric.value.max and a long field at the same time, which is illegal. There are also popular metric monitoring systems like Prometheus that also use dots in field names. And the list of data sources using dots is growing. 

If you are familiar with this issue then you may have tried replacing dots with underscores or adding suffixes, which can create other issues such as users not seeing field names correctly. This is why in 8.3 we are supporting dots in field names to better support metric data in Elasticsearch.

Question and answer NLP model support

Since the announcement of Elasticsearch’s new native vector search capabilities and native support for modern natural language processing models, we have been hard at work at supporting additional PyTorch models. In 8.3, we are introducing the support of question and answer. With this new capability, you can extract answers to questions from a document, greatly improving relevance and user experiences. 

Getting the exact right answer to your question from a FAQ or help page can feel like magic. With the question and answer model along with these other third-party models, our goal is to provide that magical experience through advanced natural language processing without the deep knowledge expertise that is required in other systems. Question and answer is in technical preview so be sure to try it and provide us your feedback.

Wait … there’s more

8.3 is packed with so many features we couldn’t fit them all in this blog. Be sure to check out the release notes for more news on Elasticsearch, Kibana, and Elastic Cloud.

Try it out

Existing Elastic Cloud customers can access these features from the Elastic Cloud console, and check out the Quick Start guides. You can get started with a free 14-day trial of Elastic Cloud or download the free self-managed version.

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.