Elasticsearch Labs

One API call per operation: how Elastic Cloud Hosted makes fleet-scale deployment management practical

Omer Kushmaro — Fri, 29 May 2026 00:00:00 GMT

Five new Elastic Cloud Hosted APIs replace a painful pattern: fetching a full deployment plan, editing topology elements by hand, and resubmitting the whole thing just to change one tier or bump a version. Upgrade, scale, configure, tag and share snapshot repositories, each with a single targeted API call. The same validators apply, and every operation is available in the Elastic Cloud Console too. If you run more than a handful of deployments, this is the fleet operations surface you've been waiting for.

Here's what's new.

How to upgrade an Elastic Cloud Hosted deployment via API

Upgrade API: POST /deployments/{id}/upgrade

Upgrading an Elastic Stack version across all resources in a deployment used to require you to construct and submit a full deployment plan update. Now, you can upgrade every resource in a deployment (Elasticsearch, Kibana, and all other components) to a target Elastic Stack version with one API call.

curl -X POST https://api.elastic-cloud.com/api/v1/deployments/{deployment_id}/upgrade \
  -H "Authorization: ApiKey $EC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"target_version": "9.4.0"}'

This pays off most when you operate many deployments and want to roll a new Elastic Stack version across your fleet quickly.

How to resize Elasticsearch tiers in Elastic Cloud Hosted

Scale API: GET / PATCH /deployments/{id}/elasticsearch/{ref_id}/tiers

The tiers API lets you resize any Elasticsearch tier (hot, warm, cold, frozen, master, coordinating or ML) with a single PATCH request scoped only to the tiers you want to change.Send a body keyed by tier ID. Anything you don't mention stays exactly as it was, so you can change one tier in isolation without touching the others.

curl -X PATCH https://api.elastic-cloud.com/api/v1/deployments/{deployment_id}/elasticsearch/main/tiers \
  -H "Authorization: ApiKey $EC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "hot_content": {
      "memory_size": 4096,
      "zone_count": 2
    },
    "warm": {
      "memory_size": 5120,
      "zone_count": 1
    },
    "master": {
      "memory_size": 1024,
      "zone_count": 3
    }
  }'

A few things to know:

Valid tier IDs are hot_content, warm, cold, frozen, master, coordinating, and ml. Invalid keys are rejected with the list of tiers you can use.
memory_size and zone_count are each optional per tier. Include only the one you want to change; the other stays untouched.
Dedicated master handling is automatic. Include master with memory_size and zone_count greater than zero, and the endpoint flips the hot_content tier out of the master-eligible role so your new dedicated masters take over. Include master with memory_size set to zero, and the hot_content tier resumes acting as master-eligible. No separate flag to manage.
Same validation as the full plan API. The modified plan runs through the existing validators, so size-vs-instance-configuration mismatches, zone-count limits, and topology rules are enforced consistently.

For teams automating tier-level scaling based on ingest load, rebalancing zones before a maintenance window, or promoting a deployment to dedicated masters as it grows, this collapses what used to be a multistep plan edit into a single targeted request.

How to update elasticsearch.yml settings in Elastic Cloud Hosted

User Settings API: GET and PUT /deployments/{id}/{resource_kind}/{ref_id}/user_settings

User-defined settings on a running deployment (elasticsearch.yml-style overrides, Kibana settings, APM config) now have their own endpoints. Read or update them directly with a focused JSON request, scoped to a single resource.

The endpoints cover Elasticsearch, Kibana, Elastic APM, App Search, Enterprise Search, and Integrations Server. Same endpoint shape across all of them; only the {resource_kind} path segment changes.

Read the current settings for a resource:

curl https://api.elastic-cloud.com/api/v1/deployments/{deployment_id}/elasticsearch/main/user_settings \
  -H "Authorization: ApiKey $EC_API_KEY"

Update them with a JSON body containing only the keys you want set:

curl -X PUT https://api.elastic-cloud.com/api/v1/deployments/{deployment_id}/elasticsearch/main/user_settings \
  -H "Authorization: ApiKey $EC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "search.allow_expensive_queries": false,
    "indices.recovery.max_bytes_per_sec": "200mb"
  }'

A few things to know:

JSON in, JSON out. A JSON contract that's easy to template, diff, and check into source control.
Existing YAML is auto-migrated. Any tier-level user settings you currently have in YAML are converted to JSON the first time you PUT, so existing deployments work without a manual rewrite or a one-time migration step.
Allowlist and denylist rules still apply. Settings the platform doesn't permit are rejected with the same validation the full plan API runs.
Asynchronous. The action is acknowledged immediately and applied in the background, consistent with the other targeted endpoints in this release.
Non-breaking. The full PUT /deployments/{id} endpoint continues to work; this is an additive surface, not a replacement.

For teams that tune settings as part of a release, sweep a configuration change across a fleet, or want to manage cluster overrides as code, this is a single targeted PUT per deployment.

How to manage deployment tags in Elastic Cloud Hosted

Tags API: GET and PUT /deployments/{id}/tags

Deployment tags attach metadata (environment, owner, cost center, application) to deployments for filtering, reporting, and policy. The Tags API gives you a GET for the current tags and a PUT to replace them, both in one targeted call.

curl -X PUT https://api.elastic-cloud.com/api/v1/deployments/{deployment_id}/tags \
  -H "Authorization: ApiKey $EC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"tags": [{"key": "env", "value": "prod"}, {"key": "owner", "value": "search-platform"}]}'

PUT replaces all tags on the deployment. Validation enforces up to 64 tags per deployment, keys up to 32 characters, and values up to 128 characters, with explicit 400 errors on violations. Authorization follows the same model as the deployment update endpoint.

How to link snapshot repositories across Elastic Cloud Hosted deployments

Snapshot repository APIs: Full CRUD at /deployments/{id}/elasticsearch/{ref_id}/snapshot/repository, plus a new Elastic Cloud console workflow

A common operational pattern in ECH is needing one deployment to access snapshots from another: migration validation, populating a staging environment, or selectively restoring one tenant's indices across a fleet. Snapshot repository linking lets you do this without copying full snapshot data into a new deployment.

The capability creates a link between a source deployment's managed snapshot repository and a target deployment. The target can browse the source's snapshots and restore only what it actually needs; nothing is copied at link time. Cross-bucket credentials and the repository are wired up for you, given your user has an admin role over both deployments.

This works specifically and only for Elastic Cloud managed snapshot repositories, whose credentials are managed by Elastic Cloud.

You can manage these links in two ways: from the Elastic Cloud console UI for ad hoc, point-and-click work, or via the API for automation and infrastructure-as-code workflows.

Link snapshot repositories from the Elastic Cloud console

For one-off restore drills, staging refreshes, or anytime you'd rather not script it, the Elastic Cloud console now provides a guided flow to link a source deployment's snapshot repository into a target deployment. Open the target deployment, go to the snapshot repository management screen, pick the source deployment from the list of deployments you have access to, and confirm. The console handles credentials and registration behind the scenes. No JSON, no keystore wrangling.

Once the link is in place, the same screen lists all linked deployments:

Removing a link is equally direct: Pick the linked repository, and unlink it. The source deployment is unaffected; only the read-only link from the target is torn down.

Manage repositories linking via the API

The snapshot repository linking operations (create, list, delete) are also available as API endpoints for automation and infrastructure-as-code workflows.

Create a link

curl -X POST https://api.elastic-cloud.com/api/v1/deployments/{target_deployment_id}/elasticsearch/{ref_id}/snapshot/repository \
  -H "Authorization: ApiKey $EC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "source_deployment_id": ""
  }'

The endpoint is idempotent. A retry after a partial failure picks up where the previous attempt left off, so you don't end up with half-configured links. The work runs asynchronously: You get an immediate acknowledgment, and the Elastic Cloud control plane handles the rest (fetching the source repository configuration, injecting credentials into the target cluster's keystore, and registering the read-only repository in Elasticsearch). The linking proceeds, provided the caller has the right permissions on both deployments.

List linked snapshot repositories

curl https://api.elastic-cloud.com/api/v1/deployments/{deployment_id}/elasticsearch/{ref_id}/snapshot/repository \
  -H "Authorization: ApiKey $EC_API_KEY"

Each entry in the response includes a repository_name field with the actual name of the repository as registered in Elasticsearch (for example, _clone_abcd1234). You don't need to derive it from internal ID conventions; take the name straight from the response, and use it anywhere you'd reference a repository in the Elasticsearch snapshot and restore APIs.

Remove a link

curl -X DELETE https://api.elastic-cloud.com/api/v1/deployments/{deployment_id}/elasticsearch/{ref_id}/snapshot/repository/{repository_name} \
  -H "Authorization: ApiKey $EC_API_KEY"

DELETE takes the repository_name directly (the same value returned from GET above). That keeps the API symmetric and avoids looking up the source deployment by ID when you're tearing a link down.

A typical lifecycle:

POST   .../snapshot/repository            (link target → source)
GET    .../snapshot/repository            (confirm the link, read back repository_name)
DELETE .../snapshot/repository/{name}     (unlink when done)

This is particularly useful for multi-environment workflows where a staging deployment reads from production snapshots, tenants-per-cluster setups that need selective restores, or multitenanted restore processes that require a specific index restore.

Choosing instance configurations per tier in Elastic Cloud Hosted

Instance configuration customization: Elastic Cloud console UI

Elastic Cloud Hosted now lets you pick an instance configuration per data tier (hot, warm, cold, frozen) directly from the Elastic Cloud Console.

Match deployment hardware to a specific workload by picking an instance configuration per data tier (hot, warm, cold, frozen):

Navigate to your deployment's overview page, and click Edit on the hardware profile.
In the hardware profile flyout, select the instance configuration that best fits each data tier's workload.
Review the changes, and click Update. A confirmation dialog shows exactly what will change before you apply.

Whether you need more storage density on your warm tier, higher compute on hot for search-intensive workloads, or want to move to a newer machine type in your region, you can now make that change from the console instead of crafting API requests. If your selection aligns with an existing hardware profile, that profile is automatically applied. Otherwise, the deployment is labeled custom, and any combination supported by your region is allowed.

The API continues to support instance configuration changes for both data tiers and stateless resources (Kibana, machine learning [ML] nodes, master nodes) for automation use cases.

Full documentation for instance configuration customization is available in the Elastic Docs.

Elastic Cloud Hosted API reference: upgrade, scale, configure, tag, snapshot

These endpoints, plus the new Elastic Cloud console workflows, give ECH a verb-shaped surface for fleet operations: upgrade, scale, set, tag, share, customize. One call per operation, the same validators as the full deployment plan, available from API and Terraform. That's what makes fleet-scale automation practical.

API	Endpoint	What it does
Upgrade	POST /deployments/{id}/upgrade	Upgrades all resources to a target Elastic Stack version
Tiers	PATCH /deployments/{id}/elasticsearch/{ref_id}/tiers	Resizes one or more tiers without touching others
User settings	PUT /deployments/{id}/{resource_kind}/{ref_id}/user_settings	Reads/updates elasticsearch.yml and Kibana settings
Tags	PUT /deployments/{id}/tags	Reads/replaces deployment tags
Snapshot repository	POST/GET/DELETE /deployments/{id}/elasticsearch/{ref_id}/snapshot/repository	Links, lists, and removes snapshot repository links

Get started:

Already on Elastic Cloud Hosted? Generate an API key from your Elastic Cloud account page, and try the examples above. Full reference is in the Elastic Cloud API documentation.
Prefer infrastructure as code? The Elastic Cloud Terraform provider wraps these APIs for declarative deployment management.
New to Elastic Cloud? Start a free trial, and deploy your first cluster in minutes.

How we doubled vector search throughput on Elasticsearch Serverless

Chris Hegarty, Lorenzo Dematte — Thu, 28 May 2026 00:00:00 GMT

We've brought simdvec, Elasticsearch's native single instruction, multiple data (SIMD) vector scoring engine, to serverless. Search throughput nearly doubled under concurrent load, and p99.9 tail latency dropped from 237 ms to 30 ms. By giving simdvec direct access to the blob cache's memory-mapped regions, serverless now runs the same zero-copy SIMD kernels as stateful, with identical recall and zero heap overhead. And because serverless gives us control over the entire storage layer, we believe it's where vector search will be fastest. Here's how we got there.

Vector Search on Elasticsearch Serverless

Elasticsearch Serverless is built on Stateless Elasticsearch, a fully decoupled compute and storage architecture where index data lives in remote object storage and search nodes maintain only a local cache. For vector search to be fast on this architecture, the scoring engine needs to work directly with the local cache, not copy it to the heap first.

Elasticsearch simdvec is the engine behind every vector distance computation in Elasticsearch. It provides hand-tuned AVX-512 and NEON kernels, bulk scoring with explicit prefetching, and off-heap memory access that keeps data flowing from storage straight to CPU registers. On stateful Elasticsearch, simdvec has always had a direct fuel line: Memory-mapped files feed native pointers straight into SIMD intrinsics. On serverless, the data was sitting right there in the blob cache's memory-mapped regions, in exactly the right form, but there was no path connecting it to the scoring engine.

We've now built that path. simdvec runs on Serverless with the same off-heap, native SIMD scoring as stateful. And because serverless gives us control over the entire storage layer, this is just the beginning.

Premium fuel only: why simdvec requires off-heap memory for vector scoring

simdvec's speed comes from working directly with off-heap memory. It takes a native pointer to memory-mapped data and passes it straight to C++ SIMD intrinsics. No intermediate copies, no heap allocations. The data flows from storage straight to CPU registers. This matters more than it sounds: simdvec's kernels process vectors faster than the data can be copied, so any copy in the path becomes the bottleneck, not the scoring itself.

On stateful Elasticsearch, this just works. Lucene memory-maps index files from local disk, and the scorer extracts a native pointer directly from the mapped region. This is the path that delivers the benchmark numbers we've published, and it's what we wanted to bring to serverless. To see how, we first need to understand how serverless stores and accesses data.

The serverless blob cache: how Elasticsearch stores vector data

In the stateless architecture, the primary copy of all index data lives in remote object storage, such as S3. Each search node maintains a local cache (called the blob cache) that keeps recently and frequently accessed portions of the index data on local SSD. The frozen tier on stateful Elasticsearch uses the same architecture: Searchable snapshots are backed by a similar blob cache that memory-maps regions from remote storage onto local disk. When a search hits cached data, it's served from fast local storage. When it misses, the blob cache fetches the data from the remote store and caches it for future queries.

The blob cache is organized into fixed-size memory-mapped regions, 16MB by default. It manages its own lifecycle: tracking which regions are in use, applying a least-frequently-used eviction policy when the cache is full, and reference counting to ensure regions aren't evicted while being read. The regions are still memory-mapped through the OS, but the blob cache controls which regions exist, which are populated, and when they're reclaimed. On stateful, those decisions are left entirely to the OS.

Crucially, because each region is memory-mapped, the blob cache already holds vector data in exactly the form simdvec needs. But before we built the connection, there was no way to get at it. Every vector comparison was copied into a heap array and handed to a slower scorer. No direct memory pointers, no SIMD, and garbage collection pressure on every call.

Unified scoring: one SIMD path for all storage tiers

We introduced a new abstraction that lets the scorer safely borrow direct memory from whatever storage layer is underneath, just long enough to run the SIMD computation. If the data is available as direct memory, simdvec's native kernels run. If not (data not yet cached or spanning a region boundary), the scorer falls back to a heap copy. In practice, the fallback is rare.

This gave us a single scoring entry point across all tiers:

Stateful (local disk): The scorer extracts a native pointer from the OS memory map.
Blob cache (serverless, frozen tier): The scorer borrows a direct memory slice from a cache region.
Fallback: The scorer copies bytes to the heap. Rare in practice.

The scorer doesn't know which tier it's running on, and it doesn't need to. It also means we no longer maintain separate scoring implementations; previously, there was a fast native path for stateful and a slower path for everything else. Now every improvement to simdvec benefits all tiers automatically, including its most powerful capability: bulk scoring.

Bulk vector scoring across blob cache regions

A single query may score thousands of candidate vectors. simdvec's bulk scoring processes these in batches with multi-accumulator inner loops, query amortization, and cache-line prefetching, up to 4x faster than single-vector alternatives when data exceeds CPU cache.

Search over an Inverted file (IVF) index is where bulk scoring has the most impact. The query selects a set of candidate posting lists and sweeps through the quantized vectors, scoring them in large batches against the query vector. On stateful, those vectors live in one contiguous memory-mapped file, so bulk scoring resolves them with straightforward pointer arithmetic and scores a batch in a single native call.

On serverless, a sweep through a posting list may cross blob cache region boundaries. We extended the direct memory abstraction with a bulk access method that resolves multiple vector offsets to their respective cache regions in a single call. If all vectors in the batch are cached and none cross a region boundary, the scorer gets a direct memory slice and passes the whole batch to simdvec's native bulk kernel with the same prefetching and pipelining as stateful. When a vector does cross a boundary, the system falls back to per-vector scoring: still zero-copy, just without the batching benefit. With 16MB regions and 1024-byte vectors, that happens roughly once every 16,000 vectors.

simdvec's bulk scoring architecture, the key differentiator highlighted in the simdvec benchmarks, now operates on serverless with the same characteristics that make it fast on stateful. So how does it perform in practice?

simdvec on Elasticsearch Serverless: vector search lap times

We benchmarked with an 18 million vector MSMARCO dataset at 1024 dimensions, using IVF with Better Binary Quantization (BBQ) 1-bit quantization. All results are on a warm blob cache with the full dataset resident in local cache regions, so we're measuring the scoring path rather than remote fetch latency.

Throughput. Under concurrent load, search throughput nearly doubled, jumping from 398 to 739 ops/s. Single-client gains were 23-39%, but the real difference shows up under concurrency: The improvement was 2-3x larger because eliminating heap copies removes the GC pressure and allocation contention that previously throttled concurrent scoring.

Tail latency. The direct memory path transformed tail latency under load:

p99.9 dropped from 237 ms to 30 ms (87% reduction).
p99.99 dropped from 9.1 seconds to 55 ms (99.4% reduction).

p100 dropped from 11.4 seconds to under 100 ms.

The worst-case outliers that previously took seconds now complete in tens of milliseconds. The heap-copy-induced queueing that caused latency spikes is gone.

Recall is identical. The same vectors are scored, producing the same results. And we're just getting started.

Beyond parity: what Elasticsearch Serverless can do for vector search that stateful can't

Reaching parity with stateful was the goal. But the more interesting realization is what the stateless architecture lets us do that stateful can’t.

On stateful, the OS controls memory-mapped file behavior: which pages stay resident, when to evict, how aggressively to read ahead. The application can offer hints, but they apply to entire file mappings, and the kernel may ignore them. Worse, search and indexing happen concurrently on the same node, so a hint that benefits one access pattern can hurt another. In practice, to balance different needs, you have to be conservative.

On serverless, two things are fundamentally different. The blob cache manages its own memory-mapped regions with full application-level control. And serverless separates indexing and search onto dedicated tiers: Search nodes never merge, indexing nodes never serve queries. No conflicting access patterns means we can be aggressive with memory advice. Here’s what we’re working on:

Per-region memory advice. The blob cache knows what type of data each region holds. It can issue random-access hints for rescoring regions, where raw float32 vectors are read in unpredictable order and the kernel’s default readahead would waste memory on pages that will never be used. It can apply sequential readahead for scans through quantized vectors. On the indexing tier, merges read data sequentially, so aggressive readahead brings pages in before they're needed, with no risk of harming concurrent random reads that simply aren't happening on that node.
Cache-aware prefetching. simdvec already prefetches at the CPU cache-line level. On serverless, we can coordinate this with the blob cache's knowledge of region residency, prefetching at multiple levels: remote store to cache, OS pages to RAM, and cache lines to CPU. The blob cache can tell the scorer which regions are resident before scoring begins, avoiding work on data that would trigger a remote fetch.
Workload-aware eviction. The blob cache can prioritize retaining data that vector search depends on: IVF centroid indexes that are checked on every query or quantized vectors that are scored in bulk, over data that's accessed infrequently. The OS page cache evicts based on generic heuristics with no understanding of what the data represents. On serverless, eviction policy can be tuned to the workload.

The blob cache gives us a level of control over the memory hierarchy that the OS page cache simply can’t. This is why we see serverless as the most promising platform for the next generation of vector search performance work. Not just matching stateful, but surpassing it. And vectors are just the beginning.

Vector search on Elasticsearch Serverless: what we shipped and what's next

simdvec now runs everywhere Elasticsearch runs (stateful, serverless, and frozen tier) with the same native SIMD scoring, the same bulk scoring, and the same off-heap efficiency. The abstraction we built is general-purpose and already wired through every layer in the storage chain, so the same approach could benefit term lookups, aggregations, sorting, and stored field retrieval in the future.

Elasticsearch Serverless is where we're investing most heavily in vector search performance. Every improvement to simdvec, every optimization to the blob cache, and every new storage-level improvement lands here first. If you're choosing where to run your vector workloads, serverless is the platform that keeps getting faster. You can get started with a free Elastic Cloud trial.

How Elasticsearch cuts time-series storage by 34% with synthetic _id and bloom filters

Tanguy Leroux, Francisco Fernández Castaño, Anton Persson — Thu, 28 May 2026 00:00:00 GMT

Synthetic _id reduces time-series index storage by up to 34% and eliminates 6% CPU overhead at ingest. Instead of building an inverted index for _id, Elasticsearch computes the document identifier on the fly from _tsid and @timestamp, using a bloom filter for deduplication. This optimization ships in Elasticsearch 9.4 and is already live on Elastic Cloud Serverless.

This post is a deep dive into the implementation. For context on how synthetic _id fits into the broader metrics performance story, see How we rebuilt Elasticsearch as a leading columnar metrics datastore to achieve up to 6.6x improvement in storage efficiency and 50% improvement in indexing throughput for OpenTelemetry metrics.

We'll start by explaining why the _id field is expensive for time-series workloads. We'll then describe how synthetic _id works and how it uses a bloom filter to optimize document deduplications instead of maintaining a traditional inverted index. Finally, we'll share the performance results from our benchmarks and serverless production deployments.

The hidden cost of _id in time-series indices

Time-series indices are a specialized index mode optimized for metrics, logs, traces, and other timestamped data. They store sequences of data points (like CPU usage, stock prices, or sensor readings) that track changes to specific entities over time. In Elasticsearch, each of these data points is indexed as a document with a unique identifier called _id. This identifier is used to look up, update, or delete specific documents. When a document is indexed in Elasticsearch, the system checks whether a document with the same _id already exists. Depending on the operation type (op_type), an existing document is either replaced (index) or the new document is rejected (create); the latter is the most common path for metrics ingestion.

To perform this lookup efficiently, Elasticsearch builds an inverted index for the _id field. This inverted index maps each _id value to its location in the index, enabling fast document lookups. Until version 8.11, the _id value was also stored separately in order to be returned in search results and other APIs. From 8.11 and onwards, we optimized Elasticsearch to only store this value temporarily for document replication purposes, the value being quickly merged away and reconstructed on demand.

For many use cases, building the inverted index and storing it is an acceptable overhead. But for time-series data, like metrics or traces, the cost can add up quickly. Our experiments showed that building the inverted index for the field _id adds 6% CPU overhead compared to indexing without it. In some extreme cases, we benchmarked that it could reduce indexing throughput by 25%.

This overhead is especially painful for time-series workloads where data points are typically small (often just a timestamp and a few numeric values) and compress extremely well. The _id field, however, doesn't benefit from the same compression. As a result, the inverted index for _id can represent a disproportionate share of the total storage. In our benchmarks with OpenTelemetry (OTel) metrics, the _id inverted index alone consumed around 5 bytes of the total 25 bytes per data point.

We considered several approaches to eliminate this overhead:

Stop indexing _id and checking for duplicates: This would be the simplest solution, but without deduplication, duplicate data points could corrupt aggregations. A gauge average, for instance, would be skewed by repeated values.
Accept duplicates during indexing, deduplicate at query time: This preserves correctness but adds overhead to every query, degrading dashboard responsiveness.
Deduplicate during segment merges: Duplicates would eventually be removed, but queries on unmerged segments would still return results with duplicates.
Synthetic _id: Compute the document identifier on the fly from fields that already uniquely identify each data point, and use a lightweight bloom filter for deduplication instead of a full inverted index.

We chose synthetic _id because it maintains correctness at ingest time while eliminating the storage and CPU overhead of the traditional approach. And we decided to implement it for time-series indices because they’re very well suited for this optimization.

In time-series indices, the _id isn’t arbitrary. Each document has a time series identifier (_tsid) and a timestamp (@timestamp). The _tsid is generated from the dimensions fields of the document (like host.name, pod.name, or sensor_id), while the @timestamp marks the point in time of the document. Together, these two fields uniquely identify the document: There can only be one data point for a given time series at a given moment in time. This means we can derive the _id from the _tsid and @timestamp field values, rather than storing it separately.

How does synthetic _id work in Elasticsearch?

With synthetic _id, Elasticsearch computes the document identifier on the fly as the combination of the _tsid and @timestamp fields. This computed value is used wherever _id would normally be used: in API responses, for document lookups, and for deduplication. However, it’s never stored in an inverted index nor is it stored on disk for later retrieval.

The challenge is deduplication. When a new document arrives, Elasticsearch must verify that no document with the same _id already exists. Without an inverted index on _id, how can we perform this check efficiently?

How synthetic _id simulates an inverted index without building one

Our Elastic Lucene experts suggested a clever idea: Since _tsid and @timestamp are already stored as doc values, we could expose our own custom Lucene postings format that simulates an inverted index without actually building one.

This means that when Elasticsearch needs to look up a document by its _id, it uses the same code path as usual: It queries the underlying Lucene index to look up the _id term. But instead of hitting a real inverted index, our custom postings format intercepts the call, extracts the _tsid and @timestamp encoded in the synthetic _id, and uses their doc values to locate the document. Because time-series indices are sorted by these fields, documents belonging to the same time series are stored contiguously. This allows Elasticsearch to skip large subsets of nonmatching documents (sometimes entire segments) to find the target document(s) quickly.

While this process is efficient, it can involve several random-access reads: looking up the _tsid value, scanning for matching documents, and reading timestamps. For the common case in time-series indices where we don’t expect the document to already exist, we wanted to fail fast without touching doc values at all.

Bloom filters for fast membership testing

We solve this problem using a bloom filter, a probabilistic data structure that can quickly answer the question Could this element be in the set? with a small risk of false positives but no risk of false negatives. In other words, a bloom filter might occasionally say yes when the answer is actually no, but it will never say no when the answer is yes.

When a document is indexed, its synthetic _id is added to the bloom filter. When a new document arrives, we first check the bloom filter. If the bloom filter says no, we know for certain that no document with this _id exists and we can proceed with indexing immediately. If the bloom filter says maybe yes, we fall back to the more expensive verification using the _tsid and @timestamp doc values.

Synthetic _id indexing workflow: step by step

Let's walk through what happens when a document is indexed into a time-series index with synthetic _id enabled:

Compute the synthetic _id: Elasticsearch calculates _id as a combination of _tsid || @timestamp.
Check the live version map: Like today, we first check an in-memory map of recently indexed documents. If the document is present in this map, we can handle the duplicate immediately.
Filter segments by timestamp: Time-series indices are sorted by _tsid and @timestamp. We can skip any segment whose timestamp range does not overlap with the incoming document's timestamp.
Check the bloom filter: For each candidate segment, we test whether the _id might exist using the bloom filter.
Verify if needed: If the bloom filter returns a positive result, we look up the document using the _tsid and @timestamp doc values. Since documents are sorted by these fields, this lookup is efficient.
Index the document: If no existing version is found, the document is indexed. The _id is added to the segment's bloom filter, but no inverted index is built and the field value is never stored.

In the common case where new data arrives with recent timestamps, step 3 eliminates most segments from consideration, and step 4 quickly confirms that the document is new. The expensive verification in step 5 only happens on bloom filter false positives, which are expected to be rare.

Bloom filter false positive rate: how Elasticsearch keeps it low

One challenge with bloom-filter-based deduplication is controlling the false positive rate without sacrificing the storage efficiency we were after. To size bloom filters effectively, we consider the number of data points in each segment and target both a low false positive rate and a bit set saturation below 50%.

The saturation target serves a specific purpose: When segments are merged, we OR the bit sets rather than rebuilding bloom filters from scratch. This makes merges fast but means the false positive rate converges toward 100% as segments are merged repeatedly. Keeping saturation below 50% before merging buys headroom, delaying that convergence.

The low false positive rate target is justified by access patterns: Recent segments are checked far more often than older ones, since we prune the search space based on data point timestamps. Older, heavily merged segments with degraded bloom filters are unlikely to be checked.

Synthetic _id performance benchmarks: indexing and storage

We ran extensive benchmarks to validate our implementation.

Indexing throughput

A core goal of this effort was to match or improve on existing indexing throughput. In principle, the new approach does less work: Building an inverted index for _id requires hashing each value, building and maintaining complex data structures in memory, and flushing them to disk. These structures must also be reconstructed during segment merges, adding CPU and I/O overhead in high-throughput use cases.

Building a bloom filter isn't free (we still hash each value), but the memory footprint is smaller and there are no complex data structures to maintain or flush. The bloom filter is also cheap to merge: When possible, we simply OR the bit sets together rather than rebuilding from scratch.

The main cost of synthetic _id comes from verifying potential duplicates using doc values. However, this cost is mitigated by two factors: First, bloom filter false positives are rare, so most documents skip this step entirely. Second, time-series indices are sorted by _tsid and @timestamp, which means doc value lookups can skip large blocks of nonmatching documents efficiently.

In practice, that's exactly what we observed. Even accounting for the extra seeks needed to verify matches against the tsid and timestamp when a bloom filter returns a positive, throughput came out comparable or better than before. The savings from not building and merging the inverted index outweigh the occasional cost of a false positive check, as confirmed by our nightly benchmarks:

Storage savings

In our benchmarks with OTel metrics, synthetic _id reduced storage by approximately 5 bytes per data point. For a dataset where documents average 25 bytes per data point, this represents a 20% reduction in storage from this single optimization alone.

These results were soon confirmed by our nightly benchmarks.The chart below shows the storage footprint reduction over time as we enabled the synthetic _id feature on March 19, 2026:

Our standard time series database (TSDB) benchmark showed a reduction from 2.5 GiB to 1.9 GiB (24%). Similarly the time-series downsampling benchmark showed a comparable reduction from 3 GiB to 2.3 GiB (23%).

Another benchmark, more focused on metrics, showed an even better reduction, from 3.0 GiB to 2.0 GiB (34%):

API compatibility

An important design goal was maintaining compatibility with existing Elasticsearch APIs. With synthetic _id, all document APIs continue to work as expected: Bulk, Get, Update, Delete, Reindex, and Update/Delete by Query. This compatibility layer also limited the blast radius of the change, ensuring any issues would be contained to the internal implementation.

When the _id isn’t provided in an API request, Elasticsearch computes it from the _tsid and @timestamp fields. To check if the document already exists, it first queries the bloom filter and, if needed, falls back to doc values. The _id is also synthesized on demand from doc values when returning documents in search results or API responses.

One case that requires special handling is searching or filtering by _id prefix or pattern. Such queries require scanning many documents to find matching documents, and while this works correctly, it incurs a performance penalty compared to a direct _id lookup. We don’t expect this use case to be common for time-series indices though.

Elasticsearch 9.4 and Elastic Cloud Serverless availability

The synthetic _id feature will be released in Elasticsearch 9.4.0 and is already available on Elastic Cloud Serverless.

No configuration is required: The feature is enabled by default, and newly created time-series indices (including those created on datastream rollover) will automatically benefit from this optimization. Existing time-series indices created before 9.4 will continue to create inverted indices for the _id field.

We expect synthetic _id to perform well across all time-series use cases. However, in some very specific, update-heavy use cases, if you encounter performance issues, the feature can be disabled by setting index.mapping.synthetic_id to false for new indices.

Summary: synthetic _id storage and performance gains

In this article, we’ve presented how synthetic _id eliminates the storage and compute overhead of document identifiers in time-series indices. By computing _id on the fly from _tsid and @timestamp, and using a bloom filter for deduplication, we achieve comparable or better indexing performance with up to 34% reduction in storage footprint while maintaining full API compatibility. For users running large-scale time-series workloads, this translates directly into lower infrastructure costs.

Roadmap: what comes after synthetic _id

Synthetic _id is part of a broader effort to reduce storage overhead in Elasticsearch.

Sequence number trimming: Every document carries a sequence number for replication and concurrency control. For append-only time-series data, these become redundant after segments are merged. Elasticsearch 9.4 now trims them during merges to reclaim even more storage: We'll cover this optimization in detail in an upcoming blog post.
Synthetic _id beyond time-series: We’re exploring how to bring synthetic _id to regular indices by letting users declare which fields uniquely identify their documents and configuring index sorting on those fields to enable efficient lookups.

Stay tuned!

Small model, big benchmarks: how Jina-VLM beat the competition at 2.4B and what ICLR told us is coming next

Andreas Koukounas, Georgios Mastrapas, Scott Martens — Wed, 27 May 2026 00:00:00 GMT

Jina-VLM is a 2.4B-parameter vision-language model that currently leads open 2B-scale models on multilingual VQA benchmarks (MMMB and Multilingual MMBench) across 29 languages. It pairs a SigLIP2 vision encoder with a Qwen3 language decoder and handles arbitrary-resolution inputs without sacrificing token efficiency. Jina by Elastic engineers presented the model at the DATA-FM workshop at ICLR 2026 in Rio. This post covers the architecture, the training approach and what five days at the conference told us about where retrieval, embeddings and reasoning are headed.

jina-vlm is a 2.4B-parameter vision-language model that pairs a SigLIP2 vision encoder with a Qwen3 language decoder, using attention pooling over image tiles for token-efficient handling of arbitrary-resolution inputs. Beyond the model itself, the paper's main contribution is its “leave-one-out” ablative data-mixture: By removing one task, domain, modality, or language category at a time during training, you can figure out which slices of data are significant or redundant, and whether learning in one domain transfers to others. The result is a compact model that, despite its size, achieves state-of-the-art multilingual VQA performance.

Rio delivered everything you'd hope for: warm, sunny beach weather, the easy walk between Copacabana and Ipanema, the view from Christ the Redeemer, the colors of Escadaria Selarón. A welcome contrast to a still-chilly European spring.

What was trending at ICLR 2026: RLVR, test-time compute and retrieval

Conferences like ICLR give everyone a chance to take the field’s pulse and find out what’s hot, what’s not, and what’s coming up. After a few days of walking the aisles at the poster sessions and dropping in on oral sessions, you start to get a sense for things. You start to see the same words on poster after poster, and you notice which sessions are the most crowded.

Here are a few things we picked up on:

Reinforcement Learning with Verifiable Rewards (RLVR) is now the dominant paradigm for post-training refinement. Almost every reasoning-focused poster we stopped at was using some form of Group Relative Policy Optimization (GRPO) for math correctness, code execution, and formal-logic checks, rather than Reinforcement Learning from Human Feedback (RLHF). Direct Preference Optimization (DPO) fine-tuning, which felt like the default a year ago, was conspicuously rare. It makes sense: If you can use code to check for correctness, you no longer need to get annotated data and the training loop goes much faster.

Test-time compute has stopped being a curiosity and become a design problem. Test-time compute – the time a system spends generating a response – is an increasingly important study variable. Papers now measure it as part of their experimental setup and developers try to optimize for it. Models are now built with the expectation that inference will be expensive and clever, not just a single forward pass through a neural network.

Vision-Language Models (VLMs) are everywhere, and Vision-Language-Action models (VLAs) are not far behind. A big chunk of the conference was about how to make multimodal AI work better, like better tokenization for images, better positional encodings for non-text media, and more efficient ways to compress visual information before it overwhelms your model. Vision-Language-Action models that extend multimodal AI recipes to robotics and embodied agents are no longer niche research. They brought in the crowds at their presentations and hosted vibrant debates.

Reports of the death of State-Space Models (SSMs) have been greatly exaggerated. Attention models still dominate AI, but Mamba, SSM variants and recurrent neural networks still draw attention and research, both as full replacements for Transformers and as components inside hybrid attention-based stacks. Whether they'll ever genuinely displace Transformers is an open question, but the line of research is alive and well.

Agentic AI safety is taken very seriously. A lot of papers and presentations discussed problems like machine unlearning and jailbreaking, and some of the most interesting work was on prompt injection through agentic tool use, like when a model dutifully follows instructions hidden in a webpage or an API response it just fetched. A repeated, slightly unsettling observation: models that follow instructions better tend to be more vulnerable to this kind of attack, not less. This capability-vulnerability tension is going to define a lot of the next few years of safety research.

Hallucination and factuality are increasingly framed as retrieval problems. Several talks made that point explicitly: A generative model that has to invent facts will inevitably hallucinate them, while a model that retrieves information can ground its responses in verifiable ways. That framing is, of course, exactly the bet that search AI engineers have been making all along.

ICLR 2026 invited talks: hidden universe imaging and open AI development

Two of the invited talks stood out to us, albeit for very different reasons:

Images of the Hidden Universe

Katie Bouman presented a tour of how physics, prior knowledge, and machine learning combine to reconstruct information that the universe never gives us directly, like the silhouettes of supermassive black holes and the invisible dark matter structures. She walked us through the Event Horizon Telescope's imaging of M87 and Sagittarius A, building images up from indirect and incomplete radio measurements, and then extended the same machinery to mapping dark matter through gravitational lensing.

This talk was a useful reminder of why machine learning matters outside the LLM bubble. The more you already know, the more you can learn from a little bit more information. This principle generalizes beyond astronomy to knowledge in general, and to machine learning in particular. Any decision system that uses sparse, noisy observations is confronted with it.

_____________________________________________________________________________________

Marin: Open Development of Frontier AI

Percy Liang opened his presentation with a blunt observation: As AI capabilities skyrocket, openness plummets. His response is Marin, a platform for community-driven AI research where every experiment is open, every suggestion or discussion is on public fora, and anyone can review or rerun a result.

What makes Marin interesting isn't just creating open weight models - plenty of projects do that - but creating an open process for making models. Project pre-registration, peer review, and reproducibility have long been part of the natural sciences, and Marin attempts to maintain that tradition for AI. Model training is treated as a matter of public scientific record.

The talk presented concrete scientific results from this approach (optimizer findings and scaling-law results), suggesting that community-scale science isn't just an aspiration but a workable methodology.

_____________________________________________________________________________________

Bouman and Liang made a pleasingly complementary pair: one a reminder of how much ML has to offer the world outside ML, the other a challenge to how the field organizes itself.

ICLR 2026 research highlights: embedding models, retrievers and sparse representations

We attended many oral presentations and poster sessions. The papers below stood out because of their potential to impact how we make and use embedding models.

Rethinking pretraining for representations

Decoder-only models have dominated the LLM leaderboards for years, but one paper makes a case for encoder models.

Seq vs Seq: An Open Suite of Paired Encoders and Decoders does a repeatable, open-data, architecture-controlled comparison of encoder-only and decoder-only models trained identically. They used the same data, same architecture, same training recipe, and differed only in their training paradigms: Bidirectional Masked Language Modeling (MLM), typically associated with encoders, vs. Causal Language Modeling (CLM), usually used in decoders. Their results confirm prior findings that encoders excel at classification and retrieval while decoders excel at generation. A key finding is that cross-objective continuous pretraining does not close the performance gap between the encoders and decoders. A 400M parameter encoder beats a 1B parameter decoder in classification and retrieval, and vice versa for generative tasks. All artifacts including data, checkpoints, and code are open-sourced.

Their study delivers a definitive empirical finding for the AI community: Encoder-only pretraining is substantially more efficient for classification and retrieval tasks than adapting decoders to act like encoders, even with post-training on high-quality data. This challenges the recent trend of adapting large decoder LLMs (like LLM2Vec) for embedding tasks. Dedicated encoder pretraining from scratch remains the most reliable path to strong retrieval performance. Additionally, the public release of 200+ checkpoints with batch-ordered training data makes their work an invaluable resource for studying how retrieval-relevant representations emerge during training and how they scale with parameter count and tokens.

New paradigms for training retrievers and embedders

Revela: Dense Retriever Learning via Language Modeling reframes dense retriever training as a language modeling problem. Rather than using supervised training with query-document pairs, it trains a retriever model jointly with a language model by conditioning next-token prediction on all the other documents in the batch. This innovative in-batch attention mechanism modifies the model’s Transformer blocks by injecting the similarity scores of documents in each batch into the cross-document attention weights. Training is done on raw text, without query-document pairs, hard negatives, or synthetic data generation. The resulting 3B parameter model outperforms E5-Mistral-7B-Instruct (with 7B parameters) as well as proprietary closed-weight embedding models like OpenAI, Cohere, and Voyage. On retrieval benchmarks, it matches E5 despite using roughly 1000 times less training data and approximately 10 times less compute.

This demonstrates that next-token prediction can still serve as an effective training objective for high-quality dense retrieval AI. This is important because plain text data – what you need for next-token prediction – is widespread and inexpensive and this paper shows that it’s all you need to train competitive embedding models.

Let LLMs Speak Embedding Languages: Generative Text Embeddings via Iterative Contrastive Refinement advances the proposition that LLMs should learn to "speak an embedding language," i.e., generate sequences of “soft tokens” optimized for semantic representation rather than human readability. They outline innovative loss functions and objectives in support of this goal, and show that the resulting models have very competitive performance, while generating only a handful of additional tokens.They also show that generating more tokens at inference time steadily improves embedding quality in a way analogous to chain-of-thought scaling in reasoning LLMs. KV-caching reduces the computational overhead of the generation process to within 1.1 times that of standard single-pass embedding models. This approach represents a new paradigm for representation learning, complementary to encoder-only and single-pass approaches.

Towards Improved Sentence Representations using Token Graphs frames the problem of generating embeddings for sentences from token-level representations as a relational learning problem rather than a compression problem. Instead of pooling tokens, it uses a supplementary neural network that processes a dynamically constructed graph made from output token similarities. This added network is compact, with very few trainable parameters, and can be implemented without doing any additional training on the main language model. The result is competitive with current frontier models.

This approach can be dropped into any language model at a very reasonable additional training cost, giving it immediate practical significance. Furthermore, the resulting models hold up well in the presence of noise, a known problem, especially for long-context models.

Sparse and ultra-efficient embeddings

LightRetriever: A LLM-based Text Retrieval Architecture with Extremely Faster Query Inference introduces an asymmetric dual-encoder architecture for embeddings-based retrieval in which the query encoder is much smaller and faster than the document one. The key insight is that while document embeddings benefit from the modeling power of a large language model, query embeddings are much less demanding. During training, they propose to learn per-token query embeddings, then, at query time, those embeddings are retrieved and averaged to produce a full query embedding. Documents must still be encoded at storage time using a potentially large encoder, but there is no need to invoke an embedding model at query time at all. The result retains approximately 95% of the performance of the query encoder it replaced. This has immediate implications for computational constrained, time-sensitive, or resource-efficient text information retrieval systems.

CSRv2: Unlocking Ultra-Sparse Embeddings addresses the computational cost of embeddings-based retrieval using dense, high-dimensional vectors. It tackles that cost with Contrastive Sparse Representation (CSR), which maps dense vectors into a much higher-dimensional space where only a few vector entries are non-zero, so that search can use highly efficient sparse-vector search techniques like inverted-indexes.

CSR approaches tend to break down when the number of dimensions with non-zero values becomes very low. This paper addresses this problem with an innovative training approach that makes ultra-sparse representations viable, opening up the possibility of much faster, less computationally demanding retrieval without loss of accuracy.

Multi-step and multimodal retrieval

Q-RAG: Long-Context Multi-Step Retrieval via Value-Based Embedder Training frames the problem of multi-step retrieval-augmented generation (RAG) in terms of optimizing the embeddings used in RAG search. RAG systems are typically based on a single retrieval step: Input to an LLM becomes a query to a vector store, and a selection of the results are presented to the LLM as a basis for composing a response. However, agentic approaches that involve multi-step interactions between the LLM and vector store can improve RAG performance significantly, especially for large input contexts that might contain millions of tokens. This paper seeks to optimize the embedding model used for retrieval to better support this usage scenario with Reinforcement Learning with Verifiable Rewards (RLVR).

This paper is one of the more elegant intersections of two of the conference's biggest themes — RLVR and retrieval — and it gives a glimpse of what retrieval looks like when it has to operate inside an agentic loop, not just before one.

Foundations and evaluation

HUME: Measuring the Human-Model Performance Gap in Text Embedding Tasks undertakes the unusual task of systematically measuring human performance on the Massive Text Embedding Benchmark (MTEB), the most widely used benchmark for embeddings-based information retrieval. Using 16 datasets in 5 languages, they find that average human retrieval accuracy is 77.6%, while the best embedding models currently score over 80%. However, this performance gap is uneven. Models may outperform humans on standard tasks but fall apart when faced with low-resource languages, where human intuition still holds a significant lead.

This paper also shows that "superhuman" scores on low-agreement tasks are mostly artifacts of fitting noise, not genuine capability. This underlines the problem of our current suite of embedding benchmarks: New models are not improving benchmark performance very much. To make progress, we need new, harder challenges and a total rethink of how we evaluate models.

Training dynamics for foundation models

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining identifies a significant but underexplored problem in AI model training. Large training sets can create a problem with models forgetting things that they’ve learned as they’re presented with more data. Curriculum-based pretraining — sorting data from low to high quality — should help, but in practice the results have been disappointing. The reason, the authors argue, is that the model encounters the highest-quality data late in the training schedule when the learning rate is at its lowest. Its gradient contribution is therefore greatly reduced. They confirm that hypothesis empirically by showing that curriculum training significantly beats random shuffling if training uses a constant learning rate.

They propose two simple strategies to fix this: Let the learning rate decay more slowly, or replace learning rate decay with weight-averaging over the multiple final checkpoints. Combining the two yields a 1.64% average benchmark improvement over standard practices with no additional data refinement. The broader message - that data composition and optimization schedule need to be co-designed - applies well beyond pretraining, and is a useful frame for embedding training too.

What ICLR 2026 means for retrieval and embedding research

Science has always been conducted through print and publication, but in-person conferences are still the only way to put people together in a room. Over five days, we met a steady stream of researchers from very different backgrounds — academia and industry, large labs and small startups, half a dozen countries — and conversations ranged from research trends to philosophical questions that have haunted AI from the beginning. Are LLMs really reasoning, or are they doing something more like very high-dimensional memorization with interpolation? Where exactly is the line, and does it matter for what we can build on top of them?

These conversations rarely produce answers, but they sharpen the questions, which is most of what good research is.

For the information retrieval work we do at Jina by Elastic, the future looks bright. Retrieval, long relegated to merely applied research, is increasingly recognized as the engine for keeping language models grounded. Better encoders, better embedding training paradigms, sparser representations, and retrieval that operates at the core of reasoning loops – these things matter to us all. What we saw and heard at ICLR 2026 convinces us that this is where a meaningful share of the next round of progress will come from.

We're already looking forward to seeing where the field is next year.

Cutting Elasticsearch DiskBBQ query quantization time by 5x

Benjamin Trent, Thomas Veasey — Wed, 27 May 2026 00:00:00 GMT

Asymmetric quantization cuts the time Elasticsearch DiskBBQ spends quantizing queries by 5x. We discovered that too much time was spent quantizing queries. DiskBBQ started off quantizing queries with the same centroids as the indexed documents. However, we can make this cheaper by quantizing the queries with coarser-grained centroids. This improves query latency with very little observed recall impact in our tests.

How DiskBBQ uses two centroid tiers for asymmetric quantization

DiskBBQ now uses two centroid tiers (fine-grained document centroids and coarser query centroids) so queries are quantized once per parent centroid instead of once per document centroid.

The old mental model is "one centroid does everything for a posting list." The new model splits responsibilities:

Document centroids (fine-grained): Still used for posting-list structure and document centering.
Query centroids (coarser): A parent centroid reused across multiple document centroids.

So instead of quantizing the query independently for every document centroid we visit, we quantize per parent centroid and reuse that work across all of its children. Since we were already using two-tier clustering logic as the index size grew, it was a natural fit. We can reuse the work we already do during querying.

These images are a simple representation of our goal: Quantizing per centroid gives us overhead per centroid. Let’s get rid of it!

The goal is to significantly reduce the number of times we actually need to quantize a given query.

The math behind asymmetric BBQ in Elasticsearch

To center the data prior to computing quantized query and document vectors, $q$ and $d$, we rewrite the dot product $q^td$ as $(m+q-m)^t(m+d-m)$ and expand. We can perform exactly the same operation but using different centroids for the query vector $q$ and document vector $d$. Specifically,

As for standard Better Binary Quantization (BBQ), we quantize $q-m_q$ and $d-m_d$ in order to estimate the per (document, query) pair component of the dot product. The quantities $m_d ^tq$ and $m_q ^t(d-m_d)$ are scalars so just two extra additions per dot product we compute. For $m_d ^tq$, we compute naturally when finding the nearest centroid. For $m_q ^t(d-m_d)$, this can be stored with the quantized document vectors, which are just 4 bytes overhead. Below, we’ll discuss how to manage the other term on the fly.

Asymmetric BBQ in DiskBBQ

We cluster the document centroids (using k-means, for example) into $k_q

Since there’s a unique query centroid for each document centroid, we only need to cache one value for $m_q^t(d-m_d)$ per quantized document vector, that is, for each document vector $d_k$ in posting list $i$, we need to cache $m_j(_i)^t(y_k-m_i)$ with the quantized document vector.

When we come to compute the dot products between a query and the document vectors in a cluster, we look up the quantized query vector corresponding to $q-m_j(_i)$ and we compute $m_i^tq$ once and use it to process the whole posting list. The quantization process is significantly more expensive than computing the dot product, so this is a big net win.

The $(q-m_q)^t(d-m_d)$ term is estimated using the usual BBQ machinery, that is, these vectors will be quantized and the dot product value estimated from the quantized vectors. Then we can use (1) to compute the final dot product estimate. Notice that this means we only need to quantize the query at most $k_q$ times. Furthermore, we typically visit many centroids from the same parent centroid in a search because they’re close to one another.

Euclidean distance corrections for asymmetric quantization

For Euclidean, we can write $||q-d ||^2=||q||^2+||d||^2-2q^td$ and treat the $q^td$ term exactly as above. In fact, there’s a slightly nicer form. Substituting, we have that:

We can rewrite this as follows:

The corrective terms are the norm of query vector $q$ minus the document centroid $m_d$, the norm of the document vector $d$ minus the query centroid $m_q$, and the norm of the difference of query and document centroids. As before, $||d-m_q||^2-||m_d-m_q||^2$ can be stored as a single float with each document.

What changed in DiskBBQ indexing and scoring

At indexing/merge time, centroids can be clustered into parent groups when centroid count is large enough. Posting metadata moved from "centroid ordinal + centroid score" to a shape that explicitly carries query-centroid ordinal and document-centroid score. That decoupling is what lets scoring read documents and query centering from different places. For Euclidean, let’s break it down further by our mathematics above:

=||q-m_d||^2+||d-m_q||^2-||m_d-m_q||^2

$||q-m_d||^2$ <- This is the distance from a “query vector $q$” to “document centroid $m_d$”. We already gather this when we find the nearest centroids during querying. No new work.

$||d-m_q||^2$ <- This is the distance from “document vector $d$” to “query centroid $m_q$”. However, recalling our original quantization work, this can simply replace a previously stored float value. No new storage is required.

$||m_d-m_q||^2$ <- This is just the distance between query centroid $m_q$ and document centroid $m_d$. This is just a single extra floating point value per postings list.

The practical change for dot product spaces is even simpler; the only correction value change is $m_j(_i)^t(y_k-m_i)$ being stored instead of $y^tm_i$.

These changes don’t introduce new computation costs and marginally reduce storage costs because we no longer quantize queries with document centroids. Those raw centroids don’t need to be present with the posting lists.

One cost we did add is a small cache of quantized query values. This is to account for clustering edge cases. For example, it's possible that query $q$ is very close to query centroid $qc_0 =\{dc_0, dc_1, dc_2\}$ but not quite as close as $qc_1 = \{dc_3, dc_4, dc_5\}$. That said, the actual nearest three document centroids could have a relative order: $[dc_0, dc_3, dc_1]$. So, to prevent the query from being quantized twice, we keep a limited cache of the most recent quantized values for a given query.

Here’s a visualization of the situation described above. In the typical iteration scenario, we don’t want to risk unnecessarily quantizing the query against the same query centroid multiple times.

DiskBBQ asymmetric quantization: performance results

The flame graphs below show a before and after comparison. Before, about 20% of the time was spent quantizing queries when we visited each cluster. After our adjustment, it dropped to about 4%.

Of course, the bulk of the cost is still just scoring the vectors in each cluster. But every little bit helps.

Here’s a better view of the full end-to-end performance and recall. The data set was 1 million DBpedia docs encoded with the GTE-Base model. Here, “sec” indicates the number of clusters per secondary (parent) cluster. Note that symmetric quantization is still impacted by the secondary cluster size as it also impacts the two-tier clustering indexing we do already.

However, the impact on our current index structure is still dominated by centroid scoring and scoring vectors in the cluster. Asymmetric quantization removes a frustratingly expensive part of our scoring overhead, but the impact isn’t dramatic given our current structure.

What's next for DiskBBQ quantization

This simple piece of mathematics decouples our query quantization from our document quantization, giving us better storage efficiency and faster queries. This is in Elasticsearch Serverless now and will be in Elastic Stack version 9.4.0.

This now means that query quantization time isn’t a direct concern for future decisions. We can make larger index changes without worrying about the consistent overhead of quantization directly with document centroids.

This was a nerdy one. I hope you survived all the math (and that I copied it all down correctly). It’s always fun to be able to tackle complex problems with simple mathematics, and the results are actually positive in real use cases and data.

Kibana dashboards as code: GitOps, drift detection and Terraform for Kibana dashboards in Elastic 9.4

Teresa Alvarez Soler, Omer Kushmaro, Devon Thomson — Tue, 26 May 2026 00:00:00 GMT

Elastic 9.4 ships a purpose-built Kibana Dashboards API: five typed endpoints, validated schemas for twelve visualization types, and a new elasticstack_kibana_dashboard Terraform resource that brings drift detection, plan diffs, per-environment deployments and git-based rollback to Kibana dashboards. Define a dashboard in HCL, review changes in a pull request and roll back by reverting a commit. Here's how it works.

Why do dashboards need an API?

The existing Saved Object API exports dashboards as a single stringified JSON blob: visualization state, internal references, UUIDs, and metadata all tangled together. Changing one field in a Kibana Lens chart means parsing 200 lines of internal state that was never meant to be read. There's no clean way to diff it, review it in a pull request, or edit it programmatically.

That makes standard GitOps workflows impossible: no drift detection, no automated promotion across environments, no rollback to a known-good state without overwriting the whole document by hand.

For large language models (LLMs), it's not easy either. The Saved Object format is too complex and error-prone for language models to generate or modify reliably; a simpler, validated schema is a prerequisite for natural-language dashboard authoring, which is exactly how we're using it in the Elastic AI Chat and dashboard skills for third-party tools integration.

What the new Dashboards API provides

The API works with your existing dashboards; it's not limited to dashboards created through the new interface.

Five endpoints cover the full lifecycle:

Create a new dashboard (POST /api/dashboards).
Read an existing dashboard (GET /api/dashboards/{id}).
Update an existing dashboard or create a new one if the ID doesn’t exist (upsert) (PUT /api/dashboards/{id}).
Delete a dashboard (DELETE /api/dashboards/{id}).
List all existing dashboards with pagination and query parameters (GET /api/dashboards).

What makes it different from the legacy Saved Object API:

Typed schemas for each panel. Every visualization type (XY chart, metric, pie, gauge, heatmap, data table, treemap, mosaic, waffle, tag cloud, region map) has its own validated schema with sensible defaults. Markdown panels, controls, and drilldowns are also supported.
ES|QL and data view support. Each visualization can be backed by a data view or an Elasticsearch Query Language (ES|QL) query; the schema cleanly separates the two modes.
Dashboard-level filters. Condition filters (is, is_one_of, range, exists), group filters (and/or nesting), raw Query DSL filters, and spatial filters, all typed and defined at the dashboard level.
Structural validation on write. POST and PUT validate your definition up front, so errors surface at write time rather than render time.
Layout control. Panels specify grid position on a 48-column grid; dashboards can be organized into collapsible sections.
Library panels. Panels can exist only as part of the dashboard, or they can be saved in the library if they need to be reused in multiple dashboards. The API supports both types of panels.

How the Kibana Dashboards API was built: the transforms layer

At the outset of this project, it was quickly discovered that the biggest culprit to the unusability of the Saved Objects APIs was the stringification of the dashboard contents. The dashboard’s entire panels array, for instance, was stored in one key, called “panelsJSON,” which could be thousands upon thousands of characters long, all stuffed into one line. We lovingly dubbed these fields “JSON Bags.”

If we could stop the stringification process for these keys and instead store the JSON directly in Elasticsearch, all of our problems would be solved. Unfortunately, after we de-stringified the JSON and truly saw what was inside, we found a jumbled mess.

Kibana has stored user-created UI content since day one. This content was stored with exactly one purpose in mind: allowing the UI to restore that content later. Because the actual shape of the storage was inconsequential, it was stored in Elasticsearch as a direct snapshot of the UI state. UI state is optimized for the specific UI that created it and is not meant to be looked at. Structures like deep nesting, objects with randomly generated IDs, parallel arrays, and internal references between different portions of the state are helpful in UI state but detrimental to readability.

Here, we faced a choice:

Leave the existing UI code in place and build a new alternative API schema for the public endpoints that bypasses the existing dashboard system and allows for some of its most-used functionality.

Inject a new API schema into the existing dashboard system and build middleware to allow the same API schema to underpin both the UI and the public endpoints.

We chose the second option and set about the monumental task of crystalizing and redefining every key in every panel of the jumbled mess of existing UI state into something we would be comfortable publishing in a formal API schema. This process was completed over a long time frame, by multiple teams, in live code. Backward compatibility was of utmost importance as, throughout the entire process, dashboards were being saved and loaded by our users every day.

The middleware that made this possible is called the transforms layer; it’s responsible for translating back and forth between the legacy, stringified, state shape and the cleaner API shape. Transform functions are registered for each panel that a dashboard can host. The teams that owned those panels would iteratively improve on the schema and make matching changes to the UI with each PR to ensure that the UI continues to operate correctly.

As each panel registers its transform functions with the dashboard, it also has the ability to register a schema that becomes part of the contract of the public Dashboards API. Schemas registered like this must be fully complete to reduce the risk of breaking changes to the API in the future. If a panel type doesn’t have a schema registered, it will be stripped from public API responses. This allows us to iteratively include more content in the public APIs into tech preview and beyond, while ensuring that users who rely on the API can trust that the shape will not change.

Using the Kibana Dashboards API: a walkthrough

The Kibana Dashboards API is available in Elastic 9.4 and works with existing dashboards.

The best place to get started is the Dashboards API documentation, where you’ll find all schema definitions and examples.

The following example uses the Kibana sample logs data to walk through the API; not a realistic GitOps workflow, but a simple way to see how it works.

Open a dashboard, and click Export JSON from the top menu. This dashboard contains one control and two sections: one with two metrics and another one with two time series.

You’ll see the JSON of this dashboard displayed in a flyout. Now click Open in Console.

You’ll be redirected to the Developer tools, and a new POST request is automatically prefilled for you with the HTTP request.

Include the space ID in the POST request URL.

POST kbn:/s/production/api/dashboards?

Go to the destination space, and see how the dashboard has been created.

Terraform support: Dashboards as native HCL

The Elastic Stack Terraform provider ships a new elasticstack_kibana_dashboard resource that maps the Dashboards API to native Terraform HCL. This means you get terraform plan, terraform apply, drift detection, import, and all the standard Terraform lifecycle operations for dashboards.

The provider doesn't cover every API capability yet; it focuses on the features that matter most for GitOps workflows.

Terraform example

Example: a Kibana dashboard in Terraform HCL

Start with the basics: a dashboard with a time series chart tracking request counts, a metric panel showing a KPI, and a markdown panel for context.

resource "elasticstack_kibana_dashboard" "service_overview" {
  title       = "Service Overview"
  description = "Key metrics for the payments service"
  tags        = ["production", "payments"]

  time_range       = { from = "now-1h", to = "now" }
  refresh_interval = { pause = false, value = 30000 }
  query            = { language = "kql", text = "service.name:payments" }

  panels = [
    {
      type = "vis"
      grid = { x = 0, y = 0, w = 32, h = 15 }
      xy_chart_config = {
        axis = {
          x = { title = { visible = true } }
          y = {
            title       = { visible = true }
            scale       = "linear"
            domain_json = jsonencode({ type = "fit" })
          }
        }
        legend      = { visibility = "visible", inside = false, position = "right" }
        fitting     = { type = "none" }
        decorations = { fill_opacity = 0.3 }
        query       = { language = "kql", expression = "" }
        layers = [{
          type = "area"
          data_layer = {
            ignore_global_filters = false
            sampling              = 1
            data_source_json      = jsonencode({ type = "data_view_spec", index_pattern = "logs-*" })
            y = [{
              config_json = jsonencode({ operation = "count", empty_as_null = true, color = { type = "auto" } })
            }]
          }
        }]
      }
    },
    {
      type = "vis"
      grid = { x = 32, y = 0, w = 16, h = 15 }
      metric_chart_config = {
        ignore_global_filters = false
        sampling              = 1
        data_source_json      = jsonencode({ type = "data_view_spec", index_pattern =   "logs-*", time_field = "@timestamp" })
        query                 = { language = "kql", expression = "" }
        metrics = [{
          config_json = jsonencode({
            type = "primary", operation = "count", empty_as_null = false,
            color = { type = "auto" },             format = { type = "number", decimals = 2, compact = false }
          })
        }]
      }
    }
  ]
}

This is readable, reviewable, and diffable. When someone changes the time range or adds a filter, the terraform plan output tells you exactly what changed.

Dashboard access control in Terraform

The Terraform provider supports the dashboard access control model, letting you restrict write access and set ownership:

resource "elasticstack_kibana_dashboard" "protected" {
  title = "Production SLOs"
  # ...

  access_control = {
    access_mode = "write_restricted"
  }
}

When access_mode is set to "write_restricted", only the creator can make changes. This is especially useful for production dashboards where you want to ensure all changes flow through your Terraform pipeline.

The GitOps workflow for Kibana dashboards

With the new Dashboards API and Terraform support, you can now treat dashboards like any other infrastructure artifact:

Define dashboards in HCL, alongside your Elasticsearch indices, data views, and alerting rules.
Review changes through pull requests: terraform plan shows exactly what's changing in each visualization.
Deploy across environments using Terraform workspaces or variable files per environment.
Detect drift when someone edits a dashboard manually in the Kibana UI: terraform plan will show the difference.
Roll back by reverting to a previous commit and running terraform apply.

Elastic users can now enjoy the benefit of GitOps-enabled dashboards, with a typed, HCL-native experience that goes beyond just managing dashboards.

Getting started with the Terraform provider

The elasticstack_kibana_dashboard resource requires Elastic Stack 9.4 or later and is available in the latest versions of the Elastic Stack Terraform provider.

To get started:

Set up your provider with Kibana connectivity:

terraform {
  required_providers {
    elasticstack = {
      source  = "elastic/elasticstack"
      version = "~> 0.14"
    }
  }
}

provider "elasticstack" {
  kibana {}
}

Define your first dashboard using any of the examples above.
Run terraform plan to preview, and then run terraform apply to create.
Import existing dashboards into Terraform state:

terraform import elasticstack_kibana_dashboard.my_dashboard /

For the full resource schema and documentation, visit the Terraform Registry.

Kibana Dashboards API roadmap

The Dashboards API is in technical preview in Elastic 9.4, and both the API and Terraform provider are actively evolving. On the roadmap:

All panel types will be supported by the API. There are a few panels that are still missing API support, like the Links panel, Machine Learning panels, Alerts, Log analysis panels, Vega visualization, and Maps.
More panel types in Terraform. Typed HCL config blocks for image, links, Service Level Objective (SLO), and Synthetics panels that the API already supports.
Dashboard-level filters in Terraform. Filter pills, controls, and drilldowns that the API already supports.

Try it out and let us know what you think. Send your feedback by using the Submit Feedback icon in the top menu, file issues on the Terraform provider GitHub repo, or join the conversation on Discuss. We’re looking forward to hearing from you!

Cutting agent costs with pre-computed context

Joe McElroy — Tue, 26 May 2026 00:00:00 GMT

Most of the conversation around agent context treats it as a memory problem. How do you give the model more room, longer windows, better recall. That's the wrong frame. Context is a retrieval problem. Agents stall on real workloads because they burn their token and step budget navigating raw sources before they get to the answer, and the fix is better retrieval, not more memory. In this post, we explore an experiment which leverages an LLM to extract structured facts ahead of time ("Knowledge Indicators", or KIs) which an agent can query through a natural-language interface backed by hybrid semantic and lexical retrieval.

Using the BrowseComp-Plus public dataset and a cost-controlled agent harness, accuracy moved from 60% to 70% to 92% across three stages of iterations, with input tokens dropping by up to 75% versus standard RAG. Most of the final jump came from feeding the agent's own wrong answers back into the extractor to create new KIs.

Pre-computed context only works inside a system that retrieves well, manages the data over time, and learns from its own failures.

Agents stall before they reach the answer

Frontier models are extremely capable when armed with the ability to access sources of information — they can crawl web pages, parse spreadsheets, navigate logs, run queries. The challenge is doing this without running out of tokens before arriving at the answer.

This challenge is familiar to engineering leaders who are building agents with access to data. The agent gets a task, decides it needs information, searches, retrieves, evaluates, decides it needs more, searches again, reads, stitches together a partial picture, loops. By the time the model is ready to answer, most of the token and latency budget is gone. Sometimes the answer is in the corpus and the context window fills before the agent gets to it. Sometimes the agent picks the wrong thread and never recovers.

OpenAI ran into this building their own internal data agent. Raw data access didn't scale. They had to layer in human annotations, institutional knowledge, and learned corrections before it was reliable enough for daily use. We're seeing the same with customers using Elastic Agent Builder. Model intelligence is rarely the limitation; what breaks agents is the context before the reasoning step.

The shapes of the workloads vary wildly across different domains:

Log Anomaly Triage: processing of machine-generated system alerts. Because individual data points lack sufficient context, the goal is to extract insights across multiple anomalies to separate benign, seasonal patterns from truly actionable incidents.
Financial Payment Analytics: querying historical logs to track end-to-end transaction journeys based on fuzzy identifiers. It requires cross-record retrieval to map full service lineages and diagnose payment failures or latency.
Product Support: assisting customers with questions on products using internal documentation, building insights across multiple documents.

This post covers a strategy we're testing for that problem: do the data orientation work once, ahead of time, and let the agent read the result.

Bottom-up context strategy: extracting knowledge from the source

There are two parts to the approach: extracting context ahead of time using Knowledge Indicators, and giving the agent a clean way to query it.

Use agents to learn how to extract and maintain context effectively

Most of the data an agent needs is already somewhere in the enterprise — records in databases, documents in Google Drive, Confluence, or SharePoint, logs and metrics in Elasticsearch, files in S3. The strategy works against the sources where they already live, with the access controls they already have.

Every domain utilizes sources differently. So instead of creating generic extraction pipelines that treat every source and domain the same, we're planning to use agents to tailor the extraction needs. An agent reads a sample of the source, works out the shape — the schema, the field semantics, the ways the data is typically queried — and writes that understanding out as structured metadata. Some of that metadata is facts pre-computed from the data; some of it describes how to query the source itself.

Document:
  docid: 
  url:   
  text:  
Return a JSON object with key "facts" containing 0–15 atomic facts.
Long, fact-dense documents (Wikipedia articles, news features, profiles,
academic-staff pages) typically warrant 8–15 facts. Short or generic
documents may warrant 0–3.
Each fact MUST be self-contained: title + description together fully answer
the implied W-question (who/what/when/where/how/which) without requiring the
source document. A future agent should be able to commit to an answer by
reading just title+description — the description must include the answer
value, supporting evidence (date, location, named witness, exact quantity,
physical detail), and a short verbatim quote (≤30 words) when it adds
disambiguating signal. This bias toward density is intentional even at the
cost of slightly longer descriptions.
Each fact:
  {
    "title":            "",
    "description": x     "<2-3 sentences ≤350 chars carrying the answer + evidence: entity, relation, value, date/location/source detail, and an inline verbatim quote when it disambiguates. Avoid restating the title verbatim.>",
    "subject":          "",
    "predicate":        "",
    "object":           "",
    "evidence_span":    "",
    "confidence":       <0..100 integer>,
    "tags":             ["", ...]
  }
Coverage priorities — extract a fact for EACH of the following whenever it's grounded in the doc text:
- Every named person mentioned + their role / position / title (no matter how briefly named —
  a one-line mention of "the secretary, Mary" still warrants its own fact).
- Every named organisation + its relation to the main entity.
- Every concrete date + the event that occurred on it (graduation 22 June 2003, trip 1 Nov 2022, etc.).
- Every named location + what happened there.
- Every distinctive descriptive detail: clothing colour, building material, exact age, weight,
  height, vehicle, distinguishing feature, last-seen description.
- Every cross-entity relationship: X collaborated with Y, X worked for Y, X spoke at Y's
  conference, X's child is Z, X co-edited a book with Y.
Anti-patterns — do NOT do these:
- Don't only extract facts about the most famous / dominant entity in the doc. Secondary
  individuals named once still warrant their own fact.
- Don't fill the budget with generic claims (founded-year, location, leadership) at the
  expense of specific concrete details that sit deeper in the doc body.
- Don't skip a fact because it seems minor — minor facts are often what disambiguate two
  similar entities at retrieval time.
Predicate guidance:
- Use a precise snake_case predicate (≤32 chars). Prefer reusing common terms when they fit:
  located_in, founded_in, founded_by, held_event, published_article, won_award, member_of,
  position_held, born_in, died_in, created_by, parent_of, succeeded_by, field_of_study,
  co_authored_with, organized_by, attended_by, physical_description, last_seen_wearing,
  clothing_worn, cross_link.
- Coin a new specific predicate when none of those fit. AVOID the catch-all `affiliated_with`.
Title and description constraints (CRITICAL — items violating these are dropped):
- title and description MUST read as natural standalone fact statements.
- They MUST NOT contain the strings: "BrowseComp", "qid", "qid:",
  "use this fact", "anchor a criterion", "without re-reading".
- They MUST NOT mention the document, the dataset, or this task.
Fact constraints:
- Favor specificity (proper nouns, dates, numbers) over generic claims.
- Skip the doc entirely (return empty facts list) for navigation pages, login walls,
  error pages, very short or generic content.
- evidence_span must be a verbatim substring of the doc text supplied above.

The extraction prompt was created by the agent. You can find earlier prompts created by the agent here.

Knowledge Indicators

We call this unit of pre-computed metadata a Knowledge Indicator — KI for short. KIs aren't a new concept for us: they already run in production in Elastic Observability Streams, where the same extraction pattern is applied to raw log data. There, an agent samples logs from a stream and extracts structured facts about the environment — which services are running, the infrastructure they sit on, how they depend on each other, the log schemas they use — and the KIs feed downstream into topology graphs, rules, dashboards, and agent investigations. The KIs auto-expire after 7 days if a service stops showing up, so the index stays current without manual cleanup.

The work in this post applies the same pattern to a different shape of corpus — documents instead of logs — but the unit is the same. Some KIs are facts; others describe sources. They share the same shape:

{
  "type": "knowledge_indicator",
  "id":   "ki-bcpc-d1478-tony-blair-position-held-prime-minister-of-the-un",
  "title": "Sir Tony Blair served as the Prime Minister of the United Kingdom from 1997 to 2007.",
  "description": "Blair was a British politician who held the office of Prime Minister for ten years starting in May 1997, and was the first person to lead the Labour Party to three consecutive general election victories. The text describes him as \"a British politician who served as Prime Minister of the United Kingdom from 1997 to 2007\".",
  "references": ["index://browsecomp-plus-corpus"],
  "tags": [
    "entity:tony-blair",
    "doc:1478",
    "tony-blair",
    "prime-minister",
    "politics",
    "1997-2007"
  ],
  "evidence_doc_ids": ["1478"],
  "payload": {
    "type": "feature",
    "subtype": "dataset_fact",
    "properties": {
      "subject":   "Tony Blair",
      "predicate": "position_held",
      "object":    "Prime Minister of the United Kingdom",
      "docid":     "1478"
    },
    "evidence": [
      "Sir Anthony Charles Lynton Blair (born 6 May 1953) is a British politician who served as Prime Minister of the United Kingdom from 1997 to 2007"
    ],
    "confidence": 100,
    "status":     "active",
    "last_seen":  "2026-05-10T11:12:41Z"
  }
}

In practice this is harder than it sounds, and the difficulty is why this work needs a system with optimized search, retrieval, and data engineering to work effectively.

The part that makes this work is measuring how the metadata holds up. We watch how agents use it to answer real questions, where they fail, where they fall back to scanning the raw source, where they give up. Those failures are the signal. They tell the setup agent what it missed, the metadata gets regenerated, and the next batch of failures looks different. Without that loop, the extraction agent overfits to whatever small set of questions it was first tuned against, and the strategy doesn't scale past a demo.

The shape of the loop, end to end:

The user contributes once - a set of example questions and a pointer at the source. Everything inside the loop runs on its own, and keeps running as new traces come in.

Give the agent a way to ask

The second part is the interface. The agent shouldn't need to know which source it's querying, what the index schema looks like, or whether the answer is going to come from knowledge indicators or from a fallback into the raw source. It should ask one question and get back something it can use.

Querying Knowledge Indicators

We settled on a natural-language interface for querying the metadata. The agent sends a question; an LLM behind the interface rewrites it into the underlying query against the index. The agent doesn't need to know the schema, the retrieval strategy, or which source the answer is coming from.

POST /api/_get_context
{
  "query":        "Wilkinson 2014 creatine review rheumatoid arthritis article title",
  "size":         10,
  "execute":      true
}

And this gets transformed by the use of an LLM to the following:

FROM .chat-context-items METADATA _id,_index,_score
| FORK
    (
      // Branch 1 — exact entity tag match for Wilkinson
      WHERE tags : "entity:wilkinson" OR tags : "wilkinson"
      | KEEP id, type, title, description, tags, references, evidence_doc_ids, _id, _index, _score
      | SORT _score DESC | LIMIT 25
    )
    (
      // Branch 2 — lexical match on title for specific keywords
      WHERE MATCH(title, "Wilkinson 2014 creatine review rheumatoid arthritis")
      | KEEP id, type, title, description, tags, references, evidence_doc_ids, _id, _index, _score
      | SORT _score DESC | LIMIT 25
    )
    (
      // Branch 3 — semantic match on description for the research intent
      WHERE MATCH(description.semantic, "Wilkinson 2014 review on creatine supplementation for rheumatoid arthritis")
      | KEEP id, type, title, description, tags, references, evidence_doc_ids, _id, _index, _score
      | SORT _score DESC | LIMIT 25
    )
| FUSE
| SORT _score DESC
| LIMIT 10

The response carries more than the matching KIs. Each KI comes with its tags and source references, and the API returns aggregations across the result set — counts by tag, by source, by entity — so the agent can see the shape of what's there before reading any individual KI. If the first ask returns a fan of results spread across three sources, the agent knows how to narrow. If they're all tagged with the same entity, the agent knows to follow that thread. Tags and aggregations are how it navigates the index quickly, without rereading.

The three metrics we tracked

We want to know whether pre-computing context represented in knowledge indicators helps an agent answer faster and more accurately than the standard RAG pattern when both are working against the same corpus with the same budget.

Three things we care about:

Accuracy. How often does the agent commit to the correct answer? Answer match against the gold answer (judged by an LLM), plus F1 scores.
Input tokens consumed. Every step the agent takes costs tokens. Fewer tokens to the same answer is the whole point of the strategy. If with-context answers more questions but burns the same budget doing it, that's not a win for any real deployment.
Whether the agent converges within the step budget. A wrong-but-committed answer and a "ran out of steps and gave up" answer are both failures, but they fail for different reasons and the fix is different. We track timeouts separately.

We're not chasing leaderboard accuracy. We're testing whether the strategy holds up where the budget is tight, which is where every real agent lives.

Experiment Setup

Before the numbers, a quick walk through the setup:

Dataset. BrowseComp-Plus — 830 hard factual questions in the test split, each paired with gold source documents inside a roughly 100k-document web corpus. The questions are deliberately cryptic and multi-criteria; an agent typically has to chain two to five retrievals to converge on a short, exact answer like a name, title, or date.

Agent harness. Both setups use the same harness, based on the LangChain deepagents middleware stack. The agent has local shell access to call the skill's scripts. The only thing that changes between runs is which skill is loaded. We use Claude Sonnet 4.6 as the agent's model.

Agent harness budget. The harness has a 43-step recursion budget. BrowseComp-Plus leaderboard runs typically give agents far more headroom. We're not trying to compete on raw accuracy — given enough steps the agent will eventually get to the right answer either way. The point of this experiment is to hold the step and token budget low, and measure how well each retrieval strategy converges within a realistic budget.

Force-commit safety net. If the agent reaches the 43-step ceiling without emitting an Answer: line, the harness makes one final LLM call asking it to commit to its best guess from the trace so far. A timeout is not automatically a failure — the agent can still land on the right answer at the limit.

Baseline (search-and-fetch RAG). The baseline skill exposes two helpers:

Search runs an ES|QL semantic and lexical MATCH(text, …) query against the corpus and returns up to ten hits, each with docid, url, _score, and three body snippets of up to 700 words. The agent reads relevant passages inline, in a single call, without the full document body.
get_by_doc_id is the escape hatch for fetching a full body when the snippets don't cover the answer.

A caveat on the baseline: this is the standard search-and-fetch pattern most teams use today, but it's not the most optimized RAG setup possible. Someone tuning RAG hard against this benchmark — chunking strategies, reranking, query expansion — would close some of the gap. The comparison is against the setup most customers actually run, not the theoretical best.

In-context setup. The in-context skill exposes two helpers:

get_context POSTs a natural language question to query relevant KIs.
execute_esql is available as a raw fallback against the corpus.

The skill's SKILL.md teaches the agent to chain get_context calls for entity-anchored multi-hop retrieval and only drop to body searches after two or more KI calls have come up short.

How KIs are indexed. Each KI's title and description are mapped as Elasticsearch semantic_text fields backed by the .jina-embeddings-v5-text-small inference endpoint. It's a hybrid search (semantic + lexical) match against pre-extracted knowledge.

We took a 96 question sample (from the 830 question dataset) and transformed 25k (out of the 100k-document web corpus) documents into KIs (a 25% sample) which included the golden docs in each question. This produced around 240k KIs, built on Gemini Flash over roughly 7 hours.

Stage 1: Setup Agent Optimization

At setup time, we took a small slice of 15 questions and worked with the agent to improve the KI extraction process — iterating on the prompt to reduce the number of steps and tokens the agent harness consumed to reach an answer.

Against that same set of 15 questions and after 4 agent loop iterations refining the KI extractor prompt, the in-context setup posted:

metric	baseline RAG	with-context	change
exact match	4 / 15 (26.7%)	9 / 15 (60.0%)	+33 pp
F1	0.39	0.62	+0.24
input tokens	9.0M	6.3M	−30%
output tokens	47.9k	43.9k	−8%
wall time	1,475s	1,448s	flat

The headline isn't really the 30% token saving. Exact-match accuracy more than doubled, while wall time stayed flat and tokens went down. Five new questions came out right that the baseline got wrong, and zero baseline wins were lost in the swap.

A note on what these numbers mean. The accuracy figures here are not directly comparable to the BrowseComp-Plus leaderboard. Leaderboard runs optimize for accuracy with generous step and token budgets. This experiment deliberately constrains both to simulate the real-world case we care about — agents working against limited budgets where retrieval efficiency, not raw reasoning headroom, decides whether the answer lands.

The wins clustered around questions where a KI directly answered the query. The baseline on the same questions burns through call after call of keyword search, often hitting its recursion limit before it converges. On four of the new wins, the baseline didn't fail because it was wrong, it failed because it ran out of steps. The KI route gets to the answer in a fraction of the budget, which is what the 80–95% token savings on those specific questions reflect.

However, it wasn't uniform. On a couple of questions the in-context setup actually used more tokens than the baseline (one was +151%, another +723%), because no KI covered the question well and the agent fell through to body searches after exhausting the KI route. That's the failure mode the feedback loop is built to close — every one of these traces is a signal that the extractor missed something the agent needed, and the agent would continually improve the extractor prompt to capture these facts for the domain more effectively.

Stage 2: Scaling up, but the failure mode changes

Strong numbers on a small slice. The next thing was to find out how much of that survives when you widen the eval set.

We ran the same setup over a 96-question expanded set and compared again to the snippet-baseline RAG.

metric	baseline RAG	with-context	change
judge correct	60 / 96 (62.5%)	67 / 96 (69.8%)	+7.3 pp
F1	0.561	0.624	+0.063
input tokens	174.8M	48.3M	−72%
output tokens	373k	345k	−7%
timeouts (43-step limit)	28 / 96	37 / 96	+9

With-context wins on accuracy and cost — 7 more questions correct and about 3.6× cheaper per question. But it also times out more, 37 against 28. That's worth understanding, because it changes how to read the timeout number.

A baseline search call returns up to 10 hits with three 700-word snippets each — one call can drop ~21k words of body text into the agent's context. The agent has to read that, find the answer-bearing sentence, and commit. Each miss means another search and more body text. With-context's get_context call returns ~10 KIs — single-sentence pre-extracted facts, ~2k tokens total. When a KI states the fact, the agent reads one sentence and commits. So with-context spends its budget on more retrieval calls, each one cheaper and sharper; baseline spends it on fewer, more expensive ones and then works through the text they return.

Higher Timeouts

Under a tight step budget, a timeout isn't a failure. It means the agent ran out of room before writing its answer. This is the realistic case for any budget-constrained deployment: the agent often has the answer in hand and simply hasn't committed it yet. 21 of with-context's 37 timeouts were judged correct on this run, because by step 43 the answer is usually already in the conversation from a KI retrieved already within the context, and the force-commit safety net picks it up. Baseline's timeouts fail outright more often, because its conversation is mostly raw body text and the force-commit pass is guessing from a haystack.

The safety net is catching answers the agent already had but didn't commit — so the agent is timing out on questions it could have answered itself, sooner and for fewer tokens. We want the agent to commit faster. That's a balance between two levers: the agent's instructions — when to keep retrieving, when to commit, when to fall through to body search — and the context it's working from — whether the KIs in front of it are sharp enough to commit on. Getting that balance right isn't a one-time fix. It's what the feedback loop is for: watch where the agent stalls, where it commits late, where it commits wrong, and feed that back into both the instructions and the extraction. Stage 3 is the first turn of that loop.

Stage 3: Teaching the agent from its own mistakes

Stage 2 left with-context ahead on accuracy and 3.6× cheaper per question, but 29 failures still on the table. We pulled the traces and looked at what was actually going wrong.

The failures grouped into three shapes:

Wrong-twin commits (19 of 29). The agent retrieved KIs that were topically right but couldn't disambiguate between near-neighbours, and picked the closer-looking one. q79's University of Aberdeen came back as University of Edinburgh. q193's Secret Oral Teachings in Tibetan Buddhist Sects came back as The Mystic Spiral. q775's Boston came back as Jerusalem. Same failure mode each time — confident commit to the wrong entity.
Wrong-value commits (3 of 29). Same mechanism, applied to numbers. q209's 9 came back as 7. q624's 65% came back as 26%. q1090's 500 Egyptian pounds came back as $1,500.
No-candidate failures (7 of 29). The agent's organic retrieval did not find relevant candidates and the force-commit safety net produced a guess that didn't survive judging.

In all three cases, the agent retrieved something, the something didn't sharply distinguish the right answer from a plausible neighbour, and the agent committed wrong. The fix requires identifying distinguishing criteria before the agent responds.

For each failure we had four things: the question, the correct answer, the agent's wrong commit, and the gold doc body. We then used an LLM to write a single disambiguation KI per failure, with a title that names both entities and the criterion that separates them:

"Joseph Dalton Hooker (19th-century British botanist, Director at Kew) is associated with the second origin narrative — distinguished from 16th-century German botanist Leonhart Rauwolf."

A guardrail rejected any KI whose title didn't lexically contain both the gold answer and the wrong prediction, so the disambiguator was guaranteed to land inside the semantic-text embedding rather than buried in a description field the retrieval might skip. 33 of 38 attempted KIs passed the guardrail and got indexed into the same retrieval layer the agent already queries. The five rejections were cases where the wrong-pred was empty or contained unicode quoting that broke the lexical match.

We then re-ran the 96-question set as a single live evaluation:

metric	baseline RAG	with-context	with-context + disambig	change vs no-disambig
judge correct	60 / 96 (62.5%)	67 / 96 (69.8%)	88 / 96 (91.7%)	+21.9 pp
F1	0.561	0.624	0.827	+0.203
input tokens	174.8M	48.3M	42.6M	−12%
timeouts (43-step limit)	28 / 96	37 / 96	27 / 96	−10

Force-commits landed correct on 22 of 28 fires this run, which is why timeouts dropped and accuracy jumped.

{
  "title": "The manuscript was co-authored by TJ Wilkinson, TD O'Brien, and AB Lemmey.",
  "description": "The 2014 review article on creatine supplementation for rheumatoid arthritis was authored by TJ Wilkinson, TD O'Brien, and AB Lemmey, all from the Aberystwyth University research group.",
  "subject":   "Wilkinson, O'Brien, Lemmey",
  "predicate": "co_authored_with",
  "object":    "Oral creatine supplementation review",
  "tags":      ["doc:51481", "entity:wilkinson", "td-obrien", "ab-lemmey", "2014", "co-authors"],
  "evidence_doc_ids": ["51481"]
}

Example of a disambig knowledge indicator

21 of the 29 failures flipped. Wrong-twin: 16 of 19. Numeric: 1 of 3 — the two misses didn't surface their disambig KI on this run, which looks stochastic. No-candidate: 4 of 7, because a disambig KI with an answer-bearing title surfaces during retrieval before the force-commit safety net fires.

Accuracy improved and the agent got faster doing it. Input tokens dropped 12%, timeouts dropped from 37 to 27, and the typical winning trace lands in around 25 steps.

8 persistent failures remain. Three failed because the agent committed a different wrong twin on this run than the disambig was built for — example: q83's Stage-2 disambig separated Hooker from Rauwolf, but this run committed Francisco Hernández, who isn't in the KI title. The challenge is figuring out how to retrieve and guide the agent away from new wrong twins. An improved loop names multiple plausible wrong-preds, or iterates the disambig build over multiple runs. The loop needs to run continuously because the failure modes drift.

The headline result is what this says about that loop. Stage 1 and Stage 2 knowledge indicators once and read it back. A static index has a ceiling — better retrieval over the same static facts only goes so far. Stage 3 is the same system using its own failures as the next batch of context. Every wrong-but-confident commit is a diagnostic signal: which two entities the corpus failed to disambiguate, and what the wrong commit was. That becomes a KI the next run reads first.

Conclusion: extraction, retrieval, and a feedback loop

Across the three stages, accuracy went from 60% to 70% to 92%, and input tokens dropped by up to 75% versus standard RAG. To be clear this experiment is not claiming a fixed multiplier you can expect everywhere. The baseline could be tuned harder and a different domain could shift the failure modes around. What the experiment does show is simpler: a system built around pre-computed context represented as knowledge indicators can beat search-and-fetch on a tight budget, and the gap is big enough to justify the work.

Extraction has to be tuned to the domain

A generic extractor produces generic facts, and generic facts don't disambiguate. The 96-question run worked because the extractor prompt was tuned to the questions from the human created eval dataset — what types of entities, dates, and details to pull out. Point the same extractor at a logs corpus or a payments corpus and the KIs become meaningless.

This tuning isn't a setup step you do once either. It runs as an agent in a loop, always improving. The extraction agent writes the prompt, the eval shows where it failed, the agent rewrites the prompt against those failures. Stage 1 did exactly this over 4 iterations. And the loop keeps running after launch, because sources change and the questions drift. Every new domain needs an extractor that fits its source, plus a process that keeps it fitting.

Retrieval is what gets the right fact to the agent

In Stage 2 the agent picked the wrong neighbour 19 times out of 29 failures. The right fact was usually in the index. The agent just couldn't tell it apart from a plausible twin. That's a retrieval problem, and text similarity alone won't fix it.

This is why hybrid semantic and lexical search, tags, and aggregations exist — the agent needs to see the shape of the result set before it commits. Aggregations give it counts by tag, source, and entity, so it knows if the answer is spread across three sources or sitting in one. Tag filtering lets it narrow to a single entity in one call. Together these let the agent scan a large result set fast and decide where to look, instead of reading KIs one at a time and burning steps.

The feedback loop is what moves the ceiling

Stages 1 and 2 extracted context once and read it back. Accuracy stalled at 70%. Extracting more facts the same way wasn't going to help, because the problem wasn't missing facts. It was the agent picking the wrong one of two that looked alike, and tuning the extractor alone couldn't fix that.

What moves the ceiling is the failures themselves. When the agent commits to a wrong answer, the trace tells you something precise — exactly which two entities the index couldn't separate, and which one the agent picked instead. That's specific enough to act on. The system takes those failures and builds new KIs aimed straight at the gap, each one naming both entities and what separates them. Stage 1's tuning improves how facts get pulled; the feedback loop adds targeted facts the extractor would never have written on its own. Stage 3 did this, and accuracy moved from 70% to 92%.

What's next

This is just the first strategy we're excited to share — bottom-up extraction into knowledge indicators — and there are more we're hoping to explore that leverage our platform for retrieval, data management and agent feedback. We think there's real room to make agents dramatically cheaper and sharper when tokens are tight, and we can't wait to show you where this goes next.

AI Chat in Kibana now natively renders dashboards

Teresa Alvarez Soler, Robert Jaszczurek — Mon, 25 May 2026 00:00:00 GMT

The Elastic AI Chat in Kibana now turns a plain-language question into ES|QL-backed visualizations or a full dashboard — right inside your conversation. Describe the metrics you need, refine as you go, and save when the story holds. Everything stays in the conversation until you're ready to save them, then becomes a first-class Kibana object your team can open, edit, and reuse. Available as a technical preview in Elastic 9.4

Watch video on Vidyard

The agent builds dashboards from scratch, but it also works with what you already have. Open the AI Chat sidebar while viewing a dashboard and it attaches automatically. Ask why a metric spiked, break it down by region, or add a comparison panel. Your existing dashboard becomes the starting point, not just the end product.

Behind the scenes: How we built dashboards in the AI Chat

We teach the agent specific tasks through skills — structured descriptions of how to operate on a given problem. But building a dashboard skill meant teaching an LLM to generate valid Kibana dashboards, and the legacy Saved Object API made that painful: deeply nested JSON, subtle version-to-version changes, brittle references. We needed a different approach

A purpose-built API for programmatic dashboards

The new Dashboards API was built for exactly this scenario. Instead of exposing raw internal state, it offers typed, validated schemas for every panel type. The API handles the translation between clean external structures and Kibana’s internal representations, so the agent can focus on what the dashboard should contain rather than on how to format it.

One skill, one tool, many operations

The dashboard-management skill exposes a single manage_dashboard tool that accepts an ordered array of operations. Each operation is a discrete action: Set metadata, add a markdown panel, create ES|QL-backed visualizations from natural language, edit existing panels, group panels into collapsible sections, or reposition items on the grid.

The agent can describe an entire dashboard: title, description, sections, and every panel inside them in a single call:

{
 "operations": [
   { "operation": "set_metadata", "title": "Checkout latency investigation" },
   {
     "operation": "add_section",
     "title": "Overview",
     "panels": [
       { "query": "p95 checkout latency over the last 24h", "chartType": "xy" },
       { "query": "checkout error rate by region", "chartType": "metric" }
     ]
   }
 ]
}

Operations execute in order, so later steps can reference and build upon earlier ones. This design keeps the conversation focused on intent rather than on implementation details.

The visualization pipeline: Natural language to ES|QL to visualizations

When you ask for a dashboard, the agent explores your data — indices, field mappings, types — then plans the visualizations and calls manage_dashboard.

Each panel runs through its own pipeline: chart type selection, ES|QL generation, visualization configuration, and validation. We isolated this from the main agent thread — visualization construction takes several model calls per panel, and mixing it into the main context would bloat the window and muddy the reasoning.

Inside manage_dashboard, all panels build concurrently, then reassemble in order. The result is a complete dashboard with embedded panels — no orphaned visualizations, no sync issues.

Why we moved visualization creation inside the dashboard tool

Our first approach used a separate create_visualization tool — one call per panel, then hand each attachment to the dashboard tool. It worked, but every visualization needed its own tool call, its own lifecycle, and an explicit handoff. Worse, editing a visualization in the conversation didn't update the dashboard panel, which confused users.

We folded visualization creation directly into manage_dashboard. The same parallel workflows run, but panels assemble into the dashboard structure without intermediate attachments. Fewer calls, no sync issues, one lifecycle.

Standalone visualizations still work — you can drop existing charts into a dashboard via attachment references — but for building from scratch, inline creation is the cleaner path

For security teams

SOC analysts and detection engineers can't afford a round trip to the dashboard editor mid-investigation. With the AI Chat, ask for alert volume by rule type, host, or MITRE tactic and see it in your thread in about a minute. As the hunt develops, layer in panels — process execution anomalies, network connections, timeline comparisons — without breaking context.

Save when you're done. The dashboard becomes a reference for the post-incident review, a starting point for the next analyst, or a weekly threat briefing — no re-explanation needed.

Read more about how security teams can use dashboard creation and other recently launched AI Chat capabilities in this blog post.

For observability and site reliability engineers (SREs)

When a service degrades at 2 a.m., there's no time to build dashboards from scratch. With the AI Chat, an SRE can describe the metrics they need (p99 latency by service, error rate against deployment events, pod restarts over the last hour) and get a full dashboard in the investigation thread in about a minute. The agent can refine it step by step as the picture sharpens: Add a panel, change the time window, break down by region.

Save the dashboard, and it's immediately available in the war room (same panels, same framing) for everyone joining the incident bridge. After the incident, it becomes the foundation for the postmortem.

What’s next

We're working on token optimization, richer full-screen interactions, broader panel support, and continued quality improvements. Technical preview is the right time to shape priorities — if something is missing, tell us via the "Submit feedback" icon in the top menu.

Try it

Upgrade to Elastic 9.4 (or start a trial), open the AI Chat in full-screen mode, and try it on a real investigation. Ask the agent to chart the metrics you're looking at, then ask for the next breakdown. When the story holds, save and share — same panels, same framing, no re-explanation needed. Requires an enterprise license (get started).
The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.

Kibana cuts dashboard load time by up to 25% - here's the polling strategy behind it

Drew Tate, Matthias Wilhelm — Fri, 22 May 2026 00:00:00 GMT

Kibana dashboards and Discover now load up to 25% faster thanks to continuous polling. Instead of sleeping between periodic checks, Kibana now keeps HTTP connections open and delivers Elasticsearch query results the moment they're ready. On HTTP/2+ (the Kibana default since 9.0) this kicks in automatically with no configuration required. On HTTP/1, Kibana falls back to traditional polling to prevent connection pool exhaustion.

How Kibana fetches data when loading a dashboard

When a Dashboard is opened, most of the panels (internally, we call these embeddables) kick off one or more Elasticsearch queries. But instead of the simple call-and-response of a synchronous (sync) search, we use the power of asynchronous (async) search (docs).

With async search, query results are kept available in Elasticsearch outside of any particular HTTP request. This is important because it

makes data loading resilient to network turbulence
powers our background search feature which allows users to work on other things in Kibana while they wait for a long-running dashboard or Discover session

After the initial query is submitted, Kibana monitors the search to detect when it is complete and retrieve the result set.

How traditional polling affects Kibana dashboard load times

In traditional polling, Kibana submits a query, closes the initial connection, then periodically checks Elasticsearch for completion.

We do give Elasticsearch a short amount of time after query submission to simply complete the search and return results. If the search completes that quickly, it amounts to a simple call-and-response. But for longer searches, the initial connection is closed and Kibana begins periodically checking the search for completion. This is called polling.

Performance drawbacks of traditional polling

Looking at the figure above, perhaps you can already see the performance drawback to this approach: the search is most likely to finish during one of Kibana’s sleep intervals, leading to lost time.

In the worst case scenario (when a search completes at the beginning of a sleep period) the entire duration of the polling interval will be wasted.

The impact of a backoff strategy

It’s standard practice when polling to apply a backoff strategy. This means that the longer the duration of the search, the less frequently we poll.

However, this also means that the potential lost time scales with the duration of the search.

How polling intervals create sawtooth latency patterns

Putting these factors together, our lost time becomes a stepwise sawtooth function.

Here, the peaks are worst-case scenarios and the troughs are best case scenarios. This illustrates that traditional polling costs us between nothing and the full duration of the polling interval, depending on the search duration (and network conditions).

Continuous polling: how Kibana eliminates wait time

The problem with traditional polling is a fundamental lack of coordination between Kibana and Elasticsearch. Ideally, Kibana knows immediately when results are available. So, what if we inverted the polling pattern to where nearly all of the time is spent checking Elasticsearch and no time is spent sleeping?

With this combination of long polling and no more sleep periods, results are delivered as soon as they are ready.

HTTP/1 degradation

The theory is solid. So why does this Kibana deployment look so degraded when we turn on continuous polling?

The key is that this deployment is running over HTTP/1. In HTTP/1, HTTP requests are mapped 1:1 to TCP connections. So several long-lived polling requests are hogging the browser’s finite connection pool, causing other requests to be queued.

In HTTP/2+ on the other hand, network requests can share TCP connections via multiplexing, so we don’t run into this problem.

So, on HTTP/2+ continuous polling is a virtue but on HTTP/1 it becomes a vice.

	HTTP/1	HTTP/2+
TCP connections	One per HTTP request	Multiplexed (many requests share connections)
Continuous polling behaviour	Degrades performance (connection pool exhaustion)	Full benefit (results delivered immediately)

How Kibana detects HTTP protocol for optimal polling

HTTP/2 is the recommended protocol and it’s the Kibana default since 9.0, so it would be a shame not to ship this performance enhancement. On the other hand, the HTTP/1 experience is so degraded that it isn’t acceptable to risk it on any on-prem deployments who haven’t yet upgraded their protocol. The answer is clear: we need to detect which protocol is in use and apply the optimal polling strategy.

It is certainly possible for the Kibana server to know which protocol it is speaking. But, there’s a catch: the limiting factor is the browser’s connection pool. That means that what really matters is what the browser is speaking.

Because of proxies, these are not always the same.

If we based our optimization on the server protocol, we could get things wrong in one of two ways.

Apply continuous polling when we shouldn’t and degrade the experience.
Fail to apply continuous polling when we should and miss out on the optimization.

Luckily, modern browsers provide a way to detect the protocol of the last network hop of any completed request through the use of a PerformanceObserver. So, we watch for the protocol of the first query submission and optimize based on that.

new PerformanceObserver((list) => {
  const entries = list.getEntries();
  const entry = entries.find(({ name }) => name.includes('/internal/search/'));
  if (entry) {
    this.protocolSupportsMultiplexing = ['h2', 'h3'].includes(entry.nextHopProtocol);
  }
});

Lab results: continuous polling vs. traditional polling in Kibana

To validate continuous polling, we created dashboards with query delays ranging from 1 to 23 seconds and measured load times with and without the optimization enabled. We then loaded the dashboards with and without continuous polling to measure the gains (we had a lot of fun with race-for-the-prize).

The pattern echoes our original sawtooth diagram. For some query durations, the gains are small while for others they amount to several seconds.

Conclusion

This optimization successfully replaces the latency inherent in traditional polling with a more efficient continuous polling strategy. The primary challenge was implementing this optimization conditionally to prevent performance degradation on HTTP/1 deployments. We solved it using the browser’s PerformanceObserver to reliably detect the protocol in use for the final network hop.

Laboratory testing validates the theory, showing continuous polling delivers results as soon as they are ready. On average, this leads to a meaningful improvement in user experience, making data load up to 25% faster.

This work is the latest step in our commitment to driving down time-to-insight for our users. By making Kibana a more transparent proxy to Elasticsearch data, we push the limits of performance within our sphere of influence. More to come!

(In 2025, Thomas Neirynk gave an excellent overview of the methods and motivation behind improving Kibana dashboard performance. This is an update on that initiative.)

Describe it, don't draw it: AI-native Kibana dashboards via MCP and ES|QL

Stratoula Kalafateli — Fri, 22 May 2026 00:00:00 GMT

example-mcp-dashbuilder is an open source MCP application that turns a plain-English prompt into a live, interactive Kibana dashboard, all inside your editor's chat window. Describe the dashboard you want, and the AI discovers your index structure, writes correct ES|QL aggregations for each visualization, and renders a preview inline as it works. When you're done, one command exports a fully functional Kibana dashboard: real Lens visualizations, your exact grid layout, custom colors preserved. Six chart types are supported today, with the full Kibana Lens set on the roadmap.

What is a Kibana dashboard builder?

What if you could describe the dashboard you want in plain English and watch it appear, complete with interactive charts, a drag-and-drop layout, and one-click export to Kibana?

That's exactly what example-mcp-dashbuilder does. It's an open source (Model Context Protocol (MCP) application that connects AI assistants to Elasticsearch, letting you create full Kibana Dashboards through conversation. No clicking through menus. No manually writing visualization configs. Just describe what you need, and the AI explores your data, writes the Elasticsearch Query Language (ES|QL) queries, builds the charts, and delivers a live, interactive dashboard, all within your editor's chat window.

Watch video on Vidyard

From prompt to dashboard in seconds

Here's what it looks like in practice. You type something like:

"Build me a web traffic dashboard from logstash-* with total requests, bytes transferred over time, top geographic sources, and a response code breakdown"

The AI then:

Discovers your data: Lists indices, inspects field mappings.
Writes ES|QL queries: Tailored to your schema, using the right aggregations.
Creates visualizations: Bar charts, line charts, metrics with sparklines, heatmaps, pie charts.
Organizes everything: Collapsible sections, meaningful titles, proper layout.
Renders an interactive preview: Right in the chat, with tooltips, a time picker, and drag-and-drop.

Each chart appears inline as it's created, so you can see progress in real time. Then view_dashboard shows the complete dashboard with all panels laid out in Kibana's 48-column grid.

Single chart preview inline.

Powered by ES|QL

All data retrieval uses ES|QL, Elasticsearch's piped query language. The AI doesn't just pass through raw queries, it also uses built-in knowledge of ES|QL syntax along with information on the structure of your data to write correct, efficient queries for each visualization type.

The server includes a comprehensive ES|QL reference as an MCP resource. Before writing any query, the AI reads this reference to understand the available commands, functions, and patterns. Combined with a dataviz best-practices guide (also served as a resource), the AI knows not just how to query, but what makes a good visualization:

Use BUCKET(@timestamp, 1 day) for time series; always SORT by the time field.
Limit pie charts to six slices with | SORT value DESC | LIMIT 6.
Choose bar charts for category comparisons, line charts for trends, metrics for key performance indicators (KPIs).

AI-driven data exploration with open-ended analysis

Building a dashboard you've already designed in your head is one thing. Asking "What's interesting in this index?" and getting a useful answer is harder; it requires the AI to know how to explore, not just how to draw.

example-mcp-dashbuilder ships an analysis://guidelines resource that defines a structured exploration flow: Profile the data, run targeted aggregations, surface patterns worth investigating, build charts for the most interesting findings, and propose drill-down queries the user might want next. Trigger phrases, like "analyze my logs" or "find patterns in this index," cause the AI to read the playbook before doing anything else, so an open-ended prompt produces a coherent investigation rather than a random pile of charts.

The result: You can hand the AI an unfamiliar index and get back a starting point: a dashboard plus a short list of "Here's what I noticed, want me to dig into any of these?" prompts.

Kibana dashboard export and import: the full round trip

The export/import round trip is where example-mcp-dashbuilder becomes genuinely useful for teams already working in Kibana.. example-mcp-dashbuilder is its own thing, a conversational dashboard surface that lives inside your editor, but it doesn't trap your work there. Dashboards built here can move into Kibana when you want them to, and existing Kibana dashboards can come the other way for AI-assisted editing.

Export to Kibana

When you're happy with your dashboard, one command exports it:

"Export this dashboard to Kibana"

Every panel is translated to a real Kibana Lens visualization. The translation preserves:

ES|QL queries: Transferred directly as Lens ES|QL data sources.
Grid positions: The same 48-column system Kibana uses, so your layout looks identical.
Custom colors: Series palettes, metric backgrounds, heatmap color ramps.

The result is a fully functional Kibana dashboard. Not a screenshot. Not an embed. A real dashboard you can share and continue editing in Kibana.

Kibana dashboard and dashboard in Cursor chat side by side.

Import from Kibana

The round trip works in the other direction too:

"Import the Kibana dashboard with ID abc-123"

This fetches an existing Kibana dashboard, translates its Lens visualizations back to editable chart configs, preserves the grid layout and sections, and loads everything into example-mcp-dashbuilder. From there, you can modify it with natural language and re-export.

This makes the AI a collaborator in your existing Kibana workflow, not a replacement for it.

Custom themes and colors

Want a branded dashboard? Just ask:

"Create a pink-themed dashboard with custom colors"

Every visualization type supports custom color configuration:

Charts: palette accepts an array of hex colors for series and slices.
Metrics: color sets the background color.
Heatmaps: colorRamp defines the gradient, from low to high values.

The AI picks up on theme requests naturally. Say, "Ocean theme," and it'll choose blues and teals. Say, "Match our brand colors" and provide hex values, and they'll carry through to Kibana on export.

A themed dashboard with custom colors.

How example-mcp-dashbuilder works: MCP architecture

example-mcp-dashbuilder is built on MCP, the open standard for connecting AI assistants to external tools and data. Here's the architecture at a high level:

The MCP server exposes 25 tools that the AI can call directly, everything from running ES|QL queries to exporting dashboards, alongside a handful of internal "app-only" tools that the inline preview uses to fetch data, persist layout changes, and detect time fields. It serves three resources: a dataviz best-practices guide, an ES|QL reference, and a deep-analysis playbook that kicks in for open-ended prompts ("analyze my logs", "what's interesting in this index"). And it runs over either stdio or HTTP; the HTTP transport supports streamable responses and session management, so multiple clients can connect to one server.

The MCP App is the interactive preview. It's built with React, Elastic Charts, and Elastic UI, bundled into a single self-contained HTML file. When the AI calls view_dashboard or creates a chart, the host renders this HTML in a sandboxed iframe. The app communicates with the server entirely through the MCP Apps protocol, using callServerTool() over postMessage to fetch data, save layouts, and detect time fields. There's no localhost server, no port to configure, no external network dependency.

This means it works with any MCP-compatible client: Cursor, Claude Desktop, Claude.ai, VS Code with Copilot, and more.

What chart types does example-mcp-dashbuilder support?

At time of writing, six chart types that cover the most common dashboard scenarios are supported:

Type	Best for	Example
Bar	Comparing categories	Requests by geographic source
Line	Trends over time	Bytes transferred per hour
Area	Volume over time	Request volume over time
Pie	Part-of-whole (max six slices)	Response code distribution
Metric	Single KPI with sparkline	Total requests with hourly trend
Heatmap	Patterns across two dimensions	Requests by day of week and hour

Dashboards support collapsible sections for organization, a time picker with automatic time field detection,and the ability to save and switch between multiple dashboards; parallel chat sessions stay isolated from each other via a dashboardId threaded through every tool call.

How to install and run example-mcp-dashbuilder

example-mcp-dashbuilder is open source and ready to use. You'll need Node.js 22+, an Elasticsearch instance (local or Elastic Cloud), and an MCP-compatible client.

Claude Desktop: Download the latest .mcpb from GitHub Releases, and double-click it. Claude Desktop will prompt you for your Elasticsearch credentials.

Cursor / Claude Code / VS Code Copilot: Point your MCP config at the released tarball; no clone, no npm install:

{
  "mcpServers": {
    "example-mcp-dashbuilder": {
      "type": "stdio",
      "command": "npx",
      "args": ["https://github.com/elastic/example-mcp-dashbuilder/releases/latest/download/example-mcp-dashbuilder.tgz"]
    }
  }
}

Set ES_NODE, ES_API_KEY (or ES_USERNAME / ES_PASSWORD) and KIBANA_URL as environment variables. If you'd rather work from source, clone the repo and run npm run setup for an interactive wizard that handles both local Elasticsearch and Elastic Cloud (Cloud ID + API key).

And start building:

"Explore the logs index and build me the most insightful dashboard you can"

The AI takes it from there. 😉

Roadmap: what's coming to example-mcp-dashbuilder

This is an early release, and we're actively developing it. Some areas we're focused on:

More chart types: Gauge, donut, treemap, data table, and tag cloud to match Lens's full capabilities.
Push dashboards to Git: Write dashboard configurations into a repository for version control and code-review workflows.
Better error UX: More detailed feedback when ES|QL queries fail, with suggestions for common fixes.
Richer analysis flows: Extending the deep-analysis playbook to cover more data shapes (logs, metrics, traces).

We'd love to hear what you build with it. Try it out, file issues, and let us know what visualizations and workflows would be most useful for your team.

GitHub: elastic/example-mcp-dashbuilder

Acknowledgments

Thanks to Walter Rafelsberger and Tim Schnell for their contributions to the implementation.

FAQ

What is example-mcp-dashbuilder? example-mcp-dashbuilder is an open source MCP (Model Context Protocol) application that connects AI assistants to Elasticsearch. It lets you describe a Kibana dashboard in plain English and automatically generates ES|QL queries, creates visualizations, and delivers a live interactive dashboard inside your editor's chat window.

What query language does example-mcp-dashbuilder use to retrieve data? All data retrieval uses ES|QL, Elasticsearch's piped query language. The MCP server includes a built-in ES|QL reference that the AI reads before writing any query, ensuring correct syntax and efficient aggregations for each visualization type.

Can I export dashboards built with example-mcp-dashbuilder to Kibana? Yes. Running "Export this dashboard to Kibana" translates every panel into a real Kibana Lens visualization, preserving ES|QL queries, the 48-column grid layout, custom colors and series palettes. The result is a fully functional Kibana dashboard, not a screenshot or embed.

Can I import an existing Kibana dashboard into example-mcp-dashbuilder for AI-assisted editing? Yes. Providing a Kibana dashboard ID fetches the existing dashboard, translates its Lens visualizations into editable chart configurations, and loads them into example-mcp-dashbuilder. You can then modify the dashboard using natural language and re-export to Kibana.

Which MCP clients are compatible with example-mcp-dashbuilder? example-mcp-dashbuilder works with any MCP-compatible client, including Cursor, Claude Desktop, Claude.ai, and VS Code with Copilot. It supports both stdio and HTTP transport, with no localhost server or port configuration required.

What chart types does example-mcp-dashbuilder support? The current release supports six chart types: bar, line, area, pie, metric (with sparkline), and heatmap. Planned additions include gauge, donut, treemap, data table, and tag cloud to match Kibana Lens's full capabilities.

What do I need to run example-mcp-dashbuilder? You need Node.js 22 or higher, an Elasticsearch instance (local or Elastic Cloud), and an MCP-compatible client. Set the environment variables ES_NODE, ES_API_KEY (or ES_USERNAME/ES_PASSWORD), and KIBANA_URL. For Claude Desktop, download the .mcpb file from GitHub Releases and double-click to install.

Up to 3x faster stored-vector queries in Elasticsearch

Benjamin Trent — Thu, 21 May 2026 00:00:00 GMT

Finding documents similar to a stored vector in Elasticsearch used to require two round trips: Fetch the vector with GET, and then send it back in a k-nearest neighbor (kNN) query. Elasticsearch 9.4 collapses that flow into one request with query_vector_builder.lookup, simplifying the API and improving latency by up to 3x in a two-node Google Cloud Platform (GCP) benchmark.

Why stored-vector search used to require two requests

Previously, when you wanted to find documents similar to a stored vector, you needed to:

Call GET to fetch the vector value from Elasticsearch.
Call _search referencing that vector value in Elasticsearch:
- Serialize the vector value via JSON.

This means paying serialization and network costs twice:

Serialization and deserialization of the vector for both requests.
Network latency costs in both directions.
Potential egress costs in cloud deployments.

In Python, the pattern would be:

from elasticsearch import Elasticsearch

es = Elasticsearch(HOST)

# 1) Fetch the seed vector from Elasticsearch
seed_doc = es.get(
    index=source_index,
    id=seed_id,
    _source_includes=[vector_field],
)
query_vector = seed_doc["_source"][vector_field]

# 2) Send it back in a kNN query
resp = es.search(
    index=target_index,
    query={
        "knn": {
            "field": vector_field,
            "k": 10,
            "num_candidates": 100,
            "query_vector": query_vector,
        }
    },
)

While these two calls seem cheap, the overhead is unnecessary. Let’s make this better.

How query_vector_builder.lookup works in Elasticsearch 9.4

In Elasticsearch 9.4, we added lookup to simplify the API and eliminate unnecessary costs:

from elasticsearch import Elasticsearch

es = Elasticsearch(HOST)

resp = es.search(
    index="products",
    query={
        "knn": {
            "field": "product-vector",
            "k": 10,
            "num_candidates": 100,
            "query_vector_builder": {
                "lookup": {
                    "index": "seed-products",
                    "id": "product-123",
                    "path": "product-vector"
                }
            },
        }
    },
)

This request now grabs the dense_vector value stored in the product-vector field, in the document with ID product-123 in the seed-products index. This example is a “more like this” search, finding the nearest vectors to the one with ID product-123. You can refer to any index, effectively using lookup as a query vector store.

How much latency lookup vector search can remove

The goal is to simplify the experience and make it faster. The performance gains aren't just from eliminating the client round trip. Many Elasticsearch instances involve multiple nodes, and traffic between nodes can carry its own serialization and network costs. Elasticsearch actively biases execution toward the local node, which cuts network serialization costs on the server side, too.

To illustrate the potential performance improvements, here’s a benchmark we ran. We used a modified version of our so_vector, where instead of using the query vectors, one path did the GET and then _search pattern and the other used lookup. Running on two nodes in the same zone in GCP, the results were strong. Latency was consistently improved by almost 3x. Even when nodes are within the same data center and the same availability zone, network and serialization costs can have a real impact.

Percentile	get-then-knn (ms)	lookup-knn (ms)	Reduction	Speedup
p50	10.3796	3.14093	69.74%	3.30x
p90	25.4429	5.89807	76.82%	4.31x
p99	27.7167	8.07109	70.88%	3.43x
max (p100)	28.522	12.6497	55.65%	2.25x

This benchmark ran with 2M documents, and the latency improvement will depend on your overall search costs. Even when the speedup is smaller, lookup still removes the extra client-side request. Less code, fewer round trips.

A simpler path for stored-vector search

Sometimes small changes can have an outsized impact. While this is a simple feature, I hope it removes some unnecessary friction in your Elasticsearch usage and makes us that much more lovable.

Elasticsearch downsampling methods: last-value vs. aggregate sampling

Mary Gouseti — Wed, 20 May 2026 00:00:00 GMT

Elasticsearch downsampling cuts time series storage by 94% in our OTel benchmark, and as of 9.4 it's fully queryable in ES|QL. For each metric type you now choose: last-value keeps one observation per bucket for maximum storage savings, aggregate keeps min/max/sum/count and preserves counter resets for accurate rate calculations. Both methods support histograms. Any ES|QL dashboard built on raw time series data runs unchanged on downsampled data; the trade-off is per-bucket averages instead of raw values.

Downsampling (available since Elasticsearch 8.7) shrinks the footprint of your time series data by summarising data points into broader time buckets. It frees up storage and speeds up queries by orders of magnitude. Recently, our engineering focus has shifted from simply optimising the underlying downsampling engine to expanding what it can do. The new features give you more control over how your data is summarised.

In this post, we’ll explore the new downsampling features including:

A choice between two distinct sampling methods: lightweight last-value sampling (for maximum storage savings) and high-fidelity aggregate sampling (for precise mathematical accuracy, such as counter resets).
Expanded support for new metric types, including histograms.
Full ES|QL support for downsampled gauges.

To begin, here is a quick review of the terminology used for time series data streams (TSDS):

Metrics are the actual measurements that change over time, such as CPU usage.
Dimensions are the identifying names and values associated with a measurement, which collectively determine the unique time series ID (_tsid).
The timestamp marks the exact moment a measurement was taken.
Finally, a (downsampled) bucket represents the result of reducing a metric's values across a specified time interval for a single _tsid.

How does downsampling work?

The downsampling process is initiated via the downsample API and can be automated using ILM (Index Lifecycle Management) or data stream lifecycle which downsamples and replaces the index with the raw data after the indexing has finished.

Since downsampling operates on a whole backing index (which must be read-only), and because the backing indices are time-bound for time series data, the system can generate the downsampled buckets from the data in a single index.

The downsampling task is optimised for efficiency: it reads all documents sorted by their time series dimensions and their timestamp in descending order. This specific sorting ensures that all data points belonging to a single time series bucket are collected sequentially without interleaving with other time series data. Once all documents that contribute to a single bucket have been read, their field values are collected for summarisation.

We will use the following data to show the effects of downsampling throughout this post. This data represents two nodes reporting their CPU usage and their number of requests every 10 seconds.

# Original data
{ "@timestamp": "2025-09-08T21:25:01.930Z", "node": "node-0001", "cpu.usage": 49.9, "requests": 3 }
{ "@timestamp": "2025-09-08T21:25:11.920Z", "node": "node-0001", "cpu.usage": 39.9, "requests": 6 }
{ "@timestamp": "2025-09-08T21:25:21.940Z", "node": "node-0001", "cpu.usage": 59.9, "requests": 2 }
{ "@timestamp": "2025-09-08T21:25:13.780Z", "node": "node-0002", "cpu.usage": 19.9, "requests": 100 }
{ "@timestamp": "2025-09-08T21:25:23.870Z", "node": "node-0002", "cpu.usage": 29.9, "requests": 102 }

Let’s see what downsampling (up to version 9.3) produces from this data using a 10-minute interval:

# Downsampled data
{ "@timestamp": "2025-09-08T21:20:00.000Z", "node": "node-0001", "cpu.usage": {"min": 39.9, "max": 59.9, "sum": 149.7, "value_count": 3}, "requests": 2 }
{ "@timestamp": "2025-09-08T21:20:00.000Z", "node": "node-0002", "cpu.usage": {"min": 19.9, "max": 29.9, "sum": 49.8, "value_count": 2}, "requests": 102}

A time series is uniquely identified by its dimension values, the node field in our example. Therefore, all documents summarised into a single bucket will share the same dimension values, meaning only one instance of the dimension values needs to be stored per bucket.

The timestamp and metrics are the fields whose values vary per document. For the timestamp (@timestamp), a rounding operation is performed to align it with the beginning of the bucket interval. For instance, in our example, the resulting timestamps are normalised to 2025-09-08T21:20:00.000Z (UTC).

For the metrics, up to 9.3 we used to downsample them based on their metric type, for gauges (cpu.usage) we stored the min, max, sum, and count of the encountered values and for counters (requests) the observed last value.

As you can see, up to 9.3 we effectively used a last value sampling method for counters, and the aggregate method for gauges. From 9.4 onward, for each field type, both sampling methods are available for all metric types and you get to choose which one best fits your data and the available system resources.

How downsampling sampling methods work

Different use cases require different trade-offs. For maximum storage reduction, last-value sampling keeps only the most recent observation per bucket. For accurate aggregations, the aggregate method keeps min, max, sum and count. Some applications demand maximum storage reduction and fast downsampling. Other use cases prioritise results that retain the highest possible fidelity to the original data, optimising for accuracy over sheer speed or space savings.

For this reason, in versions 9.3 and 9.4 we worked on offering two distinct ways of downsampling metrics. In 9.3 we introduced a new sampling method called last value and in 9.4 we differentiated the way we downsample counters between the last value and the aggregate method.

Last value sampling method

The last value method consistently downsamples data across all field types. For each time series bucket, it creates a single document. This document is timestamped with the start of the bucket, and all fields retain the last observed value from that period. The fact that it only needs one value makes it very efficient since it does not need to go over all values. Looking at our previous example, the downsampled documents look like this:

# 10 minute interval, last value method
{ "@timestamp": "2025-09-08T21:20:00.000Z", "node": "node-0001", "cpu.usage": 59.9, "requests": 2 }
{ "@timestamp": "2025-09-08T21:20:00.000Z", "node": "node-0002", "cpu.usage": 29.9, "requests": 102 }

While this method sacrifices data accuracy by discarding data points, it is a standard practice in time series solutions. Its primary benefit is preserving long-term trends while lowering the cost of data storage and querying. It is also lightweight, reducing the resources needed to generate downsampled data.

Aggregate sampling method

The aggregate sampling method does not skip any metric values, it collects all of them and then summarises them appropriately. The aggregate method processes each metric type differently, as explained below. Our example data, when downsampled with the aggregate method:

# 10 minute interval, aggregate method
{ "@timestamp": "2025-09-08T21:20:00.000Z", "node": "node-0001", "cpu.usage": {"min": 39.9, "max": 59.9, "sum": 149.7, "value_count": 3}, "requests": 3 }
{ "@timestamp": "2025-09-08T21:25:11.920Z", "node": "node-0001", "requests": 6 }
{ "@timestamp": "2025-09-08T21:25:21.940Z", "node": "node-0001", "requests": 2 }
{ "@timestamp": "2025-09-08T21:20:00.000Z", "node": "node-0002", "cpu.usage": {"min": 19.9, "max": 29.9, "sum": 49.8, "value_count": 2}, "requests": 100}

In the next sections we will see how the aggregate method summarises each metric type.

Gauges

Gauges are a fundamental type of metric. Unlike counters or histograms, their values can both increase and decrease, reflecting the current state of a system component (in our example, cpu.usage). A single value per bucket isn't enough to keep gauge aggregations accurate, since gauge values fluctuate. Instead, we track multiple statistical aggregates over each downsample interval:

min tracks the lowest value recorded for the gauge within the aggregation interval.
max tracks the highest value recorded for the gauge within the aggregation interval.
sum is the arithmetic sum of all individual gauge readings taken during the interval.
value_count tracks the total number of individual gauge measurements that contributed to the aggregation interval.

With these four statistics, we can answer the direct aggregations exactly (we know the true minimum, maximum, sum and count) and compute the average over the interval by dividing the recorded sum by the value_count. This approach preserves the shape of the gauge's behaviour over time, regardless of the downsampling interval.

In our example, we see that for node-0001 the downsampled document is stored in the format of an aggregate metric double:

{"min": 39.9, "max": 59.9, "sum": 149.7, "value_count": 3}

Some operations such as value filtering or standard deviation aren't covered by the summary statistics. In 9.4, we address this by using the average value in ES|QL. This is a major milestone because now the pre-aggregated gauges can be fully supported by ES|QL. Any ES|QL dashboard built on top of raw data can now use downsampled data with no errors, just with the expected loss of accuracy from using the average value per downsample interval.

We chose the average as the most representative signal we have available for the original samples in each downsample interval. For example, let’s think of the following query:

FROM my-data | STATS stddev = std_dev(cpu.usage)
FROM my-downsampled-data | STATS stddev = std_dev(cpu.usage)

The first query is applied to the original data and uses the individual data points. The second query is applied on the aggregated cpu usage values so we do not have individual data points. We could use min or max, but then the values used would be skewed towards the min or the max point. For this reason, we decided to use the average that would better capture the values within an interval.

Counters

Counters (requests in our examples), which are measurements that only ever increase (for cumulative temporality), seem straightforward for downsampling: just keeping a single value should be enough. However, when a process restarts, the counter value resets to zero.

The most common aggregation on a counter is its rate of change. Missing a counter reset when downsampling can drastically skew this rate, leading to inaccurate monitoring. Therefore, our downsampling process ensures that the rate calculation algorithm can still detect a reset, even when analysing downsampled data.

A rate algorithm detects a reset when a counter's current value is lower than its previous recorded value. To maintain accuracy in downsampled data, we need to ensure that the last value before the reset is preserved, and the next value seen is correctly compared against it.

So, let’s see how the counter requests in a downsampled bucket with reset look like:

# Main downsampled document
{ "@timestamp": "2025-09-08T21:20:00.000Z", "node": "node-0001", "requests": 3, "cpu.usage": ...}
# Pre-reset value
{ "@timestamp": "2025-09-08T21:25:11.920Z", "node": "node-0001", "requests": 6 }
# Post-reset value
{ "@timestamp": "2025-09-08T21:25:21.940Z", "node": "node-0001", "requests": 2 }

The main downsampled bucket stores the first counter value observed in that time frame. We choose the first value because it is the closest to the bucket's start timestamp. In addition to that, when a counter reset occurs within this time bucket, we store auxiliary documents with their original timestamps that hold the values to preserve the reset event:

Last value before reset is stored in an auxiliary document. This records the maximum value the counter reached before the reset event.
Value after reset (conditional) may also be stored in a second auxiliary document. This is optional because if the first value of the next downsampled bucket is already lower than the stored pre-reset value, the rate algorithm can infer the reset without needing this intermediate data point.

Let’s consider a more elaborate example, we have a stream of counter values within a time window, simulating a reset:

Bucket 1: 1000 1003 1010 1040 1060 (reset) 20 30 40 70 80

Bucket 2: 90 ...

This sequence results in the following downsampled documents:

Downsampled bucket 1: stores the first value (1000).
Auxiliary document: stores the pre-reset value (1060).
Post-reset value: we do not need to store value 20, because the next downsampled bucket's starting value (90) is already lower than the pre-reset value (1060). The rate calculation will correctly detect the reset by comparing 90 against 1060.
Next downsampled bucket 2: stores the first value (90).

The total change (delta) in the counter value is calculated via:

delta = (preResetMax - bucket_1) + bucket_2

Notice that it remains the same for the original and the aggregated data because we have all the values available in both data sets: $(1060 - 1000) + 90 = 150$. In comparison, the last value misses the increase within the first bucket and only observes an increase of 90, 40% less.

Although the delta is accurately observed, the final downsampled rate may not be exactly the same as the raw data rate because of the rate extrapolation and interpolation. Still, this method keeps the result close to the raw rate, even when counters reset.

How aggregate method handles histogram metrics

The aggregate downsampling method merges histogram metrics into a single representative histogram per bucket, preserving distribution shape while reducing volume. The process depends on the histogram type.

For exponential histograms, which are designed for efficient storage of data with a wide range, their values and counts are aggregated and merged into a single, representative exponential histogram. This maintains the proportional distribution and statistical characteristics of the original datasets.

Conversely, for the older histogram field type and the newer tdigest field type, the merging is performed using the TDigest algorithm. This ensures that essential metrics derived from the distributions, such as the median or the 95th percentile, remain statistically reliable after the downsampling process is complete.

Let’s see an example with exponential histograms:

# Original data
{ "@timestamp": "2025-09-08T21:25:01.930Z", "node": "node-0001", "histogram": { "scale": 0, "sum": 7.0, "min": 1.0, "max": 4.0, "positive": {"indices": [0,1], "counts": [3,2]}}}
{ "@timestamp": "2025-09-08T21:25:11.920Z", "node": "node-0001", "histogram": { "scale": 0, "sum": 30.0, "min": 4.0, "max": 16.0,
  "positive": {"indices": [2,3], "counts": [4,1]}}}
{ "@timestamp": "2025-09-08T21:25:21.940Z", "node": "node-0001", "histogram": { "scale": 0, "sum": 55.0, "min": 0.5, "max": 32.0, "zero": { "count": 2, "threshold": 0.5 }, "positive": {"indices": [1,4], "counts": [5,3]}}}

# 10 minute interval, aggregate method
{ "@timestamp": "2025-09-08T21:20:00.000Z", "node": "node-0001", "histogram": { "scale": 0, "sum": 92.0, "min": 0.5, "max": 32.0, "zero": { "count": 2, "threshold": 0.5 }, "positive": { "indices": [0,1,2,3,4], "counts": [3,7,4,1,3] }}}

As you can see, the downsampled histogram is capturing the values of all three histograms and not just the last one as the last value method would have done.

Note: the histogram field type does not record which algorithm was used to build it. Histograms built with TDigest are downsampled correctly with the aggregate method. Histograms built with High Dynamic Range (HDR) are not, because the aggregate method cannot detect that. For HDR data, use the last value method instead.

How to configure Elasticsearch downsampling methods

You can configure the sampling method of a time series data stream using either data stream lifecycle or ILM:

# Data stream lifecycle
PUT _data_stream/my-data-stream/_lifecycle
{
  "data_retention": "7d",
  "downsampling_method": "aggregate",
  "downsampling": [
     {
       "after": "1m",
       "fixed_interval": "10m"
      },
      {
        "after": "1d",
        "fixed_interval": "1h"
      }
   ]
}

# ILM
PUT _ilm/policy/datastream_policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover" : {
            "max_age": "5m"
          },
          "downsample": {
  	        "fixed_interval": "5m",
  	        "sampling_method": "aggregate"
  	      }
        }
      },
      "warm": {
        "actions": {
          "downsample": {
            "fixed_interval": "1h",
  	        "sampling_method": "aggregate"
          }
        }
      }
    }
  }
}

Note: all downsampling actions in a single ILM policy should have the same sampling method configured.

How do I switch between downsampling methods?

It is possible to switch from one sampling method to another, but a downsampled index can be downsampled further only with the same sampling method.

The data stream lifecycle takes care of this, so if you change the sampling method in a data stream lifecycle configuration, the data stream lifecycle will apply it to the original indices, but if there are already downsampled indices it will continue applying the compatible one.

On the other hand, if you are using ILM and the policy has more than one downsampling action, it is recommended to create a new policy with the new sampling method. This way, the existing indices will continue downsampling with the old sampling method and the new ones will transition to the new one. More specifically, we recommend the following steps:

Copy the ILM policy and change the sampling method in the downsampling actions.
Update the relevant index templates to use the new ILM policy.

How do last-value and aggregate downsampling compare?

Now let’s compare the two methods on more realistic data. We generated an hour of OTel metrics from a receiver. This was done using metricsgenreceiver with the following configuration:

scenario: hostmetricsreceiver
interval: 5s

Next, we downsampled the data into 10-minute buckets using both sampling methods. We then compared key performance characteristics, including the resulting document count, the size of the dataset (after being force-merged into a single segment), and the time taken for the downsampling process.

	Doc count	Data set size	Duration
Raw	11509670	176.2 MB
Aggregate	(-98.83%) 134370	(-93.64%) 11.2 MB	2.8s
Last value	(-98.89%) 127170	(-94.72%) 9.3 MB	(-7.14%) 2.6s

Both downsampling methods (last value and aggregate) significantly reduce the initial 1.51 million documents by 98%. However, the aggregate method retains slightly more documents to account for counter resets.

When considering the overall data set size, both methods show a substantial reduction compared to the raw data, but a more noticeable difference emerges between the two: the last value method achieves a greater size reduction than the aggregate method. This difference is not solely due to counter resets; it's also because the aggregate method stores more detailed information per gauge to maintain the accuracy of common aggregations.

The last value downsampling method is also faster, completing the process 7% sooner than the aggregate method. This efficiency is because the last value method does not need to collect every value; after the last value is found, the rest are ignored.

The last value method wins on data size and speed; the aggregate method is more accurate on queries, which is its main advantage.

The following section will demonstrate this with three ES|QL query examples using the recently introduced TS command which uses time series semantics and lets us use time series aggregations to examine how time series change over time. These queries will run against all three indices (raw, last value, and aggregate) to compare how the downsampled data relates to the raw data concerning query accuracy.

Gauges

For downsampling gauges, we will compare the following time series aggregations:

The minimum of min_over_time
The maximum of max_over_time
The average of avg_over_time
The standard deviation (std_dev) of last_over_time
The standard deviation (std_dev) of avg_over_time

TS  | STATS 
min = MIN(MIN_OVER_TIME(system.cpu.logical.count)), 
max = MAX(MAX_OVER_TIME(system.cpu.logical.count)), 
avg = AVG(AVG_OVER_TIME(system.cpu.logical.count))

TS  | STATS stddev = std_dev(last_over_time(system.cpu.logical.count))
TS  | STATS stddev = std_dev(avg_over_time(system.cpu.logical.count))

	min	max	avg	std_dev of last value	Std_dev of avg
Raw data	0	55	23.3369	0.79372	0.04318
Aggregate	0	55	23.3369	0.13253	0.04318
Last value	0	41	18.5261	0.79372	0.46364

The aggregate sampling method preserves accuracy for most results, such as min, max, and average, because it uses pre-aggregated data from the buckets. This data accurately reflects the raw data for all metrics except the standard deviation of the last value.

Since the pre-aggregated data does not include the raw last value, the system defaults to using the average value for the bucket to calculate the last value over time. This substitution reduces the overall variance and, consequently, the standard deviation compared to the raw data.

Conversely, the calculation based on the last value alone loses accuracy for min, max, and average because it omits certain data points. However, this method maintains the accuracy of the standard deviation of the last value query, as both rely on the same single last value within the bucket.

Counters

For counters, we will focus solely on the rate, as it is the most practical aggregation. Specifically, we will query the minimum and maximum rate observed across the time series.

TS  | STATS 
min_rate = min(rate(system.disk.io)), 
max_rate = max(rate(system.disk.io))

	Min rate	Max rate
Raw	332214509.24	381462158.55
Aggregate	322801874.77	370653936.12
Last value	11.78	370653936.71

The maximum rate observed closely matches that of the raw data for both sampling methods. However, the minimum rate detected shows a significant difference because the last value method results in the loss of all reset information calculating a significantly lower value.

Choosing the right downsampling method

As we demonstrated in the previous sections, downsampling can substantially reduce data size and improve query performance, regardless of the sampling method used. The default sampling method is still the aggregate method, so accuracy is preserved by default. However, if your primary concern is reducing resource consumption during the downsampling process and in the resulting data, you now have the option to accept a slight decrease in accuracy for a more resource-efficient outcome.

What's next for Elasticsearch downsampling?

Three areas are in active development: sparse data performance, layered downsampling, and lifting the read-only index requirement.

Performance improvements for sparse data

Downsampling has been optimised keeping in mind that all the documents of a time series data stream will have the same fields. For example, in a data stream containing node metrics, we expected that all nodes would be defined by the same set of dimensions and all documents would contain the same metrics about these nodes.

However, in practice we see that this is not always the case, and quite often a time series data stream contains documents that represent measurements, that are defined by different sets of dimensions and contain different metrics. For example, the same data stream could have kubernetes metrics but also application metrics.

This is reflected in recent improvements, since 9.3, time series indices use doc values skippers, a form of sparse indices, instead of inverted indices and BKD trees, allowing them to be very efficient in this case. Following this direction, we will focus on improving the downsampling algorithm to use optimisation opportunities that this type of data provides.

The future of downsampling

We believe the improvements described above enable adoption of downsampling for most metrics applications. Three constraints still shape how teams use it:

downsampling depends on ILM or data stream lifecycle for automation,
requires the source index to be read-only,
and replaces the raw data once complete.

We plan to address all three with a more flexible downsampling solution built around multiple downsampling layers.

Layers will lift the requirement to delete the original data, so you can start downsampling much earlier and query the pre-aggregated data even for the most recent information.

Multiple layers will also let you downsample at different granularities and manage each layer separately. For example, you could move the original data to the frozen tier sooner and keep only the pre-aggregated data in the hot tier, balancing cost and query performance per layer.

We expect to share design details and an early preview of the layered model in an upcoming post.

12x faster Elasticsearch vector indexing: deploying NVIDIA cuVS with GPU and CPU tiers

Blake Holden — Tue, 19 May 2026 00:00:00 GMT

NVIDIA cuVS builds HNSW graphs on the GPU, delivering up to 12x faster vector indexing in Elasticsearch. This post covers two production deployment patterns: Pattern A runs build and serve on the same GPU node, suited for clusters under five data nodes. Pattern B uses three lean GPU ingest nodes (64 GB RAM each) that hand off via ILM rollover to seven CPU hot-serving nodes (192 GB each) - the right default for production at scale. Full index template, ILM policy, and sizing math for a 300M-vector corpus are included.

What does the Elasticsearch cuVS GPU plugin actually do?

The Elasticsearch cuVS GPU plugin takes over HNSW graph construction during two operations: segment flush and force merge. The plugin requires a supported NVIDIA GPU attached to the data node, the cuVS plugin installed, and vectors.indexing.use_gpu: true in elasticsearch.yml.

Segment flush. Vectors accumulate in the JVM write buffer. At flush, the plugin batches them, copies them to GPU VRAM via PCIe, constructs a CAGRA graph in VRAM, converts it to HNSW, copies the result back to system RAM via PCIe, and then Lucene writes the segment file to local NVMe. The segment is subsequently memory-mapped into the OS page cache for query serving.
Force merge. When segments are merged, the same GPU path speeds up graph reconstruction, delivering approximately 7x faster force-merge times according to the published benchmark.

Everything else (HTTP request handling, query serving, cluster state management, ILM execution) runs on the CPU and uses system RAM. The GPU is a coprocessor for two write-path operations, not a replacement for the host compute environment.

The separation of GPU build from CPU query serving matters because GPU-attached nodes don't need to be the query-serving tier.They can be dedicated to the write path and hand off built shards to cheaper CPU hot-serving nodes. That's the core insight behind Pattern B.

Pattern A: combined GPU build and serve in one Elasticsearch node

Pattern A is the simpler deployment. Every GPU node builds shards and serves queries on those shards. This is how the cuVS benchmark was run: a single g6.4xlarge (64 GB RAM, 1x NVIDIA L4) running indexing on a single Elasticsearch process. The node is capable of both indexing and search concurrently, though the published benchmark measured indexing throughput specifically.

When to use combined GPU nodes (Pattern A)

Small clusters (fewer than 5 data nodes).
Proof-of-concept or edge deployments where operational simplicity outweighs cost optimization.
Corpora small enough that the full HNSW graph fits in each node's page cache alongside normal operations.

Node configuration (Pattern A)

Every data node gets a GPU and the same elasticsearch.yml config:

# elasticsearch.yml: Pattern A (combined GPU + serving node)
node.roles: [data_hot, data_content]

# auto (default): use GPU when available. true: fail to start if GPU unavailable.
# Use true on dedicated GPU nodes to catch misconfiguration at boot.
vectors.indexing.use_gpu: true

# JVM heap: ~32 GB max (compressed OOPs boundary)
# Remaining system RAM goes to OS page cache for HNSW graph

No shard-routing tricks, no ILM tier separation. Shards land where Elasticsearch's default allocator puts them, and every node can both build and serve.

Note: there is also an index-level setting, index.vectors.indexing.use_gpu, that can override the node-level default on a per-index basis. Valid values are auto (default, uses GPU when available), true (requires GPU, fails if unavailable), and false (disables GPU for this index).

Sizing implications

Because each node holds long-lived shards, it needs enough system RAM for:

JVM heap (~32 GB)
OS page cache holding the HNSW graph for its share of the corpus
OS and CUDA overhead (~10-15 GB)

For a 300M-vector int8 corpus with HA (two copies), each of 10 data nodes holds ~65 GB of HNSW data in page cache. Add JVM and overhead and you land at roughly 128–192 GB of system RAM per node. That's much more than the 64 GB reference benchmark because the benchmark held only 2.6M vectors on a single node.

System RAM per node scales with shards-per-node, not with total corpus size. More nodes at lower RAM, or fewer nodes at higher RAM. The tradeoff is operational complexity versus hardware cost.

A note on the published benchmark. The ~12x indexing throughput and ~7x force-merge numbers were measured using 2.6M vectors at 1,536 dimensions with float32 hnsw on a single g6.4xlarge node. This post's examples use 1,024 dimensions with int8_hnsw. Performance characteristics may vary with different dimension counts and quantization levels, so benchmark your own corpus for production sizing.

Pattern B: dedicated GPU ingest nodes with ILM shard handoff to CPU

Pattern B separates the cluster into two data tiers with different hardware profiles and different roles:

GPU ingest tier. Small number of nodes with NVIDIA GPUs and modest system RAM (64 GB). These nodes accept writes, build HNSW segments on GPU via cuVS, and own the active write shard.
CPU hot-serving tier. Larger number of nodes without GPUs and larger system RAM (192 GB). These nodes receive migrated shards from the GPU ingest tier via ILM rollover and serve all query traffic.

Once a shard rolls over and migrates to the CPU hot-serving tier, the GPU ingest node no longer owns it. The GPU ingest node's page cache footprint stays small because it only ever holds the currently-writing index.

When to use GPU ingest + CPU serving (Pattern B)

Production clusters with 5+ data nodes, where ILM rollover is already part of the operational cadence.
When GPU-attached hardware is expensive and you want to minimize how many nodes need GPUs.
When query-serving workloads benefit from dedicated CPU and RAM that isn't shared with GPU build operations.
When ILM is already part of the operational model (which it should be for any time-series or lifecycle-managed vector corpus).

Node configuration (Pattern B)

GPU ingest nodes (elasticsearch.yml):

# GPU ingest node: builds shards, does not serve queries long-term
node.roles: [data_hot]
node.attr.tier_function: gpu_ingest

vectors.indexing.use_gpu: true

CPU hot-serving nodes (elasticsearch.yml):

# CPU hot-serving node: receives migrated shards, serves queries
node.roles: [data_hot]
node.attr.tier_function: cpu_serve

Both node types share the data_hot role because they participate in the same logical tier. The custom attribute node.attr.tier_function lets us control which nodes receive new writes versus migrated shards.

Index template: route new writes to GPU nodes

PUT _index_template/vectors-template
{
  "index_patterns": ["vectors-*"],
  "template": {
    "settings": {
      "index.routing.allocation.require.tier_function": "gpu_ingest",
      "index.lifecycle.name": "vectors-lifecycle",
      "index.lifecycle.rollover_alias": "vectors-active"
    },
    "mappings": {
      "properties": {
        "embedding": {
          "type": "dense_vector",
          "dims": 1024,
          "index": true,
          "index_options": {
            "type": "int8_hnsw"
          }
        }
      }
    }
  }
}

New indices matching vectors-* are allocated exclusively to nodes with tier_function: gpu_ingest. Writes flow to GPU ingest nodes, and cuVS builds the HNSW graph on the GPU.

Critical: you must explicitly set `index_options.type` to `int8_hnsw` (or `hnsw`) for GPU acceleration. In Elasticsearch 9.1 and later, `dense_vector` fields with 384 or more dimensions default to bbq_hnsw, which cuVS does not support. If you create vector fields outside this template without specifying int8_hnsw, the GPU plugin will fall back to CPU for indexing. The template above sets this correctly, but every vector field in the cluster that should benefit from cuVS needs the explicit setting.

ILM policy: roll over and migrate to CPU serving nodes

PUT _ilm/policy/vectors-lifecycle
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_age": "7d",
            "max_primary_shard_size": "50gb"
          },
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        "min_age": "0ms",
        "actions": {
          "migrate": {
            "enabled": false
          },
          "allocate": {
            "require": {
              "tier_function": "cpu_serve"
            }
          },
          "set_priority": {
            "priority": 50
          }
        }
      },
      "cold": {
        "min_age": "180d",
        "actions": {
          "migrate": {
            "enabled": false
          },
          "allocate": {
            "require": {
              "tier_function": "cpu_serve"
            }
          }
        }
      },
      "delete": {
        "min_age": "730d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

A few things to note in this policy.

The `"migrate": {"enabled": false}` in the warm and cold phases is important. The warm-phase migrate action would normally attempt to set _tier_preference to data_warm, but since no nodes in this architecture have the data_warm role, shards could become unallocatable. Disabling migrate and using an explicit allocate action with our custom tier_function attribute gives us precise control over where shards land.

The `min_age: "0ms"` on the warm phase means migration happens immediately after rollover, not after a time delay. This is intentional: we want shards off the GPU ingest nodes within minutes of rollover to keep the GPU ingest tier lean.

Alternative approach: if the naming confusion between "warm" (ILM phase name) and "hot serving" (what the CPU nodes actually do) is a problem for your team, you can instead assign GPU ingest nodes node.roles: [data_hot] and CPU hot-serving nodes node.roles: [data_warm], then let ILM's standard tier migration handle the move without custom attributes. The tradeoff is simpler ILM but potentially confusing semantics: your heaviest query-serving nodes are labeled "warm."

Why 64 GB system RAM is enough for GPU ingest nodes

Under Pattern B, a GPU ingest node only holds the following (approximate, based on standard Elasticsearch JVM sizing guidance and observed CUDA runtime overhead):

Consumer	Approximate RAM
JVM heap	~32 GB
CUDA driver + pinned PCIe buffers	~10 GB
Page cache for the active write shard	~10-15 GB
OS and container overhead	~5 GB
Total	~60 GB

This matches the published cuVS benchmark hardware: AWS g6.4xlarge with 64 GB RAM. The GPU ingest node does not need to hold the accumulated HNSW graph for the entire corpus, only the currently-writing index, which rolls over on a short cadence.

CPU hot-serving nodes, by contrast, hold the accumulated corpus in page cache and need 128–192 GB depending on vectors-per-node.

Sizing summary (Pattern B, 300M vectors, int8_hnsw, HA)

Node type	Count	System RAM	GPU	Role
GPU ingest	3	64 GB	1x L40S (48 GB VRAM)	Build shards, cuVS
CPU hot-serving	7	192 GB	None	Serve queries, hold HNSW in page cache
Warm/cold (BBQ)	5	64 GB	None	Aged data, ~8x compression
Masters	3	16 GB	None	Cluster quorum
Coord + Kibana	4	32–64 GB	None	Query routing, UI

Total ERU (Elastic Resource Unit) under ECK: ~32.

How data flows through Pattern B, end to end

In Pattern B, vectors flow from the client to a GPU ingest node, where cuVS builds the HNSW segment on the GPU before ILM migrates the finished shard to a CPU hot-serving node. The sequence diagram below shows the round-trip for one batch of vectors, followed by the rollover that moves the finished shard to the serving tier:

In step form:

A client sends a _bulk request with vectors encoded as base64 strings (recommended for throughput).
The coordinating node routes the request to the GPU ingest node that owns the active write shard.
The GPU ingest node's JVM parses the request, queues vectors in the write buffer.
At flush threshold, the cuVS plugin batches vectors and copies them from system RAM to GPU VRAM via PCIe.
The GPU constructs a CAGRA graph in VRAM and converts it to HNSW format.
The HNSW data is copied back to system RAM via PCIe.
Lucene writes the segment file from system RAM to local NVMe.
The segment is memory-mapped into OS page cache (system RAM) and is now queryable.
After the configured rollover threshold (time or size), ILM rolls the index over and creates a new write index on the GPU ingest tier.
The rolled index's allocation requirement changes from gpu_ingest to cpu_serve, and Elasticsearch's shard allocator migrates the shards to CPU hot-serving nodes over the network.
CPU hot-serving nodes memory-map the received segments into their page cache and begin serving queries.
The GPU ingest node's page cache is released as shards leave, freeing it for the next write cycle.

At no point does query traffic touch the GPU ingest nodes. At no point does the GPU need to hold more than one batch of vectors in VRAM. The GPU is busy during steps 4–6 and idle otherwise. System RAM is in the path on both sides of the GPU: as the staging area for PCIe transfers, and as the persistent page cache for serving.

Adding a BBQ cold tier with requantization

Pattern B extends to a BBQ cold tier for long-retention corpora. BBQ delivers roughly 8x compression over int8, shrinking the cold tier's RAM and disk footprint dramatically. The ILM policy adds a phase that reindexes the int8_hnsw data into a new index configured with bbq_hnsw:

"cold": {
  "min_age": "180d",
  "actions": {
    "allocate": {
      "require": {
        "tier_function": "cpu_serve"
      }
    }
  }
}

In practice, the requantization happens via a separate reindex job triggered alongside the cold-phase transition. BBQ delivers roughly 8x compression over int8 (and ~32x over float32), so the cold tier's RAM and disk footprint shrinks dramatically.

Note that BBQ is a CPU-only quantization path. bbq_hnsw is not supported by cuVS in Elasticsearch as of 9.5. This is why the GPU ingest tier builds with int8_hnsw and the BBQ conversion happens later, on CPU warm or cold nodes. There is no GPU dependency on the cold path.

Which pattern should you choose?

For most production clusters with five or more data nodes, Pattern B is the better default. Pattern A is the right choice for proof-of-concept deployments and small clusters where operational simplicity matters more than cost.

Factor	Pattern A (combined)	Pattern B (dedicated GPU ingest)
Cluster size	Fewer than ~5 data nodes	5+ data nodes
Operational complexity	Lower	Higher (ILM + allocation routing)
GPU pod system RAM	128–192 GB	64 GB
GPU utilization	GPU idle during queries	GPU idle between flushes, but node is not serving queries
Hardware cost	Higher (each GPU node needs query-serving RAM)	Lower (GPU ingest nodes are lean)
ERU impact	Higher	Lower
Best for	POC or edge deployments	Production at 5+ data nodes

For most production deployments, Pattern B is the better default. The operational complexity is modest (the ILM policy and allocation attributes shown in this post are the entire implementation), and the savings in GPU-node RAM and ERU are material.

Getting started with NVIDIA cuVS GPU vector indexing

GPU-accelerated vector indexing with NVIDIA cuVS is available in technical preview for Elasticsearch 9.3 (self-managed Enterprise) and is targeted for general availability in Elasticsearch 9.5.

Requirements:

Elasticsearch 9.3+ with Enterprise subscription
NVIDIA Ampere architecture GPU or newer (compute capability ≥ 8.0), minimum 8 GB VRAM
CUDA 12.x and cuVS runtime libraries (refer to the Elastic support matrix for exact supported versions)
Java 22 or higher on the GPU node
Linux x86_64
Fast local NVMe (network-attached storage is not recommended)

For teams weighing this against FAISS or a dedicated vector database like Milvus or Pinecone, the operational pitch is the same single platform, same ATO, same ops model, with GPU-accelerated ingest layered onto an existing Elasticsearch cluster. The broader Elastic and NVIDIA partnership context is in Elastic and NVIDIA together unlock next generation enterprise AI search.

Start with Pattern A on a single GPU node to validate throughput on your corpus. Once you have confidence in the numbers, move to Pattern B with the ILM policy above, scale the CPU hot-serving tier to match your query SLA, and let the GPU ingest tier do the one thing it is built for: building HNSW graphs at 12x the speed of CPU.

For the cuVS benchmark methodology and results, see [Up to 12x faster vector indexing in Elasticsearch with NVIDIA cuVS]. For the broader Elastic and NVIDIA partnership, see [Elastic and NVIDIA together unlock next generation enterprise AI search].

One query, multiple Elasticsearch Serverless projects: introducing cross-project search

Michael Peterson, Najwa Harif — Mon, 18 May 2026 00:00:00 GMT

Cross-project search (CPS) is now available in Elastic Cloud Serverless. With a single query like FROM logs*, you can search data across multiple isolated projects - no network peering, no certificate management, no data duplication. Projects stay in their own regions and clouds; only the results come back to you. For teams dealing with data residency requirements, tenant isolation, or high egress costs from copying logs, CPS means your data can live exactly where it belongs and still be queried as one.

Elastic Cloud Serverless already removes the headache of managing infrastructure and version upgrades. CPS takes that a step further. We've replaced complex network peering and manual certificate management with a simple linking model. Now, you can just treat your Elastic Cloud Serverless projects as simple namespaces for your data. Whether you're dealing with strict data residency laws, isolating tenant data, or just trying to avoid the massive network egress fees that come from duplicating logs, CPS lets you search your data exactly where it lives with a single query.

In this post, we’ll walk through how CPS works, how to control your searches using project tags, and how this new model differs from traditional cross-cluster search (CCS).

How to link projects for cross-project search

To get started with cross-project search, link projects in the Elastic Cloud Console or API. Linking is simple and unidirectional: you choose an origin project, then connect the projects it should search. Those links can span regions, cloud providers, and project types, so your data can stay where it belongs without giving up a unified search experience.

Once the link is created, it usually takes effect within about a minute. If you already have Kibana open, refresh to see the new cross-project search capabilities.

How cross-project search queries all linked projects by default

Once projects are linked, cross-project search turns separate projects into a single logical search surface. If your logs live across multiple projects, a query like FROM logs* searches the origin project and any linked project that has matching data. You do not need to name each remote target up front.

That is a major improvement over cross-cluster search. In CCS, reaching both local and remote data often means writing something like FROM logs*,*:logs*. For users, that means less query complexity. For teams, it moves us closer to a true single pane of glass across distributed data.

For more information on this, see the CPS search model docs.

If you're interested in learning about technical details on how we built this, see How cross-project search (CPS) works in Elasticsearch Serverless.

Control of searches via Project Routing

Searching across every linked project by default is convenient and useful for many workflows, but not every search should go everywhere. Cross-project search introduces project routing, a way to limit a query to a specific subset of projects.

It works through project tags defined in Elastic Cloud. Every project has built-in attributes such as its alias, cloud provider, and region. You can also add your own tags to reflect how your organization thinks about its estate, such as environment:prod, environment:test, a business unit, or a customer name. Elasticsearch can then use that metadata to decide which linked projects should participate in a search.

All Elasticsearch endpoints that support cross-project search accept a project_routing parameter. In the technical preview, routing is limited to using project alias. For example, setting project_routing to _alias:my-linked-project sends the query only to that linked project, while _alias:_origin keeps the query on the origin project. Over time, this model opens the door to much richer routing, where search scope can follow the logical structure of your organization instead of the physical layout of your infrastructure.

See the project routing documents for examples and more details about how they work.

Kibana Space-level default project routing

As an example of where more precision is needed for your search routing, searching all linked projects might trigger a flood of false positives in your Kibana rules or confusing results in your existing dashboards. To fix this, you can set a Space-level default project scope in Kibana. This acts as a safe preset for that specific Space—meaning all dashboards, Discover sessions, and alerting rules automatically respect it. Analysts can still override the scope manually during an investigation if they need a broader view.

This matters for teams sharing a central project, such as MSPs, MSSPs, and centers of excellence: you can assign each team their own Kibana space and restrict it to only query their specific customer projects, guaranteeing tenant-specific experiences. Analysts can still override the scope manually during an investigation if they need a broader view.

You can configure this Space default before or after you actually link your projects in the Cloud UI. But because CPS immediately turns on the "search all" behavior the second a link is made, setting your Kibana defaults first ensures your existing detection rules don't suddenly run against a massive global dataset and overwhelm your team.

Using tags in searches

In addition to using tags for project routing, you can also use tags in your ES|QL and _search queries. This can be useful to identify where each record or row in a result set came from, or to sort, filter or aggregate by those tags.

For example, if you want to see which project every row in an ES|QL response came from, you can add the _project._alias tag to the ES|QL query:

FROM logs* METADATA _project._alias | ...

and this allows you to use _project._alias in other parts of the query including KEEP clauses in order to see it in the final result:

FROM logs* METADATA _project._alias | ... | KEEP @timestamp, message, _project._alias

For more examples of using tags in queries see this doc that describes how to use them both in Search APIs and in ES|QL.

If you're interested in learning about technical details on how we added tags to Search and ES|QL queries, see Faster cross-project search in Elasticsearch Serverless with project tags and routing.

How cross-project search handles origin and linked projects equally

If you have used CCS, you might be aware that the local cluster is treated differently from remote clusters in a few ways.

Errors from the local cluster are handled differently than errors from remote clusters. In particular, CCS uses the skip_unavailable setting to control how errors from remote clusters behave, but that setting does not exist for the local cluster.
The local cluster has no "cluster alias", so the index expression *:logs* searches all the remote projects, but skips the local cluster. To search both, you have to use the index expression logs*,*:logs*.

In CPS, we have changed both of these behaviors to put the origin project and linked projects on a more even footing.

First, the skip_unavailable setting is not used in Elastic Cloud Serverless. Instead, you control whether you want partial results on a search via the allow_partial_search_results parameter in _search or _async_search or the allow_partial_results parameter in ES|QL.

Second, in Elastic Cloud Serverless, the origin project has a project alias. It is defined in Elastic cloud like all project tags. Thus, in CPS, all of the queries below are equivalent - they target all projects with a "logs" index:

POST logs/_search

POST *:logs/_search


POST logs/search 
{
  "project_routing": "_alias:*"
}

Note: there is an important difference between the qualified index expression *:logs and the unqualified expression logs in terms of how error handling around missing indices works. For details see Unqualified and qualified search expressions in the public documentation.

Access control and security model for cross-project search

Elastic has created a new cloud-based security model, Universal Identity and Access Management (UIAM), that enables a key principle for cross-project search: the projects and data you can access do not depend on where you access it from.

Whether you initiate a search from your primary observability project or an ad-hoc analytics project, your access to the linked data remains consistent, since the access rights were defined in a centralized location. The cloud-based authentication and authorization model uses the cloud UIAM service to ensure your access permissions are uniform regardless of the origin project.

Try Cross-project search

Ultimately, Elastic Cloud Serverless and CPS together reduce operational friction and give you additional options for organizing data based on logical considerations rather than physical or operational considerations. Cross-project search allows your users to focus purely on the logical organization of their data, delivering a unified search experience without the physical complexities of the past.

Agentic AI search with deterministic guardrails in Elasticsearch for safe query execution

Alexander Marquardt, Honza Král, Taylor Roy — Mon, 18 May 2026 00:00:00 GMT

Parts 1 through 7 of this series described a governed control plane for ecommerce search. A user types a query. The control plane classifies intent, enforces business constraints, resolves policy conflicts, and routes to the appropriate retrieval strategy, all before the product catalog is ever queried. The entire architecture assumes that the input is a search string typed by a human shopper.

This final post asks: What changes when the input comes from an AI agent instead?

The answer is that the architecture doesn't change, but the stakes do. Every property of the governed control plane that matters for human-authored queries matters more when the upstream decision-maker is a large language model (LLM). Determinism, auditability, conflict resolution, and constraint enforcement become critical guardrails rather than operational conveniences, because the system producing the input is probabilistic by nature.

The agentic search problem

The most common approach to AI-driven search is straightforward: Give the LLM the database schema, provide business rules in the prompt, and let the agent generate the query directly.

For an ecommerce chatbot, this means injecting the Elasticsearch index mapping, field types, category taxonomies, pricing logic, and business constraints into the agent's context window, and then asking the LLM to translate natural language into valid Elasticsearch Query DSL. The LLM becomes the query author.

This approach works in demos. It fails in production for four reasons.

Context bloat

An enterprise ecommerce index mapping is not a trivial document. Field definitions, nested objects, multi-field configurations, and analyzer settings can run to thousands of tokens before any business logic is added. On top of the mapping, the agent needs category taxonomies (which in enterprise ecommerce can contain tens of thousands of values), pricing rules, brand hierarchies, eligibility constraints, and campaign logic.

The result is a context window dominated by structural metadata rather than the user's actual intent. This increases latency, increases token cost, and degrades the LLM's ability to follow instructions as the context grows. This is a well-documented phenomenon, sometimes called context rot: As the prompt gets longer, the model's attention to any particular instruction weakens.

Probabilistic hallucination

LLMs generate queries based on patterns in their training data and the context provided. When asked to produce Elasticsearch Query DSL, the model can hallucinate field names that don't exist, construct syntactically invalid query clauses, misapply filter types to the wrong field types, or produce queries that are syntactically valid but semantically wrong, returning results that don't match the user's intent.

Google Cloud's BIRD benchmark for Text-to-SQL illustrates the ceiling of this approach. Google's state-of-the-art single-model result achieved between 70% and 80% accuracy, meaning that nearly one in four generated queries was incorrect. This is for SQL, which is far more standardized than Elasticsearch Query DSL. The error rate for LLM-generated Elasticsearch queries in a real production environment, with complex mappings and business-specific semantics, would likely be higher.

For a revenue-critical ecommerce system, a one in four query error rate isn’t a tuning problem to be solved iteratively. It’s an architectural limitation of the approach.

The security gap

When the LLM has access to the database schema and acts as the query author, the system is vulnerable to indirect prompt injection. A user interacting with an ecommerce chatbot can craft inputs designed to manipulate the agent into generating unintended queries.

This isn’t a theoretical risk. Prompt injection is one of the most actively researched attack surfaces in deployed LLM systems. The fundamental issue is that when the agent authors the query, there’s no structural boundary between user intent and query execution. The LLM is simultaneously interpreting the user's request and constructing the database operation. Any manipulation of the first directly affects the second.

High-cardinality scaling failure

Certain ecommerce fields have extreme cardinality. A product catalog might have 17,000 category values, thousands of brand names, and hundreds of attribute combinations. Standard agentic workflows require injecting these values into the context so the LLM can select the correct one when constructing a query.

This creates an impossible trade-off: Either inject all possible values (consuming enormous context and degrading performance), inject a subset (and accept that the agent cannot reference values outside that subset), or fall back to ungoverned search. This connects directly to the core problem from Part 1: If the LLM searches for “oranges” and Elasticsearch returns orange soda, the chat experience degrades in the same way a search experience does. The absence of governance means the system cannot enforce the shopper's intended resolution.

Retrieving relevant values dynamically based on the query is a known alternative, but it introduces an additional nondeterministic step where the retrieval itself can miss relevant values. Additionally, this adds latency and complexity to every query.

The architectural alternative: Decoupling intent from execution

The governed control plane described in Parts 1 through 7 offers a fundamentally different approach. Instead of the LLM authoring the final query, the LLM's role is reduced to a single, well-bounded task: extracting a search intent string from the user's natural language input.

The user says: "I'm looking for cheap brown shoes." The agent's job isn’t to generate an Elasticsearch query. It’s to extract and pass along the search intent, (in this case, something like "cheap brown shoes") to the control plane. The control plane then does what it has always done: percolates the intent string against stored policies, composes matching policies through cascading transformations, resolves conflicts deterministically, and produces a governed Elasticsearch query.

The LLM never sees the index mapping. It never knows about field types, category taxonomies, or pricing thresholds. It never constructs a query clause. It operates on the natural language side of an architectural boundary that we call the metadata air gap, a strict separation between the probabilistic component (the LLM) and the structured data layer (schema, policies, and query construction).

What the metadata air gap provides

Schema blindness. The LLM has no access to the database schema and therefore cannot generate invalid queries, hallucinate field names, or be manipulated into exposing structural information. The schema exists only on the deterministic side of the air gap.
Minimal context. Instead of thousands of tokens of mapping data, business rules, and category taxonomies, the LLM's prompt contains only a persona and intent extraction instructions. This dramatically reduces token cost, latency, and context rot.
Deterministic execution. Every query that reaches Elasticsearch is constructed by the control plane using human-vetted policy templates, not generated probabilistically by an LLM. Syntactic validity is guaranteed. Semantic correctness is enforced by the same policy framework that Parts 1 through 6 described.
Security by architecture. Prompt injection becomes structurally ineffective. Even if a user manipulates the agent into producing an unusual intent string, that string is percolated against stored policies. If no policy matches, no query is generated. The user cannot instruct the agent to construct a query because the agent doesn't construct queries. The control plane does, and the control plane is deterministic.

How the pieces connect

The following walkthrough shows how the governed control plane handles an agent-mediated query.

Step 1: The user speaks to the agent

A shopper interacting with an ecommerce chatbot says: "I'm looking for cheap chocolate, nothing with peanuts."

Step 2: The agent extracts intent

The LLM's role is intent extraction, not query generation. Given a minimal prompt that instructs it to identify the product intent, the agent produces a search intent string: "cheap chocolate without peanuts".

This is a lightweight classification task. The LLM doesn’t need the index mapping, category taxonomy, or pricing rules to perform it. It needs to understand natural language, which is exactly what LLMs are good at.

Step 3: The control plane governs the query

The intent string "cheap chocolate without peanuts" is passed to the control plane, which percolates it against the policy index. Three policies match:

The "cheap" policy (extracts "cheap", applies a price filter based on the product category).
The "chocolate" policy (constrains results to chocolate categories).
The "without" negation policy (extracts the exclusion target and applies a must_not filter)

The control plane applies these policies through the same cascading transformation described in Part 3 and Part 4: priority ordering, per-field conflict resolution, consumed phrase tracking. If a “Christmas campaign” policy is also active, it composes with the product policies exactly as described in Part 3, the agent's involvement doesn't change the governance model at all.

Step 4: The governed query executes

The control plane produces a fully governed Elasticsearch query: a search for “chocolate”, constrained to the appropriate categories, with a price ceiling derived from the “cheap” policy, an exclusion filter for peanut-containing products, and any active campaign boosts applied. If the “chocolate” policy also includes economic optimization weights (Part 7), those are applied as well. Margin boosting is set to 3.0x because “chocolate” is a browsing query where the retailer benefits from promoting higher-margin products. If the shopper has purchase history (Part 6), personalization signals are layered on top. This query is syntactically valid by construction and semantically correct by policy design.

Step 5: Results return through the agent

The product results are returned to the agent, which presents them conversationally to the user. The agent's role in the return path is presentation: formatting results, answering follow-up questions, providing product details. The retrieval itself was governed, deterministic, and explainable.

What the agent is good at (and what it isn't)

This architecture leverages the LLM for what it does well and protects the system from what it does poorly.

LLMs excel at understanding natural language intent. "I'm looking for cheap chocolate, nothing with peanuts" is a natural language understanding task, parsing intent, identifying product references, recognizing negation. LLMs handle this reliably because it's a classification problem, not a generation problem. The output is a short intent string, not a complex structured query.

LLMs struggle with precise structured output under complex constraints. Generating valid Elasticsearch Query DSL requires exact field names, correct clause nesting, appropriate filter types for each field, and consistent application of business rules across thousands of edge cases. These are exactly the properties that a deterministic system enforces trivially and that a probabilistic system enforces unreliably.

The governed control plane puts each component where it belongs: the LLM on the natural language side, the deterministic policy engine on the query construction side, and an architectural boundary between them.

Governance constrains the blast radius

This is the same insight from Part 3, extended to the agentic context. In Part 3, we observed that governance makes semantic retrieval safer by narrowing the candidate set before retrieval begins. A semantic search over 500 products in a governed category is a fundamentally different proposition from a semantic search over 500,000 SKUs.

The same principle applies to agent-mediated queries. Without governance, an agent that misinterprets "cheap chocolate" could generate a query that searches the entire catalog with no price constraint, no category filter, and no exclusions. With governance, even if the agent produces an imperfect intent string, the control plane constrains the query to the policies that match. The worst case is that fewer policies fire, not that an unbounded query hits the product catalog.

Governance narrows the blast radius of probabilistic errors. This is true whether the probabilistic component is a semantic retrieval model or an LLM agent.

LLM-suggested policies: Expanding coverage

Part 2 introduced the idea that an LLM can suggest new policies that enter the same Author → Test → Promote pipeline as human-authored ones. In the agentic context, this becomes a powerful feedback loop.

An LLM can analyze query logs, identify patterns where the control plane has no matching policy (queries that fall through to unmodified retrieval), and suggest new policies to cover those gaps. A merchandiser reviews each suggestion, tests it, and promotes it if it produces the expected behavior. The governance model ensures that no LLM-suggested policy reaches production without human validation.

Over time, this creates a virtuous cycle: The control plane's policy coverage expands, the proportion of queries that require unmodified retrieval shrinks, and the system becomes progressively more governed, with every policy auditable, versioned, and individually reversible.

The broader pattern: Deterministic guardrails for probabilistic systems

The architecture described in this series, a deterministic control plane that sits between a probabilistic input source and a data retrieval system, isn’t specific to ecommerce search. The same pattern applies wherever an AI agent needs to interact with structured data.

An agent querying a SQL database faces the same challenges: context bloat from schema injection, hallucinated column names, prompt injection risks, and high-cardinality value selection. An agent interacting with a ticketing system like Jira, a customer relationship management (CRM) system like Salesforce, or a code repository like GitHub faces analogous problems. In every case, the core architectural question is the same: Should the LLM author the query, or should the LLM extract intent and pass it to a deterministic layer that authors the query?

The governed control plane provides a repeatable answer to that question. Policies are data. Intent extraction is the LLM's job. Query construction is the control plane's job. The metadata air gap keeps them separated. And the governance framework (priority ordering, conflict resolution, cascading transformations, auditability) ensures that the deterministic layer is operationally manageable as the number of policies grows.

Conclusion

The ecommerce search governance patterns described in this series (policies as data, the Author → Test → Promote workflow, cascading transformations, per-field conflict resolution, percolator-based reverse matching, and multi-tier fallback) were designed for a world where a merchandiser authors policies and a shopper types queries. But the architecture can enable much more than its initial use case.

When the input source is an AI agent rather than a human shopper, the governed control plane becomes the critical safety layer between a probabilistic system and a production data store. It provides the deterministic guarantees (syntactic validity, semantic correctness, auditability, and security) that enterprise systems require and that LLMs cannot provide on their own.

The deterministic control plane doesn’t replace the AI agent. It makes the AI agent safe to deploy.

Put governed ecommerce search into practice

The governed control plane architecture described in this series, from the policy-as-data paradigm to the percolator-based lookup to personalization, economic optimization, and the agentic air gap, was designed and built by Elastic Services Engineering. Every pattern described across this series comes from a working system built and validated against enterprise-scale product catalogs.

If your team is building AI-powered search experiences and needs deterministic guardrails for agent-mediated queries, or if you want to implement a governed, business-editable search architecture on Elasticsearch, Elastic Professional Services can accelerate your implementation. Contact Elastic Professional Services.

Join the discussion

Have questions about search governance, retrieval strategies, or ecommerce search architecture? Join the broader Elastic community conversation.

Elastic Cloud on Kubernetes, simplified: zone awareness, restarts, and mTLS

Omer Kushmaro — Fri, 15 May 2026 00:00:00 GMT

ECK 3.4 makes the Elastic Stack on Kubernetes simpler to operate. Zone-aware HA, safe rolling restarts, and Kibana↔Elasticsearch mTLS each become a one-line answer in your manifest.

If you operate Elastic Cloud on Kubernetes (ECK), this release is about reducing the friction in the things you do every day.

Easier to operate, easier to understand

ECK 3.4 is a release focused on reducing what you have to think about when you run The Elastic Stack on Kubernetes. Each headline change picks a multi-step task and turns it into a single declarative answer:

Simplified zone awareness. Telling ECK that a cluster should be spread across availability zones is now a single field on the NodeSet. The operator handles the topology, the scheduling, and the Elasticsearch-side awareness configuration on your behalf. Your manifests reflect what you mean, not how it's wired.
Restart a cluster the same way you do everything else. Triggering a rolling restart is now an annotation on the Elasticsearch resource. It's declarative, fits GitOps, and leaves an audit trail. No force-edit on an unrelated field to get a rollout.
mTLS is automatically configured by the operator. Wiring mutual TLS between Kibana and Elasticsearch by hand requires managing CAs, per-component client certificates, mounts, rotation, and configurations on both ends. ECK 3.4 takes care of all of that: flip a flag on Elasticsearch, point Kibana at it, and the operator manages the rest.

This release is to make day-to-day ECK operations boring, in the best sense: fewer fields to remember, fewer side trips to keep in sync, and simpler-to-understand manifests.

Simplified zone awareness

Make an Elasticsearch cluster highly available across availability zones by setting one field on the NodeSet. ECK 3.4 handles the topology spread, the pod scheduling, and the Elasticsearch-side awareness configuration for you.

Before, you had to wire all of this by hand across four separate objects: an annotation on the Elasticsearch resource for downward node labels, awareness attributes in the NodeSet config, a fieldRef env var in the pod template to surface the zone, and a matching topologySpreadConstraints block plus a nodeAffinity rule pinning the cluster to specific zones. Roughly forty lines of YAML, easy to misconfigure.

In ECK 3.4, the same zone-aware cluster is four lines:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: my-cluster
spec:
  version: 9.4.0
  nodeSets:
  - name: default
    count: 3
    zoneAwareness: {}

To pin to a specific set of zones, name them, and ECK adds the matching required node affinity rules:

spec:
  nodeSets:
  - name: hot
    count: 3
    zoneAwareness:
      zones: ["us-east-1a", "us-east-1b", "us-east-1c"]

If you do need to customize maxSkew or whenUnsatisfiable, providing a matching topology spread constraint with the same topologyKey in podTemplate still wins. Your override stays an override.

One note for upgrades: enabling zoneAwareness on an existing NodeSet changes the StatefulSet pod template (new topology spread constraints, ZONE env var, node affinity, node.attr.zone), which triggers a one-time rolling restart of the affected NodeSet. Plan accordingly.

To learn more about simplified zones management, you can read this page at Elastic Docs.

Declarative rolling restarts

Restarting an Elasticsearch cluster without changing its spec is now a first-class workflow in 3.4. Two new annotations on the Elasticsearch resource do the work:

eck.k8s.elastic.co/restart-trigger: set or change this value (a timestamp is the conventional choice) to start a rolling restart. Changing the value triggers another restart later; removing the annotation does not.
eck.k8s.elastic.co/restart-allocation-delay: optional duration string (e.g. "20m") passed to the Elasticsearch node shutdown API as the allocation delay during the restart, so you can hold off on rebalancing while a pod recycles.

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: my-cluster
  annotations:
    eck.k8s.elastic.co/restart-trigger: "2026-04-30T10:00:00Z"
    eck.k8s.elastic.co/restart-allocation-delay: "20m"
spec:
  version: 9.4.0

Under the hood, ECK propagates the trigger value to pod annotations, which changes the StatefulSet template hash and feeds every pod through the existing rolling-upgrade path (node shutdown API, predicates, one-pod-at-a-time deletion). There's no new restart mechanism to learn, and the status messages and observability you already have on rolling upgrades carry over.

For GitOps users, this means a Flux/ArgoCD pipeline can request a restart by patching one annotation: no spec drift, no diff churn, no force-edit on an unrelated field.

Managed mTLS for Kibana ↔ Elasticsearch

Mutual TLS orchestration between Kibana and Elasticsearch arrives with this release. The Elasticsearch CRD accepts a single new field, spec.http.tls.client.authentication: true, that tells the cluster to require client certificates on its HTTPs interface. ECK does the rest: it builds a trust bundle from any secret labeled eck.k8s.elastic.co/client-certificate: true, mounts it into the Elasticsearch pods, sets xpack.security.http.ssl.client_authentication: required, and issues an operator-side client certificate so it can keep talking to the cluster throughout the rollout.

This makes enabling and configuring mTLS for the stack (Elasticsearch and Kibana only, in this release) a much simpler task.

Enabling mTLS on Elasticsearch:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: secure-cluster
spec:
  version: 9.4.0
  http:
    tls:
      client:
        authentication: true # <---- This is all you need
  nodeSets:
  - name: default
    count: 3

On the client side, Kibana's association controller now detects the client-authentication-required annotation on the referenced Elasticsearch and automatically generates a client certificate for Kibana — no extra config needed. If you want to bring your own cert (cert-manager, an internal PKI), point at the secret you've already provisioned:

apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
  name: kibana
spec:
  version: 9.4.0
  count: 1
  elasticsearchRef:
    name: secure-cluster
    clientCertificateSecretName: my-custom-client-cert

ECK rotates the certificate, mounts the secret into the Kibana pod, and wires elasticsearch.ssl.certificate and elasticsearch.ssl.key. Cleanup of mTLS resources is deferred until all pods have rolled, so connectivity holds throughout the transition.

Kibana is the first Stack component to get this first-class treatment in 3.4. Support for APM Server, Beats, Fleet Server, Elastic Agent, Logstash, Maps, and Enterprise Search ships in the near future. In the meantime, a new recipe walks through manual mTLS for those components using cert-manager.

Other notable improvements

This release includes other improvements worth highlighting. Here is a list with their related pull requests.

Native Go FIPS 140-3 in the FIPS-enabled operator (separate image). The FIPS-flavored ECK image (docker.elastic.co/eck/eck-operator-fips:3.4.0, plus a UBI variant eck-operator-ubi-fips:3.4.0) now ships with native Go FIPS 140-3 support, pinned at the certified GOFIPS140=v1.0.0 module and enforced at runtime. The standard eck-operator image is unchanged. For Elasticsearch 9.4.0 or later, the operator also generates and mounts a FIPS-compliant keystore password automatically when xpack.security.fips_mode.enabled: true is set (#9263, #9287).
Reliability fixes worth calling out:
- Stale CAs in the certificate chain are now detected and trigger reissuance (#9197).
- Remote-CA secret generation failures are non-blocking (#9271).
- The NetworkPolicy namespace selector label is fixed for soft multi-tenancy setups (#9153).
- The Elasticsearch controller skips its default PVC if a volume of the same name already exists (#9199).
- The DaemonSet reconciler handles stale cache the same way the Deployment reconciler does (#9256).

Getting started

If you're already running ECK, upgrade to 3.4.0 with Helm:

helm upgrade elastic-operator elastic/eck-operator -n elastic-system

Or apply the latest operator manifest directly:

kubectl apply -f https://download.elastic.co/downloads/eck/3.4.0/crds.yaml
kubectl apply -f https://download.elastic.co/downloads/eck/3.4.0/operator.yaml

If you're new to ECK, start with the quickstart guide to get an Elasticsearch cluster running on Kubernetes in minutes.

For the full list of changes, see the ECK 3.4.0 release notes on GitHub.

To start using Elastic Cloud today, log in to the Elastic Cloud console or sign up for a free trial.

Frequently asked questions

How do I make an Elasticsearch cluster zone-aware in ECK without writing topology spread constraints?

Set spec.nodeSets[].zoneAwareness: {} on the Elasticsearch resource. ECK derives the topology, attaches node.attr.zone, sets maxSkew=1 topology spread constraints, and injects the downward labels for you. Provide zones: [...] if you want to pin to a specific set of availability zones. Enabling this on an existing NodeSet causes a one-time rolling restart.

Can I trigger a rolling restart of an Elasticsearch cluster on Kubernetes without editing the spec?

Yes. ECK 3.4 introduces two annotations on the Elasticsearch resource: eck.k8s.elastic.co/restart-trigger (set or change the value, e.g. a timestamp, to start a rolling restart) and eck.k8s.elastic.co/restart-allocation-delay (optional duration string passed to the Elasticsearch node shutdown API). Removing the trigger annotation does not start a new restart.

How do I enable mutual TLS between Kibana and Elasticsearch on Kubernetes?

With ECK 3.4, set spec.http.tls.client.authentication: true on the Elasticsearch CRD and reference it from Kibana via elasticsearchRef. ECK auto-generates a client certificate for Kibana, builds a trust bundle from any secret labeled eck.k8s.elastic.co/client-certificate: true, and configures xpack.security.http.ssl.client_authentication: required for you. mTLS for Kibana ↔ Elasticsearch is a technical preview in 3.4.

Does ECK 3.4 mTLS support cover all Stack components like Beats and Fleet?

Not yet. Kibana is the first Stack component to get first-class mTLS support in 3.4 — the operator auto-generates its client certificate. Support for APM Server, Beats, Fleet Server, Elastic Agent, Logstash, Maps, and Enterprise Search ships in the next release. A new recipe walks through manual mTLS for those components using cert-manager in the meantime.

Does ECK support FIPS 140-3?

Yes, in a separate operator image. ECK 3.4 publishes a FIPS-flavored build (docker.elastic.co/eck/eck-operator-fips:3.4.0, plus a UBI variant) with native Go FIPS 140-3 support. The standard eck-operator image is unchanged. For Elasticsearch 9.4.0 or later, ECK also generates and mounts a FIPS-compliant keystore password automatically when xpack.security.fips_mode.enabled: true is set.

Faster cross-project search in Elasticsearch Serverless with project tags and routing

Stas Malyshev, Luigi Dell'Aquila — Fri, 15 May 2026 00:00:00 GMT

Cross-project search (CPS) in Elastic Cloud Serverless lets you query data across multiple projects in a single request, a need many organizations have. With project tags and routing, you can easily scope that search. Project routing filters by project alias before the query runs, so Elasticsearch never touches non-matching projects: no index resolution, no coordination, no compute. Project tag fields work inside the query like any other field, so you can filter, aggregate, sort, and group results by tag values such as environment or department.

This blog introduces Serverless project tags and details how to use them in the context of cross-project search to target specific projects in queries and improve the query performance.

Project tags background

Each Serverless project has a set of key-value pairs associated with it, allowing you to categorize and organize your projects. These are called project tags. There are two types of tags:

Prebuilt tags: Predefined tags that exist for every project and are generated automatically; for example, _alias, _id, _type, corresponding to the project alias, project ID, and project type. Predefined tag names always begin with an underscore.
Custom tags: User-defined tags; they can have any name that begins with a letter and can contain lowercase letters, numbers, underscores, and hyphens. For example, you could have the tag env:qa for your QA environment projects and env:production for your production environment.

In the Elastic Cloud console, you can manage project tags in the “Manage project” screen:

And programmatically, you can see the tags by using the GET _project/tags Elasticsearch API:

{
 "origin": {
   "f58ef00d1538476f884c137bf7d304ff": {
     "_alias": "my-origin-project-1",
     "_csp": "aws",
     "_id": "f58ef00d1538476f884c137bf7d304ff",
     "_organization": "500590167",
     "_region": "aws-eu-west-1",
     "_type": "security"
   }
 },
 "linked_projects": {
   "a4f8c2fa86824395a845e4a055e1ce83": {
     "_alias": "my-linked-1",
     "_csp": "aws",
     "_id": "a4f8c2fa86824395a845e4a055e1ce83",
     "_organization": "500590167",
     "_region": "aws-eu-west-1",
     "_type": "observability",
     "env": "qa",
     "test": "abcd"
   }
 }
}

In your queries, you can use project tags to limit the scope of the search to a subset of your linked projects; for example, only observability projects or only projects belonging to the billing department. This is called project routing; for each search, you can provide a filtering expression that confines the search only to the projects which have the tags matching the expression. This is a very efficient way of reducing the search space, because it doesn’t require accessing any projects that don’t have the matching data.

In addition, you can use the project tags in queries just like any index fields are used: for matches, aggregations, sorting, as part of the search fields output, and more. When used in this way, project tags look exactly like any mapped field, but their values aren’t stored in the index; they’re loaded dynamically from the project configuration. This adds flexibility in using project tag information in searches and aggregations.

When used in this manner, project tags are always prefixed with _project.; for example, project tag _alias becomes _project._alias, and project tag department becomes _project.department.

Project routing

Project routing restricts cross-project search to a subset of linked projects by applying a filter expression, which uses a subset of Lucene query syntax, before the query processing begins; for example, _alias:my_search_project. This is a very efficient way of limiting the query, since the filter is applied before any of the operations are executed, and the linked projects that don’t match the filter won’t even be contacted in this query. However, this way of filtering is limited because you can only use it to exclude the whole project.

This filtering is applied to every index expression, so for example, when running GET logs/_search without the routing expression, it would look for logs index on the origin project and every linked project; however, with routing expression _alias:my-o11y-project, only the project with the alias my-o11y-project will be used.

For convenience, you can store frequently used routing expressions as named project routing expressions and reuse them between different queries by specifying just the expression name: @routing-expression.

In the Technical Preview release, project routing is limited to only filtering by _alias tag matching; for example, _alias:my-security-*. In future releases, we plan to support all tags and most of the Lucene filter syntax.

Query DSL

Project routing can be specified for search using request body field:

GET logs/_search 
{
  "project_routing": "_alias:my_search_project"
}

Elasticsearch Query Language (ES|QL)

You can specify project routing as a statement preceding your main query:

SET project_routing = "_alias:my-project-alias" ;
FROM my_index
| LIMIT 10

Alternatively, you could use a project_routing query parameter in an API call, as above, but within ES|QL, the SET syntax is preferred.

Precedence rule: If both the SET project_routing syntax is used and a project_routing query parameter is provided via the API, the SET syntax takes precedence.

Project tags as fields

Project tags can be used as fields in queries (for fetching, matching, aggregating, and sorting) by referencing them with the _project. prefix. This is implemented by creating a special dynamic mapping type and attaching it to the field named _project. By itself, this field doesn’t have any data and cannot be used directly, but it has subfields with names matching project tag names; for example, _project._alias or _project.env. These fields return constant keyword values, taken directly from the project tags map stored in memory.

Query DSL

In Query DSL, you can use the project tag fields just as you do with regular fields, wherever a field name is accepted; for example:

Fetching:

{
  "fields": ["count", "_project._id", "_project._alias", "_project.env"]
}

Wildcard patterns work, too:

{
  "fields": ["count", "_project.my-tag-*"]
}

By default, however, project tags aren’t in the output fields (including when using * as fields); you always need to include them explicitly with the _project. prefix. You don’t need to add project tags to the output to use them for matching or aggregations; these functions are independent of each other.

Matching:

"query": {
"match": {
"_project._alias": "my-project"
}
}

For this simple example, project routing may be a more efficient way; however, more complex matching can be used, too:

{
  "query": {
    "bool": {
      "should": [
        { "term": { "_project._alias": "my-project" } },
        { "term": { "_project.env": "qa" } }
      ],
      "minimum_should_match": 1
    }
  }
}

Unlike project routing, the full set of Elasticsearch matching expressions can be used here, even in the Technical Preview release. Keep in mind that this comes at a performance cost: All indices named in the index expression will still be resolved, and all linked projects will be contacted, even if the match expression ultimately excludes all their data.

You can also use aggregation on project tags:

{
	"size": 0,
	"aggs": {
		"by_project": {
			"terms": {
				"field": "_project._alias"
			},
			"aggs": {
				"total_count": {
					"sum": {
						"field": "int_count"
					}
				}
			}
		}
	}
}

And sort by a project tag:

{
	"size": 4,
	"sort": [
		{
			"_project._alias": {
				"order": "asc"
			}
		}
	],
	"fields": [
		"count",
		"_project._id",
		"_project._alias"
	]
}

ES|QL

To use project tags as fields in your query results, you must explicitly include them using the METADATA keyword in your FROM clause:

FROM my_index METADATA _project._alias
| LIMIT 5

Wildcards can be used to include multiple tags:

FROM my_index METADATA _project.*
| LIMIT 5

Once included, these tag fields behave like any other field, usable by all ES|QL commands; for example:

FROM my_index METADATA _project._alias
| WHERE _project._alias == "production-data"
| STATS count() by status

In ES|QL, there are a few important differences between using project_routing and defining filters based on project tags in a WHERE command.

Index resolution:

With project_routing="...", indices are only resolved on the specified projects, so you won’t see fields that are only defined in the mappings of the excluded projects.
With a simple filter, for example, WHERE _project._alias LIKE "...", the index resolution will be routed to all the projects, so you’ll also see columns that are only defined in projects that don’t match the alias pattern.

Query routing:

With project_routing="...", the query will be routed only to the involved projects.
With WHERE _project._alias LIKE "...", the query will be routed to all the projects, so you’ll pay at least for the cost of query coordination.

Query execution:

With project_routing="...", the query will be executed only on the nodes that match the expression.
With WHERE _project._alias LIKE "...", the query will be routed to all the nodes, but ES|QL will perform a second optimization phase and will replace project tags with constants, performing constant folding that will transform the original filter into a WHERE true for projects that match the condition and with WHERE false for projects that don’t match the condition. In this case, the engine will recognize that the query will return no results, so in practice there will be no execution. In conclusion, you’ll pay for the cost of coordination and for the cost of local replanning, but the actual execution will be a no-op.

Flexibility:

In the Technical Preview release, project_routing only allows filtering based on project aliases. Future releases will support more complex syntax and all tags.
The ES|QL language supports project tags in any command or expression where a field name is allowed, including filters, even with very complex expressions.

Conclusion

Both project routing and project tag fields give you control over which projects participate in a cross-project search, but they serve different purposes.

Use project routing when performance matters most. Whether you set project_routing in a Query DSL request body or use SET project_routing in ES|QL, the effect is the same: Non-matching projects are excluded before any query work begins; no index resolution, no coordination overhead, no wasted compute. If you know exactly which projects to target, this is always the most efficient path.

Use project tag fields when you need flexibility. Project tags behave like regular fields in both Query DSL and ES|QL. You can filter, aggregate, sort, and include them in output. This opens up scenarios that routing alone cannot handle, such as combining tags with Boolean logic, grouping results by project, or filtering on custom tags like “env” or “department”. The trade-off is that all linked projects are still contacted, even if the filter ultimately excludes their data.

Combine both for the best of both worlds. Start with project routing to narrow down to the relevant projects, and then use project tag fields within your query for finer-grained logic. This gives you the performance benefits of early exclusion with the expressiveness of full query support.

Don't leave metrics on the table: query them with the ES|QL TS command

Felix Barnsteiner — Thu, 14 May 2026 00:00:00 GMT

If you use ES|QL for logs and traces, FROM is probably second nature, but on metrics it can return numerically wrong answers. A query like FROM metrics-* | STATS SUM(request_count) adds up cumulative counter values across every sample on every host. The result grows without bound and isn't a rate, a count, or anything else useful. TS fixes that by grouping samples into time series first, then exposing functions like RATE, AVG_OVER_TIME, and LAST_OVER_TIME that operate per series.

For a high-level tour of metrics analytics across ES|QL and Discover, see Explore and Analyze Metrics with Ease in Elastic Observability. This post zooms in on the mechanics.

Here is the mental model in five bullets:

FROM treats every document as an independent row. That is right for events, but metric aggregations often need the time series that each row belongs to.
TS adds that time series context: it groups and aggregates data points by time series before any other aggregation runs, and enables functions like RATE, AVG_OVER_TIME, and LAST_OVER_TIME.
A TS | STATS query normally has two aggregation phases. The inner phase reduces samples inside each time series; the outer phase groups and combines those per-series results.
The default inner aggregation is LAST_OVER_TIME, which is why TS metrics | STATS AVG(cpu_usage) and FROM metrics | STATS AVG(cpu_usage) can return different numbers.
Use TS to query a time series data stream (TSDS). Use FROM for events and raw document inspection.

What is a time series, really?

A time series is a sequence of (timestamp, value) data points identified by the metric name and a unique set of dimension values.

For example, request_count reported every 30 seconds by host h1 in data center dc1 is one time series. The same metric on host h2 in dc1 is a different time series.

In a time series data stream, every metric document carries an internal _tsid field that uniquely identifies a time series. Samples that share a _tsid belong to the same time series and are stored sequentially, sorted by timestamp.

That storage layout enables efficient per-series aggregations. It also explains why TS only works on time series data streams. Other index modes have no notion of a time series, so the per-series operations TS relies on have no such identifier to attach to. FROM does not support those operations, which is what the next section is about.

Why FROM leaves metrics on the table

Consider a counter named request_count collected every 30 seconds from three hosts.

A counter is a cumulative metric: each sample is the running total since the process started reporting it. For request_count, a value of 1,000 means "this time series has observed 1,000 requests so far", not "1,000 requests happened since the previous sample". Counters reset to zero on process restart, so a sample of 4 right after 1,004 is a fresh count, not negative traffic. The ES|QL RATE function computes the per-second change within a time series and handles resets without glitches.

You want to calculate the total request rate across all hosts, bucketed by 5 minutes.

If you are used to writing ES|QL over event data, you might start with this query:

FROM metrics-*
| WHERE TRANGE(1h)
| STATS SUM(request_count) BY BUCKET(@timestamp, 5m)

The chart it produces looks plausible at first: a line that goes up over time. But the number on the y-axis is the sum of every cumulative counter value reported in the bucket. Each host contributes its own running total, repeatedly, once per sample. Because the query uses SUM on those cumulative values, the result is not a rate, it is not the number of requests in the bucket, and it grows without bound even if the application stops receiving requests.

request_count is a monotonically increasing counter, so its raw values represent "how many requests have ever happened on this host", not how many happened in the bucket. The right computation is "how much did this counter increase per second on each host, then sum across hosts." FROM cannot express that operation directly. It can group rows by fields, but it has no built-in notion of "the same time series over time" and no way to ask for the change of a counter within each time series. It also cannot use sliding-window time series functions such as RATE(request_count, 5m), which we will come back to below.

TS was introduced for this purpose, providing a succinct syntax to express time series aggregations:

TS metrics-*
| WHERE TRANGE(1h)
| STATS SUM(RATE(request_count)) BY TBUCKET(5m)

RATE(request_count) runs per time series and produces a per-second rate that handles counter resets correctly. SUM then adds those rates across hosts.

Two aggregation phases: inner and outer

Every TS | STATS query has two distinct aggregation phases.

Let's make that concrete with a query that calculates the request rate per data center:

TS metrics-*
| WHERE TRANGE(1h)
| STATS SUM(RATE(request_count)) BY datacenter, TBUCKET(5m)

The diagram below shows how TS evaluates this query. It first reduces samples inside each time series, then groups and combines those per-series values into one result per datacenter and time bucket.

The phases are:

Inner (within a time series). Runs separately for each time series. It collapses many (timestamp, value) data points within a bucket into a single value per time series per bucket by applying the inner aggregation function, such as RATE in the example above. Functions: RATE, AVG_OVER_TIME, MAX_OVER_TIME, LAST_OVER_TIME, STDDEV_OVER_TIME, and so on. The full list is on the time series aggregation functions page.

Outer (across time series, the "grouping" phase). Combines the per-series values into a single value per group per bucket. Functions: SUM, AVG, MAX, MIN, percentiles, and the rest of the regular ES|QL aggregates.

In SUM(RATE(request_count)) BY datacenter, TBUCKET(5m):

RATE(request_count) is the inner aggregation. It runs per time series.
SUM(...) is the outer aggregation. It combines time series within the same datacenter and bucket.
TBUCKET(5m) defines the bucket boundaries (equivalent to BUCKET(@timestamp, 5m)).

The outer aggregation is optional. If you only need the per-time-series result, use the time series aggregation function directly:

TS metrics-*
| WHERE TRANGE(1h)
| STATS request_rate = RATE(request_count) BY TBUCKET(5m)

That query keeps the per-series rate for each bucket instead of wrapping it in SUM, AVG, or another aggregate across time series.

The default inner aggregation: LAST_OVER_TIME

TS has to reduce raw samples inside each time series before it can run the outer aggregation. That means every metric field in a TS | STATS aggregation needs an inner aggregation, even when the query does not spell one out.

Consider a metric named cpu_usage. It is a gauge: a metric that captures a value at a point in time and can move up and down freely. A sample of 0.42 means "this host is at 42% CPU at this time". For a gauge, the natural "value in this bucket" is the most recent sample.

That is what ES|QL fills in for you. If you write TS metrics | STATS AVG(cpu_usage) BY host.name, TBUCKET(5m), the implicit inner aggregation is LAST_OVER_TIME(cpu_usage) and the query is equivalent to:

TS metrics
| WHERE TRANGE(1h)
| STATS AVG(LAST_OVER_TIME(cpu_usage)) BY host.name, TBUCKET(5m)

For each time series, LAST_OVER_TIME picks the latest sample in the bucket. Then AVG averages across time series.

It is also why the same-looking query against FROM and TS can return different numbers. FROM averages every individual document. TS averages one value per time series per bucket. If your hosts publish at slightly different rates, those averages diverge. For example, in a five-minute bucket, a host that publishes every second contributes 300 documents while a host that publishes every two minutes contributes only two or three. With FROM | STATS AVG(cpu_usage), the chatty host dominates the average. With TS, each time series is reduced to one bucket value first, so the outer average gives each host one value to contribute.

If you want the average value during the bucket instead of the latest value, make the inner aggregation explicit:

TS metrics-*
| WHERE TRANGE(1h)
| STATS AVG(AVG_OVER_TIME(cpu_usage)) BY host.name, TBUCKET(5m)

AVG_OVER_TIME averages all CPU utilization samples within each time series. The outer AVG then averages those per-series values across matching hosts. That makes the result sample-weighted within each time series, then equally weighted across time series. Use this when you care about how the value behaved during the bucket, not just where it ended up.

The same rule applies to peaks and troughs. For a peak CPU chart, use MAX(MAX_OVER_TIME(cpu_usage)), not just MAX(cpu_usage). The inner MAX_OVER_TIME finds the peak within each time series; the outer MAX finds the peak across matching time series.

Counters work the other way around. Their sample value is a running total, so the latest sample on its own is rarely meaningful. For a counter, the inner aggregation you almost always want is RATE for a per-second rate, or INCREASE for the total change in the bucket. Falling back on the default LAST_OVER_TIME gives you the most recent cumulative value, which is the trap the FROM query in the previous section walked into.

Pick the inner function deliberately. The outer function is the easy part.

When to use TS, when to use FROM

A practical rule of thumb:

Use TS for metric aggregations against a time series data stream. It is the source command designed for that data, and it applies per-series semantics by default.
Use FROM for events: logs, traces, audit records, transactions. Each row is independent. There is no time series context.

FROM still works on TSDS indices and is occasionally useful, for example when you want to inspect raw metric documents without per-series grouping. For dashboards, alerts, and any kind of charting, TS is the right default.

If you first need to discover which metrics or time series exist in the data, use METRICS_INFO or TS_INFO after TS and before STATS. See ES|QL METRICS_INFO and TS_INFO: Catalog your time series data for a deeper walkthrough.

Post-process TS results with ES|QL

The first STATS command is the boundary between time series processing and regular ES|QL processing. Before that first STATS, TS needs to keep the data grouped by _tsid, so commands that change row order or shape are not allowed. After that first STATS, the output is a regular ES|QL table. You can sort it, limit it, join lookup data, enrich it, or compute derived columns.

For example, this query calculates average CPU per host and bucket, finds the maximum bucketed average for each host, and returns the ratio:

TS metrics-*
| WHERE TRANGE(1h)
| STATS avg_cpu = AVG(AVG_OVER_TIME(cpu_usage)) BY host.name, time_bucket = TBUCKET(5m)
| INLINE STATS max_avg_cpu = MAX(avg_cpu) BY host.name
| EVAL cpu_ratio = avg_cpu / max_avg_cpu
| KEEP host.name, time_bucket, cpu_ratio
| SORT host.name, time_bucket DESC

Sliding windows for the inner aggregation

Time series aggregation functions accept a second argument: the window size for the inner phase.

TS metrics-*
| WHERE TRANGE(1h)
| STATS AVG(RATE(app.requests, 5m)) BY TBUCKET(1m)

This computes the rate over a 5-minute sliding window, but reports a value every minute. It is useful when you want a smoother chart at fine bucket sizes.

The window is the ES|QL counterpart to a PromQL range vector selector: RATE(app.requests, 5m) serves the same purpose as rate(app_requests[5m]).

Gotchas worth knowing

A few things in TS can seem surprising, especially when coming from the events-based FROM mental model. None of these are bugs; most are direct consequences of the per-series model. Here is what to watch for.

COUNT(*) is rejected. Say you want to know how many samples were collected per service in each bucket. The instinct from FROM is COUNT(*), but TS rejects it: there is no plain "row" once data is grouped by time series, so a row count has no defined meaning. Pick what you actually want to count:

Number of samples per service: STATS samples = SUM(COUNT_OVER_TIME(cpu_usage)) BY service.name, TBUCKET(5m). The inner COUNT_OVER_TIME counts samples per time series; the outer SUM adds them across the time series in the group.
Number of distinct hosts reporting per service: STATS hosts = COUNT_DISTINCT(host.name) BY service.name, TBUCKET(5m). This counts unique label values across time series.

You cannot sort, limit, lookup join, or enrich before STATS. TS metrics | SORT @timestamp | STATS ... will fail. The grouping by _tsid must happen first, before anything else can run. Filter with WHERE if you need to narrow the scope. After the first STATS, the output is regular ES|QL and you can pipe it through any command, as shown in the previous section.

Gauge vs counter mapping. Time series functions are sensitive to the metric type set in the field mapping. RATE only works on counters; *_OVER_TIME functions are intended for gauges. If you build TSDS mappings by hand, pay special attention to this part.

This can be a source of friction for Prometheus users. Prometheus metric type metadata is not always available in the data Elasticsearch receives, so the metric type may have to be inferred from naming conventions (_total for counters, and so on). Those heuristics are imperfect, and a misclassified metric is rejected by the function that should accept it. The deeper mechanics, including how Prometheus Remote Write maps metric types into TSDS, are covered in How Prometheus Remote Write Ingestion Works in Elasticsearch.

Explicit converter functions (gauge-to-counter and counter-to-gauge) are on the roadmap to make these cases easier to recover from at query time.

Kibana charts go empty when you zoom in too far. In Kibana, TBUCKET adapts to the date picker, so zooming in shrinks the bucket size. When the bucket size drops below the data's collection interval, every other bucket has no sample, RATE and the rest return null, and the chart silently goes blank. Elastic is evaluating mitigations such as a runtime warning when the bucket size is too small, a configurable minimum bucket size, or automatic widening of the window or bucket size.

Wrap up

For metric queries, start with TS unless you specifically need raw documents. Then choose the inner aggregation based on what the value should mean inside each time series: RATE for counters, LAST_OVER_TIME for current gauge values, and explicit *_OVER_TIME functions for peaks, averages, minimum values, or distributions.

Once the per-series value is right, the outer aggregation is the familiar part: group and reduce those time series into the chart, alert, or table you need.

For the full reference, see the TS command docs and the list of time series aggregation functions.

Approximate queries in Elasticsearch ES|QL: 100x faster on billions of records, with built-in confidence intervals

Aris Papadopoulos — Thu, 14 May 2026 00:00:00 GMT

Add one line to any Elasticsearch Query Language (ES|QL) query, and get answers 100x+ faster on billions of documents. Your gains grow as your data grows. Built-in confidence signals tell you when results carry formal guarantees and when they’re best estimates.

One line: Speed that scales with your data

On billions of documents, analytical queries push against a real efficiency-precision trade-off. We’ve been hard at work pushing back. Our native columnar support is one of the best there are. ES|QL itself is a fast, purpose-built analytical engine, getting ever smarter at aggregation execution. And Elasticsearch ships a steady stream of efficiency innovations, like Block k-dimensional (BKD) tree pruning, with more landing all the time.

But even with all of that, native approximate queries really shine through. Starting in Elasticsearch 9.4, ES|QL supports approximate query execution. All you have to do is add one line to your queries: Prepend SET approximation = true. Now Elasticsearch will automatically sample a subset of your data, run the aggregation on that sample, extrapolate the results, and report confidence intervals. All transparently.

SET approximation = true;
FROM logs-*
| STATS count = COUNT(*) BY time = BUCKET(@timestamp, 5 MINUTE)
| SORT time

Your existing query stays unchanged. The SET directive tells Elasticsearch to handle the sampling, extrapolation, and statistical validation for you. No query rewriting, no manual sampling math, no guessing at sampling probabilities.

SET approximation = true is a forward-compatible directive. Today, it speeds up the most heavily used aggregations. As we expand support to more capabilities, your existing queries benefit automatically. Queries that aren’t yet approximated run exactly without errors; a warning header explains why.

How much faster?

On the ClickBench benchmark, well-behaved analytical queries ran on average 23x faster with confidence intervals enabled. Individual queries hit ~100x. Disabling confidence interval computation, the highest-leverage queries land near ~300x.

The advantage grows as your datasets grow. Approximate-mode cost is capped by the configured sample size, while exact-execution cost scales with row count. Doubling your index doubles exact-query time but barely changes approximate-query time for the same accuracy! This is a beautiful property of the underlying math, not an engineering trick, and it’s why approximation gets more valuable as you scale.

Speedup also depends on query shape, grouping cardinality, and sample size. See the FAQ for the full set of factors and tuning tips. Read “Fast approximate ES|QL” in two parts ([Part 1], [Part 2]), straight from the creators of the feature.

What you get back

The response includes your original aggregated values, automatically scaled to represent the full dataset; a COUNT on a 1% sample comes back as the estimated total, not the sample count. Column names and types are preserved (backwards-compatible). Plus two additional columns per approximated value:

Confidence interval: A range that bounds the true value at the configured confidence level (default 90%). For example, a count of 268,473 with interval [264,444–273,179] means you can be 90% confident the true count falls in that range.
Certified flag: A Boolean indicating whether the confidence interval for that value meets formal statistical guarantees. When certified is true, the data distribution allows us to rely on the results at face value. When false, the approximation is still often close but we can’t claim the same formal guarantees, typically because the distribution may be highly skewed or involve too few documents in a group. Think of it as the difference between "statistically proven" and "best estimate."

This is a deliberate design choice: Consumers that don't care about the confidence metadata can choose to not compute them at all (see “Granular control when you need it”) or ignore the extra columns and use the results exactly as before. Consumers that do care (like AI agents that read the results programmatically) get everything they need without a second query.

Use cases for approximate queries in ES|QL: Where this matters

AI agents and agentic workflows

Approximate queries don’t just speed up agent queries; they enable a scan-then-enhance investigation pattern that wasn’t practical at scale before. An agent can sweep billions of documents in sub-second time, identify candidates, and zoom in for exact answers, all inside a single reasoning loop. The certified flag turns approximation into a decision signal: Proceed at face value when it’s true, escalate to an exact query when it’s false and the step needs a tight guarantee. As ES|QL becomes the foundation for agentic analytics in Elastic, approximation is the speed layer that makes investigation possible at this scale.

Dashboards and charts on large datasets

Dashboards that aggregate weeks or months of data can become sluggish as data volumes grow. With SET approximation = true, the same dashboard loads faster. In the future, Kibana will inject the setting transparently, so users won't need to know it's happening; they’ll just see faster charts.

Log pattern analysis in ES|QL at scale

CATEGORIZE, GROK, and regex-heavy conditions are among the most compute-intensive parts of ES|QL because they require nontrivial compute per document. With approximate execution enabled, these large-scale pattern and exploration workflows become practical on very large indices.

Exploratory analysis and hypothesis testing

When you're exploring data to form hypotheses, for example, "Which services have the highest error rates this week?", you rarely need exact counts. You need shapes, relative magnitudes, and outliers. Approximate mode gives you those at interactive speed, and the confidence intervals tell you when to switch back to exact mode for the final answer.

How approximate queries work in ES|QL, without the math

The speedup is real engineering, not a query-planner trick. Sampling happens at the Lucene layer: Elasticsearch reads only the documents in the sample, so I/O and compute savings are proportional to the sampling rate. The aggregation runs on the sample, and the result is automatically scaled to represent the full dataset.

Confidence intervals are computed by a bootstrap procedure over multiple sub-partitions of the sample: statistically rigorous, not a heuristic or a guess. This is what backs the certified flag: When the methodology’s assumptions are met, the intervals carry formal guarantees.

Granular control when you need it

The defaults are designed to work well out of the box, but you can tune them:

SET approximation = {"rows": 500000, "confidence_level": 0.95};
FROM logs-*
| STATS count = COUNT(*), avg_duration = AVG(duration) BY service.name

rows: How many documents to sample (default: 100,000 for ungrouped queries, 1,000,000 for grouped). More rows means higher accuracy and longer runtime.
confidence_level: The confidence level for intervals. Defaults to: 0.9. Set it to a higher level for an increased probability that the value is within the confidence interval.
Skip confidence intervals for maximum speed: Set confidence_level to null, and Elasticsearch returns just the point estimates, adding another 2–5x speed on top of approximate execution. This is how the highest-leverage queries land near 300x.

What's next

SET approximation = true is a forward-compatible directive. As we add support for FORK, JOIN, chained STATS, and additional aggregations, your existing queries automatically benefit.

Future work also includes tighter integration with Kibana so dashboards and Discover can enable approximation automatically and improved handling of highly skewed grouping fields.

Additionally, we’ll make approximate queries natively accessible to agents, so they can opt into fast execution as part of their analytics tools and reasoning loop.

Get started

Approximate queries are available in Elasticsearch 9.4 as a technical preview on the Enterprise subscription tier. Add SET approximation = true; to the beginning of your query, and see the difference. Check the ES|QL SET command reference for configuration options.

FAQ

What is approximate query execution in Elasticsearch?

Approximate query execution is a mode where Elasticsearch samples a subset of your data, runs the aggregation on the sample, and extrapolates the result to represent the full dataset. You get back the estimated value plus a confidence interval showing how much to trust it. It's controlled by a single SET directive prepended to your existing ES|QL query; no query rewriting required.

How do I speed up ES|QL aggregations without reducing my data retention?

Just add SET approximation = true to your query. Approximate execution samples at query time, not at index time. Your data stays fully indexed, fully retained, and queryable both exactly and approximately. Elasticsearch handles sampling and extrapolation on the fly. Drop the directive any time you want exact results; nothing about the underlying data changes.

How much faster are approximate queries?

On the ClickBench benchmark, aggregation-heavy ES|QL queries that are well-suited to sampling typically run 10–40x faster with confidence intervals enabled, with individual queries reaching 100x or more. Disabling confidence interval computation (SET approximation = {"confidence_level": null}) adds another 2–5x on top, so the highest-leverage queries hit nearly 300x. The advantage grows with dataset size: Sampling cost is capped by the configured sample size, while exact execution cost scales with the row count, so the bigger your index, the bigger the win for the same precision.

How accurate are approximate queries? Can I trust the results?

Each approximated value comes back with two signals: a confidence interval (a range bounding the true value at a configurable confidence level) and a certified Boolean flag. When certified is true, the confidence interval carries formal statistical guarantees. When false, the result is still often close, but the data distribution didn't meet the assumptions required for a formal guarantee. Accuracy depends on data characteristics and query shape, not on document count, so speedup gains increase as your dataset grows.

What does the speedup depend on?

Five main factors:

Dataset size. Larger datasets produce larger speedups, for the reason described above (exact scans grow with N; sampled scans don’t).
Query shape. Queries that scan a lot to compute relatively little (large STATS, especially MEDIAN and PERCENTILE) benefit most. Queries that are already cheap (small WHERE filters matching few rows, or simple counts that hit indexed summary statistics) see little speedup.
Grouping cardinality and distribution. Well-distributed BY fields with healthy per-group sample counts benefit cleanly. Very sparse or highly skewed grouping (for example, a near-unique field or a long tail of rare values) can erode the gain because rare groups end up with too few sampled documents.
Confidence interval computation. Computing intervals adds overhead. Set confidence_level to null, and you trade interval reporting for an additional 2–5x speedup.
Sample size. The defaults (100k for ungrouped STATS, 1M for STATS … BY) work well for most queries. Increasing rows improves accuracy on high-cardinality grouping at the cost of some speedup; decreasing it does the reverse.

Can I use approximate queries for log analysis and pattern detection?

Yes. CATEGORIZE, GROK, and regex-heavy conditions are among the most compute-intensive operations in ES|QL because they require per-document processing. With SET approximation = true, these operations run on a sampled subset instead of the full index, making large-scale log pattern analysis and exploration fast on very large datasets.

Do I have to rewrite my ES|QL queries to use approximate mode?

No. Prepend SET approximation = true to your existing query. The aggregation expressions, column names, and output types stay the same. The response adds two columns per approximated value (the confidence interval and the certified flag), but existing consumers that don't use those columns see no breaking change.

What aggregations does approximate mode support in 9.4?

COUNT, SUM, AVG, WEIGHTED_AVG, MEDIAN, PERCENTILE (except extremes), MEDIAN_ABSOLUTE_DEVIATION, and STD_DEV (with caveats for highly skewed distributions). More coverage on the way.

Will I get the same result twice for the same query?

Not exactly. Approximate execution randomly samples documents at query time, so successive runs of the same query return slightly different point estimates and confidence intervals. The variation between runs is small relative to the confidence interval each run reports. If you need bit-for-bit reproducibility, run the exact query. For dashboards, depending on the use case, the variation can typically be smaller than the visual resolution of the chart.

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.

Ecommerce search optimization using margin and popularity boosting in Elasticsearch

Alexander Marquardt, Honza Král, Taylor Roy — Wed, 13 May 2026 00:00:00 GMT

Parts 1 through 6 of this series describe a governed control plane that classifies intent, enforces constraints, resolves conflicts, personalizes results, and routes to the appropriate retrieval strategy. This post introduces a different objective: ensuring the retailer's business priorities influence which of those relevant products rank highest, with that optimization governed per query through policies rather than applied as a static global setting.

In most ecommerce deployments, economic signals, like profit margin and product popularity, are either ignored in search ranking or applied as static, global weights. A fixed margin boost might push high-margin products up across every query, which works for "chocolate" (where shoppers are open to suggestion) but backfires for "baby formula" (where shoppers want the trusted, popular brand).

The governed control plane makes it possible to treat economic optimization as a per-query decision, expressed as policy data and managed through the same admin UI as every other governance mechanism. A merchandiser can say "for chocolate queries, prioritize margin" and "for baby formula queries, prioritize popularity", without writing code, without deploying changes, and with full auditability.

For the mathematical foundation of margin and popularity boosting in Elasticsearch, including the logarithmic scaling formula and factor tuning explanation, see Boosting e-commerce search by profit and popularity with the function score query in Elasticsearch.

Two business signals: Margin and popularity

Every product document in our product catalog carries two numeric fields:

margin: The product's profit margin as a percentage (0 to 200 in our dataset).
popularity: A relative sales volume metric (0 to 10,000 in our dataset), such as weekly average units sold.

These fields represent two fundamentally different business objectives. Margin optimization pushes profit per transaction. Popularity optimization pushes conversion probability since products that many shoppers buy are products that the current shopper is likely to buy.

The baseline: Global boosting with business signals

Before introducing per-query policy overrides, the system applies a default boost for both margin and popularity. These are implemented using Elasticsearch's field_value_factor with logarithmic scaling inside a function_score query as described in Boosting e-commerce search by profit and popularity with the function score query in Elasticsearch.

The design has three properties worth noting:

Calibrated range. Each signal's factor is calibrated so that it contributes at most approximately +1.0 to the boost multiplier at the top of its range. Combined with a baseline weight of 1, the final multiplier ranges from 1.0 (a product with zero margin and zero popularity) to approximately 3.0 (maximum margin plus maximum popularity). A product with strong business signals scores roughly 3x higher than an identical product with none, regardless of the BM25 score magnitude.
Logarithmic scaling. The ln1p modifier grows fast at small values (rewarding incremental gains) but flattens at high values (preventing runaway scores from a single dominant product). This also makes the system resilient to data distribution changes: If the maximum popularity in a dataset shifts significantly, the boost curve stretches rather than breaking.
Multiplicative, not additive. The business-signal boost is applied multiplicatively against BM25 (boost_mode: "multiply") rather than added to it. BM25 scores vary dramatically across queries, so an additive boost would have inconsistent impact depending on query specificity. Multiplicative scaling guarantees a consistent percentage uplift regardless of the absolute BM25 magnitude.

Per-query boosting overrides through policies

The default weights (1.0 for both margin and popularity) apply to every query. But the governed control plane makes it possible to override these weights on a per-query basis through the same policy engine described in Part 3 and Part 4.

Each policy document has two optional fields: margin_boost_weight and popularity_boost_weight. When a policy matches a query and includes weight overrides, those values flow through to the function_score construction, replacing the defaults.

Why per-query control matters

Consider two queries and why they demand different economic optimization strategies.

Margin boosting: Chocolate

A shopper searching for "chocolate" is browsing. They'll be satisfied by many chocolate-related products. The retailer's store-brand chocolate truffles at 60% margin might be just as appealing as the name-brand bar at 15% margin. Aggressive margin boosting pays for itself if the shopper doesn't care which chocolate they purchase and buys one of the margin-boosted hits.

Chocolate results without margin boosting

To isolate the effect of per-query margin boosting, we first disable margin boosting entirely for this query (margin boost weight: 0). Without any margin signal, the ranking is driven by text relevance. In our dataset, the first hit has a margin of 10 and the next one has a margin of 84 (out of a max of 200) as follows:

Setting a margin boost on queries for “chocolate”

A merchandiser who decides that “chocolate” queries should prioritize margin makes that change in the admin UI, tests it against representative queries, and promotes it to production. The change takes effect on the next query. No engineering ticket, no deployment, no code change. The following "chocolate" policy sets margin_boost_weight: 3.0, which ensures that searches for chocolate will aggressively promote high-margin items.

Chocolate results with margin boosting

With the above margin boost policy enabled, the higher-margin chocolates with a margin of 197 and 184 are boosted to the top of the results as follows (remember that the maximum margin in our dataset is 200):

Popularity boosting: Baby formula

A parent searching for baby formula is not experimenting. They want the product that other parents trust. Pushing a high-margin store-brand formula above the established brand that thousands of parents are buying would feel wrong and erode trust. Popularity is the right signal here because it functions as social proof for a high-stakes purchase.

Baby formula results without a popularity boost

To isolate the effect of per-query popularity boosting, we first disable popularity boosting entirely for this query (popularity_boost_weight: 0). Without any popularity signal, the ranking is driven by text relevance. In this example, the top hit has a popularity of 50 on a scale that goes up to 10,000.

Setting a popularity boost on queries for “baby formula”

A "baby formula" policy sets popularity_boost_weight: 5.0 and margin_boost_weight: 0; formula searches prioritize what's popular, completely ignoring margin.

Baby formula results with popularity boosting

If we enable the above rule, then the most popular baby formula (Lactogen 2 with a popularity of 9979) will be boosted to the top of the results, as shown below.

Disabling business signals: Clearance

Not every query benefits from economic boosting. A shopper searching for "clearance" is looking for deals; margin and popularity are both irrelevant to that intent. A high-margin product is the opposite of what the shopper wants, and a popular product may not be on clearance at all.

A "clearance" policy sets margin_boost_weight: 0 and popularity_boost_weight: 0, which disables both business signals entirely. Results are ranked on pure text relevance with no economic influence. This completes the design space: Policies can amplify either signal independently, rebalance them, or turn them off altogether.

How overrides flow through the control plane

When the percolator returns matching policies, the control plane checks for margin_boost_weight and popularity_boost_weight fields on the highest-priority matching policy. If present, those values replace the defaults in the RewriteState. If no matching policy includes weight overrides, the default values (1.0 for both) are used.

The weights then flow through to the function_score construction when the final Elasticsearch query is assembled. The structure of the function_score doesn't change; only the weight values on the margin and popularity functions.

Weight overrides participate in the same governance model as every other policy mechanism. They’re subject to priority ordering: A Christmas campaign policy with margin_boost_weight: 0.5 will override a product-category policy with margin_boost_weight: 3.0 if the campaign policy has higher priority. The cascading transformation model from Part 3 applies: Economic optimization parameters are just another field in the policy's execution plan.

Interaction with other policies

Per-query weight overrides compose naturally with the constraint enforcement, conflict resolution, and personalization mechanisms described in earlier parts of this series.

Consider a search for "cheap chocolate" during a Christmas campaign, with a shopper who has a purchase history and belongs to a vegan cohort. The control plane processes this query through the full governance stack:

The "cheap" policy extracts the price constraint and removes "cheap" from the query.
The "chocolate" policy sets margin_boost_weight: 3.0 and constrains results to chocolate categories.
The “Christmas campaign” policy (higher priority) overrides the category constraint with seasonal categories and adjusts the price ceiling.
The “vegan cohort” policy applies a soft boost to vegan-certified products.
The margin and popularity boosts are applied with the governed weights (margin at 3.0× from the “chocolate” policy, popularity at the default 1.0×).
The shopper's purchase history boosts are applied as the outermost scoring layer.

Every layer stacks multiplicatively. The economic optimization weights are governed by the same policy framework that controls category constraints, campaign overrides, and cohort-specific boosts. A merchandiser can tune all of these through the admin UI, all without code changes.

This example also illustrates where economic optimization sits in the scoring stack. The layers nest in a deliberate order: the base query (keyword or semantic match), then governance constraints (hard filters and soft boosts from Part 3 and Part 4), then business-signal boosts (margin and popularity with governed weights), and then purchase history personalization (Part 6). Each layer wraps the previous one, and the effects compound multiplicatively. Governance controls what appears. Economic optimization influences what ranks highest from the retailer's perspective. Personalization adjusts ranking further from the shopper's perspective.

Tuning guidance

The factor values in the baseline function_score are calibrated for the demo dataset's field ranges. A production deployment with substantially different ranges for margin or popularity should recalibrate the factors so that each signal contributes a consistent maximum boost. The logarithmic scaling provides built-in resilience to outliers and distribution shifts, but the factors are worth reviewing whenever the underlying data changes significantly. For the calibration methodology, see Boosting e-commerce search by profit and popularity with the function score query in Elasticsearch.

From economic optimization to agentic AI

The governed control plane now handles intent classification, constraint enforcement, conflict resolution, personalization, and economic optimization, all expressed as policy data, all managed through a business-editable admin UI, and all composable through a deterministic transformation framework.

The final post in this series asks what happens when the input to this system isn’t a search string typed by a human shopper, but an intent string extracted by an AI agent, and why the deterministic properties of the governed control plane become even more critical when the upstream decision-maker is probabilistic.

Put governed ecommerce search into practice

The per-query economic optimization described in this post (policy-governed margin and popularity weights composing with governance constraints, personalization, and campaign overrides) was designed and built by Elastic Services Engineering as part of our repeatable ecommerce search accelerators. Contact Elastic Professional Services.

Join the discussion

Have questions about search governance, retrieval strategies, or ecommerce search architecture? Join the broader Elastic community conversation.

Elasticsearch Vector DiskBBQ filter search is now 3–5x faster

Benjamin Trent — Wed, 13 May 2026 00:00:00 GMT

Elasticsearch 9.4 makes restrictive DiskBBQ filtered vector search 3–5x faster. DiskBBQ is Elasticsearch’s new partition based index. It strives to provide the best trade-off for cost and performance by making the vector index as sympathetic as possible to the underlying system. While DiskBBQ does well with broad filters, it struggled with restrictive filters. Continuing our journey for a simple, fast, and efficient vector index, we adjust how we apply filtering and significantly improve latency.

What’s hard about filtering partition indices

With partition indices, all searches are done in two phases:

Find the nearest centroids.
Find all the nearest vectors within the nearest centroids’ clusters.

For DiskBBQ, the centroids are quantized and possibly indexed in their own structure. The cluster contents (we’ll call them postings lists from now on) are laid out in an effort to make scoring the vectors fast. Postings are stored in blocks of 32, with each block in doc_id order. Doc IDs are delta-encoded to minimize disk usage with low decode overhead. Vector values are also block-encoded, separating dimensions from quantized corrections to maximize single instruction, multiple data (SIMD) throughput using our optimized kernels.

Vector Cluster (Posting List) layout
  | metadata |
  | doc_deltas[32] | vec_quant[32] | vec_quant_corrections[32] |
  | doc_deltas[32] | vec_quant[32] | vec_quant_corrections[32] |
  | ... |
  | doc_deltas[T]  | vec_quant[T]  | vec_quant_corrections[T]  |   (T <= 32)

Once we reach a postings list, the layout is optimized for fast scoring and filtering. For example, if an index sort is provided, blocks of vectors that match a filter will be stored and scored together within a list. This unlocks scoring contiguous blocks of vectors at a time, taking full advantage of the underlying CPU throughput.

That said, we don’t know if a cluster matches a given filter until we actually check its doc_ids. Once verified, we can be sure to only score against the relevant vectors. In restrictive-filter cases, we can inspect a centroid and still find that none of its vectors match the filter. To compensate, we keep scoring and exploring centroids until we get a representative group of vectors scored.

This meant wasted work for restrictive filters. We score centroids, not knowing if they have vectors relevant to the filter or not; prepare to score the postings list, only to find that none of the blocks apply. The wasted compute adds up:

Unnecessarily scoring the centroid.
Loading a filtered-out postings list because it’s close to the query vector.
Decoding and checking the document IDs in the list, only to find out none match.
Continue the search, potentially hitting another completely filtered-out centroid, until we visit enough to get the desired recall.

Here’s an example showing the old flow. Check all the centroids, see what matches, move on to centroids that have matching vectors. Rinse and repeat.

Watch video on Vidyard

How do we get to the right centroids quickly?

The simplest solution would be to skip centroids that contain no valid vectors. But we don’t want to index additional information for all potential filter fields and values. A user can provide many variations of complex or simple filters. This is a strength of Elasticsearch, and we don’t want to hamper that.

Instead, we simply store the mapping of doc_id -> centroid_ord. This gives us an immediate view of all docs and their centroid membership. Allowing us to iterate any provided filter in document order, quickly determining the relevant centroids. Of course, iterating every document to check if it passes a filter is not free. We only apply this eager logic if the average number of documents per cluster that match the filter is 1.25. Yes, this is a “magic number”; however, it's empirically based. Assuming the filter is random, we’re validating at least one matching vector per centroid with some overhead. We may refine this in the future, but early experimentation found this number to be a sweet spot for most users.

Watch video on Vidyard

Here’s the new way. Detecting we have a restrictive filter, go straight to the filtered centroids.

Benchmark, benchmark, benchmark

Here’s a macro-benchmark with a random filter. The filter selectivity is purposefully extreme to show the significant improvement on hyper-restrictive filters. Here we see almost an order-of-magnitude improvement. Where before, when filters got very restrictive, there would be a horrible elbow. Now latency remains consistent and will in fact improve as filters get more restrictive.

A further validation is our nightly runs of so-vector with rally showing the improvement. You can try this yourself by specifying bbq_disk in the vector_index_type in the rally configuration.

What's next?

This is in Elastic Serverless now and will be in stack release 9.4.0. We aren’t done improving vector search in Elasticsearch. This is just another step in our journey to bring you simple, efficient, practical, and fast vector search. Thank you so much for using the code that we write. We ❤️ you.

Elasticsearch query logs: One coordinator-level line per query for ES|QL, DSL, SQL, and EQL

Najwa Harif, Valentin Crettaz — Tue, 12 May 2026 00:00:00 GMT

Your dashboard times out and CPU spikes, but which query actually ran? Slow logs give you one line per shard; Elasticsearch query logs give you one JSON line per request, with the same end-to-end duration as the took you already trust from the API. That single line also captures full query text for ES|QL, DSL, SQL, and EQL, outcomes, tracing, optional user context, and cross-cluster hints when relevant.

They’re ECS-aligned, ready for Discover and out-of-the-box dashboards once you ship the log, no custom schema project. Below: why we built this, how it differs from slow logs, what each line contains, and how to turn it on.

Why we built this (you asked, a lot!)

Coordinator-level query logging has been a very popular request; we listened and delivered! The same pain kept showing up: You want the response duration for Service Level Objectives (SLOs) and dashboards. You want to know the execution time of queries executed in your cluster, and you want to be able to see the full query.

If using cross-cluster search, a search that fans out across clusters looks like one operation from the app or Kibana, but operationally it’s a chain of work: coordination, remote execution, merges, timeouts, and partial results. When something is slow or flaky, teams need to know not only how long the request took but also which clusters contributed and whether the outcome was success, partial, or a hard failure.

What you get: One log stream, one entry per query! Every entry has the coordinator duration (the very same took time that actually matches your search API response), success or failure, and the full query text. Elastic Common Schema–compliant (ECS) JSON, optional duration threshold and user/audit fields, plus X-Opaque-Id that lets you trace a hot query back to the saved object it originates from, and the trace ID so you can correlate with Kibana or your own tooling.

What’s more: Logs follow a stable, ECS-aligned schema, which means you don’t need to design your own ingestion pipelines or field mappings. This consistency enables out-of-the-box dashboards and analytics that work immediately once logs are shipped.

Slow logs vs. query logs: The 30-second version

Slow logs have been the go-to tool for years. They tell you which search operation is slow, but they emit one line per shard that took part, where each line reflects that shard’s piece of the work. This means that they don’t provide a single row that says how long the query execution took, from the client’s perspective. Query logs do exactly that: one line per query, with the end-to-end (wall clock) duration that lines up with the took time in the search API response. This makes them much better suited for understanding workload patterns and identifying problematic queries quickly.

Slow and query logs also differ in when they fire and what they cover. Slow logs only write when a shard’s slice breaches a duration threshold; that is, you’re optimized for “show me unusually slow shard work.” Query logs can record every query (or only those above a configurable threshold you set at the cluster level), so you can tune volume for analytics versus troubleshooting. Slow logs only support DSL queries, while query logs cover ES|QL, DSL, SQL, and EQL, which matches how you reason about “what ran on my cluster” in a modern stack. Both provide the same support in terms of correlation with headers, traces, and audit information (when you turn on user context).

The table below summarizes the main differences between the historical slow logs and the new query logs features.

	Slow logs	Query logs
What they’re for	Finding hot shards / slow index operations on specific indices and classic performance tuning inside one cluster.	Understanding what query ran, how long the operation took end to end from the coordinator, and whether it succeeded, which is better for SLOs, analytics, and incident investigations.
Granularity	Per shard (and per phase) for searching slow logs: One user search can produce many lines across shards/replicas.	Per coordinator-level query: One query maps to one log event.
Scope of work	Query + indexing	Query only; indexing will come soon.
What you learn	“This shard on this index exceeded N ms in query/fetch phase.”	“This query (full text), this duration, this outcome, and (when relevant) federation/cross-cluster summary fields.”
Query types	DSL only	ES\|QL, DSL, SQL, and EQL
Threshold model	Often tiered (for example, multiple time thresholds per log levels) and per index.	Single duration gate at the cluster level (for example, “log if duration ≥ 500ms”)

What you get in each log line

Every line is one JSON object (one request) in a dedicated file (for example, *_querylog.json under your Elasticsearch log directory). Below is what you can do with the data:

Did it succeed, how long did it take, and what broke? Outcome (whether the request was successful or not), duration (took / took_millis, in line with the API), and a clear failure or timeout when something goes wrong. That’s the core signal for alerting, SLOs, and dashboards: “Are we green? If not, what’s the error?” You also get how many rows or hits came back (result_count), so you can separate “slow but empty” from “slow and huge.”

What actually ran? Query type (esql, dsl, sql, eql) plus the full query text. That answers “Which dashboard rule, saved search, or client pattern is hammering us?” Mix it with duration and outcome to find the worst offenders to fix or throttle.

Who asked for it, and how do I trace it end to end? X-Opaque-Id and trace ID tie a line back to Kibana or your own headers. Task and optional parent task IDs help follow work that was enqueued or chained (async or nested operations).

Cross-cluster search: Who participated, and did anyone misbehave? When cross-cluster search (CCS) is in play, the log can carry remote cluster aliases, per-cluster duration, and status (successful, failed, partial, skipped). You can see at a glance whether a slow search was local or a specific remote dragging the response. DSL can also record that a search was served from a remote alias; ES|QL exposes the richer cluster map; EQL logs a lighter view (for example, which remotes and how many) when remotes are involved.

Security (optional). With elasticsearch.querylog.include.user, you get the usual identity and realm fields (plus effective user when run-as applies), and API key metadata when applicable. Pair with query text and duration for governance and capacity conversations that use names, not only IPs.

There’s more available than we covered here, including additional execution details, shard-level outcomes, and optional profiling information depending on the query type. For every field path and setting, see the Elasticsearch documentation on query logs.

Where the logs live (and how to use them)

Logs land in your Elasticsearch log directory as *_querylog.json (for example, mycluster_querylog.json) on the coordinating node. Ship them with the querylog fileset in the Filebeat Elasticsearch module, so you can then inspect them in Discover (filter by event.dataset: elasticsearch.querylog). On Elastic Cloud, you need to enable Logs on your deployment, and the query logs are shipped as soon as you enable them.

Two workflows. If you need a one-off look to find out who’s hammering the cluster, what the query mix is, or a quick audit, just turn logging on, set a duration threshold so you only log what matters (for example, ≥ 1 s or ≥ 5 min), and then turn it off when you’re done. If you want ongoing query analytics, simply enable logging, point Filebeat at the log, and open a dashboard on the monitoring cluster. Two very simple steps, enable + ship, and you’re done. One request per line, one duration per request, no custom pipeline.

The dashboard below builds upon the new query logs and is provided out of the box. On the top row, you can find the P95/P99 query latencies (with an optional “acceptable latency” bar), the query type breakdown, the success and failure ratio, the user and system queries ratio, and (for DSL) hits versus aggregations. Underneath that, the latency over time (avg, p50, p95, p99, max) with a reference line so you can spot regressions, query volume over time (stacked by type), and tables for top indices, top users, and top error types. Filtering for cluster, user, or index lets you zoom into exactly what you want to focus on.

Heads up. Logging of queries is asynchronous, so it doesn’t block the query execution. Use the duration threshold to cap volume. Also worth noting that at very high queries per second (QPS), we may drop some lines rather than slow your cluster down. For analytics, shipping to a separate monitoring cluster keeps the cluster you’re debugging from taking the extra load.

Some configuration and code samples

Query logging is off by default. Flip it on in elasticsearch.yml or via the cluster settings API. Here’s how.

Enable query logging

In elasticsearch.yml:

elasticsearch.querylog.enabled: true

Or dynamically via the cluster settings API:

PUT _cluster/settings
{
  "persistent": {
    "elasticsearch.querylog.enabled": "true"
  }
}

Only log queries above a duration threshold

If you don’t want to log every health check or tiny request, simply set a threshold so only queries that run at least this long get an entry. Duration is in time units:

PUT _cluster/settings
{
  "persistent": {
    "elasticsearch.querylog.enabled": "true",
    "elasticsearch.querylog.threshold": "1s"
  }
}

Include user/audit information

If you use the Security plugin and want to see who ran each query:

PUT _cluster/settings
{
  "persistent": {
    "elasticsearch.querylog.enabled": "true",
    "elasticsearch.querylog.include.user": "true"
  }
}

Log DSL searches that hit only system indices

By default, searches that target only system indices aren’t logged. To include them, enable query logging and set:

PUT _cluster/settings
{
  "persistent": {
    "elasticsearch.querylog.enabled": "true",
    "elasticsearch.querylog.include.system_indices": "true"
  }
}

Example log entries

One line = one JSON object = one request with the same shape for ES|QL, DSL, SQL, EQL. Below: a successful DSL search and a failed EQL query with timestamp, duration, query type, and full query. On success, you get result count and shard stats, on failure an error block. User-inclusion and X-Opaque-Id show up when you’ve enabled them.

Success (DSL search):

{
  "@timestamp": "2026-03-04T19:40:34.736Z",
  "log": {
    "level": "INFO",
    "logger": "elasticsearch.querylog"
  },
  "event": {
    "duration": 1000000,
    "outcome": "success"
  },
  "elasticsearch": {
    "querylog": {
      "type": "dsl",
      "query": "{\"size\":10,\"query\":{\"match_all\":{\"boost\":1.0}}}",
      "indices": ["query_log_test_index"],
      "result_count": 3,
      "search": { "total_count": 3 },
      "shards": { "successful": 1 },
      "took": 1000000,
      "took_millis": 1
    },
    "node": { "name": "node-1" },
    "cluster": { "name": "my-es-cluster" }
  },
  "http": {
    "request": {
      "headers": { "x_opaque_id": "opaque-1772653234" }
    }
  },
  "user": {
    "name": "elastic",
    "realm": "reserved"
  }
}

Failure (EQL query):

{
  "@timestamp": "2026-03-04T19:40:35.271Z",
  "log": {
    "level": "INFO",
    "logger": "elasticsearch.querylog"
  },
  "event": {
    "duration": 1326334,
    "outcome": "failure"
  },
  "elasticsearch": {
    "querylog": {
      "type": "eql",
      "query": "any where true",
      "indices": ["nonexistent_index_xyz"],
      "result_count": 0,
      "took_millis": 1
    },
    "node": { "name": "node-1" },
    "cluster": { "name": "my-es-cluster" }
  },
  "error": {
    "type": "org.elasticsearch.index.IndexNotFoundException",
    "message": "no such index [Unknown index [nonexistent_index_xyz]]"
  }
}

Wrapping up

Elasticsearch query logs provide you with one single coordinator-level log for every query (ES|QL, DSL, SQL, EQL). One line per request, coordinator duration, full query, optional user and X-Opaque-Id. Enable it, set a duration threshold and user-inclusion if you want them, and you’re done. Logs live in your log dir (*_querylog.json), and when shipped with Filebeat, you can find them in Discover under the elasticsearch.querylog dataset.

Head to the Elasticsearch documentation on query logs for the full list of configuration settings, and field references. Slow or broken queries can also be found in AutoOps, which leverages the X-Opaque-Id to tie a long-running search back to its origin, such as a dashboard, a saved search, or an alerting rule.

Finally, it’s also worth noting that this new query log is an evolution of the ES|QL-only query log that we released in 9.2. We recommend adopting the new query log since it not only supports ES|QL queries, but also all your other queries.

Now, go see what’s actually running in your cluster.

Elasticsearch ES|QL “schema on read”: Your unmapped fields were there all along

Tyler Perkins — Tue, 12 May 2026 00:00:00 GMT

Elasticsearch Query Language (ES|QL) now has schema-on-read. Add one line: SET unmapped_fields="load", and every unmapped field in _source becomes queryable. No reindex. No pipeline change. Works retroactively against data you ingested months ago.

Add JSON_EXTRACT for surgical extraction from raw JSON strings: flattened fields, embedded payloads, OTel resource attributes (60+ semantic conventions that standard mappings don't index individually). Together, these turn schema from a gate into a spectrum: indexed fields for speed, _source fallback for everything else. See also our companion posts on ES|QL views and ES|QL subqueries.

The mapping cliff edge

Elasticsearch mappings define how fields are indexed. When a mapping is complete, queries are fast; they hit inverted indices, doc values, and all the performance structures Elasticsearch builds at index time. But when a field was never mapped (missed during onboarding, added by a new integration, or simply not anticipated), it's invisible. Queries that reference it fail. The traditional fix: Update the mapping, reindex the data. For a multi-terabyte index, that means hours of reprocessing and doubled storage during the reindex window.

ES|QL schema-on-read changes the contract. The mapping is no longer a cliff edge where "mapped" means queryable and "unmapped" means invisible. Instead, it's a spectrum:

Mapped fields → fast path. Queries hit indexed structures. This is still the preferred mode for production workloads.
Unmapped fields with SET unmapped_fields="load" → queryable from _source at query time. Slower than the indexed path (no inverted index to accelerate filters), but the data is accessible in seconds instead of not at all: immediately, retroactively, against historical indices.

You can discover and query first and then decide later whether a field is worth mapping for performance.

Unmapped field access: The strategic abstraction

SET directives are query-level settings that appear at the top of an ES|QL query, before the FROM clause. They configure how the query engine behaves for that specific query without affecting anything else. The unmapped_fields directive controls what happens when a query references a field that doesn't exist in the mapping.

Consider an OTel logs index where resource.cost_center was never mapped. Without SET unmapped_fields, referencing it produces an error:

-- This fails: "Unknown column [resource.cost_center]"
FROM otel-logs-*
| STATS errors = COUNT(*) BY service.name, resource.cost_center

Add one line and the query works:

SET unmapped_fields="load";
FROM otel-logs-*
| WHERE log.level IN ("error", "warn")
| STATS errors = COUNT(*), latest = MAX(@timestamp)
    BY service.name, resource.cost_center
| SORT errors DESC

The field is loaded from _source at query time. No mapping change. No reindex. Works against data that was ingested weeks or months ago.

Mode	Behavior	Use when
`default`	Unmapped fields are invisible; queries that reference them error (pre-9.4 behavior).	You want strict schema enforcement.
`nullify`	Unmapped fields appear as columns with `null` values.	You want the column shape without loading data, or you need compatibility with subqueries and views.
`load`	Unmapped fields are loaded from `_source` at query time.	You need to filter, aggregate, or inspect the actual values.

nullify is useful when you want queries to succeed even if some indices in a wildcard pattern don't have a particular field mapped. load is the mode you reach for when you actually need the data.

Partially mapped fields

The power of unmapped_fields really shows with wildcard index patterns where a field exists in some indices but not others. Suppose resource.team was mapped in recent indices but not in older ones:

SET unmapped_fields="load";
FROM otel-logs-*
| WHERE log.level IN ("error", "warn")
| STATS errors = COUNT(*), latest = MAX(@timestamp)
    BY service.name, resource.team
| SORT errors DESC

For indices where resource.team is mapped, values come from the fast indexed path. For indices where it isn't, values are loaded from _source. The query returns a unified result across the entire time range; no reindexing of historical data required.

JSON_EXTRACT: The lower-level tool

SET unmapped_fields is the right answer for fields that exist in _source under predictable names. But some data requires more surgical extraction, reaching into a JSON string stored in a text field or navigating nested objects inside a flattened field type where dot notation can't reach subkeys.

JSON_EXTRACT is the lower-level escape hatch. It takes a source field and a JSON path expression and returns the value at that path:

Extracting from string fields

Payment services often store structured error details as a JSON string in a response body field. JSON_EXTRACT reaches into that string at query time:

FROM svc-payments-*
| WHERE transaction.status IN ("failed", "timeout")
| EVAL error_code = JSON_EXTRACT(response_body, "$.error_code"),
       reason     = JSON_EXTRACT(response_body, "$.reason"),
       retryable  = JSON_EXTRACT(response_body, "$.retry")
| STATS failure_count = COUNT(*) BY error_code, reason, retryable
| SORT failure_count DESC

No ingest pipeline needed to extract error_code from the response body. The schema can evolve without reindexing.

Extracting from _source: Flattened fields and OTel data

For fields mapped as flattened, such as resource.attributes in many OTel indices, which can contain 20–30 nested keys per document, the entire JSON object is stored as an opaque token. Elasticsearch indexes the leaf values as flat keywords for filtering, but doesn't decompose them into separate mapped fields, so dot notation has nothing to resolve against. JSON_EXTRACT on _source follows the actual nesting in the stored document:

FROM svc-auth-* METADATA _source
| EVAL svc_version = JSON_EXTRACT(_source, "$.resource.attributes['service.version']"),
       env         = JSON_EXTRACT(_source, "$.resource.attributes['deployment.environment']"),
       host        = JSON_EXTRACT(_source, "$.resource.attributes['host.name']")
| STATS login_failures = COUNT(*) BY svc_version, env, host
| SORT login_failures DESC

Use dot notation for the nested structure and then bracket notation for leaf keys that contain dots (like service.version as a single key name). This is a common pattern for OpenTelemetry data.

What to use when

Scenario	Tool	Why
Field was never mapped	`SET unmapped_fields="load"`	The strategic solution for schema evolution. One-line directive, works retroactively.
Need to extract a key from a raw JSON string	`JSON_EXTRACT`	Surgical extraction from string fields. Path notation gives precise control.
Field is inside a flattened type	`JSON_EXTRACT` on `_source`	Bridge solution. Native flattened field support in ES\|QL is planned.
Query needs to work with subqueries or views	`SET unmapped_fields="nullify"`	`load` mode is incompatible with subqueries and views in the current release. `nullify` works everywhere.

SET unmapped_fields is the abstraction most users should reach for first. JSON_EXTRACT is the tool for cases that need direct JSON manipulation or for data patterns like flattened fields that aren't natively handled yet.

How this compares

"Schema on read" is Splunk's founding narrative; the idea that you index everything as raw text and decide the schema at search time. ES|QL takes a fundamentally different position: You get both. Here's how that plays out in practice.

Splunk SPL extracts fields at search time using search commands like spath, rex, search-time field extractions, and calculated fields. The benefit is flexibility; you never need to declare a schema up front. The cost: field-value searches scan raw event data. Splunk's TSIDX files index metadata (host, source, sourcetype) and index-time fields, but user-defined search-time fields hit raw events on every query. A query that touches a billion events scans a billion events. Splunk compensates with summary indexing and data models, but those are manual, preconfigured accelerations that trade flexibility back for performance; the same trade-off Elasticsearch mappings make, just with extra steps. ES|QL's SET unmapped_fields gives you the same "query fields you never declared" flexibility, but mapped fields still hit Elasticsearch's indexed structures at full speed. You pay the _source scan cost only for the specific fields that aren't mapped, not for every field in the query.

Elasticsearch Query DSL has runtime fields and OpenSearch has derived fields — both extract from _source at query time without reindexing. But both require per-field configuration: Runtime fields need a Painless script and type declaration per field, either at the index mapping level or in each query; derived fields require index-level or cluster-level configuration. You have to know the field name and define the extraction logic before you can query it. SET unmapped_fields="load" is a per-query directive that skips all of that; one line, and every unmapped field in the index becomes queryable. No per-field definitions, no index settings changes, no scripts.

ClickHouse requires a strict schema for standard columns; adding one means ALTER TABLE. However, ClickHouse's JSON type (GA in 25.3) automatically creates typed dynamic subcolumns for every path encountered at insert time, with no per-field declaration needed. The limitation is retroactive access; data already ingested into a String column requires JSONExtract* functions for field access, similar to ES|QL's JSON_EXTRACT, and cannot be retroactively migrated to a JSON column without a data pipeline change. There's no equivalent to SET unmapped_fields that makes arbitrary historical fields queryable without touching the schema or re-ingesting.

Capability	Splunk SPL	ES\|QL	OpenSearch / Query DSL	ClickHouse
Schema model	Schema-on-read (extract at search time)	Both (indexed fast path + `_source` fallback)	Schema-on-write + per-field query-time extraction	Schema-on-write; JSON type auto-creates subcolumns
Unmapped field access	Always available (all fields extracted at search time)	`SET unmapped_fields="load"`; per-query, zero config	Runtime fields / derived fields; per-field config required	JSON type: automatic. String columns: `JSONExtract*` only
JSON extraction	`spath`	`JSON_EXTRACT` with JSONPath subset	Painless scripting	`JSONExtract*` functions
Performance on mapped fields	Full scan (no indexed structures)	Inverted index + doc values + columnar	Inverted index + doc values	Columnar + primary index
Retroactive access to old data	Yes (always raw)	Yes (via `_source`)	Yes (via runtime / derived fields, per-field setup)	No; string columns can't be migrated to JSON retroactively
Cost of flexibility	Every query pays the scan cost	Only unmapped fields pay the scan cost	Per-field Painless script or index-level config	Must choose JSON type at table creation time

The key difference: Splunk makes you choose between flexibility and performance at the platform level. ES|QL lets you choose per field, per query. Mapped fields are fast. Unmapped fields are accessible. You don't have to pick one model for your entire dataset.

Current constraints

SET unmapped_fields="load" is incompatible with subqueries and views in the current release; use nullify mode instead when composing queries. See the SET documentation for details.

What's next

Schema-on-read is a strategy, not just two features. The direction is to make more data queryable at query time without requiring a perfect schema at ingest time:

Native flattened field support is next; dot notation directly into flattened fields without the JSON_EXTRACT on _source workaround. This eliminates the most common reason users need JSON_EXTRACT today and is planned.
Lifting the load mode restriction for subqueries and views will let you combine schema-on-read with the composition primitives from our views and subqueries posts.

The long-term goal: SET unmapped_fields becomes the primary way users handle schema evolution, with JSON_EXTRACT reserved for truly surgical JSON manipulation.

Try it

Unmapped field access and JSON_EXTRACT are available as Tech Preview features. Try them in Kibana Dev Tools or Discover. We'd love your feedback; file a GitHub issue with the ES|QL label.

ES|QL unmapped field access and JSON_EXTRACT are Tech Preview features. Tech Preview features are subject to change and are not covered by the support SLA of GA features. The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.

Bringing Fire to Elasticsearch: Adding Native Prometheus API Support

Felix Barnsteiner — Mon, 11 May 2026 00:00:00 GMT

Point any Prometheus-compatible client at Elasticsearch and run PromQL directly against your existing metrics. Elasticsearch is adding native Prometheus query, discovery, and metadata endpoints as a tech preview that work over metrics ingested through Prometheus Remote Write, OpenTelemetry, or the Bulk API. The API runs on top of Elasticsearch's time series data streams (TSDS), so there's no separate Prometheus-specific storage layer to operate.

This post explains how the query, discovery, and metadata endpoints build on the earlier ingest and query work to form that API surface. Companion posts go deeper on individual pieces:

Native PromQL support in ES|QL covers how PromQL queries are translated into ES|QL execution plans.
Ship Prometheus Metrics to Elasticsearch with Remote Write covers ingestion setup.
How Prometheus Remote Write Ingestion Works in Elasticsearch covers the remote write internals.

This is still a work in progress. The sections below call out what is supported today and which parts are still evolving.

The API surface

Today, the Prometheus-compatible API surface falls into three groups.

Query endpoints

The query endpoints let Prometheus-compatible clients evaluate PromQL expressions:

GET /_prometheus/api/v1/query_range evaluates a PromQL expression over a time window (matrix results).
GET /_prometheus/api/v1/query evaluates at a single point in time (vector results). Currently implemented as a short range query that returns the last sample.

Only GET is supported for query endpoints today. Some clients default to POST, so you may need to configure them to use GET. The Prometheus POST convention uses application/x-www-form-urlencoded bodies, which Elasticsearch's HTTP layer rejects as a CSRF safeguard before the request ever reaches the handler.

For the full PromQL coverage status, see the companion post on PromQL in ES|QL.

Metadata endpoints

The metadata endpoints serve the discovery information that clients need for autocomplete, variable dropdowns, and metric browsing.

The series, labels, and label values endpoints all accept match[] selectors and a time range (start/end). The match[] parameter takes a Prometheus series selector like http_requests_total{job="api"} and restricts the response to time series that match. This keeps responses fast and relevant on clusters with large numbers of metrics. For example:

GET /_prometheus/api/v1/series?match[]=http_requests_total{job="api"}

GET /_prometheus/api/v1/labels?match[]=http_requests_total

GET /_prometheus/api/v1/label/instance/values?match[]=http_requests_total{job="api"}

The first returns all series for http_requests_total where job="api", with their full label sets. The second returns only the label names that exist on http_requests_total series. The third returns only the instance values that appear on matching series.

GET /_prometheus/api/v1/metadata is different: it returns type and unit for each metric, optionally filtered by name via a metric parameter.

GET /_prometheus/api/v1/metadata?metric=http_requests_total

It does not accept match[] selectors or a time range. In Prometheus, metadata is collected from active scrape targets (the HELP, TYPE, and UNIT lines they expose), so the response does not involve a data scan. Elasticsearch does not have a dedicated metadata store like that, so the current implementation discovers metric metadata by visiting time series data from the last 24 hours. This keeps the query fast without requiring a full index scan. That 24-hour lookback is fixed today: the Prometheus metadata API does not expose start or end parameters that Elasticsearch could use to make it user-adjustable.

How the metadata endpoints work under the hood, including the TS_INFO and METRICS_INFO commands that power them, is covered below.

Index pre-filtering

All query and metadata endpoints accept an optional {index} path segment after /_prometheus/:

GET /_prometheus/metrics-prod-*/api/v1/query_range?query=up&start=...&end=...

This restricts which Elasticsearch indices the query runs against before any expression evaluation begins. On clusters with many data streams across teams or environments, this avoids scanning unrelated indices and can significantly reduce query latency. You can configure separate data sources per index pattern to give teams scoped access to their own metrics.

A note about Remote Write

For ingestion, Elasticsearch also exposes the standard Prometheus Remote Write endpoint:

POST /_prometheus/api/v1/write ingests time series via the Prometheus Remote Write v1 protocol. v2 is not yet supported.

Remote Write writes into Elasticsearch's existing time series data streams (TSDS), not a separate Prometheus-specific storage layer. Prometheus labels become TSDS dimensions, and metric names become fields in the index mapping. The remote write architecture post covers the full mapping in detail, including how metric types are inferred and how labels are stored with a labels. prefix.

How it works

Under the hood, all endpoints work the same way: parse the incoming HTTP parameters, build an ES|QL query plan, execute it against time series data streams, and convert the columnar result back into the JSON format Prometheus clients expect.

TS_INFO and METRICS_INFO

The metadata endpoints need to answer questions like "what labels exist?" or "what metric types are defined?" across potentially millions of time series, without scanning every data point.

Internally, the Prometheus metadata endpoints answer those questions by building ES|QL plans around two new processing commands: METRICS_INFO and TS_INFO. You do not need to use these commands directly to use the Prometheus API, but they are the core execution primitives behind the metadata responses. Both work by visiting only one document per time series to extract its metadata, rather than scanning all samples. This means their cost scales with the number of distinct time series, not the number of data points.

METRICS_INFO returns one row per distinct metric with its name, type, unit, and associated dimension fields. TS_INFO is more granular: one row per (metric, time series) combination, including the actual dimension values as a JSON object.

TS metrics-*
| METRICS_INFO
| SORT metric_name

A dedicated blog post on TS_INFO and METRICS_INFO is coming soon, covering the two-phase execution model, how they scale, and how to use them directly in ES|QL queries beyond the Prometheus API.

How the metadata endpoints use them

Each metadata endpoint constructs an ES|QL plan with one of these commands at its core.

/api/v1/labels and /api/v1/series use TS_INFO, since they need per-time-series detail (which labels exist, which dimension values identify each series). /api/v1/metadata and /api/v1/label/__name__/values use METRICS_INFO, since they only need per-metric information (metric names, types, units).

/api/v1/label/{name}/values for regular labels (anything other than __name__) does not use either command. Regular labels like job or instance are actual dimension fields in the index, so the endpoint can query them directly with a group-by aggregation. When match[] selectors are provided, they are translated into a WHERE clause that filters the time series before the aggregation runs.

The __name__ label needs a different strategy because it is not always present as a dimension field. Prometheus Remote Write does store labels.__name__, but metrics ingested through other paths (OpenTelemetry, the bulk API) do not have it. The metric name is encoded in the field name itself (e.g., metrics.http_requests_total). You could look at the index mappings to enumerate field names, but mappings alone do not tell you which metric has which dimensions, and they cannot be filtered by label values from a match[] selector. METRICS_INFO can do both: it enumerates metric names across indices while respecting upstream WHERE filters.

In all cases, the API layer handles the translation back to Prometheus conventions: stripping the labels. and metrics. storage prefixes and synthesizing __name__ for non-Prometheus metrics that lack it.

In conclusion

The result: any Prometheus-compatible client can query and explore Elasticsearch metrics through endpoints it already understands. Remote Write metrics, OpenTelemetry metrics, and metrics indexed through other paths all show up through the same API, backed by the same TSDS indices.

All the Prometheus APIs mentioned here are available as tech preview in Elasticsearch Serverless today. For self-managed clusters and Elastic Cloud Hosted deployments, available as tech preview in Elasticsearch 9.4, with the exception of GET /_prometheus/api/v1/metadata. To experiment locally, use start-local.

Personalizing ecommerce search: Integrating purchase history and user cohorts

Alexander Marquardt, Honza Král, Taylor Roy — Mon, 11 May 2026 00:00:00 GMT

Parts 1 through 5 of this series describe a governed control plane that classifies intent, enforces constraints, resolves policy conflicts, and routes to the appropriate retrieval strategy, all before the product catalog is queried. Every mechanism described so far treats all shoppers identically. A search for "chocolate" produces the same governed result set, whether the shopper is a vegan, a parent buying for a child's birthday, or a halal-observant consumer.

This post introduces two personalization mechanisms that extend the governed control plane without changing its architecture. Both mechanisms stack multiplicatively with the governance layer from Parts 1 through 5: Policies still fire, constraints are still enforced, conflicts are still resolved, and personalization signals are composed into the same governed query, ensuring that the results Elasticsearch returns are already personalized.

The first mechanism boosts products the individual shopper has purchased before. The second activates cohort-specific policies based on the shopper's profile. Together, they demonstrate that personalization is not a separate system bolted alongside search or applied as post-retrieval processing; it’s a natural extension of the policy-driven control plane.

For a deep dive into the mathematics of the personalization techniques used in this post, see Personalizing search in Elasticsearch without ML post-processing and Cohort-aware ranking in Elasticsearch.

To see a live demonstration of how purchase history can be used to boost search results for returning customers, watch the video: Explainable Personalization: Boosting Search with Purchase History.

Individual purchase history boosting

The simplest form of personalization is also one of the most effective: If a shopper has bought a product before, boost it when they search for something related. A shopper who regularly buys a particular brand of chocolate chip cookies should see those cookies ranked higher when they search for "cookies", not because a model predicted a preference, but because there’s direct behavioral evidence.

How it works

When a search request includes a user identifier, as would be the case for a user that has an open session, the control plane runs two Elasticsearch queries in parallel using a thread pool:

The percolator query against the policy index (the same governance lookup described in Parts 3 and 4).
A purchase history query against a user_purchases index, filtered to the specific user by term(user_id) and then matching the current search string against that user's product titles.

These run concurrently (neither waits for the other), so the personalization lookup adds no meaningful latency to the governance pipeline.

The purchase history query uses Elasticsearch's text analysis (stemming, tokenization) when matching the current search string against stored product titles. This means a search for "cookies" will match a past purchase of "brownie cookies" through standard text analysis, without requiring exact string matching.

Computing boost weights

Not all past purchases deserve the same boost. The weight accounts for two intuitive factors: how often the shopper bought the product, and how recently. A product purchased 15 times last week is a much stronger signal than a product purchased once six months ago. The weighting uses logarithmic scaling on frequency (so a single heavily purchased item doesn't overwhelm everything else) and exponential decay on recency (so older purchases fade naturally over time).

For the mathematical details of the boost formula, see Personalizing search in Elasticsearch without ML post-processing.

How it becomes a query

The purchase history boosts are composed into the query as the outermost scoring layer, wrapping the governance policy filters and boosts from Parts 3 and 4 and any business-signal boosts, such as margin and popularity (which we’ll explore in Part 7). This means a product that’s removed by a governance policy won’t reappear because of a purchase history boost. Governance controls the result set; personalization adjusts ordering within it. Products without any purchase history are not penalized. Their governed ranking is preserved, though products with relevant purchase history will rank above them, all else being equal.

Why query Elasticsearch on every search?

The purchase history is queried from Elasticsearch on every search rather than cached in the application layer. This is a deliberate design choice. Because the query matches the current search string against product titles using Elasticsearch's text analysis pipeline, the system benefits from the same stemming, tokenization, and language handling that powers the product search itself. A cached in-memory lookup would require reimplementing that analysis or accepting cruder matching.

To see why this ordering matters, consider a shopper who previously purchased orange juice and now searches for "oranges”. The purchase history query matches "orange juice" against the search term "oranges" through text analysis and computes a boost for that product. But the governance layer has already constrained "oranges" to the produce category, filtering out orange juice entirely. The purchase history boost for orange juice is present in the query, but it has no effect because there’s no matching document in the governed result set for it to act on. The shopper sees fresh oranges, ranked by relevance and personalization. The governance guardrail holds.

The performance cost is minimal: The purchase history index is small (a user's purchase history is typically dozens to hundreds of documents, not millions), and the query runs in parallel with the percolator lookup, so it doesn't extend the critical path.

Example query for “spring water” without user history

If a non-logged-in user or a user that has never purchased “spring water” searches, they might see results similar to the following:

Example user purchase history

On the other hand, a user called Carol has a shopping history that contains the following products:

Example search for “spring water” with the above purchase history

If Carol searches for “spring water”, she’ll see personalized results that reflect what she has purchased in the past. Looking at the purchase history above, she purchased “Carbonated Spring Water” (the green bottle) above, about 40 times, and most recently two days ago. If she searches for “spring water”, then that product is boosted up, as we know that she likes it. Notice that in the non-personalized results, the Rubicon spring water was the first hit instead.

Cohort-aware policy activation

Individual purchase history works well for returning customers with established behavior. But many shoppers are new, anonymous, or browsing outside their usual patterns. For these shoppers, cohort membership provides a different kind of personalization, one based on who the shopper is, not what they've done.

A vegan shopper searching for "chocolate" should see vegan chocolate ranked higher. A halal-observant shopper searching for "snacks" should see halal-certified options prominently. A health-conscious shopper searching for "yogurt" should see probiotic options boosted.

Cohorts as policies, not product tags

Products already carry their normal attributes, including fields like dietary_restrictions: ["vegan"] or dietary_restrictions: ["halal"]. The question is where the logic lives that connects a shopper's cohort to those product attributes.

The naive approach would be to hard-code that mapping in the application layer or in the search template: If the user is vegan, add a boost on dietary_restrictions: "vegan". But this is the same application-layer spaghetti described in Part 1, and it creates the same operational friction: Adding a new cohort or changing what a cohort means requires a code change.

The governed control plane keeps the cohort logic in the policy engine instead. A cohort policy bridges two things: a shopper's cohort membership (for example, “vegan”) and a product attribute (for example, dietary_restrictions: “vegan”). The policy defines the connection: When a shopper in the vegan cohort searches, boost products where dietary_restrictions includes “vegan”.

Because cohort logic lives in the policy engine rather than application code, this means:

Adding a new cohort can be done by creating a new policy; no product reindexing required.
Cohort policies use the full rule engine: They can add filters, apply soft boosts, expand synonyms, change retrieval strategy, or any other action a policy can take.
Cohort behavior is managed through the same admin UI as all other policies: A merchandiser can create, test, and promote cohort policies through the Author → Test → Promote workflow described in Part 2.

Example vegan cohort policy

A merchandiser creates a cohort policy with the following characteristics:

Cohorts: ["vegan"].
Match criteria: Matches any query (or a specific product category).

Action: Soft boost on dietary_restrictions: "vegan" with a boost weight of 2.

How cohort activation works

Each policy document has a cohorts field. Universal policies that apply to all shoppers regardless of cohort can leave this field blank, and these will internally be assigned a value of "_all" by the control plane. Cohort-specific policies store their target cohort names, such as ["vegan", "kosher", “sweet_tooth”].

When a search request includes a user profile, the control plane constructs a simple terms filter for the percolator query:

{ "terms": { "cohorts": ["_all", "vegan", "health_conscious"] } }

This single filter includes all universal policies plus the user's cohort-specific policies. The _all sentinel makes this a clean inclusion filter: No must_not or exists queries are needed to handle the case where a policy has no cohort restriction.

The percolator then evaluates policy matches as usual. The only difference is that the candidate policy set has been narrowed to those relevant to this shopper's cohorts. Everything downstream (cascading transformations, per-field conflict resolution, consumed phrase tracking) operates identically to the non-personalized flow described in Parts 3 and 4.

Non-vegan (standard) user results when searching for “chocolate”

When a non-vegan user searches for chocolate, there’s no vegan cohort boost applied to their results. They would often see non-vegan chocolates in the top hits, as follows:

Vegan cohort policy results when searching for “chocolate”

When a vegan-cohort shopper searches for "chocolate", this policy is included in the percolator candidate set. It matches, and the control plane applies a soft boost to vegan-certified chocolates. The boost is multiplicative: Vegan chocolates rank higher, but non-vegan chocolates are not fully excluded because the above filter is defined as a soft boost, which we described in detail in Part 3 of this series.

However, if the shopper explicitly searches for "Hershey milk chocolate", the vegan boost still applies but may be outweighed by the stronger text relevance of Hershey milk chocolate products.

A shopper outside the vegan cohort searching for the same query never sees the “vegan cohort” policy; it’s not in their candidate set. The governance layer is identical; only the active policy set differs.

Cohorts with purchase history

A vegan shopper with extensive purchase history gets vegan-cohort-specific policy activation as well as purchase history boosts. For new or anonymous shoppers, implied cohort membership alone provides meaningful personalization without requiring any behavioral data (for example, perhaps an anonymous user has only searched for vegan products, and so we classify them as a member of the vegan cohort). A shopper who self-identifies as halal-observant during account creation immediately receives halal-tailored results on their first search.

How personalization layers compose

The nesting order of function_score layers matters. From innermost to outermost:

Base query: The keyword or semantic match with named queries (fulltext_match, title_phrase_match).
Governance policy layer: Hard filters as bool.filter clauses, soft boosts as function_score functions (Parts 3 and 4).
Business-signal boosts: Margin and popularity boosting (which we’ll explore in Part 7).
Purchase history boosts: The outermost function_score layer.

This ordering ensures that governance controls the result set (what appears), business signals adjust ranking within that set (what appears first from the retailer's perspective), and purchase history adjusts ranking further based on individual behavior (what appears first from the shopper's perspective). Each layer wraps the previous one multiplicatively, so the effects compound rather than conflict.

What this means operationally

Personalization through the governed control plane preserves every operational property described in Parts 1 and 2:

Zero-deploy changes. Cohort policies are created, tested, and promoted through the admin UI. Adding a new dietary cohort or adjusting a boost weight requires no code changes and no engineering involvement.
Auditability. Every cohort policy is a discrete, versioned document. When a merchandiser asks, "Why are vegan products ranking higher for this user?", the answer is a specific policy with a specific priority, visible in the debug panel alongside every other policy that fired for that query.
Conflict resolution. Cohort policies participate in the same per-field conflict resolution described in Part 3. If a cohort policy's category boost conflicts with a campaign policy's category override, the conflict is resolved deterministically by the same priority and strategy framework, no special handling needed.
Measurability. Because cohort policies are discrete and individually toggleable, their impact on conversion, click-through, and add-to-cart rates can be measured independently, just like any other policy in the system.

What's next in this series

The next post explores another dimension of the governed control plane: how margin and popularity boosting can be tuned per query through policies, turning economic optimization into a governance decision rather than a static configuration.

See Part 7: Query-governed economic optimization: Per-query margin and popularity boosting

Put governed ecommerce search into practice

The personalization patterns described in this post (individual purchase history boosting and cohort-aware policy activation) were designed and built by Elastic Services Engineering as part of our repeatable ecommerce search accelerator. Both mechanisms integrate with the governed control plane architecture described throughout this series. Contact Elastic Professional Services.

Join the discussion

Have questions about search governance, retrieval strategies, or ecommerce search architecture? Join the broader Elastic community conversation.

One index, all media: Introducing jina-embeddings-v5-omni

Scott Martens — Mon, 11 May 2026 00:00:00 GMT

jina-embeddings-v5-omni brings text, images, video, and audio into a single Elasticsearch index. Extending the best-in-class jina-embeddings-v5-text models, the v5-omni suite adds visual and audio encoding through an innovative architecture that leaves the text backbone identical, providing frontier-class performance in one very compact embedding model.

You can now create high-performance semantic embeddings for text, images, videos, and audio recordings, spanning nearly 100 languages, and use them for classification, clustering, semantic similarity measurement, and indexing for retrieval. If your data lives in PDFs, recordings, and video alongside text, you no longer need separate pipelines for each.

The jina-embeddings-v5-omni family is the most compact embedding model currently on the market with support for pictures, speech, print, and video. It offers:

jina-embeddings-v5-text’’s frontier-class text embeddings for retrieval, analysis, and AI agent applications.
Best-in-size-class embeddings for visual semantic similarity, visual understanding, and image retrieval. jina-embeddings-v5-omni-small has the best performance on image benchmarks of any model in the 1 billion (10⁹) parameters and is superior to our own previous jina-clip-v2. Only a few models with three to 30 times as many parameters can beat it.
State-of-the-art embeddings for multilingual visual understanding and retrieval, beating models as much as 20 times larger.
Best-in-size-class audio embeddings, with only models that have double or more the number of parameters performing better on standard benchmarks.
Support for video, especially for locating objects and events in footage.

This has applications in all areas of information retrieval, document processing, and data analysis. jina-embeddings-v5-omni opens up access to information locked up in different media silos and makes it accessible for retrieval, analysis, and use by AI agents. Information in audio and video recordings, PDF, printed page scans, and infographics are on an equal footing with digitized texts in your data ecosystem.

Just like jina-embeddings-v5-text, these models come in two sizes: small and nano. Both models extend their corresponding text equivalent with additional modules supporting audio and visual input. Users can select modules at load time. In addition, task-specific extensions for semantic similarity, classification, clustering, and information retrieval are implemented as compact low-rank adapters (LoRAs) and are all loaded, so users can select them at inference time.

Both models are very compact. jina-embeddings-v5-omni-small can run on conventional GPU-equipped servers, and jina-embeddings-v5-omni-nano is small enough to run on commodity hardware. This represents a large potential savings in compute costs and makes possible licensed local installation and edge processing, reducing latency and increasing your control of your own data.

The v5-omni suite uses innovative model design and machine learning techniques to compose new embedding models from previously trained ones without having to retrain them. We use encoders from pretrained, language-aligned, embedding models for audio and video media as input preprocessors for our existing jina-embeddings-v5-text model suite. The resulting models generate embeddings for images and sound recordings that are semantically compatible with the embeddings it generates for texts.

The v5-omni models produce text embeddings that are identical to jina-embeddings-v5-text (that is, jina-embeddings-v5-omni-small with jina-embeddings-v5-text-small; and jina-embeddings-v5-omni-nano with jina-embeddings-v5-text-nano) so you can extend existing text retrieval repositories to multimedia applications without rebuilding your indices.

The integrated encoders are all derived from open-weight sources. For images and video, we’ve used encoders from Qwen3.5 models:

For jina-embeddings-v5-omni-nano, the fine-tuned SigLIP2 Base encoder from Qwen3.5-0.8B.
For jina-embeddings-v5-omni-small, the fine-tuned SigLIP2 So400m encoder from Qwen3.5-2B.
For audio support, we’ve added the encoder from Whisper-large-v3, extracted from Qwen2.5-Omni-7B, to both the small and nano versions.

We’ve connected these media-specific encoders to the text-processing backbone with trained cross-modal projectors. These projectors translate their native outputs to input embeddings compatible with jina-embeddings-v5-text. The only newly trained parts of the jina-embeddings-v5-omni models are the weights in those projectors.

This architecture means we only need to train the cross-model projectors, roughly 5.5 million parameters for jina-embeddings-v5-omni-small and under 3.5 million for jina-embeddings-v5-omni-nano, for each of the four LoRA adapters. This approach minimizes the additional training needed to connect different embedding models, leveraging the specialized training of each to produce an extremely compact, high-performance, modular embedding suite.

Selected model properties

Input/output

Model name	Input contextwindow size	Embedding size
jina-embeddings-v5-omni-small	32,768 tokens*	1024 dims(minimum: 32)
jina-embeddings-v5-omni-nano	8,192 tokens*	768 dims(minimum: 32)

* See Using jina-embeddings-v5-omni below for more on how non-text media is tokenized.

Size

Model name	Total size
jina-embeddings-v5-omni-small(text-only base model + 4 LoRA adapters)	700M params
image/video support(SigLIP2 So400m encoder extracted from Qwen3.5-2B)	1.006B params
audio support(Whisper-large-v3 encoder extracted from Qwen2.5-Omni-7B)	1.354B params
both	1.660B params
LoRA adapters (each)	20M
jina-embeddings-v5-omni-nano(text-only base model + 4 LoRA adapters)	266M params
image/video support(SigLIP2 Base encoder extracted from Qwen3.5-0.8B)	354M params
audio support(Whisper-large-v3 encoder extracted from Qwen2.5-Omni-7B)	916M params
both	1.004B params
LoRA adapters (each)	7M

* See Using jina-embeddings-v5-omni below for more on how non-text media is tokenized.

Task-specific training

The jina-embeddings-v5-omni family supports the same task-specific LoRA adapters as jina-embeddings-v5-text:

Task	Example uses
Retrieval	Information retrieval, by itself or in conjunction with other retrieval and candidate evaluation techniques. With the v5-omni models, you can retrieve audio, video, and images in one query from one index.
Clustering	Topic discovery and automatic topical organization across all media.
Classification	Categorization, sentiment analysis, and related kinds of tasks.
Semantic similarity	Data deduplication across media, recommender systems, related media, finding texts to match speech, identifying translations, and similar tasks.

Output embeddings depend on the selected task category. For example, you shouldn’t use retrieval-oriented embeddings for clustering or semantic similarity embeddings for classification.

Multimedia, multimodal, multilingual, multifunctional

To show what jina-embeddings-v5-omni can do, let’s take the famous opening passages of two novels and measure their semantic similarity:

A Tale of Two Cities (Charles Dickens)

It was the best of times, it was the worst of times, it was the
age of wisdom, it was the age of foolishness, 
it was the epoch of belief, it was the epoch of incredulity,
it was the season of Light, it was the season of Darkness,
it was the spring of hope, it was the winter of despair,
we had everything before us, we had nothing before us,
we were all going direct to Heaven, we were all going
direct the other way—in short, the period was so far like
the present period, that some of its noisiest authorities
insisted on its being received, for good or for evil, in 
the superlative degree of comparison only.

Pride and Prejudice (Jane Austen)

It is a truth universally acknowledged, that a 
single man in possession of a good fortune must
be in want of a wife. However little known the
feelings or views of such a man may be on his first
entering a neighbourhood, this truth is so well
fixed in the minds of the surrounding families,
that he is considered as the rightful property of
some one or other of their daughters.

Using jina-embeddings-v5-omni-small, with its semantic similarity adapter, these texts have a similarity of 0.5329.

That number doesn’t mean much without something to compare it with, so let’s compare these two texts to their French translations using the same model and adapter:

Semantic similarity scores for texts across languages

	A Tale of Two Cities (English)	Pride and Prejudice (English)
Tale of Two Cities (French)(Paris et Londres en 1783, tr. H. Loreau)	0.9095	0.5074
Pride and Prejudice (French)(Orgueil et Préjugés,tr. Leconte et Pressoir)	0.4826	0.8784

The two texts show much greater similarity to their translations than to other texts in the same language or a different one. This reflects the very high performance multilingual semantic embeddings of jina-embeddings-v5-text-small, included unchanged in jina-embeddings-v5-omni-small.

Adding multimedia support to jina-embeddings-v5-omni means we can extend this experiment to whole other types of data. For example, we fetched scans of the first pages of both novels from old print editions:

Figure 2: A Tale of Two Cities, undated 19th-century edition, and Pride and Prejudice, 1903 Macmillan edition.

Let’s compare both texts to the scans, again using the semantic similarity adapter:

Semantic similarity scores between texts and images

	A Tale of Two Cities (scan)	Pride and Prejudice (scan)
Tale of Two Cities (text)	0.7336	0.4891
Pride and Prejudice (text)	0.4804	0.7213

You see that semantic similarity scores strongly favor texts that match image contents.

We can also compare the texts to a screenshot of a social media post and a meme that reference those texts, using the same setup:

Figure 3: An Elon Musk tweet referencing A Tale of Two Cities, and a meme referencing the famous opening of Pride and Prejudice.

Semantic similarity scores between texts and images

	A Tale of Two Cities	Pride and Prejudice
Musk tweet (image)	0.7156	0.4912
Keep calm meme (image)	0.4555	0.6244

We can do the same for speech. We obtained recordings of readings of both texts, in English and in French:

Semantic similarity scores between texts and audio across languages

	A Tale of Two Cities (English audio)	A Tale of Two Cities (French audio)	Pride and Prejudice (English audio)	Pride and Prejudice (French audio)
A Tale of Two Cities(English text)	0.3816	0.3106	0.1607	0.1774
A Tale of Two Cities(French text)	0.3528	0.3253	0.1598	0.1721
Pride and Prejudice(English text)	0.1910	0.1682	0.3511	0.3398
Pride and Prejudice(French text)	0.1667	0.1474	0.3018	0.3702

This multilingual and multimedia ability extends to information retrieval.

The retrieval adapters for the jina-embeddings-v5-omni models implement asymmetric retrieval. This means they embed queries differently from the way they embed retrieval target documents, so cross-modal queries are always in some direction, with queries in one media and documents in another, giving different scores from when they’re reversed.

The tables below show the retrieval scores for text, audio, and page scan images for A Tale of Two Cities and Pride and Prejudice, when the text from A Tale of Two Cities (in English) is encoded as the query:

Text to text

Document	Retrieval score
A Tale of Two Cities (French text extract)	0.7597
Pride and Prejudice (English text extract)	0.1482
Pride and Prejudice (French text extract)	0.0523

Text to image

Document	Retrieval score
A Tale of Two Cities (English page scan)	0.5517
A Tale of Two Cities (French page scan)	0.3576
Pride and Prejudice (English page scan)	0.1917

Text to audio

Document	Retrieval score
A Tale of Two Cities (English audio)	0.3277
A Tale of Two Cities (French audio)	0.1980
Pride and Prejudice (English audio)	0.1419
Pride and Prejudice (French audio)	0.1759

Users can also run the query the other way around, doing audio-to-text and image-to-text retrieval.

Below are the scores using the English audio of A Tale of Two Cities as a query and various texts as documents:

Image to text

Document	Retrieval score
A Tale of Two Cities (English text extract)	0.3352
A Tale of Two Cities (French text extract)	0.2650
Pride and Prejudice (English text extract)	0.1626
Pride and Prejudice (French text extract)	0.1385

And the scores using a scan of page one of A Tale of Two Cities (in English) as a query:

Audio to text

Document	Retrieval score
A Tale of Two Cities (English text extract)	0.5304
A Tale of Two Cities (French text extract)	0.4845
Pride and Prejudice (English text extract)	0.1467
Pride and Prejudice (French text extract)	0.0761

Video search

The jina-embeddings-v5-omni‘s capabilities for video indexing and search bring new capabilities to Elasticsearch databases, but it’s subject to many of the same warnings that apply to texts. Generating a single embedding for a long film is like embedding a very long novel: Detailed information will be swamped, and the resulting embedding will be a good match for many very spurious queries.

If you embed the whole text of Lord of the Rings (~500,000 words), it’s likely to be a good match for most queries, no matter what you’re looking for. Similarly, if you index a two-hour Hollywood film, you’ll get a lot of spurious matching and totally missed details. jina-embeddings-v5-omni is optimal with short clips.

For this example, we downloaded the trailer to the 1961 film Breakfast At Tiffany’s, which is just 158 seconds long and in the public domain. You can see the trailer on the Internet Archive.

Figure 4: The theatrical poster for Breakfast at Tiffany’s.

We used PySceneDetect to split the trailer into 28 individual scenes, with lengths varying from 1.877 seconds (45 frames) to 18.393 seconds (441 frames). Scene detection is imperfect, but it provides an adequate mechanism for splitting video into bite-sized chucks for retrieval. Then we generated document embeddings for each of the 28 segments, using jina-embeddings-v5-omni-small, so we could test the effectiveness of text queries at finding specific elements in the video.

For example, querying for “cat” returned the following clips as the top three results. The one scene with a cat in it is at the top, with a score of 0.1634:

Watch clip one.

The next highest match, with a score of 0.1237, is much lower:

Watch clip 2.

You can also query for actions. If you query with the string “kiss”, the top four matches all contain kisses:

Watch clip 3. Its score is 0.2864.

Scores: For the second match (0.2494), third match (0.2099), and fourth match (0.2068), respectively

And you can search for text displayed in videos, like for “Buddy Ebsen”, which only appears once. jina-embeddings-v5-omni-small readily identifies it as the best match with a score of 0.3885, considerably higher than the next best match:

Buddy Ebsen clip.

Visual document retrieval

Jina AI multimodal embedding models are top performers in visual document processing and state-of-the-art in multilingual visual document processing. This means handling image data that contains text, figures, and structured information. Important data is often in the form of print scans, PDF files, diagrams, technical drawings, screenshots, pictures, infographics, and the like. These kinds of images are often mechanically composed or computer generated. They can’t usually be reduced to text without loss of meaning and are poorly suited to computational vision models designed for photography of natural scenes.

jina-embeddings-v5-omni’s embeddings encompass information about the things in the image, the text printed on them, and the relationships between the two. Visual document retrieval makes it possible to index rich images that contain both things and relevant text and to do so across languages.

As an example, let’s use four product images from various ecommerce websites:

Now, let’s see how well jina-embeddings-v5-omni-small scores these four images for the query “ramen noodles”:

Campbell’s Chunky Chicken Noodle(Canadian packaging)	Kraft Dinner(Canadian packaging)	Maruchan Miso Flavour Fresh Ramen(Japanese packaging)	Birkel Spaghetti (German packaging)
0.0872	0.0711	0.1123	0.0886

It readily finds the Japanese match.

Now, let’s try a query for “マカロニチーズ” (Japanese for macaroni and cheese):

Campbell’s Chunky Chicken Noodle(Canadian packaging)	Kraft Dinner(Canadian packaging)	Maruchan Miso Flavour Fresh Ramen(Japanese packaging)	Birkel Spaghetti (German packaging)
0.2207	0.3487	0.2760	0.2674

It finds the correct match with the same ease as an English query.

jina-embeddings-v5-omni also excels at interpreting information-rich images, like charts. To see this in action, look at these two bar charts:

Two charts, Chart 1 to the left, about the global burden of disease, and Chart 2 to the right, about the lifespans of dog breeds.

Let’s see how well they match two potential text questions, each relevant to one but not both charts, using jina-embeddings-v5-omni-small for retrieval:

Text question	Chart 1	Chart 2
“What are some common medical problems for elderly people?”	0.2787	0.1099
“How long do dogs live?”	0.1350	0.3564

You can also reverse the search, using images as queries to find texts. The table below shows target documents extracted from the abstracts of topically related scientific papers and their retrieval scores, using the chart images as queries:

	Text 1	Text 2
	The health of populations living in extreme poverty has been a long-standing focus of global development efforts, and continues to be a priority during the Sustainable Development Goal era. However, there has not been a systematic attempt to quantify the magnitude and causes of the burden in this specific population for almost two decades. We estimated disease rates by cause for the world’s poorest billion and compared these rates to those in high-income populations.	The companion dog is one of the most phenotypically diverse species. Variability between breeds extends not only to morphology and aspects of behaviour, but also to longevity. Despite this fact, little research has been devoted to assessing variation in life expectancy between breeds or evaluating the potential for phylogenetic characterisation of longevity.
Chart 1	0.2377	0.1357
Chart 2	0.0673	0.3576

Features

Truncatable embeddings

We trained the backbone jina-embeddings-v5-text models underpinning jina-embeddings-v5-omni with Matryoshka Representation Learning, so you can truncate both text and multimedia embeddings from these models.

By default, jina-embeddings-v5-omni-small generates embeddings with 1024 dimensions, taking 2KB to store at 16-bit precision. jina-embeddings-v5-omni-nano’s embeddings have 768 dimensions, taking up about 1.5KB. You can reduce the size of these embeddings down to 32 dimensions (64 bytes) at some cost to accuracy but with a large gain in processing speed and reduced resource costs. In general, reducing embedding sizes by half lowers accuracy by about 2%, down to 128 dimensions, below which accuracy falls much faster.

Truncatable embeddings allow users to decide the optimal trade-off between accuracy, speed, and cost, given their own use cases.

Quantization

The jina-embeddings-v5-omni family also inherits robust performance under quantization from its jina-embeddings-v5-text backbone. This further increases speed and reduces computing and storage costs by storing less precise numbers. We’ve trained them to work with Elasticsearch’s Better Binary Quantization (BBQ) to provide near-identical performance to unquantized embeddings. On the Massive Text Embedding Benchmark (MTEB) retrieval benchmark suite, binarization reduces performance by less than 3% compared to full 16-bit values, while saving 93% of the space and dramatically increasing processing and retrieval speeds.

Cross-language performance

jina-embeddings-v5-text’s extensive multilingual training carries over to jina-embeddings-v5-omni, with nearly 100 languages in jina-embeddings-v5-text-small’s pretraining and 15 major global languages in jina-embeddings-v5-text-nano’s. For audio media, the Whisper-large-v3 model has roughly 100 languages in its training, and the Qwen-modified SigLip2 vision models integrated in jina-embeddings-v5-omni-small and -nano were trained with data from 201 distinct languages and dialects.

Benchmark performance

Text

jina-embeddings-v5-omni models are identical to jina-embeddings-v5-text models when used just for text. They’re the top performers on the MMTEB benchmark suite in their respective size categories for semantic text embeddings.

Figure 5: jina-embeddings-v5-omni’s size and performance on text benchmarks, compared to competing models. The cited size is without loading extensions for other media.

Visual semantic similarity

On standard visual semantic similarity benchmarks, jina-embeddings-v5-omni delivers the best scores of any model near its size. jina-embeddings-v5-omni models show by far the best performance for public open-weight models of comparable size. jina-embeddings-v5-omni-small is only beaten by a model three times its size on visual semantic similarity tasks, and jina-embeddings-v5-omni-nano is beaten only by jina-embeddings-v5-omni-small and by models 10 to 25 times larger.

Figure 6: Visual semantic similarity benchmark mean scores for jina-embeddings-v5-omni-small, jina-embeddings-v5-omni-nano, and comparable models, as well as their sizes including vision extensions.

Visual document retrieval

jina-embeddings-v5-omni-small is competitive with three and seven billion parameter models while remaining under one billion parameters. jina-embeddings-v5-omni-nano similarly stands out for its size, beating models ten to sixty times larger.

Figure 7: Mean ViDoRe visual document retrieval scores on six benchmarks: DocVQA, InfoVQA, ShiftProj, SynAI, Tabfquad, and TatDQA.

Audio retrieval

On the standard MAEB (Massive Audio Embedding Benchmark) audio retrieval benchmarks, both jina-embeddings-v5-omni-small and jina-embeddings-v5-omni-nano rank among the top performers. Only very large models – more than three times the size of jina-embeddings-v5-omni-small – beat its score.

Figure 8: Mean score for various models on the MAEB audio retrieval benchmarks.

Although LAION’s larger_clap_general model does improve on jina-embeddings-v5-omni-nano ‘s score while having fewer parameters, it’s an audio-only model with none of the additional multimodal features of the v5-omni suite.

Video

On video, jina-embeddings-v5-omni-small excels at finding the place in a video that matches a text query. The Charades-STA and MomentSeeker tests are the standard benchmarks for this task, and you can see from the charts below that jina-embeddings-v5-omni-small is the top scorer among comparable open-weight models despite a far smaller size.

Figure 9: Charades-STA scores for various models, along with their sizes.

Figure 10: MomentSeeker scores for various models, along with their sizes.

We also compared jina-embeddings-v5-omni-small to ByteDance's Seed 1.6, a closed-weight model with undisclosed parameter count. Our model beats Seed 1.6 by a large margin on the Charades-STA benchmark and nearly equals it on MomentSeeker.

Model	Charades-STA score	MomentSeeker score
seed-1.6-embedding	29.30	59.30
jina-embeddings-v5-omni-small	55.57	58.93

Strengths and limitations

jina-embeddings-v5-omni models expand users’ ability to index, search, and analyze digitized information in a number of ways, particularly:

Multilingual speech retrieval from text queries.
PDF, scans, and visual document search.
Video temporal grounding, that is, identifying parts of videos that match natural language text descriptions.
Audio genre classification, including musical genres.
Image classification based on scene information and object identification.

Performance is more limited in some other areas. It may be possible to use jina-embeddings-v5-omni to do these tasks, but we haven’t trained for them and results may be poor.

We’re actively working at improving our technology in these areas:

Finding specific videos from natural language descriptions.
Image-to-image semantic similarity and retrieval.
Intent classification in speech, like recognizing verbal commands.
Processing mixed media inputs, that is, images and accompanying text, or audio, images, and texts combined.

Using jina-embeddings-v5-omni

This model suite supports input via three entry points: text, audio, and images and video together. jina-embeddings-v5-omni runs within a framework that converts a broad array of standard formats and does other preprocessing.

We process images using the same NaFlex approach provided in the initial SigLip2 release: If the input is smaller than 262,144 pixels (equivalent to 512x512), it’s upscaled until it's larger than that minimum; and if larger than 3,072,000 pixels, then it’s downscaled until it’s smaller than that maximum. The conversion process ensures that both the height and width of the image is a multiple of 14 pixels, with as little aspect ratio distortion as possible to accomplish that goal. The result is split into patches of 28x28 pixels, so the total number of patches is however many 28x28 squares are needed to cover the image. Each patch is treated like a single token at inference time, and each image input is accompanied by special start and end tokens to delimit a single image.

The jina-embeddings-v5-omni models modify video resolution in the same way that images are modified (see above), and we extract up to 32 frames from the video. If the video has more than 32 frames (which is likely, since standard formats are usually at least 24 frames per second), we evenly space the frames we extract. Then, for every two frames, the video preprocessor generates one set of tokens equal to the number of 28x28 squares needed to cover the video.

Figure 11: jina-embeddings-v5-omni extracts 32 equally spaced frames from the video. If you have a long video, this means a lot will be lost.

For more details on video preprocessing, see the SigLip2 technical documentation.

Audio tokenization follows the approach built into Qwen-2.5-Omni: Sound files are cut into 30-second segments; if longer than 30 seconds, resampled to 16kHz, transformed into a 128-channel mel-spectrogram. Each 40ms is treated as a single token, so every 30-second segment is handled as 750 tokens, one token per 40ms of audio, plus special start and end tokens to delimit a single sample.

For more details on audio preprocessing, see the Qwen-2.5-Omni Technical Report.

Availability

Both jina-embeddings-v5-omni-small and jina-embeddings-v5-omni-nano are available on the Elastic Inference Service (EIS), via the Jina API, and for local installation via download (small and nano). Model weights are distributed freely to try out on a non-commercial license. Contact Elastic sales for commercial use.

Getting started

To use jina-embeddings-v5-omni for text, you can integrate using the semantic_text field just like with jina-embeddings-v5-text. Just set the inference_id to .jina-embeddings-v5-omni-small or .jina-embeddings-v5-omni-nano. See the Reference Guide for instructions.

To embed other media with jina-embeddings-v5-omni, you need to use the inference API. For example:

POST _inference/embedding/.jina-embeddings-v5-omni-small
{
  "input": [
    {
      "content": { 
        "type": "image", 
        "format": "base64", 
        "value": "data:image/jpeg;base64,..." 
      } 
    }, 
    { 
      "content": { 
        "type": "text", 
        "value": "Some text to create an embedding" 
      } 
    } 
  ] 
}

For jina-embeddings-v5-omni-nano, change the POST URI to _inference/embedding/.jina-embeddings-v5-omni-nano.

To encode documents in other media, or generate embeddings for classification or clustering, you need to create an inference endpoint with the jinaai service.

For queries, use the query builder as in the example below. Replace the inference_id value with .jina-embeddings-v5-omni-nano to use the nano model instead of small.

POST my-index/_search
{
  "knn": {
    "field": "dense-vector-field",
    "k": 10,
    "num_candidates": 100,
    "query_vector_builder": {
      "embedding": {
        "inference_id": ".jina-embeddings-v5-omni-small",
        "input": {
          "type": "image",
          "format": "base64",
          "value": "data:image/jpeg;base64,..."
        }
      }
    }
  }
}

See the query builder documentation for more information.

To use BBQ with jina-embeddings-v5-omni, follow the instructions for BBQ indexing.

More information

For more information about jina-embeddings-v5-omni, see the model’s technical report and page on the Jina AI website. The jina-embeddings-v5-omni collection page on Hugging Face also contains technical information and instructions for downloading and running these models locally. The jina-embeddings-v5-omni models can be downloaded under a CC-BY-NC-4.0 license, so you’re free to try them out, but for commercial use, please contact Elastic sales.

From averages to any percentile: Elasticsearch ships native exponential histogram support in ES|QL

Jonas Kunz — Fri, 08 May 2026 00:00:00 GMT

Elasticsearch adds native support for OpenTelemetry exponential histograms in ES|QL. Unlike fixed-bucket histograms, exponential histograms dynamically adapt to your data — giving you accurate percentile estimates (median, p99, any percentile you want) at query time with guaranteed error bounds. No more pre-defining buckets, no more lossy conversions.

Just send your OTel metrics to the Elasticsearch OTLP/HTTP endpoint and they're stored using the new exponential_histogram type and queryable immediately. Already have historical data stored in the classic histogram type? A simple ::exponential_histogram cast in your ES|QL queries handles the migration transparently. Already using downsampling? Both histogram field types are now fully supported.

Histogram metrics

When dealing with metrics (in OpenTelemetry or Prometheus, for instance), counters and gauges are the most common metric types. Gauges allow you to monitor values that rise or fall (e.g., CPU utilization). Counters allow you to, well, count things, such as the total number of HTTP requests your service is handling. Counters normally just increase in value, with a few exceptions when they reset, like when a server reboots.

In the case of counters, you can additionally collect a counter measuring the total sum of your HTTP response times, which allows you to derive the average response time by dividing that sum by the total number of requests. However, average response times provide limited insights into the collected data and the system behavior. The best insights are gained by analyzing the collected metric distribution, e.g., through median and percentile calculations. This is where counters fall short.

In the past, workarounds have been applied: For example, classic Prometheus-style histograms attempt to capture the distribution using a set of counters. By defining fixed buckets (e.g., one for response times in the range [0s, 1s), one for [1s, 4s), and so on) and associating a counter with each, we can at least estimate percentiles broadly. However, the key problem here is that we have to know the distribution of our data up front to properly define these buckets.

To that end, the OpenTelemetry community has come up with a better solution: exponential histograms. Exponential histograms assign collected values to buckets, just like classic Prometheus-style histograms. The key differentiator is that these buckets vary dynamically based on the collected values. The name "exponential" comes from the fact that the bucket sizes increase exponentially: we use small buckets for small values and wider buckets for larger values. You can find an excellent introduction in the OpenTelemetry exponential histograms introduction.

Note that in addition to classic histograms, Prometheus also added native histograms, which directly map to OTel exponential histograms. Native histograms have their own PromQL syntax. We are actively working on adding support for that syntax to the Elasticsearch PromQL implementation, so that you can directly query exponential histograms using PromQL.

Demo setup

Let's start by collecting some histogram metrics to show how they can be stored and analyzed in Elasticsearch using ES|QL.

We'll focus on a Java JVM metric: garbage collection durations. OpenTelemetry defines the jvm.gc.duration, which is a histogram-typed metric. The OpenTelemetry Java agent natively supports collecting this metric.

We'll spin up a JVM running a Renaissance benchmark to put it under stress. We'll start that JVM with the vanilla OpenTelemetry Java agent attached and have it send the metrics directly to Elasticsearch.

You can find the ready-to-run Docker-compose file here. You'll just need to insert your Elasticsearch OTLP/HTTP endpoint and API key in the docker-compose.yml:

OTEL_EXPORTER_OTLP_ENDPOINT: https:///_otlp
OTEL_EXPORTER_OTLP_HEADERS: "Authorization=ApiKey "

Note that you don't have to use this demo setup. We even encourage you to try it with your own application. Here are the other important OpenTelemetry agent settings the demo already includes, which you should include too if you're bringing your own app:

OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE: delta
OTEL_EXPORTER_OTLP_METRICS_DEFAULT_HISTOGRAM_AGGREGATION: BASE2_EXPONENTIAL_BUCKET_HISTOGRAM
OTEL_INSTRUMENTATION_RUNTIME_TELEMETRY_ENABLED: "true"

Let's step through them:

Temporality preference: OpenTelemetry supports both cumulative and delta-based histograms. Cumulative means that the histogram is only cleared after an application restart, while delta clears it after each export. At the time of writing, Elasticsearch only supports delta temporality for histograms. We are actively working on supporting cumulative histograms as well.
Default Histogram Aggregation: By default, OpenTelemetry exports histograms in the Prometheus-style fixed bucket format. Since we want to reap the benefits of exponential histograms, we tell the agent to use them instead.
Runtime Telemetry enabled: This tells the agent to actually collect the detailed JVM metrics, which include jvm.gc.duration.

Now we are ready to go! We'll let the application run in the background and switch over to Kibana to analyze the GC metric.

Querying with ES|QL

Now let's open up Kibana and navigate to "Discover". There we'll switch to ES|QL mode, and start querying the collected data:

TS metrics-* | STATS COUNT(jvm.gc.duration)

As a response, we now see the metric panel shown below. If you don't see any data, make sure to double-check the Kibana time range filter.

This number represents the total number of garbage collection operations that happened in our test application during the selected time frame.

Similarly, we can query the total time spent on those garbage collection operations:

TS metrics-* | STATS SUM(jvm.gc.duration)

So we have roughly 270k garbage collections, which in total took 713 seconds. Given these two numbers, we can now compute the average if we are still fluent in primary school-level math. Even if not, you can just let ES|QL do that for you:

TS metrics-* | STATS AVG(jvm.gc.duration)

Now we know that the average garbage collection operation took about 3 milliseconds. However, Java experts might know that there are different kinds of garbage collections happening, which can have significantly different pause times. Fortunately the OpenTelemetry metric comes with attributes, which allow us to slice the data accordingly:

TS metrics-* | STATS AVG(jvm.gc.duration) BY jvm.gc.action

As expected, major garbage collections take a lot more time per collection than minor ones, at least on average. So far, we have done nothing you couldn't also achieve by just using counters. Let's now use histograms to understand the actual distribution of the GC latency. We'll look at the data over time (by grouping using TBUCKET) and focus on the major garbage collections:

TS metrics-* 
| WHERE jvm.gc.action == "end of major GC"
| STATS MAX(jvm.gc.duration),
        PERCENTILE(jvm.gc.duration, 99),
        MEDIAN(jvm.gc.duration),
        MIN(jvm.gc.duration)
 BY TBUCKET(100)

The graph now shows us the minimum, maximum, median and 99th percentile for major garbage collections. Note that we aren't bound to only querying the median and the 99th percentile. We can query any percentile we'd like to see, as these are estimated at query time from the raw exponential histograms.

A note on backwards compatibility

So far, we have seen how you can use the new shiny toy in Elasticsearch and ES|QL: exponential histograms. However, since this has just reached general availability (GA) in the 9.4 release, what about your historical data?

Before exponential histograms were added, Elasticsearch was already capable of storing OpenTelemetry histograms in the histogram field type. To do so, we converted them to a different data structure supported by the histogram field type: T-Digest. T-Digest provides good accuracy for extreme percentiles (e.g., 99th percentile) at the cost of accuracy for percentiles in the middle of the distribution, such as the median. In contrast, exponential histograms provide a guaranteed upper bound on the relative error for every percentile. As conversions always introduce errors, we are happy to now have native support for exponential histograms, allowing you to collect and analyze your metrics end-to-end without unnecessary conversions.

But still, what should you do if you have historical data and still want to query it? Thanks to ES|QL union types, the answer is actually easy: You just have to add a ::exponential_histogram suffix to the histogram metrics in your queries:

TS metrics-* | STATS AVG(jvm.gc.duration::exponential_histogram)

When this query encounters histogram fields, it will attempt to convert them to exponential histograms. When operating on exponential_histogram fields, the ::exponential_histogram cast has no effect. Note that this also works with mixed data sets: if your backing indices use both types, the query will just do the right thing.

So if you are building queries or dashboards that you expect to run on pre-9.4 ingested data, we recommend that you simply add: ::exponential_histogram casts.

Wrapping up

Native support for OpenTelemetry exponential histograms in Elasticsearch gives you better metric fidelity and more flexible analysis in ES|QL. In this blog post, we have shown you how to easily ingest and analyze your histogram metrics with ES|QL using various aggregations and the impact exponential histograms have.

Exponential histograms are generally available in Elasticsearch basic starting with the 9.4.0 release. They will be available in Elastic Cloud Serverless a few weeks after the 9.4.0 release, once mOTLP (the managed observability OTLP intake) switches to use the Elasticsearch OTLP endpoint. We'll update this blog post and add a note on the Elastic Cloud Serverless release notes when that happens.

What’s new in Kibana Dashboards: flexible controls, improved defaults and a visual refresh

Teresa Alvarez Soler — Fri, 08 May 2026 00:00:00 GMT

Building a time series in Kibana Dashboards used to take 14 clicks. In 9.4, it takes three. Beyond the headline features in this release — a brand new API and natural-language dashboard creation with the AI Chat — we are also shipping a collection of focused changes: improved defaults, freely placed controls, a cleaner visual design. Together, they compound into dashboards faster to set up and easier to scan.

Place controls anywhere in the dashboard

Most dashboarding tools lock filters into a fixed bar at the top of the page. Kibana 9.4 breaks that constraint: controls are now regular panels you can drag anywhere to place them where you need them. Pin them to the top so they stay visible while you scroll, or place them inside a collapsible section where they automatically scope their filters to that section only — so a single dashboard can serve multiple use cases without one filter resetting all the panels.

Fewer clicks to a readable chart

Building a time series in Kibana Dashboards used to take 14 clicks. Now, it takes three. The new defaults give you a line chart with a bottom legend and cleaner x-axis labels — the way you'd configure them by hand. A right-hand legend still makes sense past ten series, but for fewer, the bottom list avoids the empty margin a side legend leaves behind. For top-value breakdowns, the default jumps from three series to nine — matching how many distinct colors the eye can parse at once — and the colors themselves are darker and more separated, so series stand out at a glance

Before (bar chart by default, legend to the right, redundant x-axis label):

Now (line chart by default, list legend at the bottom, more contrasted color palette for lines readability):

A calmer look for dense dashboards

Dense dashboards in most tools still look like a wall of boxes. We redesigned Kibana's dashboard visuals to feel flatter and more cohesive — more like a single page than a grid of disconnected panels. See some of these changes highlighted in the picture below compared to what the same dashboard looks like in 9.3.

Other improvements

Collapse filter pills

Another common filtering approach is using filter pills at the top of the dashboard. When you stack many pills, they used to consume vertical space; they now scroll automatically after a threshold, and you can collapse the row to reclaim the space.

Closing the gap for ES|QL panels

Elasticsearch Query Language (ES|QL) panels are closer to parity with the rest of dashboards: Drilldowns between dashboards work for ES|QL visualizations, and click-to-filter from the chart keeps improving. Upcoming releases aim to close remaining gaps, such as annotations and saving ES|QL visualizations to the library.

Panel size visibility

Before 9.4, it was hard to match panel height and width because there was no readout for panel size. Panel sizing is now shown while you resize, so you can align dimensions across panels for a more harmonious layout.

Easier section reordering

Collapsible sections picked up usability fixes, too: You can reorder sections while they’re expanded, and you can drag a section from its header without hunting for the small drag handle.

ES|QL support for Vega visualizations

Vega lets you build custom visualizations beyond what built-in chart types — radar charts, chord diagrams, or anything the grammar supports. The hardest part has always been the data query: writing verbose Elasticsearch Query DSL with nested aggregations and format paths just to feed data into the chart. Vega now accepts ES|QL as a data source, so you can get the data with a single readable query. Less time wiring up the data, more time on the visualization itself.

What’s next

This is just the start. Upcoming releases will keep pushing on smarter defaults and closing the gap between ES|QL and data-view visualizations so the two feel interchangeable.

Got a pain point or a feature request? Hit Submit feedback in the top menu — we're listening.

How to try it

If you use Elastic Cloud Serverless, you may already be on these changes. Otherwise, upgrade to 9.4 and then create a dashboard or open an existing one. Many updates apply automatically to new visualizations, while layout and panel options appear in edit mode. If you aren’t on Elastic Cloud yet, start a trial and explore the latest Kibana Dashboards there.

30x faster than Prometheus: How we rebuilt Elasticsearch as a leading columnar metrics datastore

Kostas Krikellas, Martijn Van Groningen, Nhat Nguyen, Felix Barnsteiner — Thu, 07 May 2026 00:00:00 GMT

Elasticsearch now stores OTel metrics at 3.75 bytes per data point — down from 25 bytes a year ago — and queries them up to 30x faster and with up to 2.5x better storage efficiency, compared to Prometheus, Mimir and ClickHouse. These gains came from rebuilding TSDS storage and the ES|QL compute engine into a fully columnar metrics engine, with native OTel ingestion added as part of the effort — all while keeping Elasticsearch's ability to store and query logs, traces, and any other data alongside metrics.

Elasticsearch has supported storing metrics in time-series data streams (TSDS) since version 8.7. This offering mainly focused on storage gains as explained in an earlier blog post. Still, performance was not on par with specialized systems for storing and querying metrics, in terms of storage efficiency, indexing throughput and query latency.

In the past year, we revisited the storage layer, optimized ingestion for OTel metrics and extended the ES|QL compute engine with vectorized processing for time series data. These efforts led to substantial performance wins across the board, compared to earlier versions of TSDS:

Up to 6.6x improvement in storage efficiency, reaching 3.75 bytes per data point in OTel metrics
Up to 50% improvement in indexing throughput for OTel data
Up to 160x improvement in query latency, including blazing fast counter rate evaluation and window support in time series aggregations

Elasticsearch has thus become a leading columnar metrics engine, matching or exceeding the competition (like Prometheus, Mimir, and ClickHouse) in indexing throughput and exceeding it by up to 2.5x in storage efficiency and 30x in query performance. All while maintaining the ability to store logs and other data and fully use the rich querying capabilities of ES|QL (e.g. inline stats, lookup join) — which other PromQL-based systems lack. Elasticsearch can thus serve as a unified storage and query engine for all user data, with no compromises for metrics and observability applications.

How TSDS is organized

TSDS has the following properties that help improve the performance of time-series codecs and produce correct results when aggregating data points per time series:

The metric name and the dimension names and values are used to calculate the _tsid, a unique identifier per time series.
TSDS get sorted by [_tsid ascending, timestamp descending] order. Each time series is thus stored in sequence on disk, with newer data points appearing first. Since the _tsid is calculated over dimension values, the latter are also clustered on disk.
Shard routing is based on _tsid, with each _tsid value appearing in one shard only.
Backing indices are time-bound, with no overlap over time between them.

The rest of this post explains how we use these properties to improve storage, indexing, and query performance.

Storage optimizations

TSDS already achieved a very competitive storage footprint, reaching 0.9 bytes per data point, when it is possible to combine many metrics in a single doc, sharing the same dimension values. However, when most data points have a unique set of dimensions (which is typical for OTel or Prometheus metrics), docs end up containing a single data point. In this setup, storage required 25 bytes per data point, with dedicated metrics stores requiring less than 10 bytes per data point.

To further reduce the storage footprint, we applied a series of optimizations over the past year:

Replace inverted indices and BKD trees with doc value skippers

Elasticsearch creates inverted indices (for text values) or BKD trees (for numeric values) by default for all non-metric fields, i.e. for @timestamp and dimensions. These indices improve performance for queries including filters on these fields, but have significant impact to storage — effectively doubling the footprint for each field. More so, they are also processed during segment merging, increasing the cpu, memory and storage overhead and slowing down the system — especially in high ingest throughput scenarios, as is often the case with metrics.

Lucene has been extended with doc value skippers, a form of hierarchical sparse indices that store the minimum and maximum value of blocks of documents. Range queries can check these min and max values and ignore blocks that don't fall into the requested range. Skippers work particularly well on sorted fields. Since TSDS are sorted by [_tsid, timestamp desc], dimension values get also clustered on disk. It's therefore possible to replace indices on @timestamp and dimension fields with doc value skippers that amplify the columnar layout — each field stored in its own files, with no duplicate tracking of each doc for indexing purposes.

Doc value skippers have negligible storage overhead — replacing indices with them led to a reduction of 10 bytes out of the initial 25 bytes per data point in OTel. Moreover, they work very well in practice when queries include filters on time ranges or dimension values (including prefixes and regex) — there was no noticeable regression in query performance in our benchmarks when they replaced separate indices. Doc value skippers are enabled for TSDS by default since version 9.3.

Enable synthetic IDs

The _id metadata field was another big contributor to the storage footprint. TSDS has already been extended to trim the doc values once they were no longer needed for replication, but the inverted index was kept around to efficiently support the id-based APIs (Get, Delete, Update).

The ID value for TSDS is synthesized by combining the _tsid and @timestamp values that uniquely identify each data point. Since these fields are configured with doc value skippers, it's possible to replace the inverted index on _id with (a) retrieval of the _tsid and @timestamp value from the _id value, and (b) checks for matches using doc value skippers respectively. Care has to be taken to avoid expensive checks for duplicate IDs during metric ingestion, with segment-level bloom-filters keeping the overhead at bay.

Supporting synthetic IDs in metrics is a first for Elasticsearch. It led to a reduction of 5 bytes out of the initial 25 bytes per data point for OTel metrics, with no loss of functionality. Synthetic IDs are enabled for TSDS by default in version 9.4. We plan to extend their uses in logs and other applications after further evaluation.

Trim sequence numbers

Sequence numbers are used as part of replication, but also to provide strong consistency semantics on doc modification operations through Optimistic Concurrency Control (OCC). While such semantics are applicable to certain scenarios, they don't fit in metrics where concurrent updates are very rare, with no practical need for guarding against concurrent operations on data points with matching ids. We therefore decided to disable the use of sequence numbers in all APIs, along with OCC support, for TSDS, in version 9.4. This leads to a substantial storage reduction of 4 bytes out of the initial 25 bytes per data point for OTel data, as there's no inverted index and sequence numbers get trimmed once no longer needed for replication. Update and delete by query operations are still supported, albeit with weaker consistency semantics.

If OCC is still deemed important for a particular metrics application, the old behavior can be restored by setting index.disable_sequence_numbers: false in the index template of the involved TSDS.

Use large numeric codec blocks

TSDS already uses an advanced codec, as explained in an earlier article. The codec works very well in most cases, but has poor performance in case of repeated sequences of keywords and numbers, leading to an inflated storage footprint for dimensions containing IP and MAC addresses. We identified that the existing logic for identifying repeated sequences requires larger codec blocks to work well, especially as the sequence length increases. After experimentation, the numeric block size was increased from 128 to 512 elements in version 9.3, leading to a reduction of 2 bytes out of the initial 25 bytes per data point for an OTel dataset containing IP and MAC addresses as dimensions. We're also working on a more configurable codec layout that will allow more flexibility with block sizes and other parameters, based on field type and cardinality.

Indexing throughput

Elasticsearch has support for bulk ingestion of documents. This entrypoint has long been optimized for leniency, ensuring that all docs get accepted. This flexibility, however, incurs additional processing cost during indexing. Metric applications proved good candidates for using different approaches to reduce this overhead, as explained below.

Introduce OTLP protobuf entrypoint

OTel metrics and Prometheus have established protocols for metrics ingestion, using protocol buffers. In the past, a translation step was required to convert collected protobuf messages to bulk requests that Elasticsearch can consume.

Elasticsearch was recently extended with endpoints accepting messages from OTel metrics collectors and over Prometheus remote write. Parsing and processing these (binary) messages is cheaper, compared to json parsing, while hash operation over dimensions for _tsid calculations get reused and amortized across more data points within a single protobuf message. Furthermore, _tsids get evaluated once per doc in the coordinator nodes and propagated to data nodes for indexing, thus deduplicating an expensive step per indexed doc. These improvements led to up to a 20% speedup in indexing throughput for OTel metrics. The OTLP entrypoint was added in version 9.2 (tech preview) and reached GA in version 9.3. We've added similar entrypoints for Prometheus remote write in version 9.4 (tech preview) and are actively working to cover OTel Logs and Traces.

Reduce indexing CPU with doc value skippers

In addition to a substantial storage footprint, inverted indices require a lot of cpu to build and reconstruct during segment merging. The use of doc value skippers in their place helps also reduce cpu load at ingestion and thus improves indexing throughput by 10%, a welcome bonus on top of the aforementioned storage wins.

Synthetic recovery source

The original source of a document, as provided at index time, is never stored for metrics. Still, Elasticsearch needed to temporarily store it for replication purposes. That changed in version 9.1, where the source gets synthesized on demand for replication purposes. This is known as synthetic recovery source and reduces disk I/O by 50%, with a significant impact to metrics indexing performance. Check out this article for more details.

Query execution

Replacing inverted indices with doc value skippers leads to a pure columnar storage layout for TSDS, with metric and dimension fields stored as Lucene doc values, each field encoded and compressed in their own file. Combined with the introduction of the ES|QL compute engine that uses vectorized execution internally, it became possible to introduce a fully columnar storage and query processing engine for metrics in Elasticsearch. We pushed this idea to the extreme and implemented a columnar metrics processing engine that comfortably outperforms dedicated metrics engines and other columnar stores in query performance.

Time series integration in compute engine

Time series processing is largely based on applying aggregation functions per time series (or _tsid), such as a gauge average or a counter rate. These partial results are then reduced by a secondary function to produce results for the grouping dimensions, e.g. per host and process. Observability dashboards are built on top of this execution model, providing summary views of how metrics evolve over time and allowing for quick deep-dives by filtering on dimension values and time ranges.

To support this execution model, we introduced the TS source command, providing a simple yet powerful syntax for executing such queries that combine an inner aggregation function per time series with an outer aggregation over the partial results of the former. For instance, the following query calculates the hourly sum of rate of search requests per host over the last day:

TS metrics
  | WHERE TRANGE(1d)
  | STATS SUM(RATE(search_requests)) BY TBUCKET(1h), host

To execute this query, the compute engine is aware of how data is stored and applies the inner aggregation function per _tsid value. Since data are sorted by _tsid, time series aggregation functions process metric values as they get fetched, until the _tsid changes or the timestamp belongs to the next time bucket. This leads to vectorized execution of these functions over the fetched columns of metric values, while dimension values are only fetched (once) when the _tsid changes. The evaluation of the secondary aggregation function is also efficient, with partial aggregates stored in arrays of primitive values that get populated when _tsid values change.

The compute engine has inherent support for parallel query evaluation, taking full advantage of the available processing cores. Time series aggregations fully use this feature and process data points in parallel as applicable, reducing response times through improved cpu utilization.

Time series processing in ES|QL was introduced in version 9.2 as tech preview and reaches GA in version 9.4. We expect all metrics applications to adopt it and benefit from the much improved query performance wins.

Zero-copy data decoding and loading

Vectorized processing of time series data delivered immediate performance wins (8x for some queries), compared to aggregations through the /_search API, but the performance was still inferior when compared to competitive metrics stores. Benchmarking and profiling showed that there were too many array copies within the compute engine, between data decoding and evaluation of aggregation functions. To that end, the following optimizations were introduced:

The codec for TSDS was extended to decode on-disk data directly into primitive arrays inside blocks that the compute engine uses to evaluate time series aggregations. No additional copies required, as the compute engine can bulk-read these blocks and process their arrays, one column at a time.
Blocks containing a single value N times are represented as constant blocks with these 2 values, as opposed to an array with length N, a form of in-memory run-length encoding. Filtering and aggregation operations were extended to efficiently consume these blocks. This reduced memory pressure and cpu overhead for the _tsid and dimension fields, as their values get clustered due to index sorting.
Documents with null values for the aggregated metric fields are filtered out at the Lucene level, before they get decoded and copied into blocks.
All filters and regular expressions on the timestamp and dimension fields get pushed down to Lucene that makes use of doc value skippers to efficiently filter out non-matching docs.

Combined, these optimizations led to query execution speedups exceeding 10x (totaling 80x when combined with the 8x speedup from vectorized execution). They were included since the introduction of the TS source command in version 9.2, and fine-tuned ever since.

Optimized counter rate evaluation

While most time series aggregations can be trivially parallelized and evaluated, rate evaluation of cumulative counters is tricky as it requires processing all data points in order to detect counter resets (e.g. when a host restarts). To address this, the compute engine uses the _tsid prefix to shard time series across threads. Care has been taken to assign in-order ranges of _tsid values to each thread, as opposed to hash-partitioning _tsids, so that each thread can scan on-disk data in order, still making use of efficient decoding and zero-copying into blocks. The performance wins are impressive, with rate evaluation performance far exceeding dedicated metrics stores as we shall see in the next section.

Another interesting problem for cumulative counters is how to properly calculate counter increases for the entire time bucket when there are no data points at the bucket boundary timestamps. Metrics systems often use extrapolation, extending the first and last data points of each time bucket to the boundaries, or calculate the delta between the last data point of adjacent buckets. We posit we can do better, by interpolating between the last data point of each bucket and the first of the next, to get an estimate of the value on each boundary. The delta is then calculated over the interpolated values of the lower and upper boundary of each time bucket.

Sliding window support

Elasticsearch has long supported aggregations bucketed by time, but it was not possible to extend the window of processed data beyond each time bucket. Using windows larger than the time bucket, e.g. a window of 5 minutes for per-minute bucketing, helps smoothen out spikes and observe the underlying trend per time series with reduced noise:

All time-series aggregation functions have been extended with window support, as an optional argument. In case the window is a multiple of the time bucket (e.g. 1h window with TBUCKET(5m)), the compute engine first aggregates data points over intervals matching the time bucket span, and then combines these partial results per window span. This 2-phase approach eliminates repeated scans of data points and makes optimal reuse of intermediate results, improving response times. Window support was introduced as tech preview in version 9.3 and reaches GA in version 9.4.

Efficient datetime rounding

Queries on time-series data commonly include time bucketing. While data points can be trivially assigned to sub-hour time buckets, larger buckets start interfering with issues like time zones, daylight savings, variable days per month etc. Elasticsearch has elaborate logic for datetime rounding that takes these peculiarities into account, but that has relatively high cpu cost when processing time series data.

To mitigate this, the compute engine has been extended to identify cases where simpler logic can be employed to assign data points to time buckets. For instance, it can identify when the buckets are sub-hour or when timezones and daylight savings don't affect a particular query, and switches to simple modulo operations for datetime rounding. This led to a further 30% improvement in response times for certain queries. This change is introduced in version 9.4.

Performance evaluation

To evaluate the performance of our offering and track how it evolves and improves over time, we focused on OTel metrics since (a) Open Telemetry is the industry standard for collecting metrics, with universal adoption by all cloud providers and (b) they lead to a storage layout with 1 metric per doc, a setup that traditionally hurt performance for Elasticsearch.

We rely on Metricsgenreceiver to generate datasets. This tool is inspired by TSBS, producing data simulating the data points collected by the OTel hostmetricreceiver. We used two datasets:

A low-cardinality setup, with 100 hosts sending metrics every 1s, containing 14k time series in total
A high-cardinality setup, with 10k hosts sending metrics every 10s, containing 1.4M time series in total

We benchmarked on single-node deployments on EC2, using c6i.4xlarge and c8g.8xlarge machines for the low- and high-cardinality datasets respectively.

For competitive comparison, we used Prometheus (v.3.11.1), Mimir (v.3.0.6.) and ClickHouse (v26.3.9.8-lts). Prometheus and Mimir have proper time series processing, e.g. for counter rate, whereas ClickHouse lacks such support and only provides approximate values at best (for instance, it can't track counter resets consistently). We still report response times for ClickHouse to showcase that, once we optimize Elasticsearch for columnar query processing, it can exceed competing columnar engines even when they don't process the data per time series as expected.

We strived to use the default configuration for every system (including Elasticsearch), without tweaking them to optimize performance for the particular workload. This helps understand the user experience when systems are deployed by novice users, without much experience (or time) to tweak before receiving metrics traffic and setting up dashboards. We focused on single-node runs to keep noise low and accommodate all systems (Prometheus doesn't offer a multi-node setup out of the box). Elasticsearch performance provably scales well with the number of nodes; we plan to share scalability results in a future post.

Storage efficiency and indexing throughput

Our efforts to improve storage efficiency paid big dividends. Performance on OTel metrics dropped from 25 to 3.75 bytes per data point, in a year. Such an improvement, on top of an offering already optimized for time series, is really impressive and very rare in the industry.

The competitive picture looks favorable at this point, with Elasticsearch:

Slightly outperforming Mimir in storage efficiency and indexing throughput
Outperforming Prometheus by 2.5x in storage efficiency and by a small margin in indexing throughput
Outperforming ClickHouse by 2x in storage efficiency and by 40% in indexing throughput

Query performance

The novel columnar engine for metrics processing proves very efficient in practice. We used a mix of queries based on gauge averages and counter rates, the most common operations that require different optimization approaches. The queried interval was 4 hours of data, covering all time series per metric.

ClickHouse doesn't support time series aggregations, so the results have limited value and are not directly comparable to Prometheus or Mimir that natively support time series processing. We used the published guidelines to adjust each query to get similar results to the extent possible. The point is to show how our columnar engine compares to generic columnar stores.

Here is a summary of the results:

Query type	Mimir	Prometheus	ClickHouse †
Gauge average	up to 30x faster	up to 30x faster	up to 8x faster
Counter rate	up to 30x faster	up to 30x faster	up to 3.5x faster
Prefix filter on host name	up to 5x faster	up to 5x faster	up to 3x faster
Gauge average with window	up to 25x faster	up to 25x faster	up to 4x faster

†ClickHouse lacks native support for time series aggregations and counter reset detection.

Gauge average

We compared performance of evaluating the per-host hourly average of average memory utilization per time series, using the following queries:

# PromQL
avg by (host.name) (avg_over_time(system.memory.utilization[1h]))

# ES|QL
TS metrics-hostmetricsreceiver.otel-default
| STATS AVG(AVG_OVER_TIME(system.memory.utilization)) BY host.name, TBUCKET(1h)

Elasticsearch comfortably outperforms the other systems by up to 30x, in both low and high cardinality datasets.

Counter rate

We next compared performance of evaluating the per-host hourly average of cpu rate, using the following queries:

# PromQL
avg by (host.name) (rate(system.cpu.time[1h]))

# ES|QL
TS metrics-hostmetricsreceiver.otel-default
| STATS AVG(RATE(system.cpu.time)) BY host.name, TBUCKET(1h)

Despite processing data points per time series in order, counter rate performance matches calculating gauge average (the involved time series have 6.6x more docs than the query above). Elasticsearch maintains its wide advantage compared to the other systems and outperforms Mimir and Prometheus by 30x in the low cardinality dataset and by 16x in the high cardinality one.

It's really impressive that, for the high cardinality dataset, Elasticsearch is able to process 4 hours of data for half a million time series in less than 2 seconds, while the other systems take more than 30 seconds, leading to unresponsive dashboards for such queries. ClickHouse is also slower, despite having no logic to detect counter resets and extrapolate/interpolate deltas across buckets.

Prefix filter on host name

We next compared performance of filtering on host names based on their prefix, using the following queries:

# PromQL
avg by (host_name)
  (avg_over_time(system_cpu_load_average_1m{host_name=~"host-.*"}[5m]))

# ES|QL
TS metrics-hostmetricsreceiver.otel-default
| WHERE host.name LIKE "host-*"
| STATS AVG(AVG_OVER_TIME(system.cpu.load_average.1m)) BY host.name, TBUCKET(5m)

Elasticsearch manages to maintain an advantage of up to 5x compared to the other systems, despite replacing the inverted index on host.name with a doc value skipper.

Gauge average with window

We compared the performance of time series aggregations with a window of 90 minutes and time buckets of 30 minutes, using the following queries:

# PromQL
avg by (host.name) (avg_over_time(system.memory.utilization[90m]))&step=30m

# ES|QL
TS metrics-hostmetricsreceiver.otel-default
| STATS AVG(AVG_OVER_TIME(system.memory.utilization, 90m))
    BY host.name, TBUCKET(30m)

Elasticsearch maintains an advantage that reaches 25x for the low cardinality dataset and 8x for the high cardinality one. ClickHouse is outperformed by close to 4x, denoting the efficiency of our approach for windowed query operations.

What's next for Elasticsearch metrics

Elasticsearch has been extended with metrics storage and processing capabilities that outperform Prometheus, Mimir, and ClickHouse. We're making fast progress with supporting PromQL and Prometheus remote write, also available as tech preview in version 9.4. These extensions enable users familiar with Prometheus and relevant systems to switch their applications to Elasticsearch — no need to migrate existing dashboards. Since Prometheus integration reuses the same storage and query engine that has been presented in this article, the same performance wins are also expected for Prometheus. Furthermore, collected metrics can be queried with PromQL and ES|QL, side-by-side or in ES|QL query pipelines, further boosting the analytics capabilities far beyond what was conceivable so far with Prometheus-based systems.

The improvements in storage efficiency, indexing throughput and query performance are already impressive, but we're not done. We'll be introducing more refinements to the codec for time series data, further reducing bytes per data point. Batch processing of ingested metrics will be further improved, reducing synchronization overhead and redundant processing layers that are not needed for well-formatted collected metrics. We're also planning to make wider use of doc value skippers, storing pre-computed aggregates like sum and count per block of values, to shortcut data point loading and processing where applicable, as well as use more cpu-friendly partitioning and grouping operations.

Multi-tier search with Elastic for ecommerce search governance: Fixing poor recall

Alexander Marquardt, Honza Král, Taylor Roy — Wed, 06 May 2026 00:00:00 GMT

A common issue in ecommerce search is poor recall. This occurs when a system lacks a governed fallback strategy. The solution is a multi-tier execution model. This post describes a multi-tier retrieval strategy used to execute governed search plans. It explains how to orchestrate strict, relaxed, and semantic matching while maintaining stable results, facets, and pagination.

From policy logic to retrieval architecture

Part 3 and Part 4 provided a technical deep dive into the governed control plane and its implementation using the Elasticsearch percolator. Once the logic layer has identified which policies to apply, the system must address the retrieval strategy used to execute the search.

Managing the transition from precision to recall is a critical function of any ecommerce search engine. For example, a basic search implementation often defaults to broad keyword matching. If a shopper searches for "organic Pink Lady apples", this can lead to irrelevant results, such as apple-scented dish soap, apple juice, or organic pink grapefruit, appearing at the top of the list simply because they share a common term. While these items are technically matches, they fail to satisfy the user's intent and typically lead to high bounce rates. However, a "No results" page is equally detrimental to conversion. This conflict is resolved by implementing a three-tier execution model, which uses the governed control plane to orchestrate a principled fallback strategy.

The three-tier execution model

This architecture executes up to three retrieval tiers in a sequence, each with a specific matching logic.

Highest tier: Strict matching

Strict matching is a lexical match that requires that all query terms appear in the product metadata.

The logic: A search for "organic navel oranges" returns only products containing all three terms.
Application: This tier provides the highest precision. When a customer types a precise product name, such as "organic navel oranges", they’re typically seeking that exact item rather than an alternative.

Mid-tier: Relaxed matching

If the strict tier fails to return sufficient results, the system expands the search parameters.

The logic: This tier allows for a subset of terms to lexically match, using Elasticsearch's minimum_should_match logic.
Application: Relaxed matching maintains lexical grounding. A search for "organic navel oranges" might surface "navel oranges" (missing the "organic" term) or "organic oranges" (missing the "navel" term). These represent intuitive, keyword-based alternatives for the shopper.

Lowest tier: Semantic matching

The logic: This tier uses vector/semantic embeddings (such as Elastic Learned Sparse EncodeR [ELSER], E5, or Jina) to retrieve conceptually related products, regardless of direct keyword overlap.
Application: A search for "organic navel oranges" might surface "mandarins" or "clementines”. This serves as the final retrieval tier, intended to provide relevant options when literal keyword matches are unavailable.

To see this multi-tier orchestration in action and how the Engine steps down from lexical to semantic matching, watch the video: Eliminating Zero-Result Pages: PRISM’s Multi-Tier Search Fallback.

Tier orchestration: The "bucket filling" logic

While the governed control plane provides the logic and the queries for each tier, the application layer is responsible for the execution. The application executes these tiers sequentially and excludes lower tiers once the accumulated result count on the first page reaches or exceeds 10 items (or whatever number of results you want to display on the first page). This threshold ensures a full first page of results while prioritizing the most accurate retrieval method.

Scenario 1: High-intent search ("oranges")

The first tier returns 15 hits. Since 15 is more than 10, the current result set is locked to only strict matches (which can be paged through) and subsequent tiers are not executed.

Strict tier:   [##########]##### (>= 10 found: Exact matches)
Relaxed tier:  [          ]      (Tier bypassed)
Semantic tier: [          ]      (Tier bypassed)

Scenario 2: Specific but limited results ("organic blood oranges")

The strict tier finds only four items. Since this is less than 10, the system triggers the relaxed tier, which finds 12 more relevant products. The combined total (16) meets the threshold of 10, so the current result set is locked to the strict and relaxed tiers. Subsequent paging will only surface results from these two tiers (preventing lower-quality semantic hits from appearing on later pages).

Strict tier:   [####      ]       (4 found)
Relaxed tier:  [    ######]###### (>= 6 found)
Semantic tier: [          ]       (Tier bypassed)

Scenario 3: Abstract or intent-based search ("high vitamin C snacks")

Keyword matches are limited (only five hits between tiers 1 and 2). The system triggers the semantic tier to find conceptually relevant items, such as kiwis, guavas, or red peppers, to fill the result set. The result set for this query includes products from all tiers.

Strict tier:   [##        ]             (2 found)
Relaxed tier:  [  ###     ]             (3 found)
Semantic tier: [     #####]######################...

This orchestration optimizes for latency, as the computational cost of the semantic tier is only incurred when the keyword-based tiers are insufficient. Additionally, this allows fast-responding keyword results to be displayed while semantic results are integrated shortly after, maintaining a responsive user interface.

Determining intent via tier activation

The logic used to fill the first page serves a critical secondary purpose: It acts as a diagnostic for user intent. The application uses the logic returned by the governed control plane to determine which tiers remain active for the current result set and paging.

If the strict and relaxed tiers together yield fewer than 10 results, the query is likely exploratory or abstract. In this case, activating the semantic tier is a benefit. Because the query is diagnosed as exploratory, the system allows the shopper to page through the entire depth of the semantic results. This provides access to conceptually related alternatives that lexical matching would have missed, which is appropriate for an abstract search.

Conversely, if the strict tier returns a robust set of results (for example, 30 hits), it confirms that the system has found high-precision matches. The user can page through those 30 hits and will likely find what they’re looking for. In this scenario, there’s no need to provide additional, less relevant exploratory hits. By disabling lower tiers for these high-precision queries, we ensure that a shopper deep diving into specific results isn’t distracted by irrelevant semantic fallback as they paginate through the current result set.

Governance across tiers

A critical component of this architecture is that policies apply globally across all tiers. If a user has a "vegan" preference profile, the governed control plane injects that constraint into the strict, relaxed, and semantic queries. This ensures that even when the system uses semantic fallback to return "mandarins" for an orange search, the results remain compliant with the user's broader dietary preferences or business constraints.

The problem of facet instability

A challenge with multi-tier search is maintaining consistent faceted navigation (sidebar filters). If a search for "chocolate" yields 12 strict results, the sidebar filters might show "dark" and "milk". If a user selects "dark" and the result count drops, a naive system might trigger the semantic tier to fill the page, which could suddenly introduce "red wine" into the filters due to a semantic relationship.

The governed control plane identifies which tiers contributed to the initial search and locks the facets to those tiers. This prevents the sidebar from changing unexpectedly during a filtered session, ensuring a stable user experience.

The pagination challenge: Seamless multi-tier paging

Pagination in a tiered system requires precise state management. As established, the first page determines the scope of the current result set. If the first page required semantic results, the user can page through all available results from all three tiers. On the other hand, if the first page was satisfied by high-intent keyword matches, the semantic tier is not retrieved for that specific result set.

The governed control plane manages this through:

Tier locking: The response includes an array identifying the contributing tiers. The front end returns this on subsequent requests to keep the tier composition consistent across all pages.
Dynamic offset calculation: The back end calculates an offset based on the requested page and the total products returned in preceding tiers.Example: If the first page has returned seven strict matches and three relaxed matches, a request for page 2 (starting at index 10) would execute a relaxed tier query with an offset of three.
ID exclusion for lower tiers: The system retrieves IDs from the higher tiers (which, by definition, will always be fewer than the page size threshold) and explicitly excludes them from lower-tier results using an ID-only query (which avoids the overhead of a full fetch phase for excluded items).

Summary

The multi-tier approach ensures search results are precise when data is available and helpful when it is not. By providing a governed fallback sequence for the application to execute, the architecture maintains high relevance while eliminating "no results" scenarios.

What's next in this series

The next posts in this series extend the governed control plane into new territory. Part 6 explores personalization (using purchase history boosting and cohort-aware policies), and Part 7 demonstrates per-query economic optimization. Stay tuned!

Put governed ecommerce search into practice

The search architecture described in this post, where retrieval tiers, economic weights, and governance constraints compose into a single request, was designed and built by Elastic Services Engineering as part of our repeatable ecommerce search accelerators.

To learn more about applying these patterns to your business, Contact Elastic Professional Services.

Three indices walk into a FROM clause: ES|QL subqueries in Elasticsearch

Tyler Perkins — Wed, 06 May 2026 00:00:00 GMT

Elasticsearch Query Language (ES|QL) now has subqueries in FROM. Three indices, different schemas, one query; each source gets its own pipeline with its own filters and transforms. No more CASE chains. No more client-side stitching. Add a fourth source? Add a fourth branch; zero changes to the existing three.

The problem: Heterogeneous data, one query

Consider a production incident investigation. Errors are spread across three microservices: an API gateway, a payments service, and an auth service, each with different field names and different conventions. Before subqueries, combining them in a single ES|QL query meant cramming everything into one FROM with CASE chains:

FROM svc-gateway-*, svc-payments-*, svc-auth-* METADATA _index
| WHERE http.response.status_code >= 500
    OR transaction.status IN ("failed", "timeout")
    OR (event.action == "login" AND event.outcome == "failure")
| EVAL
    service = CASE(
      _index LIKE "svc-gateway*", "gateway",
      ... /* one branch per source */),
    error_detail = CASE(
      _index LIKE "svc-gateway*", CONCAT("HTTP ", http.response.status_code::string),
      ... /* one branch per source */)
| KEEP @timestamp, service, error_detail, source.ip
| SORT @timestamp DESC

This is brittle and slow. The disjunctive OR prevents predicate pushdown; every index scans every condition. Every CASE chain grows with every source. Copy it into five dashboards and three alert rules, and you have eight places to update when anything changes.

The fix: Independent pipelines

Subqueries replace the monolithic FROM + CASE pattern. Each data source gets its own complete pipeline:

FROM
  (FROM svc-gateway-*
   | WHERE http.response.status_code >= 500
   | EVAL service = "gateway",
         error_detail = CONCAT("HTTP ", http.response.status_code::string)
   | KEEP @timestamp, service, error_detail, source.ip),
  (FROM svc-payments-*
   | WHERE transaction.status IN ("failed", "timeout")
   | EVAL service = "payments",
         error_detail = transaction.status
   | KEEP @timestamp, service, error_detail, source.ip),
  (FROM svc-auth-*
   | WHERE event.action == "login" AND event.outcome == "failure"
   | EVAL service = "auth",
         error_detail = CONCAT(event.action, " ", event.outcome)
   | KEEP @timestamp, service, error_detail, source.ip)
| SORT @timestamp DESC
| LIMIT 20

The gateway branch only scans for HTTP 500s. The payments branch only looks at transaction statuses. The auth branch only checks login failures. Because each branch has its own WHERE, the optimizer pushes filters independently into each index, restoring the predicate pushdown that a single FROM with OR conditions prevents. Fields that exist in one branch but not another are filled with null.

Adding a fourth service means adding a fourth branch. Existing branches don't change.

Save it as a view

This is where subqueries and logical views combine. Wrap the subquery above in a named view, with one API call:

PUT _query/view/error_triage
{
  "query": "FROM (FROM svc-gateway-* | WHERE ...) , (FROM svc-payments-* | WHERE ...) , (FROM svc-auth-* | WHERE ...)"
}

Now consumers just write FROM error_triage | STATS error_count = COUNT(*) BY service. Three indices, three pipelines, one name. If you have 10 dashboards and five alert rules consuming this pattern, that's 15 copies of the same logic today; with a view, it's one definition and zero consumer-side edits when you add a fourth service. See Elasticsearch ES|QL Views for the full views deep dive.

What you can do inside a branch

Each branch supports the full ES|QL pipeline: WHERE, EVAL, STATS, LOOKUP JOIN, ENRICH, and more. See the subquery documentation for the complete list.

Aggregate different metrics, and then combine

Each branch can compute its own summary before results are merged. This is useful when different indices track the same concept under different field names:

FROM
  (FROM svc-gateway-*
   | STATS avg_latency = AVG(http.response.time_ms) BY hour = BUCKET(@timestamp, 1 hour)
   | EVAL service = "gateway"),
  (FROM svc-payments-*
   | STATS avg_latency = AVG(transaction.duration_ms) BY hour = BUCKET(@timestamp, 1 hour)
   | EVAL service = "payments")
| SORT hour DESC, service

Both branches produce avg_latency and hour, but each computes it from a different source field. The combined result is a single table you can chart or alert on, without normalizing field names at ingest time. This pattern is impossible with a single FROM; you can't compute different aggregations per index without subqueries.

Subqueries vs. FORK

ES|QL also has FORK (now generally available), which creates parallel execution branches from the same input. The distinction:

Different indices → subqueries. Same data, different analyses → FORK.

How this compares

If you're coming from other query languages, here's how ES|QL subqueries stack up at the time of writing:

Splunk SPL/SPL2 has append and multisearch in classic SPL, and SPL2 adds a union command that merges events from multiple datasets (the closest analogue to ES|QL subqueries). Federated Search extends this across remote Splunk deployments (analogous to CCS). The differences are in how the engine handles each branch: ES|QL subqueries give each branch independent predicate pushdown, meaning filters are pushed into each index's shard-level structures separately. SPL2 union merges datasets but optimization across branches is limited to what the search scheduler can parallelize. Wrapping ES|QL subqueries in a view gives you engine-level encapsulation with role-based access control (RBAC); Splunk's equivalent is saved searches and macros, which are text substitution expanded at parse time.

SQL databases have UNION ALL, which is the closest analog. The difference is that SQL UNION ALL typically requires matching column counts and types at parse time. ES|QL subqueries are more forgiving; columns that exist in one branch but not another get null-padded automatically, which matters when your sources have different schemas (the norm in observability data). SQL views solve the reuse problem similarly, but ES|QL views are cluster-level objects, not database-scoped; they work across cross-cluster search boundaries.

Grafana / Datadog / other dashboarding tools handle multisource composition at the visualization layer: Run separate queries, merge in the panel. This works for display but breaks for alerting, downstream queries, and anything that needs a single result set programmatically. ES|QL subqueries push the composition into the engine, so alerts, views, and API consumers all get the same unified result.

Capability	Splunk SPL/SPL2	SQL UNION ALL	Dashboard-layer merge	ES\|QL subqueries
Independent filters per source	SPL2 `union` merges datasets; optimization is scheduler-level	Yes	N/A (separate queries)	Yes; parallel with pushdown
Schema mismatch handling	Manual field normalization	Strict column matching	Manual in panel config	Automatic null-padding
Engine-level reuse	Text macros (parse-time expansion)	Database-scoped views	Dashboard variables	Cluster-level views with RBAC
Works for alerts + API	Limited (summary indexing)	Yes	No; display only	Yes
Add a source	Edit every macro/saved search	Add a UNION branch	Add a panel query	Add a branch; existing branches unchanged

Current constraints

In the Tech Preview release, subqueries are non-correlated; branches run independently and can't reference the outer query. They're supported in FROM only (not TS), and FORK can't be used inside or after subqueries. See the subquery documentation for details.

What's next for subqueries

WHERE subqueries — WHERE field IN (FROM other_index | ...) and other correlated forms — will extend the composition model from FROM into filtering. This brings the familiar SQL pattern of nested filtering to ES|QL.

Try it

Subqueries in FROM are available as a Tech Preview. Try them in Kibana Dev Tools or Discover. We'd love your feedback; file a GitHub issue with the ES|QL label.

ES|QL subqueries in FROM are a Tech Preview feature. Tech Preview features are subject to change and are not covered by the support SLA of GA features. The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.

Elasticsearch's BBQ vs. TurboQuant: 10–40× faster on CPU and lower ranking noise

Thomas Veasey — Wed, 06 May 2026 00:00:00 GMT

In CPU vector search, Elasticsearch's Optimized Scalar Quantization (OSQ) (the algorithm behind Better Binary Quantization (BBQ)) beats TurboQuant where production systems care most: throughput, ranking accuracy, and storage efficiency. In our tests on Apple M2 Max, OSQ's symmetric kernels are 10-40x faster, and on shifted embeddings its 1-bit document encoding beats TurboQuant at 4 bits on ranking accuracy while using far less storage. TurboQuant still wins on raw reconstruction MSE, but that advantage comes mostly from the Hadamard rotation and does not translate into better CPU search behavior.

A brief quantization primer

Vector search indexes often store millions or billions of embedding vectors, each hundreds or thousands of floats wide. Scalar quantization compresses each float coordinate independently to a small integer, typically 1, 2, or 4 bits, reducing storage by 8-32× and enabling fast integer-arithmetic scoring.
Why does per-coordinate quantization work at all? Because the expected squared error decomposes as a sum of independent per-coordinate terms: $\mathbb{E}[\|x - q(x)\|^2] = \sum_i \mathbb{E}[(x_i - q_i(x_i))^2]$, by linearity of expectation. Each coordinate can be quantized independently regardless of the joint distribution. The remaining question is how to make each per-coordinate quantizer good.

BBQ and Optimized Scalar Quantization

Better Binary Quantization (BBQ), and its underlying algorithm Optimized Scalar Quantization (OSQ), have been part of Elasticsearch for multiple releases. OSQ is an evolution of several techniques to make scalar quantization more accurate, specifically for vector search.

Each vector's components are mapped to a uniform grid over an interval $[a, b]$. The interval is initialized from the vector's statistics (assuming approximately normal residuals), using the exact same optimization objective as TurboQuant but with an additional constraint on centroid positions. They are then refined by coordinate descent to minimize an anisotropic loss $L = (1-\lambda)(x \cdot e)^2/\|x\|^2 + \lambda\|e\|^2$, where $e$ is the quantization error vector. With the production default $\lambda = 0.1$, this deliberately sacrifices some MSE to concentrate accuracy along the query direction. This is the direction that matters for ranking.

Before quantization, the segment (or cluster in the case of Inverted Vector File (IVF) indices) centroid $c$ is subtracted from every vector. This removes the dominant shared component that would otherwise waste the quantizer's dynamic range. Both the symmetric and asymmetric dot-product paths also center the query using the same segment centroid, so the only quantized inner product is between centered residuals. The correction terms, $\langle c, x \rangle$ for each vector and $\|c\|^2$, depend on only a single vector each, and can be precomputed exactly. Centering therefore adds no per-pair cost.

Documents are quantized at 1-bit (32× compression), queries at 4-bit (cheap since there are only a handful per search). The storage constraint binds on documents, not queries, so spending more bits on the query side recovers float-query accuracy while keeping per-document footprint minimal.

A block-diagonal orthogonal preconditioner equalizes coordinate variances and normalizes their distribution before quantization. This is the same goal as a full Hadamard rotation used by TurboQuant, but with no power-of-2 padding overhead.

Because the grid is uniform, quantized dot products decompose into integer dot products with scalar corrections. This enables NEON/SVE and SSE/AVX popcount and multiply-accumulate pipelines: bit-plane decomposition for 1-bit and 2-bit, nibble multiply for 4-bit, and a RaBitQ-style mixed 4×1 kernel that decomposes to four 1-bit kernels.

For a deeper dive into the sparse rotation and what it brings to robustness, see the Robust Optimized Scalar Quantization blog. For a full walkthrough of optimized scalar quantization, see this OSQ deep dive.

TurboQuant

TurboQuant (Google, ICLR 2026) takes a slightly different path to the same starting observation: that concentrated, predictable per-coordinate distributions are easy to quantize well.

Rather than adapting the quantizer per vector, TurboQuant normalizes the vector and applies a shared randomized Hadamard rotation to the entire dataset. This general sort of idea was first proposed and formalized by RaBitQ, which showed that the random rotation yields worst-case bounds over the data, holding for any fixed unit vector, for their quantization scheme. The idea of implementing via a Hadamard rotation was suggested by Weaviate for rotational quantization. After normalization and rotation, each coordinate's distribution almost always converges to $\mathcal{N}(0, 1/d)$ in high dimensions, regardless of the original data. TurboQuant builds on this foundation: with the distribution pinned down, it solves for the optimal Lloyd-Max scalar quantizer, a 1-D $k$-means problem on the known density. The resulting non-uniform centroids bunch up where the density is highest (near zero) and spread out in the tails. This achievies provably near-optimal MSE: within ~2.7× of the information-theoretic lower bound in general, and as tight as 1.45× at 1-bit.

For inner products, MSE-optimal quantizers introduce a multiplicative bias (which is most severe at 1-bit: $2/\pi \approx 0.64$). TurboQuant corrects this with a two-stage design ($Q_\text{prod}$): spend $b-1$ bits on the MSE quantizer, then use the remaining 1 bit for a Quantized Johnson-Lindenstrauss (QJL) sketch of the residual, yielding a provably unbiased inner-product estimator.

The paper's nearest-neighbour experiments were conducted on GPU (NVIDIA A100), where the lookup-table access pattern maps naturally onto shared memory.

The key design divergence: integer arithmetic vs. lookup tables

The difference between uniform and non-uniform centroids may seem minor, but it creates a large computational gap.

OSQ's uniform grid means each quantized coordinate is an integer whose arithmetic meaning is preserved. The dot product of two quantized vectors decomposes into an integer dot product, directly exploitable by SIMD: vpdpbusd on x86, multiply-accumulate and vcnt (popcount) on ARM NEON. The pipeline is branch-free and the data access pattern is sequential.

TurboQuant's non-uniform centroids break this. Each coordinate pair requires looking up a centroid value from a shared codebook, and the access pattern is data-dependent with each index selecting a different table entry. On NEON, which lacks a float gather instruction, this means scalar loads to build each vector register before the Fused Multiply-Add (FMA). Precomputing per-coordinate product tables ($d \times 2^b$ entries, amortized over all documents) doesn't help either: the FMA is relatively cheap on modern cores, so the bottleneck remains the data-dependent gather, not the arithmetic. Our benchmarks confirm this: precomputed ADC tables are no faster (and sometimes slower due to the larger working set) than inline centroid lookup.

Terminology used in the comparisons

The results sections below refer to several OSQ scoring configurations. All use uniform-grid quantization with scalar correction terms to recover the exact dot product up to quantization noise.

Symmetric $n$-bit quantizes both query and document at $n$ bits per coordinate.

Asymmetric keeps the query as a full float vector and quantizes only the document. The dot product is a float-times-integer sum. This is more expensive per pair than symmetric, but avoids any query quantization noise. TurboQuant's scoring is always asymmetric (float query dotted against quantized document via centroid lookup).

1-4 is the production configuration for OSQ: documents at 1-bit (32× compression), queries at 4-bit. This exploits the asymmetry of search: there is one query but millions of documents, so query storage is free but document storage is the binding constraint.

Centered means the segment centroid $c$ has been subtracted from all vectors (and the query) before quantization, with the exact correction recovered from precomputed scalar terms. Centering focuses the quantizer's dynamic range on the information-bearing residual rather than the shared mean.

$\lambda$ controls the anisotropic loss tradeoff: $\lambda = 1$ minimizes pure MSE, $\lambda = 0.1$ (production default) sacrifices some MSE to concentrate accuracy along the query direction, the direction that determines ranking.

How do they compare in practice?

The following results were obtained on an Apple M2 Max. The code to reproduce all these results is available here.

Head-to-head: MSE

On reconstruction MSE, the metric TurboQuant was designed to optimize, TurboQuant outperforms plain OSQ at every bit-width.

Relative MSE ($\|x - \hat{x}\|^2 / \|x\|^2$) on $d = 768$ Gaussian vectors (1,000 vectors, lower is better):

Bits	OSQ (λ=0.1)	OSQ (λ=1)	TurboQuant	TQ vs OSQ ($\lambda=1$)
1	0.512	0.362	0.307	1.18×
2	0.138	0.118	0.092	1.28×
3	0.038	0.037	0.026	1.42×
4	0.011	0.011	0.007	1.61×

The $\lambda$ columns reveal that OSQ's production setting ($\lambda = 0.1$) deliberately sacrifices MSE for dot-product accuracy. With $\lambda = 1$ (pure MSE), the gap narrows, to just 1.18× at 1-bit.

But where does TurboQuant's remaining MSE advantage actually come from, the Lloyd-Max centroids, or the Hadamard rotation? We can answer this directly by applying the same randomized Hadamard rotation to OSQ (zero-pad 768→1024, random sign flips, Walsh-Hadamard butterfly, quantize in rotated space, invert). Theory predicts the MSE improves by a factor of $d/d' = 768/1024 = 0.75$:

Bits	OSQ (λ=1)	OSQ + Hadamard	TurboQuant	Ratio (OSQ/QSQ + Hadamard)	Theory (d'/d)
1	0.362	0.306	0.307	1.19	1.33
2	0.118	0.092	0.092	1.28	1.33
3	0.037	0.028	0.026	1.31	1.33
4	0.011	0.009	0.007	1.33	1.33

OSQ + Hadamard matches TurboQuant almost exactly at 1-bit (0.306 vs 0.307) and 2-bit (0.092 vs 0.092). TurboQuant's MSE advantage is the rotation, not the centroids. At 3–4 bits the Lloyd-Max placement contributes a modest ~1.1× edge, real but small.

The convergence of the improvement ratio applying the Hadamard transformation to OSQ is itself informative: at 4-bit it hits the theoretical 1.33 exactly, but at 1-bit it's only 1.19. The shortfall quantifies the value of OSQ's data-dependent interval refinement: it already captures ~40% of the dimension expansion and component equalization benefit that Hadamard provides. The coordinate-descent is doing some of the same work as the rotation, adapting to each vector rather than relying on a data-oblivious transform. However, the real advantage, as we discuss below, is this formulation allows us to concentrate accuracy along the query direction.

This raises a natural question: how does OSQ's block-diagonal sparse preconditioner compare to the full Hadamard rotation in practice?

Head-to-head: sparse preconditioner vs Hadamard

OSQ's sparse preconditioner applies a block-diagonal random orthogonal transformation: dimensions are randomly permuted into blocks (64×64 in production), each block is multiplied by an independent random orthogonal matrix. This equalizes coordinate distributions within each block. The Hadamard rotation achieves the same goal globally but requires zero-padding to the next power of 2.

We test on anisotropic Gaussian data ($d = 768$, $\sigma_i$ ramping from 1 to 5 across coordinates), a challenging distribution where some coordinates carry far more variance than others.

Transform latency ($d = 768$, 1,100 vectors, ARM NEON, lower is better):

Method	ns/vec	Effective dim
Block 32×32	1,811	768
Block 64×64	4,887	768
Full dense	244,752	768
Hadamard	1,556	1,024

Hadamard is the fastest non-trivial option thanks to $O(d \log d)$ butterflies vs $O(d \times b)$ for block size $b$, though all block-diagonal variants are fast enough to be negligible in practice with even the 64×64 block at 4.9 μs is tiny compared to typical search latencies. The full dense $d \times d$ rotation is impractical at 244 μs/vec but serves as a theoretical reference. Note that the block-diagonal transform works for arbitrary dimensions: no power-of-2 padding is required, and the effective dimension stays at $d$.

MSE (relative MSE, $\lambda = 1$, anisotropic data, lower is better):

Method	1 bit	2 bit	4 bit
No transform	0.443	0.157	0.0182
Block 32×32	0.368	0.121	0.0120
Block 64×64	0.365	0.119	0.0117
Full dense	0.362	0.118	0.0113
Hadamard	0.362	0.118	0.0112

Even 32×32 blocks recover most of the gap from no-transform (0.443) to full rotation (0.362), 93% at 1-bit. Block 64×64 closes the gap further. On isotropic data (not shown), all methods produce identical MSE (~0.362 at 1-bit), confirming there is nothing to equalize when coordinates already have equal variance.

Dot-product accuracy (1-4 centered, raw relative dot-product error; note these are raw RMSE including multiplicative bias, which is appropriate for comparing preconditioner variants against each other since the bias structure is similar, lower is better):

Method	Anisotropic	Isotropic
No transform	0.690	0.722
Block 32×32	0.606	0.724
Block 64×64	0.602	0.720
Full dense	0.595	0.723
Hadamard	0.566	0.629

On isotropic data the block-diagonal methods and the full dense rotation produce the same dot-product error as no transform since there is nothing to fix. Hadamard is the outlier, improving from 0.723 to 0.629. But this improvement is not from better preconditioning: the full dense rotation, which is an equally good random orthogonal transform, shows no improvement at all. The difference is the padding. Hadamard operates in 1024 dimensions, so 1-bit documents store 1024 bits instead of 768. This is 33% more storage. The improvement ratio (0.723 / 0.629 = 1.15) matches $\sqrt{d'/d} = \sqrt{1024/768} = 1.155$ almost exactly, confirming that the entire dot-product advantage is attributable to the extra bits, not the rotation.

On anisotropic data, the block-diagonal rotation does help dot-product accuracy (0.690 → 0.602 for block 64), which is the real value from coordinate equalization. Hadamard goes further (0.566), but the incremental improvement over a full dense rotation at the same dimension (0.595 → 0.566) is again consistent with the padding benefit.

The practical implication: for CPU-based search where storage efficiency matters, the block-diagonal preconditioner delivers the same MSE improvement as Hadamard at the same effective bit rate, works for any dimension without padding, and the dot-product gap we see in our experiments is a padding artifact, not a preconditioning advantage.

Head-to-head: dot-product accuracy

MSE measures reconstruction quality, but search engines rank by dot products. These are different objectives, and the gap between them is where OSQ's design choices pay off.

We measure relative dot-product error: $\sqrt{\sum(q \cdot x - \hat{q} \cdot \hat{x})^2 / \sum(q \cdot x)^2}$, varying the angle between query and document. The small-angle regime (0°–20°) matters most: real transformer embeddings occupy a narrow cone rather than spreading uniformly on the sphere (Ethayarajh 2019). Furthermore, near-parallel vectors, corresponding to the nearest neighbours of a query in the dataset, are where ranking accuracy is critical.

Our production configuration is 1-bit documents, 4-bit queries, centroid centering, with integer scoring.

Raw dot-product error conflates two distinct components: a multiplicative bias (a global scale factor that preserves ranking order) and noise (random per-pair deviations that can swap rankings). For search, only the noise matters: a biased estimator that consistently scales all scores by the same factor produces the same ranking as the exact scores. TurboQuant's MSE quantizer at 1-bit has a well-known multiplicative bias of $2/\pi \approx 0.64$, meaning raw dot-product errors of ~0.36 are almost entirely this ranking-irrelevant scale factor. To give a fair comparison, we report the debiased RMSE after fitting and removing the best multiplicative scale: $\alpha = \sum(\hat{d} \cdot d) / \sum(d^2)$, then measuring $\sqrt{\sum(\hat{d}/\alpha - d)^2 / \sum d^2}$.

Zero-mean corpus ($d = 768$, 500 vectors, 5 queries per vector, lower is better):

Angle	OSQ asymmetric (debiased)	OSQ 1-4 (debiased)	TQ @1-bit (debiased)	TQ @4-bit (debiased)
0°	0.0035	0.0067	0.0083	0.0052
5°	0.0042	0.0060	0.0085	0.0052
10°	0.0057	0.0074	0.0091	0.0052
20°	0.010	0.011	0.012	0.0053
45°	0.027	0.029	0.025	0.0060
60°	0.048	0.049	0.042	0.0074

Shifted corpus (shift = 2.0, modeling real embedding bias, lower is better):

Angle	OSQ asymmetric (debiased)	OSQ 1-4 (debiased)	TQ @1-bit (debiased)	TQ @4-bit (debiased)
0°	0.0008	0.0013	0.0073	0.0054
5°	0.0013	0.0015	0.0076	0.0054
10°	0.0021	0.0023	0.0084	0.0054
20°	0.0041	0.0043	0.012	0.0055
45°	0.012	0.012	0.025	0.0064
60°	0.022	0.022	0.043	0.0078

On zero-mean data, the raw error numbers (omitted for brevity, TQ @1-bit's is ~0.363, almost entirely due to the $2/\pi$ multiplicative bias) are misleading; only the debiased ranking noise matters. The asymmetric column (float query, 1-bit document) is the most directly comparable to TQ since both quantize only the document: at 0° OSQ achieves 2.4× lower noise (0.0035 vs 0.0083). This is the payoff of the anisotropic loss ($\lambda = 0.1$), which concentrates accuracy along the query direction at the expense of off-axis components. The symmetric 4-bit query recovers some of this advantage (0.0035 → 0.0067), showing that query quantization is now the dominant noise source at small angles. Even so, OSQ symmetric still beats TQ @1-bit by 1.2–1.4× through 10°. The tradeoff is visible at wider angles where TQ @1-bit has lower noise than OSQ (0.042 vs 0.049 at 60°): the Hadamard rotation distributes information uniformly across all directions, while OSQ deliberately favors the directions that matter for search.

What about TurboQuant's $Q_\text{prod}$ variant? TurboQuant's inner-product variant ($Q_\text{prod}$) was designed to address exactly this bias, spending $b-1$ bits on the MSE quantizer and 1 bit on a QJL sketch of the residual to produce a provably unbiased estimator. At 1-bit $Q_\text{prod}$ is not viable (0 bits for MSE), so the minimum is 2-bit. But for ranking, the cure is worse than the disease: $Q_\text{prod}$ trades ranking-irrelevant bias for ranking-relevant noise. At 60°, $Q_\text{prod}$'s debiased noise is consistently higher than MSE-only at the same total bit width, 0.031 vs 0.025 at 2-bit and 0.011 vs 0.007 at 4-bit, because each bit spent on QJL correction would have been better spent on quantization. Since search cares only about ranking, MSE-only is the better choice. The bias is harmless and the extra quantization bit reduces the noise that actually matters.

The picture changes on shifted data, where centroid centering gives OSQ a decisive advantage. At 0° the debiased noise drops to 0.0008, which is 9× lower than TQ @1-bit's 0.0073, and 7× lower than TQ @4-bit's 0.0054. Centering removes the dominant shared component before quantization, letting the quantizer focus its bits on the information-bearing residual. TurboQuant's data-oblivious rotation cannot exploit this structure. The advantage persists through 20° (OSQ 0.0041 vs TQ @1-bit 0.012) and only narrows at wide angles (60°: OSQ 0.022 vs TQ @1-bit 0.043), where OSQ remains competitive.

On shifted data, OSQ at 1-bit per document (debiased noise 0.001) beats TurboQuant at 4-bit per document (debiased noise 0.006): better ranking accuracy at over 5× less storage (768 bits vs 4,096 bits, since TQ pads 768→1024 for the Hadamard transform). This is the payoff of the data-dependent design: centering and anisotropic interval refinement extract structure that a data-oblivious rotation cannot.

TQ @4-bit MSE is consistently the lowest-noise option on zero-mean data (debiased 0.005-0.008 across all angles), but at 5× the storage cost per document. On shifted data it is actually substantially worse than OSQ symmetric for angles less than 20°.

Head-to-head: throughput

Throughput is where the uniform grid constraint really shines. Here are the throughput figures on $d = 768$, 10k documents, Apple M2 Max, 100 repetitions:

Bits	OSQ asymmetric	OSQ symmetric	OSQ 1-4	TurboQuant
1	67 ns/doc	7 ns/doc	—	275 ns/doc
2	132 ns/doc	14 ns/doc	—	293 ns/doc
4	94 ns/doc	22 ns/doc	14 ns/doc	216 ns/doc

OSQ's symmetric kernels are 10–40× faster than TurboQuant!

We made a fair effort to optimize both implementations to use ARM NEON instructions effectively, but do not claim these are optimal. The key techniques:

The 1-bit kernel reduces to popcount(a AND b) via NEON's vcntq_u8, processing 32 bytes per iteration with dual accumulators for pipeline parallelism. For $d = 768$ the entire packed vector is 96 bytes, a single pass yields 7 ns/doc.

The 2-bit kernel decomposes each 2-bit index into two bit-planes (precomputed at quantize time), reducing the dot product to 4 AND+popcount passes over the same 96-byte planes: $\sum(2x_1+x_0)(2y_1+y_0) = 4 \cdot \text{pc}(x_1 \wedge y_1) + 2 \cdot \text{pc}(x_1 \wedge y_0) + 2 \cdot \text{pc}(x_0 \wedge y_1) + \text{pc}(x_0 \wedge y_0)$. At 14 ns/doc this is 2× the 1-bit time rather than the naive 4× because all four plane pairs share the same data loads, each 96-byte plane is read once and reused across passes.

The 4-bit kernel uses direct NEON nibble multiply with vandq/vshrq to split packed bytes into lo/hi nibbles, multiply, and accumulate via vpaddlq_u8 widening adds. At 22 ns/doc, this is faster than the 16-popcount bit-plane alternative ($4^2 = 16$ plane combinations).

A mixed 4×1 kernel is the production workhorse. It precomputes the 4-bit query's 4 bit-planes at quantize time (each 96 bytes in the same 1-bit packed layout as the document). Per-document scoring is then 4 AND+popcount passes, i.e. the RaBitQ decomposition: $\sum \text{idx4}_i \times \text{idx1}_i = 8 \cdot \text{pc}(\text{plane3} \wedge \text{bits1}) + 4 \cdot \text{pc}(\text{plane2} \wedge \text{bits1}) + \ldots$ At 14 ns/doc this is 21.3× faster than TurboQuant's 1-bit path at 3/4 the document storage.

TurboQuant's bottleneck is the data-dependent gather: each coordinate requires a scalar load from the centroid table to build a NEON float vector. The arithmetic (FMA) is essentially free in comparison.

Conclusion

TurboQuant is a theoretically elegant construction that builds directly on the OSQ formulation. The provable MSE bound, the unbiased inner-product estimator, and the clean data-oblivious design are real contributions. For applications requiring calibrated scores (not just rankings), or running on GPU hardware where gather operations are cheap, TurboQuant's architecture is well-motivated. The calibration-free design is also a natural fit for settings where quantization must happen on the fly with zero training overhead, KV cache compression during LLM inference is a prime example. There, every vector is quantized once as it enters the cache and discarded after the forward pass, so there is no opportunity to amortize a per-vector coordinate descent. A fixed codebook derived from the known post-rotation distribution is exactly the right tool: rotate, snap, store.

But for CPU-based vector search, the setting where Elasticsearch and most operational systems execute queries, the empirical picture is clear across all three axes:

MSE: TurboQuant's advantage comes from the Hadamard rotation, not the Lloyd-Max centroids. OSQ with the same rotation matches TurboQuant at 1–2 bits and comes within 1.1× at 3–4 bits. OSQ's sparse preconditioner already provides this benefit without padding overhead.

Dot-product accuracy: After removing ranking-irrelevant multiplicative bias (including TQ's $2/\pi$ scale factor at 1-bit), OSQ has 1.2–1.4× lower ranking noise than TQ @1-bit at small angles on zero-mean data even with a quantized query and without the 25% pad, thanks to the anisotropic loss concentrating accuracy along the query direction. On shifted data, the regime that matters in practice because embeddings typically have a non-zero mean, centering amplifies the advantage further: debiased noise of 0.0008 at 0° vs TQ @1-bit's 0.0073 and even TQ @4-bit's 0.0054. OSQ at 1-bit beats TurboQuant at 4-bit on ranking accuracy at less than 1/5 the storage. TurboQuant's $Q_\text{prod}$ variant addresses bias explicitly but trades it for higher noise, making MSE-only the better choice for search.

Throughput: 10–40× faster symmetric scoring, with the mixed 4-1 kernel at 14 ns/doc versus TurboQuant's 293 ns/doc using NEON intrinsics. This reflects a fundamental architectural divide between integer arithmetic and lookup-table gather, not a constant factor that disappears with batching.

The uniform grid, far from being a compromise, turns out to be the right trade: it sacrifices a theoretical MSE margin that almost vanishes under equivalent rotation, and in return unlocks the integer-arithmetic pipeline that makes sub-millisecond search at scale practical.

Elasticsearch ES|QL views: One query to rule twelve dashboards

Tyler Perkins — Tue, 05 May 2026 00:00:00 GMT

Elasticsearch Query Language (ES|QL) now has logical views. Define a query once, and reference it by name in FROM, like an index. Twelve dashboards, one definition, zero copy-paste. Update the view, and every consumer gets the change automatically.

Views don't store data; they re-execute on every read, so results always reflect the current data and the current definition. If you've used views in SQL databases, this will feel familiar. The difference: ES|QL views are engine-level virtual indices stored at the Elasticsearch cluster level, not saved query text that gets expanded client-side. They appear in Kibana autocomplete, support cross-cluster search (CCS), and are governed by dedicated role-based access control (RBAC) privileges.

A simple view

A view can wrap any ES|QL query. Start with a straightforward filter — HTTP 500 errors from the API gateway:

PUT _query/view/error_triage
{
  "query": """
    FROM svc-gateway-*
    | WHERE http.response.status_code >= 500
    | KEEP @timestamp, http.response.status_code, url.path, source.ip
  """
}

Now anyone can write FROM error_triage without knowing the index pattern or filter condition:

FROM error_triage
| STATS error_count = COUNT(*) BY url.path
| SORT error_count DESC

The query is defined once. Consumers reference a name.

Views support full create, read, list, update, and delete (CRUD) via the _query/view REST API.

Update propagation

Say the team decides error_triage should also capture client errors, not just 500s. Update the definition in place:

PUT _query/view/error_triage
{
  "query": """
    FROM svc-gateway-*
    | WHERE http.response.status_code >= 400
    | KEEP @timestamp, http.response.status_code, url.path, source.ip
  """
}

Every dashboard panel, alert rule, and ad-hoc query using FROM error_triage immediately reflects the broader filter. No saved objects to hunt down. No stale copies. Change once, update everywhere.

Nested views

Views can reference other views, enabling layered abstractions. Create views for suspicious IPs and threat intelligence, and then compose them:

PUT _query/view/suspicious_ips
{
  "query": """
    FROM svc-auth-*
    | WHERE event.action == "login" AND event.outcome == "failure"
    | STATS attempts    = COUNT(*),
            first_seen  = FIRST(@timestamp, @timestamp),
            latest_user = LAST(user.name, @timestamp)
        BY source.ip
    | WHERE attempts > 3
  """
}

PUT _query/view/known_threats
{
  "query": """
    FROM threat-intel
  """
}

PUT _query/view/security_overview
{
  "query": """
    FROM suspicious_ips, known_threats
  """
}

FROM security_overview
| WHERE source.ip IS NOT NULL
| EVAL is_known_threat = threat.category IS NOT NULL
| KEEP source.ip, attempts, threat.category, threat.severity, is_known_threat
| SORT is_known_threat DESC, attempts DESC

Security teams query FROM security_overview without knowing the underlying data model. They're also shielded from any changes made to suspicious_ips by its owner; the abstraction boundary is real, not syntactic.

Multisource views with subqueries

A view can wrap any ES|QL query, including multisource compositions, using subqueries in FROM. Each subquery branch queries one service independently (its own filters, its own field normalization), and the results combine automatically:

PUT _query/view/all_errors
{
  "query": """
    FROM
      (FROM svc-gateway-*
       | WHERE http.response.status_code >= 500
       | EVAL service = "gateway",
              error_detail = CONCAT("HTTP ", http.response.status_code::string)
       | KEEP @timestamp, service, error_detail, source.ip),
      (FROM svc-payments-*
       | WHERE transaction.status IN ("failed", "timeout")
       | EVAL service = "payments", error_detail = transaction.status
       | KEEP @timestamp, service, error_detail, source.ip)
  """
}

Consumers just write:

FROM all_errors
| STATS error_count = COUNT(*) BY service
| SORT error_count DESC

Two indices, two independent pipelines, one name. To add a third service later, add a third branch; existing branches don't change, and every downstream dashboard and alert reflects the update automatically. For a deep dive on subquery syntax and what you can do inside each branch, see Three Indices Walk Into a FROM Clause.

How views work under the hood

When you write FROM view_name, ES|QL resolves the view's stored query and executes it inline. Views are re-executed on every read, so results always reflect the current data and the current definition.

Views share a namespace with indices, aliases, and data streams. A view cannot have the same name as any of these (enforced at creation time). This keeps FROM my_name unambiguous regardless of whether the name resolves to a view, an index, or an alias.

Security model

Views are governed by four dedicated RBAC privileges: create_view, read_view_metadata, delete_view, and manage_view. Elasticsearch checks the privileges of the user running the query (invoker security), not the user who defined the view. The user querying a view needs permissions on both the view and its underlying indices.

Kibana integration

Views appear in Discover's ES|QL editor autocomplete alongside indices. ES|QL-based dashboard panels work with views transparently. In the initial Tech Preview release, view management is API-only. A Kibana UI for creating and managing views is planned.

Cross-cluster search

A view's definition can reference remote indices using CCS syntax:

PUT _query/view/cross_cluster_errors
{
  "query": """
    FROM cluster-west:logs-*, cluster-east:logs-*
    | WHERE log.level IN ("error", "crit")
  """
}

Consumers query FROM cross_cluster_errors without knowing which clusters are involved.

Current constraints

In the Tech Preview release, view management is API-only and SET directives can't appear inside view definitions; the caller applies them when querying. Subquery-based views can't be nested inside other multisource FROM expressions. See the views documentation for the full list.

What's next for views

Views today are always fresh; they re-execute on read. Materialized views flip that tradeoff: Pre-compute once, read instantly. Think pre-aggregated rollup views for Service Level Agreement (SLA) dashboards that load in milliseconds instead of scanning raw data on every refresh. A Kibana CRUD UI for views, including a "Save as View" workflow in Discover, is also planned.

Try it

Logical views are available as a Tech Preview. Try them in Kibana Dev Tools or Discover. We'd love your feedback; file a GitHub issue with the ES|QL label.

ES|QL logical views are a Tech Preview feature. Tech Preview features are subject to change and are not covered by the support SLA of GA features. The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.

Elastic Agent Builder: How we taught AI agents to manage their own context

Anish Mathur, Deepti Dheer, Evan Castle — Tue, 05 May 2026 00:00:00 GMT

Every developer building agents has to answer one question: what should the agent know, and when? The usual starting point is simple. Write a system prompt, connect a few tools, and the agent works. But as the scope grows, you add more instructions, sources and tools, and runs get longer. Eventually the context window fills up, fidelity drops and token costs climb. Managing context is a first-class concern. This post shares how we built context handling into agents so they can decide what to fetch, what to summarize, what to drop, and how memory should carry across steps.

Agent Builder in 9.4 makes context the agent's problem, not yours. Skills provide reusable instructions that load on demand, so only what the current task needs is in context. Large result sets go into a conversation context store rather than sitting in the prompt. For long-running tasks, context gets compacted selectively so the agent doesn't drift. Token and turn counts are monitored as the conversation runs. And connectors handle reaching enterprise data where it lives.

The goal of all of it is the same. Load the context you need, when you need it. Internally we've seen this cut token costs by up to 40%, and the agent's context stays reliable across datasets when it would have degraded before.

Getting agents to know what you know

Three context problems show up again and again: managing bloated prompts, executing complex actions, and controlling enterprise data sources.

Firstly, prompts get bloated because every instruction has to live in them. Skills fix that by loading on demand and with fewer input tokens. Skills provide structured descriptions of how an agent should operate and act in a specific task. Agent Builder ships with built-in skills for common data analysis patterns, but the real value is that users can also build custom skills. A security team can encode its triage playbook as a skill. An SRE team can describe how they want root cause analysis to proceed. A developer can encode their API design conventions and error handling patterns. Skills are reusable and shareable across agents, which means a pattern that works for one team's deployment doesn't have to be reinvented by the next.

In practice, this looks like: a team lead defines a "Summarize this incident" skill with the process they care about, the severity classification their org uses, and the output format their runbook expects. Anyone on the team invokes it by typing in the chat input and selecting it from autocomplete. Skills follow the Agent Skills open format, so you can pull them from a shared library, write your own, or craft skills using an agent of your choice.

In internal testing, we found that removing instructions from the agent prompt and placing them in dynamically loaded skills showed a 21 to 39% reduction in input token usage across test datasets. The key architectural improvement is that skills and their associated tools are loaded only when the agent needs them. All other skills stay as lightweight stubs with just a name and description, consuming almost no context.

Chat with your data and act on it too (dashboards, workflows, queries and more): Agentic tasks don't stay simple for long. Agent Builder now has contextual awareness of objects in Kibana. With agentic dashboard creation, a user can describe what they want to see in plain language, and the agent generates a dashboard with panels, visualizations, queries, and everything that's needed. Users can refine it conversationally: "break that out by region," "add a filter for the last 7 days," "swap the bar chart for a line chart."

Dashboards, alerts, and rules also work as inputs. Once a dashboard exists, it can be retrieved from the Agent’s context. This unlocks the "act" side of agents. Once a dashboard or alert is in context, the agent can modify it, extend it, or create new ones. The agent can reason about what the data shows, suggest follow-up analyses, or modify the dashboard based on what it sees. It's a feedback loop: the user describes intent, the agent produces a visual artifact, and both the user and the agent can reason over that artifact together.

For business analysts and operations teams, this collapses the gap between "I have a question about the data" and "I have a dashboard I can share with my team," reducing hours of manual work to a few minutes of conversation.

Lastly, using enterprise data for context creates governance you didn't ask for. Connectors close the loop for data that lives outside Elastic. We added prebuilt OAuth-based connectors for sources like Google Drive, Salesforce, and Slack. The design principle here is worth calling out: data stays at the source. The agent searches data via the connector with the user's own permissions enforced. Agents do not accumulate copies of enterprise data in new locations just to be able to answer questions about it.

This matters more than it might seem. Enterprise data governance isn't just a compliance checkbox; it's a load-bearing infrastructure that most teams don't notice until it fails. When agents start routing around it, accumulating copies in vector stores and context windows, you've quietly created a new class of data sprawl that your security team didn't sign off on, and your audit logs don't capture. The connector approach eliminates this risk by constraint: if the data never moves, it can't end up somewhere it shouldn't. The user's permissions travel with every query because the query goes to the source, not to a cached copy. You get agents that are genuinely useful on enterprise data.

Ensure agents don't outgrow the context window

Giving agents too much context creates a new problem. A security analyst investigating a complex threat might pull in dozens of alerts, correlate across multiple indices, and go back and forth with the agent for twenty or thirty turns. At some point, you're pushing past what the context window can hold and degrading the quality of the model's responses. The problem is that each retrieval call adds latency to the user's request and pushes infrastructure costs higher, and a single user interaction can trigger dozens of these calls.

We built a context store for retrieval results. As the agent retrieves data from indexes, the results can grow large and crowd the context window. We introduced a temporary store that holds the results of a query in an in-memory “file store” and only pulls the results into the active context when needed. This allows for conversations to extend and deal with multiple related data sets without blowing out the context. We are also optimizing the retrieval results themselves, applying top snippets retrieval, which demonstrated a 27 to 34% reduction in token usage.

We also added intelligent context compaction for longer interactions: As a conversation progresses, the agent manages what stays in the active context and what gets compressed into a summary that can be retrieved if needed. This isn't a simple truncation; it's selective compaction that preserves the information most likely to matter for the next turn.

This enables agents to handle larger result sets, more complex queries, and longer conversations without the token cost scaling linearly with every turn. With context compaction agents, the context window remains within a limit even for chats with 30 or more turns, rather than quickly ballooning to max size.

For teams running multi-step investigations or summarizations, this is the difference between an agent that stays coherent through turn thirty and one that starts contradicting itself at turn twelve.

Monitoring: In 9.4, we also shipped monitoring for agents to track token usage. With an API available to monitor conversation turns and tool calIs. This matters because agents aren't static. Their behavior shifts based on the context they receive and the tools they call, and without visibility into those patterns, optimizing cost and performance is guesswork.

Agentic consumption model

To support these new capabilities, we're introducing an agent pricing model that directly aligns the value users gain from their agents and how they scale. Agent Builder usage will be measured by Executions. Executions are free for the first 1,000 each month in Elasticsearch and 10,000 in Elastic Security and Observability projects.

An Agent Builder execution represents a completed round of interaction with the agen. In most cases, sending a chat message and receiving a successful response from the agent counts as one execution. For messages that demand significant processing, it will be calculated as multiple executions based on the total number of input tokens required, grouped into 50,000 input token units. For example, a deep investigative task that requires 130,000 input tokens will be billed as 3 executions. This model ensures your consumption aligns with the value your agents deliver and becomes more cost-effective as your agents achieve greater context efficiency.

Where are we going with agents

Agents that can optimize context over operational data need the same kind of careful context engineering that we've spent years applying to search relevance. Getting the right information in front of the model at the right time and at the right level of detail is the new retrieval problem. These capabilities are foundational towards enabling agents that are more reliable, scalable, and cost-efficient as they scale.

Get started with an Elastic Cloud Trial, and check out the documentation here. For existing customers, Agent Builder is available in Cloud Serverless and on the Enterprise Tier in Elastic Cloud Hosted and self-managed.

Elasticsearch percolator for ecommerce search governance: translating ambiguous queries into controlled retrieval strategies

Alexander Marquardt, Honza Král, Taylor Roy — Mon, 04 May 2026 00:00:00 GMT

This post is a technical deep dive into the Elasticsearch implementation of the control plane architecture described in Part 3, showing how to build it using the Elasticsearch percolator. It outlines the patterns used to implement a deterministic, governed policy engine in production.

From architecture to implementation

Part 3 described the control plane architecture: reverse matching as a lookup primitive, policy documents that separate match from action, and cascading transformations that compose multiple policies into a single execution plan. This post goes hands-on with the Elasticsearch feature that powers the policy lookup: the percolator query.

The percolator is a natural fit for governance because it inverts the direction of search in exactly the way a control plane needs. This post walks through the implementation step by step, starting with a clear explanation of what the percolator does and why it matters, and then moving through index design, policy storage, query-time evaluation, and multi-policy composition.

How normal search works

In an ecommerce system, you may have hundreds of thousands or millions of product documents containing fields such as title, category, and price. When a user searches for matching documents, you're asking Elasticsearch to compare the user’s search string against one or more stored fields in these product documents. Elasticsearch's default analyzer, the standard analyzer, lowercases text and splits it into tokens. A search for “oranges” matches “Oranges” because of lowercasing. With a language-aware analyzer that includes stemming, it also matches “orange” because both forms reduce to the same stem. For example, the following match query returns documents that have “orange” or “oranges” in their “title” field.

POST products/_search
{
  "query": {
    "match": {
      "title": "oranges"
    }
  }
}

So for the above query, Elasticsearch returns the product documents whose title field matches “oranges”, which could include results such as “Orange Fruit Spread”, “Orange Juice”, “Juicy oranges”, “Orange Marmalade”, and so on. The key point to remember is that Elasticsearch is commonly used to compare a search string against documents and to return the documents that match the search string.

The governance problem: Finding relevant policies before searching for products

As established in Parts 1 through 3, a governed search system does not send the user's search string directly to the product catalog. First, it checks whether any policies apply to that search string.

A merchandiser has decided that when someone searches for exactly "oranges", results should be restricted to the Oranges category, eliminating orange juice, orange marmalade, and orange soda. That business decision is stored as a policy. When a user types "oranges", the control plane needs to find that policy, read its instructions, and modify the search against the product catalog accordingly. In order to do this, the control plane needs to figure out which stored policies are relevant for this search string.

An enterprise deployment might have hundreds or thousands such policies. Checking them one by one with if/else logic is the application-layer anti-pattern described in Part 2. What we need is a way to store all of those policies in an index and instantly find the ones that match a given search string. This is where the percolator comes in.

Flipping the direction: The percolator

We previously mentioned that in a normal search, Elasticsearch is commonly used to compare a search string against documents and to return the documents that contain that search string.

The percolator inverts this. With a percolator, you have an index where each document stores a query pattern, and then an incoming search string is checked against these stored queries to determine which of these stored query patterns has triggered.

For governance, the "stored query patterns" are policies. Each policy contains a pattern that describes the kind of search string it should match. For example, does the search string exactly match “oranges”, or does the search string contain “olive oil”? The incoming string is the user's search text, which arrives at query time and needs to be checked against all stored policy patterns. This is covered in a related PRISM video at 4:09.

Step by step: How a search for "oranges" finds its policy

The policy

A merchandiser has authored a policy that matches if a user searches for exactly "oranges" without any other words. Once the percolator matches, the remainder of the document includes the rules that the control plane will use to build the Product query; in this example, one of the rules is to restrict (filter) results to the Fruits category.

{
  "percolator": {
    "match_phrase": { "query": "START oranges END" }
  },
  "rule_type": "filter",
  "rule_args": {
    "filters": [
      {
        "field": "categories",
        "values": ["Fruits"],
        "mode": "hard_filter",
        "on_conflict": "soft_boost",
        "on_conflict_boost_weight": 1.0
      }
    ]
  },
  "priority": 0,
  "enabled": true
}

The percolator field contains the pattern that defines when this policy should fire. In this case, it matches the phrase "START oranges END". The rule_type and rule_args fields define what the policy should do when it fires. The START and END tokens are boundary markers, which we will explain shortly.

You can see how a policy is authored in the PRISM Studio UI at 2:52 of the related PRISM video.

The user searches

A shopper types "oranges" into the search bar.

The control plane checks for matching policies

Before searching the product catalog, the control plane intercepts the user search string, wraps it in boundary markers, and sends it to the percolator:

POST policies/_search
{
  "query": {
    "percolate": {
      "field": "percolator",
      "document": {
        "query": "START oranges END"
      }
    }
  }
}

The string "START oranges END" is checked against all stored policy patterns. Internally, Elasticsearch runs the stored policy patterns against this string and returns the ones that match. That's the percolator. The user's search string was checked against all stored policy patterns, and the ones that matched were returned. No if/else chains. No sequential evaluation. The index handles the matching.

The control plane applies the policy

The control plane reads the matched policies’ actions. The above policy instructs the control plane to restrict results to the Fruits category. The control plane builds the final Elasticsearch query against the product catalog as follows:

POST products/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "title": "oranges" } }
      ],
      "filter": [
        { "terms": { "categories": ["Fruits"] } }
      ]
    }
  }
}

The user searched for "oranges”. The product catalog receives a query for "oranges" constrained to the Fruits category. Because of this constraint, orange juice, orange marmalade, and orange soda are excluded.

Why "orange marmalade" does NOT trigger the oranges policy

Suppose a different user searches for "orange marmalade”. The control plane wraps the string and percolates: "START orange marmalade END". The oranges policy's pattern is match_phrase: "START oranges END". The oranges policy does not match and therefore the policy isn’t applied, and the results aren’t constrained to the Fruits category.

This is the purpose of the START and END boundary markers. Without them, a policy that matches on the word "oranges" could accidentally fire on a query like "orange marmalade". By wrapping the user's search string with START and END and including those markers in the policy's pattern, we ensure that the policy only fires when "oranges" is the complete search string, without any other words. This matches both the shoppers and the merchandiser's intent.

A second policy: "olive oil" on the stemmed field

Not every policy needs an exact string match. The “olive oil” policy matches on a stemmed field, so it fires regardless of minor word-form variations:

{
  "percolator": {
    "bool": {
      "should": [
        { "match_phrase": { "query.stemmed": "START olive oil END" } }
      ]
    }
  },
  "rule_type": "filter",
  "rule_args": {
    "filters": [
      {
        "field": "categories",
        "values": ["Olive oils"],
        "mode": "hard_filter",
        "on_conflict": "soft_boost",
        "on_conflict_boost_weight": 1.0
      }
    ]
  },
  "priority": 300,
  "enabled": true
}

This policy's pattern matches against query.stemmed instead of query. When the user's search string arrives, it’s stored in both a query field (the exact text) and a query.stemmed field (analyzed with a stemming analyzer that reduces words to their stems, so "olives" and "olive" both reduce to the same stem, as do "oils" and "oil"). The policy's pattern is checked against the stemmed version of the string, so it fires regardless of minor word-form variations.

The START and END boundary markers work on the stemmed field, as well, ensuring this policy only fires when "olive oil" is the entire search string, not when it appears as part of something longer.

The rest of this post covers the implementation details that make this production-ready: the index mapping that supports both matching modes, how highlights drive phrase removal and consumed phrase tracking, and how multiple conflicting policies compose into a single execution plan.

The policy index mapping

The policy index needs a percolator field to hold stored query patterns and a text field that mirrors the structure of the incoming search string the percolator will match against. The mapping below is simplified for clarity. A production deployment is more complex, using custom analyzers to handle boundary markers, variable pattern matching (for example, recognizing that "under $4" contains a currency value), and other kinds of analysis.

PUT policies
{
  "mappings": {
    "properties": {
      "percolator": {
        "type": "percolator"
      },
      "query": {
        "type": "text",
        "fields": {
          "stemmed": {
            "type": "text",
            "analyzer": "stemming"
          }
        }
      },
      "rule_type": { "type": "keyword" },
      "rule_args": { "type": "object", "enabled": false },
      "priority": { "type": "integer" },
      "enabled": { "type": "boolean" }
    }
  }
}

The index is named policies because each document represents a complete governed policy as defined in Part 2. This includes match criteria, action, priority, and metadata. The rule_type and rule_args fields contain the action component of the policy, which contain the instructions that the control plane will use to compose the query for execution against the product catalog.

The query field is the string that the percolator matches against. It has two variants: an exact version and a stemmed version. When the user's search string arrives, it’s placed into this field in the temporary in-memory index. Policies that match on query see the exact string; policies that match on query.stemmed see the stemmed version.

Percolating with highlights, filtering, and sorting

The simple examples above showed minimal percolation requests. In practice, the control plane adds highlighting, filters disabled policies, and sorts by priority:

POST policies/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "percolate": {
            "field": "percolator",
            "document": {
              "query": "START olive oil END"
            }
          }
        },
        {
          "term": { "enabled": true }
        }
      ]
    }
  },
  "highlight": {
    "fields": {
      "query": {
        "matched_fields": ["query.stemmed"]
      }
    }
  },
  "sort": [
    { "priority": { "order": "desc" } }
  ]
}

The highlight configuration uses "query" as the field key with "query.stemmed" in matched_fields. This tells Elasticsearch's unified highlighter to return highlights on the parent query field but to also consider matches from the query.stemmed subfield when determining which tokens to highlight. This is what allows a policy that matches on the stemmed field to still produce accurate highlight spans on the original text, which the control plane needs for phrase removal and consumed phrase tracking.

The enabled: true filter ensures that disabled policies are skipped. The sort on priority ensures that higher-priority policies are returned first, so the control plane can process them in the correct order for cascading transformations. The highlight field is the most important addition; it tells us exactly which words in the user's search string triggered each match.

The response for an "olive oil" search may look as follows:

{
  "hits": {
    "hits": [
      {
        "_id": "en_2c3021c8",
        "_source": {
          "rule_type": "filter",
          "rule_args": {
            "filters": [
              {
                "field": "categories",
                "values": ["Olive oils"],
                "mode": "hard_filter",
                "on_conflict": "soft_boost",
                "on_conflict_boost_weight": 1.0
              }
            ]
          },
          "priority": 300
        },
        "highlight": {
          "query": ["START olive oil END"]
        }
      }
    ]
  }
}

Why highlights matter

Notice the highlight in the response: "START olive oil END". Elasticsearch is telling us exactly which words in the user's search string caused the policy to match. This isn’t cosmetic. The highlight metadata drives two critical downstream behaviors:

Phrase removal. Some policies need to remove the matched text from the search string before constructing the product catalog query. For example, a policy that matches on "cheap" removes that word and converts it into a price filter instead. The highlight identifies exactly which span of the search string the policy matched, so the system knows what to remove.

Consumed phrase tracking. As described in Part 3, when multiple policies match the same search string, a higher-priority policy might remove words that a lower-priority policy also matched on. By comparing each policy's highlight against the current (evolving) search string, the system can detect that a phrase has been consumed and skip the lower-priority policy. This prevents double-processing and ensures deterministic behavior.

You can learn more about how highlighting works in this article.

From percolation to execution plan

The percolator returns a set of matching policies. But as Part 3 described, the lookup is only half the story. The other half is composing those matches into a coherent execution plan. Here’s what that looks like for a concrete query.

Worked example: "Cheap chocolate" during a Christmas campaign

Suppose the system has two active policies: the "Cheap chocolate" policy (priority 210) and the "Christmas chocolates" policy (priority 300), both described in detail in Part 3.

Step 1: Percolate. The user searches for "cheap chocolate." The control plane wraps the search string as "START cheap chocolate END" and sends it to the percolator. Two policies match: The "Cheap chocolate" policy's pattern matches on the phrase "cheap chocolate"; and the "Christmas chocolates" policy's pattern matches on "chocolate" via the stemmed field.

Step 2: Sort by priority. The percolator returns both policies, sorted by priority in descending order. The “Christmas chocolates” policy (300) is processed first, followed by the “Cheap chocolate” policy (210).

Step 3: Apply the cascading transformation. This is the initial state → [Policy A] → state' → [Policy B] → state'' → execution plan model from Part 3.

The “Christmas chocolates” policy (priority 300) applies first:

Adds a category hard filter: "Christmas foods and drinks," "Christmas sweets".
Adds a price filter: less than $7.
Adds a category soft boost: "Advent calendars" (3x).

The “Cheap chocolate” policy (priority 210) applies next against the modified state:

Attempts to add a category hard filter: "Chocolates," "Milk chocolates"; but the Christmas policy already set this field with on_conflict: override, so the Cheap chocolate categories are dropped.
Attempts to add a price filter: $2, the Christmas policy set on_conflict: restrict for price, and $2 is more restrictive than $7, so $2 wins.
Removes "cheap" from the search string.

Step 4: Build the Elasticsearch query. The control plane assembles the execution plan into a single Elasticsearch query against the product catalog:

POST products/_search
{
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "must": [
            { "match": { "title": "chocolate" } }
          ],
          "filter": [
            { "terms": { "categories": ["Christmas foods and drinks", "Christmas sweets"] } },
            { "range": { "price": { "lt": 2 } } }
          ]
        }
      },
      "functions": [
        {
          "weight": 1
        },
        {
          "filter": { "terms": { "categories": ["Advent calendars"] } },
          "weight": 3
        }
      ],
      "score_mode": "sum",
      "boost_mode": "multiply"
    }
  }
}

The original search string was "cheap chocolate”. The query that reaches the product catalog is a governed, intent-aware retrieval plan: The word "cheap" has been consumed and converted into a price constraint, results are restricted to Christmas seasonal categories, Advent calendar products receive a ranking boost, and the price ceiling reflects the more restrictive value from the lower-priority policy. Every transformation is deterministic, traceable, and explainable.

For a quick overview about how these multipliers interact with the base BM25 score, see 8:45 in the related PRISM video, where we briefly discuss multiplicative boosts.

Why this scales

The percolator is efficient for this use case because of the asymmetry: An enterprise ecommerce system might have millions of products but only hundreds or thousands of governance policies. The percolator is checking one incoming search string against that set of stored policy patterns, not scanning the full product catalog. The cost is proportional to the number of policies, and Elasticsearch applies internal optimizations (indexing terms from stored query patterns, short-circuiting Boolean logic) to keep matching fast.

Adding a new policy is just indexing a new document. Disabling one is a field update. No code changes, no deploys, no restarts.

From lookup to governed retrieval

The percolator provides the fast reverse-matching primitive that makes the control plane architecture from Part 3 practical at scale. Policies are data which are stored and indexed, and efficiently matched against incoming search strings. The control plane composes matching policies into a governed execution plan through the cascading transformation and per-field conflict resolution described in Part 3. And the retrieval engine executes the governed execution plan against the product catalog.

The result is a system where a merchandiser can author a new policy without touching application code, test it against representative queries, promote it to production, and immediately see the effect. The percolator makes the policy lookup fast; the control plane makes the policy composition deterministic; and the governed workflow makes the whole process safe.

What's next in this series

The next post in this series extends the governed control plane into new territory. It introduces a multi-tier search architecture, explaining how to orchestrate strict, relaxed, and semantic retrieval while maintaining stable pagination and facets.

Put governed ecommerce search into practice

The percolator-based control plane described in this post, from index mappings and boundary markers to highlight-driven phrase tracking and cascading policy composition, was built by Elastic Services Engineering as part of our repeatable ecommerce search accelerators. Every query example and policy structure shown here comes from a working system validated against enterprise-scale product catalogs.

If you want to implement a governed, policy-driven control plane on Elasticsearch, Elastic Services can get you there faster. Contact Elastic Professional Services.

Join the discussion

Have questions about search governance, retrieval strategies, or ecommerce search architecture? Join the broader Elastic community conversation.

How to measure and improve Elasticsearch search recall: from 0.43 to 0.75 with hybrid search

Jeffrey Rengifo — Mon, 04 May 2026 00:00:00 GMT

Lexical search using the BM25 ranking algorithm is cheap, fast, and very effective for a wide range of queries. But it has a blind spot: queries that don't share tokens with your documents. In this article, you’ll measure exactly where BM25 falls short. We'll use Elasticsearch's ranking evaluation API (rank_eval) and close that gap by adding Jina AI embeddings through Elastic Inference Service (EIS). You’ll see the recall score go from 0.43 to 0.75 and understand why.

What is recall?

Recall measures on a scale from 0 to 1 how many of the documents that your users actually want appear somewhere in your search results. If a query should surface three products and your search returns only two of them in the top 10, recall@10 = 0.67 for that query. It’s a set-based metric: It doesn’t care about the position of the relevant documents within those k results. A relevant document in position 10 counts the same as one in position 1. Having a high recall means that you’re not losing relevant results.

The diagram shows two sets: all relevant documents (left) and what BM25 actually retrieved (top 10, right). Only the intersection counts toward recall, prod_1 and prod_2 were found, while prod_3, prod_4, and prod_6 were missed entirely. Result: Recall@10 = 2/5 = 0.40.

Prerequisites

Let's get down to business to better understand how recall works. This demonstration uses Python. You can follow along with it on the companion notebook (notebook.ipynb), where every code block is a cell ready to run.

The code provided uses the following:

Elasticsearch 9.3+
Python 3.10+

pip install elasticsearch pandas plotly python-dotenv

A .env file with your Elasticsearch credentials

ELASTICSEARCH_URL=https://your-cluster-url
ELASTICSEARCH_API_KEY=your-api-key

The dataset

We’ll use a product catalog of 1,000 products, spanning categories such as footwear, electronics, tools, and more.

Each document has four fields:

Field	Type
`title`	text
`description`	text
`brand`	keyword
`category`	keyword

The dataset is loaded from dataset.csv.

The power and limits of lexical search

BM25 is the default ranking algorithm in Elasticsearch and most search engines. It scores documents by how often your query terms appear in them, adjusted for document length and the frequency of those terms across the entire index. You get analyzers on top: lowercase normalization, stemming, and stopword removal. A query for "running shoes" will match "Running Shoes" and likely "run" as well.

This works well for a large class of queries:

"running shoes" immediately matches products with those exact tokens in the title.
"bluetooth speaker" surfaces portable audio products because the tokens appear verbatim.

The results are deterministic and explainable: A document ranks highly because the query terms appear in it. Debugging relevance is straightforward.

Where it breaks

Now let’s try these queries against the same catalog:

"skincare routine": The word "routine" doesn’t appear in any product title. BM25 can partially match on "skincare," but face serums, body oils, and moisturizers are described using terms like "vitamin C," "retinol," or "brightening," none of which overlap with the query. Products that form a complete skincare routine are scattered across the index with no shared token to anchor them.

ID: B06XX6DS3P, Score: 9.0552, Title: Replenix Retinol Smooth + Tighten Body Lotion - Collagen-Boosting, Regenerating Anti-Aging Body Cream, Reduces Appearance of Stretch Marks, 6.7 oz.

  ID: B08XMPKJ1L, Score: 5.2699, Title: Bio-Oil Skincare Body Oil (Natural) Serum for Scars and Stretchmarks, Face and Body Moisturizer Hydrates Skin, with Organic Jojoba Oil and Vitamin E, For All Skin Types, 6.7 oz

  ID: B01CY764KQ, Score: 5.0057, Title: Nike Up Or Down Men Deodorant - Pack of 2 | Long-Lasting Fragrance, Body Spray Combo for Men | Deodorant for Active Living | Nike Men's Deo Set | Ultimate Odor Protection | Grooming Essentials | Signature Nike Scent | High-Performance Men's Deodorant

"pet travel accessories": This is a use-case grouping, not a product category. A dog sling carrier, a pet car seat, and a travel crate are all relevant, but their descriptions talk about portability, safety, and comfort rather than "travel accessories." BM25 matches "pet" broadly but has no signal to distinguish travel-specific products from the rest of the pet catalog.

ID: B0BVV7BKTW, Score: 7.4371, Title: Large Foldable Travel Duffel Bag with Shoes Compartment

ID: B07TNPHYNV, Score: 6.6455, Title: 40 Pieces Christmas Bronze Jingle Bells Craft Small Bells

ID: B08R8FRW53, Score: 6.6335, Title: CUBY Dog and Cat Sling Carrier
ID: B08QMCQYGM, Score: 6.5259, Title: YTFGGY Whiteboard Pinstripe Tape 6 Rolls 1/8"
ID: B0CP3LQSWM, Score: 6.2994, Title: Portable Dog Water Bottle 32 Oz

This is a recall problem. The relevant documents exist in your index. BM25 just cannot find them because the user's words and the document's words do not match closely enough.

Adding synonyms helps for known cases. But you cannot enumerate every way a user might express an intent. That is where vectors come in.

Why you should measure recall

Before fixing a problem, you need to quantify it.

Recall@k measures how many of the documents that your users actually want appear somewhere in your search results. Formally:

Recall@k = (relevant documents found in top k) / (total relevant documents)

Precision@k measures the top k results and how many are actually relevant:

Precision@k = (relevant documents in top k) / k

High precision means that the results you do return are good. In ecommerce, missing a relevant product (low recall) is often worse than showing a slightly imperfect result (lower precision), because a hidden product is a lost sale.

Elasticsearch's rank_eval API lets you measure both systematically. You provide a list of queries, each with a set of rated documents, and Elasticsearch computes the metrics for you across all queries.

Setting up the evaluation

The rank_eval API needs a ratings dataset: a mapping of queries to the documents that are relevant for each one, along with a relevance grade (0 = not relevant, 1 = relevant, 2 = highly relevant).

In the notebook, this is the judgments list:

judgments = [
    # Query 1: "running shoes" BM25 handles well (tokens appear in product titles) 
    {"query_id": "q1", "doc_id": "B09NQJFRW6", "grade": 2, "query": "running shoes"},
    {"query_id": "q1", "doc_id": "B08JMD4LMM", "grade": 2, "query": "running shoes"},
    {"query_id": "q1", "doc_id": "B08VRJ6F2Q", "grade": 2, "query": "running shoes"},
    {"query_id": "q1", "doc_id": "B07S8NRRWR", "grade": 2, "query": "running shoes"},
    {"query_id": "q1", "doc_id": "B01HD620I8", "grade": 2, "query": "running shoes"},
    {"query_id": "q1", "doc_id": "B07DX86321", "grade": 2, "query": "running shoes"},
    {"query_id": "q1", "doc_id": "B0968YVLQ8", "grade": 1, "query": "running shoes"},
    {"query_id": "q1", "doc_id": "B093QJ39ZS", "grade": 1, "query": "running shoes"},
    {"query_id": "q1", "doc_id": "B096FGSC39", "grade": 1, "query": "running shoes"},
    {"query_id": "q1", "doc_id": "B01GVQWVV2", "grade": 1, "query": "running shoes"},

    # Query 2: "skincare routine" intent-based, "routine" never appears in product titles
    {"query_id": "q2", "doc_id": "B08XMPKJ1L", "grade": 2, "query": "skincare routine"},
    {"query_id": "q2", "doc_id": "B0BN3WQB92", "grade": 2, "query": "skincare routine"},
    {"query_id": "q2", "doc_id": "B0BT7B7P5T", "grade": 2, "query": "skincare routine"},
    {"query_id": "q2", "doc_id": "B00NPA2WEY", "grade": 2, "query": "skincare routine"},
    {"query_id": "q2", "doc_id": "B06XX6DS3P", "grade": 1, "query": "skincare routine"},
    {"query_id": "q2", "doc_id": "B07PDRD1KT", "grade": 1, "query": "skincare routine"},
    {"query_id": "q2", "doc_id": "B074J7869B", "grade": 1, "query": "skincare routine"},
    {"query_id": "q2", "doc_id": "B08JV31QW4", "grade": 1, "query": "skincare routine"},
    {"query_id": "q2", "doc_id": "B00K3TVJMQ", "grade": 1, "query": "skincare routine"},

    # Query 3: "study desk setup" intent-based, products are desks/stands/organizers
    {"query_id": "q3", "doc_id": "B08CS35J2T", "grade": 2, "query": "study desk setup"},
    {"query_id": "q3", "doc_id": "B09B3LFDXJ", "grade": 2, "query": "study desk setup"},
    {"query_id": "q3", "doc_id": "B07W58LMND", "grade": 1, "query": "study desk setup"},
    {"query_id": "q3", "doc_id": "B0CHYDX91L", "grade": 1, "query": "study desk setup"},

    # Query 4: "pet travel accessories" use-case grouping, products are carriers/crates/seats
    {"query_id": "q4", "doc_id": "B08R8FRW53", "grade": 2, "query": "pet travel accessories"},
    {"query_id": "q4", "doc_id": "B01MYUYX33", "grade": 2, "query": "pet travel accessories"},
    {"query_id": "q4", "doc_id": "B003C5RKE4", "grade": 2, "query": "pet travel accessories"},
    {"query_id": "q4", "doc_id": "B09GF8GBF6", "grade": 1, "query": "pet travel accessories"},
    {"query_id": "q4", "doc_id": "B0CP3LQSWM", "grade": 1, "query": "pet travel accessories"},
]

The mix is intentional: q1 is a query that BM25 handles well (exact tokens in product titles), while q2, q3, and q4 are intent-based queries where the user's intent is expressed as a concept rather than specific product keywords.

Measuring BM25 baseline recall

First, set up the Elasticsearch client and index the raw text data:

import os
import json
import pandas as pd
import plotly.graph_objects as go
from elasticsearch import Elasticsearch, helpers
from dotenv import load_dotenv

load_dotenv()

es = Elasticsearch(
    os.getenv("ELASTICSEARCH_URL"),
    api_key=os.getenv("ELASTICSEARCH_API_KEY")
)

INDEX_NAME = "ecommerce-products"

Now build the rank_eval request for BM25. Each request in the list combines a query with its ratings:

judgments_df = pd.DataFrame(judgments)

bm25_requests = []
for query_id, query_text in (
    judgments_df[["query_id", "query"]].drop_duplicates().values
):
    relevant_docs = judgments_df[judgments_df["query_id"] == query_id]
    ratings = [
        {"_index": INDEX_NAME, "_id": row["doc_id"], "rating": row["grade"]}
        for _, row in relevant_docs.iterrows()
    ]

    bm25_requests.append({
        "id": query_id,
        "request": {
            "query": {
                "multi_match": {
                    "query": query_text,
                    "fields": ["title", "description"]
                }
            }
        },
        "ratings": ratings,
    })

bm25_eval = {
    "requests": bm25_requests,
    "metric": {"recall": {"k": 10, "relevant_rating_threshold": 1}},
}

bm25_result = es.rank_eval(index=INDEX_NAME, body=bm25_eval)
print("BM25 Recall@10:", bm25_result.body["metric_score"])

Result:

BM25 Recall@10: 0.43

0.43 means that across all four queries, BM25 finds only 43% of the documents it should find. The shortfall is concentrated in the intent-based queries: "skincare routine" misses face serums and body oils because "routine" never appears in product titles, and "pet travel accessories" retrieves off-topic pet products while missing carriers and crates described in terms of portability and safety rather than "travel accessories."

This is our baseline. Now we have a number to beat.

Adding vector search with Jina embeddings

Vector search encodes documents and queries as high-dimensional vectors, a type of vector made up of hundreds or thousands of numerical values, each encoding a specific feature of the data it represents. Documents with similar meaning end up close together in vector space, even if they share no words. "Gym equipment" and "dumbbell set" will be nearby because the concepts are related. I chose Elasticsearch as my vector database because it supports hybrid search, giving me both semantic understanding and keyword precision out of the box.

EIS includes out-of-the-box support for embedding models through its inference API.

Step 1: Using Jina embeddings v5 as an inference endpoint

INFERENCE_ENDPOINT_ID = ".jina-embeddings-v5-text-small"

If your cluster has GPU resources (available in Elastic Cloud and Elasticsearch 9.3+), the embeddings are generated on GPU, which is significantly faster than CPU inference and removes the performance trade-off that historically made vectors expensive at scale.

Why Jina embeddings specifically? jina-embeddings-v5-text is a multilingual model (119+ languages) with a 32,000-token context window and support for task-specific Low-Rank Adaptation (LoRA) adapters. It works well for short product descriptions out of the box. Read more about jina-embeddings-v5-text model here.

Step 2: Create the index with a semantic field

index_mappings = {
    "mappings": {
        "properties": {
            "title": {"type": "text", "copy_to": "semantic_field"},
            "description": {"type": "text", "copy_to": "semantic_field"},
            "brand": {"type": "keyword"},
            "category": {"type": "keyword"},
            "semantic_field": {
                "type": "semantic_text",
                "inference_id": INFERENCE_ENDPOINT_ID,
            },
        }
    }
}

if not es.indices.exists(index=INDEX_NAME):
    es.indices.create(index=INDEX_NAME, body=index_mappings)
    print(f"Created index: {INDEX_NAME}")

The semantic_text field type is the key here. It’s a higher-level abstraction over dense_vector: You point it at an inference endpoint, and Elasticsearch takes care of generating embeddings automatically.

The copy_to property on title and description means content from both fields flows into semantic_field for embedding, so a single vector captures the full product representation.

Step 3: Index the products

def bulk_index(products, index_name):
    actions = []
    for product in products:
        doc_id = product.get("_id")
        source = {k: v for k, v in product.items() if k != "_id"}
        action = {"_index": index_name, "_source": source}
        if doc_id:
            action["_id"] = doc_id
        actions.append(action)

    success, failed = helpers.bulk(es, actions, raise_on_error=False)
    if failed:
        for error in failed:
            print(f"Error: {error}")
    else:
        print(f"Successfully indexed {success} documents")

bulk_index(products, INDEX_NAME)

At index time, Elasticsearch calls the inference endpoint for each document and stores the resulting embedding in semantic_field. No extra code on your side.

Hybrid search: Combining BM25 and vectors with RRF

Adding vectors improves recall, but using vectors alone risks losing precision on exact-match queries; "running shoes" should still rank verbatim matches first. Hybrid search retains the lexical component specifically to preserve that precision.

Hybrid search with Reciprocal Rank Fusion (RRF) keeps the best of both:

BM25 handles exact and near-exact queries with high precision.
Semantic search handles intent-based and multilingual queries with high recall.
RRF combines the two ranked lists into a single ranking.

The RRF formula assigns each document a score based on its rank in each result list:

score = sum(1 / (rank_constant + rank))

A document that ranks highly in both lists gets a higher combined score. The rank_constant controls how much weight lower-ranked documents receive.

hybrid_requests = []

for query_id, query_text in (
    judgments_df[["query_id", "query"]].drop_duplicates().values
):
    relevant_docs = judgments_df[judgments_df["query_id"] == query_id]
    ratings = [
        {"_index": INDEX_NAME, "_id": row["doc_id"], "rating": row["grade"]}
        for _, row in relevant_docs.iterrows()
    ]

    hybrid_requests.append({
        "id": query_id,
        "request": {
            "retriever": {
                "rrf": {
                    "retrievers": [
                        {
                            "standard": {
                                "query": {
                                    "multi_match": {
                                        "query": query_text,
                                        "fields": ["title", "description"],
                                    }
                                }
                            }
                        },
                        {
                            "standard": {
                                "query": {
                                    "match": {
                                        "semantic_field": {"query": query_text}
                                    }
                                }
                            }
                        },
                    ],
                    "rank_window_size": 50,
                    "rank_constant": 5,
                }
            }
        },
        "ratings": ratings,
    })

hybrid_eval = {
    "requests": hybrid_requests,
    "metric": {"recall": {"k": 10, "relevant_rating_threshold": 1}},
}

hybrid_result = es.rank_eval(index=INDEX_NAME, body=hybrid_eval)
print("Hybrid Recall@10:", hybrid_result.body["metric_score"])

Result:

Hybrid Recall@10: 0.75

Hybrid improves substantially over BM25 (0.43) and preserves precision for exact-match queries like "running shoes."

Results: Before and after

Here’s the full comparison across all three approaches:

methods = {
    "BM25 (Lexical)": bm25_requests,
    "Hybrid (BM25 + Vectors)": hybrid_requests,
}

recall_metric = {"recall": {"k": 10, "relevant_rating_threshold": 1}}

comparison_data = []
for method_name, requests in methods.items():
    result = es.rank_eval(
        index=INDEX_NAME,
        body={"requests": requests, "metric": recall_metric}
    )
    comparison_data.append({
        "method": method_name,
        "recall@10": result.body["metric_score"]
    })

comparison_df = pd.DataFrame(comparison_data)
print(comparison_df.to_string(index=False))

Result:

Method	Recall@10
BM25 (Lexical)	0.43
Hybrid (BM25 + Vectors)	0.75

Breaking it down by query:

Conclusion

Throughout this post, we saw that BM25 lexical search is reliable when users type exact queries, but it loses recall when they search by intent rather than keywords. Using rank_eval, we established a reproducible baseline to measure that gap with real numbers. From there, we added a semantic_text field powered by Jina embeddings and ran the evaluation again. The result: Hybrid search improved recall from 0.43 to 0.75 while preserving precision on exact-match queries, though the actual margin will depend on your query mix.

The pattern scales beyond this example: Collect judgments from your users' actual queries, run rank_eval as a baseline, add semantic_text, and measure again. You'll know exactly what improved and by how much.

Next steps

Dive deeper into recall and vector search: Recall and vector search quantization by Jeff Vestal
Add reranking for even better precision on the top results
Explore Elasticsearch hybrid search documentation
Read more about the rank_eval API

Building a control plane to govern ecommerce search

Alexander Marquardt, Honza Král, Taylor Roy — Fri, 01 May 2026 00:00:00 GMT

Part 1 and part 2 of this series established why ecommerce search needs a governance layer, a decision layer between the user's query and the retrieval engine that classifies intent, enforces constraints, and routes to the correct retrieval strategy (for example, BM25, semantic, hybrid). This post shows how to build that layer using a simple architectural primitive where query interpretation policies are stored as documents and retrieved at query time via fast reverse matching. Because new retrieval policies (for example, “boost brand X” or “only show category Y”) don’t require code changes, the result is a routing layer that stays stable while policies evolve and that keeps the retrieval engines safe in high-stakes environments. If you want to see the end result of this architecture before reading further, check out this video: Fixing Search Relevance in Seconds: Introducing PRISM.

Why query interpretation is often a challenge

Storing policies as code (if/else blocks in the application layer) produces tens of thousands of lines of brittle logic that lacks any indexing for efficient policy retrieval at query time. Iteration is slow (a single query behavior change may require a six-week deployment cycle), accountability is unclear (why did results change?), and business users cannot modify search behavior without engineering involvement. This is shown on the left side in the following image:

Storing policies as data in an Elasticsearch index is shown on the right side of the above image. This approach solves all of the issues associated with hard-coded query resolution logic. However, for this to work, you need a way to quickly determine which policies match the user query and how conflicts should be resolved. This is where the governed control plane comes in.

The control plane pattern

A governed control plane sits between the raw user query and an Elasticsearch retrieval. It receives user text as its input, and its output is an execution plan that includes filters, boosts, and retrieval routing decisions.

A control plane pipeline consists of:

User query: A user enters a string of what they’re looking for, such as “oranges” or “gift for grandpa”.
Policy lookup: Match the user query against the policy index.
Return matching policies: Policies that match the user query are returned from the policy index.
Policy application: The control plane analyzes these returned policies and composes matched policies into a single coherent execution plan that includes filters, boosts, overrides, and guardrails and that applies the appropriate retrieval method (for example, lexical versus semantic versus hybrid).
Execute: The modified intent-aware Elasticsearch query is passed to the application to be executed against a product catalog index.
Explain (optional): In addition to creating a query that provides business and intent-aligned results, the control plane provides an optional explainability payload to show which policies were triggered and how they were combined.

Finding which policies should be applied for a user’s search string requires a fast reverse-matching primitive, which we solve with the percolator query. After retrieving relevant policies, combining multiple matched policies into a unified execution plan requires a judgment framework: priorities, conflict strategies, consumed phrase tracking, and cascading transformations that apply policies in sequence rather than independently. Additionally, the most appropriate retrieval technology needs to be selected (for example, BM25 for “oranges” versus semantic search for “gift for grandpa”).

Policy lookup: Checking the query before searching for products

When a shopper types a query, a search system with a governed control plane doesn’t send that query directly to be executed against the product catalog. First, the query is checked against a set of stored policies and modified to reflect the intent of the query and business priorities.

Policy structure

Each policy is a simple document that defines two things:

Match criteria: What query text should cause this policy to fire. This could be an exact phrase, a single word, a pattern, or a combination.
Action: What to do when the policy fires. This could be applying a category filter, excluding products, extracting a price constraint, or changing the retrieval strategy.

The system finds all matching policies, composes them into an execution plan, and only then runs the product search. Taken together, policies act like a knowledgeable store associate who understands what you’re looking for and walks you to the right aisle.

The policy pattern

The first articles in this series introduced examples of policies in action: constraining "oranges" to the produce category, treating "without peanuts" as an exclusion, and routing "gift for grandpa" to semantic retrieval. The key architectural point is that in each case, the query is checked against stored policies before the product search begins. The policies determine what constraints to apply, which text to modify, and which retrieval strategy to use. The query against the product catalog comes after the policies have been applied and a new rewritten query has been created.

Why this is fast

An enterprise ecommerce system might have millions of products but only hundreds or thousands of policies. The policy lookup step is searching against a small curated index, not the full product catalog, and is therefore fast. And because policies are stored as data in their own index, a merchandiser adding a new policy doesn't touch the application code, and an engineer optimizing the product search doesn't touch the policy index. The two concerns evolve independently.

The examples above describe what happens conceptually. Under the hood, the policy lookup is implemented using the Elasticsearch percolator query type, which is purpose-built for this kind of pattern: matching incoming text against a set of stored queries. Part 4 in this series provides a hands-on deep dive into the percolator implementation, including index mappings, boundary markers, and highlight-driven phrase tracking. With the lookup mechanism covered in depth in Part 4, let's turn to what a policy document actually contains and how the control plane composes multiple policies into a single execution plan.

Example policies

Now that we've seen what policies do conceptually, let's look at what they actually contain. The two policies below have been designed to intentionally conflict, which will demonstrate the conflict resolution system described in subsequent sections.

Cheap chocolate

The policy shown below detects if a user has submitted a search containing the phrase “cheap chocolate”. If so, results are restricted to the “Chocolates” and “Milk chocolates” categories. This policy also applies a price filter of $2. Also, notice that this policy has a priority of 210; we’ll come back to this when we discuss conflict resolution in more detail.

The filter mode and conflict strategy settings shown here (hard_filter, soft_boost, restrict, override) are explained in detail in the conflict resolution section below.

When the above policy is activated, a search for “cheap chocolate” respects the price filter of $2 and restricts results to the “Chocolates” and “Milk chocolates” categories. Example results are shown below:

Christmas chocolate

The policy shown below is an example of a policy that one could imagine applying at Christmas. This example restricts results to “Christmas foods and drinks” and “Christmas sweets”, boosts any products that are also in the “Advent calendars” category, and applies a price filter of less than $7 to focus on affordable seasonal items. Additionally, notice that this policy has a priority of 300. We’ll come back to this when we discuss conflict resolution in more detail.

When the above policy is activated without any conflicting policies, a search for “chocolate” respects the price filter of $7, and restricts results to the “Christmas food and drinks” and “Christmas sweets” categories, and boosts any products tagged as “Advent calendars”. Example results are shown below:

Combining matched policies

The policy lookup described above is half the story. The other half is what happens when multiple policies match the same query.

In any nontrivial deployment, a single query will routinely trigger several policies at once. "Cheap chocolate" will match both of the policies that we demonstrated above. Each policy is correct in isolation. The challenge is composing them into a single, coherent execution plan without contradictions, without double-counting, and without one policy silently undoing the work of another.

This isn’t a lookup problem; it’s a judgment problem. The system must decide:

Order of application: If a negation policy removes "without peanuts" from the query, does the price policy still see the original text or the modified text?
Filter conflicts: If two policies set different price ceilings, which one wins? Is the loser silently dropped, or does it degrade gracefully into a soft boost?
Phrase ownership: If two policies both matched on the same word and the first one already consumed it, should the second one still fire?

A naive implementation (apply all matched policies independently, merge the results) breaks as soon as policies interact. The architecture needs an explicit model for how policies compose. The next two sections describe that model: a priority and conflict resolution framework; and a cascading transformation model that makes policy interaction deterministic.

The key insight is that policy application isn’t a set of independent operations; it’s a cascading transformation. Each policy receives the rewrite state produced by all higher-priority policies and transforms it further:

initial state → [Policy A] → state' → [Policy B] → state'' → ... → execution plan

The state carries the rewritten query text, accumulated filters, current intent, and any synonym expansions. A high-priority policy can remove text from the query, and every subsequent policy sees the modified query, not the original. Context accumulates. Order matters.

Precedence and conflict resolution: Determinism matters

The specific conflict strategies are a design choice. Different organizations may resolve conflicts differently, depending on their business requirements. The following approach illustrates the kind of judgment framework a control plane needs. The important thing is not these specific strategies but that the system has explicit, deterministic strategies rather than letting conflicts resolve through unpredictable interactions.

Priority ordering

Policies are sorted by priority (highest first). When multiple policies match the same query, they’re applied in priority order. If two policies try to set the same filter field, the higher-priority policy's declared strategy for that field takes precedence. If there are multiple policies triggered that have the same priority, then the policy with the highest ID is given precedence (as if it were assigned a higher priority); this choice ensures deterministic behavior when conflicts arise.

Per-field resolution, not per policy

A critical design principle: Conflict resolution operates per field (for example, brand, category, or description), not per policy. When two policies produce filters that overlap on specific fields, only those specific fields are affected by the conflict resolution strategy, and the resolution strategy is defined by the highest-priority matching policy. Non-conflicting fields from both policies survive intact.

This matters because the alternative of a per-policy approach would force the system to either accept or reject an entire policy when only one of its fields conflicts.

Per-field resolution preserves the maximum amount of useful constraint information.

Three settings per filter field

Each filter field in a policy has three independent settings:

Filter mode: How the filter is applied when there’s no conflict.

hard_filter (default): Applied as an Elasticsearch bool.filter clause. This is useful for excluding unrelated products entirely. For example, restricting a search for "oranges" to the produce category eliminates hits such as orange juice and orange marmalade. Non-matching documents are completely excluded from results.
soft_boost: Applied as an Elasticsearch function_score weight with a configurable boost_weight. Documents that match get a ranking boost, but non-matching documents aren’t excluded. This is useful for something like boosting a brand, without excluding other brands.

Conflict strategy

What happens when a lower-priority policy sets the same field:

override: This high-priority policy's value wins; the lower-priority value is dropped entirely. Valid for all field types.
restrict: Take the more restrictive numeric value (for example, the lower ceiling for price__max, the higher floor for price__min). Valid for numeric range fields only.
merge: Combine both values into a union. Valid for non-numeric fields only.
soft_boost: Convert the conflicting filter to a function_score weight with a configurable boost_weight instead of a hard filter. For more details on function_score boosting, see Influencing BM25 ranking with multiplicative boosting in Elasticsearch. This is only valid for non-negation fields.

Value: The actual filter value (for example, a categories list, a price threshold).

Strategies by field type: Not all strategies make sense for all field types. For instance, an exclusion is inherently binary, so it cannot be soft-boosted. The following table shows which strategies are available for each field type:

Field type	Available strategies	Default
Negation fields (__not, __match__not)	override, merge	override
Numeric range fields (__max, __min, __gt, __lt)	restrict, override, soft_boost	restrict
All other fields (keyword, text)	soft_boost, override, merge	soft_boost

Negation fields cannot be soft-boosted because exclusions are binary. Converting "never show canned foods" to "slightly prefer not-canned-foods" fundamentally changes the semantics; a product from "canned foods" would still appear, just ranked slightly lower, which defeats the purpose of the exclusion.

A concrete example: Searching for "cheap chocolate" during a Christmas campaign

Suppose a merchandiser has created the two policies for chocolate that we previously demonstrated, a lower priority one for cheap chocolate and another higher-priority chocolate-related policy that will be enabled during Christmas. If both of these policies are enabled, then how these are combined depends on the filter mode and conflict strategy of the higher-precedence policy. If both of the previously discussed policies are enabled, they’ll be combined as follows:

This shows two conflicts, one on categories and one on price. It’s worth noting that the query that will be executed after this transformation has the following characteristics:

Only products from the “Christmas foods and drinks” and “Christmas sweets” categories will be shown.
Within those categories, if the products are also tagged as being in the “Advent calendars” category, they’ll be boosted up by 3x.
A price filter for $2 is applied, which came from the lower-priority policy (because the higher-priority policy specified to “Restrict” on conflict).
The word “cheap” is removed, only returning products matching “chocolate”.

With both of these policies enabled, “cheap chocolate” returns results similar to the image shown below:

Relaxing constraints

Perhaps the retailer doesn’t want to exclude products in the categories of “Chocolates” and “Milk chocolates” during Christmas. The settings on the Christmas policy might have overreached and inadvertently removed categories applied by the “cheap chocolate” policy. This is an example that shows why it might be more desirable to combine lower-priority policies with conflicting higher-priority policies. For example, we could modify the Christmas chocolates promotion so that instead of “Override” on conflict, we do a soft boost. The change to that policy would be as follows:

After this modification, the query rewriter transformation pipeline execution for “cheap chocolate” looks as follows:

With the soft boost on conflict, the conflicting filters are converted into soft boosts rather than being dropped. The query that will be executed on the product catalog after this transformation has the following characteristics:

Because “On conflict” is specified as “Soft boost” on the higher-priority policy, the conflicts will be converted to boosts as follows:
- Products from the “Christmas foods and drinks” and “Christmas sweets” categories will have a boost of 1x applied to them.
- Products from the “Chocolates” and “Milk chocolates” categories will have a boost of 3x applied to them.
As in the previous example, if the products are also tagged as being in the “Advent calendars” category, they’ll be boosted up by 3x.
As in the previous example, a price filter for $2 is applied.
The word “cheap” is removed, only returning products matching “chocolate”.

With relaxed filtering, results look as follows:

Overriding price from a high-priority policy

Or perhaps the retailer wants to allow slightly more expensive chocolates to be shown during Christmas by increasing the price max to $7. To ensure that the max price from the Christmas chocolates policy is not overridden if someone searches for “cheap chocolates”, we can set the conflict mode on the price to “override” rather than “restrict”, as follows:

With this override, the query for “cheap chocolate” ignores maximum price that is defined in the “cheap chocolate policy” and only applies the price specified in the “Christmas chocolates” policy, as follows:

This is similar to the previous example, with the difference being that the max price is set to the $7 value from the higher-priority policy because that policy specified “Override” on conflict. With the Christmas price filter taking precedence, the results look as follows:

These three variations (override, soft_boost, and override on price) demonstrate a key property of the system: A merchandiser can change how two policies interact by modifying a setting on a single field within a single policy, without deploying any code. The conflict strategy is the lever that controls business behavior.

Consumed phrase tracking

There’s a subtler form of conflict: two policies that match on the same phrase. If a higher-priority policy removes "without peanuts" from the query, a lower-priority policy that also matched on "without" has nothing left to act on. The system detects if the matched phrase is no longer present in the rewritten query and skips the lower-priority policy.

Intent policies are exempt from consumed phrase tracking: They set the retrieval strategy based on the original query match, regardless of what text has been removed by higher-priority policies.

Priority ordering, per-field conflict resolution, and consumed phrase tracking together give the control plane a deterministic composition model. With that foundation in place, the system can make a routing decision that would be risky without it.

Governance makes retrieval strategy safe

An important insight about routing to the correct retrieval method (text, semantic, or hybrid) is that it executes after governance. If your policies have already enforced "produce category”, then semantic retrieval becomes far less risky because the candidate set is constrained. A semantic search over 500 product items is a very different proposition from a semantic search over 500,000 SKUs. Governance narrows the blast radius before retrieval begins.

For example, without governance, a semantic query for “Fruit high in vitamin C under $4”, in addition to fruits, might return vitamin bottles, carrots, and green pepper. The control plane ensures that these undesired results aren’t even considered as part of the semantic expansion.

With that constraint in place, the control plane applies pragmatic routing logic:

Lexical for navigational and head queries where deterministic precision matters.
Semantic for descriptive discovery queries where concept matching helps.
Hybrid selectively, when constraints have already been enforced and the business accepts broader recall.

From architecture to implementation

The governed control plane translates business intent into deterministic, composable execution plans, without embedding that logic in application code. Policies are data: matched at query time, resolved through explicit per-field conflict strategies, and applied as cascading transformations that produce explainable results. Elastic Services Engineering has built and deployed this architecture for enterprise ecommerce teams, using repeatable patterns and accelerators that compress the path from concept to production. You can see a demo of our implementation of a control plane on YouTube at: Fixing Search Relevance in Seconds: Introducing PRISM.

What's next in this series

The next post goes hands-on with the implementation: how the Elasticsearch percolator powers the policy lookup, including index mappings, boundary markers, highlight-driven phrase tracking, and concrete query examples.

Put governed ecommerce search into practice

The control plane architecture described in this post (per-field conflict resolution, cascading policy transformations, and governance-constrained retrieval routing) was designed and built by Elastic Services Engineering. Every pattern, screenshot, and transformation pipeline shown in this series comes from a working system built by Elastic Services Engineering and validated against enterprise-scale product catalogs.

If you want to implement a governed, policy-driven control plane on Elasticsearch, Elastic Services can get you there faster.

Join the discussion

Have questions about search governance, retrieval strategies, or ecommerce search architecture? Join the broader Elastic community conversation.

Elasticsearch ES|QL query builder for JavaScript and TypeScript: Fluent, type-safe query construction

Margaret Gu — Thu, 30 Apr 2026 00:00:00 GMT

We're pleased to announce that the Elasticsearch Query Language (ES|QL) query builder is now available for JavaScript and TypeScript. It's a fluent, type-safe library that lets you construct ES|QL queries with method chaining, automatic value escaping, and full integrated development environment (IDE) support; no more raw string concatenation.

Learn how to get started with practical examples you can use right away.

Elasticsearch ES|QL query builder for JavaScript and TypeScript

If you've ever built an ES|QL query in JavaScript, you've probably written something like this:

const query = `FROM logs-*
| WHERE status_code >= ${minStatus}
  AND host.name == ${hostname}
  AND @timestamp >= "${startDate}"
| STATS error_count = COUNT(*) BY status_code
| SORT error_count DESC
| LIMIT 10`

It looks fine until hostname is O'Brien's server and the whole thing blows up with a parse error. Or until a user passes "; DROP INDEX logs into a search field and you realize you've been building queries with raw string concatenation this entire time.

There's a better way. The ES|QL query builder for JavaScript and TypeScript lets you write queries like this instead:

import { ESQL, E, f } from '@elastic/elasticsearch-esql-dsl'

const query = ESQL.from('logs-*')
  .where(E('status_code').gte(minStatus))
  .where(E('host.name').eq(hostname))
  .where(E('@timestamp').gte(startDate))
  .stats({ error_count: f.count() })
  .by('status_code')
  .sort(E('error_count').desc())
  .limit(10)

Values are escaped automatically. You get autocomplete in your editor. And you can see exactly what the query does, without mentally parsing a template literal.

ES|QL query builders are already available across Elastic's language clients, including Python, Ruby, and others. This article focuses on the JavaScript and TypeScript version, walking through practical examples you can start using today.

Getting started

Install the package:

npm install @elastic/elasticsearch-esql-dsl

Here’s a minimal query:

import { ESQL, E } from '@elastic/elasticsearch-esql-dsl'

const query = ESQL.from('employees')
  .where(E('still_hired').eq(true))
  .sort(E('last_name').asc())
  .limit(10)

console.log(query.render())

This renders:

FROM employees
| WHERE still_hired == true
| SORT last_name ASC
| LIMIT 10

To run it against Elasticsearch:

import { Client } from '@elastic/elasticsearch'

const client = new Client({ node: 'http://localhost:9200' })
const response = await client.esql.query({ query: query.render() })

That’s it. No string interpolation, no manual escaping.

Building a real query, step by step

Let's walk through a realistic scenario: You're building a dashboard that analyzes web server error logs. We'll start simple and layer on features.

Step 1: Filter error logs

import { ESQL, E } from '@elastic/elasticsearch-esql-dsl'

const errors = ESQL.from('logs-*')
  .where(E('status_code').gte(400))
  .limit(100)

FROM logs-*
| WHERE status_code >= 400
| LIMIT 100

Step 2: Add a computed column

Your timestamps are in milliseconds, but you want response time in seconds:

const errors = ESQL.from('logs-*')
  .where(E('status_code').gte(400))
  .eval({ response_secs: E('response_time_ms').div(1000) })
  .limit(100)

FROM logs-*
| WHERE status_code >= 400
| EVAL response_secs = response_time_ms / 1000
| LIMIT 100

Step 3: Aggregate errors by status code

import { f } from '@elastic/elasticsearch-esql-dsl'

const errorBreakdown = ESQL.from('logs-*')
  .where(E('status_code').gte(400))
  .stats({
    error_count: f.count(),
    avg_response: f.avg('response_time_ms'),
  })
  .by('status_code')
  .sort(E('error_count').desc())

FROM logs-*
| WHERE status_code >= 400
| STATS error_count = COUNT(*), avg_response = AVG(response_time_ms) BY status_code
| SORT error_count DESC

The f namespace gives you access to 150+ ES|QL function wrappers: aggregations, string functions, date functions, math, geo, and more. They all return chainable expressions, so you can use them anywhere you'd use E().

Step 4: Use date functions for time-based analysis

const hourlyErrors = ESQL.from('logs-*')
  .where(E('status_code').gte(400))
  .eval({ hour: f.dateTrunc('@timestamp', '1 hour') })
  .stats({ error_count: f.count() })
  .by('hour')
  .sort(E('hour'))

FROM logs-*
| WHERE status_code >= 400
| EVAL hour = DATE_TRUNC(@timestamp, "1 hour")
| STATS error_count = COUNT(*) BY hour
| SORT hour

Step 5: Branch queries safely

Every method returns a new query object. The original is never mutated. This means you can build a base query and branch it for different views:

const base = ESQL.from('logs-*')
  .where(E('status_code').gte(400))
  .where(E('@timestamp').gte('2026-01-01T00:00:00Z'))

const byStatus = base
  .stats({ count: f.count() })
  .by('status_code')
  .sort(E('count').desc())

const byHost = base
  .stats({ count: f.count() })
  .by('host.name')
  .sort(E('count').desc())
  .limit(20)

const recent = base
  .sort(E('@timestamp').desc())
  .keep('@timestamp', 'status_code', 'url.path', 'message')
  .limit(50)

Three different queries, one shared base. Change the filter on base, and all three update. This is especially useful for dashboards where multiple panels query the same dataset with different aggregations.

Three ways to write expressions

The domain‑specific language (DSL) gives you flexibility in how you write conditions. Here's the same WHERE clause written three different ways:

Raw strings: When you're writing a quick one-off:

.where('status_code >= 400 AND host.name == "web-01"')

The E() expression builder: When you want type safety and autocomplete:

import { and_ } from '@elastic/elasticsearch-esql-dsl'

.where(and_(
  E('status_code').gte(400),
  E('host.name').eq('web-01')
))

The esql template tag: -When you want safe interpolation of dynamic values:

import { esql } from '@elastic/elasticsearch-esql-dsl'

const minStatus = 400
const host = 'web-01'
.where(esql`status_code >= ${minStatus} AND host.name == ${host}`)

All three produce the same ES|QL. Pick whichever fits your situation: raw strings for simple cases, E() when building expressions programmatically, and the template tag when mixing literal ES|QL with dynamic values.

Keeping queries safe

If any part of your query comes from user input, you need to think about injection. ES|QL supports parameter binding, and the DSL makes it straightforward:

function searchLogs(userQuery: string) {
  const query = ESQL.from('logs-*')
    .where(E('message').eq(E('?')))
    .limit(100)

  return client.esql.query({
    query: query.render(),
    params: [userQuery],
  })
}

The ? placeholder is replaced server-side by Elasticsearch, so the user's input never touches the query string. No escaping, no injection risk.

Beyond the basics

Once you're comfortable with the core commands, the DSL supports every advanced ES|QL feature:

Hybrid search with FORK and FUSE:

const results = ESQL.from('articles')
  .fork(
    ESQL.branch()
      .where(f.match('title', 'elasticsearch'))
      .sort(E('_score').desc())
      .limit(50),
    ESQL.branch()
      .where(f.knn('embedding', 10))
      .sort(E('_score').desc())
      .limit(50),
  )
  .fuse('RRF')
  .limit(10)

Data enrichment:

const enriched = ESQL.from('logs-*')
  .enrich('ip_lookup')
  .on('client.ip')
  .with('geo.city', 'geo.country')

Conditional aggregation:

const stats = ESQL.from('employees')
  .stats({
    eng_avg: f.avg('salary').where(E('dept').eq('Engineering')),
    sales_avg: f.avg('salary').where(E('dept').eq('Sales')),
    total: f.count(),
  })

AI/machine learning (ML) integration:

const summarized = ESQL.from('docs')
  .completion('Summarize this document')
  .with({ inferenceId: 'my-llm' })

For the full list of commands and functions, check out the ES|QL query builder documentation.

What's next

This is the initial release of @elastic/elasticsearch-esql-dsl. You can find the package on npm, explore the source on GitHub, and read the full documentation in the repository. If you run into issues or have feature requests, open an issue; we're actively developing this and want to build what JavaScript and TypeScript developers actually need.

How cross-project search (CPS) works in Elasticsearch Serverless

Matteo Piergiovanni, Pawan Kartik — Thu, 30 Apr 2026 00:00:00 GMT

With the introduction of cross-project search (CPS) in our Elastic Cloud Serverless offering, we wanted to provide our customers with a streamlined search experience. Our vision is for our users to not worry about which project the data is stored in but instead to focus on the information they want an answer for.

Users familiar with cross-cluster search (CCS) know that to include a remote cluster in a search you must specifically reference it with a cluster alias prefix, such as remote:metrics* (or *:metrics* to target all remotes). With CPS, "bare" index names or patterns such as metrics* in a search index expression implicitly reference all instances of that index that can be found on the origin project and on all of its linked projects.

In this post, we pull back the curtains on how the Elasticsearch TransportSearchAction, the code that underlies the _search API (and _async_search) APIs, determines which indices, aliases, and datastreams to search on which projects when running a cross-project search.

Analyzing index expressions

One important consequence of the new CPS model is that the Elasticsearch query parameters allow_no_indices and ignore_unavailable require different handling than before. In general, these parameters control whether a search should throw an error if a concrete index name (for example, "logs") cannot be found or isn’t accessible (ignore_unavailable=false), or a wildcard pattern (for example, "logs*") doesn’t match anything or if there are no indices at all to search (allow_no_indices=false).

In CCS, those parameters are analyzed on each cluster separately: Each cluster can just consult which indices, aliases, or data streams exist locally. But in CPS, we need to account for whether each resource (index, alias, or data stream) referenced by an index expression matches on any project (origin or linked), rather than every project.

To illustrate, suppose a user has one linked project (linked1) and they issue a cross-project query with index expression logs*,metrics-1, along with allow_no_indices=false,ignore_unavailable=false. As long as we find one resource (index, alias, or data stream) that matches logs* and one that matches metrics-1 on any project, then the search can proceed. If metrics-1, for example, is found on linked1 but not the origin project, that suffices to pass the ignore_unavailable=false constraint. Only if it’s found nowhere would we throw an IndexNotFoundException.

To handle this, the _search API in cross-project search mode needs to gather information from each linked project before kicking off the actual search.

The serverless node that receives the _search REST request is considered to be the origin project, and it acts as the overall search coordinator. To fully analyze and process an index expression before kicking off the actual search, the search coordinator needs to:

Determine which projects are in scope for the search.
Determine which indices, aliases, or data streams should be searched in each individual project.
Do a final reconciliation step of all the information gathered: 1) Have we found all the indices required to proceed (as defined by the allow_no_indices and ignore_unavailable settings)?; and 2) Are there any projects that should be skipped since they have no matching indices?

To illustrate, we’ll follow a CPS request against the _search API from start to finish. Suppose that the incoming user request is to search metrics* and that the origin project has two indices which match the expression, namely metrics-ES and metrics-KB, and that the origin project is linked to projects P1 and P2, where P1 has index metrics-ES and P2 has no index, alias, or data stream that matches metrics*.

Determining projects in scope

The search coordinator starts by grabbing the full list of linked projects from Elasticsearch cluster state. It then determines, based on Universal Identity and Access Management (UIAM) credentials provided with the query, which of those projects the user has access to and keeps the subset of projects that the user is allowed to access.

If a project_routing parameter is present on the request, that can further limit which projects are in scope for the query. For instance, "project_routing":"_alias:_origin" would cull the list to just the origin project, while "project_routing":"_alias:P*" would include the linked projects P1 and P2 and remove the origin project from the list of projects in scope for the query.

The index expression itself can also limit which projects to access. Qualified index expressions, like "P*:metrics*", would indicate that we should search only on projects whose alias starts with "P". For our case, the "bare" or unqualified index expression "metrics*" means "search on all projects that are in scope," where "in scope" is modified by security access and project_routing.

Determining which indices each project has

Once we know which projects to search, we need to determine which matching indices each one has. To support this, the search coordinator creates a ResolvedIndexExpressions data structure that allows tracking which indices on each project should be included in the search.

For each index expression provided by the user, that data structure tracks:

Original expression: The index expression, as provided by the user.
Local resolution: The local expressions that will replace the original together with the resolution result.
Remote expressions: A set of remote expressions one for each project a query can target.

On the search coordinator that received the request, we’ve so far been able to determine which projects are in scope (origin, P1, and P2) and which, if any, matching indices exist on origin. So, for our example search against metrics*, we’ll have the following structure at that point in time:

Original expression: metrics*.
Local resolution: SUCCESS, .
Remote expressions: .

The remote expressions are left unresolved until we later contact the P1 and P2 projects to fill in that information.

The index resolution can be any of the following:

SUCCESS: Local resolution completed successfully.
NOT_VISIBLE: Indicates that a non-wildcard expression was resolved to nothing, either because the index doesn’t exist or is closed.
UNAUTHORIZED: Indicates that the expression could be resolved to a concrete index, but the requesting user isn’t authorized to access it.
NONE: No local resolution was attempted, typically because the expression is remote-only (for example, P1:index).

Check indices on the linked projects

In the search API, how we check for indices on the linked projects depends on whether the cross-project search is being run with minimize round trips or not. Most searches in CPS are set internally to run with minimize_roundtrips=true, so we’ll focus on that pathway.

In CCS, where index expression analysis is done locally on each cluster, we just send the same request to every cluster and, with minimize_roundtrips=true, each remote cluster sends back an entire SearchResponse that the primary search coordinator collects and eventually merges into all the other responses it receives.

By contrast, for CPS, an additional phase was introduced to search where we contact each linked project to assess which indices are present and which ones the user has access to. This is an additional round trip that uses ResolveIndexAction, the class that implements the functionality of the _resolve/index endpoint. Upon receiving the ResolveIndex responses from all the linked projects, we can fill in the ResolvedIndexExpressions data structure on the primary search coordinator. For this case, since P1 has index metrics-ES and P2 has no matching indices, the updated data structure would be:

Original expression: metrics*.
Local resolution: SUCCESS, .
Remote expressions: , P2: NOT_VISIBLE>.

CPS validator

Once we have all the linked project information, we can run the validation to honor the IndicesOptions specified by the caller.

If the user had specified a qualified index expression (for example, original=P1:metrics*), the CPS validator needs to make sure that project P1 has at least one index matching metrics*, otherwise a 404 index not found exception would be returned to the user. In other words, qualified expressions imply that an index, alias, or data stream matching that name must be present on all projects specified by the qualifier.

On the other hand, for “bare” (unqualified) CPS index expressions (for example, original=metrics*), we only need to check whether the original index expression exists anywhere.

For the example we’re following in this post, the ResolveIndexExpressions show that at least one index, alias, or data stream matching metrics* was found, so the search can proceed.

However, we can also see that the P2 cluster has no matching indices. In that case, we can remove it from the rest of the query. So now, for the actual query, we’ll:

On the origin project, search metrics-ES, metrics-KB.
On the P1 project, search metrics-ES.
Skip the P2 project. Note: Since it wasn’t included in the query at all, it won’t show up on the _cluster/details of the SearchResponse.

Conclusions

We’ve described how the TransportSearchAction class in Elasticsearch supports some key new features of cross-project search. That class supports a large number of Elasticsearch REST APIs, such as, _search, _async_search, _msearch, _eql, _sql and _count, all of which use the mechanisms described in this document.

Elastic-caveman: Cutting AI response tokens by 64% without losing the best of Elastic

Sri Kolagani — Wed, 29 Apr 2026 00:00:00 GMT

When querying Elasticsearch through an AI assistant, you need facts: index names, field mappings, Elasticsearch Query Language (ES|QL) queries, case IDs, sentiment scores. But current large language model (LLM) interfaces wrap every response in conversational padding:

"Of course! I'd be happy to help you..."

"This should give you a good overview..."

"Feel free to let me know if you need anything else!"

This isn't just annoying; it's expensive. Every token costs money and adds latency. For production Elasticsearch queries, that overhead compounds fast. In this post, we introduce elastic-caveman and share the results of a controlled experiment across eight live Model Context Protocol (MCP) scenarios against an Elasticsearch cluster. The findings: 63.6% average token reduction, 817 tokens saved, and zero loss of technical accuracy.

Enter elastic-caveman

elastic-caveman tests a simple hypothesis: Strip AI responses to pure signal, and measure the impact. The approach:

Normal mode: Full conversational AI with greetings, explanations, and sign-offs.
Caveman mode: Raw data with minimal structural labels only.

We tested both modes against a live Elasticsearch instance using MCP with real support ticket and Salesforce case data across eight production scenarios.

Results: 64% token reduction, zero accuracy loss

Here's what we found across eight live MCP tool calls: The Elastic-Caveman initiative has successfully optimized AI response size without compromising quality or functionality.

Metric	Result
Scenarios tested	8
Success rate	88%
Token reduction	63.6% average
Total normal tokens	1,284
Total Caveman tokens	467
Tokens saved	817
Max reduction (single scenario)	91.5%

Key preservations (0% loss):

Technical accuracy
API paths
ES|QL syntax
Field names

The critical finding: Every field name, case ID, ES|QL query, account name, and sentiment score was preserved exactly. Not approximately. Exactly.

Real examples: Before and after

Example 1. List indices: 87% reduction

User: Show me my indices

Normal mode (107 tokens):

Of course! I'd be happy to help you see your indices. Here's a comprehensive list of all the indices in your Elasticsearch cluster. Each entry shows the index name along with relevant metadata. This should give you a good overview of what's stored in your cluster:-- salesforce-cases-- support-ticketsThese are all the indices I found. If you'd like to dive deeper into any specific index, just let me know and I'll be glad to assist!

Caveman mode (14 tokens):

Indices: salesforce-cases, support-tickets

Saved: 93 tokens (86.9%)

Example 2. Generate ES|QL query: 75% reduction

User: Show me open critical tickets grouped by product area

Normal mode (208 tokens):

Sure thing! I'd be happy to help you write that ES|QL query. What you're looking for is a query that will filter for open tickets with critical priority and group them by product area. Here's how you can accomplish this...

[followed by the actual query, plus 150+ tokens of step-by-step explanation]

Caveman mode (52 tokens):

FROM support-tickets
WHERE status = "Open" AND priority = "Critical"
| STATS count = COUNT(*) BY product_area
| SORT count DESC
| LIMIT 100

Filters open+critical, groups by product_area, counts, sorts desc.

Saved: 156 tokens (75.0%). ES|QL syntax is character-for-character identical in both modes.
Example 3. Search recent support tickets: 35% reduction

User: Show me 5 recent support tickets

Caveman mode (143 tokens):

5 recent tickets, newest first:
Case     | Priority | Status    | Account               | Subject                                    | Days | Sentiment
CASE-0012 | Critical | Open      | Pinnacle Financial    | Regulatory reporting: incorrect tax calcs  | 2    | -0.94
CASE-0021 | Critical | Escalated | Riverstone Capital    | Security audit blocking renewal            | 2    | -0.93
CASE-0046 | Medium   | Open      | Blackrock Partners    | Pen test findings need vendor response     | 2    | -0.70
CASE-0006 | High     | Open      | United Oil & Gas      | Unauthorized access -- urgent               | 3    | -0.91
CASE-0017 | High     | Open      | Riverstone Capital    | Renewal paperwork stalled                  | 4    | -0.86

All negative sentiment. Two critical cases from April 7. All case IDs, priorities, account names, and sentiment scores preserved verbatim.

What gets removed vs. what stays

When we clean up the output, we strip out conversational filler, like “Of course! I’d be happy to help you…”, “This should give you a good overview…”, or “Would you like me to help you prioritize these?”, and we keep every piece of factual content, such as ES|QL snippets, like FROM support-tickets WHERE status = "Open"; field names like sentiment_score, product_area, and resolution_hours; and index names, like support-tickets and salesforce-cases. We also preserve concrete identifiers and business entities, such as case IDs CASE-0012 and CASE-0002; account names, like Pinnacle Financial and United Oil Gas Corp; along with all numeric values, for example, a sentiment_score of -0.94, counts like 47 duplicates, durations such as 18 days, or metrics like 27.0 average hours, so the edited text is tightly focused on query syntax, entities, and numbers while discarding only the polite scaffolding.

Results varied by operation type:

Query type	Token reduction	Why
Metadata listings	85–92%	Small payload, maximum filler in normal mode
ES\|QL generation	70–75%	Query is identical; explanation is eliminated
Data-heavy searches	35–40%	Actual data dominates, leaving less room for fluff

Complete evaluation breakdown

Token savings by query type across all eight scenarios against live MCP data:

Scenario	Normal tokens	Caveman tokens	Reduction	Tokens saved	MCP tool
T1: List all streams	118	10	91.5%	108	platform.streams.list_streams
T2: List indices	107	14	86.9%	93	platform.core.list_indices
T3: Get index mapping	143	40	72.0%	103	platform.core.get_index_mapping
T4: Generate ES\|QL query	208	52	75.0%	156	platform.core.generate_esql
T5: Execute ES\|QL aggregation	149	44	70.5%	105	platform.core.execute_esql
T6: Search recent tickets	221	143	35.3%	78	platform.core.search
T7: Search escalated cases	198	128	35.4%	70	platform.core.search
T8: ES\|QL stats by priority	140	36	74.3%	104	platform.core.execute_esql
TOTALS	1,284	467	63.6%	817

Technical accuracy verification:

Accuracy check	Result	Details
ES\|QL syntax preserved	PASS	FROM, WHERE, STATS, SORT, LIMIT identical
Field names preserved	PASS	account_id, sentiment_score, product_area verbatim
Index names preserved	PASS	support-tickets, salesforce-cases unchanged
Case IDs preserved	PASS	CASE-0012, CASE-0002 exact
Account names preserved	PASS	Pinnacle Financial, United Oil Gas Corp exact
Numeric values preserved	PASS	Sentiment scores -0.94, -0.88; days open 18, 7 exact
Priority/status labels	PASS	Critical, Escalated, Open verbatim
Null values preserved	PASS	null for low priority resolution hours retained
Error messages preserved	PASS	Tool validation errors quoted verbatim

Zero information loss. 64% fewer tokens.

Why this matters for Elastic users

For teams building AI assistants on Elasticsearch, 64% token reduction means 64% savings on output costs at scale, faster streaming responses, and more context window space for actual data rather than fillers. When you're debugging an ES|QL query at 2 a.m., you don't need an AI telling you it's delighted to help; you just need the query response!

The bigger picture: Rethinking AI interfaces

This experiment reveals something fundamental: Conversational AI interfaces optimize for the wrong metric. They optimize for sounding human when users often just want accurate data, fast.

For technical workflows, especially data queries, there's a strong case for mode-switching:

Conversational mode: When exploring or learning.
Caveman mode: When you know what you want and need it now.

The Elastic MCP server makes this possible by returning structured, accurate responses that work in both modes without modification.

How elastic-caveman works

elastic-caveman is an Agent Skill, that is, a markdown file with YAML front matter that any compatible AI agent reads and follows. No runtime. No binary. No API calls. Just instructions that reshape how your agent talks when working with Elasticsearch.

Install with:

npx skills add srikolag/elastic-caveman

Supported agents: Claude Code, Cursor, Codex, Windsurf, GitHub Copilot, Gemini CLI, Roo

Trigger with:/elastic-caveman

Disable with:"normal mode" or "verbose"

Live in action

We tested elastic-caveman with the Claude model to measure its impact on token usage and cost:

With elastic-caveman: Token usage was 368 tokens (in) and 1.6k tokens (out), resulting in a cost of $0.11.
Without elastic-caveman: Token usage was 367 tokens (in) and 1.8k tokens (out), resulting in a cost of $0.12.

Prompt: Get me the critical support tickets from the support-tickets index in kibana for Pinnacle Financial

This test demonstrates the efficiency of elastic-caveman.

Watch video on Vidyard

What's next

Caveman mode is just the beginning. Consider dynamic mode switching: Flip between concise and conversational mid-session. Or a hybrid approach: Lean on success, explanatory on errors. Or custom verbosity levels for teams that want something in between. The goal isn't to make AI assistants robotic; it's to give users control over the signal-to-noise ratio.

Try it yourself

Test caveman mode with your Elasticsearch data:

Set up the Elastic MCP server.
Install elastic-caveman.
Run queries in both normal and caveman modes.
Compare token counts and accuracy.

Full evaluation methodology and scripts available in the GitHub repo.

The bottom line

Across eight real scenarios with live Elasticsearch data, elastic-caveman delivered 64% average token reduction with zero accuracy loss and 100% preservation of ES|QL syntax, field names, and technical values. Sometimes the best AI response isn't the chattiest one. Sometimes you just need the data; and with elastic-caveman, you can get it 64% faster. Ready to optimize your Elasticsearch AI workflows? Check out Elasticsearch Labs for more tutorials, integrations, and research on building with Elasticsearch and AI, or start building with Elasticsearch today.

Want to optimize your Elasticsearch AI workflows? Check out Elasticsearch Labs for more tutorials, integrations, and research on building with Elasticsearch and AI. Ready to try it yourself? Start building with Elasticsearch today.

Your Elasticsearch field mappings are lying to you — METRICS_INFO and TS_INFO tell the truth

Dmitry Leontyev — Wed, 29 Apr 2026 00:00:00 GMT

Two new Elasticsearch ES|QL processing commands, METRICS_INFO and TS_INFO, tell you which metrics and time series actually have data for your current query context, not just what the mapping declares. Field mappings enumerate every field ever written; these commands return what's actively ingested, typed, and queryable right now, scoped to your time window and filters. A single-line query against 1.84 billion documents and 1.4 million time series returns in ~4 seconds. Available GA in Elasticsearch 9.4 and Elastic Cloud Serverless.

Why time series discovery matters

Elasticsearch uses time series data streams (TSDS) to efficiently store metrics. Backed by a fully columnar store, metrics stored in TSDS in Elasticsearch 9.4 require up to 17x less storage compared to using a standard index. Starting with Elasticsearch 9.2, we've also added time-series support in Elasticsearch Query Language (ES|QL) as a fully supported capability when querying data stored in TSDS.

If you operate TSDS in Elasticsearch, you already know the pattern: dimensions identify a series, metrics carry typed values like gauge or counter, and the TS source command in ES|QL enables time series aggregation functions such as RATE and AVG_OVER_TIME.

What that pipeline can't tell you (but you need to know just as often) is which metrics and time series actually exist right now, for the slice of data you care about. Field mappings enumerate every field that was ever declared; they don't show what's actively being ingested in a specific cluster, environment, or time window. That gap shows up across very different workflows:

Dashboard building. Metric and dimension pickers should reflect what the cluster currently holds for the user's filters, not every field that has ever been mapped. Otherwise, dropdowns stay cluttered with stale options and panels render empty.
Onboarding to an unfamiliar TSDS. A new cluster, a new integration, a customer's data. A quick list of the metrics being ingested, with their types, units, and applicable dimensions, replaces hours of mapping spelunking and ad hoc probe queries.
Data quality investigations. Mapping drift (the same metric declared gauge in one backing index and counter in another) and dimension-cardinality explosions both surface immediately in the catalog output.
Query validation. Before running an expensive TS ... | STATS aggregation, confirm that the metric and dimensions you're about to use really have data in your window.

Kibana already relies on this internally. The dynamic metrics catalog in the observability experience appends METRICS_INFO to the user's active TS query so the UI only offers metrics that truly exist for the current filters, rather than every field in the mapping.

The problem: Mappings are an inventory of fields, not time series

Operations teams routinely need answers to questions that mapping APIs alone cannot answer:

Which metrics actually have data in this environment, for this cluster, in this time range?
How are those metrics typed, and which dimensions apply when building or validating a query?
How many distinct time series exist per metric?

Until now, answering these questions meant piecing together mapping APIs, ad hoc queries, and guesswork. METRICS_INFO and TS_INFO turn those questions into single-line ES|QL queries that fit naturally into the same pipeline you use for STATS:

TS k8s
| WHERE cluster == "prod"
| METRICS_INFO
| SORT metric_name

metric_name	data_stream	unit	metric_type	field_type	dimension_fields
network.eth0.rx	k8s	packets	gauge	integer	[cluster, pod, region]
network.eth0.tx	k8s	packets	gauge	integer	[cluster, pod, region]
network.total_bytes_in	k8s	bytes	counter	long	[cluster, pod, region]
network.total_cost	k8s	usd	counter	double	[cluster, pod, region]

How these commands integrate with ES|QL pipelined queries

Both commands are processing commands. Once you run one, the table is replaced: Downstream commands, like KEEP, WHERE, or STATS, operate on metadata rows, not the original time series documents.

A few rules to keep in mind:

They apply only after a TS source. Using them after FROM or without a preceding TS source produces an error.
They must appear before STATS, SORT, or LIMIT run on the time series rows returned by TS. For example, TS ... | STATS ... | METRICS_INFO is invalid; TS ... | METRICS_INFO | STATS ... is valid because STATS then runs on the metadata table.
You can filter and aggregate after METRICS_INFO or TS_INFO on the metadata columns with the usual processing commands.
You can include filters before them, for example, narrowing by @timestamp or dimensions, so that the produced metadata reflects series that match your query context, not the entire index.

Conceptually, the pipeline looks like this:

TS + filters  →  METRICS_INFO or TS_INFO  →  KEEP / WHERE on metadata  →  STATS / SORT / LIMIT

This design means you can scope a catalog to exactly the slice of data you care about and then post-process the result with more ES|QL commands as desired.

How to use METRICS_INFO and TS_INFO in practice

METRICS_INFO retrieves information about the metrics available in your time series data streams, together with applicable dimensions and other metadata, all scoped to the current TS query. TS_INFO does the same for individual time series. Each row is one metric plus the dimension values that identify one series.

Each command offers a different view to time series metadata: METRICS_INFO collapses what you see into one row per distinct metric signature: the metric name plus how it's declared (type, unit, field type, which dimension fields apply) as observed across backing indices. TS_INFO adds one row per metric and time series, with a dimensions column that holds the concrete label set for each series, formatted as a JSON object (for instance, {"job":"elasticsearch","instance":"instance_1"}).

If the same logical metric name shows up with incompatible metadata in different places, you get multiple rows or multi-valued cells. That's a useful signal when you're tracking down mapping drift.

Both commands expose the same core columns; only TS_INFO adds dimensions.

Column	Meaning
metric_name	Name of the metric.
data_stream	Data stream(s) that contain this metric; multi-valued when it spans multiple data streams.
unit	Unit declared in the mapping (e.g. bytes); multi-valued when definitions differ across backing indices; may be null.
metric_type	Types such as gauge or counter; multi-valued when definitions differ across backing indices.
field_type	Elasticsearch field type (long, double, ...); multi-valued when definitions differ across backing indices.
dimension_fields	Dimension field names for this metric (multi-valued): the union of dimension keys across all time series for that metric.
dimensions	TS_INFO only. JSON-encoded dimension key/value pairs that identify one time series.

Start with a catalog of names and types. The smallest useful query is a TS source, METRICS_INFO, and a sort so the table is easy to scan:

TS k8s
| METRICS_INFO
| SORT metric_name

metric_name	data_stream	unit	metric_type	field_type	dimension_fields
network.eth0.rx	k8s	packets	gauge	integer	[cluster, pod, region]
network.eth0.tx	k8s	packets	gauge	integer	[cluster, pod, region]
network.total_bytes_in	k8s	bytes	counter	long	[cluster, pod, region]
network.total_cost	k8s	usd	counter	double	[cluster, pod, region]

You can post-process the result as usual in ES|QL. For instance, you can trim columns or filter on metadata before aggregating:

TS k8s
| WHERE cluster == "prod" AND TRANGE(1d)
| METRICS_INFO
| KEEP metric_name, metric_type
| SORT metric_name

metric_name	metric_type
network.eth0.rx	gauge
network.eth0.tx	gauge
network.total_bytes_in	counter
network.total_cost	counter

To find how many distinct metric names match a pattern (not which series), combine METRICS_INFO with STATS:

TS k8s
| METRICS_INFO
| WHERE metric_name LIKE "network.total*"
| STATS matching_metrics = COUNT_DISTINCT(metric_name)

matching_metrics

Document predicates before the catalog command narrow the processed time series to data samples that actually exist in your window. The metrics listed are those with matching data, not every field that has ever been mapped:

TS k8s
| WHERE cluster == "prod" AND TRANGE(1d)
| METRICS_INFO
| SORT metric_name

metric_name	data_stream	unit	metric_type	field_type	dimension_fields
network.eth0.rx	k8s	packets	gauge	integer	[cluster, pod, region]
network.eth0.tx	k8s	packets	gauge	integer	[cluster, pod, region]
network.total_bytes_in	k8s	bytes	counter	long	[cluster, pod, region]
network.total_cost	k8s	usd	counter	double	[cluster, pod, region]

Run the same scoped pipeline, but swap the middle command for TS_INFO, and the question shifts from “which metrics match” to “which time series identities match”. Each row is one metric plus one combination of dimension values; sort on metric_name and dimensions so related series group together:

TS k8s
| WHERE cluster == "prod" AND TRANGE(1d)
| TS_INFO
| SORT metric_name, dimensions

metric_name	data_stream	unit	metric_type	field_type	dimension_fields	dimensions
network.eth0.rx	k8s	packets	gauge	integer	[cluster, pod, region]	{"cluster":"prod","pod":"one","region":"[eu, us]"}
network.eth0.rx	k8s	packets	gauge	integer	[cluster, pod, region]	{"cluster":"prod","pod":"three","region":"[eu, us]"}
network.eth0.rx	k8s	packets	gauge	integer	[cluster, pod, region]	{"cluster":"prod","pod":"two","region":"[eu, us]"}
network.eth0.tx	k8s	packets	gauge	integer	[cluster, pod, region]	{"cluster":"prod","pod":"one","region":"[eu, us]"}
network.eth0.tx	k8s	packets	gauge	integer	[cluster, pod, region]	{"cluster":"prod","pod":"three","region":"[eu, us]"}
network.eth0.tx	k8s	packets	gauge	integer	[cluster, pod, region]	{"cluster":"prod","pod":"two","region":"[eu, us]"}
network.total_bytes_in	k8s	bytes	counter	long	[cluster, pod, region]	{"cluster":"prod","pod":"one","region":"[eu, us]"}
network.total_bytes_in	k8s	bytes	counter	long	[cluster, pod, region]	{"cluster":"prod","pod":"three","region":"[eu, us]"}
network.total_bytes_in	k8s	bytes	counter	long	[cluster, pod, region]	{"cluster":"prod","pod":"two","region":"[eu, us]"}
network.total_cost	k8s	usd	counter	double	[cluster, pod, region]	{"cluster":"prod","pod":"one","region":"[eu, us]"}
network.total_cost	k8s	usd	counter	double	[cluster, pod, region]	{"cluster":"prod","pod":"three","region":"[eu, us]"}
network.total_cost	k8s	usd	counter	double	[cluster, pod, region]	{"cluster":"prod","pod":"two","region":"[eu, us]"}

That extra column can be used to deduce metric cardinality. Each TS_INFO row is one time series for a given metric, so grouping with STATS counts how many distinct time series exist per metric:

TS k8s
| TS_INFO
| STATS series_count = COUNT(*) BY metric_name
| SORT metric_name

series_count	metric_name
9	network.eth0.rx
9	network.eth0.tx
9	network.total_bytes_in
9	network.total_cost

Choosing between them: Use METRICS_INFO when you want a compact inventory of metric names and types in the filtered TS context. Use TS_INFO when you need label combinations, per-metric series counts. In practice, skim with METRICS_INFO and then switch to TS_INFO when the answer depends on which dimensions apply, not just what metrics exist.

Under the hood: How the commands are executed

Both METRICS_INFO and TS_INFO run inside the same distributed ES|QL execution that powers any TS query. In addition to standard features, like shard-level parallelism, Lucene filter pushdown, and coordinator-side merging, special care has been taken during implementation so that the cost scales with the number of matching time series, not the number of documents. Here's how each output row gets produced:

1. The TS command defines the scope. TS resolves your data stream pattern to its TSDS backing indices and turns any filters you place before the catalog command, such as a time range on @timestamp or dimension predicates in WHERE, into a Lucene query that runs on every shard that can match. Shards in backing indices outside the time window are pruned up front and never touched.

2. Each shard iterates over matching documents and tracks one per series. A TSDS index is physically sorted by _tsid first, then by @timestamp (descending). That sort matters here: All documents belonging to the same time series sit next to each other on disk, so as a shard processes documents in order, it only needs to keep the first document it sees for each new _tsid and skip the rest. The result is one representative document per time series that has at least one document matching your filters.

3. The mapping tells us what each field is. The backing index mapping is the source of truth for the metadata that describes each field:

Fields declared with time_series_metric are metrics, and the mapping carries each metric's metric_type, field_type, and (if declared) meta.unit.

4. Synthetic source fills in the actual dimension and metric presence. For the one representative document per series, the shard reads a subset of _source containing only the dimension (and metric) paths the mapping declares. TSDS uses synthetic _source, so that subset is reconstructed primarily from doc values — no stored _source is needed. From that reconstructed sliver of JSON, the shard learns two things:

The dimension key/value pairs for this series (the dimensions JSON for TS_INFO, and the set of dimension keys that feed dimension_fields for both commands).
Which metric fields actually have data for this series in this backing index.

5. Partial aggregation happens inside each shard. Shards don't ship raw per-series rows upstream. They partially aggregate first, which is a big part of why catalog queries stay cheap.

6. The coordinator merges across shards and data streams. Each data node first reduces its own shards' partial results and streams them to the coordinator, which applies the same merge logic one more time.

7. The rest of the pipeline runs as usual. Everything after the catalog command (KEEP, WHERE, STATS, SORT, LIMIT) runs against this consolidated metadata table on the coordinator, exactly like any other ES|QL stage.

The net effect is that catalog queries do just enough work to identify one representative document per series, read a small reconstructed slice of that document, classify its fields against the mapping, and fold the results into a handful of metadata rows. Because the output cardinality is bounded by the number of matching series (for TS_INFO) or by the number of distinct metric signatures (for METRICS_INFO), not by the number of documents in the window, these commands stay responsive even against long retention windows and high-ingest data streams.

Running these commands against the full high cardinality TSDB benchmark corpus without a time range filter (1.84 B documents / 1.4 M time series / 2.77 TB uncompressed) on a single-node Elasticsearch (AWS c8gd.8xlarge, 24 cores, 24 GiB heap, NVMe SSD, 3 primary shards, force-merged), METRICS_INFO returns in ~4 seconds.

Beyond ad hoc queries

These commands also support product workflows inside Kibana. The UI appends METRICS_INFO to a user's TS query (when the query doesn't already include STATS) to build a metric catalog aligned with the user's filters, rather than relying solely on mappings.

These new commands are also the foundation of Prometheus-compatible metadata APIs we're adding to Elasticsearch, which Prometheus-based tools can use. Stay tuned for a dedicated blog post that goes into more detail.

Data quality

Multi-valued unit, metric_type, or field_type in the METRICS_INFO output, is a concise warning that backing indices aren't aligned on a metric's definition. TS_INFO makes it easier to see whether an explosion in cardinality comes from a handful of metrics or from dimension cardinality you should account for in alerts and aggregations. For example, ranking metrics by series count surfaces outliers at a glance:

TS k8s
| TS_INFO
| STATS series_count = COUNT(*) BY metric_name
| SORT series_count DESC

series_count	metric_name
12000	network.eth0.rx
9	network.eth0.tx
9	network.total_bytes_in
9	network.total_cost

When a single metric dwarfs the rest, as network.eth0.rx does above, the explosion is concentrated in a handful of metrics, and inspecting that metric's dimensions pinpoints which label is growing. Comparable counts across metrics instead point at shared dimension cardinality, such as a newly introduced pod or instance value propagating through every time series.

Availability

METRICS_INFO and TS_INFO are generally available in Elastic Cloud Serverless and in Elasticsearch basic starting with the 9.4.0 release.

For command pages (syntax, restrictions, and examples), see METRICS_INFO and TS_INFO.

For background on TSDS and the TS command itself, start with the official documentation on time series data streams and the TS source command.

Stop guessing which query is burning your cluster: Query activity in Kibana

Valentin Crettaz — Tue, 28 Apr 2026 00:00:00 GMT

Query activity in Kibana gives you a live list of in-flight search tasks in Elasticsearch (ES|QL, DSL, SQL, EQL), so in about a minute you can move from “something’s slow” to which query, how long it’s been running, and whether you can cancel it. It builds on the Tasks Management API, but it’s the operational view you can show in a war room. If you’re on Elastic Cloud Serverless, it’s available now; on Elastic Cloud Hosted and self-managed, it ships with Kibana 9.4.

When “something is slow” finally has an answer

Query activity is already available in your Elastic Cloud Serverless project today. For Elastic Cloud Hosted and Elastic Self-Managed clusters, it ships with Kibana 9.4 and is available across all deployments and clusters on that version. Query activity is the Kibana view on top of the Tasks Management API in Elasticsearch. It’s purpose-built for search-related tasks in any query language, including Elasticsearch Query Language (ES|QL), DSL, SQL, and Event Query Language (EQL).

It always starts the same way. Someone pings you on a Friday: Discover feels stuck. The exec dashboard won’t load. Did we change something? You open your monitoring tabs, squint at CPU, maybe tail a log, and you’re still guessing. Is it one giant ES|QL pipeline? A dashboard nobody remembers? A background rule doing honest work at the worst possible time? The cluster isn’t mysterious on purpose. In-flight work is simply invisible unless you enjoy living in Dev Tools and reconstructing life stories from task IDs and snippets of JSON.

We built Query activity for everyone who has ever muttered, Just tell me what’s running. It’s a new screen in Kibana that lists active search work in ES|QL, DSL, SQL, or EQL. It shows the queries that are consuming your cluster right now, with enough context to move from panic to diagnosis without a scavenger hunt.

The play you know and the one-minute rewrite

If you’ve operated Elasticsearch for more than a week, you’ve lived the old script. In Act I, someone says the cluster feels slow. In Act II, you scatter across shards, heap, slow logs, and sticky-note task IDs. Hours pass, and you still cannot name the query. In Act III, maybe you find the culprit before dinner, or maybe next month Act I opens again with the same villain in a fake mustache.

Query activity replaces that meandering Act II with one default sequence. It’s the same story, compressed from symptom to evidence to origin to action in about a minute. Paste this into your runbook or send it to your on-call channel. It’s the whole innovation in practice.

1. Open Query activity as soon as Act I hits. On Elastic Cloud Hosted and Elastic Self-Managed clusters, go to Stack Management and then Cluster performance. On Elastic Cloud Serverless, go to Admin and Settings and then Project performance. Do this before you fork into guesswork.

2. Refresh the list of queries once so you’re looking at right now, not five minutes ago.

3. Surface the pressure. Sort by run time, or tighten the “Run time” filter, until the expensive work floats to the top.

4. Open the flyout on the worst offender. You’ll see duration, query type, index breadth, and full query text. That’s your evidence without opening Dev Tools.

5. Name the owner. Use trace.id to jump into Discover and filter on the audit or query logs for that trace, or use X-Opaque-Id to figure out which dashboard, saved search, or rule this query originates from.

6. Resolve Act III. Let the query finish, fix the upstream, or cancel when it’s appropriate and Elasticsearch says the task is cancelable.

That’s one pass through what used to be three acts. You get attribution instead of folklore and decisions instead of theater.

Query activity deep dive

The one-minute sequence above is the habit. What follows is the machinery: the concrete controls and signals in Kibana that make that rewrite possible. You get what’s executing, how long it’s been running, where it came from, and what to do next, without stitching clues across tabs.

Under the hood, this view is powered by Elasticsearch’s Tasks Management API for long-running search tasks. It’s translated into an operational UI that’s built for speed. You can find the outlier quickly, inspect rich details, and act with confidence.

Here’s how the UI backs each beat of the runbook.

The main view is a filterable list of running queries. It includes a search bar so you can match anything in the table, including task ID. You also get filters for run time, query language, and source (for example Discover, Dashboard, and similar surfaces). You stay in control of what “noisy” means.

Refresh is manual on purpose. The table does not auto-refresh. You click Refresh when you’re ready, and the UI shows when the last refresh happened. You shouldn’t have to wonder whether the list is stale.

When you click a task ID, a detail flyout opens. It shows start time, run time, how many indices the query touches, and full query text. When X-Opaque-Id is present, it can help you trace an Elasticsearch query to its origin in Kibana so you can turn “mystery load” into “that dashboard, that version.” Previous and next navigation let you walk the queue without jumping back to the list. When trace.id is available, you can open Discover with that trace prefiltered. That helps when the incident channel is already busy.

Where work is cancelable, you can request cancellation from the list or the flyout. There’s a deliberate confirmation step. After you confirm, the cancel control shows a spinner until Elasticsearch reports that the task actually stopped. The goal is oops-proof, not oops-fast.

Viewing and managing active query work requires appropriate cluster privileges. The UI states clearly when something is missing. For example, users without cluster:manage may not be able to take destructive actions. Users without cluster:monitoring may not see task details. You shouldn’t get a blank screen that feels like the stack is playing hide-and-seek.

If you’ve been following our broader story around query observability, this is the live side of the house. It’s what is happening now, in the product, with controls you can use. Over time, pair it with historical views, such as query logs and AutoOps long-running search tasks insights when you need to ask whether this has happened before. When you need to answer what’s eating your cluster in this minute, start with the new Query activity UI in Kibana.

Who this is for (and who becomes the hero)

Cluster and platform admins get the obvious win: faster incident response and less time translating APIs into narratives for stakeholders.

Centers of excellence and internal search champions get something equally valuable: a teachable moment you can screenshot. This is the query pattern that blew up shared capacity. This is what “interactive” versus “background” pressure looks like when everyone is busy.

Anyone on the hook for Service Level Agreements (SLAs) gets a cleaner bridge between user experience (“the app is slow”) and search reality (“these three requests are still running, and one of them is huge”).

You don’t have to be the person who wrote the query to be the person who explains the cluster calmly. That’s the whole point.

Not every task is cancelable, and deep tuning work still has its place. Query activity doesn’t fix your queries for you. It surfaces in seconds which ones might need attention, and it gives you faster evidence, clearer attribution, and better decisions before you reach for heavier tools.

Where to find it

You’ll find Query activity in the performance area of each deployment model. In Elastic Cloud Hosted and Elastic Self-Managed clusters, open Stack Management and then Cluster performance. In Serverless projects, open Admin and Settings and then Project performance.

Threshold hygiene: Open Stack Management and then Advanced Settings. The running_queries:minRunningTime setting defaults to 100 ms. Only tasks that have been running longer than that appear. That way you can dial through noise without drowning in instantaneous work.

What to do next

Walk through the six-step sequence once when the cluster is calm. When Act I hits, you won’t be learning a new UI under pressure. Then repeat it during the next slow moment. The gap between assuming and seeing is the whole product story.

If you aren’t on Elastic Cloud yet, you can still get hands-on with the stack at elastic.cloud/registration.

Accelerating merchandising improvements with a governed control plane

Alexander Marquardt, Honza Král, Taylor Roy — Tue, 28 Apr 2026 00:00:00 GMT

Part 1 in this blog series established why ecommerce search needs a governance layer between the user's query and the retrieval engine that classifies intent, enforces business constraints, and routes to the appropriate retrieval strategy. The natural next questions are: Who operates that layer, and how fast can they move?

This post answers those questions. A governed control plane doesn't just improve search relevance; it changes the operating model. It moves search behavior changes from engineering deployment cycles to business-driven workflows, without sacrificing safety or accountability.

The scenario that exposes the operating model

Imagine that you’re in the weeks leading up to Christmas, and your merchandising team has identified three urgent changes that must immediately be made to search behavior:

Campaign launch. Due to an ordering error, there’s an oversupply of in-house branded turkeys. Therefore any query for "turkey" must boost the in-house brand.
Product recall. A supplier has recalled a product line. Queries that would surface those products shouldn’t be shown.
Seasonal reinterpretation. Queries for "stocking" are returning women's hosiery and tights. During the holiday season, "stocking" should resolve to Christmas stockings and stocking stuffers. Once the season ends, the policy can be reverted in minutes.

Under the traditional operating model, where search logic is embedded in application code, each of these changes requires an engineering ticket, a code change, a review cycle, a staging deployment, and a production release. In organizations with conservative release processes, that's a timeline measured in weeks, not hours or minutes. The Christmas shopping window closes before engineering can ship the necessary modifications.

The bottleneck isn’t the retrieval engine; it’s the operating model. The core challenge is that business intent cannot be translated into search behavior without engineering acting as a constant intermediary, turning every strategic pivot into a technical ticket.

The anti-pattern: Search logic in application code

Part 1 described how search logic embedded in application code can turn into a "spaghetti" implementation, which creates operational friction. Here’s what that friction looks like at scale. What starts as a few targeted overrides, a filter here, a boost there, grows over time into tens of thousands of lines of if/else branching, regex patterns, and conditional query modifications. This creates problems beyond just technical debt:

This model introduces four systemic frictions that hinder both organizational speed and system scalability:

Coupling. Business strategy changes daily. Application infrastructure should remain highly stable. When both live in the same codebase, a merchandiser's request to boost a seasonal product becomes a deployment risk, and a scoring function refactor can silently break a campaign.

Latency (organizational and computational). A single query behavior change can require a six-week deployment cycle: ticket, investigation, code change, review, staging, release. Furthermore, the application layer lacks any indexing mechanism to efficiently determine which policies apply to a given query, so policy evaluation often adds meaningful latency at query time as the system walks through sequential if/else checks.

Accountability gaps. When results change unexpectedly, nobody can quickly answer why. Was it a synonym update? A scoring change? A new filter added three releases ago? When business logic is distributed across thousands of lines of application code, shipped by different teams across different releases, tracing a relevance change back to its root cause becomes an archaeology project.

Misallocated engineering. This model turns skilled software engineers into full-time relevance mechanics. Instead of building platform capabilities, they spend their cycles translating merchandising requests into code changes and debugging interactions and conflicts between hard-coded business policies.

The paradigm shift: Policies as data

The solution is to decouple business policies from application code entirely. Instead of hard-coding query modifications in middleware, store governed policies as structured documents, each one expressing a discrete business intent, and evaluate them at query time in a dedicated governed control plane layer.

A policy is a first-class data object. It has match criteria (when should this policy fire?), an action (what should it do?), a priority (how does it interact with other policies?), and metadata (a title and a description). The control plane evaluates matching policies, resolves conflicts deterministically, and produces an execution plan including constraints, boosts, and routing decisions that Elasticsearch executes against a product catalog.

For each additional search requirement, the application code doesn't change. The retrieval engine doesn't change. What changes is that business decisions are no longer encoded in code. They live in a policy index as data that can be updated without a deployment.

This changes your org chart, not just your query.

Policies vs. triggers vs. rules

A note on terminology used in this series: a policy refers to this complete governed document, including a trigger (match criteria), rule (action), priority, enabled/disabled, and metadata. A trigger refers to the matching criteria that determines when this policy fires, and a rule refers specifically to the action inside the policy, such as applying a filter or changing the retrieval strategy.

The workflow: Author → Test → Promote

Moving policies out of code and into data opens the door for business-driven search management. But enabling non-technical teams to alter search behavior requires strict operational guardrails. The goal is fast and safe iteration with governance.

To empower non-technical teams to modify search behavior with confidence, we suggest a three-stage workflow: Author, Test, and Promote. Let’s examine the components of this workflow in detail.

Author. A merchandiser creates a policy using structured fields: what the policy should match, what action it should take, and at what priority. The interface guides the business user through what’s expressible.

Elastic Services has built and deployed a governed framework for enterprise ecommerce customers, which has an admin UI that looks as follows:

Test. The policy is validated in a non-production environment where the merchandiser can run representative queries and verify that the policy produces the expected behavior, including how it interacts with other active policies. Because the control plane infrastructure is identical across environments, what works in the test environment will work in production.

Review. Before a policy is promoted to production, it passes through review. Depending on the organization's risk tolerance, this might be a peer review from another merchandiser, an approval from a search lead, or an automated validation that checks for conflicts with existing policies.

Promote. Once approved, the policy is promoted to the production policy index. It takes effect on the next query: no code deployment, no engineering release, no staging build. The entire promotion is a data operation: the same JSON document, moved to a different index.

Disable. If a production policy produces unexpected behavior, it can be disabled immediately without engineering involvement. Disabling removes the policy from query evaluation instantly, without affecting any other policy in the system.

This is the "zero-deploy" promise. It doesn't mean "no process." It means the process operates on policy data, not application code. This distinction compresses the change cycle from weeks to hours or minutes.

Why "zero-deploy" matters for revenue-critical queries

The economics of ecommerce search are asymmetric. A small number of high-volume queries ("milk," "bread," "oranges," "diapers") drive a disproportionate share of revenue. When one of these queries returns unexpected results, the cost is immediate and measurable: Conversion drops, customer complaints spike, and the merchandising team opens an urgent ticket.

Under the traditional model, the response cycle is:

The merchant notices the problem.
The merchandiser files a ticket with engineering.
Engineering investigates, identifies the cause, and writes a fix.
The fix goes through code review, staging, and release.
Production is updated.

Depending on the organization, steps 2 through 5 may take weeks. For a revenue-critical query during a peak sales period, that latency costs money.

Under a governed control plane, the response cycle compresses:

The merchant notices the problem.
The merchandiser drafts a policy fix (or modifies an existing policy).
The policy goes through review and is published.
The fix is live.

The difference isn't just speed. It's ownership. The person closest to the business context (the merchandiser who understands why "oranges" should resolve to produce, not beverages) is the person making the change. Engineering is freed from the daily merchandising loop to focus on the platform. This shift also unlocks something that's nearly impossible under the traditional model: attributing search performance changes to specific business decisions.

Measurability: Which policy moved conversion

When policies are discrete, versioned documents that are stored in an Elasticsearch index, each one becomes independently deployable and therefore its impact can be more easily measured. You can answer questions that are nearly impossible to answer when business logic is scattered across application code:

Did the "cheap laptops" price threshold policy improve conversion for that query class, or did it suppress it?
What was the click-through rate impact of the holiday campaign boost?
When we rolled back the "oranges" category constraint last Thursday, what happened to add-to-cart rates?

This turns search governance into a data-driven discipline. Instead of vague "relevance tuning," where a release contains a dozen changes and nobody can attribute the outcome, you get measurable, attributable impact per policy. Merchandisers can iterate with evidence. Engineers can evaluate whether a policy schema change produced the expected downstream effect. Leadership can see which policies are driving revenue and which are inert.

What this means for each role

For merchandisers and business users

Search behavior becomes something you can directly influence through structured policies without understanding Elasticsearch syntax or scoring algorithms. You can see what policies are triggered for a given query to understand why it produces specific results, and make changes within hours instead of weeks. The same policy mechanism also supports sponsored product placement: A merchandiser can create a boost policy that elevates a product or brand and flags it for a 'Sponsored' indicator in the UI, without requiring engineering involvement or additional infrastructure.

For search engineers

The control plane separates two concerns that are currently entangled: retrieval optimization and business logic. Instead of maintaining tens of thousands of lines of application code that encodes business decisions, you maintain the retrieval engine and the control plane infrastructure. When a merchandiser needs a new campaign boost, they don't need engineering to write it.

This doesn't eliminate engineering involvement. Engineers design the policy schema, maintain the control plane, set guardrails on what policies can express, add new capabilities as required, and handle edge cases that fall outside the policy framework. But the day-to-day operational cadence of modifying query behavior shifts to the people who own the business context.

For site reliability engineers and platform teams

Because policies are structured documents rather than application code, they fit naturally into existing operational workflows. Policies can be stored in version control, reviewed through pull requests, and deployed through the same continuous integration and continuous deployment (CI/CD) pipelines the team already uses. Conflicts between policies are detected and resolved deterministically at query time through the control plane's priority system, not through unpredictable interactions between code branches shipped in different releases.

When something does go wrong, diagnosing the cause is straightforward: Policies are discrete, named, and individually toggleable. A problematic policy can be disabled or deleted immediately without affecting anything else in the system. Compare that to debugging a relevance regression caused by an interaction between a synonym update, a scoring function change, and a new analyzer, all shipped in the same release with no clear attribution.

Beyond manual authoring: Large language model–assisted (LLM-assisted) policy suggestions

The policies described so far are authored by humans (a merchandiser identifying a gap and drafting a fix). But the same governed workflow supports a second mode: LLM-assisted policy suggestion.

An LLM can run offline or in the background, analyzing query logs, identifying patterns where search results underperform, such as queries with high exit rates, low click-through, or frequent reformulations. An LLM can then suggest new policies that enter the same Author → Test → Promote pipeline, where a human evaluates each one before it reaches production.

Governance is the enabler, not the constraint

It might seem counterintuitive: Adding a governance layer makes the system faster to change, not slower. This is the same pattern that works in other domains. CI/CD pipelines don't slow down software delivery; they make it safe to ship frequently. Access control doesn't slow down collaboration; it makes it safe to share broadly.

A governed control plane works the same way. The reason a query behavior change takes six weeks isn't that the code change is complex; it's that nobody is confident enough to ship it faster, because the blast radius is unclear and the rollback path is uncertain.

Governance provides that confidence. When every policy is explicit, every conflict is resolved deterministically, and every change can be instantly disabled and then rolled back (because policies are structured JSON documents that can be version controlled using existing workflows), the cost of iteration drops dramatically. Business teams move at the speed of the market. Engineering focuses on the platform.

From operating model to architecture

The shift from business logic in code to business policies as data is more than a technical refactoring; it's an organizational change that puts relevance ownership with the teams closest to the business context. But it raises an architectural question: How do you evaluate policies at query time without adding latency or turning the control plane itself into a new form of spaghetti?

The next post will dig into exactly that: the design pattern that enables fast, deterministic policy evaluation at query time.

Put governed ecommerce search into practice

The workflow described here, merchandisers authoring, testing, and promoting search policies without engineering deployments, is already available. Elastic Services Engineering designed and built it, and Elastic Services has the skills to deploy it for enterprise ecommerce teams.

If your organization is ready to move from deployment-gated relevance tuning to business-editable search with governance and auditability, we can accelerate your implementation. Contact Elastic Professional Services.

Join the discussion

Have questions about search governance, retrieval strategies, or ecommerce search architecture? Join the broader Elastic community conversation.

Preconditioning Vectors: Making Elasticsearch VectorDB Better Binary Quantization work for every vector

John Wagster — Mon, 27 Apr 2026 00:00:00 GMT

Elasticsearch as a vector database offers comprehensive quantization techniques like Better Binary Quantization (BBQ). BBQ and other similarly modern quantization techniques compress vectors down to as little as a single bit per dimension, reducing memory use while retaining impressively accurate distance approximation. For vectors generated from deep learning models, such as Cohere models, this works really well; however, for other kinds of vectors, such as image data or histogram features, recall can be impacted heavily. Preconditioning fixes this by applying a random orthogonal rotation to your vectors before quantization, redistributing variance evenly across dimensions so every bit captures meaningful signal, in some cases improving recall by almost 75%.

Here, we’ll provide some intuition about the problem and how preconditioning solves it.

The problem

BBQ quantizes each dimension of a vector independently: Values above the mean become 1, values below it become 0. This works well when every dimension carries roughly the same amount of information. Transformer-based embeddings tend to have this property naturally such that their dimensions are learned representations that distribute variance evenly.

But there are lots of real-world vectors that aren’t like this. Consider a 784-dimension vector representing a grayscale image, like in the Fashion-MNIST dataset. Some pixels near the center of the image, where the clothing actually appears, vary a lot across the dataset. However, other pixels, such as those near the corners, are mostly one color and barely vary at all. When BBQ quantizes these vectors, the high-variance dimensions lose precision because a single bit can't capture their range, while the low-variance dimensions become useless. The resulting quantized vectors are poor approximations of the originals, and recall suffers.

Picture of a representation of Fashion-MNIST images. (credit: geeksforgeeks.org)

Precondition

To fix the problem, we want to spread the information more evenly across dimensions so that each bit captures roughly the same amount of information.

Preconditioning applies a linear transformation to every vector before quantization. The transformation is an orthogonal rotation that reshuffles how information is distributed across dimensions without changing the distances between vectors. If you want to dig into the math, take a look at this in-depth analysis on optimized scalar quantization (OSQ) with preconditioners.

Here’s a graphic to help illustrate how preconditioning can help when applying quantization. This simplified two-dimensional diagram illustrates the idea that the orthogonal rotation helps to increase the spread, or range, of information that was previously quite compressed. While this two-dimensional animation is not an exact representation of preconditioning, it gives a good intuition for what roughly happens in higher dimensions where buckets of dimensions are transformed independently and a random projection can greatly improve the distribution. Imagine that the y-axis represents one pixel of our Fashion-MNIST corners that are primarily one shade with very low variance and the x-axis represents a pixel of clothing at the center of the image with very high variance. Without preconditioning, quantizing vectors to a single representative point is not a particularly good discriminator.

Let’s look at the data

Today, preconditioning is supported in DiskBBQ. Here’s a benchmark showing the impact when visiting different percentages of the total vector dataset.

Fashion-MNIST Recall (784 dimensions, 60K docs, 5x oversample, k: 10)

Vectors visited	Baseline recall	Preconditioned recall	% Improvement
0.5%	0.45	0.77	71%
3%	0.49	0.77	57%
5%	0.50	0.87	74%
10%	0.55	0.91	65%

GIST (960 dimensions, 1M docs, 5x oversample, k: 10)

Vectors visited	Baseline recall	Preconditioned recall	% Improvement
0.1%	0.49	0.69	41%
0.2%	0.70	0.77	10%
0.3%	0.73	0.85	16%
0.5%	0.78	0.88	13%

SIFT (128 dimensions, 1M documents, 5x oversample, k: 10)

Vectors visited	Baseline recall	Preconditioned recall	% Improvement
0.5%	0.48	0.60	25%
1%	0.59	0.71	20%
3%	0.71	0.87	23%
7%	0.72	0.90	25%

That’s a nice boost in recall; however, this boost comes with a cost. Applying it to all embeddings blindly is inefficient, causing ~2–4% overhead in query latencies with no improvement in recall for datasets that don’t need to be preconditioned. And upwards of 20% additional overhead at index time. For production use cases where you see initially low recall, you may want to evaluate the impact of preconditioning with your specific model and dataset.

Here’s the how

Preconditioning is available for the bbq_disk index type. Simply set precondition to true in the index_options, like so:

{
  "mappings": {
    "properties": {
      "my_vector": {
        "type": "dense_vector",
        "dims": 784,
        "index_options": {
          "type": "bbq_disk",
          "precondition": true
        }
      }
    }
  }
}

Take a look at the dense vector mapping docs for more details.

Conclusion

BBQ is highly effective for deep learning embeddings, but it can be less effective with embeddings that have uneven variance across dimensions, as can occur in feature-engineered vectors. Preconditioning redistributes that variance so quantization can be more effective. On some datasets, like Fashion-MNIST, we see as much as a 74% improvement in recall!

For now, we’ve made preconditioning optional. Hopefully, you feel more capable of knowing when it may be beneficial so you try it out yourself. In the future, we plan to iterate on performance and automatically detect when to apply preconditioning.

Reindexing data streams due to mapping conflicts

Lisa Larribas — Fri, 24 Apr 2026 00:00:00 GMT

When mapping conflicts arise in fields, whether they’re Elastic Common Schema–standard (ECS-standard) or specific to the data source, reindexing your data using Dev Tools becomes necessary. These conflicts can negatively impact any downstream function following ingestion, potentially causing inaccurate results or preventing the use of the complete dataset in features like visualizations, dashboards, the Security app, and aggregations. This blog post details the steps for this reindexing process.

This blog's content was developed and verified using Elastic versions 9.2.8 and 8.19.14, along with Filestream Integration versions 2.3.0 and 1.2.0.

Important note: Depending on your environment, some steps may require specific modifications. Furthermore, be aware that dynamic templates were removed from the @package component template starting with Filestream Integration version 2.3.3.

Before starting the reindexing process, it’s important to consider the current storage allocation in your environment. The steps outlined below involve creating a copy of the existing backing index, which will temporarily reside in the hot tier.

Elasticsearch data tiers

Hot: The hot tier is the Elasticsearch entry point for time series data, storing the most recent, frequently searched data. Hot tier nodes require fast reads and writes, necessitating more resources and faster storage (SSDs). This tier is mandatory, and new data stream indices are automatically allocated here.
Warm: Time series data can move to the warm tier once it’s being queried less frequently than the recently indexed data in the hot tier. The warm tier typically holds data from recent weeks. Updates are still allowed but are likely infrequent. Nodes in the warm tier generally don’t need to be as fast as those in the hot tier. For resiliency, indices in the warm tier should be configured to use one or more replicas.
Cold: Data that’s infrequently searched can move from the warm to the cold tier. The cold tier, while still searchable, prioritizes lower storage costs over search speed. Alternatively, the cold tier can store regular indices with replicas instead of searchable snapshots, allowing use of less expensive hardware for older data without reducing disk space requirements compared to the warm tier.
Frozen: Data that’s queried infrequently or no longer queried moves from the cold to the frozen tier for its remaining lifecycle. This tier uses a snapshot repository and partially mounted indices to store and load data, reducing local storage and costs while still allowing search. Searches on the frozen tier are generally slower than on the cold tier because Elasticsearch may need to fetch frozen data from the snapshot repository. We recommend dedicated frozen tier nodes.

Prerequisites: Determine which fields have conflicts

To determine which fields have mapping conflicts, navigate to Stack Management -> Data Views -> logs-* (using the logs-* data view is the highest hierarchy of data present with the logs- prefix.) If there are any conflicts, there will be a yellow box stating that. You may either click View conflicts or, under the Field type box next to the Search box, select conflict.

Clicking the yellow Conflict button will reveal which indices are associated with which mapping types.

This situation (where the field is mapped as both a keyword and a long) typically occurs because data was ingested before a specific mapping type was defined in the component template for the relevant data stream. In such cases, Elasticsearch attempts to set the mapping based on its dynamic templates.

In order to determine which mapping is appropriate for the field, and if the field is an ECS field, verification with ECS field reference is needed. If the field in question is not an ECS field, its value must be reviewed to determine the correct mapping.

If a field, such as log.offset in this example, isn’t documented in the ECS, the next steps are to investigate the field's value, determine which conflicting mapping type has the most backing indices, and examine the component templates of the other indices.

Typically, the mapping type associated with the highest number of indices is the correct one, but we recommend you verify the value of the field in question to validate this. To confirm the validity of a mapping type (for example, long), you must also verify that the field's value is appropriate for that type. This verification can be done by using Discover to search for the field in question. Reviewing other data streams that contain the same field can provide additional confirmation also.

To review the values present for the field with the mapping issue, navigate back to the yellow Conflict button stated earlier, click the Conflict button, highlight one of the backing indices, and paste into a Discover session. Your Kibana Query Language (KQL) statement should look like the following screenshot, to include the _index: field delimiter.

Prepare the new backing index custom component template

To address the mapping conflict in the data stream, first examine the relevant @package component template. You can find this under Stack Management -> Index Management -> Component Template. Search for the data stream and select the corresponding @package link. This template contains mappings for the fields out of the box and, while it isn’t common to have a mapping mismatch, it’s possible for the more appropriate type to be overlooked.

Review the template to confirm it contains the necessary field nesting and mapping for the field in question. For example, if the template incorrectly lists log.offset as a keyword, this is the source of the issue.

Important: Because modifying @package/managed templates isn’t recommended, you must use or create an @custom component template to correct the mapping type (for example, for log.offset) for all future data.

We don’t recommend modifying the @package/managed templates, since when you update the integration to a more recent version, any changes you make to the @package template will be overwritten. This is why we recommend using the @custom templates.
If a data stream is experiencing mapping conflicts, you need to add any missing field (ECS and non-ECS) nestings or mappings to the data stream's @custom component template. Create this template if it doesn't exist yet, and make sure to specify the correct mapping type for the field.
If you have multiple conflicts in your data view, apply all the necessary missing mappings for the data stream simultaneously so that the reindex is performed once versus multiple times. Having entries for proper data typing in the @custom component template will ensure any future data ingestion will follow the same mapping guideline.

To create the @custom component template (or verify it’s in use and populated), navigate to Index Templates, type in the name of the data stream in question, and click the appropriate @custom template being used by the data stream. If the template is not yet created, a yellow box will appear, allowing you to create the template through the UI.

The screenshot below shows the next page once Create component template is selected. Leave the defaults as is on the first page and click Mappings or Next until you reach the Mappings page.

To explicitly set the mapping for a new field coming in or to update a field that has a mapping conflict, when the data stream rolls over due to configuration set in the index lifecycle policy, an entry is needed for the field that the conflict exists in.

The below will set the mapping for the log.offset field in the @custom component template for the filestream data stream. Repeat the steps to add any custom fields or update necessary fields from the @package with the appropriate mappings, if needed, for this dataset. In this example, when setting offset to Long, the field type will be Numeric and the Numeric type will be Long. Click Add field and then outside of the area to continue.

Once all needed fields have been added, click through to review, and select Create component template when ready. All new data being ingested from this step forward will have log.offset set to long.

Creating the new backing index structure

The new backing index needs to have the existing mappings from the data stream’s component template, as well as the ECS ecs@mappings component template. The ecs@mappings component template is applied after the data stream’s component as a catchall for additional mappings that potentially weren’t captured in the previous component templates.

Navigate to the browser tab for the data stream's @package mappings. (Go to Stack Management -> Index Management -> Component Template -> logs-filestream.generic@package -> Manage -> Edit.) Once there, click on the Review section, then Request, and finally the Copy button on the right. The JSON contents of the component template copied will ensure the remaining field mappings and settings are retained while we update the log.offset field mapping. The JSON will form the backing structure for the newly reindexed backing index.

Important: If the template’s JSON was not copied and work was continued on with the reindex, the log.offset conflict would be resolved but there would be new conflicts with the integration, as the integrity of the current mappings were not upheld, creating double work to resolve the original issue.

Open a second browser tab, navigate to Dev Tools, and paste the copied content. Now, to clean up what was pasted:

Modifications to the request

1. Index name: Replace _component_template/logs-filestream.generic@package with the name of the backing index you intend to reindex, appending -1 to the end. For example, use PUT -1.

The appended -1 signifies a reindex and won’t conflict with the default ILM rollover settings, which are based on the index's creation date.

2. Settings: Remove the line "template" (line 3), as well as the very last closing brace for the entire JSON payload; Line 3 should start with "settings": {.

Replace the inner contents of the settings section with "index.codec": "best_compression". This action will apply Elastic's best compression to the index upon creation.
Add in "index.lifecycle.name": "logs", as well as a line for "index.lifecycle.rollover_alias": "".
1. The "index.lifecycle.name": "logs" entry will apply the logs ILM policy to the new backing index. Modify the ILM policy name if you aren’t using logs.
2. The "index.lifecycle.rollover_alias": "" is blank, since this backing index won’t be rolled over, yet the setting is required to avoid ILM rollover errors into the next ILM phase after hot.

3. Structure: The request should now include both a Settings section and a Mappings section. Inside "mappings": {, you should find "dynamic_templates" and a "properties" section containing hard-coded fields and their mappings.

4. Dynamic templates modification: The current dynamic templates section contains entries for fields that may be overwritten when the ecs@mappings dynamic templates are added next, causing redundancy and extra lines that aren’t needed.

Remove all sections in "dynamic_templates" except for the second section titled "_embedded_ecs-data_stream_to_constant": {.
Repeat the same process as described above, gathering the dynamic mappings for the @package component template, but this time the dynamic mappings for ecs@mappings component template.
- It may be easier to copy the entire contents of the mappings from the UI for the ecs@mappings component template, paste into the working Dev Tools dynamic_templates section, and remove duplicate and unnecessary lines where appropriate. Include these dynamic template setting contents after the"_embedded_ecs-data_stream_to_constant": { entry. The dynamic_templates section should look very similar to the below sample contents in Dev Tools.
If dynamic_templates are not included/removed altogether, other fields (review the screenshot below) will have double mappings: text and keyword versus the appropriate mappings, if the dynamic_templates section was left included. What’s left should be the "properties" section under "mappings". This will also create issues in the data view by having the fields be double mapped (if not already mapped this way) and will cause additional mapping conflicts.

5. Metadata removal: Delete the last section labeled "_meta", as well as the section labeled "version", if present.

6. Formatting: Auto-indent the remaining sections, and adjust or remove any unnecessary curly braces that would prevent a successful execution.

7. Mapping change: Navigate to the "properties" section, find "log", and then locate "offset" nested underneath. Change the type from keyword to long, and remove the line entry (comma included) labeled "ignore_above": 1024,. If more than one entry was added to the @custom component template created earlier, include them here.

Your Dev Tools console view should now be similar to the example provided below.

PUT .ds-logs-filestream.generic-default-2026.04.14-000001-1
{
  "settings": {
    "index.codec": "best_compression",
    "index.lifecycle.name": "logs",
    "index.lifecycle.rollover_alias": ""
  },
  "mappings": {
    "dynamic_templates": [
      {
        "_embedded_ecs-data_stream_to_constant": {
          "path_match": "data_stream.*",
          "mapping": {
            "type": "constant_keyword"
          }
        }
      },
      {
        "ecs_timestamp": {
          "mapping": {
            "ignore_malformed": false,
            "type": "date"
          },
          "match": "@timestamp"
        }
      },
      {
        "ecs_message_match_only_text": {
          "path_match": [
            "message",
            "*.message"
          ],
          "mapping": {
            "type": "match_only_text"
          },
          "unmatch_mapping_type": "object"
        }
      },
      {
        "ecs_non_indexed_keyword": {
          "path_match": [
            "*event.original"
          ],
          "mapping": {
            "index": false,
            "type": "keyword",
            "doc_values": false
          }
        }
      },
      {
        "ecs_non_indexed_long": {
          "path_match": [
            "*.x509.public_key_exponent"
          ],
          "mapping": {
            "index": false,
            "type": "long",
            "doc_values": false
          }
        }
      },
      {
        "ecs_ip": {
          "path_match": [
            "ip",
            "*.ip",
            "*_ip"
          ],
          "mapping": {
            "type": "ip"
          },
          "match_mapping_type": "string"
        }
      },
      {
        "ecs_wildcard": {
          "path_match": [
            "*.io.text",
            "*.message_id",
            "*registry.data.strings",
            "*url.path"
          ],
          "mapping": {
            "type": "wildcard"
          },
          "unmatch_mapping_type": "object"
        }
      },
      {
        "ecs_path_match_wildcard_and_match_only_text": {
          "path_match": [
            "*.body.content",
            "*url.full",
            "*url.original"
          ],
          "mapping": {
            "fields": {
              "text": {
                "type": "match_only_text"
              }
            },
            "type": "wildcard"
          },
          "unmatch_mapping_type": "object"
        }
      },
      {
        "ecs_match_wildcard_and_match_only_text": {
          "mapping": {
            "fields": {
              "text": {
                "type": "match_only_text"
              }
            },
            "type": "wildcard"
          },
          "unmatch_mapping_type": "object",
          "match": [
            "*command_line",
            "*stack_trace"
          ]
        }
      },
      {
        "ecs_path_match_keyword_and_match_only_text": {
          "path_match": [
            "*.title",
            "*.executable",
            "*.name",
            "*.working_directory",
            "*.full_name",
            "*file.path",
            "*file.target_path",
            "*os.full",
            "*email.subject",
            "*vulnerability.description",
            "*user_agent.original"
          ],
          "mapping": {
            "fields": {
              "text": {
                "type": "match_only_text"
              }
            },
            "type": "keyword"
          },
          "unmatch_mapping_type": "object"
        }
      },
      {
        "ecs_date": {
          "path_match": [
            "*.timestamp",
            "*_timestamp",
            "*.not_after",
            "*.not_before",
            "*.accessed",
            "created",
            "*.created",
            "*.installed",
            "*.creation_date",
            "*.ctime",
            "*.mtime",
            "ingested",
            "*.ingested",
            "*.start",
            "*.end",
            "*.indicator.first_seen",
            "*.indicator.last_seen",
            "*.indicator.modified_at",
            "*threat.enrichments.matched.occurred"
          ],
          "mapping": {
            "type": "date"
          },
          "unmatch_mapping_type": "object"
        }
      },
      {
        "ecs_path_match_float": {
          "path_match": [
            "*.score.*",
            "*_score*"
          ],
          "mapping": {
            "type": "float"
          },
          "path_unmatch": "*.version",
          "unmatch_mapping_type": "object"
        }
      },
      {
        "ecs_usage_double_scaled_float": {
          "path_match": "*.usage",
          "mapping": {
            "scaling_factor": 1000,
            "type": "scaled_float"
          },
          "match_mapping_type": [
            "double",
            "long",
            "string"
          ]
        }
      },
      {
        "ecs_geo_point": {
          "path_match": [
            "*.geo.location"
          ],
          "mapping": {
            "type": "geo_point"
          }
        }
      },
      {
        "ecs_flattened": {
          "path_match": [
            "*structured_data",
            "*exports",
            "*imports"
          ],
          "mapping": {
            "type": "flattened"
          },
          "match_mapping_type": "object"
        }
      },
      {
        "all_strings_to_keywords": {
          "mapping": {
            "ignore_above": 1024,
            "type": "keyword"
          },
          "match_mapping_type": "string"
        }
      }
    ],
    "properties": {
      "input": {
        "properties": {
          "type": {
            "ignore_above": 1024,
            "type": "keyword"
          }
        }
      },
      "@timestamp": {
        "ignore_malformed": false,
        "type": "date"
      },
      "ecs": {
        "properties": {
          "version": {
            "ignore_above": 1024,
            "type": "keyword"
          }
        }
      },
      "log": {
        "properties": {
          "file": {
            "properties": {
              "inode": {
                "ignore_above": 1024,
                "type": "keyword"
              },
              "path": {
                "ignore_above": 1024,
                "type": "keyword"
              },
              "device_id": {
                "ignore_above": 1024,
                "type": "keyword"
              },
              "fingerprint": {
                "index": false,
                "type": "keyword"
              }
            }
          },
          "offset": {
            "type": "long"
          },
          "level": {
            "ignore_above": 1024,
            "type": "keyword"
          }
        }
      },
      "data_stream": {
        "properties": {
          "namespace": {
            "type": "constant_keyword"
          },
          "type": {
            "type": "constant_keyword"
          },
          "dataset": {
            "type": "constant_keyword"
          }
        }
      },
      "event": {
        "properties": {
          "original": {
            "index": false,
            "type": "keyword",
            "doc_values": false
          },
          "module": {
            "type": "constant_keyword",
            "value": "filestream"
          },
          "dataset": {
            "type": "constant_keyword",
            "value": "filestream.generic"
          }
        }
      },
      "message": {
        "type": "match_only_text"
      },
      "tags": {
        "ignore_above": 1024,
        "type": "keyword"
      }
    }
  }
}

After your console resembles the example (with any additional custom fields included and custom values specific to your environment), execute the command to create the shell of the new backing index, pausing to resolve any errors that arise.

Begin reindex process

With the shell of the new backing index successfully created, the next step is to reindex and resolve the mapping conflicts.

Important: If the backing index that has the mapping conflict is the most recent index and is the current write index (for example, the ending number for the backing index is -000001), the data stream needs to be rolled over. Rolling over the data stream is needed since the current write index, which is having documents fed into it, is a live backing index and cannot be modified.

With the correct field mapping now applied to the newer write index via the previously created @custom component template, all new documents will reflect this change.

This is performed by executing the following:

POST /_rollover

For example:

POST logs-filestream.generic-default/_rollover

Reindexing involves copying the data from an existing backing index to a new one within the same naming convention, typically to apply necessary changes. These modifications could include updates to a component template or the addition of a new ingest pipeline for the data to be processed through.

Next, the data will be copied from the backing index that has the incorrect mappings into a new backing index. The original backing index has been rolled over, meaning no new documents can be added to it. The new backing index will follow the same naming convention, which preserves data visibility and integrity while applying the correct ILM policy, but will include a -1 suffix to indicate that it has been reindexed.

Adjust the index names as needed and paste the following code into the console. By including wait_for_completion=false, you can track the progress of document copying, which helps estimate the remaining reindexing time. Without this setting, you cannot track the status using the GET _tasks command below and will only be able to check the document count in the newer backing index using GET -1/_count.

Important: If issues arise during the reindex process, don’t rerun the reindex command; doing so will restart the process and create duplicate records in the index ending with -1. If a restart is necessary, first delete the index with the trailing -1, and then execute the preceding PUT command to recreate the new backing index shell.

POST _reindex?wait_for_completion=false
{
  "source": {
    "index": ""
  },
  "dest": {
    "index": "-1"
  }
}

i.e.
POST _reindex?wait_for_completion=false
{
  "source": {
    "index": ".ds-logs-filestream.generic-default-2026.04.13-000001"
  },
  "dest": {
    "index": ".ds-logs-filestream.generic-default-2026.04.13-000001-1"
  }
}

Upon execution, the response will include a task ID. You can monitor the reindex progress using this ID with the command: GET _tasks/.

The duration of the reindex depends on the volume of data in the original index. The completion can be tracked by looking for "completed": true when executing the GET command, which should yield a similar output.

GET _tasks/

With the reindexing process now finished for the document count, the next step is to verify that the mappings for the new backing index and the specific field in question are correct.

GET -1/_mapping

For example:

GET .ds-logs-filestream.generic-default-2026.04.13-000001-1/_mapping

You can verify that the mapping for log.offset is as shown below. To confirm that other fields have only a single mapping entry (not both text and keyword), compare them to a field that was not part of the dynamic template section in the preceding PUT command.

If the backing index that’s being reindexed has a large number of documents, it’s helpful to check the status of those documents being copied to the new backing index; this can be done by the following two Dev Tools commands to compare the counts.

GET .ds-logs-filestream.generic-default-2026.04.14-000001/_count

GET .ds-logs-filestream.generic-default-2026.04.14-000001-1/_count

Once the counts are verified to match and the correct mappings are present, update the data stream to include the new backing index, preventing an orphaned backing index in index management, where the ILM policy will never occur on the backing index.

The return should be an acknowledgment of true, if successful.

POST _data_stream/_modify
{
  "actions": [
    {
      "add_backing_index": {
        "data_stream": "logs-filestream.generic-default",
        "index": ".ds-logs-filestream.generic-default-2026.04.14-000001-1"
      }
    }
  ]
}

Verify the new backing index is added with the following command, making sure the ilm_policy is correct:

GET _data_stream/logs-filestream.generic-default

Check the ILM status of the backing index next with the following command:

It’s normal to see that the index is in hot, as it was created very recently (review line 8 or 10).

GET .ds-logs-filestream.generic-default-2026.04.14-000001-1/_ilm/explain

Execute the following to transition the backing index from the hot tier to the next appropriate tier that’s after the hot phase for the ILM policy for this data stream. The specific values for phase, action, and name in the current_step below can be referenced from lines 11, 13, and 15, respectively, in the provided screenshot above.

The next_step value indicates the subsequent ILM phase or data tier to which the index will transition to.

For example:

POST _ilm/move/.ds-logs-filestream.generic-default-2026.04.14-000001-1
{
  "current_step": {
    "phase": "hot",
    "action": "rollover", 
    "name": "check-rollover-ready"
  },
  "next_step": {
    "phase": "warm" 
  }
}

It isn’t necessary, but as a safety measure, you may execute the _ilm/explain command again to ensure the backing index has moved to the next phase and is no longer in hot.

Once the following conditions are met, you can safely delete the original backing index that had mapping conflicts:

A new backing index has been successfully created.
Documents have been moved to the new index, and the document counts match.
Mappings have been corrected (both data stream specific and ECS).
The data stream incorporates the new backing index.
The ILM policy has been applied and has moved the index out of the hot phase.

Important: Alternatively, before deleting the original index, you can check the Data Views page. Select logs-* and verify that the reindexed backing index (which ends in -1) now appears in the long section. The original backing index should still be present under keyword. If the reindexed backing index is not in the long section, go back and review the preceding steps and make any necessary corrections.

For example:

DELETE .ds-logs-filestream.generic-default-2026.04.14-000001

After resolving the conflicts, return to the Data Views page and select logs-*. If the conflict was solely related to log.offset, you should no longer see any conflicts listed. If there were other conflicts, the original backing index should no longer appear in the conflict list; instead, the new backing index should now be listed in the long section.

You can also verify in Discover that the log.offset field now displays the appropriate icons.

Continue this process, repeating the above steps for every backing index that has a mapping conflict until all are successfully resolved.

References:

Final thoughts

By following the steps in this blog, you will resolve mapping conflicts and ensure that all new data is correctly mapped. This is achieved by linking the necessary component templates to your data source. This workflow not only fixes the immediate issues but also establishes a secure and repeatable process for managing schema changes as your data and requirements evolve.

How we built Elasticsearch simdvec to make vector search one of the fastest in the world

Chris Hegarty, Lorenzo Dematte, Simon Cooper — Thu, 23 Apr 2026 00:00:00 GMT

Elasticsearch simdvec is the engine behind every vector distance computation in Elasticsearch. It provides hand-tuned AVX-512 and NEON kernels for every vector type Elasticsearch supports. Its bulk scoring architecture hides memory latency through explicit prefetching on x86 and interleaved loading on ARM, outperforming libraries like FAISS and jvector by up to 4x when data exceeds CPU cache. In this post, we explain why we built it, what’s inside, and how it makes Elasticsearch vector search one of the fastest in the world.

How we built Elasticsearch simdvec

Every vector search query in Elasticsearch, whether Hierarchical Navigable Small World (HNSW) traversal, inverted file (IVF) scan, or reranking pass, reduces to the same problem: computing distances between vectors, millions of times per query. Elasticsearch supports a wide range of data types and quantization strategies, from float32 to int8, bfloat16, binary, and Better Binary Quantization (BBQ). Each comes with different trade-offs between memory, throughput, and recall. Behind all of it is a single engine: simdvec.

We built simdvec to make every distance computation as fast as the hardware allows. In this post, we explain why we built it, what’s inside, and where it delivers the most impact.

Built like a race car

As Formula 1 enthusiasts, with one of us having previously worked with the Ferrari Formula 1 Team, we see a clear parallel. A Formula 1 car is designed with a single purpose: to achieve the best lap time. Engine power, aerodynamics, and chassis design only matter insofar as they contribute to that outcome. The same is true of a vector database, where indexing throughput, query latency, and recall define success.

While the end result is what matters, reaching the highest levels of performance requires each component to be at its best. It can’t just be good enough, it has to be the best in its category. Simdvec is built with that mindset, focusing on a critical part of the system: the engine. It’s a purpose-built, single instruction multiple data (SIMD) optimized kernel library that provides hand-tuned native C++ distance functions, called from Java via the Panama foreign function interface (FFI). It supports bulk scoring, cache line prefetching, and all vector types and layouts used in Elasticsearch.

That’s the engine behind every query.

Why we built our own

We started in 2023 with the Panama Vector API in Apache Lucene. It worked well for float32 dot products, but Elasticsearch's needs quickly outgrew what it could provide. Elasticsearch supports a wide range of quantized vector types: int8, int4, bfloat16, single-bit, and asymmetric BBQ. Each has different SIMD strategies, packing layouts, and accumulator requirements. Beyond type coverage, Elasticsearch's scoring paths demand more than single-pair throughput: HNSW needs to score several graph neighbors in one pass, IVF needs bulk scoring of thousands of candidates with prefetching, and disk-based scoring needs to work directly on mmap'd memory without copying. We looked at what was available, and nothing covered the full set.

So we built simdvec: hand-tuned native C++ kernels called from Java via FFI, with bulk scoring, prefetching, and support for every vector type Elasticsearch uses. By owning the library, we control the full stack. When we add a new quantization type like BBQ, it gets a tuned SIMD kernel wired all the way through the system. We don't wait for an upstream library to support it, and we don't compromise on performance for any type. Every vector query in Elasticsearch, whether HNSW, IVF, reranking, or hybrid, runs on this engine, built around the operations and types we actually use.

Simdvec has separate native libraries for x86 and ARM, each with multiple instruction set architecture (ISA) tiers selected at startup. The call overhead from Java via FFI is very low at single-digit nanoseconds.

The landscape

We're not the only ones building SIMD-optimized vector distance kernels. The ecosystem is rich, and we wanted to understand how simdvec performs. Not to rank projects, but to provide context and explain where Elasticsearch's engine sits. We selected three projects as reference points, each representing a different approach:

jvector: A Java approximate nearest neighbor (ANN) library that uses the Panama Vector API for vectorized distance computation, with optional native C acceleration on x86.
FAISS: A widely deployed open source vector search framework, with hand-tuned AVX2/AVX-512 kernels.
NumKong (formerly SimSIMD): A comprehensive suite of over 2,000 hand-tuned SIMD kernels spanning distance functions, matrix operations, and geospatial computation.

Each project serves a different purpose and makes different trade-offs. We include reference numbers from them to give context for simdvec's performance on the specific operations that Elasticsearch needs.

How we measure

The simdvec and jvector benchmarks are written in Java with JMH, the standard JVM microbenchmark harness, with FFI overhead included. For NumKong benchmarks and FAISS benchmarks, we wrote small C/C++ harnesses using Google Benchmark, which is the standard C++ microbenchmark framework. Both frameworks report nanoseconds per operation with warmup and iteration calibration. We verified via hardware performance counters that all libraries are using SIMD on both platforms. All the benchmark code is publicly available in the linked GitHub repositories (and, in the case of simdvec, in the elasticsearch repository).

Software: JDK 25.0.2, JMH 1.37, GCC 14, Google Benchmark (latest).

One vector at a time

The most fundamental operation in vector search is computing the distance between two vectors. Every HNSW neighbor evaluation, every IVF candidate score, every reranking comparison reduces to this inner loop.

We measured single-pair throughput at 1024 dimensions on both platforms, starting with float32, the baseline type and the one where the ecosystem is most competitive. We compare simdvec against FAISS and jvector; we excluded NumKong as it uses float64 accumulators for float32, making it 3.2x-5.3x slower (depending on platform), prioritizing numerical precision over throughput. To keep the comparison like-for-like, we benchmark NumKong on int8 instead, where it uses the same accumulator strategy as simdvec.

On x86, FAISS AVX-512 is the fastest single-pair kernel at 23 ns. Simdvec AVX-512 follows at 28 ns, a gap that reflects the FFI call overhead. Both use 512-bit FMA with multi-accumulator unrolling. At the AVX2 level, the two are much closer, 36 ns and 39 ns respectively, both constrained by the 256-bit register and memory load widths. jvector lands at 44 ns using the Java Panama Vector API. Panama generates good SIMD code, but hand-tuned C++ intrinsics retain an edge.

On ARM, simdvec leads at 70 ns, well ahead of jvector at 110 ns and FAISS at 156 ns. Simdvec has hand-tuned NEON kernels for aarch64. Jvector has no native ARM code and relies on Panama. FAISS relies on compiler auto-vectorization rather than explicit NEON intrinsics, which accounts for the wider gap. This reflects a practical advantage of owning the kernel library: when Elasticsearch expanded to Graviton, we added purpose-built NEON kernels. Neither jvector nor FAISS have prioritized ARM native code to the same degree.

But Elasticsearch doesn't only score float32. Int8 quantization reduces memory by 4x, bfloat16 by 2x, and BBQ by 32x. Each type needs its own SIMD strategy, and simdvec provides hand-tuned native kernels for all of them.

Of the libraries we compared, only NumKong has comparable kernels for int8. We measured int8 dot product, squared Euclidean, and cosine at 1024 dimensions.

Int8 single-pair scoring(1024 dimensions, ns/vec op – lower is better)

On both architectures, NumKong is equal or faster at small-to-medium dimensions, where the difference is largely due to lower call overhead (direct C call vs Java FFI). At larger dimensions simdvec catches up, where the more efficient kernel implementation (which uses cascade unrolling) amortizes the call cost: As dimension increases, this gap closes and eventually reverses. Crossover is at dimensions between 768 and 1536, depending on function and architecture.

Despite the slightly higher overhead of Java FFI, simdvec is on par with highly optimized C/C++ libraries. Not only is it the only library with optimized kernels for both float32 and int8, but it also leads on ARM and only slightly behind FAISS on x86 (for float32), and very close to NumKong on both architectures (for int8). And for bfloat16, int4, binary, and BBQ, while alternatives exist, simdvec distinguishes itself through hand-tuned SIMD tailored to each type's data layout.

But a production search engine doesn’t score one vector at a time; it scores thousands per query. The next question is what happens at that scale.

Thousands at a time

Single-pair performance is only part of the picture. What matters in practice is how systems behave under load. A single HNSW query may score hundreds of graph neighbors. An IVF scan may score thousands of posting list entries. A reranking pass may score tens of thousands of candidates. Single-pair throughput matters, but what matters more is how fast you can score many vectors, and how gracefully performance degrades as the working set spills out of CPU caches.

Simdvec provides bulk scoring for every data type. These aren't just loops over single-pair kernels; they use multi-accumulator inner loops that load the query vector once per dimension stride and share it across multiple document vectors, with explicit cache-line prefetching for the next batch. Neither jvector nor FAISS offer an equivalent (at the time of writing). Jvector has no bulk API, so callers score one pair at a time in a loop. FAISS exposes fvec_inner_products_ny, which, at the time of writing, is implemented as a loop over its single-pair distance function with no query amortization or prefetching.

Float32. To measure the impact at the kernel level, we scored a single query against increasing numbers of 1024 dimension float32 document vectors using random access patterns that simulate HNSW-like scattered graph neighbor lookups. The three dataset sizes, 32, 625, and 32,500 vectors, are chosen so the working set exceeds L1, L2, and L3 cache, respectively.

When the data fits in cache, simdvec is the fastest on both platforms, but the margins are modest since kernel arithmetic dominates. The real separation appears as the working set grows beyond L3. On x86, simdvec scores at 95 ns per vector, while FAISS needs 165 ns and jvector 412 ns. On ARM, the pattern is the same: simdvec holds at 162 ns, while FAISS climbs to 347 ns, and jvector to 476 ns. The prefetching and query amortization in simdvec keeps memory latency hidden in a way that a simple loop over single-pair kernels cannot match, and the advantage widens precisely where real search workloads operate, deep in main memory.

Int8. The same pattern holds for quantized types. We measured int8 dot product bulk scoring at 1024 dimensions with dataset sizes chosen to exceed the same L1, L2, and L3 cache boundaries, comparing simdvec's bulk scoring against NumKong single-pair scoring in a loop.

On x86, simdvec is between 1.2x and 1.9x faster, driven by the combination of explicit prefetching and batch processing. On ARM, simdvec wins again (1.7x to 1.9x faster) across all dataset sizes. The advantage comes from batch processing four vectors at a time, providing memory-level parallelism via an interleaved access pattern. In both cases, the most striking result is what happens at the largest dataset size, where it matters the most.

Results for squared distance and cosine show a similar pattern, with speedups of 1.4x to 1.8x for ARM, and of 1.3x to 3.0x for x86 (details here).

When memory matters

Production vector indices typically don't fit in CPU cache. A 10M-vector int8 index at 1024 dimensions is 10GB. Scoring candidates means streaming data from DRAM, and that's where bulk scoring architecture makes the difference.

We used hardware performance counters to measure what happens inside the CPU during bulk scoring and found that hiding memory latency requires two fundamentally different strategies, one per architecture.

On x86, explicit prefetching eliminates cache misses. The bulk kernel processes vectors sequentially, one fully computed before the next, while issuing prefetch instructions for the next batch. Future data is pulled into L1 before the CPU needs it.

On ARM, the same sequential approach performed poorly, even with prefetching. Instead, the bulk kernel interleaves loads from four vectors at every stride position, giving the out-of-order engine four independent memory streams. The CPU is not fetching data any faster, but rather waiting less by always having something else to compute while memory requests are in flight. Detailed analysis can be found in this GitHub issue.

The numbers tell two different stories:

On x86, prefetching turns 139K cache misses into 19K, and instructions per cycle (IPC) more than doubles. The bulk advantage grows with dataset size, from 1.2x in L2 to 2.8x beyond L3, because prefetching hides progressively more expensive DRAM round trips.
On ARM, cache misses barely change. What changes is utilization: Backend stalls drop 40% because the interleaved access pattern keeps the pipeline fed. This advantage stays a consistent 1.8x regardless of dataset size, because memory-level parallelism applies whether data comes from cache or DRAM.

Two architectures, two strategies, one result: At production scale, simdvec keeps the CPU pipeline busy even when vectors are scattered across main memory.

What this means for Elasticsearch users

These kernel-level capabilities compound. A single vector search query may compute millions of distance operations: HNSW graph traversal, candidate scoring, reranking. Across thousands of concurrent queries, nanoseconds per operation translate directly to query latency and cluster throughput. Whether you use float32, int8, bfloat16, or BBQ, whether your index is in memory or on disk, simdvec is the engine underneath, and every one of those operations runs through the same engine, tuned down to the last nanosecond.

The key takeaway is that at production scale, vector search performance isn’t primarily determined by raw SIMD throughput. It’s dominated by how efficiently the system hides memory latency while sustaining compute across millions of small operations.

The simdvec kernels improve with almost every Elasticsearch release. When new quantization types and hardware platforms emerge, they get tuned kernels from day one. And existing types continue to get faster as we refine the implementations that are already shipping.

Jina embeddings v3 now available on Gemini Enterprise Agent Platform Model Garden

Sa Zhang — Wed, 22 Apr 2026 00:00:00 GMT

Today we’re launching jina-embeddings-v3, the first Jina search foundation model to be available on Gemini Enterprise Agent Platform Model Garden as a self-deployable partner model. Self-deployment means the model runs on GPU instances inside your Google Cloud project and Virtual Private Cloud (VPC). No external API calls, no per-token metering, no rate limits.

With this integration, Elasticsearch users gain a new deployment option that keeps data inside their security perimeter, delivers predictable infrastructure costs, and runs natively on Google Cloud. At the same time, the broader Google Cloud ecosystem gains access to Jina's purpose-built, state-of-the-art search and retrieval models.

This is the first stage of a broader rollout. Together with the models coming next, the lineup will form a complete retrieval stack: Embed your data, embed queries, retrieve and rerank candidates, and extend search to images with multimodal embeddings, all on infrastructure you control. You can start today with jina-embeddings-v3, the model already powering production search pipelines across the Elasticsearch ecosystem via Elastic Inference Service (EIS).

Model	Type	Parameters	Key capability	Status on Model Garden
`jina-embeddings-v3`	Text embedding	572M	Proven multilingual workhorse, 8K context, 1024 dim output, truncatable to 32	Available now
`jina-embeddings-v5-text-small`	Text embedding	677M	State-of-the-art sub-1B multilingual, 32K context, 1024 dim output, truncatable to 32	Coming soon
`jina-embeddings-v5-text-nano`	Text embedding	239M	Best-in-class under 500M params, 8K context, 768 dim output, truncatable to 32	Coming soon
`jina-reranker-v3`	Reranker	600M	Listwise reranker, 131K context, up to 64 documents	Coming soon
`jina-clip-v2`	Multimodal embedding	900M	Text + image in shared space, 89 languages, and 8K text context, 512×512 images	Coming soon

Every model runs on a single NVIDIA L4 (24 GB), the most cost-efficient GPU tier on Google Cloud. Most other embedding models on Google Cloud Model Garden require an A100 80 GB or H100, roughly three times the per-hour instance cost before you even start counting tokens.

No additional commercial license is required when deployed through Vertex AI.

Why Model Garden?

Why deploy through Model Garden instead of hitting an API? It comes down to three things: control, cost, and context.

Your data never leaves the house

The biggest draw for most developers is the self-deploy architecture. When you deploy a Jina model through Model Garden, the weights run on GPU instances inside your own Google Cloud project and your own VPC. This is a game-changer for anyone working in industries with data security concerns, like finance or healthcare. Because there are no external API calls, your sensitive data stays within your security perimeter.

Scaling with prediction

Instead of paying every time you embed a sentence or rerank a document, you pay a flat hourly instance cost. And because every Jina model can run on a single NVIDIA L4, the most affordable GPU tier on Google Cloud, the barrier to entry is low. Whether you process a thousand requests or a billion, your infrastructure bill stays predictable. This is a setup that actually rewards you for growing your traffic rather than taxing you for it.

Everything under one roof

If your data is already sitting in Elasticsearch on Google Cloud, BigQuery, or Cloud Storage, it makes sense to keep your inference engines nearby. By deploying through Model Garden, Jina search foundation models inherit all the enterprise features you are already using: identity and access management (IAM) for access control, unified billing on your existing Google Cloud invoice, and the ability to plug into Vertex AI Pipelines for machine learning operations (MLOps) workflows.

While the Jina AI Cloud API and Elastic Cloud offer the fastest path for bursty traffic or existing search workflows, Model Garden is ideal for enterprise applications requiring strict data security and predictable costs at scale. Elastic wants to meet you where you are.

Jina AI models

jina-embeddings-v3

Our proven multilingual embedding model with 572M parameters and 8K token context. Scores 65.5 on Massive Text Embedding Benchmark (MTEB) English. Supports five task-specific Low-Rank Adaptation (LoRA) adapters (retrieval query/passage, text-matching, classification, clustering) and Matryoshka truncation from 1024 to 64 dimensions. Already widely adopted across the Elasticsearch ecosystem via EIS.

We’re leading with v3 because many production systems already depend on it. If you’re migrating a v3-based pipeline to Google Cloud, you can now run the same model natively without changing your embedding dimensions or reindexing.

jina-embeddings-v5-text (small and nano)

Our fifth-generation text embedding models, released February 2026, achieve top-tier performance, competing with models many times their size.

v5-text-small (677M) scores 67.0 on the Multilingual MTEB (MMTEB) benchmark suite, encompassing 131 tasks of nine task types, and 71.7 on the MTEB English benchmark. It’s the strongest sub-1B multilingual embedding model on the MTEB Leaderboard.

v5-text-nano (239M) scores 65.5 on MMTEB. No other model under 500M parameters reaches this level. At less than half the size of most comparable models, it’s the natural choice for edge and latency-sensitive deployments.

Both models support:

Four task-specific LoRA adapters: Retrieval, text-matching, classification, clustering. Selecting an appropriate adapter via task parameter at inference time.
Matryoshka dimension truncation: Reduce embedding dimensions from 1024 (or 768 for nano) down to 32. Quality loss is minimal at moderate truncation (for example, 256 dims). Halving dimensions roughly halves storage.
Binary quantization: Compress 1024-dim embeddings from 2KB to 128 bytes with binarization. Special training makes this compression minimal losses.
Multilingual: 119 languages (small) and 93 (nano).

jina-reranker-v3

A 0.6B parameter multilingual listwise reranker built using a last but not late interaction architecture. The query and up to 64 candidate matches are entered into a single 131K-token context window, and the model performs cross-document comparison before scoring. Jina Reranker v3 achieves 61.94 nDCG@10 on BEIR, outperforming the model being 6× smaller in size.This is fundamentally different from pointwise rerankers that score each document in isolation, producing better results, especially for passage retrieval from single documents.

jina-clip-v2

A 0.9B multimodal, multilingual embedding model that maps text and images into a shared 1024-dimensional space. It supports:

89 languages for text-image retrieval.
512×512 image resolution.
8K token text input.
Matryoshka truncation from 1024 to 64 dimensions for both modalities.

Highly competitive on image-to-text benchmarks, including multilingual tasks.

Getting started

Jina Embeddings v3 is live on Model Garden today. Here’s how to get it running.

You need a Google Cloud project with the Vertex AI API enabled and enough GPU quota for at least one g2-standard-8 instance (NVIDIA L4). If you’re new to Google Cloud, start with the setup guide.

The Model Garden page for Jina Embeddings v3 walks you through the full flow: Upload the model, create an endpoint, pick your machine type, and deploy. Open it in your own project, and follow the guided steps. A100 and H100 machines are also available where region and quota allow, but L4 is all you need to start.

From click to first embedding, the whole process takes a few minutes.

What comes next

Jina Embeddings v3 is the starting point. In the coming weeks, we’ll bring the rest of the Jina retrieval stack to Model Garden: v5 text embeddings (small and nano), jina-reranker-v3, and jina-clip-v2 for multimodal search. All will run on a single L4 GPU with the same self-deploy model.

Elastic Security, Observability, and Search now offer interactive UI in your AI tools

David Elgut, Graham Hudgins, Jesse Miller — Tue, 21 Apr 2026 00:00:00 GMT

AI agents are only as good as the surfaces they return. A text-only tool result flattens anything that is naturally visual or interactive: a dashboard, an alert list, an investigation graph, a distributed trace. The conversation becomes a place where you ask and the agent answers, but the moment the work gets dense, you leave for a separate tab, a separate product, a separate authentication boundary.

MCP Apps change the shape of that answer. A tool can now return an interactive UI alongside its text summary, and the host (Claude Desktop, Claude.ai, VS Code Copilot, Cursor) renders it inline in the conversation. The model keeps the compact text for reasoning. The human gets a live, clickable interface right next to the chat.

Three properties make this a different kind of integration than "a webhook that returns a URL":

Context preservation. The UI lives inside the conversation. No tab switching, no hand-offs.
Bidirectional data flow. The UI can call tools on the MCP server for fresh data, and the host can push new results from the agent back into the UI. No separate API layer or authentication plumbing.
Sandboxed trust boundary. MCP Apps run in a host-controlled iframe. They cannot access the parent page, read cookies, or escape their container.

Security operations run on triage, investigation graphs, and Attack Discovery, where an AI agent correlates hundreds of alerts into a handful of attack chains. Observability means distributed traces and time-series drill-downs. Building in Kibana means a dashboard grid. Flatten any of that to text and you lose the thing that makes it useful. We built MCP Apps for all three and are open-sourcing them together, so the same conversation can move from a triage queue to a dependency graph to a live dashboard without ever leaving the chat.

Each of the three reference apps is one MCP server serving many interactive views, not a bundle of separate products. The security app alone surfaces six dashboards that share the same server shell, the same tool-visibility model, and the same host bridge. The pattern is small; the surface area is where the value compounds.

Elastic Security MCP App

Why it matters for the SOC

When an agent tells a SOC analyst, "There are 47 alerts on host-314, here's a summary," it hasn't done any work. It's just pointed at where the work starts. The actual work lives in the alert list, the process tree, the investigation graph, and the case file. You can't do it from a paragraph of text.

The security MCP App returns the workflow itself. The analyst prompts the agent, and the agent returns an interactive dashboard in the chat where the analyst can drill into alerts, run threat hunts, correlate attack chains, and open cases, all without losing the thread of the conversation. And because the findings, queries, and cases all land back in Elasticsearch, the same investigation is waiting in Kibana where the analyst can pick back up after the conversation has been closed.

Six interactive dashboards

The Elastic Security MCP App ships six interactive elements, one per major SOC workflow. Each one is a React UI that renders inline when the agent calls the corresponding tool:

Tool	What it does	Interactive UI
Alert triage	Fetch, filter, and classify security alerts	Severity grouping, AI verdict cards, process tree, network events
Attack Discovery	AI-correlated attack chain analysis with on-demand generation	Attack narrative cards with confidence scoring, entity risk, MITRE mapping
Case management	Create, search, and manage investigation cases	Case list with alerts, observables, comments tabs, AI actions
Detection rules	Browse, tune, and manage detection rules	Rule browser with KQL search, query validation, noisy-rule analysis
Threat hunt	ES\|QL workbench with entity investigation	Query editor, clickable entities and investigation graph
Sample data	Generate ECS security events for common attack scenarios	Scenario picker with four pre-built attack chains

Each tool returns a compact text summary that the model can reason over, alongside the interactive UI the analyst acts on. The UI can also fetch fresh data behind the scenes through the MCP host bridge. The full tool model and bridge API live in the repo's architecture doc.

The app also ships with Claude Desktop skills, SKILL.md files that teach the agent when and how to use each tool. Download pre-built skill zips from the latest release.

From Alert to Case

Four skills cover the core SOC loop. Each one picks up a prompt, calls a tool, and returns an interactive dashboard alongside a text summary that the model reasons over. An analyst's day usually starts with an alert queue.

Triage alerts. Ask the agent to triage by host, rule, user, or time window. The Alert Triage skill returns a dashboard of AI verdicts above the raw alert list, with one verdict per detection rule classifying that rule's activity as benign, suspicious, or malicious, each with a confidence score and a recommended action. Click any alert to open a detailed view with a process tree, network events, related alerts, and MITRE ATT&CK tags. No need to context switch between AI conversation and your alerts dashboard inside Kibana, everything is happening in real time inside your conversation.

Hunt for threats. Ask the agent to hunt across your indices. The Threat Hunt skill returns an ES|QL workbench with the query pre-populated and auto-executed, with every entity in the results clickable for drill-down. The model writes a short read-out below the table: what's unusual, what's connected, what's worth a closer look. It then offers the next pivot: either go deeper into the threat hunt, or start a new skill within the MCP app that complements the work done so far. What ties this really well is launching an Attack Discovery to gather more context on the alerts you’ve gone deep with and the threats you have hunted so far.

Run Attack Discovery. The Attack Discovery skill triggers the Attack Discovery API and returns a ranked list of findings. Each finding is a set of related alerts stitched into one attack chain, with MITRE tactics, a risk score, a confidence label, and the impacted hosts and users surfaced up front. The agent's summary lands below the findings in the same rank order, and the conversation now holds everything needed to act: hunt queries, triage decisions, correlated chains, all staged for the next step.

Open cases without leaving the chat. Approve findings in bulk or ask the agent to open cases for specific alerts. The Case Management skill creates one case per approved finding (source alerts attached, MITRE tactics inherited from the attack chain) and renders the live case list inline. Click a case for its detail view, which includes a row of AI action buttons: Summarize case, Suggest next steps, Extract IOCs, and Generate timeline. Each one drops a structured prompt back into the chat, so the agent picks up the case context without needing a reintroduction. The agent's summary sits below the case list and covers the full IR queue, including the cases just opened and earlier findings that still need one.

Every step in this walkthrough runs the same loop: a prompt comes in, the skill picks it up, the tool returns a compact text summary for the model to reason over, alongside an interactive UI that the analyst acts on. Chain the skills together, and they compose into an end-to-end SOC flow — hunt, triage, correlate, open cases, and drive the next pivot, all with the model carrying the session context across every step. Invoke any one on its own and it's still the full dashboard, pointed at whatever slice of your data you name. Either way, the work accumulates inside the conversation; no tab switching, no copy-paste, no hand-offs.

Two more skills round out the app: a detection-rule browser for tuning noisy rules, and a sample-data generator for spinning up realistic ECS events against a fresh cluster. A follow-up post will go deep on all six: investigation graph, attack-flow canvas, and end-to-end walkthrough.

"The MCP App for Elastic Security bridges the gap between automated detection and manual hunting. By bringing our security data directly into a single interface within Claude Desktop, we surfaced 'silent' threats in under an hour, risks that didn't trigger standard alerts but required immediate action. It's a force multiplier for our analysts." Mandy Andress: Chief Information Security Officer (CISO), Elastic.

How it works

Each MCP App is a small Node.js server whose tools return both a compact text summary for the model and a React UI that the host renders inline. Because it's built on the open MCP App spec, the same server runs on any compatible host - see the repo's architecture doc for the full design.

Try it

Requires Elasticsearch 9.x with Security enabled, plus Kibana for cases, rules, and Attack Discovery. The fastest path is the one-click .mcpb bundle from the latest release - double-click it in Claude Desktop, and you'll be prompted for your Elasticsearch URL and API key. Setup guides for Cursor, VS Code, Claude Code, Claude.ai, and building from source are in the repo.

Watch video on Vidyard

Elastic Search MCP App: Dashboards built from conversation

Every Kibana user knows the dashboard detour: leave what you're working on, open Kibana, pick an index, pick fields, pick a visualization, tweak, and save. That's five context switches before a single chart is on screen.

The new example-mcp-dashbuilder reference app collapses that into a prompt. Ask the agent to "build me a dashboard with revenue metrics, order trends, and category breakdowns" and the dashboard comes back inside the conversation without any tab-switching required.

Behind that prompt, the agent explores your Elasticsearch data via ES|QL and selects chart types to match the data: bars for comparisons, lines for trends, metric cards for KPIs, and heatmaps for two-dimensional patterns. It lays panels out on Kibana's 48-column grid using the Elastic UI Borealis theme, and the result is fully interactive: you can drag, resize, and group panels into collapsible sections right in the chat. When the dashboard looks right, a single tool call exports it to Kibana, preserving ES|QL queries and custom colors. You can also import existing Kibana dashboards back into the chat for AI-assisted editing.

The principle is the same one behind the Security app: when the artifact is the product, returning it inside the conversation closes the loop between describing what you want and seeing it.

Under the hood, it follows the same MCP App pattern. A Node.js server registers a view_dashboard model-facing tool alongside a set of app-only tools the UI calls directly (data fetching, layout persistence, time-field detection, export/import). The dashboard view itself is a single self-contained HTML file bundled with vite-plugin-singlefile and served as an MCP App resource. Builders forking the repo get the same server shell and host bridge they see in the Security app, pointed at a different job. The example-mcp-dashbuilder README has the full architecture and chart-type reference.

Watch video on Vidyard

Elastic Observability MCP App

The third reference app, Elastic Observability MCP App, tackles the SRE version of the same shape problem. When something breaks in production, the answer the on-call engineer needs is not a chart, it is a diagnosis stitched together from K8s metrics, APM topology, ML anomalies, and risk assessment. The shape of the answer is a causal story: what failed, why, what depends on it, and what to do next.

Six tools supporting the observability investigation workflow

Tool	Dependency	What It Does
observe	Universal	Transient ES\|QL + ML-anomaly access primitive — run a query once, live-sample a metric, or block until a threshold or anomaly fires.
manage-alerts	Universal	Create, list, get, and delete Kibana custom-threshold alerting rules. Omit the Kibana URL to run read-only.
ml-anomalies	ML jobs	Query ML anomaly records and open an inline anomaly-explainer view.
apm-health-summary	Elastic APM	Cluster-level health rollup from APM telemetry; layers in K8s and ML context when available.
apm-service-dependencies	Elastic APM	Service dependency graph — upstream/downstream, protocols, call volume.
k8s-blast-radius	Kubernetes metrics	Node-outage impact — full outage, degraded, unaffected, reschedule feasibility.

Cluster health rollup

Ask "what's broken?" or "give me a status report" and get a one-shot orientation: overall health badge, degraded services with reasons, top pod memory consumers, anomaly severity breakdown, and service throughput — all in one inline view. This is the starting point when something feels off, but you don't know where to look. The view adapts based on what your deployment supports. APM gives you service health. Kubernetes metrics add pod and node context. ML jobs layer in anomalies.

Service dependency graph

Ask "what calls checkout?" or "show me the topology" and get a layered dependency graph — upstream callers, downstream dependencies, protocols, call volume, and latency per edge. Let’s ask Claude to “Show me the service dependencies of the frontend”:

Zoom, pan, and hover to get all the details you need to understand the complex service relationships:

Assess risk with a blast radius

Ask "what happens if my k8s node goes down?" and get a radial impact diagram: the target node at center, full-outage deployments in red, degraded in amber, unaffected in gray. A floating summary card shows pods at risk and rescheduling feasibility. Single-replica deployments are flagged as single points of failure.

Observe

The agent's primary access primitive for Elastic — one tool, three modes for three different needs. Say "what's CPU doing right now?" and it runs an ES|QL query once and returns a table. Say "show me frontend latency for the next 60 seconds" and it live-samples the metric, updating the chart in-place. Say "tell me when memory drops below 80MB" or "watch for anything unusual for the next 10 minutes" and it blocks until the condition fires or the window expires. The view adapts to the mode: a results table for one-shot queries, a live trend chart with current/peak/baseline stats for sampling and threshold conditions, and a severity-scored trigger card for anomaly mode.

How it works

Same MCP App pattern as the Security and Search apps: a Node.js server, six model-facing tools wired to six single-file view resources. Tools are grouped by deployment backend (Universal, APM-dependent, K8s-dependent, ML-dependent), so the agent and the user both know up front which tools apply to a given deployment instead of discovering capability gaps at call time. The MCP App also includes an example Agent Builder workflow: k8s-crashloop-investigation-otel that can trigger on a Kubernetes alert and return a structured root cause summary before you've opened a single dashboard.

Watch video on Vidyard

The Agentic Stack, Interactive

Three properties about this pattern are worth stating directly. First, the tool result is no longer the end of the work, it is the start of it: the conversation returns an interface you can act on, not a summary you have to act from. Second, the same agent, the same model context, and the same conversation thread can now move across Security, Search, and Observability surfaces without leaving the conversation. Third, this only works because Elasticsearch and Kibana already expose the APIs. The MCP App is a thin interactive layer over the product capabilities we already ship.

Attack Discovery already powers the correlated findings view inside this app. Inside the stack, the same agentic pattern goes further: Elastic Workflows automate the deterministic steps (enrich entities, create cases, isolate hosts), while Agent Builder reasons over the data and invokes those workflows as tools. The MCP App brings that same security surface into the external conversation; Workflows and Agent Builder deepen it inside the stack. Different entry points, same Elastic APIs underneath.

Try it:

Security: example-mcp-app-security
Search and dashboards: example-mcp-dashbuilder
Observability: example-mcp-observability

Don't have an Elasticsearch cluster yet? Start a free Elastic Cloud trial. For more on the building blocks behind the security app, see the related Security Labs posts on Elastic Workflows and Agent Builder, Agent Skills, and Attack Discovery.

Introducing unified API keys for Elastic Cloud Serverless and Elasticsearch

Alex Chalkias — Mon, 20 Apr 2026 00:00:00 GMT

Imagine you're a site reliability engineer (SRE) responsible for a growing fleet of Elastic Cloud Serverless projects: Elastic Observability for your production infrastructure, Elastic Security for your security operations center (SOC) team, and Elasticsearch for your customer-facing application. Each project has its own Elasticsearch API key. Your continuous integration and continuous delivery (CI/CD) pipeline needs a separate Cloud API key to provision and manage those projects. Rotation day arrives every quarter: You walk through each project, mint fresh keys, update your Terraform state, redeploy your pipelines, and hope nothing falls through the cracks. When an incident hits at 2 a.m. and you need to revoke access fast, you're cross-referencing a spreadsheet of credentials to figure out which key belongs to which project and which service.

Today, that story gets a lot simpler. Elastic Cloud API keys can now be used to authenticate directly against Elasticsearch and Kibana APIs on Elastic Cloud Serverless. You can now use a single credential to manage your organization's resources and run data operations, like Elasticsearch Query Language (ES|QL) queries, data ingestion, and alerting.

Let’s look at why we built this, how we engineered a globally distributed identity layer to make it possible, and how it lays the foundation for cross-project search.

The secret management burden

Building reliable CI/CD pipelines, GitOps workflows, or Terraform automation around data platforms comes with a hidden cost: secret sprawl.

In the previous model, developers faced a disjointed authentication story:

Control plane (Elastic Cloud API keys): Organization-scoped keys used to create projects, invite users, and manage billing via the Elastic Cloud API.
Data plane (Elasticsearch API keys): Project-scoped keys created inside a specific Serverless project to interact with Elasticsearch and Kibana APIs.

This meant that your deployment script had to authenticate to Elastic Cloud, provision a Serverless project, extract a newly minted Elasticsearch API key from that specific project, and then inject that second key into the downstream application or automation tool, resulting in complex pipelines, fragmented audit logs, and a higher risk of credential leaks.

Unified authentication in Elastic Cloud Serverless

With this release, the split is gone for Serverless projects. You can now create an Elastic Cloud API key that’s explicitly authorized for Cloud, Elasticsearch, and Kibana APIs.

Before: An Elastic Cloud API key was strictly a control plane token. It could create projects, manage billing, and invite users, but it had a hard boundary; it couldn’t be used to call the Elasticsearch or Kibana APIs inside those projects. You always needed a second, project-specific key for data operations.
Now: By opting into Cloud, Elasticsearch, and Kibana API access when creating an Elastic Cloud API key, the hard boundary is removed for Serverless. That API key becomes a truly unified credential. It retains its ability to manage your organization's infrastructure, while simultaneously gaining native access to query, ingest, and analyze data across any authorized Serverless project.

By unifying this under a single Elastic Cloud API key, you gain a single identity that can be scoped, audited, rotated, and revoked as one unit. Every API call, whether it provisions a new project or runs an ES|QL query, appears under the same credential in your audit logs, giving you a single trail to follow during incident investigations or compliance reviews. Credential rotation becomes a one-step operation instead of a coordinated update across separate control-plane and data-plane secrets. And because role assignments are per-project, a single key can span several projects, managing ingestion in your observability project and running queries in your security project, without juggling separate credentials for each.

Importantly, unified does not mean all-powerful. By using the role_assignments payload, you can scope a unified key strictly to a single project and a specific role (such as read-only), ensuring the blast radius remains completely contained if a credential is ever exposed. If a developer leaves or an application is decommissioned, you can revoke a single key from the Elastic Cloud Console, immediately terminating access across both the control plane and all associated Elasticsearch projects.

(Note: For Elastic Cloud Hosted/managed deployments, Cloud API keys still only manage the control plane. Support for extending this to hosted stack APIs is planned for a future release.)

Automating your workflows

Getting started is simple. You can configure this entirely through the Elastic Cloud console or automate it using the Elastic Cloud API.

The UI process remains the same, but now you can select Cloud, Elasticsearch, and Kibana API access under the project role assignment.

Here’s how you create a unified key programmatically using the Elastic Cloud API. Notice the application_roles array, as this is what grants the key native access to the Elasticsearch data plane:

curl -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: ApiKey $EC_API_KEY" \
  "https://api.elastic-cloud.com/api/v1/users/auth/keys" \
  -d '{
    "description": "unified-automation-key",
    "expiration": "90d",
    "role_assignments": {
      "project": {
        "elasticsearch": [
          {
            "role_id": "elasticsearch-admin",
            "organization_id": "YOUR_ORG_ID",
            "all": false,
            "project_ids": ["YOUR_PROJECT_ID"],
            "application_roles": ["admin"]
          }
        ]
      }
    }
  }'

Once created, you simply pass this exact same key in the Authorization: ApiKey header to both api.elastic-cloud.com and your specific Serverless Elasticsearch endpoints.

Under the hood: Building a distributed identity layer

Making a Cloud API key work across both the control plane and the data plane isn't as simple as passing a token. It requires solving a fundamental distributed systems challenge.

Historically, Cloud API keys lived in a centralized global security cluster. That works fine for control plane operations where a higher latency is acceptable. However, Elasticsearch data requests require ultra-low latency. We cannot afford a round trip across the globe to a central control plane to validate every single search query or ingest request.

To solve this, we introduced a new authentication architecture backed by a globally distributed datastore. The following sequence diagram shows a client sending an Elasticsearch query using an Elastic Cloud API key, illustrating how authentication happens entirely within the local region, without a round trip to the global control plane. Elasticsearch delegates authentication to the Regional IAM Service, which validates the key and resolves its role assignments against a local replica of the globally distributed database. Once authorized, Elasticsearch executes the query and returns results to the client.

Globally distributed persistence

Instead of relying solely on a centralized security cluster, Elastic Cloud API keys and their associated role definitions are now persisted in a globally distributed, highly available database. This database synchronizes identity and access management (IAM) data across the global control plane and the regional data planes where your Serverless projects actually run.

Local validation with regional IAM

When your client sends a request to Elasticsearch using an Elastic Cloud API key, the request doesn't go back to the global control plane. Instead, it gets routed to the new regional IAM service. It validates the key against the local database replica, ensuring that authentication happens with near-zero latency and is completely insulated from global control plane outages.

Dynamic role mapping

Authentication is only half the battle; the system also needs to authorize the request. The regional IAM service instantly translates your Cloud-level role assignments for example, application_roles) into native Elasticsearch privileges. Elasticsearch can then authorize and execute the request locally, without ever needing a local .security index.

The foundation for Cross-Project Search

This distributed identity architecture is a foundational building block for the future of the Elastic platform.

Because identity and access are now unified and globally synchronized, we have the framework required to securely pass your identity between different projects. This enables the upcoming Cross-Project Search (CPS) capabilities for Serverless.

With CPS, you'll be able to query data spanning multiple remote Serverless projects, such as combining security and observability workloads, as easily as if they were a single dataset. By relying on unified API keys, the system can automatically evaluate your permissions across all projects simultaneously without requiring you to configure complex trust relationships, certificates, or duplicate credentials on every target project.

Learn more

Ready to simplify your stack?

Read the Elastic Cloud API keys documentation to learn how to assign stack access.
Check out the Create API key (Elastic Cloud API) reference to automate key generation.
Review Elastic API keys for a full comparison of key types across the Elastic platform.

Start or continue building in Elastic Cloud today.

Disclaimer

Fast approximate Elasticsearch ES|QL - part II

Thomas Veasey, Jan Kuipers — Fri, 17 Apr 2026 00:00:00 GMT

As we discussed in our previous blog, we’re introducing fast approximate ES|QL STATS queries, which will be available in version 9.4 of Elasticsearch and the Elastic Stack. This feature allows users to estimate an expensive analytics query, often orders of magnitude faster than running the full query, by relaxing the constraint that it returns the exact value. We believe this has many uses; for example, we’re planning to integrate it into Kibana to obtain fast chart previews where possible.

In order for you to be able to trust our estimates, we provide error estimates. Furthermore, since there are edge cases in error estimation, we certify when the estimated value and error are trustworthy. In this blog post, we’ll dive into the theory for approximating and estimating the error in such queries, as well as discuss the testing we’ve done.

Background

In order to estimate ES|QL STATS queries efficiently, we make use of a property that’s shared by many statistics: Their estimates computed from a large number of independent samples from a dataset approach their true value. In the case of an index with some field $X=\{x_i\},$ we can think of the true value of a statistic as its value computed for a random variable with uniform discrete distribution on $X$. In the following we denote this quantity $\theta$; it can be things like AVG, MEDIAN, and so on. If we make $n$ independent draws from $X$, denoted $S$, such that each value is selected with probability $\frac{1}{|X|}$, we have $n$ independent copies of this random variable. The property we rely on means that a sample statistic value $\hat{\theta}$ computed from $S$ approaches $\theta$ as $n$ becomes large. For example, if $\theta$ is the mean of some metric values then $\hat{\theta} = \frac{1}{|S|}\sum_{x\in S} x \rightarrow \frac{1}{|X|}\sum_{x\in X} x=\theta$ as $|S|=n$ becomes large. Indeed, for many statistics the limiting error distribution is known to be normal. Furthermore, it only depends on the distribution of $X$, the size of the sample $n$ and the type of the aggregation $\theta$. This means supported STATS queries can be approximated with fixed accuracy independent of the index size $|X|$.

It is easy to pick values at random from a Lucene index: create a filter that takes exponentially distributed jumps through the dataset, where the expected jump size is controlled by the desired sample probability. The AND of this filter and any other Lucene query can be performed extremely efficiently, since AND’ing filter queries is one of the things for which it is well optimized. In our other post, we discussed some real-world query examples to give a sense of the speedup we obtain for different levels of accuracy.

So far, we've only discussed obtaining an estimate of a query. While such a point estimator can be useful, without knowing anything about its error those uses are limited. We found that ES|QL has existing capabilities that make it relatively easy to incorporate cheap, flexible, and accurate error estimation at the same time. We'll discuss this next.

Error estimates

We view providing an accurate understanding of the uncertainty in our estimates as crucial for users to be able to trust the approximation. While having the option to quickly estimate an ES|QL query alone can be useful in certain situations, we wanted to provide a richer feature that allows clients to make intelligent choices. For example, if an approximate query is being used to preview a chart and the error is only a couple of pixels, there’s little point in running another expensive query to redraw it.

The way we've chosen to represent error is by a confidence interval: the $\alpha$-central confidence interval, to be precise. This can be expressed in terms of the cumulative density, $F$, of the statistic being estimated. Specifically, it's the interval which contains the true value of the statistic with probability $\alpha$ whose endpoints are $F^{-1}\left(\frac{1-\alpha}{2}\right)$ and $F^{-1}\left(\frac{1+\alpha}{2}\right)$. Confidence interval calculations are surprisingly subtle. There are also important constraints for our use case that make standard approaches undesirable. Next, we’ll take a look in more detail at the motivation and the design for the approach we’ve adopted.

A key requirement of the whole project is to dramatically accelerate expensive analytics queries. It’s therefore vital that the overhead of estimating uncertainty isn’t too large compared to estimating the query result itself. We also want the feature to be as general as possible, but “isolated” within the language. In other words, ES|QL is a flexible language, and we want estimation to work with as much of it as possible. At the same time, we don’t want to introduce a cross-cutting feature that incurs development costs on every new feature we ship.

With these considerations in mind, we chose to estimate confidence intervals by partitioning the sample set and computing the query output on each subsample. This is reminiscent of bootstrap; however, since we ensure that each partition receives a disjoint random subset of the sample data, we know that they comprise true estimates of the statistic distribution. To achieve the best possible estimate of the statistic itself, we still compute its value on the full sample. For example, to estimate the mean and its distribution the process can be expressed as follows:

FROM data | SAMPLE probability
          | EVAL bucketId = RANDOM(B) // B is the number of buckets
          | STATS avg     = AVG(x)
                  avg_0   = AVG(x) WHERE bucketId==0  
                  (...)
                  avg_B-1 = AVG(x) WHERE bucketId==B-1
            BY grouping
          | EVAL confidence_interval = CONFIDENCE_INTERVAL(avg, avg_0, ..., avg_B-1)

This introduces a complication to account for the discrepancy between the count of values used to estimate a query statistic and used to sample its distribution. This is a downside; however, there are some significant advantages.

Most of the work in analytic queries resides in computing the aggregate statistics: post-processing after a STATS reduction acts on a far smaller table, and the cost is often relatively small. In this scheme, every row in the input data to the STATS command is processed exactly twice compared to just estimating the statistic. Therefore, roughly speaking we pay a fixed overhead that's the same order of magnitude as the cost of estimating the query in order to estimate its uncertainty. Since we often achieve multiple orders of magnitude speedup on the exact query, this is acceptable.

Because this process uses a plain old table, with extra columns for the distribution samples, we can pass the whole table through any ES|QL pipeline and compute confidence intervals on the final results. For example, if we include EVAL square_avg = avg * avg in the pipeline above, we'd have exactly the same square_avg, square_avg_0, …, square_avg_B-1 extra values. At the end of the pipeline, we have samples from the distribution of the original statistics and all quantities that are computed using them. Therefore, we can apply our standard confidence interval machinery to reduce the table and convert samples into confidence intervals for derived quantities as well. This whole process is essentially transparent to the rest of the ES|QL language, and as we showed above, can be achieved by query rewriting.

The confidence interval calculation

We have independent samples of the statistic distribution $\{\hat{\theta}_i\}$. However, they're computed with fewer values than our estimate $\bar{\theta}$. We also have a relatively small number of distribution samples, to avoid the count discrepancy being too large, and so we don’t inflate the table too much. We therefore prefer a parametric approach for estimating confidence intervals.

The errors in the statistics for which we support estimation tend to normal distributions in the limit they're computed from many values. So a natural choice, the standard interval, is to estimate the mean and standard deviation from the samples and report the corresponding normal confidence intervals $\left[ m+\sigma \Phi^{-1}\left(\frac{1-\alpha}{2}\right), m+\sigma \Phi^{-1}\left(\frac{1+\alpha}{2}\right) \right]$. Here, $\Phi$ denotes the standard normal distribution function. For heavy-tailed data and statistical functions that are sensitive to outliers, such as STD_DEV, convergence to normality can be slow, resulting in poorly calibrated intervals.

Briefly, in order to assess the quality of the intervals, one can examine their calibration. Specifically, one computes a quantity called the coverage. For a central confidence interval, it should contain the true statistic value roughly $(1-\alpha)n$ times for $n$ trials. In fact, since we seek the central confidence interval, we can make the stronger statement that the true value should be above, or below, the confidence interval endpoints in roughly $\frac{1-\alpha}{2} n$ out $n$ trials. The empirical coverage is this fraction computed for a large number of trials. It allows us to compare alternative approaches by simulation. We return to this when we report our test results.

In order to obtain better confidence intervals, we tried a couple of different approaches: the Cornish-Fisher correction of quantiles and an adaptation of bias-corrected accelerated (BCa) confidence intervals. Simulation showed BCa provided more robust calibration across a range of confidences, so this is the approach we selected. The basic idea, which was introduced by Efron, is to assume that there exists a monotonic transformation of the underlying statistic $g=g(\theta)$ which, when applied to a distribution sample normalizes its distribution:

\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\hat{\phi}=\phi+(1+a \phi) \left(N(0,1) - z_0 \right)\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;(1)

Here, $\hat{\phi}=g(\hat{\theta})$, $\phi=g(\theta)$ and $N(0,1)$ is the standard normal random variable. This is clearly a relaxation of the assumption that the statistic itself is normally distributed, which is used to derive the standard interval. In fact, this family includes many distributions, since $g$ is only constrained to be monotonic. (You can think of $1+a\phi=1+a g(\theta)$ as a first-order Taylor expansion of the case that the variance is an arbitrary function of the true parameter value. This further relaxes the assumption that the normalizing transformation also stabilizes the variance.) The nice thing about this ansatz is that $g$ never needs to be explicitly computed, and there exist standard approaches for estimating the parameters $a$ and $z_0$ from the distribution samples.

To handle $z_0$ one simply arranges for the estimate to land at the median of transformed distribution. If we assume the cumulative distribution function in theta space is $F_{\theta}$ then $z_0=\Phi^{-1}(F_{\theta}(\bar{\theta}))$, where $\bar{\theta}$ is the estimated statistic value, and as before $\Phi$ is the standard normal distribution function. Typically, $F_{\theta}$ is approximated by the empirical distribution function, computed indirectly by bootstrap. However, somewhat surprisingly, extensive simulation showed that we obtained better calibrated intervals using a normal approximation to our sample values, i.e. $F_{\theta}(\bar{\theta})=\Phi \left( \frac{\bar{\theta}-\hat{\theta}}{\hat{\sigma}}\right)$ with $\hat{\theta}$ and $\hat{\sigma}$ their empirical mean and standard deviation, respectively.

To complete the procedure, one can rearrange (1) to derive $\alpha_{\theta}$ quantiles for $\theta$ as follows:

\alpha_{\theta}=\Phi\left( z_0 + \frac{z_0+z_{\alpha}}{1 - a (z_0+z_{\alpha})}\right)

where $z_{\alpha}$ is the standard normal z-score for quantile $\alpha$. Typically, one uses the inverse empirical cumulative density estimate of $F_{\theta}^{-1}$ to convert quantiles back to a confidence interval. However, because we have a mismatch between the count of values used to compute distribution samples and the query estimate, we need to do some sort of scaling. Exploring options by simulation, we again found it best to use a normal approximation, $F_{\theta}^{-1}(\alpha_{\theta})=\hat{m}+\frac{\hat{\sigma}}{\sqrt{s}}\Phi^{-1}(\alpha_{\theta})=\hat{m}+\frac{\hat{\sigma}}{\sqrt{s}}\left( z_0 + \frac{z_0+z_{\alpha}}{1 - a (z_0+z_{\alpha})}\right)$, where $s$ is the number of distribution samples we use. This is just applying the usual scaling of variance by $\frac{1}{\sqrt{\text{sample size}}}$.

Efron showed that in the case $\hat{\theta}$ is distributed as $f_{\theta}$, i.e. that it depends only on the true value $\theta$, then the acceleration $a$ can be estimated without any knowledge of $g$. In particular, $a=\frac{1}{6}\text{SKEW}\left(\frac{\partial f_{\theta}}{\partial \theta}\right)$. By assumption, our statistics tend to normal distributions with mean $\theta$. Since skew is translation and scale invariant, this gives that $a \approx \frac{1}{6}\text{SKEW}(\{\hat{\theta}\})$, i.e. one sixth of the skew of our distribution samples. One thing this glosses over is the dependence of skew, and therefore acceleration, on sample size. We know it tends to zero as the count increases. In fact, skew also asymptotes to zero as $\frac{1}{\sqrt{\text{sample size}}}$ and so we also adjust acceleration to be $\frac{1}{6\sqrt{s}}\text{SKEW}(\{\hat{\theta}\})$ to account for the count mismatch between the samples $\{\hat{\theta}\}$ and estimate $\bar{\theta}$.

Although we significantly improve the calibration of confidence intervals by using a better methodology, we still see issues in the case that the underlying distribution has very heavy tails for some of the supported STATS functions. Therefore, we introduce some additional guard rails we discuss next.

Guard rails

To avoid the user having to understand too much about edge cases, we provide additional safeguards that surface when we've been unable to confirm that the distribution samples behave as we expect. This typically happens when the statistic isn’t computed from a sufficient number of values given the metric distribution. It's exacerbated by very skewed metric data and certain aggregation functions, such as the STD_DEV, which are sensitive to outliers.

We have some global constraints on the minimum count of values used to estimate a statistic for which we'll certify it. For example, if any bucket is empty, then we can’t rely on the distribution samples. This is because ES|QL allows mixing approximate statistics, which treat empty buckets differently. For example, consider the following query:

SET approximation=true;
FROM data | STATS avg = AVG(x), sum = SUM(y) | EVAL mix = avg + sum

There is no self-contained way of correctly assigning a value to mix for empty buckets, since summing requires that we treat them as zero, in which case we bias our estimate of avg. Alternatively, ignoring empty buckets introduces bias in the sum. There is also a global minimum count of values for which we’ve verified our certification method is sufficiently reliable; this is 10.

We explored a variety of additional tests to certify the results. These were based on both tests of the underlying data distribution, specifically Hill’s estimator, as well as the statistic’s distribution properties. If the true distribution of the statistic is sufficiently normal, then our estimate and confidence interval calculation behaves as we expect: The interval is well calibrated and the interval width is representative of the actual error. Therefore, in the end, we chose to use a test based on the p-value for distribution samples’ skewness and kurtosis versus a normal distribution null hypothesis. To certify a result, we require that the two tail p-values are greater than 0.05 for both tests. As we show below, we found this test was well aligned to our actual needs: to distinguish results for which the estimate and its confidence interval are more and less reliable.

There's a simple trick we can use to boost the accuracy of the accuracy of the test: Create multiple independent distribution samples and use a vote. Given a test to certify results with a failure rate $f$, the distribution of the count of $k$ failures for $t$ tests is $\frac{t!}{k!(t-k)!}f^k(1-f)^{t-k}$ for the case the null hypothesis, that the estimate is trustworthy, is true. For example, for the majority vote assuming $f=0.05$ and $t=3,$ then the significance of the test is $1-(3\times 0.95^2\times 0.05+0.95^3)=0.007$, i.e. we fail to certify fewer than 1% of trustworthy results. Note that we can compute multiple trials relatively easily using different seeds for the RANDOM bucket identifier.

This additional check allows us to certify that we trust our estimates and their errors. We surface this information in the approximate query results. When we can’t certify results, they won’t necessarily be inaccurate, but they should be treated with more caution.

Testing

The two main aims of the testing we discuss here were to understand the calibration of the confidence intervals and to see how well they characterize the statistics' estimation errors. The count function is particularly well behaved, its error distribution is binomial, so the majority of our testing focused on metric aggregations. We study smooth distributions but make sure we cover a range of tail behaviors. The presence of outliers is the key factor that reduces the accuracy of estimated statistics. For example, if an outlier isn’t sampled at all, it can significantly affect the value of some statistics.

We explored a range of light-tailed distributions, such as uniform and normal, and skewed and heavy-tailed distributions, such as exponential, log-normal, Cauchy, and Pareto. For each family of distribution, we used multiple parameterizations, focusing primarily on varying the scale parameter. In total, we had 24 distinct data distributions. Figure 1 shows some example sample distributions from this set. Note that we’ve truncated the charts to remove extreme outliers, which are present for both the Cauchy and log-normal distributions.

For each data distribution, we evaluated 14 different sample sizes, ranging from 1000 to 500000. Then, for each sample set, we evaluated AVG, COUNT, MEDIAN_ABSOLUTE_DEVIATION, MEDIAN, PERCENTILE([25, 75, 90, 95, 99]), SUM and STD_DEV at two levels of confidence, 50% and 90%. In total, we have around 7500 distinct experiments. For each experiment, we assessed the interval calibration using 100 runs and counting the number of times the true statistic lands in the confidence interval. This gives us a binomially distributed estimate for the true confidence interval coverage. The variation we expect in the estimated coverage changes slightly with the level of confidence; for example, at 50% we expect to see values mainly between 0.44 and 0.56, and for 90% we expect to see values mainly between 0.86 and 0.94 using 100 trials.

Figure 2 shows box plots for the empirical coverage for the two confidence levels computed from all experiments. In all cases, the confidence intervals are reasonably well calibrated. Extreme percentiles are biased for small sample sizes, which leads to increased outlier counts for small sample sizes. As a rule of thumb, you’d want roughly $\frac{10}{p(1-p)}$ samples to ensure that you have enough samples in the appropriate tail.

Next, we examine the degree to which the confidence intervals capture the typical size of the estimate error. To do this, we examine the distribution of the ratio of the estimated statistics' error and half the confidence interval width for all certified results. The higher the confidence, the wider the interval, so different confidence levels shift the mean of this distribution. Figure 3 shows this distribution computed for the 90% confidence interval. As expected, the distribution is roughly normal, albeit with a tail of some larger errors. We see in all cases the confidence interval width gives the order of magnitude of the estimated statistics' actual errors.

We’ve shown that certified results are nearly always reliable; however, we’d also like some insight into the proportion of results which we fail to certify that are actually reliable, to confirm that the test aligns with our objective. We use reliable here in the fairly strong sense that the confidence interval is well calibrated. Specifically, for the 50% and 90% confidence intervals, we count the proportion of uncertified results for which the confidence interval empirical calibration has an acceptable margin of error, given the number of trials used to estimate it. Using this procedure, the false positive rate across all experiments is around 1%. This agrees well with the failure rate we expect by chance, given our test parameters, and confirms the assumption underlying the test.

Finally, to better understand the difference between certified and uncertified results, Figure 4 shows the error distribution of the ratio of the estimated statistics' errors and half the 90% confidence interval for the reliable and unreliable results separately. Note that we truncated the range for uncertified intervals.

Wrapping up

In this post, we present the background behind our approach for quickly estimating ES|QL queries and providing an indication of their errors. To do this, we developed an effective confidence interval mechanism that allows us to provide error estimates. Our approach also allows us to estimate confidence intervals for quantities derived from sampled statistics via other pipeline operations. Quantifying the error comes with a relatively small overhead compared to just estimating the query. Finally, we developed a statistical test to certify results we return. Values that aren’t certified can still be accurate, but we’re less confident in them.

As well as testing the feature on a range of real-world use cases, which we discuss in our companion post, we tested the error estimation by extensive simulation across a range of data characteristics, sample sizes, aggregation functions, and confidence levels. This showed confidence intervals are well calibrated, and the interval itself provides a good approximation of the actual error we observe in the estimates. Finally, we showed that we were able to certify intervals with a low false negative rate.

We’re planning to integrate this feature into other stack capabilities in the future, so stay tuned.

Fast approximate Elasticsearch ES|QL - part I

Jan Kuipers, Thomas Veasey — Thu, 16 Apr 2026 00:00:00 GMT

Analytics workloads typically involve summarizing large volumes of data into a much smaller number of statistics. The Elasticsearch Query Language (ES|QL) implements this capability using the STATS command. This allows you to select various aggregation functions and apply them to the previous query results, as well as grouping the results by one or more ES|QL expressions. This is a flexible operation that, coupled with ES|QL querying capabilities, allows one to perform MapReduce on data stored in collections of Elasticsearch indices.

One of the key requirements for a pleasant user experience is that these operations are performed quickly. Large language model–based (LLM) agents also introduce new higher bandwidth and speculative query patterns that can potentially benefit from different optimization strategies.

In this two-part blog series, we discuss an optimization approach we’re introducing to ES|QL in version 9.4 of Elasticsearch and the Elastic Stack, which exploits a relaxation of the problem. Rather than trying to get exact values for aggregates, we allow ourselves to return approximate values, together with some characterization of their error. A key benefit of approximation is that it breaks the dependency between performance and dataset size: The accuracy with which one can approximate a query doesn’t depend on the original dataset size but, principally, its data characteristics and the query itself. As we’ll see later, this allows us to achieve some dramatic performance improvements.

In our next blog post, we will discuss the theory behind our approach and the validation we’ve done of its statistical properties. Here, we introduce the syntax and give a sense of how it’s achieved using standard ES|QL and query rewriting. You can explore its performance on a subset of the popular ClickBench benchmark. Finally, we discuss some limitations and gotchas that are worth understanding when you use query approximation.

Syntax and behavior

So how do you actually use it?

SET approximation=true;
// The query you want to approximate
FROM data | commands | STATS x=agg(...) | commands

That’s it. You simply introduce the new line SET approximation=true; and write your STATS query pipeline as usual. Below, we discuss some advanced configuration options and some limitations around the agg(...) and commands. However, essentially, we choose defaults so that this will typically provide useful approximations while achieving significant speedups.

With this change, you’ll see some differences in the query results. Let’s look at a concrete example to illustrate this. Suppose the raw query is as follows:

FROM sales | WHERE @timestamp >= NOW()-1w
           | STATS count = COUNT() BY item_category
           | SORT count DESC
           | LIMIT 5

The results might look something like this:

item_category        | count
---------------------+------
Household Essentials | 5165
Kitchen              | 2132
Storage              | 1121
Home Decor           | 877
Furniture            | 357

Approximating this query introduces some extra columns for each quantity that’s estimated:

item_category | count | _approximation_confidence_interval(count) | _approximation_certified(count)
--------------+-------+-------------------------------------------+--------------------------------
Essentials    | 5150  | [5100, 5250]                              | true
Kitchen       | 2150  | [2100, 2200]                              | true
Storage       | 1120  | [1100, 1150]                              | true
Home Decor    | 880   | [860, 900]                                | true
Furniture     | 330   | [310, 350]                                | true

The count column now contains an estimate, and you’ll see it’s somewhat different from the exact values above. The _approximation_confidence_interval(count) column defaults to the central 90% confidence interval for the count estimate and the _approximation_certified(count) column indicates if we’re highly confident that the results and their confidence interval are trustworthy. In outline, the confidence interval is an interval we expect has a high probability (0.9) of containing the true value for the quantity being estimated. The certified column indicates the distribution of the approximation is behaving as we expect. When the result isn’t certified, it’s often still accurate, but our test of the properties of its distribution hasn’t been able to confirm this. These quantities are discussed in more detail in our second post.

Implementation

An approximate query is rewritten before query execution using random sampling and extrapolation. Let’s take a look at the query of the previous section. The part of the rewritten query responsible for obtaining the best estimate looks like:

FROM sales | SAMPLE probability
           | WHERE @timestamp >= NOW()-1w
           | STATS count = TO_LONG(COUNT() / probability) BY item_category
           | SORT count DESC
           | LIMIT 5

The query samples a fraction of the data, and therefore the final count has to be extrapolated by scaling up with the inverse of the sample probability. Extrapolation clearly depends on the underlying aggregation function, and we handle this appropriately for all functions we support.

To obtain the sample probability, we're setting a fixed number_of_rows to be processed by the STATS command. In this case, the probability is calculated as follows:

FROM sales | WHERE @timestamp >= NOW()-1w
           | STATS total_row_count = COUNT()
           | EVAL probability = number_of_rows / total_row_count

This query is executed before the final approximate query is executed.

As well as this best estimate, confidence intervals and a statistical test used to certify that the value distribution is behaving as we expect also need to be computed. The intervals are computed using a variant of the bias-corrected and accelerated bootstrap confidence interval (BCa) method. Therefore, the data needs to be partitioned into B buckets, which are used in turn to compute the intervals. Omitting some implementation details, this approximate query looks like:

FROM sales | SAMPLE p
           | WHERE @timestamp >= NOW()-1w
           | EVAL bucketId = RANDOM(B) // B is the number of buckets
           | STATS count     = TO_LONG(COUNT() / p) 
                   count_0   = TO_LONG(COUNT() / p) WHERE bucketId==0  
                   (...)
                   count_B-1 = TO_LONG(COUNT() / p) WHERE bucketId==B-1  
             BY item_category
           | WHERE count >= 10
           | SORT count DESC
           | LIMIT 5
           | EVAL ci = CONFIDENCE_INTERVAL(count, count_0, ..., count_B-1),
                  certified = CERTIFIED(count, count_0, ..., count_B-1)
           | DROP bucketId, count_0, ..., count_B-1

To certify the estimate and confidence interval, there should be enough data, and the distribution of the bucket values should tend to normality.

Some queries can be efficiently computed using only summary statistics maintained in the index. To handle these correctly, where sampling is both slower and inaccurate, we updated the physical query planner, since detecting this case requires information that’s only available where the data resides. When the planner detects this is possible, it simply executes the query as normal. Such queries are typically fast anyway, and there’s no real side effect, so you don’t need to worry about this when using approximation; however, you’ll see that confidence intervals for such queries always have zero length, indicating the results are exact.

Results

To explore the performance improvements, we use ClickBench. This is a benchmark for analytics workloads for database management systems (DBMS). It comprises approximately 100 million rows, with a focus on clickstream and traffic analysis, web analytics, machine-generated data, structured logs, and events data. The benchmark also defines 43 queries that are typical of ad-hoc analytics and real-time dashboards.

Some of the queries aren’t suitable for approximation. For example, we don’t support approximating the unique count of a categorical value or computing the minimum and maximum of a metric value. We also don’t care about queries targeting search alone, for which Elasticsearch has excellent performance in any case. We therefore exclude these types of query from our evaluation. Finally, we also want to test a few additional aggregation functions, such as percentiles, which are not well represented in the original query set, so add some variants of the original metric queries to this end.

Queries in the benchmark are written using standard SQL and so need porting to use ES|QL syntax. This translation is fairly straightforward. Here’s an example:

SELECT SUM(AdvEngineID), COUNT(*), AVG(ResolutionWidth) FROM hits

becomes:

FROM hits | STATS s = SUM(AdvEngineID),
                  c = COUNT(*),
                  a = AVG(ResolutionWidth)

when rewritten in ES|QL.

For running all benchmarks, we use an Elastic Cloud Hosted instance with 870GB disk, 29GB Ram, and 4 vCPUs, in effect, an Amazon Elastic Compute Cloud (EC2) i3.xlarge instance. In the following results, we simply compare ES|QL with and without query approximation. Extensive results on a range of different hardware setups and datastores can be found here. Even with significantly constrained test hardware (matching the vCPUs of the smallest setup), our approximation approach achieves competitive results against much larger systems.

We run each query and its approximation five times in a random order, clearing the query cache between each run. We report the average run time over all five runs. While clearing the cache should be sufficient to avoid most of the advantage of running second, we wanted to avoid any possible accidental prewarming effects, which is why we alternate.

The results break down into four categories:

Queries which are rewritten to use index summary statistics (three queries).
Queries that perform well (13 queries).
Queries with high cardinality partitioning (seven queries).
Queries with restrictive filters (12 queries).

Roughly speaking, for these four categories, approximate querying is: equivalent (1); faster and accurate (2); faster but unreliable (3); and slightly slower (4), compared to exact querying, respectively.

For category 1, the planner automatically detects that we’re able to perform the query using summary statistics, and we end up executing the queries in the same way. To do this, we need information that’s only available on the data nodes, so we perform the rewrite only after we've estimated the sample probability. Because we're able to do this very efficiently, the overhead is small (around 10–15%). In both cases, the results are exact.

Queries in category 2 run on average 23$\times$ faster if estimating the values and computing confidence intervals and 72$\times$ faster if just estimating the values, which you can select as follows: SET approximation={"confidence_level":null}. These headline figures hide quite some variation in the impact of approximation on performance. The table below shows some queries sampled from the range of speedups we see:

Query	Baseline / ms	Approximate with CI / ms	Approximate without CI / ms
3	1725	145	15
10	4340	1721	56
13	32912	6106	3821
21	46739	3284	2139
22	252505	6478	5019

Here are the corresponding queries:

// Query 3
FROM hits | STATS s = SUM(AdvEngineID),
                  c = COUNT(*),
                  a = AVG(ResolutionWidth)

// Query 10
FROM hits | STATS s = SUM(AdvEngineID),
                  c = COUNT(*),
                  a = AVG(ResolutionWidth) BY RegionID
          | SORT c DESC
          | LIMIT 10

// Query 13
FROM hits | WHERE SearchPhrase != ""
          | STATS c = COUNT(*) BY SearchPhrase
          | SORT c DESC
          | LIMIT 10

// Query 21
FROM hits | WHERE URL != ""
          | STATS l = AVG(LENGTH(URL)), c = COUNT(*) BY CounterID
          | WHERE c > 100000
          | SORT l DESC
          | LIMIT 25

// Query 22
FROM hits | WHERE Referer != ""
          | GROK Referer """%{URIPROTO}://(?:www\.)?%{URIHOST:k}"""
          | WHERE k IS NOT NULL
          | STATS l = AVG(LENGTH(Referer)), c = COUNT(*) BY k
          | WHERE c > 100000
          | SORT l DESC
          | LIMIT 25

We'll return to the accuracy of the approximation in the next blog post, but to give a sense of this, we plot below the exact and approximate values for a sample run for query 13:

For category 3, we get an average speedup of $11\times$. However, the results of queries in this category can miss some partitions and often have large estimation errors. Approximation can still be valuable for such queries, particularly in the context of agentic workflows, but requires larger sample sizes than out default if accuracy is important. As we discuss in the next section, we provide an API to explicitly control the sample size. If the source dataset is sufficiently large, this can be increased and approximation will still yield significant performance improvements. The table below shows a couple of query examples for this category:

Query	Baseline / ms	Approximate with CI / ms	Approximate without CI / ms
15	8256	1187	124
17	70641	2109	982

Here are the corresponding queries:

// Query 15
FROM hits | STATS c = COUNT(*) BY UserID, SearchPhrase
          | SORT c DESC
          | LIMIT 10

// Query 17
FROM hits | EVAL m = DATE_EXTRACT("minute_of_hour", EventTime)
          | STATS c = COUNT(*) BY UserID, m, SearchPhrase 
          | SORT c DESC
          | LIMIT 10

Finally, category 4 queries use selective filters and end up being executed exactly, but they run slightly slower because of the work done in the query rewrite stage. Typically, all these queries run fast anyway, so the absolute slowdown is small. On average, they run approximately 14% or 370ms slower than the “without” sampling for our test setup.

Limitations and best practices

It’s worth explicitly mentioning some limitations. In particular, the following queries are not currently supported:

Queries using the TS source command.
Queries using the FORK or JOIN processing command.
Pipelines which use two or more STATS commands.
The ABSENT, PRESENT, DISTINCT_COUNT, MIN, MAX, TOP, ST_CENTROID_AGG and ST_EXTENT_AGG aggregation functions.

We plan to lift some of these restrictions in future releases, such as approximating queries using TS, FORK and JOIN; however, some are intrinsic. For example, while there’s prior art for estimating the minimum and maximum of a metric dataset or the count of unique values of a categorical dataset (see, for example, this paper), they require making certain distributional assumptions, either explicitly or implicitly. In summary, we view trying to automatically provide estimates of these statistics as being too open to accidental misuse.

For the expert user, we provide another route: ES|QL supports using the SAMPLE command directly. This allows one to obtain “point estimates” of any query, albeit with no attempt to correct for the impact of sampling or quantify error. For example:

FROM data | SAMPLE 0.01 | STATS DISTINCT_COUNT(value)

computes the unique count of the value field on a sample of roughly 1/100th of the dataset. The sample probability can be adjusted to get a sense of how this is asymptoting, or more sophisticated estimation procedures can use STATS COUNT() BY value to estimate the frequency profile of the data.

There are a couple of cases that are more problematic for sampling. If a very restrictive filter is applied in the query, then sampling is of little value, since few rows match anyway. In such cases, we discover that we’d have to sample too large a proportion of the rows to estimate the query in the rewrite phase. In this case, we revert to running the query without sampling and its result is exact. However, the search procedure to determine the fraction of rows to sample comes with some overhead. One therefore pays a penalty, albeit less than the original query cost, for no benefit. If you know in advance that the query is expected to match relatively few rows, it's best to run it without approximation.

The second case only applies when computing STATS partitioned by some expression. If the cardinality of this expression is very high, then even if many rows are searched, individual statistics may be computed from a small number of rows. Some cases are more problematic than others. Sorting by ascending count, that is, finding the rarest partitions, can be impossible to estimate in a single query if heavy hitters would require us to sample most of the dataset to find them. For this particular case, heavy hitting partitions can be estimated first and sometimes efficiently excluded by updating the query. In general, infrequent partitions may be lost in the sampling process, and their statistics' estimation errors can be high. It’s worth noting that we won’t attempt to estimate any statistic for which we have fewer than 10 samples, and we simply drop them from the result set. In the case of very high cardinality BY clause, for example, a field whose value is unique for every row, this means the query can return no results. If you find approximate query results are too inaccurate, you have the option to increase the sample size, which by default is 1,000,000 for STATS, which uses grouping and 100,000 otherwise. Currently, this needs to be done manually, and we provide the following API for this:

SET approximation={"rows":12345678};
FROM data | commands | STATS x=agg(...) | commands

Occasionally, functions significantly alter the distribution characteristics of the quantities they act on. A contrived example is the following:

FROM data | STATS sl = SUM(length) | EVAL csl = COS(sl)

If the variation in the estimate sl is much larger than $2\pi,$ we expect the distribution of csl to be mainly flat in the interval $[-1,1]$ with peaks near both endpoints. In this particular case, it’s not clear that the central confidence interval is a particularly useful concept, since the modes of the distribution lie outside almost all central confidence intervals. In any case, just observing the samples of csl, our standard confidence interval machinery won’t reliably characterize this distribution and it will underestimate the variability of csl. However, our statistical test should detect this problem, and the result won’t be certified.

Finally, we note that Elasticsearch implements some query optimization strategies that ideally need to account for the fact that sampling is taking place. These rewrite the query at the Lucene level and the preprocessing involved in this rewrite can be relatively expensive. Accelerating an expensive string matching operation by first building a suitable data structure makes sense if the query needs to process every row, but if it processes only a small fraction of them, the trade-off is different. This is something we plan to enhance in future.

Conclusions

In this blog post, we introduced a new form of query optimization we’re bringing to ES|QL that enables dramatically faster querying by relaxing the constraint that the results are exact. We found on ClickBench that we were able to accurately estimate query values and their confidence intervals up to 100 times faster and values alone up to 250 times faster than we can compute them exactly. Furthermore, we expect this advantage to grow as the dataset size increases, because the approximation accuracy is independent of the dataset size. This feature works with many features of the ES|QL language and is enabled by simply prepending SET approximation=true; to the query to estimate.

As well as providing a point estimate, we also estimate confidence intervals and indicate whether we think that the underlying assumptions used to compute these are satisfied. This allows us to certify the results if the results are reliable. We explain the theory behind this feature and discuss the testing of its accuracy in our next post.

Is your ML job's datafeed losing a race it cannot win?

Valeriy Khakhutskyy — Wed, 15 Apr 2026 00:00:00 GMT

On almost every large Elastic deployment I’ve worked with, there’s an Elastic Security or Elastic Observability anomaly detection (AD) job that looks healthy but is perpetually behind. Six hours behind. Twelve. And the gap never closes.

The datafeed isn’t broken. It’s doing exactly what it was built to do: reading every raw document, across every shard, every run. On a large cluster with cross-cluster search (CCS) and a broad index pattern, like logs-*, that means scanning billions of documents per bucket. There’s no hardware that makes that sustainable. The datafeed will always be chasing live data and never reaching it.

The fix is to switch from the default scroll-based datafeed configuration to an aggregation-based datafeed configuration: Let the data nodes summarize locally, and ship only compact bucket results to the ML node. Same detections, a fraction of the load. The speedup can be dramatic. More than you might expect. The numbers are in the next section. The explanation for why the gap is so large is at the end of the post, for those who want to understand the mechanics.

One catch worth knowing now: Switching requires creating a new job. The old model doesn’t transfer; weeks of learned baseline are lost. The right time to make this switch is before the job has been running for months, not after. That’s the main reason to read this before you deploy.

How much faster? Scroll vs. aggregation datafeeds for ML jobs

We ran the same job two ways on production data: first scroll-based, and then aggregation-based. The job covered 13 months of history, monitoring 836,000 log events per hour in 15-minute buckets across multiple clusters.

Training on historical data with scroll-based configuration: five days of wall-clock time, 7.9 million sequential requests, and 3.5 TB transferred; with aggregations: 2.3 minutes, 23 requests, and 34 MB (a 3,374× speedup). Think of it this way: If you start the scroll backfill at 9 a.m. Monday, it will finish Saturday morning. The aggregation version is done by 9:02 a.m.

On live data, the difference is less dramatic but still meaningful: around 20× fewer requests per tick. That adds up quickly when the datafeed runs every few minutes around the clock.

Before you start

Three things worth knowing before diving into the configuration.

This isn't wizard territory. The standard Kibana job wizards (Single Metric, Multi-Metric, Population) don't expose aggregation configuration. To create an aggregation-based job, you need either the Elasticsearch API or Kibana's Advanced Job Wizard, with JSON edited by hand. The worked example below shows the most practical path: Configure the job in the Multi-Metric Wizard, and then click Convert to advanced job before creating it. That gets you a prefilled JSON starting point instead of a blank editor.

The configuration is unforgiving and mostly silent about it. There's no schema validation that catches a misnamed aggregation key or a fixed_interval that doesn't match bucket_span. The job will run, anomalies will fire, and nothing will indicate that the results are based on the wrong data. This is why the five-step pattern exists and why the Preview tab is worth using every time: Catching a misconfiguration before the job trains is a 30-second check; catching it a week later is a much worse afternoon.

The Single Metric Viewer has a known limitation with aggregated jobs. That viewer reconstructs the "actual" data curve by re-querying the index, but it can't reproduce an arbitrary, user-defined aggregation, so the actual-value line is typically missing or approximate. The Anomaly Explorer is unaffected: Anomaly scores, swim lanes, and influencer attribution all work normally. Just don't rely on the Single Metric Viewer's chart for visual validation of what the model saw.

What we can and can’t aggregate

Almost every ML function works with aggregated datafeeds, but the right aggregation pattern depends on the function.

Function	Pattern
`count`, `mean`, `high_mean`, `low_mean`, `sum`, `max`, `min`, `varp`	Standard: `date_histogram` → `terms` → metric aggregation
`time_of_day`, `time_of_week`	Minimal: plain `date_histogram`, no `terms` or metric needed
`rare`, `freq_rare`, `info_content`	Composite: top-level composite with `date_histogram` as a source
`categorization`	`terms` on the `.keyword` subfield of the categorization field
`lat_long`	Scroll only

lat_long is the one genuine exception. The configuration is accepted, but geo_centroid computes the arithmetic mean of all coordinates in a bucket: If the same entity appears in New York and London within the same bucket, the centroid ends up in the Atlantic Ocean, which probably doesn't make sense for the use case. Keep lat_long jobs on scroll-based datafeeds.

The five-step pattern in the next section covers the standard case. We’ll walk through the remaining patterns at the end of the post.

The standard five-step pattern: Scroll-based to aggregation datafeed

Converting any scroll-based job to an aggregation-based datafeed follows the same five steps. Once you understand the pattern, applying it to any compatible job takes about 10 minutes.

Step 1: Add summary_count_field_name: "doc_count" to the analysis config. This tells the ML engine that incoming data is pre-summarized. Without it, the engine treats each aggregated bucket as a single raw document and produces wrong anomaly scores.

Step 2: Choose the bucket wrapper topology. For most functions (count, mean, sum, max, min, varp, time_of_day, time_of_week, and categorization) use a date_histogram at the top level whose fixed_interval matches your bucket_span exactly to ensure accurate analysis. For rare, freq_rare, and info_content, use a composite at the top level with a date_histogram as one of its sources. This routes the datafeed to the composite extractor, which paginates through all field-value combinations rather than truncating to a top-N.

Step 3: Add a max aggregation on @timestamp. The ML engine needs this to determine the precise end time of each bucket. In the standard topology (Step 2, date_histogram outer), it goes inside the histogram’s aggregations. In the composite topology, it sits as a sibling of the composite aggregation.

Step 4: Map each analysis field to a terms aggregation, named exactly after the corresponding field in the analysis config. One categorical field → a single nested terms. Two or more categorical fields → a composite aggregation nested inside the date_histogram, with one terms source per field. For categorization jobs, use a terms aggregation on the .keyword subfield of the categorization_field_name. The naming rule is strict: The aggregation key must exactly match the field name in the analysis config; the ML engine uses the aggregation name, not the field parameter, to look up values. A mismatch produces silently wrong results; no error, just a job that appears to run while missing everything meaningful.

Step 5: Map each detector’s metric field to its Elasticsearch aggregation equivalent:

ML function	Elasticsearch aggregation
`mean` / `high_mean` / `low_mean`	`avg`
`sum`	`sum`
`max`	`max`
`min`	`min`
`varp`	`extended_stats`

For count, rare, freq_rare, info_content, time_of_day, time_of_week, and categorization jobs, the ML engine works from doc_count alone; no metric aggregation is needed, and this step can be skipped.

Step-by-step example: Building an aggregation-based ML job in Kibana

Let’s build this end to end using Kibana’s sample web logs. If you haven’t loaded them yet, go to the Kibana home page and click Integrations → Sample data → Sample web logs → Add data. This gives us a data view called Kibana Sample Data Logs and an index called kibana_sample_data_logs with fields including @timestamp, bytes (response size), and geo.dest (destination country).

We’ll build a job that detects unusually large response sizes: high_mean of bytes, partitioned by destination country (geo.dest), with a 1-hour bucket span.

Creating the job with the Multi-Metric Wizard

This is how most jobs get created in practice. Navigate to Machine Learning → Anomaly Detection → Manage Jobs → Create job.

Select the “Kibana Sample Data Logs” data view, and set the time range to cover the full sample dataset. On the job type screen, choose Multi-metric.

In the Multi-Metric Wizard, configure the detector:

High mean of bytes.
Split data by geo.dest.
Bucket span: 1h.

Give the job an ID, and leave everything else at its defaults, but don’t click Create yet. On this last configuration step, click on Preview JSON and look at the datafeed section. What you’ll see is a plain scroll-based datafeed with no aggregations, just an index pattern and a match_all query.

This is the default every wizard produces. On a small cluster, it works fine. On a large cluster with CCS and a broad index pattern, this datafeed will scan every raw document on every run and never catch up with live data.

Instead of clicking Create, click Convert to advanced job. This keeps everything you just configured (the detector, the partition field, the bucket span) and drops you directly into the Advanced Wizard, where we can apply the five-step pattern.

Analysis configuration

The conversion prefills the detector, partition field, and bucket span. The only change needed here is Step 1 of the pattern: Open the Edit JSON view, and add summary_count_field_name to tell the ML engine that incoming data will be pre-summarized:

{
  "bucket_span": "1h",
  "summary_count_field_name": "doc_count", // Step 1
  "detectors": [
    {
      "function": "high_mean",
      "field_name": "bytes",
      "partition_field_name": "geo.dest"
    }
  ],
  "influencers": ["geo.dest"]
}

Datafeed configuration

Switch to the Datafeed tab. This is where Steps 2 through 5 of the pattern come together. Remove scroll_size if it’s present, and then enter the aggregations:

{
  "buckets": {
    "date_histogram": {               // Step 2: bucket wrapper, interval = bucket_span
      "field": "@timestamp",
      "fixed_interval": "1h"
    },
    "aggregations": {
      "@timestamp": {                 // Step 3: max timestamp anchor
        "max": { "field": "@timestamp" }
      },
      "geo.dest": {                   // Step 4: partition field, name must match exactly
        "terms": {
          "field": "geo.dest",
          "size": 1000
        },
        "aggregations": {
          "bytes": {                  // Step 5: metric field → avg aggregation
            "avg": { "field": "bytes" }
          }
        }
      }
    }
  }
}

A few notes on this config:

Step 2: The date_histogram uses fixed_interval: "1h", matching bucket_span exactly. A mismatch produces incorrect bucket timing.
Step 3: The max aggregation on @timestamp must be named @timestamp and placed inside the histogram’s aggregations; without it, the ML node can’t determine the precise end of each bucket.
Step 4: The terms aggregation for the partition field must be named exactly after the partition field: geo.dest, not geo.dest_grouping or any alias. The ML engine uses the aggregation name, not the field parameter, to identify which partition value each bucket belongs to. A mismatch silently drops the partition field from results entirely.
Step 5: The metric aggregation key bytes matches field_name in the detector exactly. Any mismatch here produces silently wrong anomaly scores.

Validate with the preview

Before we create the job, let’s use the Preview tab. This runs the aggregation against real data and shows exactly what the ML node will receive, a very useful sanity check before committing.

Three things to verify in the preview output: doc_count should be present on every bucket and greater than 1. The bytes values should look like average response sizes: numbers in the hundreds to hundreds of thousands for web traffic. And each row should correspond to a distinct (timestamp, geo.dest) pair. If anything looks off, fix it in the JSON editor and rerun the preview.

Adding influencer fields

In the example above, geo.dest is the partition field. The ML model learns a separate baseline for each destination country, and anomalies are reported per country. But you might also want machine.os to appear as an influencer in anomaly results: When the detector fires, you want to see “this looks anomalous for geo.dest: CN and machine.os: win is a contributing factor.” Influencers don’t drive anomaly detection; they provide context for the anomalies that are found.

To support an influencer alongside a partition field, the analysis config gains an influencers array:

{
  "bucket_span": "1h",
  "summary_count_field_name": "doc_count",
  "detectors": [
    {
      "function": "high_mean",
      "field_name": "bytes",
      "partition_field_name": "geo.dest"
    }
  ],
  "influencers": ["geo.dest", "machine.os"]
}

And now the datafeed needs to aggregate on both fields simultaneously. One terms nested inside another terms won’t work; a nested terms surfaces only the top-N values of the inner field per outer bucket, so you’d silently lose combinations. Instead, use a composite aggregation with one terms source per field, nested inside the date_histogram:

{
  "buckets": {
    "date_histogram": {
      "field": "@timestamp",
      "fixed_interval": "1h"
    },
    "aggregations": {
      "@timestamp": {
        "max": { "field": "@timestamp" }
      },
      "group_by_fields": {
        "composite": {
          "size": 1000,
          "sources": [
            { "geo.dest":   { "terms": { "field": "geo.dest" } } },
            { "machine.os": { "terms": { "field": "machine.os" } } }
          ]
        },
        "aggregations": {
          "bytes": {
            "avg": { "field": "bytes" }
          }
        }
      }
    }
  }
}

composite generates one bucket per unique (geo.dest, machine.os) combination. The ML node sees every pair and can correctly attribute which operating system was contributing when a country’s response sizes spiked. Use the preview to confirm distinct pairs appear. If you only see a handful of rows where you’d expect many, the size parameter on the composite may need to be raised.

Note that this composite is nested inside the date_histogram, a different structure from the top-level composite used for rare, freq_rare, and info_content below. The distinction matters: Composite nested inside date_histogram routes the datafeed to the standard extractor; composite at the top level routes it to the composite extractor, which paginates through all value combinations across time.

Categorization

Categorization works with aggregated datafeeds: summary_count_field_name and categorization_field_name can coexist in the same job. The five-step pattern applies directly. Step 2 uses the standard date_histogram topology. Step 4 has one adjustment: Instead of a partition field, we aggregate the text field itself using a terms aggregation on its .keyword subfield, named to match categorization_field_name exactly. Step 5 is skipped. The count detector works from doc_count alone.
Analysis config:

{
  "bucket_span": "1h",
  "summary_count_field_name": "doc_count",
  "categorization_field_name": "message",
  "detectors": [
    {
      "function": "count",
      "by_field_name": "mlcategory"
    }
  ],
  "influencers": ["mlcategory"]
}

Datafeed aggregations:

{
  "buckets": {
    "date_histogram": {
      "field": "@timestamp",
      "fixed_interval": "1h"
    },
    "aggregations": {
      "@timestamp": {
        "max": { "field": "@timestamp" }
      },
      "message": {
        "terms": {
          "field": "message.keyword",
          "size": 1000
        }
      }
    }
  }
}

The datafeed sends one bucket per unique message.keyword value with a doc_count for each. The ML node receives those strings, runs categorization on them, assigning an mlcategory to each, and the count detector tracks how many documents fall into each category per bucket. The naming rule from Step 4 applies: The terms aggregation must be named message, matching categorization_field_name in the analysis config exactly.

One thing to watch: Keyword fields have a default ignore_above: 256 limit. Log messages longer than 256 characters won’t be indexed as .keyword and will be silently excluded from the aggregation. If your log messages are long, check the field mapping before using this approach. You may need to raise the limit in your index template.

The minimal pattern for `time_of_day` and `time_of_week`

time_of_day and time_of_week are the easiest functions to aggregate: They only need a timestamp and a document count. The C++ process extracts the time component from the bucket timestamp and builds a cyclical model of normal activity; doc_count tells it how many events fell in each bucket. No terms sources, no metric aggregation, no composite.
Analysis config:

{
  "bucket_span": "15m",
  "summary_count_field_name": "doc_count",
  "detectors": [
    { "function": "time_of_day" }
  ]
}

Datafeed aggregations:

{
  "time": {
    "date_histogram": {
      "field": "@timestamp",
      "fixed_interval": "15m"
    },
    "aggregations": {
      "@timestamp": { "max": { "field": "@timestamp" } }
    }
  }
}

A plain date_histogram is enough; no composite needed. This makes time_of_day and time_of_week particularly CCS-friendly: one request per time chunk, minimal data over the wire. Use the same structure for time_of_week; only the function name changes.

If you want to add a partition_field_name (for example, to model time-of-day patterns per service), add a terms aggregation inside the histogram’s aggregations following the standard Step 4 pattern.

The composite pattern for `rare`, `freq_rare`, and `info_content`

rare, freq_rare, and info_content all need the composite extractor, the one that paginates through all unique value combinations rather than truncating to top-N. The five-step pattern applies here with a different topology in Step 2: composite goes at the top level (not date_histogram), with date_histogram as a source inside it. Step 3 places the max @timestamp aggregation as a sibling of the composite, and Step 5 is skipped since all three functions work from doc_count alone.

The datafeed structure is the same for all three functions: a composite at the top level, a date_histogram as one of its sources, and one terms source per analysis field. The only thing that varies is which fields you include as terms sources: rare needs one source for by_field_name; freq_rare needs sources for both by_field_name and over_field_name; info_content needs a source for field_name plus any by_field_name or over_field_name fields. None of the three require a metric aggregation.

{
  "buckets": {
    "composite": {
      "size": 10000,
      "sources": [
        { "@timestamp":   { "date_histogram": { "field": "@timestamp", "fixed_interval": "5m" } } },
        { "by_field":     { "terms": { "field": "by_field" } } },
        { "over_field":   { "terms": { "field": "over_field" } } }
      ]
    },
    "aggregations": {
      "@timestamp": { "max": { "field": "@timestamp" } }
    }
  }
}

A few notes:

The composite aggregation must be the top-level aggregation, not nested inside a date_histogram. This is what routes the datafeed to the composite extractor.
The date_histogram is a source inside the composite, not the outer wrapper. Its fixed_interval must divide evenly into bucket_span.
The max aggregation on @timestamp sits as a sibling of the composite (inside aggregations), not nested inside it.
composite.size controls the page size per round trip. Setting it high (10000) reduces round trips, which matters with CCS latency. With three sources and high-cardinality fields, the total combination count can be large; the extractor paginates automatically.

Why aggregation-based datafeeds outperform scroll at scale

The gap is structural, not incidental. A scroll-based datafeed reads raw documents one page at a time: Every 1,000 documents is one request, and each waits for the previous one to complete before issuing the next. The number of requests is therefore proportional to the total document count in the time range being backfilled. At 836,000 events per hour over 13 months, that's roughly 7.9 billion events, or 7.9 million sequential round trips. Each round trip crosses the CCS boundary, waits for shard responses, and transfers matching documents in full. There’s no parallelism: The datafeed holds a scroll context open on the remote cluster and processes one page at a time.

An aggregation-based datafeed works differently. The data nodes summarize data locally, grouping by time bucket and categorical fields, and ship only the bucket results to the ML node. The number of requests is proportional to field cardinalities, not document count. In our example, two influencer fields with six unique combinations produce six result rows per time bucket; the datafeed pages through those in a handful of requests regardless of how many raw events fall in each bucket. Double the ingestion rate and the scroll request count doubles; the aggregation request count stays the same. This is why the gap widens at scale: The more data you have, the worse scroll looks by comparison, and the better aggregations look.

On live data, the picture is different because each real-time tick covers only one fresh bucket: Scroll issues however many pages fit in that bucket's worth of data, while aggregations issue one request. The 20× figure for live data reflects that ratio at 836,000 events per hour with a 15-minute bucket span. The practical threshold where aggregations stop being optional is when (ingestion rate × bucket span) > scroll_size; once a single bucket contains more than one scroll page of documents, the datafeed can't keep pace with live data regardless of hardware. Below that threshold, scroll is fine and aggregations are a nice-to-have. Above it, aggregations are the only sustainable option.

Scroll-based datafeeds are the right default, and the wizards make the right call for most deployments. At scale (more shards, broader index patterns, CCS across tiers), switching to an aggregation-based datafeed is the natural next step: The data nodes summarize where the data lives, the ML node processes compact results, and the detections stay the same. The one cost to know up front is model state: Switching requires a new job, so the earlier you make the move, the less you give up.

If you hit a case not covered here, an aggregation type that doesn’t map cleanly or a composite that behaves unexpectedly, the Elastic Discuss forums are a good place to continue.

How big is too big? Elasticsearch sizing best practices

Gustavo Llermaly — Tue, 14 Apr 2026 00:00:00 GMT

There’s no hard size limit in Elasticsearch. Production clusters run at petabyte scale. But "too big" shows up in three ways: queries slow past your Service Level Agreement (SLA), a node hits its shard ceiling, or storage costs spiral because everything lives on the same expensive tier. This guide walks through each signal, the numbers that matter, and what to do about them.

The three limits that actually matter

There’s no hard storage ceiling at the node level. Elastic has demonstrated a single node querying 1 PiB of data. In earlier versions, the per-shard overhead was high enough that the old rule of thumb was no more than 20 shards per GB of heap. Exceeding that limit meant garbage collection pressure, slow cluster state updates, and unstable nodes. Over 7.x and 8.x, a series of optimizations (more compact metadata serialization, efficient caching, off-heap data structures, and compressed cluster state) reduced per-shard overhead to the point where that rule was retired in 8.3, replaced by field-density-based sizing.

What defines your actual ceiling is workload type. Cold nodes at 20 TB with 31 GB of heap handle audit and retention workloads comfortably, because the access pattern is infrequent and aggregation-based. The same spec on a high-concurrency document search workload would struggle.

The three things to watch operationally:

Shard size: Individual shards that are too large slow queries and recovery.
Shards per node: Every node has a ceiling, and index lifecycle management (ILM) creates shards automatically whether you track them or not.
Storage tier mismatch: Keeping data on expensive fast storage longer than needed.

Shard size

Target between 10 GB and 50 GB per shard. The official guidance sets the ILM rollover trigger at 50 GB per primary shard, with 10 GB as the suggested floor. Keep each shard under 200 million documents.

Shards that are too small create unnecessary overhead: more metadata for master nodes, more heap consumed, more network traffic. Shards that are too large slow query execution and make recovery after node failure slow, since Elasticsearch recovers one shard at a time.

One rule you can stop using: The "20 shards per GB of heap" guideline was deprecated in Elasticsearch 8.3. The replacement is simpler: Watch the 1,000 shard-per-node limit below, and keep shard sizes in the 10–50 GB range (or 200M documents).

How to monitor:

# size per shard
GET _cat/shards?&h=index,store&v

The shard budget

Each non-frozen data node supports up to 1,000 shards. ILM creates shards on your behalf. If your policy rolls over daily with five primary shards and one replica, that’s 10 shards per day. One node fills up in about 100 days without you changing anything.

Options when you’re approaching the ceiling:

Wider rollover intervals: Roll weekly or monthly if shards are not reaching 50 GB before the time trigger fires.
Fewer shards per index: For smaller daily volumes, one or two primary shards is often enough. See how to increase primary shard count if you need to rebalance an existing index.
More nodes: If volume genuinely requires daily rollover at full shard count, distribute across more nodes.

For master nodes, plan for 1 GB of heap per 3,000 indices.

How to monitor:

# shards per node
GET _cat/allocation?h=node,shards&v

Storage

The search speed guide recommends allocating at least half of system memory to the OS filesystem cache and using directly attached storage. Remote storage generally performs worse. The indexing speed guide echoes this, recommending RAID 0 across multiple local SSDs for write-heavy workloads.

For hot data: Don’t use network-attached storage (NAS). NAS adds latency on every read, and some NAS systems don’t correctly implement POSIX filesystem semantics, which can cause data corruption. Use local SSDs.

What works at each tier:

Tier	Storage	Why
Hot	Local SSD (DAS)	High I/O, low latency, safe filesystem semantics
Warm	HDD acceptable	Lower query pressure, no active indexing
Cold	Searchable snapshots	No replicas needed, ~50% storage savings
Frozen	Searchable snapshots	Up to 20x reduction vs warm (Enterprise license)

How to monitor:

# disk usage per node and role
GET _cat/allocation?h=node,node.role,disk.used,disk.avail,disk.percent&v

On Elastic Cloud, skip this section. You select a hardware profile per tier, and Elastic handles storage provisioning.

Data tiers and ILM

Index lifecycle management moves data through tiers automatically: hot, warm, cold, frozen, delete. The further data moves from hot, the cheaper the storage.

The cold and frozen tiers use searchable snapshots:

Cold (fully mounted): Performance comparable to a regular index, no replicas required, roughly 50% cheaper than warm.
Frozen (partially mounted): Up to 20x storage reduction compared to warm, slower queries, requires an Enterprise license.

The cost difference at scale is significant. A Search Labs benchmark measured 90 TB of data: all-hot cost $28,222 per month. A hot+frozen architecture brought that to $3,290 per month.

A typical ILM policy for time-series data with a 14-day hot window:

{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": { "max_primary_shard_size": "50gb" }
        }
      },
      "warm": {
        "min_age": "14d",
        "actions": {
          "shrink": { "number_of_shards": 1 }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "searchable_snapshot": {
            "snapshot_repository": "my_repository"
          }
        }
      },
      "frozen": {
        "min_age": "90d",
        "actions": {
          "searchable_snapshot": {
            "snapshot_repository": "my_repository"
          }
        }
      },
      "delete": {
        "min_age": "365d",
        "actions": { "delete": {} }
      }
    }
  }
}

Adjust min_age values to match your query patterns. Data queried weekly can move to cold sooner than data queried daily.

AutoOps

As of February 2026, AutoOps is free for all Elasticsearch users regardless of license tier. On Elastic Cloud, it’s already enabled. For Elastic Self-Managed, Elastic Cloud Enterprise (ECE), and Elastic Cloud on Kubernetes (ECK) deployments, a lightweight Elastic Agent connects your cluster in about five minutes via Cloud Connect. Internet connectivity is required; air-gapped deployments are not supported.

AutoOps samples hundreds of metrics every 10 seconds and surfaces issues with root cause analysis and remediation commands. It does not apply fixes automatically.

For large deployments it detects:

Shards growing past the recommended size range.
Indices without ILM policies that have grown too large.
Shard imbalance across nodes.
Disk watermark violations before they cause allocation failures.
Indexing rejections and ingestion bottlenecks.
Slow queries and circuit breaker trips from large aggregations.

It ships with 100+ customizable alerts and routes notifications to PagerDuty, Slack, Teams, or any webhook.

Conclusion

Watch shard size (10–50 GB), track your per-node shard budget as ILM rolls, put hot data on local SSDs, and use cold and frozen tiers for data that is rarely queried.

On Elastic Cloud, hardware profiles and AutoOps handle most of this for you. For self-managed deployments, this is your checklist, and AutoOps via Cloud Connect is your early warning system. If you’re unsure how much data your nodes can handle for your specific workload, use Rally to benchmark against your own data before committing to a hardware spec.

Sources

Size your shards
Data tiers
Elasticsearch shards and replicas guide
How to reduce shard count
How to increase primary shard count
Optimize disk space and usage
Searchable snapshots benchmark
AutoOps documentation
Rally: Elastic's benchmarking framework for testing cluster sizing against your own data
Optimizing storage efficiency in Elasticsearch webinar by Christian Dahlqvist and Alan Woodward
Using Rally to get your cluster size right: webinar by Christian Dahlqvist and Daniel Mitterdorfer on benchmarking methodology

New Elasticsearch ES|QL plugin for IntelliJ IDEA

Laura Trotta — Mon, 13 Apr 2026 00:00:00 GMT

Elasticsearch Query Language (ES|QL) is Elasticsearch’s piped query language, designed for intuitive data querying and manipulation. Refer to our getting started guide to learn more.

The Elasticsearch Java client supports ES|QL queries through the DSL, but currently it treats queries as simple strings, with no dedicated helper; and while Kibana offers an excellent UI to build ES|QL queries, we’re aware that sometimes having everything needed to write applications in the integrated development environment (IDE) offers a better experience. So, until the Java client extends its type support to ES|QL, we wrote an Intellij IDEA plugin that autocompletes, syntax checks, shows documentation, and executes ES|QL queries.

The plugin currently supports Java, Kotlin, and plain text files, in case the Java Virtual Machine (JVM) isn’t your thing.

Check it out in the JetBrains Marketplace page and in the GitHub repository, for more information.

Prerequisites

IDE: Intellij IDEA version >= 253 (community or ultimate)

Usage

Install the plugin in Intellij IDEA like you would with every other plugin, so either from the JetBrains marketplace or by going to Settings -> Plugins -> Marketplace and searching “esql”.

The following examples are written using Java, but Kotlin is also supported and the usage is pretty much the same.

Create a text block string, write “ES|QL” in a simple comment above it, and you’re done.

// ES|QL
String query = """
""";

If you see the Elastic logo appearing on the left:

then everything is working, and you’re ready to write your queries.

Why text blocks and not simple strings? The ES|QL syntax accepts quotes in various contexts, and escaping them would trigger other errors in the syntax checker, so we decided on text blocks to keep things simple.

It’s even simpler for txt files, as you can just add the comment and start writing the query right below:

// ES|QL

Connecting to a server instance

The plugin can be connected to an Elasticsearch server instance to fetch indices and field names, which will then be added to the autocompletion options. Look for the Elastic logo on the bottom left of of the screen (or wherever you keep your tools), and configure your connection to any server instance:

Autocomplete

Start typing while in the text block to automatically open the autocompletion popup, which will return a list of acceptable commands/values to continue writing the query correctly. If you want to manually trigger autocompletion, ctlr+space is the IDE’s shortcut to use:

Syntax check

The plugin will highlight errors in queries, explaining what to fix:

Documentation

Hovering with the cursor over commands will display documentation describing what the command can be used for and its correct syntax:

Running the query

Once connected to a server instance, you can run queries by clicking on the green button beside the Elastic icon: The results will be displayed in the tool window:

Or if you’re writing an application, you can use the Java client like so:

// ES|QL
String query = """
	FROM my-index
| SORT year DESC
| LIMIT 10
""";

try (ElasticsearchClient client = ElasticsearchClient.of(e -> e
                .host(serverUrl)
                .apiKey(apiKey))) {

client.esql().query(QueryRequest.of(qr -> qr.query(query)));

}

Check our previous ES|QL Java Client article for a complete example of mapping ES|QL results to Java objects.

How does it work?

There’s no AI involved; the plugin is based on the ES|QL ANTLR grammar for autocompletion and syntax check, and it uses the Elasticsearch docs to show documentation.

Conclusion

The plugin is still experimental, so feel free to report any bug or feature request on the Github repository.

Unsupervised document clustering with Elasticsearch + Jina embeddings

Matthew Adams — Fri, 10 Apr 2026 00:00:00 GMT

Vector search starts with a query, but what if you don't have one?

Organizations accumulate large document collections, like support tickets, legal filings, news feeds, research papers, and need to understand what's in them before they can ask the right questions. Without labels or training data, manually reviewing thousands of documents is impractical. Traditional search doesn't help when you don't know what to search for.

This post walks through an Elasticsearch-native approach to unsupervised document clustering and temporal story tracking that addresses this discovery problem. By the end, you can trace story arcs like this across days:

What you'll discover:

Why clustering embeddings (not retrieval embeddings) matter when you want topic discovery without a query.
How density-probed centroid classification groups documents by topic using Elasticsearch k-nearest neighbor (kNN) and batched msearch.
How significant_text can auto-label clusters so themes are readable without training a model.
How temporal story chains link daily clusters to show how themes evolve from day to day.

The pipeline uses ~8,500 February 2025 articles from BBC News and The Guardian as a test corpus. News is convenient because it has clear temporal behavior, but the pattern applies anywhere document discovery matters: legal review, compliance monitoring, research synthesis, customer support triage.

Stack:

Jina v5 clustering embeddings: Task-specific Low-Rank Adaptation (LoRA) adapters for topic grouping. Jina has joined Elastic, and its models are available natively through Elastic Inference Service (EIS).
Elasticsearch: Scalable kNN, significant_text labeling, and vector storage.
DiskBBQ: A disk-based vector index format that combines Better Binary Quantization (BBQ) with hierarchical k-means partitioning for approximate nearest neighbors (ANN) acceleration. This index partitioning is internal to vector search and separate from the density-probed clustering algorithm used in this post. bbq_disk stores quantized vectors on disk and keeps only partition metadata in heap, dramatically reducing resource requirements, compared to bbq_hnsw, while maintaining high recall.
Global clustering + daily temporal linking: Discovery and story evolution.

What you'll need:

An Elasticsearch deployment (Elastic Cloud, Elasticsearch Serverless, or Elastic Self-Managed 8.18+/9.0+): bbq_disk requires 8.18 or later. The optional diversify retriever section requires 9.3+ or serverless.
A Jina API key: The free tier includes 10 million tokens, which covers the core clustering pipeline (~4.25 million tokens). The optional retrieval-versus-clustering comparison uses a second embedding pass.
A Guardian API key (free).

Setup

Install required packages:

pip install elasticsearch pandas numpy plotly umap-learn python-dotenv pydantic-settings datasets requests

Optional (only if you run scraping helpers from this repo):

pip install beautifulsoup4

Then configure API keys in a .env file at the project root:

ELASTIC_CLOUD_ID=your-cloud-id        # or ELASTIC_HOST=https://...
ELASTIC_API_KEY=your-api-key
JINA_API_KEY=your-jina-key
GUARDIAN_API_KEY=your-guardian-key

This notebook calls load_dotenv(override=True), so local .env values take precedence.

Connected to Elasticsearch

Part 1: Discovery clustering - Why clustering embeddings?

Most vector search uses retrieval embeddings trained to match a query to relevant documents. That's perfect for search, but not for discovery. When you want to find what topics exist in a corpus without any query at all, you need embeddings that group similar documents together.

Jina v5 solves this with task-specific Low-Rank Adaptation (LoRA) adapters. LoRA adds small low-rank updates to targeted internal layers while keeping most base-model weights frozen, so the model behavior shifts toward a specific task without full retraining. The same base model produces different embeddings depending on the task parameter:

Task	Trained for	Use case
retrieval.passage	Query-document matching	Search, retrieval augmented generation (RAG)
clustering	Topic grouping (optimized for tight clusters)	Discovery, categorization

The clustering adapter is trained to make documents about the same topic closer in embedding space and documents about different topics further apart. The visual comparison below makes the difference concrete.

Retrieval vs. clustering: A visual comparison

To see the difference, a sample of documents is embedded with both task types. Clustering is performed in the original 1024-dimensional embedding space; Uniform Manifold Approximation and Projection (UMAP) is used only to project those embeddings into 2D for visualization. UMAP preserves local neighborhood structure, making it useful for comparing cluster separation.

Below, the same 480-document sample is embedded with both task types and projected to 2D with UMAP. Look for tighter, more separated color groups in the clustering panel.

    Full dataset: 8,495 articles
    Sources: guardian: 5749, bbc: 2746
    Date range: 2025-02-01 to 2025-02-28


    Sample: 480 docs across 8 sections
    section
    Film              60
    World news        60
    Australia news    60
    Opinion           60
    Football          60
    US news           60
    Sport             60
    Business          60


    Clustering embeddings: 480
    Retrieval embeddings:  480


    UMAP projection complete

Retrieval embeddings (left) spread topics broadly; clustering embeddings (right) produce tighter, more separated groups from the same documents.

The clustering embeddings produce tighter, more visually distinct groups. The retrieval embeddings spread topics out more evenly, ideal for search (fine-grained similarity); but for discovery, tight topical clusters are what matters.

This is why task="clustering" is used for the rest of this walkthrough.

Loading the dataset

The corpus combines two news sources for February 2025:

BBC News via the RealTimeData/bbc_news_alltime HuggingFace dataset.
The Guardian via the Guardian Open Platform API.

Having multiple sources helps validate that clustering finds topics rather than source-specific style.

    Total articles:  8,495
    
    Source breakdown:
    source
    guardian    5749
    bbc         2746
    
    Date range: 2025-02-01 → 2025-02-28
    Days covered: 28
    
    Sample article:
      Source:  guardian
      Title:   Carbon monoxide poisoning ruled out in death of Gene Hackman and wife, police sa
      Section: Film
      Text:    Authorities have ruled out that Gene Hackman and his wife, Betsy Arakawa, died from carbon monoxide poisoning earlier this week in their home in Santa Fe, New Mexico. The Santa Fe county sheriff, Adan...

Embedding with the clustering task

The Jina v5 API is called with task="clustering" for all documents. Embeddings are cached to disk, so subsequent runs skip the API entirely.

The API call is straightforward. The task parameter is the key difference from typical embedding usage:

payload = {
    "model": "jina-embeddings-v5-text-small",
    "input": texts,
    "task": "clustering",  # ← This selects the clustering LoRA adapter
}

The timing below reflects a cache hit. First run against the API takes longer, depending on corpus size.

    Embeddings ready: 8,495 vectors of dimension 1024
    Time: 0.6s

Indexing into a single Elasticsearch index

For discovery clustering, the full month goes into one index (docs-clustering-all). Daily partitioning comes later for temporal story linking.

The index mapping uses bbq_disk for the vector field:

{
  "embedding": {
    "type": "dense_vector",
    "dims": 1024,
    "index": true,
    "similarity": "cosine",
    "index_options": {
      "type": "bbq_disk"        // hierarchical k-means partitioning for ANN index lookup; separate from this post's clustering algorithm
    }
  }
}

A 1024-dimensional float32 vector is 4 KB. bbq_disk uses hierarchical k-means to partition vectors into small clusters, binary-quantizes them, and stores the full-precision vectors on disk for rescoring. Only partition metadata lives in heap, so memory requirements stay low even for large corpora. For workloads that can afford more heap, bbq_hnsw builds a Hierarchical Navigable Small World (HNSW) graph for faster lookups at higher resource cost.

The dense_vector field type supports multiple quantization strategies: bbq_disk and bbq_hnsw are the best fits for high-dimensional embeddings like the 1024-dim vectors used here.

    Indexed 8,495 documents into docs-clustering-all
    Time: 57.5s

Clustering: Density-probed centroid classification

Traditional clustering algorithms like HDBSCAN assume you can hold the full N×d vector matrix in memory and run repeated full-pass updates. For 8,495 documents at 1024 dimensions, that's manageable (~35 MB), but the approach doesn't scale to millions of documents without additional infrastructure.

This algorithm is conceptually similar to KMeans++ initialization with Voronoi assignment and a noise floor, but it uses Elasticsearch kNN search as the compute primitive, keeping almost all work server-side:

Sample 5% of documents as density probes (random sample, minimum 50).
Probe density via batched msearch kNN. Each probe fires a kNN query and records the mean similarity of its neighbours. High mean similarity = dense region of embedding space. msearch sends multiple search requests in a single HTTP call, which is critical here: Density probing generates hundreds of kNN queries, and batching them avoids per-request overhead.
Select high-density seeds with diversification: Candidates above median density are sorted by density descending and greedily accepted only when their cosine similarity to every existing seed is below a separation threshold. This is the only client-side compute (~0.01s for 8k docs).
Classify all docs against centroids via msearch kNN: Each seed acts as a centroid; a kNN search retrieves nearby documents above a similarity threshold. Each document is assigned to whichever centroid returned it with the highest score. Small clusters are dissolved to noise.

Elasticsearch handles the heavy lifting: msearch for density probes, msearch for classification, and significant_text for labeling. For this corpus (8,495 docs), the 5% density-probe sample launches 425 kNN probe queries, which msearch batches into nine HTTP calls (at batch size 50), avoiding one-request-per-probe overhead. Combined with bbq_disk ANN lookup, this keeps the clustering stage fast and scalable. The kNN queries use a minimal num_candidates value for speed during the clustering pass; production search queries should use higher num_candidates values to improve recall at the cost of latency.

Clusters have natural sizes determined by the embedding space density around each centroid, not by a hard k cap. Dense topic regions produce larger clusters; niche topics produce smaller ones.

Why not KMeans or HDBSCAN?

KMeans assumes spherical clusters and requires the full N×d matrix in memory. For corpora that fit in memory, HDBSCAN is a strong alternative. It handles arbitrary cluster shapes and has well-understood density semantics.

The density-probed centroid approach targets a different niche: corpora where you want storage, retrieval, and clustering in one system, or where scale makes client-side matrix operations impractical. It uses Elasticsearch kNN as the compute primitive, handles arbitrary cluster sizes, and keeps nearly all computation server-side.

    Clustered global index in 31.6s
      Total clusters: 82
      Total noise:    2420 (28.5%)
      Density probes: 425 kNN queries via 9 _msearch HTTP calls

Understanding the noise rate

The ~28% noise rate is by design, not a failure mode. Documents that don't fit any dense cluster at the configured similarity_threshold are left unassigned rather than forced into a poor match. This acts as a quality gate: Opinion columns, short articles, and one-off stories naturally resist clustering because they lack the thematic density that defines a coherent group.

The threshold is tunable: Lowering similarity_threshold produces more aggressive clustering (more documents assigned, but looser clusters), while raising it tightens clusters and increases the noise fraction. For this corpus of mixed news content, ~30% noise is a reasonable operating point. Production deployments should tune the threshold against domain-specific quality criteria.

Automatic labels with significant_text

Now each cluster needs a human-readable label. Elasticsearch's significant_text aggregation finds terms that appear unusually often in a foreground set (the cluster) compared to a background set (the full corpus).

Under the hood, it uses a statistical heuristic (JLH score by default) that balances absolute and relative frequency shifts, no machine learning, no large language model (LLM) calls. A cluster about UK politics might surface terms like starmer, labour, downing because those terms are disproportionately common in that cluster compared to the overall news corpus.

For this global pass, labels are computed directly against docs-clustering-all, so both foreground and background are drawn from the full month. In Part 2, labeling uses the daily index pattern (docs-clustering-*), a wildcard that lets queries span all matching indices simultaneously, to give significant_text a broader background for better contrast.

A minimal query shape looks like this:

{
  "size": 0,
  "query": { "term": { "cluster_id": "72" } },
  "aggs": {
    "label_terms": {
      "significant_text": {
        "field": "text",
        "size": 5,
        "filter_duplicate_text": true
      }
    }
  }
}

significant_text also serves as a quality gate: Clusters that produce no significant terms have no distinguishing vocabulary. They're incoherent groupings that should be dissolved back to noise rather than given a misleading label.

A lightweight deterministic cleanup step removes noisy label terms (numeric tokens, generic words) and falls back to a representative headline when needed. This keeps labels Elasticsearch native while improving readability.

    Sample cluster labels:
      cluster   3  (200 docs)  arsenal | mikel | villa
      cluster   1  (198 docs)  volodymyr | ukrainian | kyiv
      cluster   0  (196 docs)  hostages | hamas | israeli
      cluster   4  (187 docs)  scrum | rugby | borthwick
      cluster  52  (185 docs)  fossil | renewable | renewables
      cluster  10  (156 docs)  labour | gwynne | mps
      cluster  40  (151 docs)  novel | novels | literary
      cluster  11  (149 docs)  mewis | sarina | wiegman
      cluster  44  (143 docs)  flooding | rainfall | rain
      cluster  13  (131 docs)  doge | musk | elon
      cluster  12  (128 docs)  murder | insp | knockholt
      cluster   5  (124 docs)  putin | backstop | starmer


    Reassigned 35 docs from incoherent clusters to noise
    Total docs: 8,495
    Clustered:  6,040 (71.1%)
    Noise:      2,455 (28.9%)

Visualizing the clusters

The visualizations below show what the global clustering pass discovered: a date-wise breakdown of clustered versus noise documents, a UMAP projection of the full month, and a source-mix chart confirming that clusters reflect topics rather than sources.

Daily distribution of clustered versus noise documents across February 2025.

Each colored island in the UMAP represents a cluster: a group of articles about the same topic discovered purely from embedding similarity. The gray noise points are articles that didn't fit cleanly into any cluster (often short articles, opinion pieces, or one-off stories).

The source breakdown chart confirms that clusters contain articles from both BBC News and The Guardian. The clustering is finding topics, not sources, exactly what unsupervised discovery should produce.

Exploring cluster breadth with the diversify retriever

Plain kNN returns the documents most similar to a cluster's centroid (the dense core). But real clusters cover subtopics. The diversify retriever uses Maximal Marginal Relevance (MMR) to surface documents that are relevant to the centroid but also different from each other.

The key parameter is λ (lambda):

λ = 1.0 → pure relevance (same as plain kNN).
λ = 0.0 → pure diversity (maximally spread results).
λ = 0.5 → balanced: that is relevant to the topic, but covering different angles.

A minimal retriever request shape looks like this:

{
  "size": 8,
  "retriever": {
    "diversify": {
      "type": "mmr",
      "field": "embedding",
      "lambda": 0.5,
      "query_vector": "",
      "retriever": {
        "knn": {
          "field": "embedding",
          "query_vector": "",
          "k": 50,
          "num_candidates": 100
        }
      }
    }
  }
}

The type, field, and query_vector parameters are required at the diversify level: field tells MMR which dense_vector field to use for inter-result similarity, and query_vector provides the reference point for relevance scoring.

This lets you answer: "What does this cluster actually cover?" rather than just "What's at its center?"

    Exploring cluster 52 (185 docs)
    Label: fossil | renewable | renewables
    Centroid computed (dim=1024)


    ========================================================================
    Plain kNN (closest to centroid)
    ========================================================================
      1. [0.9738] Green campaigners fear ministers are poised to award billions of pounds in fresh subsidies to Drax power station, despite strong concerns...
      2. [0.9710] Thirteen more oil and gas licences could be cancelled as ministers decide new guidance for fossil fuel extraction after a landmark court...
      3. [0.9699] Experts have accused the fossil fuel industry of seeking special treatment after lobbyists argued greenhouse gas emissions from oilfields...
      4. [0.9681] Burning wood is a terrible way of producing electricity . Chopping down trees destroys habitats for wildlife, and growing new trees cannot...
      5. [0.9649] Keir Starmer will do huge damage to the global fight against climate change if he gives in to political pressure and allows the development...
      6. [0.9641] Labour will next week be confronted with stark policy choices that threaten to expose the fault lines between the Treasury and the...
      7. [0.9638] The Drax power station near Selby in north Yorkshire burns imported wood pellets  The government has agreed a new funding arrangement with...
      8. [0.9581] If you care about the world we are handing on to future generations, the news on Thursday morning was dramatic. This January was the...
    
    ========================================================================
    Diversify retriever (MMR, lambda=0.5)
    ========================================================================
      1. [0.9738] Green campaigners fear ministers are poised to award billions of pounds in fresh subsidies to Drax power station, despite strong concerns...
      2. [0.9434] Oil and gas interests have waged a coordinated campaign to kill pro-electrification policies that ban gas connections in new buildings ,...
      3. [0.9303] It was interesting to read that new licences for oil and gas production in the North Sea are being delayed by legal action ( Thirteen more...
      4. [0.9139] The US energy secretary, Chris Wright, has said he “would love to see Australia get in the game of supplying uranium and maybe going down...
      5. [0.9077] Rachel Reeves was facing criticism on Saturday night as it was confirmed that a report she cited as evidence that a third runway at...
      6. [0.8996] When Margaret Thatcher opened the Hadley Centre for Climate Change in 1990 journalists suggested she was attempting to appear to be doing...
      7. [0.8993] The vast majority of governments are likely to miss a looming deadline to file vital plans that will determine whether or not the world has...
      8. [0.8987] European imports of seaborne gas shipments fell by a fifth last year to their lowest level since the pandemic, according to a new report,...
    
    Overlap: 1/8 documents appear in both result sets
    
    Avg pairwise similarity (lower = more diverse):
      Plain kNN:          0.9057
      Diversify retriever: 0.6965

The plain kNN results cluster around one angle of the topic: the documents most similar to the centroid and to each other. The diversify retriever surfaces different facets of the same cluster: subtopics, different sources, and varied perspectives.

The diversity metric confirms this quantitatively: the average pairwise similarity is lower for the diversify retriever results, meaning that the returned documents cover more ground.

This is useful for:

Understanding what a cluster actually covers, not just its center but also its edges.
Generating summaries. Diverse representative docs give an LLM better material.
Finding representative examples for human review or downstream labeling.
Quality checks. If the diverse results look incoherent, the cluster may need splitting.

Part 2: Temporal story chains

Tracking stories across days

Part 1 clustered the full month globally for topic discovery. For temporal flow, the same density-probed centroid classification runs independently per day on daily indices, and then clusters are linked across adjacent days. Note that the daily clusters are independent of the global clusters from Part 1; each day produces its own cluster assignments and labels tuned to that day's content.

The linking approach: sample-and-query

For each cluster on day A:

Sample a few representative documents.
Run kNN against day B's index.
Count how many hits land in each day B cluster.
If the hit fraction exceeds a threshold (kNN fraction ≥ 0.4), record a link.

This is fast (only a few docs per cluster are queried, not all of them) and uses Elasticsearch's native kNN, no external tools needed.

Preparing daily indices for temporal linkage...


Indexed 8,495 docs into 28 daily indices


Temporal links found: 808 in 145.4s

Strongest links:
  2025.02.01 'league | arsenal | premier' -> 2025.02.02 'league | season | striker'  (100%)
  2025.02.03 'league | striker | loan' -> 2025.02.04 'league | striker | season'  (100%)
  2025.02.03 'score | operator | gedling' -> 2025.02.04 'league | striker | season'  (100%)
  2025.02.12 'playoff | leg | bayern' -> 2025.02.13 'league | players | injury'  (100%)
  2025.02.14 'league | injury | football' -> 2025.02.15 'league | premier | football'  (100%)
  2025.02.18 'russia | ukraine | talks' -> 2025.02.19 'saudi | russia | arabia'  (100%)
  2025.02.18 'football | league | bayern' -> 2025.02.19 'league | manchester | players'  (100%)
  2025.02.21 'league | premier | manchester' -> 2025.02.22 'game | players | defeat'  (100%)
  2025.02.21 'rugby | calcutta | brilliant' -> 2025.02.22 'game | players | defeat'  (100%)
  2025.02.26 'metals | kyiv | ukrainian' -> 2025.02.27 'ukraine | russia | talks'  (100%)

A kNN fraction of 100% means every sampled document from the source cluster landed in the same target cluster, the strongest possible cross-day link. Most links above are football-related, which makes sense: Premier League coverage runs daily with high topical consistency.

The score | operator | gedling → league | striker | season link is an example of a niche local football cluster (Gedling is a non-league club) being absorbed into the broader Premier League cluster on the next day, a natural effect of daily reclustering at different granularity.

Building story chains

A story chain is a sequence of linked clusters across consecutive days.

Individual pairwise links tell you that Monday's "UK politics" cluster connects to Tuesday's. Chains reveal the full arc: a story that starts Monday, evolves through the week, and fades by Friday.

Chains are built greedily from links with a kNN fraction ≥ 0.4, meaning that at least 40% of sampled documents from the source cluster landed in a single target cluster. Starting from the earliest cluster, the algorithm always follows the strongest outgoing link.

    Strong links (kNN fraction >= 0.4): 244
    Story chains spanning 3+ days: 18
      Chain 1: 'ukrainian | kyiv | eastern' (19 days: Feb 3 → Feb 21)
      Chain 2: 'playing | opposition' (19 days: Feb 10 → Feb 28)
      Chain 3: 'tadhg | maro | cadan' (10 days: Feb 1 → Feb 10)
      Chain 4: 'invade | china | putin' (8 days: Feb 21 → Feb 28)
      Chain 5: 'elected | labour | leader' (7 days: Feb 12 → Feb 18)
      Chain 6: 'film | swift | awards' (6 days: Feb 2 → Feb 7)
      Chain 7: 'amendment | termination | reporting' (6 days: Feb 12 → Feb 17)
      Chain 8: 'officers | scene | police' (5 days: Feb 1 → Feb 5)

The longest chain tracks Ukraine–Russia coverage for 19 consecutive days, unsurprising given the sustained geopolitical intensity in February 2025. The second-longest follows Premier League football across 19 days of the month. Shorter chains capture award season (film/awards, six days), Six Nations rugby (10 days), and UK political leadership coverage (seven days). Each chain represents a story arc that the algorithm discovered purely from embedding similarity across daily indices.

Sankey: Visualizing story flow

A Sankey diagram is a flow visualization where link width represents connection strength. Here, each vertical band is a day, each node is a daily cluster (sized by document count), and each colored path traces one story chain across time. Link width encodes kNN overlap strength: Thicker links mean more sampled documents landed in the target cluster. Colors are consistent per chain, so a single color path from left to right reads as one story's progression.

For example, the Ukraine–Russia chain (visible as one of the longer paths) flows continuously from early February through the third week, with consistently thick links indicating strong topical continuity across days.

Temporal story chains flowing across February 2025. Each colored path is a story persisting across days; link width indicates kNN overlap strength.

What this approach delivers

This walkthrough covered a complete unsupervised document clustering pipeline built on Elasticsearch:

Clustering embeddings: Jina v5's task-specific adapters produce embeddings optimized for topic grouping, not just query-document matching.
Global discovery clustering: Clustering the full month in one index maximizes cross-day topical discovery.
Density-probed centroid classification: Sample 5%, probe density via msearch kNN, select diverse high-density seeds, classify all docs against centroids. Elasticsearch handles the heavy compute; only seed selection runs client-side (~0.01s).
significant_text labeling: Significance testing produces meaningful cluster labels without any ML model or manual annotation. Clusters that produce no significant terms are incoherent and get demoted to noise — a built-in quality gate.
Temporal story linking: Daily indices and sample-and-query cross-index kNN trace how stories evolve over time.

Key takeaways:

The embedding task type matters: Clustering embeddings produce measurably tighter topical groups.
Elasticsearch can serve as both the storage layer and the clustering engine via kNN search.
Density-probed centroid classification keeps nearly all compute server-side and produces clusters with natural sizes determined by embedding space density.
significant_text is fast, interpretable, and effective for both auto-labeling and quality gating.

When this approach is useful:

You have timestamped text and want topic discovery without labeled training data.
You want one stack for storage, vector search, labeling, and temporal linkage.

Extensions to explore:

Multi-period clustering (weekly, monthly rollups).
Real-time ingestion with incremental cluster assignment.
LLM-generated cluster summaries using the significant_text terms as seeds.
At larger scale, sampled KMeans centroids can serve as warm-start seeds for density-based clustering, reducing the probe phase cost.

Try it yourself

Swap in your own timestamped document corpus; any collection of text with dates works with this pipeline. The full notebook and supporting code are available in the companion repository.

Start a free Elastic Cloud trial: Spin up a managed cluster with bbq_disk support in minutes.
Try Elasticsearch Serverless: No cluster management, scales automatically, and supports everything in this walkthrough.

Why ecommerce search needs governance

Alexander Marquardt, Honza Král, Taylor Roy — Thu, 09 Apr 2026 00:00:00 GMT

Ecommerce retailers need to handle various fundamentally different query types within the same system. A shopper searching for “oranges” expects the fruit, not products containing the word “orange”, such as orange juice or orange marmalade, and not semantically related citrus products. A shopper searching for a “gift for grandpa who has a sweet tooth” needs semantic discovery, not literal keyword matching.

Lexical retrieval (text matching), semantic retrieval (matching concepts), and hybrid retrieval (combining lexical and semantic signals) don’t solve these issues on their own. Lexical retrieval may return anything containing the word “oranges”, while pure semantic retrieval on a high-intent query like “oranges” may broaden toward related items, such as lemons or grapefruits. Hybrid retrieval blends these lexical and semantic signals, but it still doesn’t determine if this query should be treated as navigational, which constraints should be enforced, or which business policies should apply. The gap isn’t the retrieval technology itself; it’s the absence of a governance layer that understands what kind of query this is and what constraints should be enforced before retrieval begins.

In this blog, we explore ecommerce search governance, why it matters, and how a control layer ensures predictable, accurate retrieval.

What governance means in ecommerce search

Governance, in this context, means introducing a decision layer between the user's query and the retrieval engine. This layer performs the following functions:

Classifies query intent: Is this navigation ("oranges") or discovery ("gift for grandpa")?
Applies business constraints: What category boundaries, eligibility rules, availability constraints, or merchandising policies apply?
Routes to the appropriate strategy: Should this use lexical retrieval, semantic retrieval, or hybrid?

A governance layer determines which retrieval approach should be used for each query, which constraints must be enforced, and which business policies should apply before retrieval begins. It’s important not to conflate governance with hybrid retrieval: hybrid is one retrieval strategy that combines lexical and semantic signals, while governance is the upstream decision layer that determines whether lexical, semantic, or hybrid should be used.

The status quo: The application layer "spaghetti" implementation

Currently, many retailers attempt to solve this by adding logic directly into the application layer. This often results in spaghetti code, that is, thousands of lines of hard-coded if-then statements, regex, and complex search templates.

This approach can provide desired search results as shown above; however, it creates significant operational friction:

Engineering dependency: Business users and merchandisers cannot modify search behavior without engineering tickets and long deployment cycles that often span several weeks.
Fragmentation: Search logic becomes scattered between application code and search templates, and is difficult to explain or audit, making it risky to evolve.

Even when teams recognize the need for routing, the debate often focuses on the wrong question: which retrieval method to pick.

The false choice: Lexical vs. semantic vs. hybrid

Search teams often frame the challenge as a retrieval strategy choice: lexical/BM25 versus semantic/vectors versus hybrid. That framing is understandable (retrieval methods matter), but it misses the most common failure mode in real deployments, which is that using a single retrieval approach for all queries will give suboptimal results.

Commerce search is a mix of fundamentally different intents:

Deterministic, high-intent navigation ( "oranges", “milk”, “chocolate without peanuts”, “cheap olive oil”).
Exploratory discovery ("jacket for hiking in the mountains", "gift for a 12-year-old who likes robotics").
Operational constraints (availability, size, price, color).
Merchandising and campaigns (boost, bury, seasonal campaigns).

When the system routes all of these through the same retrieval strategy, the results are often systematically wrong in predictable ways because the operating model lacks governance. When teams don't recognize this as a governance gap, they respond with the only lever they have: more tuning.

Why "relevance tuning" can become cyclical

Without a routing layer, “relevance” often turns into a never-ending backlog:

Why is this query showing accessories above the core product?
Why did this head query suddenly start surfacing related items?
Why did results change after we added synonyms, adjusted analyzers, or enabled hybrid?
Why does the business team need an engineering release to fix a single query?

Teams respond with more tuning: more synonyms, more boosts, more reranking experiments, more exceptions in application code. This can work for a while, but it often produces brittle behavior because the system still lacks an explicit decision layer for determining query type and enforcing the right constraints before retrieval.

The anatomy of ecommerce intent: Head and tail

In this section, we use “head” and “tail” as practical shorthand for common navigational and exploratory query patterns in ecommerce. In the real world, many queries contain aspects of both:

Head queries (deterministic intent)

These are direct, navigational queries where the user knows exactly what they want:

Single-item intent ("oranges", "milk", "bread").
Exact brands or product families ("iPhone 15 Pro", "Diet Coke").
SKUs, model numbers, sizes ("ABC123", "air max 270").

For these queries, lexical retrieval can handle token correspondence (matching words), but the business also expects to respect constraints, return predictable rankings, and have controllable outcomes. A merchandiser needs to ensure that a query resolves within the correct category boundaries, respects eligibility, and surfaces specific business priorities.

Governance is required to enforce the intended resolution. For example, “oranges” should map to the produce category, not to orange juice, orange marmalade, or orange soda.

Tail queries (exploratory discovery)

These are descriptive, intent-rich queries where shoppers are exploring:

"Gift for grandpa who has a sweet tooth"
"Jacket for hiking in the mountains"
"Shoes for standing all day"

Lexical retrieval often struggles here. Semantic retrieval excels because it can connect the query concept to the product, even when wording does not match. But semantic retrieval alone is rarely sufficient either. Real queries often require constraints to be enforced, regardless of which retrieval method is used.

Constraints are orthogonal to retrieval method

Applying constraints to semantic retrieval doesn’t mean hybrid search. These are orthogonal concepts. Constraints, such as filters and boosts in Elasticsearch, can be applied to any lexical, semantic, or hybrid retrieval. The challenge is deciding how the query should be interpreted, which constraints must be enforced, and which retrieval strategy should be used.

Below are some examples of queries that combine retrieval with hard constraints:

Oranges: Lexical retrieval for “oranges” plus a category constraint, such as “Fruits” or “Produce”, eliminating orange marmalade, orange juice, and orange soda.
Fruits high in vitamin C under $4: Semantic retrieval for nutritional intent plus constraints limiting results to the fruit category and products under $4.
Comfortable shoes for work: Semantic retrieval for contextual intent plus a category constraint limiting results to shoes.

These queries can't be handled by a single approach:

Pure lexical retrieval is often insufficient here because phrases like “high in vitamin C” or “comfortable” may not exist as clean, structured attributes. They may need to be inferred from product descriptions, reviews, or specifications.
Pure semantic retrieval is also not always sufficient because, without explicit constraints, a query like “fruits high in vitamin C” might broaden toward vitamin supplements, fruit-flavored drinks, or high-vitamin vegetables outside the intended category and price range.

A governance layer determines whether a query needs lexical retrieval, semantic understanding, constraint enforcement, or some combination of these. Without this layer, ecommerce teams may end up:

Over-constraining: Using lexical retrieval for semantic requests (for example, "gift for grandpa").
Under-constraining: Using semantic queries for high-intent head queries (for example, “oranges”).

The governance challenge is to build a system that can make the right judgment call for each class of query.

What happens without governance

The most common failure mode is straightforward: Teams take the raw user query and pass it directly into a single retrieval strategy (lexical, semantic, or hybrid), without an intermediate governance layer.

Lexical retrieval misses intended resolution

When a user searches for “oranges”, a lexical retrieval strategy may return anything containing that token: orange juice, orange marmalade, or orange soda. The system matched the term correctly, but without governance it may not resolve the intended shopping context (the fruit).

Semantic retrieval broadens beyond intended constraints

When a user searches for “oranges”, a semantic system may retrieve conceptually related items across nearby product concepts. The system may correctly understand the broader domain (fruit or produce), but without explicit governance it can still over-broaden beyond the user’s intended constraint (specifically oranges).

The gap is governance

What’s required is an upstream decision layer that determines query intent and enforces the right constraints before retrieval begins. This fixes issues such as the following:

Similar or related items appearing alongside what the user actually wanted.
Blurred category boundaries ("beverages" versus. "produce").
Inability to implement seasonal boosts or campaigns.
Unpredictable and unexplainable results.

Intent understanding and routing: The necessary control plane

A governed search system introduces a lightweight control plane in front of retrieval (prior to executing a query in Elasticsearch). The control will be discussed in detail in parts 3 and 4 of this blog series; for now, we just discuss what it can do but not how it works:

A control plane can detect intent, apply business policies, and ensure the appropriate retrieval strategy as follows:

1. Detect intent signals

Is this query likely navigation versus discovery?
Is it a known head query (milk, bread, bananas)?
Is there a known product, brand, or category interpretation (for example, “oranges” should resolve to produce).
Is the query an SKU-like pattern?
Does the query fall under an active campaign or seasonal policy (for example, during Christmas, boost turkey-related results)?
Does the query imply constraints (category, attributes, exclusions, price/size/color)?

2. Apply governance and business policies

Enforce deterministic constraints first (category/attribute/negation/availability).
Apply active merchandising policies (boost/bury/pin/override).
Resolve conflicts with precedence rules (for example, campaign overrides versus global policies).

3. Route to the appropriate retrieval strategy

Lexical (fast, deterministic) for navigational/high-intent head queries.
Semantic retrieval for true discovery queries.
Hybrid where combined lexical and semantic signals add value under explicit business constraints.

In practice, the output of the control plane is not simply “use hybrid” or “use semantic.” It’s a governed retrieval plan: an interpretation of the shopper’s intent, the constraints and policies that should apply, and the retrieval strategy that should be executed. A few simple examples make this concrete:

Shopper query	Governed interpretation	Example retrieval plan
“chocolate without peanuts”	Product-oriented query with a hard exclusion constraint	Lexical retrieval for chocolate plus an exclusion filter for products containing peanuts
“cheap olive oil”	Product/category query with a price constraint	Lexical retrieval for olive oil plus a price filter capped at the retailer’s threshold for cheap
“fruit high in vitamin C under $4”	Discovery query requiring semantic understanding plus hard constraints	Semantic retrieval for nutritional intent, constrained to the fruit category and filtered to products priced under $4

A control plane selects the right policy and retrieval strategy for each query consistently, predictably, and at scale. This makes advanced retrieval methods more predictable in production because intent-aligned constraints are enforced first and routing decisions are explicit rather than implicit.

How this relates to other approaches

Some teams use improved embedding models to better capture product semantics, which can materially improve semantic retrieval quality. Others use reranking approaches, such as Learning To Rank (LTR), to optimize result ordering based on engagement or business signals after retrieval. Both are valuable and often complementary. Better embeddings improve similarity matching. Reranking improves ordering among retrieved candidates.

Governance addresses a different layer of the problem: It sits upstream of retrieval. It decides which retrieval strategy to use (for example, lexical, semantic, or hybrid), what deterministic constraints are required, and which queries should combine multiple business policies.

What a governed control plane enables

Once a governance layer is in place, the operating model changes fundamentally. Revenue-critical queries become predictable. Business teams can update search behavior without waiting on engineering release cycles. And advanced retrieval methods, like semantic and hybrid, can be adopted incrementally, behind routing and guardrails, instead of as a global on/off switch.

The next post in this series explores what that operating model looks like in practice and why it may matter as much as the retrieval technology underneath it.

If a merchandiser has to open a Jira ticket and wait for a deploy to fix a revenue-critical query, the bottleneck isn't the engine; it's the operating model. Modern ecommerce search needs a way to translate business intent into controlled, auditable search behavior quickly and safely, while still using advanced retrieval where it adds measurable value.

What’s next in this series

The patterns explored in this series operate upstream of retrieval: translating business intent into the right query strategy before query generation begins. In the next post, we shift from the technical problem to the operational one: what happens when business teams can change search behavior without an engineering deployment, and why governance makes that safe.

Put governed ecommerce search into practice

Engineering bottlenecks, brittle application-layer logic, and unpredictable search results are problems that Elastic Services can help you solve in enterprise ecommerce services engagements. The governed control plane architecture described in this series was built by Elastic Services Engineering.

If your team is spending engineering cycles translating merchandising requests into code changes, or if your search relevance backlog never seems to shrink, we can help you assess your current architecture and build a path to governed, business-editable search. Contact Elastic Services.

Join the discussion

Have questions about search governance, retrieval strategies, or ecommerce search architecture? Join the broader Elastic community conversation.

How to build agentic AI applications with Mastra and Elasticsearch

Enrico Zimuel — Wed, 08 Apr 2026 00:00:00 GMT

In this article, we’ll walk through how to use the Mastra TypeScript framework to build agentic applications that interact with Elasticsearch.

We recently contributed to the mastra-ai/mastra open source project by adding support for Elasticsearch as a vector database. With this new feature, you can use Elasticsearch natively in Mastra to store embeddings. In addition to vectors, Elasticsearch provides a suite of advanced features to address all your context engineering requirements. (for example, hybrid search and reranking).

This article details the creation of an agent to implement a retrieval augmented generation (RAG) architecture using Elasticsearch. We’ll showcase a demo project where an agentic approach is used to interact with a corpus of sci-fi movie data stored within Elasticsearch. The project is available at elastic/mastra-elasticsearch-example.

Mastra

Mastra is a TypeScript framework to create agentic AI applications.

A project structure in Mastra looks as follows:

src/
├── mastra/
│   ├── agents/
│   │   └── weather-agent.ts
│   ├── tools/
│   │   └── weather-tool.ts
│   ├── workflows/
│   │   └── weather-workflow.ts
│   ├── scorers/
│   │   └── weather-scorer.ts
│   └── index.ts
├── .env.example
├── package.json
└── tsconfig.json

In Mastra, you can build agents, tools, workflows, and scores.

An agent is a class that accepts a message in input and produces a response as output. An agent can use tools, large language models (LLMs), and a memory (figure 1).

An agent's tools allow it to interact with the "external world," such as communicating with a web API or performing an internal operation, like querying Elasticsearch. The memory component is crucial for storing the history of conversations, including past inputs and outputs. This stored context enables the agent to provide more informed and relevant responses to future questions by using its past interactions.

Workflows let you define complex sequences of tasks using clear, structured steps rather than relying on the reasoning of a single agent (figure 2). They give you full control over how tasks are broken down, how data moves between them, and what gets executed when. Workflows run using the built-in execution engine by default or can be deployed to workflow runners.

In Mastra, you can also define scores, which are automated tests that evaluate agent outputs using model-graded, rule-based, and statistical methods. Scorers return scores: numerical values (typically between 0 and 1) that quantify how well an output meets your evaluation criteria. These scores enable you to objectively track performance, compare different approaches, and identify areas for improvement in your AI systems. Scorers can be customized with your own prompts and scoring functions.

Elasticsearch

For running the demo project, we need to have an Elasticsearch instance running. You can activate a free trial on Elastic Cloud or install it locally using the start-local script:

curl -fsSL https://elastic.co/start-local | sh

This will install Elasticsearch and Kibana on your computer and generate an API key to be used for configuring the Mastra integration.

The API key will be shown as output of the previous command and stored in a .env file in the elastic-start-local folder.

Install and configure the demo

We created an elastic/mastra-elasticsearch-example repository containing the source code of the demo project. The example reported in the repository illustrates how to create an agent in Mastra that implements a RAG architecture for retrieving documents from Elasticsearch.

We provided a dataset for the demo about sci-fi movies. We extracted 500 movies from the IMDb dataset on Kaggle.

The first step is to install the dependencies of the project with npm, using the following command:

npm install

Then we need to configure the .env file that will contain the settings. We can generate this file copying the structure from the .env.example file, using the following command:

cp .env.example .env

Now we can edit the .env, adding the missing information:

OPENAI_API_KEY=
ELASTICSEARCH_URL=
ELASTICSEARCH_API_KEY=
ELASTICSEARCH_INDEX_NAME=scifi-movies

The name of the Elasticsearch index is scifi-movies. If you want, you can change it using the env variable ELASTICSEARCH_INDEX_NAME.

We used OpenAI as embedding service, which means that you need to provide an API key for OpenAI in the OPENAI_API_KEY env variable.

The embedding model used in the example is openai/text-embedding-3-small, with an embedding dimension of 1536.

To generate the final answer, we used the openai/gpt-5-nano model to reduce the costs.

The RAG architecture allows you to use a less powerful (and typically less expensive) final LLM model because the heavy lifting of grounding the answer is done by the retrieval component (Elasticsearch in this case).

The smaller LLM is only responsible for two main tasks:

Rephrasing/embedding the query: Converting the user's natural language question into a vector embedding for semantic search.
Synthesizing the answer: Taking the highly relevant, retrieved context chunks (documents/movies) and synthesizing them into a coherent, final, human-readable answer, following the provided prompt instructions.

Since the RAG process provides the exact factual context needed for the answer, the final LLM doesn't need to be massive or highly complex and it doesn’t need to possess all the required knowledge within its own parameters (which is where large, expensive models excel). It essentially acts as a sophisticated text summarizer and formatter for the context provided by Elasticsearch, rather than as a full-fledged knowledge base itself. This enables the use of models like gpt-5-nano for cost and latency optimization.

After the configuration of the .env file, you can ingest the movies to Elasticsearch using the following command:

npx tsx src/utility/store.ts

You should see an output as follows:

🚀 Starting ingestion of 500 movies from 500_scifi_movies.jsonl...
Ingesting ░░░░░░░░░░░░░░░░░░░░░░░░ 1/500 (0%) | ok:1 | fail:0 | chunks:1 | eta:19m 33s | current:Capricorn One
Ingesting ░░░░░░░░░░░░░░░░░░░░░░░░ 2/500 (0%) | ok:2 | fail:0 | chunks:2 | eta:10m 32s | current:Doghouse
Ingesting ░░░░░░░░░░░░░░░░░░░░░░░░ 3/500 (1%) | ok:3 | fail:0 | chunks:3 | eta:7m 33s | current:Dinocroc
Ingesting ░░░░░░░░░░░░░░░░░░░░░░░░ 4/500 (1%) | ok:4 | fail:0 | chunks:7 | eta:6m 10s | current:Back to the Future           
Ingesting ░░░░░░░░░░░░░░░░░░░░░░░░ 5/500 (1%) | ok:5 | fail:0 | chunks:9 | eta:5m 14s | current:The Projected Man            
Ingesting ░░░░░░░░░░░░░░░░░░░░░░░░ 6/500 (1%) | ok:6 | fail:0 | chunks:11 | eta:4m 41s | current:I, Robot
...
✅ Ingestion complete in 1m 46s. Success: 500, Failed: 0, Chunks: 693.

The mapping of the scifi-movies index contains the following fields:

embedding, dense_vector with 1536 dimension, cosine similarity.
description, text containing the description of the movie.
director, text containing the name of the director.
title, text containing the title of the movie.

We generated the embeddings using the title + description. Since the title and the description are two separate fields, the concatenation of both ensures that the resulting embedding vector captures both the specific, unique identity (title) and the rich, descriptive context (description) of the movie, leading to more accurate and comprehensive semantic search results. This combined input gives the embedding model a better single representation of the document's content for similarity matching.

Run the demo

You can run the demo with the following command:

npm run dev

This command will start a web application at localhost:4111 to access Mastra Studio (figure 3).

Mastra Studio offers an interactive UI for building and testing your agents, along with a REST API that exposes your Mastra application as a local service. This lets you start building right away without worrying about integration.

We provided an Elasticsearch Agent that uses the createVectorQueryTool by Mastra as a tool for executing semantic search using Elasticsearch. This agent uses the RAG approach to search for relevant documents (that is, movies) to answer the user’s question.

This agent uses the following prompt:

You are a helpful assistant that answers questions based on the provided context.
Follow these steps for each response:

1. First, carefully analyze the retrieved context chunks and identify key information.
2. Break down your thinking process about how the retrieved information relates to the query.
3. Draw conclusions based only on the evidence in the retrieved context.
4. If the retrieved chunks don't contain enough information, explicitly state what's missing.

Format your response as:
THOUGHT PROCESS:
- Step 1: [Initial analysis of retrieved chunks]
- Step 2: [Reasoning based on chunks]

FINAL ANSWER:
[Your concise answer based on the retrieved context]

Important: When asked to answer a question, please base your answer only on the context provided in the tool. 
If the context doesn't contain enough information to fully answer the question, please state that explicitly and stop it.
Do not add more information than what is present in the retrieved chunks.
Remember: Explain how you're using the retrieved information to reach your conclusions.

If you click on the Mastra Studio > Agents menu and select Elasticsearch Agent, you can test the agent using a chat system. For instance, you can ask information regarding sci-fi movies with a question as follows:

Find 5 movies or TV series about UFOs.

You’ll notice that the agent will execute the vectorQueryTool. You can click on the invoked tool to have a look at the input and the output. At the end of execution, the LLM will reply to your question, given the context coming from the scifi-movies index of Elasticsearch (figure 4).

Mastra executes the following steps internally:

Vector conversion: The user's question, Find 5 movies or TV series about UFOs, is converted into a vector embedding using OpenAI's openai/text-embedding-3-small model.
Vector search: This embedding is then used to query Elasticsearch via a vector search.
Result retrieval: Elasticsearch returns a set of 10 movies highly relevant to the query (that is, those with vectors closest to the user's query vector).
Answer generation: The retrieved movies and the original user question are sent to the LLM, specifically openai/gpt-5-nano. The LLM processes this information and generates a final answer, ensuring that the user's request for five results is met.

The Elasticsearch Agent

Here we reported the source code of Elasticsearch Agent.

import { Agent } from "@mastra/core/agent";
import { ElasticSearchVector } from '@mastra/elasticsearch';
import { createVectorQueryTool } from '@mastra/rag';
import { ModelRouterEmbeddingModel } from "@mastra/core/llm";
import { Memory } from "@mastra/memory";

const es_url = process.env.ELASTICSEARCH_URL;
const es_apikey = process.env.ELASTICSEARCH_API_KEY;
const es_index_name = process.env.ELASTICSEARCH_INDEX_NAME;
const prompt = 'insert here the previous prompt';

const esVector = new ElasticSearchVector({
  id: 'elasticsearch-vector',
  url: es_url,
  auth: {
    apiKey : es_apikey
  }
});

const vectorQueryTool = createVectorQueryTool({
  vectorStore: esVector,
  indexName: es_index_name,
  model: new ModelRouterEmbeddingModel("openai/text-embedding-3-small")
});

export const elasticsearchAgent = new Agent({
  id: "elasticsearch-agent",
  name: "Elasticsearch Agent",
  instructions: prompt,
  model: 'openai/gpt-5-nano',
  tools: { vectorQueryTool },
  memory: new Memory(),
});

The vectorQueryTool is the tool that’s invoked to implement the retrieval part of the RAG example. It uses the ElasticSearchVector implementation that Elastic contributed to Mastra.

The agent is an object of the agent class that consumes the vectorQueryTool, the prompt, and a memory. As you can see, the code that we need to put in place for connecting Elasticsearch to an agent is very minimal.

Conclusion

This article demonstrated the simplicity and power of integrating Elasticsearch with the Mastra framework to build sophisticated agentic AI applications. Specifically, we walked through creating a RAG agent capable of performing semantic search over a corpus of sci-fi movie data indexed in Elasticsearch.

A key takeaway is the direct contribution by Elastic to the Mastra open source project, providing native support for Elasticsearch as a vector store. This integration significantly lowers the barrier to entry, as evidenced by the Elasticsearch Agent source code. Using the ElasticSearchVector and createVectorQueryTool, the complete setup for connecting Elasticsearch to your agent requires only a minimal number of lines of configuration code.

Elasticsearch provides several advanced features to enhance result relevance. For example, hybrid search significantly boosts accuracy by combining lexical search with vector search. Another interesting feature is reranking using the latest Jina models that can be applied at the end of hybrid search. To learn more about these techniques, consult the following articles from Elasticsearch Labs:

Elasticsearch hybrid search by Valentin Crettaz
An introduction to Jina models, their functionality, and uses in Elasticsearch by Scott Martens

We also encourage you to explore the provided example and begin building your own data-powered agents with Mastra and Elasticsearch. For more information about Mastra, you can have a look at the official documentation here.

Migrating your OpenShift Elasticsearch 6.x cluster to Elastic Cloud on Kubernetes (ECK)

Omer Kushmaro, Jamie Parker — Wed, 08 Apr 2026 00:00:00 GMT

Red Hat's OpenShift platform has long been a trusted foundation for enterprise Kubernetes workloads, and for years, its built-in Elasticsearch Operator has made log management simple. But all good things evolve: The OpenShift Elasticsearch Operator reached the end of its supported lifecycle with Red Hat OpenShift Container Platform (OCP) 4.13, and the Elasticsearch 6.x clusters it managed are long out of support. Working closely with Red Hat, we've put together this step-by-step guide to help you move from that legacy setup to Elastic Cloud on Kubernetes (ECK), the modern, full-featured operator maintained directly by Elastic. The migration path we've designed respects the OpenShift-native tooling you already rely on, keeps disruption to a minimum, and leaves you on a solid foundation for future upgrades to 8.x, 9.x, and beyond.

Why this matters

Security and support: The last Elasticsearch 6.x patch release was on January 13, 2022. ECK lets you upgrade at your own pace, with a supported operator from the creators of Elasticsearch. Remaining on an old Elasticsearch version exposes you to supportability risks or well-known security issues.
Features you’ve been missing: Autoscaling, data tiers, machine learning (ML) jobs, searchable snapshots. None of these are available in the legacy operator.

Future-proof operations: ECK ships day-and-date with every new Elastic release, so you’re never stuck waiting again.

High-level plan

Phase	Goal	Outcome
0	Snapshot and sanity-check your 6.x cluster.	You have a backup in case you need it.
1	Install ECK 2.16.1 alongside the Red Hat operator.	Both operators coexist safely.
2	Spin up a fresh, production-ready ES 6.8.23 cluster managed by ECK.	Empty ECK-managed cluster.
3	Restore data into that new cluster.	All indices now live under ECK.
4	Point openshift-logging to the new service and retire the old operator.	Single source of truth.
5	Rolling-upgrade Elasticsearch to 7.17.28.	Latest long-term 7.x.
6	Upgrade ECK to 3.3.1.	Operator on current release.
7	Schedule your own jump to 8.x & 9.x.	You control the timing.
8	Clean up	Removing old operator

Feel free to bookmark this list. Each milestone is small, reversible, and validated before you move on.

0. Preflight checks

A. Health first: Run /_cat/health and make sure you’re green.

B. Disk watermarks: Keep at least 20% free before starting a migration.

C. Final snapshot: S3, GCS, NFS: It doesn’t matter, as long as you can mount the same repo in the new cluster.

If you don’t have object storage handy in your environment, you can use this solution-post by Red Hat to snapshot your data to local storage on the OpenShift cluster.

D. Review the documentation: Elastic provides thorough documentation for migrating data between Elasticsearch clusters.

1. Installing ECK 2.16.1 (your “bridge” operator)

ECK 2.16.1 is the last release that still accepts spec.version: 6.8.x, which makes it the ideal bridge between past and future Elasticsearch versions.

helm repo add elastic https://helm.elastic.co
helm repo update
oc create namespace elastic-system 

helm install elastic-operator elastic/eck-operator --version=2.16.1 -n elastic-system --create-namespace

You can keep the Red Hat operator in place; the two watch different Custom Resource Definitions (CRDs), so they won’t step on each other’s toes.

Keep in mind that with OpenShift, ECK might display some Transport Layer Security (TLS) errors in its logs as OpenShift tries to connect to its healthcheck webhook endpoint via HTTP, but ECK allows TLS communication only. This is a well-known issue and shouldn’t pose a problem.

You can refer further to Elastic documentation, in case you need to make a local namespaced installation.

2. Launching a 6.x cluster under ECK

Below is a starter Kubernetes manifest that balances resiliency (separate masters) with cost (three hot-tier data nodes). Swap storage class names, resources, and snapshot credentials to match your environment.

Note: The syntax used below is a bit different than what it would be for newer Elasticsearch versions on ECK.

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: es-logs
  namespace: elastic # Create this namespace prior, or use another namespace
spec:
  version: 6.8.23
  nodeSets:
    - name: hot
      count: 3
      volumeClaimTemplates:
        - metadata:
            name: elasticsearch-data
          spec:
            accessModes:
              - ReadWriteOnce
            storageClassName: gp3-csi   # adjust if needed
            resources:
              requests:
                storage: 100Gi # Storage may vary depending on  
      config:
        node.master: true
        node.data: true
        node.ingest: true
        node.attr.data: hot
        cluster.routing.allocation.awareness.attributes: data
      podTemplate:
        spec:
          containers:
            - name: elasticsearch
              resources:
                requests:
                  memory: 16Gi
                  cpu: 2
                limits:
                  memory: 16Gi
---
apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
  name: kibana
  namespace: elastic
spec:
  version: 6.8.23
  count: 1
  elasticsearchRef:
    name: es-logs
  podTemplate:
    spec:
      containers:
        - name: kibana
          resources:
            requests:
              memory: 1Gi
              cpu: 0.5
            limits:
              memory: 4Gi

Deploy it, watch pods come up, and you’re ready for data.

3. Moving the data

To move data from one Elasticsearch cluster to another, you can also further consult this guide in the Elastic documentation. For the purpose of this post, we’re assuming that snapshot and restore are used.

Snapshot and restore are quickest:

# on the old cluster, take a snapshot
PUT _snapshot/log-backups
{
  "type": "s3",
  "settings": { ... }
}

PUT _snapshot/log-backups/final-snap-2025-08-07

# on the new cluster (readonly!)
PUT _snapshot/log-backups
{
  "type": "s3",
  "settings": {
    "readonly": true,
    ...
  }
}

# Perform the restore operation
POST _snapshot/log-backups/final-snap-2025-08-07/_restore

Can’t share an object store? Use remote re-index (slower, but works everywhere; has drawbacks in terms of not migrating index templates, component templates, and more) or pump logs through a one-off Logstash job.

4. Configuring ClusterLogging operator

First, we’ll need to decommission our Red Hat operator–managed Elasticsearch cluster. We’ll modify our ClusterLogging like so:

oc edit clusterlogging instance -n openshift-logging 
---------
 logStore:
    elasticsearch:
      nodeCount: 0 # scale down node count, previously > 0 
      redundancyPolicy: ZeroRedundancy
    type: elasticsearch
  managementState: Managed # this needs to be kept, as it will manage the fluentd instance for us.
  visualization:
    kibana:
      replicas: 0 # scale down kibana as well 
    type: kibana

Then we’ll define a ClusterLogForwarder to direct the logs from fluentd to our newly built Elasticsearch 6.x cluster managed by ECK. We’ll need to create a secret with the Elasticsearch credentials:

oc create secret generic eck-es-credentials \
  -n openshift-logging \
  --from-literal=username=elastic \
  --from-literal=password=$(oc get secret es-logs-es-elastic-user -n elastic -o jsonpath='{.data.elastic}' | base64 -d)

For configuring TLS (as recommended), you’ll need to create a ConfigMap for ClusterLogForwarder to trust the ECK ca certificates. Further guidance can be found here. We’ll run the command:

oc -n elastic get secret es-logs-es-http-certs-public \
-o go-template='{{index .data "tls.crt" | base64decode}}' > ca.crt

oc -n openshift-logging create configmap eck-es-ca \
--from-file=ca-bundle.crt=ca.crt

To create the certificate secret, and then we’ll reference it in the ClusterLogging CRD:

apiVersion: logging.openshift.io/v1
kind: ClusterLogForwarder
metadata:
  name: instance
  namespace: openshift-logging
spec:
  outputs:
    - name: eck-es
      type: elasticsearch
      url: https://es-logs-es-http.elastic.svc:9200
      secret:
        name: eck-es-credentials # this secret needs to be created first
      tls:
        # insecureSkipVerify: true # can be used for lab testing purposes
        ca:
          name: eck-es-ca
  pipelines:
    - name: send-to-eck
      inputRefs:
        - application
        - infrastructure
        - audit
      outputRefs:
        - eck-es

⚠️ If you’re troubleshooting connectivity issues, you can temporarily set tls.insecureSkipVerify: true, but this shouldn’t be used long term.

Because we’re restoring legacy indices into a fresh ECK-managed cluster, OpenShift Logging will not automatically recreate the legacy index layout or aliases. You must ensure that write aliases exist and point to writable indices. In my case, I needed to verify that I have proper aliases, set up as:

app-write
infra-write
audit-write

Pointing to indices with dynamic mappings (not recommended) for minimizing errors and troubleshooting steps.

# Forward ES port to local machine
oc -n elastic port-forward svc/es-logs-es-http 9200:9200

PASS="$(oc -n elastic get secret es-logs-es-elastic-user -o jsonpath='{.data.elastic}' | base64 -d)"

# Make sure the write alias points to the correct backing index
curl -s -k -u "elastic:${PASS}" -XPOST "https://localhost:9200/_aliases" \
  -H 'Content-Type: application/json' \
  -d '{
    "actions": [
      { "add": { "index": "infra-000002", "alias": "infra-write", "is_write_index": true } }
    ]
  }'

Repeat for app-write and audit-write with their respective backing indices.We should see data start flowing now toward our new ECK managed cluster.

5. Rolling upgrade to 7.17.29, and verify

Now you can finally leave 6.x behind.

A. Check _xpack/migration/deprecations?pretty using curl against Elasticsearch, to tackle deprecations. This API will return either warnings or critical things to attend to before you upgrade.

B. Patch the CRD to upgrade it to the latest 7.x version. I’m using 7.17.29.

oc -n elastic patch elasticsearch es-logs --type=merge -p '{"spec":{"version":"7.17.29"}}'

C. ECK restarts nodes one at a time. Your cluster should be online throughout.

D. Give cluster tasks and shard recoveries time to settle before pressing on.

E. Don’t forget to upgrade Kibana in the same way.

oc -n elastic patch kibana kibana --type=merge -p '{"spec":{"version":"7.17.29"}}'

Once complete, check your Elasticsearch version and Kibana version, as well as the health state:

oc -n elastic get elasticsearch es-logs
oc -n elastic get kibana kibana

6. Operator upgrade: ECK 2.16.1 → 3.3.1

ECK upgrades are pleasantly boring:

helm upgrade elastic-operator elastic/eck-operator -n elastic-system --version 3.3.1

Watch the operator pod roll. Your Elasticsearch cluster keeps running; only the controller restarts.

Verify that the upgrade is successful by looking at the operator logs and ensuring that no major errors appear:

oc logs -n elastic-system sts/elastic-operator

And then verifying the new version of the operator (will now be 3.3.1):

helm -n elastic-system list

7. Your roadmap to 8.x and 9.x (when you’re ready)

You’re now on:

ECK Operator: 3.3.1
Elastic Stack: 7.17.29

That pair is fully supported and serves as the official launchpad for 8.x. It’s important to first go through the Elastic upgrade documentation.

We’ll again go through the procedure of checking for any hard-breaking changes between our 7.17.29 and the latest 8 version (8.19.9):

GET _migration/deprecations?pretty

It's important to look through the result of this query carefully and to go through necessary steps, like re-indexing indices and changing mappings, among others.

Once you’ve addressed all required changes from 7.17.29 to 8.x:

oc -n elastic patch elasticsearch es-logs --type=merge -p '{"spec":{"version":"8.19.9"}}'
oc -n elastic patch kibana kibana --type=merge -p '{"spec":{"version":"8.19.9"}}'

ECK will handle the rest. Just remember to upgrade Beats, Logstash pipelines, and client libraries in lockstep to avoid wire-protocol surprises.

Repeat the process again to migrate to the latest 9.x version.

8. Cleanup

Remove the Red Hat Elasticsearch operator.

Now that you’re no longer using the Red Hat Elasticsearch operator, you can remove it from your cluster. You can do that via the following steps:

A. In the OpenShift Console, go to Operators and then to Installed Operators.

B. In the Filter By Name field, enter “Elasticsearch” to find the installed Red Hat Elasticsearch operator.

C. On the Operator Details page, select Uninstall Operator from the Actions list.

D. On the Uninstall Operator? dialog box, select Uninstall. This removes the operator, the operator deployments, and the pods. After this step, the operator stops running and will no longer receive updates.

All of these steps can be found in this link from Red Hat OpenShift documentation.

Wrapping up

By installing ECK 2.16.1 as a bridge, snapshot-restoring into a new cluster, and stepping cleanly through 7.x before landing on ECK 3.3, you’ve transformed an aging, unsupported logging back end into a modern, secure, first-class Elastic deployment, without surprises or downtime.

Semantic search, now multilingual by default

Kapil Jadhav, Brendan Jugan, Ranjana Devaji — Tue, 07 Apr 2026 00:00:00 GMT

Today, we’re pleased to announce that semantic_text now defaults to the jina-embeddings-v5-text family on Elastic Inference Service (EIS), bringing built-in multilingual inference with no additional configuration required.

EIS provides managed, GPU-accelerated inference tightly integrated with Elasticsearch. With EIS, you don’t need to host, scale, or maintain infrastructure for embedding models.

Semantic search retrieves results based on meaning. Text is converted into vector embeddings so queries can match related concepts, even when the exact words differ.

The semantic_text field type simplifies this entire workflow, with automatic chunking, embedding generation at index time, and seamless querying via the semantic query, without building custom pipelines or managing separate model inference.

The jina-embeddings-v5-text model family just launched on EIS, giving developers powerful multilingual embeddings accessible as part of the core semantic_text workflow. So now your semantic search works across languages out of the box, and global datasets, such as support articles, product descriptions, user reviews, and multilingual websites, work without extra configuration.

This default opens up broad, globe-spanning semantic retrieval with no operational overhead.

jina-embeddings-v5-text

The jina-embeddings-v5-text models represent the latest generation of compact, high-performance multilingual embedding models on EIS.

State-of-the-art multilingual performance: Top scores on MMTEB benchmarks across hundreds of languages. jina-embeddings-v5-text-nano leads models under 500M parameters, and jina-embeddings-v5-text-small outperforms significantly larger alternatives.
Multiple task capabilities: Spanning across retrieval, semantic matching, clustering, and classification.
Flexible choices to fit your use case: Two model sizes (small, nano) let you balance speed, cost, and quality.
Long-context support: Embed long texts efficiently, ideal for document collections with extended context.

Get started

1. Create index

Define a semantic_text field with no additional configuration. Embeddings will be generated automatically at index time using the default model. For production workloads, explicitly specify the model to ensure consistent behavior and results.

PUT /multilingual-reviews
{
  "mappings": {
    "properties": {
      "product": { "type": "keyword" },
      "review": { "type": "semantic_text" },
      "language": { "type": "keyword" }
    }
  }
}

2. Index multilingual documents

Add product reviews in six different languages. Each document’s review field is automatically embedded at ingest time, with no separate pipeline or preprocessing needed.

POST /multilingual-reviews/_bulk?refresh=wait_for
{ "index": { "_id": "1" } }
{ "product": "wireless-headphones", "review": "Amazing noise cancellation and the battery lasts all day. Perfect for long flights.", "language": "en" }
{ "index": { "_id": "2" } }
{ "product": "wireless-headphones", "review": "La cancelación de ruido es impresionante. Muy cómodos incluso después de horas de uso.", "language": "es" }
{ "index": { "_id": "3" } }
{ "product": "wireless-headphones", "review": "ノイズキャンセリングが素晴らしく、長時間つけていても耳が痛くなりません。", "language": "ja" }
{ "index": { "_id": "4" } }
{ "product": "wireless-headphones", "review": "Réduction de bruit excellente et très confortable pour les longs trajets en avion.", "language": "fr" }
{ "index": { "_id": "5" } }
{ "product": "wireless-headphones", "review": "Hervorragende Geräuschunterdrückung. Ideal für Pendler und Vielflieger.", "language": "de" }
{ "index": { "_id": "6" } }
{ "product": "wireless-headphones", "review": "O cancelamento de ruído é excelente e a bateria dura o dia todo.", "language": "pt" }

3. Search across languages with a query in English

GET /multilingual-reviews/_search
{
  "query": {
    "match": {
      "review": "comfortable for long flights"
    }
  }
}

The results show all six reviews ranked by semantic relevance to the English query:

{
  "took": 83,
  "timed_out": false,
  "_shards": {
    "total": 6,
    "successful": 6,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 6,
      "relation": "eq"
    },
    "max_score": 0.8275735,
    "hits": [
      {
        "_index": "multilingual-reviews",
        "_id": "4",
        "_score": 0.8275735,
        "_source": {
          "product": "wireless-headphones",
          "review": "Réduction de bruit excellente et très confortable pour les longs trajets en avion.",
          "language": "fr"
        }
      },
      {
        "_index": "multilingual-reviews",
        "_id": "1",
        "_score": 0.7616198,
        "_source": {
          "product": "wireless-headphones",
          "review": "Amazing noise cancellation and the battery lasts all day. Perfect for long flights.",
          "language": "en"
        }
      },
      {
        "_index": "multilingual-reviews",
        "_id": "5",
        "_score": 0.72122526,
        "_source": {
          "product": "wireless-headphones",
          "review": "Hervorragende Geräuschunterdrückung. Ideal für Pendler und Vielflieger.",
          "language": "de"
        }
      },
      {
        "_index": "multilingual-reviews",
        "_id": "2",
        "_score": 0.6867013,
        "_source": {
          "product": "wireless-headphones",
          "review": "La cancelación de ruido es impresionante. Muy cómodos incluso después de horas de uso.",
          "language": "es"
        }
      },
      {
        "_index": "multilingual-reviews",
        "_id": "3",
        "_score": 0.66513836,
        "_source": {
          "product": "wireless-headphones",
          "review": "ノイズキャンセリングが素晴らしく、長時間つけていても耳が痛くなりません。",
          "language": "ja"
        }
      },
      {
        "_index": "multilingual-reviews",
        "_id": "6",
        "_score": 0.61658823,
        "_source": {
          "product": "wireless-headphones",
          "review": "O cancelamento de ruído é excelente e a bateria dura o dia todo.",
          "language": "pt"
        }
      }
    ]
  }
}

Notice that the French review ranks first, even above the English one. That's because "très confortable pour les longs trajets en avion" ("very comfortable for long trips by plane") is a closer semantic match to the query than the English review, which splits its focus across noise cancellation, battery life, and flights. This demonstrates the jina-embeddings-v5-text-small ability to rank by meaning, not language.

4. Search across languages with a Japanese query

GET /multilingual-reviews/_search
{
  "query": {
    "match": {
      "review": "長時間のフライトに最適"
    }
  }
}

The results show all six reviews ranked by semantic relevance to the Japanese query (“Ideal for long-haul flights”):

{
  "took": 89,
  "timed_out": false,
  "_shards": {
    "total": 6,
    "successful": 6,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 6,
      "relation": "eq"
    },
    "max_score": 0.7556782,
    "hits": [
      {
        "_index": "multilingual-reviews",
        "_id": "4",
        "_score": 0.7556782,
        "_source": {
          "product": "wireless-headphones",
          "review": "Réduction de bruit excellente et très confortable pour les longs trajets en avion.",
          "language": "fr"
        }
      },
      {
        "_index": "multilingual-reviews",
        "_id": "1",
        "_score": 0.7395687,
        "_source": {
          "product": "wireless-headphones",
          "review": "Amazing noise cancellation and the battery lasts all day. Perfect for long flights.",
          "language": "en"
        }
      },
      {
        "_index": "multilingual-reviews",
        "_id": "5",
        "_score": 0.68835545,
        "_source": {
          "product": "wireless-headphones",
          "review": "Hervorragende Geräuschunterdrückung. Ideal für Pendler und Vielflieger.",
          "language": "de"
        }
      },
      {
        "_index": "multilingual-reviews",
        "_id": "3",
        "_score": 0.6487931,
        "_source": {
          "product": "wireless-headphones",
          "review": "ノイズキャンセリングが素晴らしく、長時間つけていても耳が痛くなりません。",
          "language": "ja"
        }
      },
      {
        "_index": "multilingual-reviews",
        "_id": "6",
        "_score": 0.6241487,
        "_source": {
          "product": "wireless-headphones",
          "review": "O cancelamento de ruído é excelente e a bateria dura o dia todo.",
          "language": "pt"
        }
      },
      {
        "_index": "multilingual-reviews",
        "_id": "2",
        "_score": 0.6183049,
        "_source": {
          "product": "wireless-headphones",
          "review": "La cancelación de ruido es impresionante. Muy cómodos incluso después de horas de uso.",
          "language": "es"
        }
      }
    ]
  }
}

The ranking is nearly identical to the English query: French and English still lead because they're the most semantically relevant to "perfect for long flights," regardless of query language. The Japanese review didn't get artificially boosted just because the query was in Japanese. It ranks fourth because it discusses wearing comfort, not flights. Semantic relevance takes priority over language matching.

Note: For English-only use cases

If you prefer a sparse representation or would like to continue to use Elastic Learned Sparse EncodeR (ELSER) for English workloads, ELSER remains available and fully supported as an option for semantic_text.

You can explicitly choose ELSER by specifying inference_id: ".elser-2-elastic in your mappings when creating an index.

Conclusion: Semantic search without borders

With semantic_text now defaulting to the jina-embeddings-v5-text family on Elastic Inference Service, multilingual semantic search becomes the standard developer experience in Elasticsearch. This means developers can build search, retrieval augmented generation (RAG), and AI applications that work across global datasets without stitching pipelines together.

Create a semantic_text field, index your data, and start searching. All Elastic Cloud trials have access to Elastic Inference Service. Try it now on Elastic Cloud Serverless or Elastic Cloud Hosted, or use EIS via Cloud Connect with your self-managed cluster.

How to compare two Elasticsearch indices and find missing documents

David Pilato — Mon, 06 Apr 2026 00:00:00 GMT

When managing Elasticsearch indices, you may need to verify that all documents present in one index also exist in another, such as after a reindex operation, a migration, or a data pipeline. Elasticsearch doesn't provide a built-in "diff" command for this, but the right approach depends on one key question: Are your document IDs stable between the two indices?

The problem

Imagine you have two indices, index-a (source) and index-b (target), and you want to find all documents that exist in index-a but are missing from index-b.

A naive approach, querying both indices and comparing results in memory, won't scale. Elasticsearch is designed to handle millions of documents, and loading them all at once isn’t practical.

There are two scenarios:

IDs are stable: Both indices use the same _id for the same document (for example, emp_no as the document ID). This is the easy case.
IDs are generated: Documents were ingested through different pipelines that assigned random or sequential IDs. You can't compare by _id; you need to match on content.

Let's walk through both.

Step 0 — A lighter CLI for Elasticsearch

All the examples in this post use escli, a small Rust command line interface (CLI) that wraps the Elasticsearch REST API. It reads your cluster URL and credentials from environment variables, so you don’t have to repeat authentication headers on every command.

To see why that matters, here's a typical _search call with raw curl:

curl -X GET \
  -H "Authorization: ApiKey $ELASTIC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"query":{"term":{"user.id":"kimchy"}}}' \
  "$ELASTICSEARCH_URL/my-index-000001/_search"

With escli, the same request becomes:

./escli search --index my-index-000001 <<< '{"query":{"term":{"user.id":"kimchy"}}}'

The credentials live in a .env file that escli sources automatically — no -H "Authorization: ..." on every call, no risk of leaking secrets in shell history. The request body is passed via stdin (<<<), which makes it easy to pipe in multi-line JSON built dynamically with jq.

Step 1 — Count documents in both indices

Before doing a full scan, get a quick count of each index. If the counts match, the indices are likely in sync, and there’s no need to scan at all.

./escli count --index index-a
./escli count --index index-b

The _count API returns:

{ "count": 1000000 }

If the counts differ, proceed to the full comparison.

Step 2 — When IDs mean something: Use op_type=create

If both indices use the same _id for the same document, for example, because you indexed documents using a functional business key like emp_no rather than a generated UUID, you can find and fix missing documents in a single _reindex call.

Why functional IDs matter

Using a meaningful field as _id (instead of a random UUID) is a best practice when the data has a natural key. It means:

The same document always gets the same _id, regardless of which pipeline ingested it.
You can easily update or delete documents by ID.
You can use op_type=create to skip documents that already exist in the target.
No client-side scanning or comparison is needed.

The op_type=create trick

_reindex with op_type=create tries to create each document from the source in the target. If a document with the same _id already exists, Elasticsearch reports it as a version_conflict and moves on. It doesn’t overwrite the existing document. Setting conflicts=proceed tells the API to continue instead of aborting on the first conflict.

./escli reindex <<< '{
  "source": { "index": "index-a" },
  "dest":   { "index": "index-b", "op_type": "create" },
  "conflicts": "proceed"
}'

The response tells you exactly what happened:

{
  "total": 1000000,
  "created": 49594,
  "version_conflicts": 950406,
  "failures": []
}

created: Documents that were missing from index-b and have now been added.
version_conflicts: Documents that already existed in index-b and were left untouched.

No scanning, no client-side comparison, no intermediate file. Everything happens server-side in about six seconds on a 1M-document dataset.

Step 3 — When IDs are not stable: Business-key comparison

Sometimes you can't rely on _id. A document pipeline that generates IDs at ingestion time will assign a different _id each time the same record is processed. If index-a and index-b were populated by two such pipelines, the same employee record might have _id: "abc123" in one index and _id: "xyz789" in the other, even though the underlying data is identical.

In this case, you need to match documents by content rather than by ID. The key is to identify a set of fields that together form a unique business key.

For an employee dataset, a reasonable business key is (first_name, last_name, birth_date). A document in index-a is "missing" from index-b if no document in index-b has the same combination of those three fields.

3a — Scan the source with PIT + search_after

Open a point in time (PIT) on the source index to get a consistent snapshot, and then paginate through it, fetching only the business-key fields:

./escli open_point_in_time index-a 5m
# → { "id": "46ToAwMDaWR..." }

./escli search <<< '{
  "size": 10000,
  "_source": ["first_name", "last_name", "birth_date"],
  "pit": { "id": "46ToAwMDaWR...", "keep_alive": "5m" },
  "sort": [{ "_shard_doc": "asc" }]
}'

The sort key _shard_doc is the most efficient sort for full-index pagination: it uses the internal Lucene document order with no overhead. Repeat with search_after until the response contains zero hits. Always close the PIT when done:

./escli close_point_in_time <<< '{"id": "46ToAwMDaWR..."}'

3b — Check each page against the target via _msearch

For each page of source documents, build one _msearch request with one subquery per document. Each subquery uses a bool/must on the three business-key fields and requests size: 0; we only need to know whether a match exists, we don’t need to retrieve the document itself.

./escli msearch << 'EOF'
{"index": "index-b"}
{"size":0,"query":{"bool":{"must":[{"term":{"first_name.keyword":"Alice1"}},{"term":{"last_name.keyword":"Smith"}},{"term":{"birth_date":"1985-03-12"}}]}}}
{"index": "index-b"}
{"size":0,"query":{"bool":{"must":[{"term":{"first_name.keyword":"Bob2"}},{"term":{"last_name.keyword":"Jones"}},{"term":{"birth_date":"1990-07-24"}}]}}}
EOF

The response contains one entry per subquery, in the same order:

{
  "responses": [
    { "hits": { "total": { "value": 1 } } },
    { "hits": { "total": { "value": 0 } } }
  ]
}

total.value == 0 means no document in index-b matches that business key; the document is missing. Collect the corresponding _id from the source page.

Note on .keyword subfields: term queries require exact (keyword) matching. The first_name and last_name fields must have a .keyword subfield in the index mapping. The demo's mapping.json includes this.

3c — Speed it up with split-by-date

If the business key includes a date field, you can partition the source into date slices and run each slice as an independent job. Each slice opens its own PIT with a range filter on birth_date, runs its own msearch loop, and writes its results to a separate file. The parent script launches all slices in parallel and aggregates the results when they’re all done.

But depending on your use case, you might want to partition by a different field; for example, if you have a team field, you could run one slice per team. The key is to find a field that allows you to split the data into reasonably even chunks that can be processed in parallel.

[compare] Launching 5 slices in parallel...

  → Slice 1: 1960-01-01 → 1969-12-31 ✅ — 244408 checked, 12207 missing
  → Slice 2: 1970-01-01 → 1979-12-31 ✅ — 243624 checked, 12212 missing
  → Slice 3: 1980-01-01 → 1989-12-31 ✅ — 243551 checked, 11921 missing
  → Slice 4: 1990-01-01 → 1999-12-31 ✅ — 243895 checked, 11991 missing
  → Slice 5: 2000-01-01 → 2009-12-31 ✅ — 24522 checked, 1263 missing

Performance on a 1M dataset

To validate the approaches, the demo generates 1,000,000 documents in index-a and deliberately skips ~5% in index-b (49,594 missing documents), and then runs the full compare → reindex cycle.

Results on a MacBook M3 Pro:

Comparison (compare-indices.sh):

Strategy	Compare	Reindex	Total	How it works
op_type		6s	6s	Full _reindex server-side, skips existing
business-key	1m 38s	4s	1m 42s	PIT scan + _msearch by business key
split-by-date	32s	4s	36s	Same as business-key, 5 slices in parallel

The op_type=create approach is fastest because everything is server-side and requires no client-side scanning. The split-by-date strategy cuts the business-key duration from 1m 38s down to 36s through parallelism: not bad for a comparison across two 1M-document indices.

Decision tree

Are _id values stable between both indices?
├── Yes → _reindex with op_type=create          (6s, server-side)
└── No  → Do you have a reliable business key?
          ├── Yes, simple scan is fast enough → business-key   (1m 42s)
          └── Yes, and you need more speed    → split-by-date  (36s, parallel)

Conclusion

Elasticsearch doesn't offer a native index diff command, but the right strategy depends on your data model:

Use functional _ids (a natural business key like emp_no) whenever possible. It unlocks the simplest and fastest approach: _reindex with op_type=create finds and fills gaps in one server-side call.
When IDs are unstable, match by business key using PIT + _msearch. Partition by a field and run slices in parallel to recover most of the performance. If you find yourself doing this regularly, consider computing a hash of your business key fields and using it as _id at ingestion time. You get the best of both worlds: stable IDs and efficient lookups.

The complete demo, including dataset generation, comparison scripts, and reindex scripts, is available at https://github.com/dadoonet/blog-compare-indices/.

Monitoring Kibana dashboard views with Elastic Workflows

Gustavo Llermaly — Fri, 03 Apr 2026 00:00:00 GMT

Kibana tracks how many times each dashboard is viewed, but that data isn’t exposed natively in any built-in dashboard. In this article, we’ll use Elastic Workflows to automatically collect that data every 30 minutes and index it into Elasticsearch, so we can build our own analytics on top of it.

Elastic Workflows is a built-in automation engine inside Kibana that lets you define multistep processes using a simple YAML configuration. Each workflow can be triggered on a schedule or event or as a tool in Elastic Agent Builder, and each step can call Kibana APIs, query Elasticsearch, or transform data.

We’ll use dashboard view counts as a concrete example, but the same pattern applies to any metric exposed through the Kibana saved objects API.

Prerequisites

Elastic Cloud or self-managed cluster running 9.3
Workflows enabled (Advanced settings)

Step 1: Explore the raw data in Dev Tools

Before building anything, let's understand what data we have. Kibana stores most of its configuration and metadata as saved objects in a dedicated internal index. One of the things Kibana tracks this way is dashboard view counts, using a special saved object type called usage counters. You can query them directly from Dev Tools:

GET kbn:/api/saved_objects/_find?type=usage-counter&filter=usage-counter.attributes.domainId:"dashboard"%20and%20usage-counter.attributes.counterType:"viewed"&per_page=10000

The response looks like this:

{
  "page": 1,
  "per_page": 10000,
  "total": 1,
  "saved_objects": [
    {
      "type": "usage-counter",
      "id": "dashboard:346f3c64-ebca-484d-9d57-ec600067d596:viewed:server:20260310",
      "attributes": {
        "domainId": "dashboard",
        "counterName": "346f3c64-ebca-484d-9d57-ec600067d596",
        "counterType": "viewed",
        "source": "server",
        "count": 1
      },
      ...
    }
  ]

The counterName field is the dashboard ID, and count is the cumulative view count for that dashboard on that specific day. Kibana creates one counter object per dashboard per day; you can see the date suffix in the object ID (...viewed:server:20260310). The count grows throughout the day as users open the dashboard.

Rather than replicating this daily-document model in our index, we’ll create one document per workflow execution. Each document records how many views that dashboard had accumulated for the day at the moment of capture.

Step 2: Create the destination index

We need an index to store our dashboard view snapshots. The following command creates it with explicit mappings so we can aggregate and visualize later. Run this in Dev Tools:

PUT dashboard-views
{
  "mappings": {
    "properties": {
      "captured_at": {
        "type": "date"
      },
      "dashboard_id": {
        "type": "keyword"
      },
      "dashboard_name": {
        "type": "keyword"
      },
      "view_count": {
        "type": "integer"
      }
    }
  }
}

Using keyword mappings for IDs and names allows aggregations. Using integer for view_count is a safe default, since Kibana resets the counter daily, reaching the 32-bit limit (more than 2 billion views in a single day) isn’t a realistic concern. It still supports numeric operations, like max, avg, and min among others.

Step 3: Create the workflow

Go to Stack Management > Workflows > New Workflow, and paste the following workflow YAML configuration:

name: dashboard-views-ingestion
triggers:
  - type: scheduled
    with:
      every: 30m

steps:
  - name: fetch_dashboard_views
    type: kibana.request
    with:
      method: GET
      path: >-
        /api/saved_objects/_find?type=usage-counter&per_page=10000&filter=usage-counter.attributes.domainId:"dashboard"%20and%20usage-counter.attributes.counterType:"viewed"

  - name: index_each_dashboard
    type: foreach
    foreach: "{{ steps.fetch_dashboard_views.output.saved_objects }}"
    steps:
      - name: fetch_dashboard_name
        type: kibana.request
        with:
          method: GET
          path: /api/saved_objects/dashboard/{{ foreach.item.attributes.counterName }}
        on-failure:
          continue: true

      - name: index_doc
        type: elasticsearch.request
        with:
          method: POST
          path: /dashboard-views/_doc
          body:
            dashboard_id: "{{ foreach.item.attributes.counterName }}"
            dashboard_name: "{{ steps.fetch_dashboard_name.output.attributes.title }}"
            view_count: "${{ foreach.item.attributes.count | plus: 0 }}"
            captured_at: "{{ execution.startedAt | date: '%Y-%m-%dT%H:%M:%SZ' }}"

In the next section, let's break down the workflow step by step.

How the workflow works

Triggers

The workflow runs on a scheduled trigger every 30 minutes. This gives us time-series data without hammering the API.

fetch_dashboard_views

Uses kibana.request to call the Kibana saved objects API. No authentication setup is needed: The workflow engine automatically attaches the correct headers based on the execution context.

index_each_dashboard (foreach)

Iterates over the saved_objects array returned by the previous step. The current item in each iteration is available as foreach.item. Inside the loop, we run two nested steps for each dashboard.

1. fetch_dashboard_name:

Resolves the human-readable dashboard title by calling GET /api/saved_objects/dashboard/{id}. We add on-failure: continue: true so that if a dashboard was deleted but still has view counters, the loop continues instead of failing the whole execution.

2. index_doc:

Indexes each document using POST /dashboard-views/_doc (without an explicit ID), which lets Elasticsearch auto-generate IDs. This creates a new document on every run, building a history of view counts over time rather than overwriting the previous snapshot.

Two things worth noting:

The captured_at field uses the date filter to format the timestamp as ISO 8601. Without it, the value comes out as a JavaScript date string, like Tue Mar 10 2026 05:03:47 GMT+0000, which Elasticsearch won't map as a date.
The view_count uses ${{ }} syntax with | plus: 0 to preserve the numeric type. Using {{ }} would render it as a string, which would prevent math operations in the dashboard.

The UI allows you to nicely debug each of the workflow steps.

Step 4: Build the stats dashboard

Once the workflow has run a few times and data is collected, create a new dashboard in Kibana using the dashboard-views data view.

Some panels to start with:

Top dashboards by views: Use a Bar chart with dashboard_name on the X axis and last_value(view_count) on the Y axis. This shows the current daily view count per dashboard.
Views over time: Use a Line chart with captured_at on the X axis and last_value(view_count) on the Y axis, broken down by dashboard_name. Since each run appends a new document, use last value to get the peak count per time bucket rather than summing duplicates.
Current snapshot: Use a Data table with the latest captured_at to show the most recent view counts across all dashboards.

Since each workflow creates a new document, you can filter by time range to analyze activity in specific periods, compare week over week, or build alerts when a dashboard drops below a view threshold.

Conclusion

Elastic Workflows is a good fit for this kind of periodic data collection because both the source (Kibana API) and the destination (Elasticsearch) are native, which means zero credential management. The workflow engine handles authentication automatically for kibana.request and elasticsearch.request steps, so the only thing you write is the logic.

Resources

When TSDS meets ILM: Designing time series data streams that don't reject late data

Bret Wortman — Thu, 02 Apr 2026 00:00:00 GMT

Recently, I migrated a customer's metrics cluster from "everything in the hot tier" to a hot/cold/frozen architecture. It was a change I’d performed dozens of times before. Within minutes, Logstash stopped advancing data entirely.

Elasticsearch was rejecting late-arriving metrics. Those rejections caused the pipeline to fall behind, resulting in more late data, which triggered even more rejections. Eventually, the pipeline stalled completely.

We had to restore from snapshot, reindex the data, and redesign the ingestion pipeline to recover.

The root cause wasn't index lifecycle management (ILM) itself. It was time series data streams (TSDS) and how they enforce time‑bound backing indices.

TSDS can reduce storage requirements for metrics by 40–70%, but the architectural changes that make TSDS efficient also alter how indices behave over time. Those changes matter when designing ILM policies or when your ingestion pipelines may produce late‑arriving data.

TL;DR

When using TSDS:

Backing indices only accept documents within a specific time window.
If late data arrives after an index moves to cold or frozen, Elasticsearch rejects those documents or routes them to the failure store, if configured.

Design rule:

warm_min_age > rollover_max_age + maximum_expected_lateness

What is a time series data stream?

A time series data stream (TSDS) is a specialized data stream optimized for metrics data. Data is routed so that related documents are located within the same shards, optimizing them for query and retrieval. Here’s how Elasticsearch does it:

Each document contains:

A timestamp.
Dimension fields identifying the time series.
Metric fields representing measured values.

Examples include:

CPU usage per host.
Request latency per service.
Temperature readings per sensor.

Dimensions identify what we want to measure, while metrics represent values that change over time.

Dimensions

Dimensions describe the measured entity.

Examples:

host.name
service.name
container.id

We define them in mappings with:

time_series_dimension: true

Metrics

Metrics represent numeric values and are defined using:

time_series_metric

Common metric types:

Gauge: Values that rise and fall.
Counter: Values that increase until reset.

Elastic Agent primarily collects metrics and logs data, so even if you haven’t enabled any TSDS indices by hand, you may still have them in your cluster.

The _tsid field

Elasticsearch internally generates a _tsid value from dimension fields. This allows documents with identical dimensions to be routed to the same shard, improving:

Compression.
Query locality.
Aggregation performance.

The key difference: Time‑bound backing indices

Traditional data streams always write to the most recent backing index, called the write index, but TSDS behaves differently.

Each TSDS backing index has a defined time window and only accepts documents with @timestamp values that fall in that window:

GET _data_stream/my-metrics-data-stream
{
  "index_mode": "time_series",
  "time_series": {
    "temporal_ranges": [
      {
        "start": "2026-01-15T14:35:50.000Z",
        "end": "2026-03-16T11:34:40.000Z"
      }
    ]
  }
}

When a document is indexed, Elasticsearch routes it to the backing index responsible for that timestamp, meaning that, unlike traditional indices, a TSDS may write to multiple backing indices simultaneously.

For example:

Real‑time data → newest index.
Late data → earlier index covering that time range.

Designing for late‑arriving data

Real ingestion pipelines rarely deliver metrics perfectly on time. Metrics can be delayed by network outages, backlogs along the way, batch ingestion, and loss of edge devices, which reconnect and start to catch up.

Traditional indices quietly absorb those delays. TSDS does not.

If a document's timestamp falls outside the range of writable backing indices, Elasticsearch rejects it, meaning your ILM policy must account for late data.

The critical constraint

Backing indices must remain writable long enough to accept delayed data.

In practical terms:

time_until_readonly > maximum_expected_lateness

Because ILM measures ages from rollover, the operational rule becomes:

warm_or_cold_min_age > rollover_max_age + maximum_expected_lateness

For example, if metrics may arrive up to six hours late, indices must remain writable at least six hours after rollover.

Failing to account for this constraint was exactly what caused the ingestion failure described earlier. Late-arriving data was directed to an earlier index, which was already in the cold tier and therefore write-blocked.

Handling rejected documents

When TSDS rejects a document, Elasticsearch returns an error, indicating that the timestamp doesn’t fall within the range of writable indices. How your ingestion pipeline handles that error determines whether you lose data or stall ingestion.

The primary mechanism for handling rejected documents is the failure store.

Failure store (recommended in Elasticsearch 9.1+)

Elasticsearch 9.1 introduced the failure store, which automatically captures rejected documents. Instead of returning errors to clients, Elasticsearch writes failed documents to a dedicated failure index inside the data stream.

You can inspect failures using:

GET metrics-myapp::failures/_search

Using the failure store prevents ingestion pipelines from choking on rejection errors while preserving failed data for analysis or reindexing.

Monitoring for rejection issues

Late‑arrival problems usually appear first as ingestion anomalies. You may notice them first as:

Sudden drops in indexing rate.
Spikes in rejected documents.
A growing number of failure store entries.
Mismatches between pipeline input and output counts.

Alerting on these signals allows operators to detect issues before pipelines stall. Workflows, machine learning jobs, and other mechanisms can be used to automate detection and notification.

Migration checklist for TSDS + ILM

If you're migrating a metrics cluster to TSDS, introducing ILM tiering, or upgrading to an Elasticsearch version where metrics are TSDS by default, review these items first.

1. Measure ingestion latency

Before changing ILM policies, determine:

Normal ingestion delay.
Worst-case delay during incidents.
Delays caused by batch pipelines.

Your ILM design must accommodate the maximum realistic delay.

2. Verify index time windows

Inspect your TSDS backing indices:

GET _data_stream/

Look for:

time_series.start_time
time_series.end_time

These bounds determine which indices can accept documents. Understanding these windows can help you determine how late data can be before it’s rejected.

3. Size the hot tier for late arrivals

Ensure backing indices remain writable long enough for delayed data.

Operational rule:

warm_min_age > rollover_max_age + maximum_expected_lateness

Remember, indices must remain writable for at least six hours if metrics may arrive six hours late.

4. Decide how to handle rejected documents

Choose a strategy before enabling TSDS:

Failure store (recommended in Elasticsearch 9.1+).
Logstash dead letter queue.
Fallback index for late arrivals.
Accepting limited data loss.

5. Monitor ingestion health

Add alerts for:

Indexing rate drops.
Rejected documents.
Failure store growth.
Pipeline input/output mismatches.

Late data issues often appear first as ingestion anomalies.

Summary

Time series data streams provide major storage and performance improvements for metrics workloads, but they introduce an important architectural change: Backing indices are time‑bound, which affects how ILM behaves.

When using TSDS:

Indices must remain writable long enough to accept delayed data.
Ingestion pipelines should handle rejected documents safely.

The key rule to remember is:

warm_min_age > rollover_max_age + maximum_expected_lateness

If you design ILM policies around that constraint, TSDS works extremely well for metrics workloads.

Ignore it, though, and your ingestion pipeline may discover those time boundaries the hard way.

LINQ to Elasticsearch ES|QL: Write C#, query Elasticsearch

Florian Bernd, Martijn Laarman — Wed, 01 Apr 2026 00:00:00 GMT

Starting with v9.3.4 and v8.19.18, the Elasticsearch .NET client includes a Language Integrated Query (LINQ) provider that translates C# LINQ expressions into Elasticsearch Query Language (ES|QL) queries at runtime. Instead of writing ES|QL strings by hand, you compose queries using Where, Select, OrderBy, GroupBy, and other standard operators. The provider takes care of translation, parameterization, and result deserialization, including per-row streaming that keeps memory usage constant, regardless of result set size.

Your first query

Start by defining a plain old CLR object (POCO) that maps to your Elasticsearch index. Property names are resolved to ES|QL column names through standard System.Text.Json attributes, like [JsonPropertyName], or through a configured JsonNamingPolicy. The same source serialization rules that apply across the rest of the client apply here as well.

using System.Text.Json.Serialization;

public class Product
{
    [JsonPropertyName("product_id")]
    public string Id { get; set; }

    public string Name { get; set; }

    public string Brand { get; set; }

    [JsonPropertyName("price_usd")]
    public double Price { get; set; }

    [JsonPropertyName("in_stock")]
    public bool InStock { get; set; }
}

With the type in place, a query looks like this:

var minPrice = 100.0;
var brand = "TechCorp";

await foreach (var product in client.Esql.QueryAsync(q => q
    .From("products")
    .Where(p => p.InStock && p.Price >= minPrice && p.Brand == brand)
    .OrderByDescending(p => p.Price)
    .Take(10)))
{
    Console.WriteLine($"{product.Name}: ${product.Price}");
}

The provider translates this into the following ES|QL:

FROM products
| WHERE (in_stock == true AND price_usd >= ?minPrice AND brand == ?brand)
| SORT price_usd DESC
| LIMIT 10

A few details to note:

Property name resolution: p.Price becomes price_usd because of the [JsonPropertyName] attribute, and p.Brand becomes brand following the default camelCase naming policy.
Parameter capturing: The C# variables minPrice and brand are captured as named parameters (?minPrice, ?brand). They’re sent separately from the query string in the JSON payload, which prevents injection and enables server-side query plan caching.
Streaming: QueryAsync returns IAsyncEnumerable. Rows are materialized one at a time as they arrive from Elasticsearch.

You can also inspect the generated query and its parameters without executing it:

var query = client.Esql.CreateQuery()
    .Where(p => p.InStock && p.Price >= minPrice && p.Brand == brand)
    .OrderByDescending(p => p.Price)
    .Take(10);

Console.WriteLine(query.ToEsqlString());
// FROM products | WHERE (in_stock == true AND price_usd >= 100) | SORT price_usd DESC | LIMIT 10

Console.WriteLine(query.ToEsqlString(inlineParameters: false));
// FROM products | WHERE (in_stock == true AND price_usd >= ?minPrice AND brand == ?brand) | SORT price_usd DESC | LIMIT 10

var parameters = query.GetParameters();
// { "minPrice": 100.0, "brand": "TechCorp" }

How does this work? A quick LINQ refresher

The mechanism that makes LINQ providers possible is the distinction between IEnumerable and IQueryable.

When you call .Where(p => p.Price > 100) on an IEnumerable, the lambda compiles to a Func, a regular delegate that the runtime executes in-process. This is LINQ-to-Objects.

When you call the same method on an IQueryable, the C# compiler wraps the lambda in an Expression> instead. This is a data structure that represents the structure of the code rather than its executable form. The expression tree can be inspected, analyzed, and translated into another language at runtime.

// IEnumerable: the lambda is a compiled delegate
IEnumerable local = products.Where(p => p.Price > 100);

// IQueryable: the lambda is an expression tree, a data structure
IQueryable remote = queryable.Where(p => p.Price > 100);

The IQueryProvider interface is the extension point. Any provider can implement CreateQuery and Execute to translate these expression trees into a target language. Entity Framework uses this to emit SQL. The LINQ to ES|QL provider uses it to emit ES|QL.

The expression tree for the query above looks like this:

Expression tree for the example query.

The tree is nested inside out: Take wraps OrderByDescending, which wraps Where, which wraps From, which wraps the root EsqlQueryable constant. The Where predicate is itself a subtree of BinaryExpression nodes for the &&, >=, and == operators, with MemberExpression leaves for property accesses and closure captures for the minPrice and brand variables. This is the data structure that the provider walks to produce the final ES|QL.

Under the hood: The translation pipeline

The path from a LINQ expression to query results follows a six-stage pipeline:

Translation pipeline overview.

1. Expression tree capture

When you chain .Where(), .OrderBy(), .Take() and other operators on an IQueryable, the standard LINQ infrastructure builds an expression tree. EsqlQueryable implements IQueryable and delegates to EsqlQueryProvider.

2. Translation

When the query is executed (by enumerating, calling ToList(), or using await foreach), the EsqlExpressionVisitor walks the expression tree inside out. It dispatches each LINQ method call to a specialized visitor:

Visitor	Translates	Into
WhereClauseVisitor	.Where(predicate)	WHERE condition
SelectProjectionVisitor	.Select(selector)	EVAL + KEEP + RENAME
GroupByVisitor	.GroupBy().Select()	STATS ... BY
OrderByVisitor	.OrderBy() / .ThenBy()	SORT field [ASC\\|DESC]
EsqlFunctionTranslator	EsqlFunctions., Math., string methods	80+ ES\|QL functions

During translation, C# variables referenced in expressions are captured as named parameters.

3. Query model

The visitors don’t produce strings directly. Instead, they produce QueryCommand objects, an immutable intermediate representation. A FromCommand, a WhereCommand, a SortCommand, and a LimitCommand, each representing one ES|QL processing command. These are collected into an EsqlQuery model.

Query model and command pattern.

This intermediate model is decoupled from both the expression tree and the output format. It can be inspected, intercepted (via IEsqlQueryInterceptor), or modified before formatting.

4. Formatting

EsqlFormatter visits each QueryCommand in order and produces the final ES|QL string. Each command becomes one line, separated by the pipe (|) operator that ES|QL uses to chain processing commands. Identifiers containing special characters are automatically escaped with backticks.

5. Execution

The formatted ES|QL string and captured parameters are sent to Elasticsearch’s /_query endpoint as a JSON payload. The IEsqlQueryExecutor interface abstracts the transport layer, which is where the layered package architecture comes into play.

6. Materialization

EsqlResponseReader streams the JSON response without buffering the entire result set into memory. A ColumnLayout tree, precomputed once per query, maps flat ES|QL column names (like address.street, address.city) to nested POCO properties. Each row is assembled into a T instance and yielded one at a time via IEnumerable or IAsyncEnumerable.

The layered architecture

The LINQ to ES|QL functionality is split across three packages:

Package architecture.
Elastic.Esql is the pure translation engine. It has zero HTTP dependencies and contains the expression visitors, query model, formatter, and response reader. You can use it stand alone to build and inspect ES|QL queries without an Elasticsearch connection, which is useful for testing, query logging, or building your own execution layer.

// Translation-only: no Elasticsearch connection needed
var provider = new EsqlQueryProvider();
var query = new EsqlQueryable(provider)
    .From("products")
    .Where(p => p.InStock)
    .OrderByDescending(p => p.Price);

Console.WriteLine(query.ToEsqlString());
// FROM products | WHERE in_stock == true | SORT price_usd DESC

Elastic.Clients.Esql is a lightweight stand-alone ES|QL client. It adds HTTP execution on top of Elastic.Esql via Elastic.Transport. If your application only needs ES|QL and none of the other Elasticsearch APIs, this is the minimal dependency option.

Elastic.Clients.Elasticsearch is the full Elasticsearch .NET client. It also builds on Elastic.Esql and exposes the LINQ provider through the client.Esql namespace. This is the recommended entry point for most applications.

Both execution-layer packages provide their own implementation of IEsqlQueryExecutor, the strategy interface that bridges translation and transport.

All three packages are compatible with Native AOT when used with a source-generated JsonSerializerContext. For the full client, see the Native AOT documentation.

Beyond the basics

The example above covered filtering, sorting, and pagination. The provider supports a broader set of operations.

Aggregations

GroupBy, combined with aggregate functions in Select, translates to ES|QL STATS ... BY:

var stats = client.Esql.Query(q => q
    .GroupBy(p => p.Brand)
    .Select(g => new
    {
        Brand = g.Key,
        Count = g.Count(),
        AvgPrice = g.Average(p => p.Price),
        MaxPrice = g.Max(p => p.Price)
    }));

// -> FROM products | STATS COUNT(*), AVG(price_usd), MAX(price_usd) BY brand

Projections

Select, with anonymous types generates EVAL, KEEP, and RENAME commands:

var query = client.Esql.CreateQuery()
    .Select(p => new { ProductName = p.Name, p.Price, p.InStock });

// -> FROM products | KEEP name, price_usd, in_stock | RENAME name AS ProductName

Rich function library

Over 80 ES|QL functions are available through the EsqlFunctions class, covering date/time, string, math, IP, pattern matching, and scoring. Standard Math.* and string.* methods are also translated:

.Where(p => p.Name.Contains("Pro"))       // -> WHERE name LIKE "*Pro*"
.Where(p => EsqlFunctions.CidrMatch(      // -> WHERE CIDR_MATCH(ip, "10.0.0.0/8")
    p.IpAddress, "10.0.0.0/8"))

LOOKUP JOIN

Cross-index lookups translate to ES|QL LOOKUP JOIN:

var enriched = client.Esql.Query(q => q
    .LookupJoin(
        "category-lookup-index",
        product => product.Id,
        category => category.CategoryId,
        (product, category) => new { product.Name, category!.CategoryLabel }));

Raw ES|QL escape hatch

For ES|QL features not yet covered by the LINQ provider, you can append raw fragments:

var results = client.Esql.Query(q => q
    .Where(p => p.InStock)
    .RawEsql("| EVAL discounted = price_usd * 0.9"));

Server-side async queries

For long-running queries, submit them for background processing on the server:

await using var asyncQuery = await client.Esql.SubmitAsyncQueryAsync(
    q => q.Where(p => p.InStock),
    asyncQueryOptions: new EsqlAsyncQueryOptions
    {
        WaitForCompletionTimeout = TimeSpan.FromSeconds(5),
        KeepAlive = TimeSpan.FromMinutes(10)
    });

await asyncQuery.WaitForCompletionAsync();
await foreach (var product in asyncQuery.AsAsyncEnumerable())
    Console.WriteLine(product.Name);

Server-side async queries are especially useful for long-running analytical queries / large dataset processing that might exceed typical timeout thresholds, or in timeout-sensitive environments with load balancers, API gateways, or proxies that enforce strict HTTP timeouts. Async queries avoid connection drops by decoupling submission from result retrieval.

Getting started

LINQ to ES|QL is available starting from:

Elastic.Clients.Elasticsearch v9.3.4 (9.x branch)
Elastic.Clients.Elasticsearch v8.19.18 (8.x branch)

Install from NuGet:

dotnet add package Elastic.Clients.Elasticsearch

The entry points are on client.Esql:

Method	Returns	Use case
Query(...)	IEnumerable	Synchronous execution
QueryAsync(...)	IAsyncEnumerable	Async streaming
CreateQuery()	IEsqlQueryable	Advanced composition and inspection
SubmitAsyncQueryAsync(...)	EsqlAsyncQuery	Long-running server-side queries

For the full feature reference, including query options, multifield access, nested objects, and multivalue field handling, see the LINQ to ES|QL documentation.

Conclusion

LINQ to ES|QL brings the full expressiveness of C# LINQ to Elasticsearch's ES|QL query language, letting you write strongly typed, composable queries without handcrafting query strings. With automatic parameter capturing, streaming materialization, and a layered package architecture that scales from stand-alone translation to the full Elasticsearch client, it fits naturally into .NET applications of any size. Install the latest client, point your LINQ expressions at an index, and let the provider handle the rest.

From judgment lists to trained Learning to Rank (LTR) models

Jeffrey Rengifo — Tue, 31 Mar 2026 00:00:00 GMT

In Evaluating search query relevance with judgment lists, we built lists and used the _rank_eval API to measure search quality. Though this approach gave us an objective way to evaluate changes, improving relevance still requires manual query tuning.

If judgment lists answer the question, “How good is my ranking?,” Learning To Rank (LTR) answers, “How do I systematically make it better?”

In this article, we take the next step: using those judgment lists to train an LTR model using XGBoost, Eland, and Elasticsearch. We’ll focus on understanding the process rather than on implementation details. For the complete code, refer to the companion notebook.

What is LTR?

LTR uses machine learning (ML) to build a ranking function for your search engine. Instead of manually tuning query weights, you provide examples of proper rankings (your judgment list) and let the model learn what makes documents relevant. In Elasticsearch, LTR works as a second-stage reranker following retrieval of documents from Elasticsearch:

First stage: A standard query (BM25, vector, or hybrid) retrieves candidate documents quickly.
Second stage: The LTR model reranks the top results using multiple signals it learned to combine.

For a deeper introduction, see Introducing Learning To Rank (LTR) in Elasticsearch.

The journey from judgment list to model

A judgment list tells us which documents should rank highly for a given query. But the model cannot learn directly from document IDs. It needs numerical signals that explain why certain documents are potentially relevant.

The process works like this:

Start with judgments. Query-document pairs with relevance grades, so you define that doc1 is a good match for “DiCaprio performance” search terms.
Extract features. For each query-document pair, compute numerical signals, some about the document alone (for example, popularity), and others about how the query and document interact (for example, BM25 score).
Train the model. The model learns which feature patterns predict high grades.
Deploy. Deploy the trained model to your Elasticsearch cluster.
Query. Use the model to rerank search results.

The key insight is that features must capture what your judgments are measuring. If your judgment list rewards popular thriller movies but your features only include text-matching scores, the model has no way to learn what makes those documents relevant.

What are features?

Features are numerical values that describe a query-document pair. In Elasticsearch, we define features using queries that return scores. There are three types:

Query-document features measure how well a query matches a document. Eland provides the QueryFeatureExtractor utility to define these features, which computes the BM25 relevance score for each query-document pair:

QueryFeatureExtractor(
    feature_name="title_bm25",
    query={"match": {"title": "{{query}}"}}
)

This extracts the BM25 score from the title field for each document relative to the query.

Document features are properties of the document that don’t depend on the query. You can extract these using script_score or function_score:

QueryFeatureExtractor(
    feature_name="popularity",
    query={
        "script_score": {
            "query": {"exists": {"field": "popularity"}},
            "script": {"source": "return doc['popularity'].value;"}
        }
    }
)

Query features describe the query itself, like the number of terms. These are less common but can help the model handle different query types.

Designing your feature set

Choosing features isn’t random. Each feature should capture a signal that might explain why users prefer certain documents. Let's look at the features from the LTR notebook and understand the reasoning:

Feature	Type	Purpose
`title_bm25`	Query-document	Title matches are strong relevance signals. For example, a movie titled Star Wars should rank highly for the query "star wars".
`actors_bm25`	Query-document	Some users search by actor name. If they search for "leonardo dicaprio movies", they should get films starring Leonardo DiCaprio.
`title_all_terms_bm25`	Query-document	This is a stricter version of title matching where all query terms must be present. It helps distinguish between exact matches and partial ones.
`actors_all_terms_bm25`	Query-document	This is the same stricter matching logic as described above but applied specifically to actors.
`popularity`	Document	Users generally prefer well-known movies over obscure ones when relevance is similar. A popular Star Wars film should rank above a low-budget parody with "Star Wars" in the title.

Notice the strategy here:

Multiple signals for the same concept. We have both title_bm25 (lenient) and title_all_terms_bm25 (strict). The lenient version scores any document where at least one query term matches the title, and the strict version requires all the terms to be present. For short queries, the lenient match might be enough; whereas for longer, more specific queries, strict matching might be more important. The model can learn when to rely on each.
Text features plus quality features. Text matching alone can return irrelevant documents that happen to contain the right words. The popularity feature lets the model boost well-known, quality content when text scores are similar.
Coverage for different query types. Some queries target titles ("star wars"), and others target actors ("dicaprio movies"). Having features for both means that the model can handle diverse searches.

When designing your own features, ask yourself, "What signals would a human use to decide if this document is relevant?" Those are your candidate features.

Building the training dataset

Once features are defined, we extract them for every query-document pair in our judgment list. The result is a training dataset where each row contains:

The query identifier.
The document identifier.
The relevance grade (from our judgment list).
All feature values.

Here’s a simplified example:

`query_id`	`query`	`doc_id`	`grade`
qid:1	star wars	11	4
qid:1	star wars	12180	3
qid:1	star wars	278427	1
qid:2	tom hanks movies	857	4
qid:2	tom hanks movies	13	3

A few things to notice:

NaN values are normal. When a query doesn’t match a field, the feature returns no score. The movie Star Wars has a high title_bm25 but no actors_bm25 because the query "star wars" doesn’t match any actor names.

Queries are grouped during training. The query_id column tells the model which documents to compare against each other. For "star wars", it learns that document 11 (grade 4) should rank above document 278427 (grade 1).

But here’s the important part: The model doesn’t memorize these specific queries. Instead, it learns general patterns, like "documents with high title_bm25 AND high popularity tend to have high grades." When presented with a new query, the model applies these learned patterns to rank the results.

Features must explain grade differences. Look at qid:1: The grade 4 document has a higher title_bm25 and higher popularity than the grade 1 document. These patterns are what the model learns.

Training the LTR model

With the training dataset prepared, we train an XGBoost model with a ranking objective. The model builds decision trees that learn patterns like:

"If title_bm25 > 10 and popularity > 50, predict high relevance."
"If title_bm25 is missing but actors_bm25 > 12, still predict moderate relevance."

Here's how the training process works in practice:

from xgboost import XGBRanker
from sklearn.model_selection import GroupShuffleSplit

# Create the ranker model:
ranker = XGBRanker(
    objective="rank:ndcg",
    eval_metric=["ndcg@10"],
    early_stopping_rounds=20,
)

# Shaping training and eval data in the expected format.
X = judgments_with_features[ltr_config.feature_names]
y = judgments_with_features["grade"]
groups = judgments_with_features["query_id"]

# Split the dataset in two parts respectively used for training and evaluation of the model.
group_preserving_splitter = GroupShuffleSplit(n_splits=1, train_size=0.7).split(
    X, y, groups
)
train_idx, eval_idx = next(group_preserving_splitter)

train_features, eval_features = X.loc[train_idx], X.loc[eval_idx]
train_target, eval_target = y.loc[train_idx], y.loc[eval_idx]
train_query_groups, eval_query_groups = groups.loc[train_idx], groups.loc[eval_idx]

# Training the model
ranker.fit(
    X=train_features,
    y=train_target,
    group=train_query_groups.value_counts().sort_index().values,
    eval_set=[(eval_features, eval_target)],
    eval_group=[eval_query_groups.value_counts().sort_index().values],
    verbose=True,
)

During training, the model tries different combinations of these rules and measures how well the resulting rankings match your judgment grades. It uses a metric called Normalized Discounted Cumulative Gain (NDCG) to score itself. A perfect NDCG of 1.0 means that the model's ranking exactly matches your judgments. Lower scores mean that some relevant documents are ranking below where they should be.

The training also uses a technique called early stopping. If the model's score stops improving for several rounds, training halts automatically. This prevents the model from memorizing the training data too closely, which would hurt its ability to generalize to new queries.

The companion notebook contains the complete training code.

Understanding what your LTR model learned

After training, XGBoost can show you which features the model relied on most. You can generate a feature importance chart using XGBoost's built-in visualization:

from xgboost import plot_importance

plot_importance(ranker, importance_type="weight")

The importance_type="weight" parameter shows how often each feature was used in tree splits. Here’s the resulting chart:

The F score counts how many times each feature was used to make split decisions across all trees in the model. Higher values mean that the model relied on that feature more often.

In this example:

popularity (2178): The most important feature. The model frequently uses popularity to separate relevant from nonrelevant documents.
title_bm25 (1642): Second-most important. Title matches matter a lot for movie searches.
actors_bm25 (565): Moderately important. This is useful for queries that mention actors.
title_all_terms_bm25 (211): Occasionally useful. The stricter matching helps for some queries.
actors_all_terms_bm25 (63): Rarely used. The model found this feature less predictive.

This chart helps you iterate on your feature set. If a feature that you expected to be important shows near-zero importance, investigate why. Maybe the feature extraction is not working as intended, or maybe that signal doesn’t actually predict relevance in your judgment data.

Deploying and using the LTR model

Once trained, upload the model to Elasticsearch using Eland:

MLModel.import_ltr_model(
    es_client=es_client,
    model=ranker,
    model_id="ltr-model-xgboost",
    ltr_model_config=ltr_config,
    es_if_exists="replace",
)

Once uploaded, the model can be used as a rescorer retriever to be combined with other retrievers for multistage search pipelines:

GET movies/_search
{
  "retriever": {
    "rescorer": {
      "rescore": {
        "window_size": 50,
        "learning_to_rank": {
          "model_id": "ltr-model-xgboost",
          "params": {
            "query": "star wars"
          }
        }
      },
      "retriever": {
        "standard": {
          "query": {
            "multi_match": {
              "fields": ["title", "overview", "actors", "director", "tags", "characters"],
              "query": "star wars"
            }
          }
        }
      }
    }
  }
}

Response (simplified):

 "hits": {
    "total": {
      "value": 852,
      "relation": "eq"
    },
    "max_score": 25.165691,
    "hits": [
      {
        "_index": "movies",
        "_id": "11",
        "_score": 25.165691,
        "_source": {
          "title": "Star Wars"
        }
      },
      {
        "_index": "movies",
        "_id": "12180",
        "_score": 25.092865,
        "_source": {
          "title": "Star Wars: The Clone Wars"
        }
      },
      {
        "_index": "movies",
        "_id": "181812",
        "_score": 23.456198,
        "_source": {
          "title": "Star Wars: The Rise of Skywalker"
        }
      },
      {
        "_index": "movies",
        "_id": "140607",
        "_score": 23.320757,
        "_source": {
          "title": "Star Wars: The Force Awakens"
        }
      },
...

The first-stage query retrieves candidates using BM25. The LTR model then reranks the top 50 results using all the features it learned to weight.

For the sake of the example, the multi_match query alone would return some less relevant results on the first positions that LTR helped to fix:

{
  "hits": [
    {
      "_index": "movies",
      "_id": "11",
      "_score": 10.971989,
      "_source": {
        "title": "Star Wars"
      }
    },
    {
      "_index": "movies",
      "_id": "12180",
      "_score": 9.923633,
      "_source": {
        "title": "Star Wars: The Clone Wars"
      }
    },
    {
      "_index": "movies",
      "_id": "1022100",
      "_score": 8.9880295,
      "_source": {
        "title": "Andor: A Disney+ Day Special Look"
      }
    },
    {
      "_index": "movies",
      "_id": "278427",
      "_score": 8.845748,
      "_source": {
        "title": "Family Guy Presents: It's a Trap!"
      }
    },
    ...
  ]
}

Conclusion

The path from judgment lists to a working LTR model involves three key steps: designing features that capture relevance signals, building a training dataset that pairs those features with your judgment grades, and training a model that learns the patterns.

Our previous article becomes the starting point for this process. Your grades define what "relevant" means and how to measure it, and your features give the model the signals to predict it.

For the complete implementation with a dataset of 9,750 movies and 384,755 judgment rows, see the LTR notebook. For advanced use cases, like personalized search, see Personalized search with LTR.

From Elasticsearch runtime fields to ES|QL: Adapting legacy tools to current techniques

Jeffrey Rengifo — Mon, 30 Mar 2026 00:00:00 GMT

Elasticsearch runtime fields solve the problem of computing values at query time without reindexing. But they come with Painless scripting complexity and performance costs that scale with document count. Elasticsearch Query Language (ES|QL) offers a more powerful alternative with a dedicated execution engine, pipeline processing, and no scripting required. In this article, you’ll learn how to map five common runtime field patterns to their ES|QL equivalents, so you can modernize your queries and understand when each approach makes sense.

Prerequisites

Elasticsearch 8.15+ (for :: cast operator support; core ES|QL features available from 8.11)

Runtime fields versus ES|QL

Runtime fields were introduced in Elasticsearch 7.11 as a way to define fields at query time. Instead of reindexing data, you could write a Painless script that computes values on the fly:

PUT my-index/_mapping
{
  "runtime": {
    "full_address": {
      "type": "keyword",
      "script": {
        "source": "emit(doc['address'].value + ':' + doc['port'].value)"
      }
    }
  }
}

This works, but comes with trade-offs:

Painless scripting overhead: Every runtime field requires scripting knowledge, and the syntax is Java-like, not query-like.
Performance cost: Runtime fields evaluate per document at query time. Elasticsearch classifies them as "expensive queries" that can be rejected by cluster settings.
Isolated computation: Each runtime field computes independently. There’s no way to chain transforms or use the output of one field in another within the same query.

ES|QL changes the equation. It has its own execution engine (not translated to Query DSL), runs queries concurrently across nodes, and provides a complete toolkit for field computation: EVAL, GROK, DISSECT, type casting, and pipeline chaining.

Let's see how each runtime field pattern maps to ES|QL.

Setting up the example data

All the code snippets in this article can be executed in the Kibana Dev Tools console.

To follow along, create a sample index with data that exercises all five patterns. This simulates a server logs scenario with mixed field types, raw messages, and some intentional data quality issues:

PUT server-logs
{
  "mappings": {
    "properties": {
      "host": { "type": "keyword" },
      "port": { "type": "keyword" },
      "raw_message": { "type": "text" },
      "response_time": { "type": "keyword" },
      "status_code": { "type": "keyword" },
      "region": { "type": "keyword" }
    }
  }
}

Now index some sample documents:

POST _bulk
{ "index": { "_index": "server-logs" } }
{ "host": "web-01", "port": "8080", "raw_message": "2024-01-15 INFO user=alice action=login duration=230ms", "response_time": "145", "status_code": "200", "region": "us-east" }
{ "index": { "_index": "server-logs" } }
{ "host": "web-02", "port": "443", "raw_message": "2024-01-15 ERROR user=bob action=upload duration=1200ms", "response_time": "not_available", "status_code": "500", "region": "eu-west" }
{ "index": { "_index": "server-logs" } }
{ "host": "api-01", "port": "3000", "raw_message": "2024-01-15 WARN user=charlie action=query duration=890ms", "response_time": "890", "status_code": "200", "region": "us-east" }
{ "index": { "_index": "server-logs" } }
{ "host": "api-02", "port": "3000", "raw_message": "2024-01-16 INFO user=diana action=export duration=3400ms", "response_time": "3400", "status_code": "200", "region": "ap-south" }
{ "index": { "_index": "server-logs" } }
{ "host": "web-01", "port": "8080", "raw_message": "2024-01-16 ERROR user=eve action=login duration=50ms", "response_time": "50", "status_code": "401", "region": "US-EAST" }

Notice that response_time is stored as a keyword (a common real-world mistake), and the last document has "US-EAST" instead of "us-east" (a data quality issue we’ll fix later).

Pattern 1: Field concatenation

A common runtime field use case is combining two fields into one. For example, creating a host:port identifier.

The runtime field approach

You can define it inline at query time. Query-time approach avoids modifying the mapping, but you still need Painless scripting, scoping it to a single search request:

GET server-logs/_search
{
  "runtime_mappings": {
    "endpoint": {
      "type": "keyword",
      "script": {
        "source": "emit(doc['host'].value + ':' + doc['port'].value)"
      }
    }
  },
  "fields": ["endpoint"],
  "_source": false
}

The ES|QL approach

You can run ES|QL queries using the _query API endpoint:

POST _query
{
  "query": """
    FROM server-logs
    | EVAL endpoint = CONCAT(host, ":", port)
    | KEEP host, port, endpoint
    | LIMIT 1
  """
}

Response:

{
  "columns": [
    { "name": "host", "type": "keyword" },
    { "name": "port", "type": "keyword" },
    { "name": "endpoint", "type": "keyword" }
  ],
  "values": [
    ["web-01", "8080", "web-01:8080"]
  ]
}

CONCAT accepts two or more arguments and always returns a keyword.

Note: For brevity, the remaining ES|QL examples in this article show just the query. Wrap them in POST _query { "query": "..." } to run them in Kibana Dev Tools.

When to use

If you need endpoint to persist across all queries and be available in Kibana dashboards, use a mapping-level runtime field. If you need it for a single search request within Query DSL, use a query-time runtime field. If you need it for ad-hoc analysis or exploratory work, ES|QL is simpler.

Pattern 2: Data extraction from unstructured text

Extracting structured data from raw log messages is another classic runtime field pattern.

The runtime field approach

Painless uses Java's regex Matcher class:

GET server-logs/_search
{
  "runtime_mappings": {
    "log_user": {
      "type": "keyword",
      "script": {
        "source": "def matcher = /user=(\\w+)/.matcher(params._source['raw_message']); if (matcher.find()) { emit(matcher.group(1)); }"
      }
    }
  },
  "fields": ["log_user"],
  "_source": false
}

This is verbose. You need to know Painless regex syntax, handle the Matcher object, and call emit() correctly.

The ES|QL approach: GROK

ES|QL provides two purpose-built commands for text extraction. GROK uses regex-based patterns:

FROM server-logs
| GROK raw_message "%{WORD:timestamp_date} %{WORD:log_level} user=%{WORD:user} action=%{WORD:action} duration=%{WORD:duration}"
| KEEP user, log_level, action, duration

Response:

{
  "columns": [
    { "name": "user", "type": "keyword" },
    { "name": "log_level", "type": "keyword" },
    { "name": "action", "type": "keyword" },
    { "name": "duration", "type": "keyword" }
  ],
  "values": [
    ["alice", "INFO", "login", "230ms"], ...
  ]
}

GROK uses the %{SYNTAX:SEMANTIC} pattern format. It extracts multiple fields in a single and readable command.

The ES|QL approach: DISSECT

For structured data with consistent delimiters, DISSECT is faster because it doesn’t use regular expressions:

FROM server-logs
| DISSECT raw_message "%{timestamp_date} %{log_level} user=%{user} action=%{action} duration=%{duration}"
| KEEP user, log_level, action, duration

The syntax is nearly identical to GROK, but DISSECT works by splitting on delimiters rather than matching regex patterns. This makes it faster for data that follows a consistent format.

When to use GROK vs DISSECT

Use DISSECT when your data has a predictable structure (same delimiters, same field order). Use GROK when you need regex flexibility, for example when fields may be optional or formats vary.

Pattern 3: Dynamic type conversion

When a field is mapped as keyword but contains numeric data (a surprisingly common scenario), runtime fields can cast it at query time.

The runtime field approach

GET server-logs/_search
{
  "runtime_mappings": {
    "response_time_long": {
      "type": "long",
      "script": {
        "source": """
          def val = doc['response_time'].value;
          if (val != 'not_available') {
            emit(Long.parseLong(val));
          }
        """
      }
    }
  },
  "fields": ["response_time_long"],
  "_source": false
}

You need to handle parsing exceptions manually. If Long.parseLong fails on an unexpected value, the script throws an error.

The ES|QL approach

ES|QL provides explicit conversion functions and a shorthand cast operator:

FROM server-logs
| EVAL response_ms = TO_LONG(response_time)
| KEEP host, response_time, response_ms

Or with the :: cast operator (available since 8.15):

FROM server-logs
| EVAL response_ms = response_time::long
| KEEP host, response_time, response_ms

Response:

{
  "columns": [
    { "name": "host", "type": "keyword" },
    { "name": "response_time", "type": "keyword" },
    { "name": "response_ms", "type": "long" }
  ],
  "values": [
    ["web-01", "145", 145]
  ]
}

Both produce the same result. The key difference from Painless: Failed conversions return null instead of throwing exceptions. The document with "not_available" simply gets null for response_ms, and ES|QL emits a warning.

Common conversion functions include:

Function	Converts to
`TO_LONG()`	Long integer
`TO_INTEGER()`	Integer
`TO_DOUBLE()`	Double
`TO_DATETIME()`	Date
`TO_BOOLEAN()`	Boolean
`TO_IP()`	IP address
`TO_VERSION()`	Version

The :: operator works with all these types (for example, field::double, field::datetime).

When to use

ES|QL's graceful null handling makes it safer for dirty data. Runtime fields with Painless give you fine-grained control over error handling but require more code. For type conversion specifically, ES|QL is almost always the better choice.

Pattern 4: Dynamic field handling

Runtime fields support "dynamic": "runtime" in mappings, which prevents mapping explosion by creating all new fields as runtime fields instead of indexed fields:

{
  "mappings": {
    "dynamic": "runtime",
    "properties": {
      "timestamp": { "type": "date" }
    }
  }
}

Any new field sent to this index becomes a runtime field automatically. This is useful when you ingest semi-structured data with unpredictable field names.

Where ES|QL fits

ES|QL provides query-time flexibility, but it still needs fields to be visible in the mapping. This is where runtime fields and ES|QL complement each other rather than compete.

If a field exists in _source but isn’t mapped, ES|QL cannot access it directly. The current workaround is to define a runtime field to make the unmapped field visible:

PUT dynamic-logs/_mapping
{
  "runtime": {
    "custom_field": {
      "type": "keyword",
      "script": {
        "source": "emit(params._source['custom_field'])"
      }
    }
  }
}

Once defined, ES|QL can query it:

FROM dynamic-logs
| WHERE custom_field == "some_value"
| KEEP timestamp, custom_field

This is one scenario where runtime fields remain essential. They act as a bridge, making unmapped data accessible to ES|QL.

Pattern 5: Field shadowing for error correction

Runtime fields can shadow (override) indexed fields by defining a runtime field with the same name as an existing field. This is useful for correcting data without reindexing.

The runtime field approach

Remember our data quality issue, where region has inconsistent casing ("US-EAST" versus "us-east")?

GET server-logs/_search
{
  "runtime_mappings": {
    "region": {
      "type": "keyword",
      "script": {
        "source": "emit(params._source['region'].toLowerCase())"
      }
    }
  },
  "fields": ["region"],
  "_source": false
}

This overrides the indexed region field for all queries. Every search, aggregation, and Kibana visualization will see the lowercase version.

FROM server-logs
| EVAL region = TO_LOWER(region)
| KEEP host, port, region

When you use EVAL with an existing column name, ES|QL drops the original column and replaces it with the computed value. This is the exact equivalent of field shadowing, but scoped to the current query.

You can also chain multiple corrections in a pipeline:

FROM server-logs
| EVAL region = TO_LOWER(region)
| EVAL region = CASE(region == "us-east", "US East", region == "eu-west", "EU West", region == "ap-south", "AP South", region)
| KEEP host, region

When to use

If the correction should apply to all queries and Kibana dashboards, use runtime field shadowing. If you need to correct data for a specific analysis, ES|QL is more flexible since you can apply different transformations in different queries without modifying the mapping.

The ES|QL pipeline advantage: Going beyond runtime fields

This is where ES|QL fundamentally surpasses runtime fields. Runtime fields are isolated: each one computes independently, and you cannot use the output of one runtime field as input for another in the same query.

ES|QL pipelines chain transforms. Here’s a single query that combines multiple patterns:

FROM server-logs
| GROK raw_message "%{WORD:log_date} %{WORD:log_level} user=%{WORD:user} action=%{WORD:action} duration=%{INT:duration_raw}ms"
| EVAL duration_ms = duration_raw::long
| EVAL region = TO_LOWER(region)
| WHERE log_level == "ERROR" AND duration_ms > 100
| STATS avg_duration = AVG(duration_ms), error_count = COUNT(*) BY region

This single query:

Extracts fields from raw text (GROK).
Converts the duration to a number (EVAL with cast).
Normalizes region casing (EVAL with TO_LOWER).
Filters for errors with high duration (WHERE).
Aggregates by region (STATS).

To achieve the same result with runtime fields, you would need to define at least three separate runtime fields (for extraction, conversion, and normalization) and then write a Query DSL query with filters and aggregations. The ES|QL version is a single, readable pipeline.

You can even use expressions directly inside aggregations:

FROM server-logs
| EVAL response_ms = response_time::long
| STATS
    avg_response = AVG(response_ms),
    p95_response = PERCENTILE(response_ms, 95),
    slow_count = COUNT(CASE(response_ms > 1000, 1, null))
  BY host

Conclusion

What we covered:

ES|QL provides a full toolkit (EVAL, GROK, DISSECT, type casting with ::) that replaces most runtime field patterns without any Painless scripting.
Failed type conversions in ES|QL return null instead of throwing exceptions, making it safer for real-world data.
Pipeline processing (chaining GROK into EVAL into WHERE into STATS) goes beyond what runtime fields can do in isolation.
Runtime fields remain valuable for persistent computed fields, field shadowing across all queries, and as a bridge for unmapped data in ES|QL.

One important caveat: Both runtime fields and ES|QL compute values at query time, which means they pay the cost on every query. If you find yourself applying the same transformation repeatedly (type corrections, field extraction, data normalization), consider using ingest pipelines to fix the data at index time instead. Ingest pipelines let you parse, enrich, and transform documents before they’re stored, so queries can work with clean, properly typed fields directly. Runtime fields and ES|QL are great for exploration and ad-hoc analysis, but for production workloads, indexing the right data from the start is almost always the better choice.

The key takeaway: Runtime fields aren’t deprecated, and they aren’t going away. But for most query-time computation patterns, ES|QL offers a simpler, more powerful, and more performant approach. And when the transformation is known up front, an ingest pipeline is the most efficient option of all.

Next steps

Creating an Elasticsearch MCP server with TypeScript

Jeffrey Rengifo — Fri, 27 Mar 2026 00:00:00 GMT

When working with large knowledge bases in Elasticsearch, finding information is only half the battle. Engineers often need to synthesize results from multiple documents, generate summaries, and trace answers back to their sources. Model Context Protocol (MCP) provides a standardized way to connect Elasticsearch with large language model–powered (LLM-powered) applications to accomplish this. While Elastic offers official solutions, like Elastic Agent Builder (which includes an MCP endpoint among its features), building a custom MCP server gives you full control over search logic, result formatting, and how retrieved content is passed to an LLM for synthesis, summaries, and citations.

In this article, we’ll explore the benefits of building a custom Elasticsearch MCP server and show how to create one in TypeScript that connects Elasticsearch to LLM-powered applications.

Why build a custom Elasticsearch MCP server?

Elastic provides some alternatives for MCP servers:

If you need more control over how your MCP server interacts with Elasticsearch, building your own custom server gives you the flexibility to tailor it exactly to your needs. For example, Agent Builder's MCP endpoint is limited to Elasticsearch Query Language (ES|QL) queries, while a custom server allows you to use the full Query DSL. You also gain control over how results are formatted before being passed to the LLM and can integrate additional processing steps, like the OpenAI-powered summarization we'll implement in this tutorial.

By the end of this article, you’ll have an MCP server in TypeScript that searches for information stored in an Elasticsearch index, summarizes it, and provides citations. We'll use Elasticsearch for retrieval, OpenAI's gpt-4o-mini model to summarize and generate citations, and Claude Desktop as the MCP client and UI to take in user queries and give responses. The end result is an internal knowledge assistant that helps engineers discover and synthesize best practices across their organization’s technical docs.

Prerequisites:

Node.js 20 +
Elasticsearch
OpenAI API key
Claude Desktop

What is MCP?

MCP is an open standard, created by Anthropic, that provides secure, bidirectional connections between LLMs and external systems, like Elasticsearch. You can read more about the current state of MCP in this article.

The MCP landscape is evolving every day, with servers available for a wide range of use cases. On top of that, it’s easy to build your own custom MCP server, as we’ll show in this article.

MCP clients

There’s a long list of available MCP clients, each with its own characteristics and limitations. For simplicity and popularity, we’ll use Claude Desktop as our MCP client. It will serve as the chat interface where users can ask questions in natural language, and it will automatically invoke the tools exposed by our MCP server to search documents and generate summaries.

Creating an Elasticsearch MCP server

Using the TypeScript SDK, we can easily create a server that understands how to query our Elasticsearch data based on a user query input.

Here are the steps in this article to integrate the Elasticsearch MCP server with the Claude Desktop client:

Configure MCP server for Elasticsearch

To begin, let's initialize a node application:

npm init -y

This will create a package.json file, and with it, we can start installing the necessary dependencies for this application.

npm install @elastic/elasticsearch @modelcontextprotocol/sdk openai zod && npm install --save-dev ts-node @types/node typescript

@elastic/elasticsearch will give us access to the Elasticsearch Node.js library.
@modelcontextprotocol/sdk provides the core tools to create and manage an MCP server, register tools, and handle communication with MCP clients.
openai allows interaction with OpenAI models to generate summaries or natural language responses.
zod helps define and validate structured schemas for input and output data in each tool.

ts-node, @types/node, and typescript will be used during development to type the code and compile the scripts.

Set up the dataset

To provide the data that Claude Desktop can query using our MCP server, we’ll use a mock internal knowledge base dataset. Here’s what a document from this dataset will look like:

{
    "id": 5,
    "title": "Logging Standards for Microservices",
    "content": "Consistent logging across microservices helps with debugging and tracing. Use structured JSON logs and include request IDs and timestamps. Avoid logging sensitive information. Centralize logs in Elasticsearch or a similar system. Configure log rotation to prevent storage issues and ensure logs are searchable for at least 30 days.",
    "tags": ["logging", "microservices", "standards"]
}

To ingest the data, we prepared a script that creates an index in Elasticsearch and loads the dataset into it. You can find it here.

MCP server

Create a file named index.ts and add the following code to import the dependencies and handle environment variables:

// index.ts
import { z } from "zod";
import { Client } from "@elastic/elasticsearch";
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import OpenAI from "openai";

const ELASTICSEARCH_ENDPOINT =
  process.env.ELASTICSEARCH_ENDPOINT ?? "http://localhost:9200";
const ELASTICSEARCH_API_KEY = process.env.ELASTICSEARCH_API_KEY ?? "";
const OPENAI_API_KEY = process.env.OPENAI_API_KEY ?? "";
const INDEX = "documents";

Also, let’s initialize the clients to handle the Elasticsearch and OpenAI calls:

const openai = new OpenAI({
  apiKey: OPENAI_API_KEY,
});

const _client = new Client({
  node: ELASTICSEARCH_ENDPOINT,
  auth: {
    apiKey: ELASTICSEARCH_API_KEY,
  },
});

To make our implementation more robust and ensure structured input and output, we'll define schemas using zod. This allows us to validate data at runtime, catch errors early, and make the tool responses easier to process programmatically:

const DocumentSchema = z.object({
  id: z.number(),
  title: z.string(),
  content: z.string(),
  tags: z.array(z.string()),
});

const SearchResultSchema = z.object({
  id: z.number(),
  title: z.string(),
  content: z.string(),
  tags: z.array(z.string()),
  score: z.number(),
});

type Document = z.infer;
type SearchResult = z.infer;

Learn more about structured outputs here.

Now let’s initialize the MCP server:

const server = new McpServer({
  name: "Elasticsearch RAG MCP",
  description:
    "A RAG server using Elasticsearch. Provides tools for document search, result summarization, and source citation.",
  version: "1.0.0",
});

Defining the MCP tools

With everything configured, we can start writing the tools that will be exposed by our MCP server. This server exposes two tools:

search_docs: Searches for documents in Elasticsearch using full-text search.
summarize_and_cite: Summarizes and synthesizes information from previously retrieved documents to answer a user question. This tool also adds citations referencing the source documents.

Together, these tools form a simple “retrieve-then-summarize” workflow, where one tool fetches relevant documents and the other uses those documents to generate a summarized, cited response.

Tool response format

Each tool can accept arbitrary input parameters, but it must respond with the following structure:

Content: This is the response of the tool in an unstructured format. This field is usually used to return text, images, audio, links, or embeddings. For this application, it will be used to return formatted text with the information generated by the tools.
structuredContent: This is an optional return used to provide the results of each tool in a structured format. This is useful for programmatic purposes. Although it isn't used in this MCP server, it can be useful if you want to develop other tools or process the results programmatically.

With that structure in mind, let’s dive into each tool in detail.

Search_docs tool

This tool performs a full-text search in the Elasticsearch index to retrieve the most relevant documents based on the user query. It highlights key matches and provides a quick overview with relevance scores.

server.registerTool(
  "search_docs",
  {
    title: "Search Documents",
    description:
      "Search for documents in Elasticsearch using full-text search. Returns the most relevant documents with their content, title, tags, and relevance score.",
    inputSchema: {
      query: z
        .string()
        .describe("The search query terms to find relevant documents"),
      max_results: z
        .number()
        .optional()
        .default(5)
        .describe("Maximum number of results to return"),
    },
    outputSchema: {
      results: z.array(SearchResultSchema),
      total: z.number(),
    },
  },
  async ({ query, max_results }) => {
    if (!query) {
      return {
        content: [
          {
            type: "text",
            text: "Query parameter is required",
          },
        ],
        isError: true,
      };
    }

    try {
      const response = await _client.search({
        index: INDEX,
        size: max_results,
        query: {
          bool: {
            must: [
              {
                multi_match: {
                  query: query,
                  fields: ["title^2", "content", "tags"],
                  fuzziness: "AUTO",
                },
              },
            ],
            should: [
              {
                match_phrase: {
                  title: {
                    query: query,
                    boost: 2,
                  },
                },
              },
            ],
          },
        },
        highlight: {
          fields: {
            title: {},
            content: {},
          },
        },
      });

      const results: SearchResult[] = response.hits.hits.map((hit: any) => {
        const source = hit._source as Document;

        return {
          id: source.id,
          title: source.title,
          content: source.content,
          tags: source.tags,
          score: hit._score ?? 0,
        };
      });

      const contentText = results
        .map(
          (r, i) =>
            `[${i + 1}] ${r.title} (score: ${r.score.toFixed(
              2,
            )})\n${r.content.substring(0, 200)}...`,
        )
        .join("\n\n");

      const totalHits =
        typeof response.hits.total === "number"
          ? response.hits.total
          : (response.hits.total?.value ?? 0);

      return {
        content: [
          {
            type: "text",
            text: `Found ${results.length} relevant documents:\n\n${contentText}`,
          },
        ],
        structuredContent: {
          results: results,
          total: totalHits,
        },
      };
    } catch (error: any) {
      console.log("Error during search:", error);

      return {
        content: [
          {
            type: "text",
            text: `Error searching documents: ${error.message}`,
          },
        ],
        isError: true,
      };
    }
  }
);

We configure fuzziness: “AUTO” to have a variable typo tolerance based on the length of the token that’s being analyzed. We also set title^2 to increase the score of the documents where the match happens on the title field.

summarize_and_cite tool

This tool generates a summary based on documents retrieved in the previous search. It uses OpenAI’s gpt-4o-mini model to synthesize the most relevant information to answer the user’s question, providing responses derived directly from the search results. In addition to the summary, it also returns citation metadata for the source documents used.

server.registerTool(
  "summarize_and_cite",
  {
    title: "Summarize and Cite",
    description:
      "Summarize the provided search results to answer a question and return citation metadata for the sources used.",
    inputSchema: {
      results: z
        .array(SearchResultSchema)
        .describe("Array of search results from search_docs"),
      question: z.string().describe("The question to answer"),
      max_length: z
        .number()
        .optional()
        .default(500)
        .describe("Maximum length of the summary in characters"),
      max_docs: z
        .number()
        .optional()
        .default(5)
        .describe("Maximum number of documents to include in the context"),
    },
    outputSchema: {
      summary: z.string(),
      sources_used: z.number(),
      citations: z.array(
        z.object({
          id: z.number(),
          title: z.string(),
          tags: z.array(z.string()),
          relevance_score: z.number(),
        })
      ),
    },
  },
  async ({ results, question, max_length, max_docs }) => {
    if (!results || results.length === 0 || !question) {
      return {
        content: [
          {
            type: "text",
            text: "Both results and question parameters are required, and results must not be empty",
          },
        ],
        isError: true,
      };
    }

    try {
      const used = results.slice(0, max_docs);

      const context = used
        .map(
          (r: SearchResult, i: number) =>
            `[Document ${i + 1}: ${r.title}]\\n${r.content}`
        )
        .join("\n\n---\n\n");

      // Generate summary with OpenAI
      const completion = await openai.chat.completions.create({
        model: "gpt-4o-mini",
        messages: [
          {
            role: "system",
            content:
              "You are a helpful assistant that answers questions based on provided documents. Synthesize information from the documents to answer the user's question accurately and concisely. If the documents don't contain relevant information, say so.",
          },
          {
            role: "user",
            content: `Question: ${question}\\n\\nRelevant Documents:\\n${context}`,
          },
        ],
        max_tokens: Math.min(Math.ceil(max_length / 4), 1000),
        temperature: 0.3,
      });

      const summaryText =
        completion.choices[0]?.message?.content ?? "No summary generated.";

      const citations = used.map((r: SearchResult) => ({
        id: r.id,
        title: r.title,
        tags: r.tags,
        relevance_score: r.score,
      }));

      const citationText = citations
        .map(
          (c: any, i: number) =>
            `[${i + 1}] ID: ${c.id}, Title: "${c.title}", Tags: ${c.tags.join(
              ", ",
            )}, Score: ${c.relevance_score.toFixed(2)}`,
        )
        .join("\n");

      const combinedText = `Summary:\\n\\n${summaryText}\\n\\nSources used (${citations.length}):\\n\\n${citationText}`;

      return {
        content: [
          {
            type: "text",
            text: combinedText,
          },
        ],
        structuredContent: {
          summary: summaryText,
          sources_used: citations.length,
          citations: citations,
        },
      };
    } catch (error: any) {
      return {
        content: [
          {
            type: "text",
            text: `Error generating summary and citations: ${error.message}`,
          },
        ],
        isError: true,
      };
    }
  }
);

Finally, we need to start the server using stdio. This means the MCP client will communicate with our server by reading and writing to its standard input and output streams. stdio is the simplest transport option and works well for local MCP servers launched as subprocesses by the client. Add the following code at the end of the file:

const transport = new StdioServerTransport();
server.connect(transport);

Now compile the project using the following command:

npx tsc index.ts --target ES2022 --module node16 --moduleResolution node16 --outDir ./dist --strict --esModuleInterop

This will create a dist folder, and inside it, an index.js file.

Load the MCP server into Claude Desktop

Follow this guide to configure the MCP server with Claude Desktop. In the Claude configuration file, we need to set the following values:

{
  "mcpServers": {
    "elasticsearch-rag-mcp": {
      "command": "node",
      "args": [   "/Users/user-name/app-dir/dist/index.js"
      ],
      "env": {
        "ELASTICSEARCH_ENDPOINT": "your-endpoint-here",
        "ELASTICSEARCH_API_KEY": "your-api-key-here",
        "OPENAI_API_KEY": "your-openai-key-here"
      }
    }
  }
}

The args value should point to the compiled file in the dist folder. You also need to set the environment variables in the configuration file with the exact same names defined in the code.

Test it out

Before executing each tool, click on Search and Tools to make sure that the tools are enabled. Here you can also enable or disable each one:

Finally, let’s test the MCP server from the Claude Desktop chat and start asking questions:

For the question “Search for documents about authentication methods and role-based access control”, the search_docs tool is executed and returns the following results:

Most Relevant Documents:
Access Control and Role Management (highest relevance) - This document covers role-based access control (RBAC) principles, including ensuring users only have necessary permissions, regular auditing of user roles, revoking inactive accounts, and implementing just-in-time access for sensitive operations.
User Authentication with OAuth 2.0 - This document explains OAuth 2.0 authentication, which enables secure delegated access without credential sharing. It covers configuring identity providers, token management with limited scope and lifetime, and secure storage of refresh tokens.
Container Security Guidelines - While primarily about container security, this document touches on access control aspects like running containers as non-root users and avoiding embedded credentials.
Incident Response Playbook - This mentions role assignment during incidents (incident commander, communications lead, etc.), which relates to access control in emergency scenarios.
Logging Standards for Microservices - This document includes guidance on avoiding logging sensitive information, which is relevant to authentication security.

The response is, “Great! I found 5 relevant documents about authentication methods and role-based access control. Here's what was found:”

The tool call returns the source documents as part of its response payload, which are later used to generate citations.

It’s also possible to chain multiple tools in a single interaction. In this case, Claude Desktop analyzes the user's question and determines that it needs to first call search_docs to retrieve relevant documents and to then pass those results to summarize_and_cite to generate the final answer, all without requiring separate prompts from the user:

In this case, for the query “What are the main recommendations to improve authentication and access control across our systems? Include references.”, we obtained the following results:

Based on the documentation, here are the main recommendations to improve authentication and access control across your systems:

Key Recommendations

1. Implement Role-Based Access Control (RBAC)
Ensure users have only the permissions necessary for their job functions. This principle of least privilege minimizes security risks by limiting access to what's truly needed. [1]

2. Regular Access Audits
Conduct regular audits of user roles and promptly revoke access for inactive accounts. This helps prevent unauthorized access from dormant or compromised accounts. [1]

3. Just-in-Time (JIT) Access
Implement just-in-time access for sensitive operations, where permissions are granted only when needed and for a limited duration. This reduces the attack surface by minimizing standing privileges. [1]

4. OAuth 2.0 for Secure Authentication
Use OAuth 2.0 to enable secure delegated access without sharing user credentials. This protocol allows third-party applications to access resources on behalf of users without exposing passwords. [2]

5. Token Security and Management
Configure your identity provider to issue tokens with limited scope and lifetime

Store refresh tokens securely
Validate access tokens consistently to maintain security [2]
References

Access Control and Role Management (Tags: security, access-control)
User Authentication with OAuth 2.0 (Tags: authentication, oauth)
These recommendations work together to create a defense-in-depth approach, where multiple security layers protect your systems from unauthorized access.

As in the previous step, we can see the response from each tool for this question:

Note: If a submenu appears asking whether you approve the use of each tool, select Always allow or Allow once.

Conclusion

MCP servers represent a significant step toward standardizing LLM tools for both local and remote applications. Though full compatibility is still in the works, we’re moving fast in that direction.

In this article, we learned how to build a custom MCP server in TypeScript that connects Elasticsearch to LLM-powered applications. Our server exposes two tools: search_docs for retrieving relevant documents using Query DSL; and summarize_and_cite for generating summaries with citations via OpenAI models and Claude Desktop as client UI.

The future of compatibility between different client and server providers looks promising. Next steps include adding more functionalities and flexibility to your agent. There’s a practical article on how you can parameterize your queries using search templates to gain precision and flexibility.

Announcing read-only permissions for Kibana dashboards

Fabio Busatto, Teresa Alvarez Soler — Thu, 26 Mar 2026 00:00:00 GMT

You’ve been there. You spend an hour building the perfect dashboard to monitor your logs: every chart, every filter, and every label. You share it with your team. A few days later, you open it and something’s off. A colleague tweaked a query. Or someone changed the date range. Maybe they thought they were helping. Now you’re digging through revisions and second-guessing every number. Sound familiar?

That’s exactly why we built read-only dashboards. It’s the control you’ve been asking for. Share dashboards with confidence, without worrying that the next person with edit access will change or break them.

Note: Read-only permissions are available in Elastic Cloud Serverless and from version 9.3 for Elastic Cloud Hosted and Elastic Self-Μanaged.

When “everyone can edit” gets in the way

In Kibana, sharing has usually meant space-level permissions. If someone can create dashboards in a space, they can also edit or delete anyone else’s. That’s great for collaboration until it isn’t. One accidental edit can ripple into wrong decisions, lost trust, and a lot of cleanup.

We’ve heard the workarounds: “We put ‘read-only’ in the dashboard name and hope people notice.” Or: “We tag them and cross our fingers.” Hope isn’t a permission model. You needed a real way to lock a dashboard without locking everyone out of the space.

What actually goes wrong

Deb and Kevin both have edit access to the log monitoring dashboard within the Operations space. Kevin makes some changes to the charts. When Deb comes back, the numbers don’t match what she presented. She has to track down what changed (often from memory), fix it, and wonder how many reports went out with bad data.

Read-only dashboards: Ownership and control that make sense

Read-only dashboards fix this by giving you control to decide whether other users can edit the dashboard. When you share a dashboard, you choose: edit (default, same as today) or view. In view mode, only you (and Kibana admins) can change or delete it. Everyone else can open it, use it, and trust it, but they can’t modify it.

What you get

Dashboard integrity: In view mode, other users with edit access in the space cannot modify or delete the dashboard. If they try, they’re told it’s locked. Your charts and logic stay as you left them.
You stay in control: You’re the owner. You can always edit, refine, and update. Sharing as view-only doesn’t lock you out; it locks in the version everyone else sees.
Flexible lifecycle: You can switch a dashboard back to “can edit” anytime. And Kibana admins can still manage all dashboards (for example, if the owner leaves). No dead ends.

You can share finalized, mission-critical dashboards widely and know they’ll stay consistent. This is available in all Elastic tiers and offerings, including Serverless.

Who can do what?

Quick reference by role:

Dashboard owner: You created it; you have full edit access.
Kibana admin: Can manage all dashboards.
User with space edit: Can create and edit their dashboards; can’t edit or delete view-only dashboards.
User with space view: Can only view (and list) dashboards.

Action	Dashboard owner	Kibana admin	User with space edit	User with space view
List and view dashboards	✔	✔	✔	✔
Create new dashboards	✔	✔	✔	✘
Modify/delete editable dashboards	✔	✔	✔	✘
Modify/delete read-only dashboards	✔	✔	✘	✘

How to turn on read-only

You can set view-only when you save a new dashboard or later from the share menu.

When saving a new dashboard

Build your dashboard, and click Save.
In the “Save as new dashboard” modal, find Permissions.
Change from Can edit to Can view.
Click Save. Done. It’s read-only for everyone else.

For a dashboard you already own

Open the dashboard.
Open the Share dashboard menu.

In the sharing modal, find Permissions and switch to Can view. The change applies immediately; other users in the space can no longer edit or delete it.

You can mouse over the Share action to see what type of permissions a given dashboard has.

Seeing which dashboards are locked

On the main Dashboards list, dashboards you can’t edit or delete have a disabled selection checkbox. This provides an easy way to spot what’s view-only.

In the dashboard, you will also find that the Edit action is disabled and a tooltip will appear, explaining that the dashboard has been set as view-only.

Try it

Read-only dashboards are available now. Create a dashboard, flip it to Can view, and share it. Your team gets a single source of truth, and you get peace of mind. No more “please don’t edit” in the title.

We’d love to hear how you use read-only dashboards. Share your feedback in our community forum.

The shell tool is not a silver bullet for context engineering

Leonie Monigatti — Wed, 25 Mar 2026 00:00:00 GMT

The most important tools an agent has are the search tools it can use to build its own context. Recent posts by LlamaIndex and LangChain have sparked a discussion: Are a shell tool and a filesystem all an agent needs for context engineering? Unfortunately, the discussion quickly drifted to the wrong focus: filesystem versus database.

This post refocuses on the question,What are the right search interfaces an agent needs to build its own context? It first covers the trade-offs between shell tools and dedicated database tools. From there, it offers a practical framework for finding the right interfaces for your agent's needs.

What does "building context" actually mean for an agent?

In early retrieval augmented generation (RAG) pipelines, the developer engineered a fixed retrieval pipeline, and the large language model (LLM) was a passive recipient of the context. This was a fundamental limitation: Context was retrieved on every query, whether or not it was needed, with no check that it actually helped.

With the shift to agentic RAG, the agents now have access to a set of search tools to build their own context. For example, both Claude Code [1] and Cursor [2] let the agent choose between different search tools and even combine them for chained queries, depending on what the task actually requires.

What search interfaces exist for context engineering?

Context can live in different locations, such as on the web, in a local filesystem, or in a database. An agent can interact with each of these out-of-context data sources through different tools:

Shell tools can execute shell commands and have access to the local filesystem. Some examples of built-in shell tools are Claude API's bash tool, OpenClaw's exec tool, and LangChain's shell tool.
Dedicated database tools, such as tools from a Model Context Protocol (MCP) server (for example, the Elastic Agent Builder MCP server) or custom tools (for example, run_esql(query) or db_list_index()), can query databases.
Dedicated file search tools can search and read local (or uploaded) files (without full shell access). Some examples of built-in file search tools are Gemini API’s File Search Tool or OpenAI’s File Search Tool.
Web search tools can retrieve information from the web.
Memory tools store and recall from long-term memory (regardless of how it’s stored).

As you can see, the shell tool is versatile and can be used to retrieve context from different data sources, including:

Filesystem: The agent explores the directory structure (ls, find), searches for relevant content (grep, cat), and repeats until it has built sufficient context.
Database: The agent can use database command line interface (CLI) tools (for example, elasticsearch-sql-cli), call HTTP APIs via curl, or run scripts, which is especially useful in combination with agent skills, which are reusable, documented examples injected into the agent's context to guide correct tool usage (for example, Elastic Agent Skills for Elasticsearch).
Web: The agent can execute web searches via a curl command through a search provider’s API.

However, the shell tool provides direct system access and therefore requires safety measures, such as running in an isolated sandbox environment and logging all executed commands.

When to use which search interfaces

The right search interface depends on your data, your query patterns, and your use case. This section serves as a practical starting point.

Filesystems aren’t making databases obsolete

The filesystems-versus-databases discussion is not about the storage layer. For example, LangChain explains that its memory system doesn’t actually store memory in a real filesystem. Instead, it stores memory in a database and represents it as a set of files to the agent [3].

Filesystems are a natural fit for file-native use cases, such as coding agents. They also work well as a temporary scratch pad or working memory and for single-user or single-agent scenarios where concurrency isn't a concern. In these cases, a physical filesystem or representing the data as a filesystem gives you flexibility before committing to a purpose-built interface.

But filesystem storage has real downsides, such as weak concurrency, manual schema enforcement, and atomic transactions. These become more apparent when your application needs to scale or move to a multi-agent scenario. Anyone who ignores these downsides is doomed to painfully reinvent worse databases without the decades of engineering behind transaction safety or access control that production databases already provide. Additionally, in most enterprise contexts, you don't choose whether to use a database since it's already there, storing business-critical data.

Shell tool + filesystem

A shell tool is the natural starting point for filesystem search. Currently, coding agents are driving a lot of progress in the field. Because they work with code in local files, they’re naturally file-heavy use cases. Therefore, LLMs are fine-tuned in the post-training stage for coding tasks. That’s why many LLMs are not only good at writing code but also at using shell commands and navigating filesystems.

Using a shell tool with built-in CLIs, like ls and grep, to find files is effective. With grep, a query like "Find all files that import matplotlib" is fast, precise, and cheap. But when the agent needs to handle conceptual queries, such as "How does our app handle failed authentication?", pattern matching with grep can hit a ceiling quickly. Several alternatives that bring semantic search capabilities to the command line have emerged to fill this gap, including jina-grep.

However, grep and many of its semantic search alternatives run in O(n) over the corpus. For use cases over codebases, this might be fine. However, if your data grows, latency will become noticeable. In this case, an indexed datastore becomes necessary to maintain performance.

Shell tool + database

Another way to add more search capabilities, such as semantic or hybrid search, over your data is to store it in a database, as Cursor does, for example. Additionally, when data requires complex relational joins or aggregations, a database interface is nonnegotiable.

When the data lives in a database rather than on the filesystem, a shell tool can serve as a lightweight database interface for certain use cases. If your queries are simple enough for a CLI or a curl call, a dedicated database tool may add unnecessary complexity.

This approach is also suitable in early exploration stages, when you don't yet know what query patterns your agent will actually develop. In this case, Agent Skills can give the agent enough structure to query correctly without committing to a purpose-built tool. However, when the agent requires many iterations to figure out the right way to query the database for repeated tasks, the token overhead of using a shell tool as the interface no longer justifies the simplicity benefit of avoiding an extra tool.

Dedicated database tool

Especially when repeated query patterns are structured or analytical, dedicated database tools become necessary. A blog post from Vercel and Braintrust compared agents with different sets of search tools for real-world retrieval tasks over semi-structured data, such as customer support tickets and sales call transcripts (for example, “How many open issues mention 'security'?" or "Find issues where someone reported a bug and later someone submitted a PR claiming to fix it?") [4].

Agents with dedicated database tools used fewer tokens, were faster, and made fewer mistakes than agents with only a shell tool and filesystem. The lesson is that direct database tools are the right choice when the query requires analytical reasoning over semi-structured data.

Combining search interfaces

No single search interface handles every query well. For example, Cursor combines shell tools (for searches via grep) and semantic search tools and lets the agent select the right tool based on the user’s prompt. They report that the agent chooses grep for matching specific symbols or strings, semantic search for conceptual or behavior questions, and both for exploratory tasks.

The Vercel experiment reports the same: Its hybrid agent with access to both a shell tool and a dedicated database tool achieved the best performance out of all tested agents by first using the dedicated database tools and then verifying the results by grepping through the filesystem. However, this approach uses more tokens and time for reasoning about tool choice and verification.

The pattern across both examples is the same: Composition beats any single interface, but composition comes at the trade-off of added cost and latency.

Practical recommendations for finding the right set of tools

The right set of search interfaces is small, purposeful, and specific to your agent's actual query patterns. The current best practice is to have an agent with as few tools as possible instead of having an agent with hundreds of MCP tools. This is because the downside of exposing all possible tools up front is that it bloats the context window and confuses the agent about which tool to actually use. For example, Claude Code reportedly only has about 20 tools.

Instead, the idea of progressive disclosure is to start with a minimal set of tools and let the agent discover additional capabilities only when needed. Research from Anthropic [5] and Cursor [6] has shown that this approach yields a token savings between 47%–85%. Claude Code, for example, implements this directly, allowing the agent to incrementally discover how to query an API or a database, without that knowledge consuming context on every LLM call.

Once you’re familiar with the agent's query patterns, you can revisit the set of search tools that the agent has access to by default. A useful way to think about this trade-off is the "low floor, high ceiling" principle for deciding which tools make the cut. High-ceiling tools don't limit the agent's potential. For example, a versatile shell tool lets the agent write full database queries, including ambiguous ones, but at the cost of reasoning overhead, higher latency, and lower reliability.

Low-floor tools are the opposite. They’re specialized tools that wrap specific queries and are immediately accessible to the agent with minimal reasoning overhead, producing lower cost and higher reliability. But they need upfront engineering, can't cover every possible query, and can make it harder for the agent to choose the right tool.

Think of each tool on a spectrum: Low-floor tools are easy for the agent to use correctly but narrow in scope. High-ceiling tools are versatile but demand more reasoning to use well.

Most agents need a mix of different search tools. But each tool needs to earn its addition. We recommend starting with an all-purpose search tool (for example a search_database() tool or a shell tool). Then reuse the command logs you're already keeping for security purposes to track what your agent actually does, including tool calls, retries, and number of calls per user query. And, when you see a query pattern repeating or failing, that's the signal to build a purpose-built tool for it.

Summary

The filesystem-versus-database debate is distracting from the actual question that engineers need to be asking: What are the right search interfaces an agent needs to build its own context? The answer is most likely, Not a single one.

A shell tool is a versatile tool to interact with different out-of-context sources and thus a good starting point. But it’s less efficient and accurate for use cases with structured analytical queries than dedicated database tools.

The goal is to find the minimal set of search tools that handles your agent's actual query patterns well. Start with a shell tool, and log what your agent actually does. When you see a query pattern repeating and failing, it’s time to engineer specialized tools.

References

1. Thariq (Anthropic). Lessons from Building Claude Code: Seeing like an Agent (2026).

2. Cursor: Documentation. Semantic & agentic search (2026).

3. Harrison Chase (LangChain). How we built Agent Builder’s memory system (2026).

4. Ankur Goyal (Braintrust) and Andrew Qu (Vercel). Testing if "bash is all you need" (2026).

5. Anthropic. Introducing advanced tool use on the Claude Developer Platform (2025).

6. Cursor. Dynamic context discovery (2026).

Elasticsearch replicas for load balancing in Serverless

Andrei Dan — Tue, 24 Mar 2026 00:00:00 GMT

In Elastic Cloud Serverless, we automatically adjust the number of replicas for your indices based on search load, ensuring optimal query performance without any manual configuration. In this blog, we’ll explain how replicas are scaled, when the system adds or removes them, and what this means for your indices.

The party is getting crowded

You're hosting a pizza party. You've got a few friends helping you serve, each stationed at different spots around the room. You give each friend a pizza, and they start handing out slices to hungry guests as they arrive.

At first, things run smoothly. A few guests trickle in, your friends serve slices, everyone's happy. But then word spreads about your sourdough pizzas. The doorbell keeps ringing. Guests pour in. Soon, there's a crowd forming around one of your friends, the one holding the pepperoni pizza, which everyone seems to want.

Your friend with the pepperoni pizza is overwhelmed. Guests are waiting, getting impatient, and a large queue has formed. Meanwhile, your friend holding the margherita pizza is standing around with barely anyone asking for a slice.

What do you do?

You order a couple more pepperoni pizzas and hand them to other friends. Now three friends are holding pepperoni instead of one. The crowd spreads out, and suddenly you can serve three times as many guests at once.

A few things become clear as you host more parties:

Not all pizzas are equally popular. Some are in high demand, others have fewer takers. You don't need extra "copies" of the unpopular ones. You need extras of the ones with queues.
Order more pizzas before the queue gets too long. If you wait until your friend is completely overwhelmed and guests are leaving angry, you've waited too long. Better to get an extra pizza when you see a crowd forming.
Don't throw away pizzas too quickly. Just because the crowd around the pepperoni thinned out for five minutes doesn't mean the rush is over. Maybe they're just refilling drinks, or even talking among themselves (is that still a thing?). Keep the extra pizzas ready. If the lull continues for a while, then you can put them away.
You can only hand out as many pizzas as you have friends who are helping. If you've only got four friends helping, ten pizzas won’t change the outcome. Only four can be served at once. Match your pizza count to your available hands.
When a friend leaves, take their pizza. If one of your friends needs to head out, grab their pizza immediately. You can't have pizzas sitting unattended. Hand it to someone else, or put it away.

From pizzas to replicas

Let's map this back to Elasticsearch.

In our analogy, pizzas are replicas (copies of your index shards), your friends helping serve are search nodes, hungry guests are search queries, and that popular pizza with a crowd around it is a hot index with high search load.

When search traffic increases on a particular index, we create additional replicas and distribute them across your search nodes. Any replica can serve any query for that index, just like any friend holding pepperoni can hand out pepperoni slices. More replicas means higher throughput: Three replicas can handle three times the queries per second of a single replica.

Measuring the hunger

Before we decide how many pizzas to order, we need to know how hungry the crowd is.

Elasticsearch tracks the search load for every shard. It's a metric that captures how much search activity a shard is handling. We aggregate this across all shards of an index to understand the total search demand.

What matters most is the relative search load: What proportion of your project's total search traffic is hitting each index? If one index is receiving 60% of all searches while another gets 5%, we know where to add capacity.

The math behind the pizzas

We calculate the optimal number of replicas following this formula:

desired_replicas = min(ceil(L × N / (S × X)), N)

Where:

L = the index's relative search load (between 0 and 1).
N = the number of desired search nodes in your project.
S = the number of shards in the index.
X = a threshold to avoid hot spots (default: 0.5).

An example: four search nodes, one index with two primary shards receiving 80% of search traffic:

desired_replicas = min(ceil(0.8 × 4 / (2 × 0.5)), 4)
                 = min(4, 4)
                 = 4

This hot index gets four replicas distributed across the search nodes.

The threshold X (defaulting to 0.5) is important. We don't wait until a replica is completely overwhelmed; we scale up when it's at half capacity. Hand out the extra pizza when you see the crowd forming, not when guests are already leaving.

Scale up fast, scale down slow

When search load increases, we add replicas immediately. No reason to make users wait.

When search load drops, we wait a bit before taking any action. We need to see consistent low demand for about 30 minutes before reducing replicas. (This is to deal with spiky traffic where a quiet moment doesn't mean the party is over.)

This matters because adding a replica has a cost. The new replica copies data and warms its caches before serving queries efficiently. Removing replicas too eagerly means constantly paying this startup cost as traffic naturally fluctuates.

Respecting topology bounds

Replicas can never exceed the number of search nodes. Having more replicas than nodes provides no benefit (you can only serve as many pizzas as you have friends who are helping to serve slices).

When nodes are removed from your project, we reduce replicas immediately to match. No waiting for the cooldown, as you can't have unassigned replicas. The moment a friend leaves, we remove their pizza.

The bigger Serverless picture

Replicas for search load balancing works alongside other autoscaling systems:

Search autoscaling adjusts the number of search nodes (how many friends are helping).
Replicas for search load balancing distribute traffic by adjusting replica counts per index (how many pizzas of each kind we need).
Data stream autosharding optimizes shard counts for writes (how to slice each pizza, covered in the previous post).

An important design principle: Replicas for load balancing don't directly trigger search autoscaling. Instead, by distributing search requests across more replicas, it enables increasing resource utilization across your search nodes. This higher utilization then triggers our existing autoscaling logic to add capacity if needed. Replicas for load balancing enables autoscaling to do its job, making sure your search nodes are actually being used, rather than having all traffic bottlenecked on a single replica while other nodes sit idle.

What this means for you

You don't need to predict which indices will be popular. You don't need to manually adjust replicas when traffic patterns change. You don't need to wake up at 3 a.m. because a surge overwhelmed your busiest index.

The system watches where queues are forming and orders more pizzas for those spots. Cold indices don't waste resources on unnecessary replicas. Hot indices get the capacity they need. Your budget goes where it matters.

Conclusion

In the autosharding post, we made sure your pizzas are sliced right. Now, with replicas for search load balancing, we make sure you have enough pizzas, in the right hands, when the hungry crowds arrive.

Try Elastic Cloud Serverless and let us handle the pizza logistics.

Using Elasticsearch Inference API along with Hugging Face models

Jeffrey Rengifo — Mon, 23 Mar 2026 00:00:00 GMT

In recent updates, Elasticsearch introduced a native integration to connect to models hosted on the Hugging Face Inference Service. In this post, we’ll explore how to configure this integration and perform inference through simple API calls using a large language model (LLM). We’ll use SmolLM3-3B, a lightweight general-purpose model with a good balance between resource usage and answer quality.

Prerequisites

Elasticsearch 9.3 or Elastic Cloud Serverless: You can create a cloud deployment following these instructions, or you can use the start-local quickstart instead.
Python 3.12: Download Python here.
Hugging Face access token.

Chat completions using a Hugging Face inference endpoint

First, we’ll build a practical example that connects Elasticsearch to a Hugging Face inference endpoint to generate AI-powered recommendations from a collection of blog posts. For the app knowledge base, we’ll use a dataset of company blog articles, which contains valuable but often hard-to-navigate information.

With this endpoint, semantic search retrieves the most relevant articles for a given query, and a Hugging Face LLM generates short, contextual recommendations based on those results.

Let’s take a look at a high-level overview of the information flow we’re going to build:

In this article, we’ll test SmolLM3-3B capacity to combine its compact size with strong multilingual reasoning and tool-calling capabilities. Based on a search query, we’ll send all the matching content (in English and Spanish) to the LLM to generate a list of recommended articles with a custom-made description based on the search query and results.

Here’s what the UI of an article site with an AI recommendations generation system could look like.

You can find the full implementation of this application in the linked notebook.

Configuring Elasticsearch inference endpoints

To use the Elasticsearch Hugging Face inference endpoint, we need two important elements: a Hugging Face API key and a running Hugging Face endpoint URL. It should look like this:

PUT _inference/chat_completions/hugging-face-smollm3-3b
{
    "service": "hugging_face",
    "service_settings": {
        "api_key": "hugging-face-access-token", 
        "url": "url-endpoint" 
    }
}

The Hugging Face inference endpoint in Elasticsearch supports different task types: text_embedding, completion, chat_completion, and rerank. In this blog post, we use chat_completion because we need the model to generate conversational recommendations based on the search results and a system prompt.This endpoint allows us to perform chat completions directly from Elasticsearch in a simple way using the Elasticsearch API:

POST _inference/chat_completion/hugging-face-smollm3-3b/_stream
{
  "messages": [
      { "role": "user", "content": "" }
  ]
}

This will serve as the core of the application, receiving the prompt and the search results that will pass through the model. With the theory covered, let’s start implementing the application.

Setting up inference endpoint on Hugging Face

To deploy the Hugging Face model, we’re going to use Hugging Face one-click deployments, an easy and fast service for deploying model endpoints. Keep in mind that this is a paid service, and using it may incur additional costs. This step will create the model instance that will be used to generate the recommendations of the articles.

You can pick a model from the one-click catalog:

Let’s pick the SmolLM3-3B model:

From here, grab the Hugging Face endpoint URL:

As mentioned in the Elasticsearch Hugging Face inference endpoints documentation, text generation requires a model that’s compatible with the OpenAI API. For that reason, we need to append the /v1/chat/completions subpath to the Hugging Face endpoint URL. The final result will look like this:

https://j2g31h0futopfkli.us-east-1.aws.endpoints.huggingface.cloud/v1/chat/completions

With this in place, we can start coding in a Python notebook.

Generating Hugging Face API key

Create a Hugging Face account, and obtain an API token by following these instructions. You can choose between three token types: fine-grained (recommended for production, as it provides access only to specific resources); read (for read-only access); or write (for read and write access). For this tutorial, a read token is sufficient, since we only need to call the inference endpoint. Save this key for the next step.

Setting up Elasticsearch inference endpoint

First, let’s declare an Elasticsearch Python client:

os.environ["ELASTICSEARCH_API_KEY"] = "your-elasticsearch-api-key"
os.environ["ELASTICSEARCH_URL"] = "https://xxxx.us-central1.gcp.cloud.es.io:443"

es_client = Elasticsearch(
    os.environ["ELASTICSEARCH_URL"], api_key=os.environ["ELASTICSEARCH_API_KEY"]
)

Next, let’s create an Elasticsearch inference endpoint that uses the Hugging Face model. This endpoint will allow us to generate responses based on the blog posts and the prompt passed to the model.

INFERENCE_ENDPOINT_ID = "smollm3-3b-pnz"

os.environ["HUGGING_FACE_INFERENCE_ENDPOINT_URL"] = (
 "https://j2g31h0futopfkli.us-east-1.aws.endpoints.huggingface.cloud/v1/chat/completions"
)
os.environ["HUGGING_FACE_API_KEY"] = "hf_xxxxx"

resp = es_client.inference.put(
        task_type="chat_completion",
        inference_id=INFERENCE_ENDPOINT_ID,
        body={
            "service": "hugging_face",
            "service_settings": {
                "api_key": os.environ["HUGGING_FACE_API_KEY"],
                "url": os.environ["HUGGING_FACE_INFERENCE_ENDPOINT_URL"],
            },
        },
    )

Dataset

The dataset contains the blog posts that will be queried, representing a multilingual content set used throughout the workflow:

// Articles dataset document example: 
{
    "id": "6",
    "title": "Complete guide to the new API: Endpoints and examples",
    "author": "Tomas Hernandez",
    "date": "2025-11-06",
    "category": "tutorial",
    "content": "This guide describes in detail all endpoints of the new API v2. It includes code examples in Python, JavaScript, and cURL for each endpoint. We cover authentication, resource creation, queries, updates, and deletion. We also explain error handling, rate limiting, and best practices. Complete documentation is available on our developer portal."
  }

Elasticsearch mappings

With the dataset defined, we need to create a data schema that properly fits the blog post structure. The following index mappings will be used to store the data in Elasticsearch:

INDEX_NAME = "blog-posts"

mapping = {
    "mappings": {
        "properties": {
            "id": {"type": "keyword"},
            "title": {
                "type": "object",
                "properties": {
                    "original": {
                        "type": "text",
                        "copy_to": "semantic_field",
                        "fields": {"keyword": {"type": "keyword"}},
                    },
                    "translated_title": {
                        "type": "text",
                        "fields": {"keyword": {"type": "keyword"}},
                    },
                },
            },
            "author": {"type": "keyword", "copy_to": "semantic_field"},
            "category": {"type": "keyword", "copy_to": "semantic_field"},
            "content": {"type": "text", "copy_to": "semantic_field"},
            "date": {"type": "date"},
            "semantic_field": {"type": "semantic_text"},
        }
    }
}


es_client.indices.create(index=INDEX_NAME, body=mapping)

Here, we can see more clearly how the data is structured. We’ll use semantic search to retrieve results based on natural language, along with the copy_to property to copy the field contents into the semantic_text field. Additionally, the title field contains two subfields: the original subfield stores the title in either English or Spanish, depending on the original language of the article; and the translated_title subfield is present only for Spanish articles and contains the English translation of the original title.

Ingesting data

The following code snippet ingests the blog posts dataset into Elasticsearch using the bulk API:

def build_data(json_file, index_name):
    with open(json_file, "r") as f:
        data = json.load(f)

    for doc in data:
        action = {"_index": index_name, "_source": doc}
        yield action


try:
    success, failed = helpers.bulk(
        es_client,
        build_data("dataset.json", INDEX_NAME),
    )
    print(f"{success} documents indexed successfully")

    if failed:
        print(f"Errors: {failed}")
except Exception as e:
    print(f"Error: {str(e)}")

Now that we have the articles ingested into Elasticsearch, we need to create a function capable of searching against the semantic_text field:

def perform_semantic_search(query_text, index_name=INDEX_NAME, size=5):
    try:
        query = {
            "query": {
                "match": {
                    "semantic_field": {
                        "query": query_text,
                    }
                }
            },
            "size": size,
        }

        response = es_client.search(index=index_name, body=query)
        hits = response["hits"]["hits"]

        return hits
    except Exception as e:
        print(f"Semantic search error: {str(e)}")
        return []

We also need a function that calls the inference endpoint. In this case, we’ll call the endpoint using the chat_completion task type to get streaming responses:

def stream_chat_completion(messages: list, inference_id: str = INFERENCE_ENDPOINT_ID):
    url = f"{ELASTICSEARCH_URL}/_inference/chat_completion/{inference_id}/_stream"
    payload = {"messages": messages}
    headers = {
        "Authorization": f"ApiKey {ELASTICSEARCH_API_KEY}",
        "Content-Type": "application/json",
    }

    try:
        response = requests.post(url, json=payload, headers=headers, stream=True)
        response.raise_for_status()

        for line in response.iter_lines(decode_unicode=True):
            if line:
                line = line.strip()

                if line.startswith("event:"):
                    continue

                if line.startswith("data: "):
                    data_content = line[6:]

                    if not data_content.strip() or data_content.strip() == "[DONE]":
                        continue

                    try:
                        chunk_data = json.loads(data_content)

                        if "choices" in chunk_data and len(chunk_data["choices"]) > 0:
                            choice = chunk_data["choices"][0]
                            if "delta" in choice and "content" in choice["delta"]:
                                content = choice["delta"]["content"]
                                if content:
                                    yield content

                    except json.JSONDecodeError as json_err:
                        print(f"\nJSON decode error: {json_err}")
                        print(f"Problematic data: {data_content}")
                        continue

    except requests.exceptions.RequestException as e:
        yield f"Error: {str(e)}"

Now we can write a function that calls the semantic search function, along with the chat_completions inference endpoint and the recommendations endpoint, to generate the data that will be allocated in the cards:

def recommend_articles(search_query, index_name=INDEX_NAME, max_articles=5):
    print(f"\n{'='*80}")
    print(f"🔍 Search Query: {search_query}")
    print(f"{'='*80}\n")

    articles = perform_semantic_search(search_query, index_name, size=max_articles)

    if not articles:
        print("❌ No relevant articles found.")
        return None, None

    print(f"✅ Found {len(articles)} relevant articles\n")

    # Build context with found articles
    context = "Available blog articles:\n\n"
    for i, article in enumerate(articles, 1):
        source = article.get("_source", article)
        context += f"Article {i}:\n"
        context += f"- Title: {source.get('title', 'N/A')}\n"
        context += f"- Author: {source.get('author', 'N/A')}\n"
        context += f"- Category: {source.get('category', 'N/A')}\n"
        context += f"- Date: {source.get('date', 'N/A')}\n"
        context += f"- Content: {source.get('content', 'N/A')}\n\n"

    system_prompt = """You are an expert content curator that recommends blog articles.

    Write recommendations in a conversational style starting with phrases like:
    - "If you're interested in [topic], this article..."
    - "This post complements your search with..."
    - "For those looking into [topic], this article provides..."


    FORMAT REQUIREMENTS:
    - Return ONLY a JSON array
    - Each element must have EXACTLY these three fields: "article_number", "title", "recommendation"
    - If the original title is in spanish, use the "translated_title" subfield in the "title" field

    Keep each recommendation concise (2-3 sentences max) and focused on VALUE to the reader.

    EXAMPLE OF CORRECT FORMAT:
    [
        {"article_number": 1, "title": "Article title in english", "recommendation": "If you are interested in [topic], this article provides..."},
        {"article_number": 2, "title": "Article title in english", "recommendation": " for those looking into [topic], this article provides..."}
    ]

    Return ONLY the JSON array following this exact structure."""

    user_prompt = f"""Search query: "{search_query}"

    Generate recommendations for the following articles: {context}
    """

    messages = [
        {"role": "system", "content": "/no_think"},
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ]

    # LLM generation
    print(f"{'='*80}")
    print("🤖 Generating personalized recommendations...\n")

    full_response = ""

    for chunk in stream_chat_completion(messages):
        print(chunk, end="", flush=True)
        full_response += chunk

    return context, articles, full_response

Finally, we need to extract the information and format it to be printed:

def display_recommendation_cards(articles, recommendations_text):
    print("\n" + "=" * 100)
    print("📇 RECOMMENDED ARTICLES".center(100))
    print("=" * 100 + "\n")

    # Parse JSON recommendations - clean tags and extract JSON
    recommendations_list = []
    try:

        # Clean up  tags
        cleaned_text = re.sub(
            r".*?", "", recommendations_text, flags=re.DOTALL
        )
        # Remove markdown code blocks ( ... ``` or ``` ... ```)
        cleaned_text = re.sub(r"```(?:json)?", "", cleaned_text)
        cleaned_text = cleaned_text.strip()

        parsed = json.loads(cleaned_text)

        # Extract recommendations from list format
        for item in parsed:
            article_number = item.get("article_number")
            title = item.get("title", "")
            rec_text = item.get("recommendation", "")

            if article_number and rec_text:
                recommendations_list.append(
                    {
                        "article_number": article_number,
                        "title": title,
                        "recommendation": rec_text,
                    }
                )
    except json.JSONDecodeError as e:
        print(f"⚠️  Could not parse recommendations as JSON: {e}")
        return

    for i, article in enumerate(articles, 1):
        source = article.get("_source", article)

        # Card border
        print("┌" + "─" * 98 + "┐")

        # Find recommendation and title for this article number
        recommendation = None
        title = None
        for rec in recommendations_list:
            if rec.get("article_number") == i:
                recommendation = rec.get("recommendation")
                title = rec.get("title")
                break

        # Print title
        title_lines = textwrap.wrap(f"📌 {title}", width=94)
        for line in title_lines:
            print(f"│  {line}".ljust(99) + "│")

        # Card border
        print("├" + "─" * 98 + "┤")

        # Print recommendation
        if recommendation:
            recommendation_lines = textwrap.wrap(recommendation, width=94)
            for line in recommendation_lines:
                print(f"│  {line}".ljust(99) + "│")

        # Card bottom
        print("└" + "─" * 98 + "┘")

Let’s test this by asking a question about the security blog posts:

search_query = "Security and vulnerabilities"

context, articles, recommendations = recommend_articles(search_query)

print("\nElasticsearch context:\n", context)

# Display visual cards
display_recommendation_cards(articles, recommendations)

Here we can see the cards in the console generated by the workflow:

You can see the full results, including all hits and the LLM response, in this file.

We’re asking for articles related to: “Security and vulnerabilities.” This question is used as the search query against the documents stored in Elasticsearch. The retrieved results are then passed to the model, which generates recommendations based on their content. As we can see, the model did a great job generating engaging short text that can motivate the reader to click on it.

Conclusion

This example shows how Elasticsearch and Hugging Face can be combined to create a fast and efficient centralized system for AI applications. This approach reduces manual effort and provides flexibility, thanks to Hugging Face’s extensive model catalog. Using SmolLM3-3B, in particular, shows how compact, multilingual models can still deliver meaningful reasoning and content generation when paired with semantic search. Together, these tools offer a scalable and effective foundation for building intelligent content analysis and multilingual applications.

Fast vs. accurate: Measuring the recall of quantized vector search

Jeff Vestal — Fri, 20 Mar 2026 00:00:00 GMT

Everyone wants vector search to be instant. But high-dimensional vectors are heavy. A single 1,024-dimension float-32 vector takes up significant memory, and comparing it against millions of others is computationally expensive.

To solve this, search engines like Elasticsearch use two main optimization strategies:

Approximate search (hierarchical navigable small world [HNSW]): Instead of scanning every document, we build a navigation graph to jump quickly to the likely neighborhood of the answer.
Quantization: We compress the vectors (for example, from 32-bit floats to 8-bit integers or even 1-bit binary values) to reduce memory usage and speed up calculations.

But optimization often comes with a tax: accuracy.

The fear is valid: "If I compress my data and take shortcuts during the search, will I miss the best results?" "Does this optimization degrade the relevance of my search engine?"

To prove that Elastic’s quantization doesn’t degrade results, we built a repeatable test harness using the DBPedia-14 dataset to calculate exactly how much accuracy (specifically, recall) we trade for speed when using default optimizations in Elasticsearch.

tldr: It’s likely much less than you think. Check out the notebook here, and try it yourself

The definitions (for the non-experts)

Before we look at the code, let’s level-set on some terms.

Relevance versus recall: Relevance is subjective (did I find good stuff?). Recall is mathematical. If there are 10 documents in the database that are the perfect mathematical matches for your query, and the search engine finds nine of them, your recall is 90% (or 0.9).
Exact search (flat): Sometimes called the "brute force" method. The search engine scans every single document in an index and calculates the distance.
- Pros: 100% perfect recall.
- Cons: Computationally expensive and slow at scale.
Approximate search (HNSW): The "shortcut" method. The search engine builds an HNSW graph. It traverses the graph to find the nearest neighbors.
- Pros: Extremely fast and scalable.
- Cons: You might miss a neighbor if the graph traversal stops too early.

The experiment: Exact versus approximate

To test recall, we used the DBPedia-14 dataset, a large dataset of titles and abstracts across 14 ontology classes, commonly used for training and evaluating text categorization models. Specifically, we’ll focus on the "Film" category. We wanted to compare the optimized production settings against a mathematically perfect ground truth.

For this experiment, we are using the jina-embeddings-v5-text-small model, a state-of-the-art multilingual model that leads industry benchmarks for text representation. We chose this model because it defines the current standard for high-performance embeddings. By combining Jina v5’s elite accuracy with Elasticsearch’s native quantization, we can demonstrate a search architecture that is both computationally efficient and uncompromising on retrieval quality.

We set up an index with dual mapping. We ingested the same text into two different fields simultaneously:

content.raw with type: flat. This forces Elasticsearch to perform a brute-force scan of the full Float32 vectors. This returns exact match results and will be used for our baseline.
content with type semantic_text. With defaults using HNSW + Better Binary Quantization (BBQ). This is the standard, optimized production setting for approximate match.

The Recall@10 test

For our metric, we used Recall@10.

We picked 50 random movies and ran the same query against both fields.

If the exact (flat) search says the top 10 neighbors are IDs [1, 2, 3... 10].
And the approximate (HNSW) search returns IDs [1, 2, 3... 9, 99].
We found nine out of the top 10 correctly. The score is 0.9.

Here’s the mapping we used:

# The "Control Group": Forces exact brute-force scan
"raw": {
    "type": "semantic_text",
    "inference_id": ".jina-embeddings-v5-text-small",
    "index_options": {
        "dense_vector": {
            "type": "flat"
        }
    }
}

The results: The "flat line" of success

We ran a scale test, reloading the full dataset and testing against index sizes of 1,000 to 40,000 documents.

Here’s what happened to the recall score:

Documents	Recall@10 score
1,000	1.000 (100%)
5,000	0.998 (100%)
10,000	0.992 (99.4%)
20,000	0.999 (99.0%)
40,000	0.992 (98.8%)

The results were incredibly stable. Even as we scaled up, the approximate search matched the brute-force exact search >99% of the time.

Why did it work so well?

You might expect that compressing vectors to binary values would hurt accuracy more than this. The reason it doesn't lies in how Elasticsearch handles the retrieval.

Most embedding models today output Float32 vectors, which are large. To make search efficient, Elasticsearch uses quantization for high-dimensional vectors. Specifically, since 9.2, it uses BBQ by default.

BBQ uses a rescoring mechanism:

Traversal: The search engine uses the compressed (quantized) vectors to traverse the HNSW graph quickly. Because the vectors are small, it can efficiently over-sample, gathering a larger list of candidates (for example, the top 100 roughly similar docs) without a performance penalty.
Rescore: Once it has those candidates, it retrieves the full-precision values for just those few documents to calculate the final, precise ranking.

This gives you the best of both worlds, the speed of quantization for the heavy lifting, and the precision of floats for the final sort.

Can we do better?

It’s worth noting that the results we’re seeing here are using default settings and a random sampling of data. Think of this as a high-performance starting point. While Jina v5 is a beast, these recall scores aren't a "one size fits all" guarantee for every dataset. Every data collection has its own quirks, and while you can definitely tune things further to squeeze out even more performance, you should always benchmark against your own specific data to see where your ceiling is.

Conclusion

This is a very small-scale test. But the point of the exercise is not to measure the embedding model or BBQ specifically, it’s to demonstrate how you can easily measure the recall of your dataset with minimal setup.

If you want to run this test on your own data, you can check out the notebook here and try it yourself.

Testing Elasticsearch. It just got simpler.

Piotr Przybyl — Thu, 19 Mar 2026 00:00:00 GMT

When I first wrote about testing Elasticsearch with Testcontainers for Java, the focus was very pragmatic: if you care about correctness, you should test against a real node; if you care about confidence, your integration tests should resemble production as closely as possible; and if you care about maintainability, your setup shouldn’t turn into a maze of mocks and assumptions.

That philosophy hasn’t changed.

What has changed, however, is how little effort it now takes to achieve that goal. With Elasticsearch 9.x, the modern Java client, and Testcontainers 2.x, the experience of writing integration tests feels noticeably smoother, as if a layer of incidental complexity has quietly been removed.

The example accompanying this article is intentionally modest and can be found here.

It doesn’t attempt to demonstrate sophisticated indexing strategies or elaborate data pipelines; instead, it concentrates on the essentials, because the essentials are precisely where the improvements are most visible.

When the tooling stops getting in the way

Anyone who has maintained a test suite for a few years will recognize the pattern: You introduce a new library, a transitive dependency pulls something unexpected, and before long, you’re negotiating between versions of testing engines rather than writing tests.

With Testcontainers 2.x, that negotiation largely disappears. The dependency structure is clearer, the modules are more explicit, and the accidental coupling to older testing frameworks no longer sneaks in behind your back. In practical terms, adding Elasticsearch support to your tests is now as straightforward as declaring:


  org.testcontainers
  testcontainers-elasticsearch
  2.0.3
  test

And, if you’re using JUnit Jupiter integration:


  org.testcontainers
  testcontainers-junit-jupiter
  2.0.3
  test

There are no exclusions to sprinkle in, no legacy engines to silence, and no uneasy feeling that something hidden might surface during the next upgrade. The configuration becomes almost unremarkable, which, in the context of build tooling, is a compliment.

A real Elasticsearch node, with security intact

In the demo test, we use the official Elasticsearch 9.3.1 Docker image:

var container =
    new ElasticsearchContainer("docker.elastic.co/elasticsearch/elasticsearch:9.3.1");

container.start();

At first glance, this may look similar to older examples, yet the subtle difference lies in what we no longer need to do. We don’t disable security. We don’t bypass SSL. We don’t simplify the environment just to make the test convenient.

Instead, once the container is started, we construct a client that uses the REST API and authenticates properly:

try (var client = ElasticsearchClient.of(c -> c
     .host("https://" + container.getHttpHostAddress())
     .usernameAndPassword("elastic", ElasticsearchContainer.ELASTICSEARCH_DEFAULT_PASSWORD)
     .sslContext(container.createSslContextFromCa())
)) {

What deserves special mention here is how neat the client construction itself has become. In earlier iterations, creating an Elasticsearch client often meant juggling multiple intermediate objects, configuring transport layers explicitly, wrapping low-level clients, and dedicating some amount of code to what was essentially plumbing. Now, the signal-to-noise ratio is refreshingly high. The builder encapsulates the necessary details, the container provides what the client needs, and the resulting configuration fits comfortably within a few readable lines.

Just as importantly, the ElasticsearchClient is AutoCloseable, which means it integrates naturally with try-with-resources, ensuring proper cleanup without additional ceremony. The lifecycle is explicit, concise, and self-contained, which is exactly what you want in integration tests that should focus on behavior rather than infrastructure management.

The container exposes everything required to build a legitimate, secure connection, and the client integrates with it naturally, which means the test environment mirrors production in all the aspects that matter, without imposing additional mental overhead from the developer.

This alignment between realism and simplicity is, perhaps, one of the most meaningful improvements.

Typed APIs change the character of tests

The evolution of the Elasticsearch Java client has also reshaped how integration tests read and feel. Where older approaches often involved parsing JSON responses or navigating loosely typed structures, the modern client offers a builder-based, strongly typed API that guides you through valid request shapes at compile time.

In the demo, we perform a simple cluster health check:

var health = client.cluster().health();

Assertions.assertEquals("docker-cluster", health.clusterName());
Assertions.assertEquals(HealthStatus.Green, health.status());

What’s striking here is not the complexity of the operation, but the absence of friction. There’s no manual extraction from maps, no assertions built on untyped string values, and no detour into low-level response handling. The test code looks indistinguishable from application code, which subtly reinforces the idea that integration tests aren’t a special category of code with different rules, but simply another consumer of the same APIs.

When the boundary between production code and test code becomes thinner, confidence increases almost by default.

Reading the test as a story

If you take a look at the full test case:

@Test
void newClientTest() throws IOException {
    try (var container =
             new ElasticsearchContainer("docker.elastic.co/elasticsearch/elasticsearch:9.3.1")) {
        
        container.start();
        
        try (
            var client = ElasticsearchClient.of(c ->
                c.host("https://" + container.getHttpHostAddress())
                    .usernameAndPassword("elastic", ElasticsearchContainer.ELASTICSEARCH_DEFAULT_PASSWORD)
                    .sslContext(container.createSslContextFromCa()))) {

            HealthResponse health = client.cluster().health();

            Assertions.assertEquals("docker-cluster", health.clusterName());
            Assertions.assertEquals(HealthStatus.Green, health.status());
        }
    }
}

you’ll notice that it reads less like a configuration script and more like a short narrative:

We define the container.
We start the container.
We build a client.
We call a real API.
We assert the outcome.

The supporting infrastructure fades into the background, leaving the intent of the test clearly visible. That clarity isn’t accidental; it’s the cumulative effect of incremental improvements across Testcontainers and the Elasticsearch client.

The advanced patterns still apply

None of the more advanced techniques discussed in earlier articles, Faster integration tests with real Elasticsearch and Advanced integration tests with real Elasticsearch, have become obsolete. Reusing containers to speed up large test suites, customizing cluster settings, preloading indices, or testing role-based access scenarios remain entirely valid and, in many cases, essential.

What has improved is the baseline experience. The simplest possible integration test, the one that merely needs a real node and a real client, no longer requires defensive configuration or dependency gymnastics. It’s concise, expressive, and production-like by default.

Progress without drama

There was no dramatic rewrite of the ecosystem, no disruptive migration guide that forced a rethinking of everything. Instead, there has been a steady refinement of APIs and dependencies, each release smoothing a rough edge here and removing a surprise there.

The result isn’t flashy, yet it’s tangible. Writing integration tests against Elasticsearch now feels less like assembling a test harness and more like exercising a real system in miniature.

Sometimes progress announces itself loudly. Sometimes it arrives quietly, in the form of code that simply reads better and requires less explanation. In this case, it’s the latter, and for those of us who care about clean, reliable integration tests, that’s more than enough.

And what if we could do something similar with Kibana? Sounds appealing? Stay tuned!

AI agent memory: Creating smart agents with Elasticsearch managed memory

Gustavo Llermaly, Jeffrey Rengifo — Wed, 18 Mar 2026 00:00:00 GMT

In this article, we’ll learn about using memory techniques to make agents smarter using Elasticsearch as the database for memories and knowledge.

Understanding memory in large language models (LLMs)

Here's something that trips people up: The conversations with LLMs are completely stateless. Every time you send a message, you need to include the entire chat history to "remind" the model what happened before. The ability to keep track of what was asked and answered within a single conversation session is what we call short-term memory.

But here's where it gets interesting: Nothing stops us from manipulating this chat history beyond simple storage. For example, when we want to persist memories like user preferences across different conversations, we inject those into fresh conversations when needed and call it long-term memory.

Why mess with chat history?

There are three compelling reasons to go beyond simply appending each new message and response to a growing list that gets sent to the LLM with every request:

Inject useful context: Add information about previous interactions, like user preferences, without cluttering the current conversation.
Summarize and remove data: Clean up information the model has already used to avoid confusion (context poisoning) and keep the model focused.
Save tokens: Remove unnecessary data to prevent filling the context window, enabling longer, more meaningful conversations.

This opens up some sci-fi possibilities. Imagine an agent that selectively remembers things based on its environment or who it's talking to, like the TV show Severance, where the main character, Mark, has a chip implanted in his brain that creates two separate identities with distinct memories depending on whether he’s in the office ("innie") or outside of it ("outie"), switching based on location.

Memory types and selective retrieval in agents: Creating smart agents with Elasticsearch managed memory

Not all memories serve the same purpose, and treating them as interchangeable chat history limits how far agents can scale. Modern agent architectures, including frameworks like Cognitive Architectures for Language Agents (CoALA), distinguish between procedural, episodic, and semantic memory. Rather than treating all context as a single growing buffer, these architectures recognize that each memory type requires distinct storage, retrieval, and consolidation strategies.

Procedural memory: How the agent operates

Procedural memory defines how an agent behaves, not what it knows or remembers.

In practice, this includes:

When to store a memory.
When to retrieve one.
How to summarize conversations.
How to use tools.

In our system, procedural memory lives primarily in the application code and prompts and isn’t stored in Elasticsearch. Instead, Elasticsearch is used by procedural memory.

Procedural memory determines how memory is used, not what’s stored.

Episodic memory: What happened

Episodic memory captures specific experiences tied to an entity and a context.

Examples:

“Peter’s birthday is tomorrow and he wants steak.”
“Janice has a report due at 9 am.”

This is the most dynamic and personal form of memory and the one most prone to context pollution if handled incorrectly.

In our architecture:

Episodic memories are stored as documents in Elasticsearch.
Each memory includes metadata (user, role, timestamp, innie or outie).
Retrieval is selective, based on who’s asking and in what context.

This is where the innie/outie model applies as an example of episodic memory isolation.

Semantic memory: Ground truth

Semantic memory represents abstracted, generalized knowledge about the world, independent of any single interaction or personal context. Unlike episodic memory, which is tied to who said what and when, semantic memory captures what is true in general.

In our analogy, the knowledge about Lumon, which is the company where Mark works in the show Severance, is world truth shared between innies and outies.

Things like company handbooks and rules are part of the knowledge being used as semantic memory.

While episodic memory retrieval prioritizes precision and strong contextual filters (such as identity, role, and time), semantic memory favors high-recall, concept-level retrieval. It’s designed to surface generally true information that can ground reasoning, rather than personal experiences tied to a specific situation.

Let’s move to architecture and see how these ideas translate into a memory system for our agent.

Prerequisites

Elasticsearch Elastic Cloud Hosted (ECH) or self-hosted 9.1+ instance.
Python 3.x.
OpenAI API Key.

The full Python notebook for this application can be found here.

Why Elasticsearch?

Elasticsearch is an ideal solution for storing both knowledge and memory because it's a native vector database ready to scale. It gives us everything we need to manage selective memory:

Vector database with hybrid search to find memories by context, not only by keywords.
Multiple data types, including text, numbers, dates, and geolocation.
Metadata filters for complex queries across different fields.
Document level security to filter memories based on who's asking.

Why selective memory improves latency and reasoning

Selective memory is not only about correctness and isolation; it also has a direct impact on latency and model performance. By narrowing the search space using structured filters (such as memory type, user, or time) before running semantic retrieval, Elasticsearch reduces the number of vectors that need to be scored and the amount of context that must be injected into the LLM. This results in faster retrieval, smaller prompts, and more focused attention for the model, which in practice translates into lower latency, lower token usage, and more accurate responses.

Episodic memory is inherently temporal: Recent experiences are usually more relevant than older ones, and not all memories should be kept with the same level of detail forever. In human cognition, experiences are gradually forgotten, summarized, or consolidated into more abstract knowledge.

Memory compression is a whole different topic, but you can implement strategies to summarize and store old memories while retrieving the fresh ones entirely.

The setup

Following the Severance concept, we're creating an agent named Mark with two distinct memory sets:

Innie memories: Work-related conversations with colleagues.
Outie memories: Personal conversations with friends and family.

When Mark talks to an innie, he shouldn't remember conversations with outies, and vice versa.

Building the memory system

Memory index structure

First, we define our memory schema:

mappings = {
    "properties": {
        "user_id": {"type": "keyword"},
        "memory_type": {"type": "keyword"},
        "created_at": {"type": "date"},
        "memory_text": {
            "type": "text",
            "fields": {
                "semantic": {
                    "type": "semantic_text"
                }
            }
        }
    }
}

Note that we use multi-field for memory_text so we can do both full-text search, and semantic search using the Elastic Learned Sparse EncodeR (ELSER) model (default) against the same field content.

This gives us semantic search capabilities while maintaining structured metadata for filtering.

Setting up document level security

This is the key piece that makes selective memory work. We create two separate roles: one for innies, one for outies, each with query-level filters built in. When a user with the innie role queries the memories index, Elasticsearch automatically applies a filter that only returns memories where memory_type equals "innie".

You can find more illustrative examples about access control here and about role management here.

Here's the innie role:

innie_role_descriptor = {
    "indices": [
        {
            "names": ["memories"],
            "privileges": ["read", "write"],
            "query": {
                "bool": {
                    "filter": [
                        {"term": {"memory_type": "innie"}}
                    ]
                }
            }
        }
    ]
}

We create a similar role for outies, just filtering by "memory_type": "outie" instead.

Then we create users and assign them to these roles. For example:

Peter (outie): Can only access memories marked as "outie".
Janice (innie): Can only access memories marked as "innie".

When Mark (our agent) receives a query, he uses the credentials of whoever is asking. If Peter asks something, Mark uses Peter's credentials, which means Elasticsearch automatically filters to only show outie memories. If Janice asks, only innie memories are visible.

The application code doesn't need to filter the user management and is completely decoupled from the application logic. Elasticsearch handles all the security automatically.

Creating the agent tools

We define three key functions for our agent:

GetKnowledge: Searches the knowledge base for relevant context (traditional retrieval augmented generation [RAG]).
GetMemories: Retrieves memories using hybrid search (semantic + keyword):

def get_memory(query: str):
    es_query = {
        "retriever": {
            "rrf": {
                "retrievers": [
                    {
                        "standard": {
                            "query": {
                                "semantic": {
                                    "field": "semantic_field",
                                    "query": query
                                }
                            }
                        }
                    },
                    {
                        "standard": {
                            "query": {
                                "multi_match": {
                                    "query": query,
                                    "fields": ["memory_text"]
                                }
                            }
                        }
                    }
                ],
                "rank_window_size": 50,
                "rank_constant": 20
            }
        }
    }
    
    response = user_es_client.search(index="memories", body=es_query)
    return response

Notice that we don't apply security filters in the query; Elasticsearch handles that automatically based on the user's credentials.

SetMemory: Stores new memories (implementation uses LLM to convert conversations into structured memory records).

How the agent uses these tools

When a user asks Mark a question, the flow works like this:

1. User asks: "What's my favorite family destination?"

2. LLM decides to use tools: OpenAI's Response API with function calling lets the LLM decide it needs to call GetMemories with the query "favorite family destination".

3. We execute the function: Our code calls get_memory("favorite family destination") using the user's credentials (Peter's in this case).

4. Elasticsearch filters automatically: Because we're using Peter's credentials, only outie memories are returned:

Memories
peter125: (User name is Peter Johnson. His favorite family destination is Disneyland.)

5. We send results back to LLM: The memory gets added to the conversation context.

6. LLM generates an answer: "Your favorite family destination is Disneyland."

Here's the actual code that handles this loop:

# Initial call with tools available
response = client.responses.create(
    model="gpt-4.1-mini",
    input=messages,
    tools=tools,
    parallel_tool_calls=True
)

# Execute any tool calls the LLM requested
for tool_call in response.output:
    if tool_call.name == "GetMemories":
        result = get_memory(tool_call.arguments["query"])
        # Add result to messages
        
# Call LLM again with tool results to generate final answer
final_response = client.responses.create(
    model="gpt-4.1-mini",
    input=messages  # Now includes tool results
)

The key insight: The application doesn't decide which memories to retrieve or when. The LLM decides based on the user's question, and Elasticsearch ensures that only the right memories are accessible.

Testing selective memory

Let's see it in action:

Outie conversation (Peter):

Peter: Hey Mark, my birthday is tomorrow! I'd like to have a steak for dinner.
Mark: That's great! (memory stored)

Mark stores this as an outie memory associated with Peter. Here's what that memory looks like in Elasticsearch:

{
    "user_id": "peter125",
    "memory_type": "outie",
    "created_at": "2025-10-11T18:02:52.182780",
    "memory_text": "Peter's birthday is tomorrow. He wants steak for dinner."
}

Innie conversation (Janice):

Janice: Hey Mark, remember we have to finish the end of year report tomorrow at 9am.
Mark: Thanks for reminding me! (memory stored)

This creates a separate innie memory:

{
    "user_id": "janice456",
    "memory_type": "innie", 
    "created_at": "2025-10-11T19:15:33.445821",
    "memory_text": "End of year report deadline tomorrow at 9am with Janice."
}

Imagine Peter also works at Lumon. A colleague stores a work-related memory about him:

{
    "user_id": "innie-peter",
    "memory_type": "innie",
    "created_at": "2025-10-11T20:30:00.000000",
    "memory_text": "Peter needs to review the Q4 budget spreadsheet before Friday."
}

This memory exists in Elasticsearch, but Peter's current credentials only grant him the outie role. When he asks Mark about work tasks, this memory is invisible to him; Elasticsearch's document level security ensures that it’s never returned.

Note: To allow interaction with these memories, you would need to create a separate user (or assign an additional role) with "innie" access for Peter. This is left as an exercise, but it demonstrates that the same person can have isolated memory contexts, and access is controlled entirely at the security layer.

Memory isolation test

Now Peter starts a new conversation:

Peter: Hey Mark, do you remember what I want for my birthday?
Mark: Yes! You want steak.

Peter: When do you have to finish the end of year report?
Mark: What are you talking about?

Perfect! Mark only accesses outie memories when talking to Peter. The agent's "brain" is genuinely split, just like in the show.

The full implementation

The complete working implementation is available in this notebook, where you can:

Set up the Elasticsearch indices.
Create roles and users with document level security.
Build the agent with OpenAI's Response API.
Test the selective memory system.

Conclusion

Memory isn’t just a place to store past conversations. It’s part of the agent’s architecture. By going beyond raw chat history and separating procedural, episodic, and semantic memory, we can build agents that reason more clearly, scale better, and stay focused over long interactions.

Selective retrieval reduces context pollution, lowers latency, and improves the quality of the information sent to the LLM. Episodic memory can be filtered by user and time, semantic memory can be used to ground answers in shared knowledge, and procedural memory controls how and when all of this is used.

Elasticsearch provides the building blocks to implement this in practice through hybrid search, rich metadata, security, and temporal filtering. Just like in Severance, we can create agents with isolated experiences and shared world knowledge. The difference is that here the split is intentional and useful, not a mystery.

The Gemini CLI extension for Elasticsearch with tools and skills

Walter Rafelsberger, Enrico Zimuel — Tue, 17 Mar 2026 00:00:00 GMT

We're glad to announce the release of our Elastic extension for Google's Gemini CLI, bringing the full power of Elasticsearch and Elastic Agent Builder directly into your AI development workflow. This extension also offers several recently developed agent skills for interacting with Elasticsearch.

The extension is available as an open source project here.

What is Gemini CLI, and how do you install it?

Gemini CLI is an open source AI agent that brings Google’s Gemini models directly into the command line. It allows developers to interact with AI from the terminal to perform tasks such as generating code, editing files, running shell commands, and retrieving information from the web.

Unlike typical chat interfaces, Gemini CLI integrates with your local development environment, meaning that it can understand project context, modify files, run builds or tests, and automate workflows directly within the terminal. This makes it useful for developers, site reliability engineers (SREs), and engineers who want AI-assisted coding and automation without leaving their command-line workflow.

Gemini CLI can be installed using several package managers. The most common method is via npm:

npm install -g @google/gemini-cli

If you want to know alternative installation options, refer to the official installation page.

After installation, start the CLI by running:

gemini

You see a screen, as shown in Figure 1:

Configure Elasticsearch

We need to have an Elasticsearch instance running. If you want to use the Model Context Protocol (MCP) server, you also need Kibana 9.3+ installed. To use the Elasticsearch Query Language (ES|QL) skill (esql) described below, Kibana is not required.

You can activate a free trial on Elastic Cloud or install it locally using the start-local script:

curl -fsSL https://elastic.co/start-local | sh

This will install Elasticsearch and Kibana on your computer and will generate an API key to be used for configuring Gemini CLI.

The API key will be shown as output of the previous command and stored in a .env file in the elastic-start-local folder.

If you’re using on-premises Elasticsearch (for example, using start-local), and you want to use Elastic Agent Builder with MCP, you also need to connect a large language model (LLM). You can read this documentation page to understand the different options.

If you’re using Elastic Cloud (or serverless), you already have a prebuilt LLM connection in place.

Install the Elasticsearch extension

You can install the Elasticsearch extension for Gemini CLI with the following command:

gemini extensions install https://github.com/elastic/gemini-cli-elasticsearch

You can check that the extensions have been installed successfully by opening Gemini and executing the following command:

/extensions list

You should see the Elasticsearch extension available.

If you want to use the MCP integration, you need to have an Elasticsearch 9.3+ version installed. You need your MCP server URL from Kibana:

Get your MCP server URL from Agents > View all tools > Manage MCP > Copy MCP Server URL.
The URL will look like this: https://your-kibana-instance/api/agent_builder/mcp

You need the Elasticsearch endpoint URL. This is typically reported at the top of the Kibana Elasticsearch page. If you’re running Elasticsearch with start-local, you already have the endpoint in the ES_LOCAL_URL key in the start-local .env file.

You also need an API key. If you’re running Elasticsearch with start-local, you already have the ES_LOCAL_API_KEY in the start-local .env file. Otherwise, you can create an API key using the Kibana interface, as reported here:

In Kibana: Stack Management > Security > API Keys > Create API key.
We suggest setting only the read privileges for the API key, enabling the feature_agentBuilder.read privilege as reported here.
Copy the encoded API key value.

Set the required environment variables in your shell:

export ELASTIC_URL="your-elasticsearch-url"
export ELASTIC_MCP_URL="your-elasticsearch-mcp-url"
export ELASTIC_API_KEY="your-encoded-api-key"

Install the example dataset

You can install the eCommerce orders dataset available from Kibana. It includes a single index named kibana_sample_data_ecommerce, containing information for 4,675 orders from an ecommerce website. For each order, we have the following information:

Customer information (name, ID, birth date, email, and more).
Order date.
Order ID.
Products (list of all the products with price, quantity, ID, category, discount, and other details).
SKU.
Total price (taxless, taxed).
Total quantity.
Geo information (city, country, continent, location, region).

To install the sample data, open the Integrations page in Kibana (search for “Integration” in the top search bar) and install the Sample Data. For more details, refer to the documentation here.

The goal of this article is to show how easy it is to configure Gemini CLI to connect to Elasticsearch and interact with the kibana_sample_data_ecommerce index.

How to use the Elasticsearch MCP

You can check the connection using the following command in Gemini:

/mcp list

You should see the elastic-agent-builder enabled, as shown in Figure 2:

Elasticsearch provides a default set of tools. See the description here.

Using these tools, you can interact with Elasticsearch, asking questions like:

Give me the list of all the indexes available in Elasticsearch.
How many customers are based in the USA in the kibana_sample_data_ecommerce index of Elasticsearch?

Depending on the question, Gemini will use one or more of the available tools to try to answer it.

The /elastic commands

In the Elasticsearch extension for Gemini CLI, we also added /elastic commands.

If you execute the /help command, you see all the available /elastic options (Figure 3):

These commands can be useful if you want to directly execute a specific tool of the elastic-agent-builder MCP server. For instance, using the following command, you can get the mapping of the kibana_sample_data_ecommerce:

/elastic:get-mapping kibana_sample_data_ecommerce

These commands are essentially shortcuts for executing specific tools, rather than relying on the Gemini model to determine which tool should be invoked.

How to use the Elasticsearch skills

This extension also comes with an agent skill for ES|QL, the Elasticsearch Query Language available in Elasticsearch. Agent Skills is an open format that gives AI coding agents, like Gemini CLI, custom instructions for specific tasks. They use a concept called progressive disclosure, meaning that only a brief description of the skill is added to the initial system prompt. When you ask the agent to perform a task, like querying Elasticsearch, it matches the request to the relevant skill and dynamically loads the detailed instructions. This is an efficient way to manage token budgets while providing the AI with exactly the context it needs.

The esql skill is designed to let Gemini CLI write and execute ES|QL queries directly against your cluster. ES|QL is a powerful piped query language that makes data exploration, log analysis, and aggregations highly intuitive. With this skill enabled, you don't need to look up ES|QL syntax; you can simply ask the Gemini CLI natural language questions about your data, and the agent will handle the rest.

The executions are performed using simple curl commands run in a terminal. This is possible because Elasticsearch provides a rich set of REST APIs that can be easily used to integrate the system into any architecture.

What the esql skill offers:

Index and schema discovery: The agent can use the skill's built-in tools to list available indices and fetch field mappings. For instance, before writing a query for the eCommerce dataset, the agent can run a schema check on kibana_sample_data_ecommerce to understand the available fields, like taxful_total_price or category.
Seamless natural language translation: The skill gives the agent more than just a plain reference manual; it provides a specific guide for interpreting user intent. When you type natural language requests, like "Show average response time grouped by service," the agent uses the skill's bundled pattern matching to instantly translate your words into the correct ES|QL aggregations, filters, and commands.
Self-correct: If a query fails (for example, due to a type mismatch or syntax error), the skill returns the generated query alongside the exact Elasticsearch error message, allowing the agent to instantly fix the query and try again without you having to intervene.

Since the esql skill is also available as a tool in the elastic-agent-builder MCP server, we need to disable this server momentarily. You can use the following command to disable it:

/mcp disable elastic-agent-builder

Then you can simply type a prompt like this into your Gemini CLI:

Find the top 5 product categories by total sales revenue in the kibana_sample_data_ecommerce index

The agent will:

Recognize the need for the esql skill.
Check the schema of kibana_sample_data_ecommerce.
Construct an ES|QL query, like: FROM kibana_sample_data_ecommerce | STATS total_revenue = SUM(taxful_total_price) BY category.keyword | SORT total_revenue DESC | LIMIT 5.
Execute the query against the Elasticsearch API.
Present the final answer to you right in the terminal.

Here, we reported an example of response from Gemini to the previous prompt:

───────────────────────────────────────────────────────────
> Find the top 5 product categories by total sales revenue in the kibana_sample_data_ecommerce index   ───────────────────────────────────────────────────────────
✓  Activate Skill "esql": Interact with Elasticsearch using ES|QL and curl. Use when querying, managing indices,  …
Skill esql activated. Resources loaded from /home/user/.gemini/extensions/elasticsearch/skills/esql: 
Showing up to 200 items (files + folders).                                                                     /home/user/.gemini/extensions/elasticsearch/skills/esql/                                                            ├───SKILL.md 
└───references/
├───esql-reference.md 
├───generation-tips.md 
└───query-patterns.md

✦ I will start by exploring the mapping of the kibana_sample_data_ecommerce index to identify the fields for product category and sales revenue. Then, I will construct and execute an ES|QL query to determine the top 5 product categories by total sales revenue.

✓  Shell curl -s "${ELASTIC_URL%/}/kibana_sample_data_ecommerce/_mapping" -H "Authorization: ApiKey $(printenv ELAS…                                                                                                                       {  
   "kibana_sample_data_ecommerce": { 
     "mappings": { 
       "properties": {
         "category": { 
           "type": "text", 
           "fields": {
             "keyword": {
               "type": "keyword"
             }
           }
         }, 
         "currency": {
           "type": "keyword"
         }, 

✦ Okay, I have the mapping. The category is products.category.keyword and sales revenue is taxful_total_price. I'll now execute the ES|QL query to get the top 5 categories by revenue.

✓  Shell curl -s -X POST "${ELASTIC_URL%/}/_query" \  -H "Authorization: ApiKey $(printenv ELASTIC_API_KEY)" \  -H …
{ 
   "took": 62,
   "is_partial": false, 
   "completion_time_in_millis": 1772805644164,
   "documents_found": 4675,
   "values_loaded": 12084, 
   "start_time_in_millis": 1772805644102, 
   "expiration_time_in_millis": 1773237644087,  
   "columns": [ 
     { 
       "name": "total_revenue",    
       "type": "double"  
     }, 
     {   
       "name": "products.category.keyword",

✦ The query for top 5 product categories by revenue was successful. I'll format the returned JSON values into a   user-friendly list.

 ✓  Shell 
                             
 Top 5 product categories by total sales revenue: 
 1. Men's Clothing: 149393.91
 2. Women's Clothing: 135099.91
 3. Women's Shoes: 105479.17
 4. Men's Shoes: 91797.92
 5. Women's Accessories: 60830.31 

✦ I've found the answer; now I'll format it for clarity.

 ✓  Shell 

 Here are the top 5 product categories by total sales revenue:                                                        
                                                                      
 | Rank | Category              | Total Revenue | 
 |------|-----------------------|---------------| 
 | 1    | Men's Clothing        | 149,393.91    |
 | 2    | Women's Clothing      | 135,099.91    |
 | 3    | Women's Shoes         | 105,479.17    |
 | 4    | Men's Shoes           | 91,797.92     | 
 | 5    | Women's Accessories   | 60,830.31     | 

✦ I've formatted and presented the answer. I'm finished.
 ───────────────────────────────────────────────────────────

It’s interesting to note how the Gemini model generates the final response by showing all the steps it follows. Here, you can clearly see the influence of the skill in the model’s reasoning process. The first time the model recognizes that it needs to use a skill or execute a shell command, it requests permission using the human-in-the-loop approach.

By handling the heavy lifting of schema discovery, query generation, and execution, the esql skill lets you focus entirely on the answers rather than the mechanics of getting them. You’ll get the data you need, properly formatted and right in your terminal, all without ever writing a single line of syntax or context-switching to a different application.

Conclusion

In this article, we presented the Elasticsearch extension for Gemini CLI that we recently released. This extension gives you the ability to interact with your Elasticsearch instance using Gemini and the Elasticsearch MCP server provided by Elastic Agent Builder, available starting from version 9.3.0, as well as the /elastic command.

Moreover, the extension also includes an esql skill that converts a user’s request from natural language into an ES|QL query. This skill can be particularly useful when the MCP server cannot be used, since the underlying communication is driven by simple curl commands executed in a terminal. Elasticsearch offers a rich set of REST APIs that can be easily integrated into any project. This is especially useful when developing agentic AI applications.

For more information about our Gemini CLI extension, visit the project repository here.

Agent Skills for Elastic: Turn your AI agent into an Elastic expert

Graham Hudgins, Matt Ryan — Mon, 16 Mar 2026 00:00:00 GMT

Every developer, site reliability engineer (SRE), or analyst who has tried to use an AI coding agent with a specialized platform has hit the same wall. You ask the agent to write a query, configure an alert, or investigate something, and it gets close but not right. Elastic has an advantage here: over a decade of documentation, blog posts, and community answers means AI agents already know Elastic better than most data platforms. But that depth comes with noise. Deprecated APIs sit alongside current ones. Outdated patterns rank as highly as best practices. The agent confidently reproduces an approach that worked three versions ago, because in its training data, it did. The result is a correction tax: users manually feed documentation into context, fix hallucinated syntax, and work around the agent instead of with it. Worse, advanced capabilities go unused entirely, not because users don't need them, but because the agent doesn't know they exist.

That's why we’re open-sourcing Elastic Agent Skills: native platform expertise for Elasticsearch, Kibana, Elastic Observability, and Elastic Security. Drop them into the agent runtime you already use, and improve your agent from being a ‘generalist’ that guesses at a lot of syntax to giving it expertise, such as being able to use many of the architectural standards as Elastic’s own engineering teams. This initial technical preview release focuses on skills with maximum compatibility for Elastic Cloud Serverless, but will evolve quickly to include improved support for older stack releases.

In addition, Elastic is solving this problem from both sides. For agents on the Elastic platform, Elastic Agent Builder (now generally available) lets you create and chat with AI agents that inherit your data's access controls, use built-in search and analysis tools, and work in context alongside your dashboards, alerts, and investigations. We're working hard to ensure amazing Agentic experiences in the Elastic platform. But not every agent lives inside Elastic. Your team already uses Cursor, Claude Code, or other runtimes, and those agents need to get Elastic right, too. That's where Agent Skills come in.

Why agents struggle with specialized platforms

Large language models (LLMs) are remarkably capable generalists. They can write Python, explain Kubernetes manifests, and refactor React components because their training data is rich with examples. But when it comes to platform-specific work, the kind that involves proprietary query languages, deep API surfaces, and domain-specific best practices, they fall short in predictable ways.

For Elasticsearch, the gap shows up concretely:

Elasticsearch Query Language (ES|QL) is new territory. LLMs are trained heavily on SQL, but ES|QL is a piped query language with different syntax, different functions, and different semantics. Agents frequently write queries that look plausible but don't parse. They confuse WHERE with | WHERE, invent functions that don't exist, and miss the pipe-based composition model entirely.
API surfaces are wide and deep. Elasticsearch, Kibana, and Elastic Security expose hundreds of APIs across search, ingestion, alerting, detection rules, case management, dashboards, and more. An agent armed with nothing but general training data has to guess which endpoint to call, what the request body looks like, and how to handle the response. It guesses wrong often enough to erode trust.
Best practices aren't in the training data. When should you use semantic_text versus a custom embedding pipeline? How should you structure an ingest pipeline for a 10GB CSV? What's the right detection rule syntax for a MITRE ATT&CK technique? General-purpose agents don't have curated, reliably structured Elastic-specific knowledge loaded by default. They'd have to go find it, and even if they did, raw docs don't always encode the judgment calls and best practices that skilled practitioners carry.

The result: Developers spend more time fixing agent output than they would have spent writing the code themselves. That's not the experience anyone signed up for.

Agent Skills: Platform knowledge, packaged for agents

Agent Skills are self-contained directories of instructions, scripts, and reference material that agent runtimes can load dynamically. When a skill is active, the agent has access to the right context at the right time: query syntax, API patterns, validation logic, worked examples, so it can complete tasks correctly on the first try.

Each skill follows the open agentskills.io specification: a folder with a SKILL.md file containing metadata and structured instructions. No proprietary format, no lock-in. Skills work across agent runtimes, including Cursor, Claude Code, GitHub Copilot, Windsurf, Gemini CLI, Cline, Codex, and many more.

What's in the initial v0.1.0 release

The first set of skills spans five areas of the Elastic Stack:

Interacting with Elasticsearch APIs (search, indexing, cluster management)
Building and managing Kibana content such as dashboards, alerts, connectors, and more
Domain expertise for Elastic Observability
Domain expertise for Elastic Security
Making effective agents in Agent Builder

Skills are composable

Skills aren't monolithic. They're modular by design. Your agent loads only the skills relevant to the task at hand. Working on an ES|QL query? The ES|QL skill activates. Need to build a dashboard from those results? The dashboards skill picks up. Assessing the health of your application? The service health skill comes into play. Investigating a security alert? The triage skill chains into case management and response skills as the investigation progresses.

This composability means that you don't need a single, massive prompt that tries to cover everything. Each skill carries exactly the context its domain requires, nothing more, nothing less.

For developers building search and AI applications

If you're loading data into Elasticsearch, writing queries, or migrating indices, skills reduce the cycle of generating code, hitting errors, and searching docs for what went wrong.

Ask your agent to load a CSV file, and it uses a streaming ingestion tool that handles backpressure and infers mappings from the data. It’s not a hand-rolled _bulk loop that runs out of memory on the first large file. Ask it to query with ES|QL, and it discovers your actual index names and field schemas, then writes valid piped queries with correct syntax, appropriate aggregations, and version-aware feature selection, not a SQL-flavored guess that requires three rounds of debugging. Ask it to reindex across clusters, and it follows the full operational workflow: creates the destination with explicit mappings, tunes settings for throughput, runs the job async, and restores production settings when it finishes, not a bare _reindex call that skips half the steps an experienced operator would follow.

Instead of an agent that gives you a plausible starting point you have to fix, you get one that encodes the operational discipline that makes the output actually work.

Example impacts of using Elastic Agent Skills

Eval	What the skill changed
es-audit-query-failed-logins	Used the audit log query patterns from the skill instead of generic search
es-authz-role-mapping-ldap	Emitted the correct role mapping API call structure
esql-basic-query	Wrote ES\|QL pipe syntax over Query DSL
esql-error-handling	Schema first instead of guessing field names
esql-schema-discovery	Never guessed an index name
es-ingest-csv-with-infer	Used --infer-mappings alone, avoided combining with --source-format csv which causes an empty index
es-ingest-json-file	Used robust ingestion approach capable of handling large files
es-reindex-local-async	Created destination index first with replicas: 0 and refresh_interval: "-1", then async reindex. Baseline skipped any prep
es-security-403-privileges	Followed the skill's diagnostic workflow for privilege errors instead of generic advice

For security teams

Security teams repeat the same operational workflows daily: triaging alerts, tuning detection rules, managing cases. Agent Skills encode that procedural knowledge so your AI agent can execute these workflows correctly, calling the right APIs in the right order with the right field names. For a hands-on walkthrough that takes you from zero to a fully populated Elastic Security environment without leaving your IDE, see Get started with Elastic Security from your AI agent.

For observability and operations teams

The new Agent Skills for Elastic Observability reduce the operational toil of instrumenting complex systems, managing SLOs, sifting through complex data, and assessing service health. Embedding native Elastic expertise directly into AI agents allows teams to execute complex observability workflows using simple natural language. This empowers SREs and Ops teams to resolve incidents faster and maintain reliable systems more easily. Learn more in this blog.

Open source, open spec, community-driven

We're releasing Agent Skills under the Apache 2.0 license because we believe that agent knowledge should be open. The agentskills.io specification that skills follow is an open standard, not an Elastic proprietary format. We want skills to be a community effort, not a walled garden.

Part of a bigger picture

Agent Skills is one piece of a broader initiative to make Elasticsearch the most agent-friendly data platform available. For agents that live on the Elasticsearch platform, Agent Builder goes further by inheriting your data's access controls and permissions, providing built-in and custom tools for search and analysis, and letting users interact with agents in context alongside their dashboards, alerts, and investigations. Finally, support for skills is coming soon to Agent Builder, allowing developer flexibility to leverage Elastic Agent Skills as well as skills from any other source to enable secure, context enhanced chat and automation on the Elasticsearch platform.

For agents that live everywhere else, we're investing in the open ecosystem:

Model Context Protocol (MCP) server expansion: Extending the MCP endpoint in Agent Builder with more tools beyond the current search, ES|QL, and index operations.
Authentication improvements: Making it easier for agents to connect securely, with the goal of eliminating manual API key copy-paste.
LLM-readable documentation: Publishing llms.txt and AGENTS.md files so agents can discover and understand Elastic APIs on their own.
A command line interface (CLI) for agent workflows: Command-line tooling that makes connection management and common operations agent-friendly.

Skills are the layer you can use today. The rest is coming.

Get started

Before you get started: AI coding agents operate with real credentials, real shell access, and often the full permissions of the user running them. When those agents are pointed at security workflows, the stakes are higher: you're handing an automated system access to detection logic, response actions, and sensitive telemetry. Every organization's risk profile is different. Before enabling AI-driven security workflows, evaluate what data the agent can access, what actions it can take, and what happens if it behaves unexpectedly.

Install Elastic Agent Skills into your agent runtime:

npx skills add elastic/agent-skills

This auto-detects your installed agent runtimes and places skills in the correct configuration directory. From there, your agent picks them up automatically.

You can also browse the skills catalog directly and install individual skills manually by copying the skill folder into your agent's configuration directory.

Don't have an Elasticsearch cluster yet? Start an Elastic Cloud free trial. It takes about a minute to get a fully configured environment.

Explore the project:

Entity resolution with Elasticsearch, part 4: The ultimate challenge

Jessica Moszkowicz — Fri, 13 Mar 2026 00:00:00 GMT

We’ve now seen intelligent entity resolution implemented in two ways. Both approaches begin the same way: entity preparation and extraction, followed by candidate retrieval with Elasticsearch. From there, we evaluate those candidates using a large language model (LLM), either through prompt-based JSON generation or through function calling, and require the model to provide a transparent explanation for its judgment.

As we saw in the previous post, the consistency provided by function calling is not just a nice optimization; it’s essential. Once we removed structural errors from the evaluation loop, results on standard scenarios (such as those in the tier 4 dataset) improved dramatically.

Yet there’s an obvious question left to answer:

Does this approach still work when things get genuinely messy?

Real-world entity resolution rarely fails because of simple cases. It fails when names cross languages, cultures, writing systems, time periods, and organizational boundaries. It fails when people are referenced by titles instead of names, when companies change names, when transliterations aren’t consistent, and when context (not spelling) is the only thing tying a mention to a real-world entity.

So, for the final post in this series, we put the system through what we called the ultimate challenge.

What makes this the ultimate challenge?

In earlier evaluations, we tested the system using increasingly complex datasets. By the time we reached tier 4, discussed in the previous post, we were already dealing with a mix of nicknames, titles, multilingual names, and semantic references. Those tests showed that the architecture itself was sound, but that reliability issues, especially malformed JSON, were suppressing recall.

With function calling in place, we finally had a stable foundation. That gave us the opportunity to ask a more interesting question:

Can one unified pipeline handle many different kinds of entity resolution problems at once?

The ultimate challenge dataset was designed to push precisely on that dimension.

Instead of focusing on a single difficulty (like nicknames or transliteration), this dataset combines 50+ distinct challenge types, including:

Cultural naming conventions.
Title-based references.
Business relationships and historical name changes.
Multilingual and cross-script mentions.
Compound challenges that mix several of the above.

Crucially, this isn’t about optimizing for any one narrow use case. It’s about testing whether the design pattern holds up when the rules change from entity to entity.

The dataset at a glance

The ultimate challenge dataset consists of:

50 entities, spanning people, organizations, and institutions.
~60 articles, with varying structure and linguistic complexity.
51 distinct challenge categories, grouped broadly into:
- Cultural naming conventions.
- Titles and professional context.
- Business and organizational relationships.
- Multilingual and transliteration challenges.
- Combined and edge‑case scenarios.

Earlier in the series, we saw that using generative AI (GenAI) to create datasets can be a mixed blessing. Without it, assembling sufficiently large and diverse test data would be extremely difficult. But left unchecked, the model has a tendency to make things too easy.

On an early generation pass, for example, we discovered that the model had included phrases like “the Russian president” as explicit aliases for Vladimir Putin. That might seem reasonable today, but it defeats the purpose of testing contextual resolution. What happens if the article is discussing Russia in the 1990s? The system should infer the correct entity from context, not rely on a hard-coded alias.

For that reason, this dataset was deliberately designed so that shortcuts don’t work. Aliases are not explicitly listed when the system is expected to infer meaning. Descriptive phrases are not prelinked to entities. Correct matches often depend on article-level context, not just local text.

Important note: Although we demonstrate the system’s capabilities across diverse scenarios, this is still an educational prototype. Production systems handling real-world sanctioned-entity monitoring would require additional validation, compliance checks, audit trails, and specialized handling for sensitive use cases.

Why these scenarios are hard

Back in the first post in this series, we introduced a simple but ambiguous example: “The new Swift update is here!” The challenge is that “Swift” can resolve to multiple real-world entities, depending on context. That example captures a broader truth: Natural language is inherently ambiguous.

Entity resolution, therefore, is not just a string-matching problem. Humans routinely rely on shared knowledge, cultural norms, and situational context to resolve references, and we rarely even notice we’re doing it.

Consider a few common cases:

A title like “the president” is meaningless without geopolitical and temporal context.
A company name may refer to a parent, a subsidiary, or a former brand depending on when the article was written.
A person’s name may appear in different orders, scripts, or transliterations, depending on language and culture.
The same phrase can legitimately refer to different entities in different contexts, and the system must be able to reject matches just as confidently as it accepts them.

There is no single rule set that handles all of this cleanly. That’s why this prototype separates concerns so aggressively:

Elasticsearch narrows the candidate space efficiently and transparently.
The LLM is used only where judgment is required and is forced to explain itself.
Retrieval and reasoning remain distinct steps.

This separation becomes even more important as the diversity of challenge types increases.

How the system handles diversity without special cases

One of the most interesting outcomes of this evaluation is what didn’t change:

We did not add special logic for Japanese names.
We did not add custom rules for Arabic patronymics.
We did not add hard-coded mappings for historical company names.

Instead, the system relied on the same core ingredients introduced earlier in the series:

Context-enriched entities indexed for semantic search.
Hybrid retrieval (exact, alias, and semantic) in Elasticsearch.
A small, well-defined set of candidate matches.
LLM judgment constrained by function calling and minimal schemas.

This suggests that the system’s flexibility comes from representation and architecture, not from an ever-growing collection of rules.

When the system succeeds, it’s because the right candidates are retrieved and the LLM has enough context to explain why a reference does (or does not) map to a specific entity.

Results: How did it perform?

On the ultimate challenge dataset, the system produced the following overall results:

Precision: ~91%
Recall: ~86%
F1 Score: ~89%
LLM acceptance rate: ~72%

Performance across challenge types

Breaking down results by challenge type reveals strengths and limitations:

Strongest performance (100% F1 score) was observed in areas such as:

Cross-script matching (Cyrillic, Korean, Chinese business entities).
Hebrew scenarios (patronymics, professional titles, religious titles, transliteration).
Business hierarchies (aerospace, diversified manufacturing, multidivision corporations).
Professional titles (academic, military, political, religious).
Combined Japanese scenarios involving multiple writing systems.

Strong performance (80–99% F1 score) included:

International political figures (98%).
Historical name changes (90%).
Complex business hierarchies (89%).
Japanese company names (93%).
Cross-script transliteration (86%).
Arabic patronymics (86%).

More challenging areas included:

Advanced transliteration (Chinese, Korean): 0% F1.
Certain Japanese scenarios (honorifics, name order, writing system variation): ~67% F1.
Some Arabic scenarios (company names, institutional references): ~40% F1.

What’s important here is why the system struggled in these cases. The failures were not due to the overall approach breaking down, but to limitations in specific components, most notably the dense vector model used for semantic search in certain multilingual scenarios.

Because retrieval and judgment are cleanly separated, improving performance does not require rewriting the system. Swapping in a more capable multilingual embedding model, enriching entity context, or refining retrieval strategies would improve results across these categories without changing the core architecture.

From an architectural standpoint, that’s the real success metric.

What this tells us about the design

Looking back across the series, a few patterns stand out:

Preparation matters more than clever matching. Enriching entities with context up front dramatically reduces ambiguity later.
LLMs are most valuable as judges, not retrievers. Asking them to explain why a match makes sense is far more powerful than asking them to search.
Reliability enables accuracy. Function calling didn’t just clean up JSON; it unlocked recall that was already latent in the retrieval step.
Generalization beats specialization. A small number of well-chosen abstractions handled dozens of challenge types without custom logic.

This is why the prototype is intentionally Elasticsearch-native and intentionally conservative in how it uses LLMs. The goal isn’t to replace search; it’s to make search explainable in situations where meaning matters.

Final thoughts

The ultimate challenge wasn’t about chasing perfect metrics; it was about answering a more fundamental question:

Can a transparent, search-first, LLM-assisted architecture handle real-world entity ambiguity without collapsing into rules or black boxes?

For this educational prototype, the answer is yes, with clear caveats around production hardening, compliance, monitoring, and data quality. If you’re building systems that need to justify why an entity match was made, this pattern is worth serious consideration. I hope this series has shown that entity resolution doesn’t have to be mysterious. With the right separation of concerns, it becomes something you can reason about, measure, and improve.

This work also suggests a broader architectural pattern. What emerges is a slight but important evolution of classic retrieval augmented generation (RAG). Instead of allowing retrieval to feed generation directly, we introduce an explicit evaluation step. The LLM is first used to judge and sanity-check retrieved candidates, and only those approved results are allowed to augment generation. You can think of this as Generation-Augmented Retrieval-Augmented Generation with Evaluation, or GARAGE, because who doesn’t love a good acronym.

What other use cases could benefit from this pattern? Systems that require trust, transparency, and defensible reasoning are natural candidates. Future work in this area should prove as compelling as the results we’ve seen here, and I’m excited to see where the community takes it next.

Next steps: Try it yourself

Want to see the ultimate challenge in action? Check out the Ultimate Challenge notebook for a complete walkthrough, with real implementations, detailed explanations, and hands-on examples.

The complete entity resolution pipeline demonstrates the core concepts and architecture needed for production use. You can use it as a foundation to build systems that monitor news articles, track entity mentions, and answer questions about which entities appear in which articles, all while retaining transparency and explainability.

The stateless architecture of Elasticsearch Serverless

Iraklis Psaroudakis — Thu, 12 Mar 2026 00:00:00 GMT

We're thrilled to announce the publication of our new peer-reviewed paper, Serverless Elasticsearch: the Architecture Transformation from Stateful to Stateless, which was accepted and presented at the industrial track of the 2025 Association for Computing Machinery (ACM) Symposium on Cloud Computing (SoCC). The paper gives a concise view of our recent innovations in Elasticsearch. At Elastic, we’re relentlessly focused on the future of search. From optimizing performance to simplifying operations, our teams are always exploring what's next.

This paper isn't just an academic exercise. It's a foundational exploration of how the core of a search engine could be reimagined for a purely serverless world. We decouple storage from compute: Data lives in a cloud blob store with virtually infinite storage and scalability. That vision is the main driver behind our Elastic Cloud Serverless offering: seamless search over massive datasets, with the economics and operational simplicity of serverless.

The challenge: Rethinking stateful search for the cloud

For decades, search engines have been powerful, stateful systems. Deploying a production-grade cluster like Elasticsearch has meant:

Provisioning servers and managing storage.
Carefully tuning configurations for cost, performance, and reliability.
Paying for idle capacity when workloads are spiky or unpredictable.
Significant operational effort to scale up and down.

Modern cloud platforms have made some of this easier, but the fundamental tension remains:

Can we build a search engine that delivers the power and rich query capabilities of Elasticsearch with the economics and operational simplicity of a serverless architecture?

That question drove our research.

Our key contributions

The paper presents concrete innovations that make Elasticsearch Serverless possible:

Object store as single source of truth: We offload index data, the transaction log (translog), and cluster state to a cloud object store. That eliminates replica shards for durability and makes the object store the sync point between indexing and search.
"Thin" (stateless) shards: Shards recover and relocate quickly across nodes without copying large amounts of data. Disks are used only for caching, not for persistent storage.
Batched compound commits (BCC): We wrap index commits in a custom format, cutting upload costs, while keeping the same read-after-write semantics as Elasticsearch.
Batched translog uploads: Translog uploads are batched at the node level, cutting upload costs.
Smart garbage collection: We track the usage of BCCs and translogs we’ve uploaded, and we delete them once they’re unused, to reduce storage footprint and retention costs.
Autoscaling: We scale automatically with ingestion and search load so clients can call APIs without managing cluster size.

The bottom line: In our experiments, Elasticsearch Serverless achieves up to twice the indexing throughput of stateful Elasticsearch on comparable hardware and scales linearly with autoscaling to match ingestion load.

Visualizing the architecture

Figure 1 in the paper gives a clear side-by-side view: stateful Elasticsearch versus the new stateless architecture Elasticsearch Serverless.

Stateful Elasticsearch (top): Familiar data tiers: hot, warm, cold, frozen. Data lives on local disks; primaries and replicas are spread across nodes; colder tiers may use searchable snapshots on an object store.

Elasticsearch Serverless (bottom): Just two tiers: indexing and search. All durable data (Lucene commits, translogs, cluster state) lives in the object store. Indexing nodes write and upload; search nodes read from the object store and a shared cache, with no local persistence of index data.

The takeaway: a complete separation between the resources used for indexing and those used for querying.

A tale of two data flow paths

Figures 2 and 3 in the paper contrast how data flows in stateful versus Elasticsearch Serverless.

Stateful Elasticsearch (figure 2):

Documents go to the primary shard's Lucene buffers and translog and then to replica shards.
After refresh, the documents go to new searchable segments.
After flush, they’re committed to disk.
Thus, durability is given by the disk and the replicas.

Elasticsearch Serverless (figure 3):

Documents go to Lucene and the translog on an indexing node.
Before acknowledging the client, the translog is uploaded to the object store.
After refresh, the documents go to new searchable segments and are committed to disk in the indexing nodes.
After flush, they go into BCCs and are uploaded to the blob object store.
Search nodes serve queries from the object store (and, for recent data not yet uploaded, directly from the indexing node).
Thus, durability comes from the object store, not from disk or replicas.

Result: The indexing and search paths are fully decoupled.

Autoscaling

Section 7 of the paper describes the autoscaler. Because data lives in the object store, relocating shards doesn't mean copying full segment data; only metadata and, when needed, cache warming. So the cluster can scale up and down much faster than in stateful Elasticsearch.

How it works:

The autoscaler is an external component that monitors metrics from Elasticsearch Serverless.
Indexing tier: Scale-up is driven by memory usage and ingestion load (including queued work).
Search tier: Scale-up is driven by memory, search load, and the user-configurable "search power" (how much of the dataset is cached locally).
It polls every few seconds and adjusts each tier independently.

Outcome: automatic, workload-driven scaling so clients can focus on their applications instead of on capacity planning.

The experimental results

Section 8 of the paper presents our experimental evaluation.

Microbenchmarks show the impact of batching: fewer object store operations for both commits and translogs, with some trade-offs.

Autoscaling experiments: As we increase the number of indexing clients, throughput scales linearly while P50 and P99 latency stay stable. A real-world example shows bulk response times improving and stabilizing as the indexing tier scales up with demand.

Head-to-head comparison of stateful Elasticsearch versus Elasticsearch Serverless:

Elasticsearch Serverless achieves roughly twice the indexing throughput of stateful Elasticsearch at the 50th percentile.
The gain comes largely from using the object store for durability instead of replicating every operation to replica shards.
Latency stays competitive.

The takeaway: The stateless design delivers both better peak performance and more efficient, automatic scaling.

Why this matters for the future of Elastic

The stateless architecture isn't just a technical achievement; it's the foundation for how we want search to work in the cloud.

Pay-as-you-go: Customers can index and search over practically limitless data without provisioning clusters, tuning tiers, or managing replicas and snapshots.
Automatic scaling: Each tier scales on its own automatically; no capacity planning required.
Frequent, automated upgrades: Better security and time-to-value, without the operational cost of rolling upgrades over stateful data.

This work is a step toward making powerful search more accessible, cost-effective, and scalable for everyone.

Read the full paper, and join the conversation

We believe in the power of open research and collaboration to move technology forward. We encourage you to dive into the details. We provide a preprint of this paper for your information, which details in depth the architecture transformation.

Dive deeper: Explore related blog posts

While our paper offers a concise overview of the Elasticsearch Serverless architecture, the details and underlying innovations are explored more fully in a collection of in-depth blog posts written by our engineering team. These articles provide the background, nuance, and specific technical deep dives that make the stateless transformation possible.

We encourage you to delve into the following resources to gain a richer understanding of the components and concepts presented in the paper:

Stateless — your new state of find with Elasticsearch (2022) and Serve more with Serverless (2023). Read the foundational posts introducing the concept of decoupling storage and compute.
Stateless: Data safety in a stateless world (2024). Learn how data durability is achieved in the absence of local replicas.
Autosharding of data streams in Elasticsearch Serverless (2024). Discover the logic behind automatic and dynamic data stream sharding.
How we optimized refresh costs in Elasticsearch Serverless (2024). Understand the specific optimizations applied to reduce the cost of making data searchable.
Introducing Serverless Thin Indexing Shards (2024). Explore the innovation of "thin" shards that enable rapid relocation and recovery.
Search tier autoscaling in Elasticsearch Serverless (2024). Gain insight into the mechanisms driving the automatic scaling of search resources.
Ingest autoscaling in Elasticsearch (2024). Learn how the ingestion tier scales automatically to meet fluctuating indexing load.
Elastic Cloud Serverless pricing and packaging (2025). Learn how the pricing and packaging was initially structured for Elastic Cloud's Serverless offering.
Elasticsearch vs. OpenSearch: Unraveling the performance gap (2023). Learn about the performance differences and key optimizations that distinguish Elasticsearch from OpenSearch, as observed in 2023.

Acknowledgments

We would like to thank all the co-authors of the paper: Iraklis Psaroudakis, Pooya Salehi, Jason Bryan, Francisco Fernández Castaño, Brendan Cully, Ankita Kumar, Henning Andersen, and Thomas Repantis. We would also like to thank the Elasticsearch Distributed Systems team for their contributions, and also the entire Elasticsearch engineering team.

Hybrid search with Java: LangChain4j Elasticsearch integration

Laura Trotta — Wed, 11 Mar 2026 00:00:00 GMT

In our previous article on hybrid search with Elasticsearch in LangChain, we explained why hybrid search can help retrieve better results than simple vector search, along with how it works. We recommend reading that article first.

In addition to Python and JavaScript, the LangChain ecosystem also has a community-driven Java project called LangChain4j, which will be the focus of this article, showing how powerful hybrid search can be by writing a complete application using LangChain4j, Elasticsearch, and Ollama.

Setting up the environment

Running a local Elasticsearch instance

Before running the examples, you'll need Elasticsearch running locally. The easiest way is using the start-local script:

curl -fsSL https://elastic.co/start-local | sh

After starting, you'll have:

Elasticsearch at http://localhost:9200.
Kibana at http://localhost:5601.

Your API key is stored in the .env file (under the elastic-start-local folder) as ES_LOCAL_API_KEY.

> Note: This script is for local testing only. Do not use it in production. For production installations, refer to the official documentation for Elasticsearch.

Running a local Ollama instance

You’ll also need to connect your application to an embedding model. Although you can choose between any provider supported by LangChain4j (check the complete list), for this example we’ll be using Ollama, which can be easily set up locally following the quickstart.

Let’s start coding

The idea for the application is simple: Given a dataset of movies (taken from an IMDb dataset on Kaggle), we want to be able to find movies whose descriptions are relevant to our queries. This demo uses a subset of the data, which has been cleaned. You can download the dataset used for this article from our GitHub repo, along with the full code for this demo.

Step 1: Dependencies and environment

Open your favorite integrated development environment (IDE), create a new blank project, preferably with a modern Java version (we’re using Java24) and a gradle/maven version to match (in our case, Gradle 9.0).

We only need three dependencies:

dependencies {
    implementation("com.fasterxml.jackson.dataformat:jackson-dataformat-csv:2.17.0")
    implementation("dev.langchain4j:langchain4j-elasticsearch:1.11.0-beta19")
    implementation("dev.langchain4j:langchain4j-ollama:1.11.0")
}

The first one is needed to ingest the data that we’ll embed and query; the other two are the necessary LangChain4j dependencies to connect and manage our Elasticsearch vector store and Ollama embedding model.

The best way to connect to the external services is to set up environment variables and set them at the start of our main function:

String elasticsearchServerUrl = System.getenv("ES_LOCAL_URL");
String elasticsearchApiKey = System.getenv("ES_LOCAL_API_KEY");

String ollamaUrl = System.getenv("ollama-url");
String ollamaModelName = System.getenv("model-name");

Step 2: Ingesting the dataset

Since the dataset is a CSV, we’ll be using Jackson dataformat’s jackson-dataformat-csv to easily read the data and map it to a Java class, defined as:

public record Movie(
    String movie_id,
    String movie_name,
    Integer year,
    String genre,
    String description,
    String director
) {
}

Now we can create an instance of CsvSchema mapping the CSV structure and read the file into an iterator:

CsvSchema schema = CsvSchema.builder()                    
    .addColumn("movie_id") // same order as in the csv    
    .addColumn("movie_name")                              
    .addColumn("year")                                    
    .addColumn("genre")                                   
    .addColumn("description")                             
    .addColumn("director")                                
    .setColumnSeparator(',')                              
    .setSkipFirstDataRow(true)                            
    .build();                                             
                                                          
CsvMapper csvMapper = new CsvMapper();                    
                                                          
File initialFile = new File("src/main/resources/scifi_1000.csv");
InputStream csvContentStream = new FileInputStream(initialFile);
                                                          
MappingIterator it = csvMapper                     
    .readerFor(Movie.class)                               
    .with(schema)                                         
    .readValues(new InputStreamReader(csvContentStream));

Each row needs to be embedded first, and then both the embedded content and the text representation will be ingested by Elasticsearch.

Let’s start by creating an instance of the Ollama embedding model class:

EmbeddingModel embeddingModel = OllamaEmbeddingModel.builder()
    .baseUrl(ollamaUrl)
    .modelName(ollamaModelName)
    .build();

And then the Elasticsearch vector store, which needs an instance of the Elasticsearch Java RestClient:

RestClient restClient = RestClient
    .builder(HttpHost.create(elasticsearchServerUrl))
    .setDefaultHeaders(new Header[]{
        new BasicHeader("Authorization", "ApiKey " + elasticsearchApiKey)
    })
    .build(); 

EmbeddingStore embeddingStore = ElasticsearchEmbeddingStore.builder()
    .restClient(restClient)
    .build();

For the ingestion loop, the LangChain4j library requires the data to be split in two lists for ingestion, one for the vector representation and one for the original text, so we’ll set up two lists which will be filled by the loop:

List embeddings = new ArrayList<>();
List embedded = new ArrayList<>();

Where Embedding and TextSegment are both library specific classes.

We’ll iterate on the movie dataset iterator, use the embedding model to retrieve the vector representation for each movie information (a text representation of all the fields merged), and add the name separately as metadata so that the result will be easier to read.

boolean hasNext = true;

while (hasNext) {
    try {
        Movie movie = it.nextValue();
        String text = movie.toString();

        Embedding embedding = embeddingModel.embed(text).content();
        embeddings.add(embedding);

        Metadata metadata = new Metadata();
        metadata.put("movie_name", movie.movie_name());
        embedded.add(new TextSegment(text, metadata));

        hasNext = it.hasNextValue();
    } catch (JsonParseException | InvalidFormatException e) {
        // ignore malformed data
    }
}

Finally, the vector list and text list are passed to the vector store method addAll(), which will handle asynchronously sending the data to the vector store:

embeddingStore.addAll(embeddings, embedded);

Step 3: Querying

Our goal is to find movies with time loops in the plot, so our prompt will be:

String query = "Find movies where the main character is stuck in a time loop and reliving the same day.";

Let’s try a simple vector search first, by creating a content retriever with a k-nearest neighbor (kNN) query default configuration and then running the query and printing the results:

ElasticsearchContentRetriever contentRetrieverVector = ElasticsearchContentRetriever.builder()
                .restClient(restClient)
                .configuration(ElasticsearchConfigurationKnn.builder().build())
                .maxResults(5)
                .embeddingModel(embeddingModel)
                .build();

List vectorSearchResult = contentRetrieverVector.retrieve(Query.from(query));

System.out.println("Vector search results:");
vectorSearchResult.forEach(v -> System.out.println(v.textSegment().metadata().getString(
                "movie_name")));

This outputs:

Vector search results:
The Witch: Part 1 - The Subversion
Divinity
The Maze Runner
Spider-Man
Spider-Man: Into the Spider-Verse

Now let’s see how hybrid search performs:

ElasticsearchContentRetriever contentRetrieverHybrid = ElasticsearchContentRetriever.builder()
    .restClient(restClient)
    .configuration(ElasticsearchConfigurationHybrid.builder().build())
    .maxResults(5)
    .embeddingModel(embeddingModel)
    .build();

List hybridSearchResult = contentRetrieverHybrid.retrieve(Query.from(query));

System.out.println("Hybrid search results:");
hybridSearchResult.forEach(v -> System.out.println(v.textSegment().metadata().getString(
            "movie_name")));

Hybrid search results:
Edge of Tomorrow
The Witch: Part 1 - The Subversion
Boss Level
Divinity
The Maze Runner

Why these results?

This query (“time loop / reliving the same day”) is a great case where hybrid search tends to shine because the dataset contains literal phrases that BM25 can match and vectors can still capture meaning.

Vector-only (kNN) embeds the query and tries to find semantically similar plots. Using a broad sci‑fi dataset, this can drift into “trapped / altered reality / memory loss / high-stakes sci‑fi” even when there’s no time-loop concept. That’s why results like “The Witch: Part 1 – The Subversion” (amnesia) and “The Maze Runner” (trapped / escape) can appear.
Hybrid (BM25 + kNN + reciprocal rank fusion [RRF]) rewards documents that match keywords and meaning. Movies whose descriptions explicitly mention “time loop” or “relive the same day” get a strong lexical boost, so titles like “Edge of Tomorrow” (relive the same day over and over again…) and “Boss Level” (trapped in a time loop that constantly repeats the day…) rise to the top.

Hybrid search doesn’t guarantee that every result is perfect; it balances lexical and semantic signals, so you may still see some non-time-loop sci‑fi in the tail of the top‑k.

The main takeaway is that hybrid search helps anchor semantic retrieval with exact textual evidence when the dataset contains those keywords. Check the previous article for more information on how hybrid search works.

Full code example

You can find the full demo code on GitHub.

Conclusion

In this article, we demonstrated how to use hybrid search in LangChain4j through its Elasticsearch integrations, with a complete Java example. This article is an extension of a previous article, which presents the LangChain integrations for Python and JavaScript and introduces and explains hybrid search. We’re planning to continue our collaboration with LangChain4j in the future by contributing to the embedding models with our Elasticsearch Inference API.

SearchClaw: Bring Elasticsearch to OpenClaw with composable skills

Alex Salgado — Tue, 10 Mar 2026 00:00:00 GMT

In recent weeks, OpenClaw has been appearing frequently in AI community discussions, particularly among developers interested in agents, automation, and local runtimes. The project gained traction quickly, which naturally raised a technical question:

What real problem does it solve for engineers?

OpenClaw is a self-hosted gateway for AI agents: a single runtime that coordinates execution, treats agents as isolated processes, and uses skills (structured instructions in markdown files) as the unit of integration. Conceptually, this isn’t entirely different from what we already do with command line interfaces (CLIs) and scripts, but it’s now formalized around agent-driven workflows.

This led to a practical exploration within the Elastic Stack:

If we treat OpenClaw as an orchestration runtime, how does it behave when Elasticsearch is the back end? And how straightforward is integration using OpenClaw skills?

Let's build an integration using composable skills.

Solution architecture

In this tutorial, we’ll teach OpenClaw how to access and query Elasticsearch data through a custom read-only skill, and we’ll then demonstrate how it composes multiple skills together; for example, combining Elasticsearch queries with real-time weather data to generate dynamic reports.

Before diving into the hands-on steps, let’s look at what we’re building. The solution is composed of three integrated layers that work together through OpenClaw orchestration.

Layer 1: Storage and search (Elasticsearch)

The data layer runs on Elasticsearch via start-local, a single command that spins up Elasticsearch and Kibana locally with Docker.

Two sample indices demonstrate different use cases:

fresh_produce: 10 products with semantic search (ecommerce scenario)
app-logs-synthetic: 30 log entries across four services (observability scenario)

The same read-only skill works with both indices without any reconfiguration; the agent inspects the mapping and adapts its queries accordingly.

Layer 2: Orchestration (OpenClaw Gateway)

The gateway receives natural language requests and loads the Elasticsearch skill, and the large language model (LLM) decides which queries to construct. The skill is a pure SKILL.md with reference docs, meaning that its operations require no custom code.

To understand how the gateway organizes this, two core OpenClaw concepts are worth knowing:

Agents: Independent AI instances, each with its own configuration, workspace, and set of skills. You can run multiple agents for different purposes.
Workspace: A folder that defines an agent’s context: AGENTS.md (the agent’s permanent briefing), .env (credentials), and a skills/ directory. Think of it as the agent’s working environment.

Layer 3: Skills (composable capabilities)

Skills are structured instructions in markdown files (SKILL.md) that teach the agent how to use specific tools or APIs. They can be global (available to all agents), workspace-specific, or bundled with OpenClaw. The agent selectively loads only the skills relevant to each request.

This tutorial uses two skills:

Elasticsearch-openclaw (custom, built for this tutorial): A read-only skill that teaches the agent how to search, filter, aggregate, and explore Elasticsearch indices using curl.
Weather (community skill, used for composition demo): A skill that fetches current weather conditions from external APIs.

Later in the tutorial, we'll demonstrate how OpenClaw composes both skills in a single request, querying Elasticsearch products based on real-time weather data without any custom integration code.

Read-only by design

The elasticsearch-openclaw skill is read-only by design. It provides patterns for searching, filtering, and aggregating data, but it never writes, updates, or deletes. This minimizes the security footprint when giving AI agents access to your Elasticsearch cluster.

Even if the agent environment is compromised, your data remains safe from modification or deletion. This is enforced through:

Skill design: No write operation patterns in SKILL.md or reference files.
API key permissions: The tutorial uses a read-only API key with only read and view_index_metadata privileges.
Agent instructions: AGENTS.md explicitly states "You can SEARCH, FILTER, and AGGREGATE data, but you can NEVER write, update, or delete."

This security-first approach is why infrastructure setup (index creation, data loading) must be done manually; by design, the agent cannot do it for you.

Prerequisites

To follow this tutorial, you’ll need:

Software and tools:

Docker Desktop installed and running (Docker Engine with Compose V2).
Elasticsearch running locally via start-local. (We’ll set this up in the next section.)
Jina API key (free): https://jina.ai/embeddings.
OpenClaw installed: https://openclaw.ai.

Setting up the environment

Start by cloning the starter project, which contains the skill, workspace configuration, and Dev Tools scripts:

git clone https://github.com/salgado/elasticsearch-openclaw-start-blog
cd elasticsearch-openclaw-start-blog

The repository contains:

elasticsearch-openclaw-start-blog/
├── devtools_fresh_produce.md         ← Creates fresh_produce index (10 products)
├── devtools_app_logs_synthetic.md    ← Creates app-logs-synthetic index (30 logs)
└── openclaw-workspace-elastic-blog/
    ├── AGENTS.md                      ← Agent briefing
    ├── .env.example                   ← Credentials template

Note: The devtools*.md files contain Kibana Dev Tools commands formatted as reference documentation.

Installing OpenClaw

OpenClaw is a self-hosted gateway. This means you maintain full control over execution and data, but you need to prepare your local environment or server.

I installed OpenClaw on a separate machine, which is why I included the disclaimer below.

** Security and responsibility disclaimer **

Since OpenClaw is an early-stage, rapidly evolving open-source project, the community has raised important discussions about potential security vulnerabilities, especially around token handling and third-party script execution.

Deployment recommendations:

Isolated environments: If you’re not an advanced infrastructure security user, we recommend installing OpenClaw strictly in isolated, controlled environments (such as a dedicated virtual machine [VM], a rootless Docker container, or a test machine).
Do not use in production: Avoid running the gateway on servers containing sensitive data or with unrestricted access to your corporate network until the project reaches a more stable, audited version.
Least privilege: We reinforce the need to use Elasticsearch API keys with restricted permissions (read-only) to mitigate risks, in case the environment is compromised.
Network segmentation: Both Elasticsearch and OpenClaw bind to localhost by default. Keep it that way, unless you have a specific reason to expose them.
Credential rotation: Rotate API keys periodically. OpenClaw stores credentials locally, so treat the machine’s security as the perimeter.
Audit logging: Enable Elasticsearch audit logging to track all API calls made by OpenClaw. This creates a full trail of what the agent accessed and when.
Keep the installation up to date.

For a deeper analysis of the security architecture and deployment options, consult the official OpenClaw documentation.

Runtime installation

OpenClaw manages daemons and skill isolation via CLI. Since it’s a recent project that has undergone naming changes, we recommend strictly following the official documentation to ensure installation compatibility.

# Global gateway installation
curl -fsSL https://openclaw.ai/install.sh | bash

Preparing the Elasticsearch back end

Before connecting any agent runtime, we need a working Elasticsearch environment with data to query and a secure, read-only access layer. In the next two sections, we’ll spin up Elasticsearch locally using start-local, create an index with semantic_text and Jina v5 embeddings, load sample data, validate that semantic search works, and generate a read-only API key. Once this foundation is in place, the Elasticsearch side is complete and we can focus entirely on teaching the agent how to use it.

Part 1: Setting up Elasticsearch locally

Start a local Elasticsearch and Kibana instance with a single command:

curl -fsSL https://elastic.co/start-local | sh

Once complete: Elasticsearch at http://localhost:9200, Kibana at http://localhost:5601, and credentials in elastic-start-local/.env.

Part 2: Configuring the index in Kibana Dev Tools

Open http://localhost:5601 → Dev Tools and run devtools_fresh_produce.md in order.

Step 1: Replace YOUR_JINA_API_KEY with your actual Jina API key (free).
Step 2: Save the encoded field immediately; it cannot be retrieved later.

The key commands in the Dev Tools file are:

Create the Jina inference endpoint:

PUT _inference/text_embedding/jina-embeddings-v5
{
  "service": "jinaai",
  "service_settings": {
    "api_key": "YOUR_JINA_API_KEY",
    "model_id": "jina-embeddings-v5-text-small"
  }
}

Create the index with semantic_text:

PUT /fresh_produce
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "fields": { "keyword": { "type": "keyword" } }
      },
      "description": { "type": "text" },
      "category": { "type": "keyword" },
      "price": { "type": "float" },
      "stock_kg": { "type": "float" },
      "on_sale": { "type": "boolean" },
      "image_url": { "type": "keyword" },
      "semantic_content": {
        "type": "semantic_text",
        "inference_id": "jina-embeddings-v5"
      }
    }
  }
}

The semantic_text field type handles embedding generation automatically at index time.

Index sample products using the bulk API (see devtools_fresh_produce.md for the full dataset of 10 products).

Validate semantic search:

GET /fresh_produce/_search
{
  "query": {
    "semantic": {
      "field": "semantic_content",
      "query": "healthy colorful meals"
    }
  },
  "size": 3,
  "_source": ["name", "description", "category"]
}

The semantic query type handles inference on the query side automatically; no need to specify model IDs or embedding details.

Create a read-only API key:

POST /_security/api_key
{
  "name": "openclaw-readonly",
  "role_descriptors": {
    "reader": {
      "cluster": ["monitor"],
      "indices": [
        {
          "names": ["fresh_produce", "app-logs-synthetic"],
          "privileges": ["read", "view_index_metadata"]
        }
      ]
    }
  }
}

Save the encoded value from the response. This is your API key for the OpenClaw configuration.

Connecting to OpenClaw

With the Elasticsearch back end ready, we can now wire it into OpenClaw. Several Elasticsearch integrations already exist in the ecosystem, from Elastic’s own Model Context Protocol (MCP) server to community-built MCP servers. However, most of these offer full CRUD access or are designed for different agent runtimes. Given that the technology is still in its early stages and security remains a primary concern, I chose to build a dedicated skill, simple, read-only, and purpose-built for OpenClaw. This approach ensures that the agent can search, filter, and aggregate data but never modify it, keeping the blast radius minimal even if the environment is compromised.

In the next sections, we’ll configure credentials, install the skill, create a dedicated agent, and explore how the workspace ties everything together.

Install the skill and create the agent

Step 1: Configure credentials

From the cloned repository, configure the credentials by copying the environment template and filling in your Elasticsearch URL and the read-only API key:

cp openclaw-workspace-elastic-blog/.env.example 
openclaw-workspace-elastic-blog/.env

Edit the .env file with these two values:

ELASTICSEARCH_URL: http://localhost:9200 (from start-local)
ELASTICSEARCH_API_KEY: The encoded value from the read-only API key you created in Part 2 (the POST /_security/api_key response)

Example .env file:

ELASTICSEARCH_URL=http://localhost:9200
ELASTICSEARCH_API_KEY=VnVaRmxLSDRCQxxxxxxxxbGVfa2V5

Step 2: Install the skill from ClawHub

ClawHub is OpenClaw's public skill registry. Think of it as npm for AI agent skills. At the time of this writing, ClawHub hosts over 3,200 skills, covering everything from Slack and GitHub integrations to Internet of Things (IoT) device automation. For this tutorial, we created elasticsearch-openclaw, a custom skill focused on read-only queries using semantic_text, aggregations, and observability on Elasticsearch 9.x. It’s published on ClawHub so you can install it directly. As a best practice, only install skills from trusted sources with known provenance; as with any package manager, review the content before granting access to your agent.

The elasticsearch-openclaw skill is published on ClawHub.

Recommended: Open the OpenClaw Web UI (http://127.0.0.1:18789/) and ask:

Install the elasticsearch-openclaw skill from https://clawhub.ai/salgado/elasticsearch-openclaw

OpenClaw will:

Fetch the skill from ClawHub.
Install it in the appropriate directory.
Confirm when ready to use.

Step 3: Create the agent

Do this by registering a dedicated agent with its own workspace, and then restart the gateway to load the new configuration:

openclaw agents add elasticsearch-agent \
  --workspace ~/path/to/elasticsearch-openclaw-start-blog/openclaw-workspace-elastic-blog \
  --non-interactive

openclaw gateway restart

Understanding the workspace

Now that the agent is running, let’s look at what makes it tick.

`AGENTS.md`

The AGENTS.md file is the agent’s permanent briefing. It defines who the agent is, what it can do, and how it should behave. For our Elasticsearch agent, this file instructs the agent about the available indices, the read-only constraint, and the preferred query patterns.

Skills: When they make a difference

Without skill	With `elasticsearch-openclaw` skill
Agent has no knowledge of Elasticsearch query syntax.	Agent knows semantic, full-text, filtered, and aggregation patterns.
Agent might attempt write operations.	Agent is instructed to never write, update, or delete.
Agent guesses field names and types.	Agent inspects mappings first and then constructs appropriate queries.
Generic curl commands with trial and error.	Structured query templates with best practices for Elasticsearch 9.x.

Exploring with the agent

With the Elasticsearch back end configured and the OpenClaw agent connected, it’s time to see what the agent can actually do. In the next sections, we’ll test natural language queries, explore observability data, and compose multiple skills together.

Testing in OpenClaw

Open the OpenClaw web UI, and try some natural language queries. The agent will inspect the index mapping, choose the appropriate query type, and return results.

Type:

“Find products that would be good for a healthy summer salad.”

Result:

Others ideas to explore:

Index exploration: > “What indices do I have in Elasticsearch? Show me the fields of fresh_produce.”
Filtered search: > “Show me all products on sale under $15.”
Aggregations: > “What’s the average price by category?”

Observability

To demonstrate that the skill works beyond a single use case, the repository includes a second index: app-logs-synthetic, with 30 synthetic log entries across four fictional services, created from devtools_app_logs_synthetic.md.

Setting up the log data

Since the skill is read-only, you need to populate the index first. The devtools_app_logs_synthetic.md file contains five commands (three for setup and two for verification):

Create ingest pipeline: Adds @timestamp to log entries automatically.
Create index mapping: Defines the app-logs-synthetic structure (classic fields only, no semantic_text).
Bulk insert logs: Loads 30 synthetic log entries across four services.
Count query: Verify 30 documents were indexed.
Sample search: Quick test to confirm that data is queryable.

How to run:

Open Kibana Dev Tools: http://localhost:5601 → Dev Tools.
Copy each numbered block from the .md file.
Paste into the Dev Tools console.
Press Ctrl/Cmd+Enter to execute.
Wait for a successful response before continuing to the next block.

This creates the app-logs-synthetic index with sample data ready for querying.

Try this query in the OpenClaw web UI:

Show me the distribution of HTTP status codes across all services.

Result:

Other ideas to explore:

“How many 500 errors do I have in app-logs-synthetic? Which services are failing?”
“Which endpoints have the slowest response times?”
“What happened with the payment-service in the last 24 hours?”

This is the same skill, same agent, same setup, just pointed at different data. The agent inspects the new index mapping, adapts its queries, and returns relevant results without any reconfiguration.

Composing skills in action

This is where composable skills truly shine. Start by asking the agent:

Install the weather skill.

OpenClaw will search for the weather skill, automatically attempt the installation, and guide you through the process. Just follow the on-screen instructions; no new API key is required for the weather skill. Afterward, try this:

“Find the products on sale in the fresh_produce index that match today’s weather in São Paulo. Generate a nice HTML report with product cards using the image_url field from each document, price, description, and stock. Save it to ~/Desktop/report.html and open it in the browser.”

In a single request, the agent chains multiple skills: the weather skill to check current conditions, the Elasticsearch skill to run a hybrid search on products that match the context, and its built-in file and browser tools to generate an HTML report and open it. No custom integration code, no glue scripts, just skills composed by the LLM at runtime.

This is what makes OpenClaw different from a traditional automation framework. You don’t preprogram the workflow. You describe the outcome, and the agent figures out the composition.

Conclusion

SearchClaw started as a simple experiment and ended up demonstrating what composable, LLM-driven integration looks like in practice. The key takeaway is not the individual tools (all are familiar) but the approach. Instead of writing a specific application with hardcoded queries, we gave the agent capabilities and let it compose solutions dynamically. This is what makes OpenClaw native: composable, LLM-driven, and local-first.

As with any early-stage project, OpenClaw should be used thoughtfully, especially regarding security and environment isolation. The read-only skill approach demonstrated here is one way to limit risk while still unlocking the value of your Elasticsearch data.

The full code is available in the repository and can serve as a starting point for your own integrations: https://github.com/salgado/elasticsearch-openclaw-start-blog.

Building effective database retrieval tools for context engineering

Leonie Monigatti — Mon, 09 Mar 2026 00:00:00 GMT

Good search is incredibly difficult. That's why we have an ecosystem full of search algorithms, tooling, and even dedicated stacks like Elastic’s, not to mention an entire field called information retrieval. In modern context engineering, AI agents search and retrieve context from many different sources, including local files, the web, or memory files. Giving agents access to tools to interact with data stored in a database allows them to ground their answers in proprietary information or even complete analytical tasks.

However, if these interfaces aren’t carefully engineered, agents can search the wrong index, generate invalid SQL/Elasticsearch Query Language (ES|QL) queries, or return large amounts of irrelevant data. During the development of Elastic Agent Builder, we’ve seen these failure modes multiple times. While collaborating with dozens of internal teams to build tools for interacting with Elasticsearch data and integrating them to improve our internal processes with agentic workflows, such as our internal laptop refresh process, we found that the most successful teams carefully design database retrieval tools as curated interfaces to their data.

In this blog, we share the best practices we follow when building database retrieval tools. In fact, the principles we share are based on common patterns we saw during iteratively improving our prebuilt tools and helping internal teams build custom tools.

Key challenges of agentic retrieval

Coding and search are among the best use cases of agents. Even though coding agents have recently made substantial progress with new concepts, such as filesystem-oriented tools and code-specific embeddings, search agents (specifically for database retrieval) lack reported breakthroughs.

Agent use cases are challenging for multiple reasons: They can ignore the available tools to accomplish a task; they can call the wrong tools; and they can call the right tools with the wrong parameters. In addition to these general challenges, we believe that database retrieval use cases are challenging for the following three key reasons:

Identifying the right index of data requires the large language model (LLM) to understand what it contains. But sometimes, the number of indices can already be so large that even representing those to select may cause context length problems.
Generating efficient queries that balance retrieving relevant information with minimizing latency and resource usage can be challenging.
Avoiding context bloat with tool responses requires the tool response to be optimized for contextual relevance and token efficiency. This isn’t always easy, especially when the agent generates the query from scratch. Once the context is no longer relevant to answer a user’s query, offloading the data for later reference is another challenge.

During the development of Agent Builder and integrating it into our own processes, we faced these challenges multiple times. In fact, the principles we share in the following sections are based on common patterns we saw during iteratively improving our built-in and custom tools and our internal workflows built on top of them.

Principles for building effective database retrieval tools

In this section, we translate our learnings into guiding principles for building effective database retrieval tools: deciding which tools to build, making sure the agent finds the right index to search and calls the right tool with appropriate parameters, optimizing the tool responses, handling errors, and safeguarding the data.

Building the right database retrieval tools (“low floor, high ceiling”)

When deciding on what database retrieval tools to build, we follow the principle of “low floor, high ceiling” for good agent experience:

High ceiling: Tools that don’t limit the agent's potential to handle ambiguous user queries in the worst case. In the context of database retrieval, these are general-purpose tools that, for example, allow the agent to write full SQL/ES|QL queries from scratch. These come with the trade-off of reasoning overhead of the agent and result in higher latency, higher cost, and lower reliability.
Low floor: Tools with high accessibility that the agent can use successfully on the first try with minimal reasoning overhead for repeating user queries. In the context of database retrieval, these are specialized tools that, for example, wrap specific queries. These have the benefit of lower latency, lower cost, and higher reliability than general-purpose tools. However, they require engineering effort, and realistically, it may not be possible for engineers to anticipate every possible user query.

For example, in our experience, a generic search tool is mandatory to allow the agent to handle unique and ambiguous user queries in the worst case. However, we found it necessary to reduce reasoning overhead and increase efficiency by creating specialized tools (for example, get_top_performing_products(category)).

Another lesson we learned is to consider the level of abstraction of a tool. During our preview phase, the agent had access to a large number of atomic general-purpose tools (for example, get_index_mappings, generate_esql, execute_query, and others). In practice, this had two downsides: When faced with a complex, open-ended question, the agent would confuse tools and their expected order, despite guiding instructions. Combining multiple tools in an agentic workflow also requires hand-off of information via the context window, which leads to filling up the context window with information that’s only temporarily important. To overcome this inefficiency, we wrapped the functionality of multiple atomic tools into one self-contained search tool.

Finding the right index

While the majority of tools that interact with a database will focus on querying the database, in some use cases, like for index selection, the tool will allow the agent to interact with the database’s metadata to decide which indices to search against based on a user’s query and intent.

Initially, our naive index selection relied on the index names and a sample of their schema definitions. This worked well in our internal testing, but when internal teams experimented with it, we realized real use cases often don’t have index names that are distinct and descriptive but are instead vague (for example, users, logs, flight_travels versus web-logs-2026.01, web-logs-2026.02).

To overcome this, we started exposing each index’s metadata and schema definitions in the tool. This significantly improved the selection by allowing engineers to add descriptions to translate technical names into natural language on two levels:

Index-level descriptions: What data is stored in the index and how documents relate to one another.
Field-level hints: Specific guidance on a field's format, expected values, or business meaning (for example, ”Use this field for exact ID matches only”).

In another iteration, we decided to add data sampling. For each index, we sample some of the data present in each field to let the agent have a clearer understanding of the type of data in the index. While it significantly improved the efficiency of index selection, it comes at the cost of increased tool response data.

Calling the right database retrieval tool

Guiding an agent to call the right tool is already challenging enough in general. This section discusses what helped us make sure the agent calls a tool to ground their response, as well as calling the right database retrieval tool.

Naming and namespacing: Standardizing identifiers for reliable selection

A tool’s name acts as a skimmable header that agents use to decide which one to investigate further. In practice, this means using descriptive and distinct tool names. Additionally, tool selection is more reliable when their names are consistent in formatting (for example, standardize on snake_case) and wording. Using action-oriented verbs helps the LLM map the user’s intent to the tool’s purpose, although the exact wording of the tool names is less critical in practice (for example, search versus find).

Namespacing tools to group related tools under common prefixes or suffixes is similarly helpful. In the case of databases, namespacing tools by index or domain helps the LLM understand tool relationships and prevents naming collisions (for example, finance.search_ticker or support.get_ticket_details).

Tool description: Instructing the agent on proper usage

The description is the most crucial component of any tool definition because it instructs the agent on when and how to use it, especially when tools have similar names (for example, search_logs and find_errors).

Consider this template for complex tools:

Core purpose: A high-level summary of what the tool does.
Trigger: When the tool should be used (and when it should not).
Action: Which specific data the tool retrieves or modifies, and what type of questions it can answer.
Limitations: What important limitations and constraints exist, such as specific query languages or formats.
Relationships with other tools: Does one tool affect another tool, or are there any preconditions?
Examples: Specific few-shot examples of user queries and how to use the tool for them, such as how to determine the optimal search strategy or when to use which operator.

A note on model sensitivity: While top-tier models like Claude 4.5 Sonnet are forgiving, smaller models often need clearer and more elaborate descriptions to select the right tool.

Adding reasoning parameters

Inspired by the paper on think-augmented function calling, we added a reasoning parameter. This approach improves the parameter accuracy by providing a scratchpad for the LLM to process its thoughts and facilitates a more transparent user experience.

This works well for complex tool calls or when a large number of tools are exposed to the agent. However, it can lead to regression in simple scenarios, and the benefits are further diminished for thinking-based LLMs. In our Agent Builder implementation, the reasoning parameter is often optional and stripped before execution and is only used for tool selection and parameter filling.

"properties": {
   "reasoning": {
      "type": "string",
      "description": "Brief explanation of why you're calling this tool"
   }
}

Support: Reinforcing instructions in the agent prompt

A common error we observed is that the LLM would sometimes ignore the available tools and instead use its innate knowledge to generate a (hallucinated) response. For example, when asked ”Can you tell me more information about Elasticsearch’s ES|QL language?”, it would assume it was fine to answer by itself instead of calling the tool, which was specifically designed to fetch documentation about Elastic products.

To mitigate this problem, we added repeated, explicit instructions in the system prompt of the agent itself to guide the agent to find the right balance between using its innate knowledge and grounding its answer in a tool response. Our testing indicates that this is especially effective when multiple tools with similar purposes are exposed to the agent.

Forcing tool usage

Beyond reinforcing the instructions in the agent prompt, we found it helpful to force tool usage when mandatory by explicitly binding tools using tool_choice: ‘any’.

Calling database retrieval tools with appropriate values and writing queries

Another challenge is to enable the agent to call a tool with appropriate parameter values. We’ve observed a consistent pattern where a strong definition, the number of parameters, and their complexity play an important role in reducing errors.

Parameter definition

A strong parameter definition significantly improves the parameter accuracy. General best practices for parameter definitions in agent tools are:

Unambiguous name: Clearly identifies the purpose (for example, user_id versus user).
Strong typing: Uses integer, string, or enums, among others, for finite sets of valid values.
Detailed description: Explains what the parameter means and when and how to use it. Specifies default values for missing values, document formats (for example, for dates), hidden rules (for example, ”at least one of agent_id | user_id is required”), and includes small examples.

# Weak parameter description
"properties": {
   "index": {
      "description": "Name of the index",
   },
}

# Strong parameter description
"properties": 
{
   "index": {
      "type": "string","description": "The specific index, alias, or datastream to search. Defaults to 'main-alias' if unknown. ",
   },
}

Number of parameters

Agents struggle to call a tool with a large number of parameters with appropriate values, especially mandatory ones. As a general rule of thumb, we try to keep the mandatory parameters below five and the total parameters below 10.

Parameter complexity

Reducing the complexity of the input parameter when possible leads to fewer mistakes. For example, it requires reasoning overhead to let an LLM generate a search term than generating full SQL or ES|QL queries from scratch. Especially for repeating user queries, “pre-canning” search queries reduces latency, cost, and error rates (although modern LLMs are good at using well-known languages, such as SQL).

To follow the principles of “low floor, high ceiling,” we gravitated toward wrapping a specific query inside a tool and letting the agent only provide the search term. Below is an example of input parameters with varying complexity for the same user query, "Find the 5 most relevant 'resolved' support tickets based on a new problem description to find previous solutions."

# Complex parameter
search_support_tickets("FROM support_tickets | WHERE status = 'resolved' AND customer_email = ?email | MATCH(issue_title, issue_description, ?problem_description) | KEEP ticket_id, issue_title, resolution | LIMIT 5",
)

# Simple parameter
find_similar_customer_resolved_tickets(problem_description)

Model sensitivity

Models have a strong impact on parameter accuracy and query generation. Our internal benchmarking showed that switching from Claude 3.7 Sonnet to Claude 4.5 Sonnet reduced the syntax errors of the generated ES|QL queries from ~28% to ~4%.

Input validation

Although the above techniques increase parameter accuracy, they don’t eliminate the possibility of errors. Instead of trusting the LLM's input, we gravitated toward always validating and sanitizing it to ensure queries adhere to the expected schema.

Optimizing database retrieval tool responses

One common mistake is neglecting the size of the tool’s output. Because the tool’s output is what goes into the agent’s context window, not optimizing it for contextual relevance (quality) can distract the agent, and not optimizing it for token efficiency (quantity) can increase cost and risk exceeding the LLM’s context window limit. Working alongside internal teams, we’ve identified three dimensions for reviewing the return values:

The first dimension is length: The number of search results. A recurring pitfall we’ve observed among teams is the failure to limit search results, which can lead to overflowing the context window. While vector search queries inherently require a limiting parameter, other search methods often do not. We recommend including limit clauses (for example, 10 - 20) in all query types. This prevents returning low-signal results and ensures token efficiency.

The second dimension is width: The number of fields per data object. Instead of returning all properties, including cluttering ones (for example, timestamps and internal IDs, among others), curating a set of relevant fields can help improve both the user experience and the agent behavior.

Consider the following examples:

Returning identifiers with a concise set of properties of a data object allows the agent to retrieve full information about a data object later when it needs it (“just-in-time context engineering”), rather than cluttering the context window.
Returning metadata for citations (for example, page numbers in large PDF documents) can help build trust for the user.
Returning the number of search results and status messages can help the agent reason the status of the search query.

The third dimension is depth: The size of a single field. Consider a case where the documents themselves are large (for example, in the 10s of MB scale). Those can’t just be passed back to the agent in full, as context length will instantly reach its limit. To mitigate this, we recommend truncating long text fields when an excerpt is sufficient. We found features such as Elasticsearch’s highlighting are helpful for this without the need for another LLM API call.

But even when working with smaller documents, letting the tool format the data into an easily digestible format for the LLM, such as sanitizing the content (for example, stripping HTML tags), formatting for readability (for example, tables to Markdown tables or links as “[Title](url)”), improved downstream performance.

While using only one of these techniques is often not sufficient for improving the contextual relevance, combining all of them might reduce the retrieval recall (for example, limiting the number of search results might risk not retrieving all of the relevant documents). In practice, this requires evaluating different combinations of these approaches to find the best balance.

Handling errors and enabling self-correction

We’ve observed that agents can get stuck in infinite loops or hallucinate responses when they encounter an error. Even if an agent follows its instructions perfectly, if a tool doesn’t provide any error message, only returns an error code, or at best provides a short, nondescriptive error message, the agent has no chance to self-correct from an error it doesn’t understand.

An informative error message enables the agent to understand why the error is happening and how to recover from it. For this, engineers need to think about the “not so happy” paths and the anticipated edge cases, such as the following examples:

If an error occurred because of a wrongly formulated search query, the agent should be able to reason over the failure and reformulate the query. In this example, returning the number of search results and the generated query can help the agent self-correct.

In general, engineers need to consider whether “zero results” is an expected behavior or an error for a given tool. In cases where an empty result likely indicates an error, both an error message and agent instructions can be helpful.

# Example error message from tool
"No product data found for product ID [XYZ]. 

Review the search query [insert used search query here].

Ask the customer to confirm the product name, and look up the product ID by name to confirm you have the correct ID."

---
# Example agent instruction
"If the product_search tool returns no results, do not state that the product does not exist. 

First, check that the `:` operator was used for multi-value fields.


Second, ask the user to provide and confirm the product ID or provide an alternative identifier like product name."

When encountering an API failure due to an expired API key, strictly limit retries (for example, a maximum of two or three) to prevent the agent from repeatedly trying a failing deterministic flow.

Safeguarding data

The primary engineering challenge for production-grade agent applications with different tools lies in identity propagation, specifically the distinct requirements of authentication (verifying who the user is) and authorization (verifying what they can access). While an initial layer (for example, Okta) can provide the base authentication, downstream systems (for example, ServiceNow, Elasticsearch, and others) maintain their own nonuniform authorization architectures with varying levels of granularity.

The most successful implementations we’ve seen enforce identity verification at every touchpoint within the tool's logic. This prevents the agent from accidentally accessing data that the end user isn't allowed to see. By verifying identity at every system level, we ensure that the agent respects privacy even when navigating complex, multisystem workflows. Be aware of the downside; this introduces intentional, security-mandated latency.

Beyond verifying the identity at every touchpoint, managing credentials securely is critical. Exposing sensitive API keys in tool definitions or hardcoding database credentials in YAML configuration files is a high-risk vulnerability. Instead, we recommend that engineers use secure credential management systems.

Evaluating database retrieval tools

The development of database retrieval tools for agentic systems is an iterative, evaluation-driven process. To evaluate the effectiveness of our database retrieval tools and uncover issues, our internal teams maintain evaluation datasets with realistic user queries and the expected tool calls (for example, ”Am I eligible for a laptop refresh?” expects the check_eligibility tool). We’ve used the following metrics for evaluation of our database retrieval tools and for benchmarking different LLMs for model selection:

Tool selection accuracy: How often the correct tool was selected for a specific query type (for example, retrieval, analytical, hybrid, adversarial).
First-pass success rate: A critical metric for us was distinguishing between eventual success and immediate success. Does the agent pick the right tool on the first try, or does it require a self-correction loop? (High self-correction indicates poor tool descriptions.)
Average tool calls per answer: We track the efficiency of the agent. If the average number of tool calls to answer a simple question jumps from 1.5 to 4, it usually indicates that the agent is getting lost or that the tools are too granular.
Tool-specific recall/precision: For dedicated database retrieval tools, we measure standard retrieval metrics to ensure that the documents returned are relevant to the arguments passed by the LLM.
Failure rate: We strictly monitor the rate of malformed tool calls (for example, missing parameters) to identify which models need more "hand-holding" in the prompt instructions.

Once live, our teams at Elastic continue to monitor agentic health and log telemetry data (for example, every success and failure and the time taken for the agent to complete the task) in Kibana. This allows our ITOps teams to build dashboards to answer meta-questions like, "What is the failure rate this week?," "How many laptop requests came from California?," "How many requests were fulfilled?," without building a custom analytics engine.

Summary

During the iterative, evaluation-driven development process of Elastic Agent Builder, we identified consistent patterns in what makes database retrieval tools effective for context engineering. When implementing tools to search, retrieve, and manipulate data, we try to follow the following core principles:

Building the right database retrieval tools by following the “low floor, high ceiling” principle.
Helping the agent call the right database retrieval tool with appropriate parameter values through careful and reinforced prompting and interface design.
Avoid context flooding by optimizing the tool response for context relevancy (quality) and token efficiency (quantity).

However, there remain open challenges, and we’re actively working on improving these techniques:

Context bloat is a primary hurdle for agent performance because retrieved data remains in the context window as the conversation progresses. A future direction is to dynamically off-load large chunks of data, such as tool responses or attachments, in a file store and allow the agent to retrieve them on demand.
Efficient discovery of large volumes of tools and data attachments will be central for building production-grade agents. We plan to introduce agent skills with the functionality of progressive disclosure (loading information only as needed) and implementing a semantic metadata layer.

Acknowledgements

Written by Leonie Monigatti with valuable contributions from colleagues across Search Solutions Engineering (Sean Story, Pierre Gayvallet, Abhimanyu Anand) and Enterprise Applications (Sri Kolagani).

Build task-aware agents with an expanded model catalog on Elastic Inference Service (EIS)

Sean Handley, Anish Mathur, Deepti Dheer, Ranjana Devaji — Fri, 06 Mar 2026 00:00:00 GMT

Today, we’re pleased to announce an expanded model catalog for Elastic Inference Service (EIS), making it easy to run fast, high-quality inference on managed GPUs, without setup or hosting complexity.

EIS already provides access to state-of-the-art large language models (LLMs) that power out-of-the-box AI capabilities across Elastic Agent Builder and Elastic AI Assistants, including automatic ingest, threat detection, problem investigation, and root cause analysis. We’re now extending this foundation with a broader catalog of managed models, giving developers more control over how agents reason, retrieve, and act.

In practice, this reflects a broader shift in how enterprises build AI systems. The idea of a single, all-purpose AI model no longer holds up. Real-world agent workflows require multiple models with different strengths, costs, and performance characteristics. With EIS, teams can either choose and switch models directly in Agent Builder, with zero setup, cost, or hosting overhead, or they can mix and match models in an agent workflow so each step uses the model best suited to the task.

Developers can use models from OpenAI, Anthropic, and Google directly in Elasticsearch, selecting different models for different agent steps while Elastic fully manages inference, scaling, and GPU execution for production agents.

An expanded catalog of managed models on EIS

The expanded EIS catalog now includes models optimized for different classes of tasks, from lightweight generation to large-context reasoning and embeddings for retrieval.

For generation, the catalog includes:

Anthropic Claude Opus 4.5 and 4.6.
Gemini 2.5 Flash.
Gemini 2.5 Pro.
OpenAI GPT-4.1 and GPT-4.1 Mini.
OpenAI GPT-5.2.
OpenAI GPT-OSS-120B.

For retrieval, EIS includes native Jina AI models, jina-embeddings-v3 and jina-embeddings-v5, which provide fast, high-quality embeddings for multilingual retrieval. The service also includes embedding models from Microsoft, OpenAI, Google, and Alibaba.

Choosing the right models for agent tasks

With EIS, model choice becomes a design decision inside the agent, rather than an operational concern. Agents can select models based on the role they play, without changing how inference is deployed or scaled.

To see how this plays out in practice, consider a few common agent scenarios.

Simple informational query

Simple interactions, such as answering “What is our holiday policy?,” do not require an expensive frontier model and can be handled by a fast, low-cost option.

Task: “What is our holiday policy?”
Pattern: Retrieve and summarize.
Model choice: Fast, low-cost generation model.

This can also be configured through the API by selecting the model you want to use:

POST kbn://api/agent_builder/converse
{
 "input": "What is our holiday policy?",
 "agent_id": "internal-kb-bot",
 "connector_id": "Anthropic-Claude-Sonnet-4-5"
}

This step relies primarily on retrieval quality. A lightweight model is sufficient to summarize a small set of documents quickly.

Moderate capability

More complex tasks may benefit from a more capable generation model, without necessarily requiring the most expensive reasoning model available.

Task: “Compare our holiday policy with new labor laws in France and draft an email.”
Pattern: Retrieve relevant documents, compare policy details across sources, and generate output such as a draft email.
Model choice: More capable generation model.

Watch video on Vidyard

Here’s the API example:

POST kbn://api/agent_builder/converse
{
 "input": "Compare our holiday policy with new labor laws in France and draft an email.",
 "agent_id": "internal-kb-bot",
 "connector_id": "Google-Gemini-2-5-Pro"
}

This task requires synthesis across multiple sources and structured output but doesn’t need the heaviest frontier reasoning model.

Investigation or audit task (high capability)

Task: Review a large document set to identify compliance risks.
Pattern: Multistep reasoning over large context, where the model evaluates information across many documents and synthesizes findings before producing a final judgment.
Model choice: Frontier or large-context model.

Try it out using the API:

POST kbn://api/agent_builder/converse
{
 "input": "What are the compliance risks associated with Example AI products?",
 "connector_id": "Anthropic-Claude-Opus-4-6"
}

Because the task requires deeper reasoning and consistent evaluation across many inputs, output quality matters more. A high-capability model is therefore appropriate for this step.

EIS also enables more advanced orchestration patterns. Enterprises increasingly recognize that using a frontier model for every agent step is inefficient.

With Agent Builder and Elastic Workflows, teams can design agents where each subtask is executed by the most efficient model for the job, based on cost, complexity, and accuracy requirements.

Models-as-judge pattern (quality control)

Task: Validate an agent’s output using a second model
Pattern: Generate and evaluate.

In this Elastic Workflow example, the agent uses one model to generate a response and a second model to evaluate its quality, adding a validation layer for the result. Elastic Workflows, the automation engine built into Elasticsearch, let developers combine reliable scripted automation with AI-driven steps for tasks that require reasoning.

Watch video on Vidyard

The multimodel approach enables new reliability patterns by separating generation from evaluation, allowing one model to produce a response and another to validate it. Today, teams can implement this by pairing a general-purpose generation model with a lighter-weight evaluation model.

Over time, this pattern naturally lends itself to specialized judging and safeguard models designed specifically for validation, policy checks, and quality control. As these models become available, EIS makes it straightforward to introduce them into agent workflows without changing how inference is deployed or managed.

What’s next

EIS is actively evolving, with more models on the way. You can track what’s coming next and what we’re currently building on the Elastic public roadmap.

Get started

Elastic Inference Service makes it easy to start with default models and evolve toward sophisticated, multimodel agent workflows over time, all within Elasticsearch. Whether you’re building global retrieval augmented generation (RAG) systems, search, or agentic workflows that need reliable context, Elastic now gives you high-performance models out of the box, along with the operational simplicity to move from prototype to production with confidence.

All Elastic Cloud trials have access to Elastic Inference Service. Try it now on Elastic Cloud Serverless or Elastic Cloud Hosted, or use EIS via Cloud Connect with your self-managed cluster.

Does MCP make search obsolete? Not even close

Dayananda Srinivas — Thu, 05 Mar 2026 00:00:00 GMT

With the rise of large language models (LLMs), agent frameworks, and new protocols like Model Context Protocol (MCP), a provocative question is starting to surface:

Do we still need a search engine at all?

If agents can call tools on demand and models can reason over massive context windows, why not just fetch data live from every system and let the LLM figure it out?

It’s a reasonable question. It’s also the wrong conclusion.

The reality is that MCP and agent tooling don’t eliminate the need for search. They make the quality of search more critical than ever. In this blog, we’ll explore why MCP, federated search, and large context windows don’t replace search engines and why indexes remain the foundational layer for scalable, accurate, enterprise-grade AI.

What MCP actually is (and what it is not)

MCP is a coordination protocol. It standardizes how an agent requests information or actions from external systems.

What MCP doesn’t do:

Rank results across systems.
Understand relevance across heterogeneous data.
Normalize schemas or metadata.
Data transformations or enrichments at scale.
Apply consistent security and permissions.
Optimize for latency, cost, or scale.

In other words, MCP tells agents how to ask for data, not which data matters most.

Modern retrieval requires query intelligence, not just data access

In modern enterprise search architectures, retrieval quality is determined long before a query reaches an index. Raw queries — especially those generated by agents — may be incomplete, overly literal, schema-driven rather than intent-driven, and at times syntactically invalid.

This is why mature search platforms introduce a query intelligence layer that performs query rewriting, entity normalization, synonym expansion, and intent disambiguation before retrieval even begins.

For example, an agent-generated request such as: “Show severity 2 authentication failures from last sprint” may be rewritten to include authentication synonyms (login, SSO, OAuth), normalized severity mappings, and sprint-to-date-range translation. The result is not just more matches — it is more relevant matches.

In enterprise AI, retrieval is not a single step. It is a controlled pipeline.

This distinction is crucial because once MCP-based agents start pulling information live from multiple tools, they recreate a familiar pattern under a new name: federated search.

MCP-based retrieval is federated search in disguise

Federated search isn’t new. Enterprises have tried it for decades.

The model is simple:

Send the user’s query to multiple systems in parallel (SharePoint, GitHub, Jira, customer relationship management [CRM]).
Collect the responses.
Merge and present the results.

MCP-driven tool calls follow the same pattern, except that the caller is now an agent instead of a user interface.

And the same problems resurface.

Why federated search breaks down at enterprise scale

Latency becomes unpredictable: A federated query is only as fast as its slowest system. Enterprise systems can have wildly different response times and rate limits, so federated queries tend to be slow and jittery. Agents must wait for multiple round trips before reasoning can even begin. The result is a laggy experience and unpredictable wait times.
Relevance is fragmented: Because each system ranks results on its own, there’s no unified relevance model. Federated search cannot apply a single ranking or semantic understanding across all content, so results often seem disjointed or incomplete. Agents may retrieve correct information but not the most useful information.
Context is shallow and incomplete: Federated systems typically expose only what’s directly accessible through an API call.They rarely surface:
- Usage signals, like clicks, dwell time, recency of access, popularity, or authority.
- Relationships between documents across different systems to correlate the insights.
- Organizational knowledge beyond a single silo.
  
  This strips agents of the broader context required for high-quality reasoning.
Limited filtering and features: In a federated setup, you can only filter on fields that every system supports (the “lowest common denominator”). If one system doesn’t support a particular filter or facet, you lose that functionality entirely. This severely limits rich search features, like date ranges, categories, or tags.

The power of an indexed search

Search engines achieve millisecond-level retrieval at massive scale by using specialized data structures, including inverted indexes for lexical search and k‑dimensional trees (k-d trees) for vector-based retrieval. The approach is to crawl or ingest every source into search engines, creating a central place of company knowledge. This brings big advantages:

Speed by design: Searching an index is lightning fast. Queries hit inverted indexes and specialized data structures, avoiding the need to poll each backend system.
Relevance that compounds over time: Search engines that support semantic search are capable of comprehending the intent, and machine learning models can rerank results for enterprise contexts. In one Elastic experiment, Elastic users see more accurate results when combining vector search with a question-answering (QA) model to extract answers. It gives better precision than keyword matching.
Advanced features: Elastic’s Graph retrieval augmented generation (RAG) solution shows how structuring an index as a knowledge graph can power more contextual retrieval. In other words, indexes aren’t just backward-looking dumps of text; they can also encode relationships and ontologies that let AI connect the dots across documents.
Permission-aware search: Enterprise AI cannot compromise on security. Indexed search allows:

Agents see only what users are allowed to see, without leaking data into model prompts or training. Elasticsearch is suitable for the indexed search layer in the diagram below, as it provides the essential components for context engineering.

Retrieval consistency through search templates and governed execution

At scale, retrieval must be predictable, secure, and repeatable. This is where search templates become critical.

Search templates act as retrieval contracts between applications, agents, and the search platform. Instead of dynamically constructing queries at runtime, agents invoke pre-defined retrieval patterns that enforce:

Consistent relevance logic
Mandatory security filters
Cost and latency guardrails
Business-specific ranking rules
Explicit index and field scope boundaries

In MCP-driven architectures, this becomes even more important. Agents should not dynamically invent retrieval strategies. Instead, MCP tool calls can map directly to approved search templates, ensuring that every retrieval request adheres to enterprise relevance and governance standards.

This approach shifts retrieval from ad-hoc query execution to controlled retrieval orchestration.

Retrieval is now a multi-layer engineering discipline

Modern enterprise retrieval is no longer a simple query-to-index operation. It typically includes multiple coordinated layers:

Query understanding — rewriting, expansion, entity resolution
Retrieval strategy selection — hybrid search, vector search, graph retrieval, or synthetic query techniques such as Hypothetical Document Embeddings (HyDE), where the system generates a representative answer or expanded context first and retrieves documents using that richer semantic signal.
Execution governance — templates, security enforcement, and performance guardrails
Ranking and re-ranking — blending lexical precision, semantic similarity, and interaction-derived relevance signals such as click-through patterns, dwell time, and document usage frequency.

When these layers are implemented upstream, agents receive clean, high-confidence context rather than raw, fragmented data.

This is what makes large-scale agent systems reliable in production environments.

Advanced retrieval techniques improve context quality before reasoning begins

Modern retrieval systems increasingly use AI-assisted techniques to improve recall and semantic coverage before ranking is applied.

One example is Hypothetical Document Embeddings (HyDE). Instead of embedding only the original query, the system first generates a hypothetical answer or expanded context, embeds that representation, and retrieves documents based on that richer semantic signal.

This is particularly useful in enterprise environments where:

Users or agents may not know the exact terminology
Knowledge is distributed across silos
Important context is implied rather than explicitly stated

Techniques like HyDE improve the probability that relevant documents are retrieved even when the original query is underspecified.

This reinforces a key principle of enterprise AI: better context retrieval produces better reasoning outcomes.

Agents aren’t data engineers; they’re reasoning systems

They shouldn’t be responsible for stitching together raw data, reconciling schemas, or compensating for poor retrieval.

This is where a search platform such as Elasticsearch becomes foundational.

By ingesting data once and normalizing it upstream (through pipelines, mappings, enrichment processors, and prebuilt indexes), Elasticsearch resolves schema mismatches, joins signals across sources, and materializes retrieval-ready views of the data. At query time, the agent receives clean, ranked, semantically enriched results rather than fragmented raw records.

For example, instead of an agent pulling independently from CRM, ticketing, and documentation systems and attempting to reconcile customer IDs, timestamps, and formats in real time, Elasticsearch can pre-index these sources into a unified customer interaction index with hybrid (keyword + vector) search and relevance ranking. The agent then queries a single, coherent interface and immediately reasons over the most relevant context.

This separation of concerns, that is, Elasticsearch handling data integration and retrieval, and agents focusing on reasoning, planning, and decision-making, is what makes agent systems scalable, reliable, and production ready.

Elastic’s role in the AI stack

Elastic sits at the intersection of search and AI by design.

Connectors and crawlers ingest data continuously from enterprise systems.
Semantic and vector search enable intent-based retrieval.
Hybrid search blends lexical precision with semantic understanding.
RAG workflows ground LLMs in authoritative, permission-aware data.

Elastic does not compete with agents or MCP. It makes them effective.

Bigger models don’t eliminate retrieval

Some have wondered whether huge new LLMs can bypass traditional search, perhaps by letting the model read everything in one go. Large context windows feel powerful, but they introduce:

Higher latency.
Higher cost.
Lower precision due to noise.
A higher propensity for confusion, context clash, and context poisoning.

RAG wins because it filters first and then reasons.In another Elastic Search Labs experiment, RAG achieved answers in about 1 second, versus 45 seconds for the raw-LM approach, at 1/1250th the cost, and with far higher accuracy. In other words, giving an LLM a million tokens of documents is slower, more expensive, and actually less precise than filtering through an index first.

Conclusion: MCP changes the interface, not the fundamentals

MCP is a meaningful step forward in how agents interact with tools. But it doesn’t replace the need for fast, relevant, governed retrieval.

In enterprise AI:

Context quality determines answer quality.
Indexes create that context.
Search is the foundation, not the legacy.

Indexes aren’t obsolete in the era of MCP. They’re the reason that MCP-based agents can work at all.

Entity resolution with Elasticsearch, part 3: Optimizing LLM integration with function calling

Jessica Moszkowicz — Wed, 04 Mar 2026 00:00:00 GMT

In part 1 and part 2 of this series, we built a complete entity resolution pipeline that included preparing entities with context and indexing them for semantic search, extracting entities from articles using hybrid named entity recognition (NER), and matching entities using semantic search and large language model (LLM) judgment. The results were promising, but JSON parsing errors significantly lowered measured accuracy by causing otherwise valid judgments to be discarded. The system wasn’t failing because it made bad judgments; it was failing because it couldn’t reliably express them.

The root of this problem was our somewhat naive choice to use prompt-based JSON generation in which the LLM generates JSON responses in text format. If we asked the LLM to judge more than a couple of matches at a time, the generated JSON was often ill-formed. To mitigate this, we were forced to reduce the processing batch size, which simply won't scale in a production system.

So the prompt-based JSON generation helped validate our approach to entity resolution, but we need a more systematic and reliable method. OpenAI function calling provides a better path by guaranteeing structure and type safety while reducing errors and costs. We chose OpenAI's functions for the educational prototype, but other LLM providers typically provide similar functionality (for example, Claude tools).

Note: While we discuss production challenges here, this is still an educational prototype demonstrating optimization techniques. Real production systems would need additional considerations, like monitoring, alerting, fallback strategies, and comprehensive error handling.

Key concepts: Function calling, schema design, and cost benefits

What is function calling? Function calling is OpenAI's structured output API. With it, we can define schemas for LLM responses, so we always know exactly what we're going to get. By enforcing the JSON format rather than trying to define it in the LLM prompt, we should be able to eliminate parsing errors.

Why is it better than prompt-based JSON? LLMs generate nondeterministic output. One hopes that they'll at least generate content that contains the correct response, but the presentation of that response is unpredictable. With a chatbot, this is often not a problem, but our prototype is trying to programmatically process the output. Computer programs demand consistency, so when the LLM generates what we expect, everything is fine, but as soon as it goes off script, so to speak, the code errors out. We could try to account for the different possibilities, but it would be very difficult to catch everything. We could try to enforce more consistent behavior by adding something like "Always return parsable JSON". We tried this exact technique in the prototype's prompt, but we've seen that prompt-based JSON still goes off the rails pretty quickly, particularly if we try to process a batch of matches.

Function calling makes the LLM generation controllable and predictable, exactly what we need for entity resolution. To aid in the definition of the functions, we’ll also follow minimal schema design principles.

What are minimal schema design principles? Minimal schema design means defining only the fields you need, using simple types, and avoiding nested structures when possible. This reduces token usage (smaller schemas mean fewer tokens), improves reliability (simpler schemas are easier for the LLM to follow), and lowers costs (fewer tokens mean lower API costs).

What are the cost and reliability benefits? Since fewer errors means match processing is much more likely to succeed, even with large batch sizes, we don't have to retry judging matches. The elimination of retries reduces costs by reducing token usage, but using minimal schemas also keeps our token count down. This all leads to a less expensive and more reliable approach that’s much more suitable to use in production.

We need to check one more thing, though. While matches may be getting processed without error, are the errorless results actually correct? How does this new approach compare to the promising results we saw with the prompt-based approach?

Real-world results: Side-by-side comparison

As we did in the previous blog, we ran the function calling approach against the tier 4 dataset, which consists of 206 expected matches across 69 articles. The results demonstrate a dramatic improvement:

Metric	Prompt-based	Function calling	Improvement
Error rate	30.2%	0.0%	100% elimination
Precision	83.8%	90.3%	+6.5pp
Recall	62.6%	90.8%	+28.2pp
F1 score	71.7%	90.6%	+18.9pp
Acceptance rate	44.8%	60.2%	+15.4pp
True positives	129	187	+45.0%
False negatives	77	19	-75.3%

Error elimination: The key differentiator

The most striking difference is the complete elimination of JSON parsing errors. This resulted in a modest precision improvement and a far more dramatic recall improvement. The precision metric captures how often the matches the system accepts were expected in the golden document. So the prototype was decent at judging matches correctly in the prompt-based approach, but function calling does that even better.

Conversely, recall tells us how many of the expected matches were found. When a batch of matches comes back with malformed JSON, the system loses all of those matches. It's likely that Elasticsearch sends many of these matches for judgment, but we lose those matches if judgment fails. The significant recall improvement shows that this hypothesis is correct. Elasticsearch identifies the potential matches and function calling verifies which of those matches are correct.

Note: It’s expected that Elasticsearch will find some incorrect matches because we look at the top two or three results from hybrid search. Most of the time, hybrid search returns the correct match as the top result, but having the LLM judge the top few hits ensures that we see how the LLM handles incorrect matches. If we move from the educational prototype to a production system, we’ll likely tune the Elasticsearch queries more carefully so that we only send promising matches to the LLM, further optimizing our LLM costs.

What's next: The ultimate challenge

Now that we've optimized our LLM integration with function calling, we have a complete entity resolution pipeline with improved reliability and cost efficiency. However, can it handle the ultimate challenge? In the next post, we'll explore how the system handles diverse entity resolution scenarios across 50 different challenge types, including cultural naming conventions, business relationships, titles, and multilingual variations.

Try it yourself

Want to see function calling optimization in action? Check out the Function Calling Optimization notebook for a complete walkthrough with real implementations, detailed explanations, and hands-on examples. The notebook shows you exactly how to use function calling for structured output, compare it with prompt-based JSON, and analyze cost and reliability benefits.

Remember: This is an educational prototype designed to teach optimization concepts. When building production systems, consider additional factors, like multi-provider support, advanced caching strategies, monitoring and alerting, comprehensive error handling, and compliance requirements that aren't covered in this learning-focused prototype.

Using subagents and Elastic Agent Builder to bring business context into code planning

Gustavo Llermaly — Tue, 03 Mar 2026 00:00:00 GMT

Subagents in Claude Code let you offload specialized tasks to separate context windows, keeping your main conversation focused. In this article, you'll learn what subagents are, when to use them, and how to build a retrieval subagent using Elastic Agent Builder that connects your development workflow to business data in Elasticsearch.

What are subagents?

Subagents are specialized assistants that can be called to execute a specific task, using their own context window. They complete a task and give the results to the main agent, preventing it from saving information that isn’t relevant for the rest of the conversation in the context window.

Their four core principles are:

Context preservation: Each subagent uses its own context window.
Specialized expertise: Each subagent is designed for a specific task.
Reusability: You can reuse a subagent in different sessions and projects.
Flexible access: You can limit the subagent access to specific tools.

Each subagent can have access to Claude Code tools to work with the terminal, such as glob, read, write, grep, or bash, or to access the internet, like search, fetch, or call external tools with Model Context Protocol (MCP) servers.

A subagent uses the following schema:

---
name: your-sub-agent-name
description: Description of when this subagent should be invoked
tools: tool1, tool2, tool3  # Optional - inherits all tools if omitted
model: sonnet  # Optional - specify model alias or 'inherit'
permissionMode: default  # Optional - permission mode for the subagent
skills: skill1, skill2  # Optional - skills to auto-load
---

Your subagent's system prompt goes here. This can be multiple paragraphs
and should clearly define the subagent's role, capabilities, and approach
to solve problems.

Include specific instructions, best practices, and any constraints
the subagent should follow.

You can call subagents implicitly by talking about the task they run, and Claude will call them automatically. For example, you can say, "I want to plan my new functionality."

You can also call them explicitly by directly asking Claude Code to use a subagent and telling it, "Use the planning subagent to plan my new functionality."

Another important feature is that subagents are stateful, so once you give one a task, it will generate an ID. This way, when you use it again, you can start from scratch or provide the ID to give it context from its previous tasks.

You can read the full documentation here.

When are subagents used?

Subagents are useful when you need to delegate tasks that require specialized context but you don't want to clutter the main chat window. Considering our example of coding, the most common subtasks include:

Subtask type	Description	Typical tools
Exploration / research	Searching and analyzing code without modifying it.	Read, grep, glob
Planning	Running deep analysis to create implementation plans.	Read, grep, glob, bash
Code review	Reviewing quality, safety, and best practices.	Read, grep, glob, bash
Code modification	Writing and editing code.	Read, edit, write, grep, glob
Testing / debugging	Running tests and analyzing issues.	Bash, read, grep, edit
Retrieval	Getting information from external sources (APIs, databases).	MCP tools, bash

Claude Code includes three built-in agents that showcase these use cases:

Explore: Quick agents for read-only search in the codebase. It's great for answering questions like, "Where are the client's errors handled?"
Plan: Research agent that activates in plan mode to analyze the codebase before proposing changes.
General-purpose: The most capable agent for complex tasks that require multiple steps and can include modifications.

Context management: Ensuring subagents have the right information

One of the most important decisions when designing subagents is how to handle context. There are three key considerations:

1. Which context the subagent should get

The prompt you give to the subagent must contain all of the necessary information to complete the task since the subagent doesn’t have access to the main chat. You need to be specific:

Do NOT say, "Review the code."
SAY, "Review the changes to src/auth/index.ts, focusing on JWT token validation."

Providing the exact file name makes a difference between using the read tool against the file directly and making a wide search using grep and thus wasting time and tokens.

Also consider what not to include. Irrelevant context can distract the subagent or bias results. It’s tempting to ask for multiple things in one pass, but focused tasks yield better results:

Do NOT say, “Review src/auth/index.ts. Here is also the database schema and our API docs for reference, fix bugs and suggest improvements about the architecture decisions.”
SAY, “Fix the token refresh bug in src/auth/index.ts that's throwing AUTH_TOKEN_EXPIRED unexpectedly.”

2. What tools to provide

Limit the tools to what’s strictly necessary. This improves security, keeps the subagent focused, and reduces unnecessary tool calls and execution costs.

# For just an analysis agent
tools: Read, Grep, Glob

# For an agent that needs to modify the code
tools: Read, Edit, Write, Grep, Glob

If you don't specify a tools field, the subagent inherits all tools from the main agent, including MCP tools.

You can learn all Claude Code tools here.

3. How to keep context between calls

Subagents can be resumed using their agentId:

# First call
> Use the code-analyzer agent to review the authentication module
[Agent completes the analysis and returns agentId: "abc123"]

# Continue with previous context
> Resume agent abc123 and now analyze the authorization module
[Agent continues with the context from the previous chat]

You can ask Claude for the agent ID or find it in ~/.claude/projects/{project}/{sessionId}/subagents/

This is especially useful for long research tasks or multistep workflows.

Another way to keep context consistent is to ask the agent to write a Markdown checklist with what it's doing and its current progress. Then you can execute /clear without losing the initial instruction. In that request, you can define the task granularity or details to retain that make sense for your use case.

# Task: Review authentication module

## Progress
- [x] Analyzed src/auth/index.ts
- [x] Found JWT validation issue
- [ ] Review authorization module
- [ ] Check rate limiting

## Findings
- Token refresh has race condition in line 42

After you clear the conversation, the next agent can pick it up from here. This is very useful when you want an agent to run a script over a list and watch the output record by record.

Orchestration patterns

It’s important to see subagents as a context optimization mechanism. The way in which you coordinate them determines the efficiency of the whole system. There are different orchestration patterns.

Sequential (chaining)

Here, a subagent completes a task, and its results feed the next one in a sequence of tasks, similar to traditional Linux piping.

Call example:

> First use the planning agent to design the feature,
> then use the coding agent to implement it,
> finally use the reviewer agent to check the code

Parallel

In this pattern, multiple subagents run independent tasks simultaneously. The main Claude Code agent invokes them since subagents cannot spawn other subagents.

This approach reduces the execution time for tasks like code review since it allows you to work with the same code from different angles without impacting the running time.

Hub-and-spoke (delegation)

In this approach, the main agent acts as an orchestrator, delegates tasks to specialized agents, and then consolidates the results.

This is the pattern we’ll implement in our example: The main Claude Code agent will delegate the gathering of business information to a retrieval agent built with Elastic Agent Builder, while the explore agent will look into local files and the planning agent builds a plan.

Why use an agent instead of a single query?

Before building our retrieval subagent, it's worth understanding when an agent adds value versus when a simple Elasticsearch Query Language (ES|QL) query suffices.

If you need a single aggregation, like "What's our most visited page?" just run the query directly. The agent adds value when your question requires:

Multiple queries that build on each other: The answer from query 1 informs query 2.
Cross-index reasoning: Correlating data from different sources.
Ambiguity resolution: The agent interprets and follows leads.
Synthesis: Combining quantitative data with qualitative knowledge.

Our example will demonstrate all of these capabilities.

Agent Builder as subagent

Generating code using AI is very quick, but the problem is having a good planning phase to set the boundaries for our coding agent. To help with that, Claude created a subagent that specializes in planning to perform deep analysis and create a to-do list for the main agent to execute.

With this flow, you can plan based on what Claude Code can see both in local files and on the internet. However, there's still knowledge available in Elasticsearch that you cannot access via standard tools.

To access our internal knowledge during the planning phase, we'll create a Claude Code subagent by making a retrieval agent using Agent Builder.

You can configure the agent using the UI or an API. In this example, we'll use the latter.

Prerequisites

Claude Code 2.0.76+
Elasticsearch 9.2
Elasticsearch API key

The scenario: Technical debt sprint planning

You're a tech lead. You have two weeks and two developers. Your TECH_DEBT.md lists 12 items. You can tackle maybe three or four. Which ones should you prioritize?

The complexity is that you need to optimize across multiple dimensions simultaneously:

User impact: How many users hit this issue?
Business impact: Does it affect paying customers? Enterprise tier?
Severity: Errors? Performance? Just ugly code?
Effort: Quick win or rabbit hole?
Dependencies: Does fixing A unlock fixing B?
Strategic alignment: Does it align with Q1 priorities?

A single query like, "What's the most important tech debt item?" fails because this requires:

Reading TECH_DEBT.md to understand what the 12 items even are.
For EACH item, querying error_logsto get error frequency.
Cross-referencing with customer_data to see tier breakdown.
Checking support_ticketsto see complaint volume.
Reading engineering_standards in the knowledge base to see whether any items violate core principles.
Reading Q1_roadmap to check strategic alignment.
Synthesizing all of this into a prioritized recommendation.

This is where a retrieval agent can be helpful in orchestrating multiple queries across different indices and synthesizing the results.

Steps

Preparing the test dataset

We'll create four indices: a knowledge base with internal documentation, error logs, support tickets, and customer data.

You can create the indices, index the data, and create the agent using one of the following:

Kibana Dev Tools: Using the Elasticsearch requests provided below.
Jupyter Notebook: Using the complete notebook written for this article.

Create the indices

Open Kibana Dev Tools, and run the following requests to create each index with its mapping and bulk data. Here's an example of the knowledge index structure and data to be indexed:

PUT customer_data
{
  "mappings": {
    "properties": {
      "user_id": { "type": "keyword" },
      "customer_tier": { "type": "keyword" },
      "company_name": { "type": "text" },
      "mrr": { "type": "float" },
      "joined_at": { "type": "date" }
    }
  }
}

POST customer_data/_bulk
{"index":{}}
{"user_id":"enterprise_user_01","customer_tier":"enterprise","company_name":"Acme Corp","mrr":2500.00,"joined_at":"2023-01-15"}
{"index":{}}
{"user_id":"enterprise_user_02","customer_tier":"enterprise","company_name":"GlobalTech Inc","mrr":4200.00,"joined_at":"2022-08-20"}
{"index":{}}
{"user_id":"enterprise_user_05","customer_tier":"enterprise","company_name":"DataFlow Systems","mrr":3100.00,"joined_at":"2023-06-01"}
{"index":{}}
{"user_id":"user_001","customer_tier":"free","company_name":"","mrr":0,"joined_at":"2024-03-15"}
{"index":{}}
{"user_id":"user_002","customer_tier":"free","company_name":"","mrr":0,"joined_at":"2024-05-20"}
{"index":{}}
{"user_id":"user_045","customer_tier":"pro","company_name":"SmallBiz LLC","mrr":49.00,"joined_at":"2024-01-10"}
{"index":{}}
{"user_id":"user_089","customer_tier":"pro","company_name":"StartupXYZ","mrr":49.00,"joined_at":"2024-02-28"}

Full requests for all indices:

Knowledge index: knowledge.txt
Error logs index: error_logs.txt
Support tickets index: support_tickets.txt
Customer data index: customer_data.txt

The raw JSON files with the dataset are also available:

Local project files

Create the following Markdown (MD) files in your project. These files look like this:

# Tech Debt Items

## AUTH-001: Token refresh race condition
- **Module**: src/auth/refresh.ts
- **Symptom**: Users randomly logged out
- **Estimate**: 3 days

## EXPORT-002: CSV export timeout on large datasets
- **Module**: src/export/csv.ts
- **Symptom**: Timeout after 30s for >10k rows
- **Estimate**: 2 days

...

Full files:

TECH_DEBT.md: Tech debt items list.
REQUIREMENTS.md: FlowDesk Q1 2025 requirements.

This ties directly to the tech debt items and gives the agent clear priorities to work with when cross-referencing with the Elasticsearch data.

Create an agent with Agent Builder

We'll now create an agent capable of running analytics queries with ES|QL to provide us with app usage information while also capable of searching to provide us info from Knowledge Base (KB) in unstructured text format.

We're using the built-in tools since they cover search and analytics on any index. Agent Builder also supports custom tools for more specialized operations, like scoping an index or adding ES|QL dynamic parameters, but that's beyond our scope here.

You can create the agent using the curl request in create_agent.txt.

curl -X POST "https://${KIBANA_URL}/api/agent_builder/agents" \
  -H "Authorization: ApiKey ${API_KEY}" \
  -H "kbn-xsrf: true" \
  -H "Content-Type: application/json" \
  -d '{
    "id": "tech-debt-advisor",
    "name": "Tech Debt Prioritization Agent",
    "description": "I help prioritize technical debt by analyzing error logs, support tickets, customer impact, and aligning with engineering standards and roadmap priorities.",
    "avatar_color": "#BFDBFF",
    "avatar_symbol": "TD",
    "configuration": {
      "instructions": "This agent helps prioritize technical debt items. Use the following indices:\n\n- knowledge: Engineering standards, policies, and roadmap priorities\n- error_logs: Production error frequency by module\n- support_tickets: Customer complaints and their urgency\n- customer_data: Customer tier information (enterprise, pro, free)\n\nWhen analyzing tech debt:\n1. Check error frequency in error_logs\n2. Cross-reference affected users with customer_data to understand tier impact\n3. Count support tickets and note urgency markers\n4. Check knowledge base for relevant policies and Q1 priorities\n5. Synthesize findings into prioritized recommendations",
      "tools": [
        {
          "tool_ids": [
            "platform.core.search",
            "platform.core.list_indices",
            "platform.core.get_index_mapping",
            "platform.core.get_document_by_id",
            "platform.core.execute_esql",
            "platform.core.generate_esql"
          ]
        }
      ]
    }
  }'

You’ll get this response if everything went OK:

{
  "id": "tech-debt-advisor",
  "type": "chat",
  "name": "Tech Debt Prioritization Agent",
  "description": "I help prioritize technical debt by analyzing error logs, support tickets, customer impact, and aligning with engineering standards and roadmap priorities.",
  ...
}

The agent will be available in Kibana, so you can now chat with it if you want:

Configure the agent as Claude Code tool

The agent we just created will expose an MCP server. Let's add the MCP server to Claude Code using the already-generated API key:

claude mcp add --transport http agentbuilder https://${KIBANA_URL}/api/agent_builder/mcp --header "Authorization: ApiKey ${API_KEY}"

We can check the connection status using claude mcp get agentbuilder.

Create a subagent that uses the tool

Now that we have the Agent Builder available as a set of MCP tools, we can create a subagent in Claude Code that will use all or some of those tools, in combination with Claude Code ones.

Claude Code recommends using its agent creator tool for this step:

1. Type /agents in Claude Code.

2. Choose Create new agent.

3. Select Project scope so that it's only available for this project. (This is the recommended setting to avoid agent overflow.)

4. Select Generate with Claude (recommended).

5. Type in the description: "Agent that analyzes technical debt by querying Elasticsearch for error logs, support tickets, customer data, and engineering knowledge base. Use this agent when you need to prioritize tech debt items based on business impact."

6. In “Select tools,” choose Advanced options and select the tools we defined on the agent creation.

Individual Tools:
☒ platform.core.search (agentbuilder)
☒ platform.core.list_indices (agentbuilder)
☒ platform.core.get_index_mapping (agentbuilder)
☒ platform.core.get_document_by_id (agentbuilder)
☒ platform.core.execute_esql (agentbuilder)
☒ platform.core.generate_esq (agentbuilder)

7. Select [ Continue ].

Now choose the model. For planning tasks, the recommendation is to use Opus due to its significant reasoning capacity. So let's select that and continue.

Finally, choose the background color for our subagent text and confirm.

Claude automatically names our subagent based on the description (for example, tech-debt-analyzer).

Testing the agent

Once the agent has been created, we can test it with a complex prioritization question that requires multistep reasoning:

> Based on TECH_DEBT.md, which items should we prioritize for our 2-week sprint?
> Use the tech-debt-analyzer agent to check error frequency, customer impact,
> support ticket volume, and alignment with engineering standards.

Watch how the agent orchestrates multiple queries:

And will give you a comprehensive analysis of the local files combined with Elasticsearch data:

This demonstrates why a single query fails and an agent succeeds: It orchestrates five or more queries across different indices, correlates the data, and synthesizes a recommendation that contradicts the naive "fix highest error count" approach.

By typing /context, we can see how much context each of the MCP tool's definitions uses and our subagent's prompt. Keep an eye on this overhead when creating subagents.

Start planning

We can now start planning using local files, the internet, and our Elasticsearch knowledge as information sources.

Ask something like:

"Based on our requirements defined in REQUIREMENTS.md, use the planning agent
to create a detailed implementation plan, prioritizing tasks according to
business impact. Use the tech-debt-analyzer agent to query about internal
company knowledge and make analytical queries about error patterns and
customer impact."

Note that Claude decides to run the Elasticsearch data analysis and the local documentation reading in parallel, following the hub-and-spoke orchestration pattern.

After the analysis, you should get a plan that prioritizes based on actual business data rather than on assumptions. This context will make your AI coding experience much more reliable, as you can feed this plan directly to the agent and execute step by step:

The more details you provide and the more focused the instructions are, the better the quality of the plan will be. If you have an existing codebase, it will suggest the code changes.

Conclusion

Subagents are a great tool to offload specific tasks where we only need the final result for the main chat (without going through how we got there), keeping the chat flow focused.

By choosing the right orchestration pattern (sequential, parallel, or hub-and-spoke) and handling the context properly, we can build efficient and maintainable agent systems.

Elastic Agent Builder and its MCP feature allow us to access our data using a retrieval subagent to facilitate planning and coding by combining local (files, source code), external (internet), and internal (Elasticsearch) sources. The key insight is that agents add value not for simple queries but when you need multistep reasoning that builds on previous results and synthesizes information from multiple sources.

Resources

Adaptive early termination for HNSW in Elasticsearch

Tommaso Teofili — Mon, 02 Mar 2026 00:00:00 GMT

Elasticsearch uses the Hierarchical Navigable Small World (HNSW) algorithm to perform vector search over a proximity graph. HNSW is known to provide a nice trade-off between the quality of k-nearest neighbor (KNN) results and the associated cost.

In HNSW, search proceeds by iteratively expanding candidate nodes in the graph, maintaining a bounded set of nearest neighbors discovered so far. Each expansion has a cost (vector operations, random seeks to disk, and more), and the marginal benefit of that cost tends to decrease as the search progresses.

One way to optimize HNSW graph traversal is to stop searching when the marginal likelihood of finding new true neighbors doesn’t increase. For this reason, in Elasticsearch 9.2 we introduced a new early termination mechanism. This stops the search process when visiting graph nodes doesn’t provide enough new nearest neighbors, consecutively, for a fixed number of times.

This article guides you through how we improved over the mentioned early termination mechanism in HNSW to make it better suited for different datasets and data distributions.

Early termination in HNSW

In HNSW, search proceeds by iteratively expanding candidate nodes in the proximity graph, maintaining a bounded set of nearest neighbors discovered so far, until it either has visited the whole graph or meets some early stop criteria.

Early termination is therefore not necessarily always an optimization, it’s part of the search algorithm itself. The moment we decide to stop determines the balance between efficiency and recall. In Elasticsearch, there are already a number of ways a query on HNSW can early terminate:

A fixed maximum number of nodes is visited.
A fixed timeout is reached.

While simple and predictable, these rules are largely agnostic to what the search is actually doing. Also they’re used mostly to make sure that the query finishes in reasonable time for the end user.

In a previous blogpost, we introduced the concept of redundancy in HNSW. In short, redundant computations occur when HNSW continues to evaluate new candidate nodes that don’t result in finding more nearest neighbors.

Patience: Measuring progress instead of effort

The notion of patience reframes early termination around progress rather than effort.

Instead of asking:

“How many steps have we taken?”

The new question becomes:

“What is the amount of computation we accept to waste, until we lose hope?”

During HNSW search, early exploration typically produces peak improvements to the top-k candidate set. During first steps of the HNSW graph exploration, the set of neighbors is continuously updated as the algorithm keeps discovering nearer and nearer neighbors to the query vector. Over time, these improvements become rarer as the search converges. Patience-based termination monitors this pattern and terminates the search once improvements have ceased for a sustained period.

In practice, while visiting the HNSW graph we also compute the queue saturation ratio as we hop through candidate nodes. This measures the percentage of nearest neighbors that were left unchanged while visiting the most recent graph node (or the inverse of the number of new neighbors introduced during the last iteration). When such a ratio becomes too big for too many consecutive iterations, we stop visiting the graph.

Conceptually, patience treats HNSW search as a diminishing returns process. When returns flatten out, continuing to explore the graph yields little benefit.

This framing is powerful because it ties termination directly to observable outcomes rather than to arbitrary fixed limits.

The benefit of using this smart early termination technique is that HNSW graph explorations tend to visit a smaller number of graph nodes while retaining an almost perfect relative recall.

To visualize this, we can plot the amount of recall per visited node that we got with the patience based early termination (labeled as et=static), when compared to the default HNSW behavior (labeled as et=no) on a couple of datasets, FinancialQA and Quora, and models, JinaV3 and E5-small.

Static thresholds and HNSW dynamics

In practice, in Elasticsearch this is implemented using static thresholds. One threshold refers to the saturation threshold: that is, the ratio of saturation that we consider suboptimal. The other threshold refers to the number of consecutive graph nodes that we allow to be visited while still having a suboptimal queue saturation: that is, the patience threshold.

When we introduced this early termination strategy in Elasticsearch 9.2, we decided to opt for conservative defaults, so as to let the recall as much as possible, while still gaining in terms of latency and memory consumption. For this reason, we set the saturation threshold to be 100% and the patience threshold to be set as a (bounded) 30% of the num_candidates in the KNN query.

In many scenarios, these settings resulted to work nicely; however, two queries requesting the same number of neighbors might have radically different convergence behaviors. Some queries encounter dense local neighborhoods and saturate quickly; others must traverse long, sparse paths before finding competitive candidates. The latter resulted to be the most difficult to handle effectively.

As a result, we sometimes noticed:

Over-exploration for easy queries.
Premature termination for hard queries.

Therefore, we figured that fixed threshold values encode global assumptions about convergence, whereas we could make HNSW better adapt to different dynamics.

Making HNSW early termination adaptive

Adaptive early termination approaches this problem from a different angle. Instead of enforcing predefined stopping thresholds, the algorithm infers when to stop from the search dynamics themselves.

So instead of comparing the queue saturation ratio between two consecutive candidates, we decided to introduce both an instant smoothed discovery rate $d_{q,i} $ (how many new neighbors were introduced for a query q, in the last visit i) together with rolling mean $\mu_{q,i}$ and standard deviation $\sigma_{q,i}$ of such a discovery rate during the graph visit (using Welford’s algorithm). These statistics about the discovery rate are calculated per query, so that this information can be used to decide different degrees of patience for each query.

The previously static thresholds become adaptive to the discovery rate statistics: The saturation threshold becomes the rolling mean plus the standard deviation; whereas we make the patience adapt and scale inversely with the standard deviation.

The early exit rules remain the same; the saturation happens when the instant discovery rate is lower than the adaptive saturation threshold. The graph visit stops if the saturation persists for a number of consecutive candidate visits that’s larger than the adaptive patience.

This way, we obtain a behavior that doesn’t depend on the num_candidates parameter in the KNN query (which might be always set or left as the default, regardless of early exit) and that better adapts to each query and vector distribution dynamically.

The recall per visited node on FinancialQA and Quora with the adaptive strategy (labeled as et=adaptive) reports a higher recall per visited node, when compared to the static strategy (et=static) and the default HNSW behavior (et=no).

Adaptive early termination is turned on by default in Elasticsearch 9.3 for HNSW dense vector fields (and it can eventually be turned off via the same index level setting).

Common Expression Language (CEL): How the CEL input improves data collection in Elastic Agent integrations

Chris Berkhout — Fri, 27 Feb 2026 00:00:00 GMT

Elastic Agent integrations allow users to ingest data into Elasticsearch from a wide range of sources. They combine collection logic, ingest pipelines, dashboards, and other artifacts into a package that can be installed and managed from the Kibana web interface.

Integrations configure Filebeat inputs to do the data collection. To collect data from HTTP APIs, we’ve often used the HTTP JSON input. However, even basic listing APIs can differ greatly in the details, and the HTTP JSON input's model of YAML-configured transformations can make it awkward and sometimes impossible to express the required collection logic.

The Common Expression Language (CEL) input was introduced to allow more flexible interaction with HTTP APIs. CEL is a language designed to be embedded in applications that require a fast, safe, and extensible way to express conditions and data transformations. The CEL input lets an integration builder write one expression that can read settings, keep track of its own state, make requests, process responses, and ultimately return events ready to ingest.

In this article, we’ll look at how CEL differs from other programming languages, how we’ve extended it for the CEL input, and the flexibility and power that gives you to express your data collection logic.

CEL and how it works in the input

CEL is an expression language. It has no statements. When you write CEL, you don’t tell it what to do by writing statements, you tell it what value to produce by writing an expression. Every CEL expression produces a value, and smaller expressions can be combined into a larger expression to produce a result according to more complex rules. Later, we’ll see how to use expressions for things that may be written with statements in other languages.

CEL is intentionally a non-Turing complete language. It doesn’t allow unbounded loops. Later, we’ll see how you can process lists and maps using macros, but by avoiding unbounded loops, the language guarantees predictable and limited execution time for individual expressions.

The CEL input is configured with a CEL program (an expression) and some initial state. The state will be provided as input to the program. The program is evaluated to produce an output state. If the output state includes a list of events, those will be removed and published. The rest of the output state will be used as the input for the next evaluation. If the output state includes one or more events and the flag want_more: true, the next evaluation will be performed immediately; otherwise, it will sleep for the rest of the configured interval time before continuing. Here’s a simplified diagram of the input’s control flow:

The output of each evaluation will be passed forward as the input to the next evaluation, for as long as the input runs. Output data under the key "cursor" will be persisted to disk and reloaded after the input is restarted, but the rest of the state will not be preserved across restarts.

The CEL language itself has limited functionality and avoids side effects, but it is extensible. The cel-go implementation adds some functionality, such as optional syntax and types. The Mito library builds on cel-go and adds more functionality, including the ability to make HTTP requests. The CEL input uses Mito’s version of CEL.

Working with Mito

To build or debug an integration using the CEL input, the most important thing to understand is what output state your CEL program will produce for a given input state. During development, it can be cumbersome to have your CEL program run by the input, surrounded by the full Elastic stack. One way to achieve a faster feedback loop is to use Mito’s command-line tool, which will let you run a CEL program directly and see the output it produces for a given input.

Mito is written in Go and can be installed as follows:

go install github.com/elastic/mito/cmd/mito@latest

When you run a CEL program with Mito, you typically give it two files: a JSON file with the initial input state, and another file with the source code of your CEL program:

mito -data state.json src.cel

For easier copy and pasting, the examples in this article are written as single commands that have the shell create temporary files on the fly, by wrapping the content of each file in <(echo '...content...'). In your own development, working with actual files will be easier.

Fetching issues data from GitHub

The following example includes a full CEL program that will fetch data about issues from the GitHub API. Its initial input state has a URL for the API endpoint, and some information about how it should handle pagination. The CEL program uses the data in the input state to generate a request. It will decode the response, produce events from it, and return them as part of its output state.

mito -data <(echo '
  {
    "url": "https://api.github.com/repos/elastic/integrations/issues",
    "per_page": 3,
    "max_pages": 3
  }
') <(echo '
  int(state.?cursor.page.orValue(1)).as(page,
    (
      state.url + "?" + {
        "state": ["all"],
        "sort": ["created"],
        "direction": ["asc"],
        "per_page": [string(state.per_page)],
        "page": [string(page)],
      }.format_query()
    ).as(full_url,
      request("GET", full_url).with({
        "Header": {
          "Accept": ["application/vnd.github+json"],
          "X-GitHub-Api-Version": ["2022-11-28"],
        }
      }).do_request().as(resp,
        resp.Body.decode_json().as(data,
          state.with({
            "events": data.map(i, {
              "html_url": i.html_url,
              "title": i.title,
              "created_at": i.created_at,
            }),
            "cursor": { "page": page + 1 },
            "want_more": size(data) == state.per_page && page < state.max_pages,
          })
        )
      )
    )
  )
')

Its first evaluation produces the following output:

{
  "cursor": {
    "page": 2
  },
  "events": [
    {
      "created_at": "2018-09-14T09:47:35Z",
      "html_url": "https://github.com/elastic/integrations/issues/3250",
      "title": "Increase support of log formats in haproxy filebeat module"
    },
    {
      "created_at": "2019-02-06T12:37:37Z",
      "html_url": "https://github.com/elastic/integrations/issues/487",
      "title": "ETCD Metricbeat module needs polishing and grooming"
    },
    {
      "created_at": "2019-08-13T11:33:11Z",
      "html_url": "https://github.com/elastic/integrations/pull/1",
      "title": "Initial structure"
    }
  ],
  "max_pages": 3,
  "per_page": 3,
  "url": "https://api.github.com/repos/elastic/integrations/issues",
  "want_more": true
}

The events will be removed, and when run in the CEL input, they’ll be published for ingestion. The rest of the output will be provided to the next CEL program evaluation as its input state.

To understand how that CEL program works, we’ll look at some smaller CEL examples and discuss more details of how the CEL input operates.

CEL basics

In the CEL language, there are no statements; there are only expressions. Every successful CEL expression evaluates to a final value. Here’s one of the smallest CEL expressions you can write, along with its output:

mito <(echo '
  "hello" + " " + "world"
')

"hello world"

Many simple expressions are intuitive. Mathematical operations are only supported on values of the same type (for example, int with int), so convert types as you need (here from int to double):

mito <(echo '
  double((1 + 2) * (3 + 4)) / 2.0
')

10.5

There are no variables in the CEL language, but an expression can be given a name and used in a larger expression with the help of Mito’s as macro. In this example, the expression (1 + 1) evaluates to the value 2, and .as(n, ...) gives that value the name n for use in the expression "one plus one is "+string(n):

mito <(echo '
  (1 + 1).as(n, "one plus one is "+string(n))
')

"one plus one is 2"

It's also possible to accumulate information in a map and use it later in the expression, as demonstrated here using with:

mito <(echo '
  { "key": "value" }.with({ "key2": "value2" }).as(data,
    {
      "data": data,
      "size": size(data),
    }
  )
')

{
  "data": {
    "key": "value",
    "key2": "value2"
  },
  "size": 2
}

Look at that example again. Notice that the nested part, ({ "data": data, "size": size(data), }), gives us the shape of the final value. It’s a map with the keys "data" and "size". The values for those keys depend on data, which is defined by the outer part of the expression. Reading CEL expressions from the inside out can help to quickly see what they’ll return.

CEL has no control flow statements, like if, but conditional branching can be done with the ternary operator:

mito <(echo '
  1 + 1 < 12 ? "few" : "many"
')

"few"

Unbounded loops and recursion are not supported, as CEL is not a Turing complete language. That makes execution time predictable and proportional to the size of the input data and the expression complexity.

Although unbounded loops are not possible in individual CEL expressions, you can process lists and maps using macros like map:

mito <(echo '
  [1, 2, 3].map(x, x * 2)
')

[2, 4, 6]

In this section, we’ve covered:

Strings, numbers, lists, and maps.
String concatenation.
Mathematical operations.
Type casting.
Conditionals.
Naming sub-expressions.
Processing collections.

Next, we’ll look at how to make HTTP requests.

Requests

Mito extends CEL with the ability to make HTTP requests:

mito <(echo '
  get("https://example.com").as(resp, string(resp.Body))
')

"Example Domain..."

Requests can be explicitly constructed before they’re executed. That makes it possible to use different HTTP methods and to add headers and a body.

In this example, we build a URL with the help of format_query, add a header to the request, and parse the response body with decode_json. When given the -log_requests option, Mito will log detailed information in JSON format about each request and response.

mito -log_requests <(echo '
  request("GET",
    "https://postman-echo.com/get?" + {
        "q": ["query value"]
     }.format_query()
  ).with({
    "Header": { "Accept": ["application/json"] }
  }).do_request().as(resp, {
    "status": resp.StatusCode,
    "data": resp.Body.decode_json(),
  })
')

{"time":"...","level":"INFO","msg":"HTTP request",...}
{"time":"...","level":"INFO","msg":"HTTP response",...}
{
  "data": {
    "args": {
      "q": "query value"
    },
    "headers": {
      "accept": "application/json",
      "accept-encoding": "gzip, br",
      "host": "postman-echo.com",
      "user-agent": "Go-http-client/2.0",
      "x-forwarded-proto": "https"
    },
    "url": "https://postman-echo.com/get?q=query+value"
  },
  "status": 200
}

Managing state and evaluations

Now that we’ve covered how to make requests and the CEL basics required to produce our desired output state, let’s take a closer look at what we should put into the output state and how that lets us direct later processing.

An integration’s CEL program needs to make sure its output state is suitable for use as the input of the next evaluation. Configuration sets the initial state, and that should be repeated in the output with any appropriate changes. An easy way to do that is to use state.with({ ... }), to repeat the state map with some overrides. A common pattern for small programs is to wrap the whole program in state.with(), so that state propagation doesn’t have to be repeated in each branch that generates output data (for example, success, errors).

When there are state values that are initialized by an evaluation rather than hard-coded in the initial input state, the program will need to check for an existing value before setting the initial one. That’s something that the support for optional syntax and types can help with. By using a question mark before the field name in a map key, the access becomes optional: It may or may not resolve to a value, but further optional accesses are possible and it’s easy to supply a default if no value is present:

mito -data <(echo '{}') <(echo '
  int(state.?counter.orValue(0)).as(counter,
    state.with({
      "counter": counter + 1,
      "want_more": counter + 1 < 3,
    })
  )
')

{ "counter": 1, "want_more": true }
{ "counter": 2, "want_more": true }
{ "counter": 3, "want_more": false }

In that example, the counter value read from state is cast to int because all numbers are serialized in the state as floating point numbers, in keeping with conventions established by JSON and JavaScript’s Number type. It should also be noted that "want_more": true is honored here by Mito, but when run in the CEL input, the evaluation would only be repeated if the output also contains events.

It’s a requirement of CEL programs run by the CEL input that they return an "events" key in their output map. Its value can be a list of event maps, an empty list, or a single event map. The single event case is usually used for errors. The event will be published by the input, but its value will also be logged, and if it sets an error.message value, that will be used to update the integration’s Fleet health status. If your program produces a single non-error event, it’s best to wrap it in a list.

Take another look at the output of our GitHub issues program from earlier:

{
  "url": "https://api.github.com/repos/elastic/integrations/issues",
  "per_page": 3,
  "max_pages": 3,
  "cursor": {
    "page": 2
  },
  "events": [
    { ... },
    { ... },
    { ... }
  ],
  "want_more": true
}

The program effectively managed its state, by:

Repeating initial state values in url, per_page, and max_pages.
Adding state that should be persisted across restarts in cursor.page.
Returning events ready to publish in the events list.
Requesting immediate re-evaluation with want_more: true.

Now that you understand optional access and state management, as well as CEL basics and HTTP requests, the full GitHub issues program should be readable. Try running it with Mito and experimenting with some changes.

Review and resources

In this article, we looked at what the CEL language is and how it has been extended in the Mito library for use in the CEL input. We saw the flexibility of CEL in an example program that fetches issues information from the GitHub API, and went through all the details necessary to understand that program, covering access to settings in the initial state, interaction with HTTP APIs, returning events to be ingested, and managing the state for later program executions.

To learn more and build integrations using the CEL input, there are a number of resources worth exploring:

And perhaps the most valuable resource for building integrations with the CEL input is the CEL code of existing Elastic integrations, which can be found on GitHub:

cel.yml.hbs files in the Elastic integrations repository - GitHub

Entity resolution with Elasticsearch & LLMs, Part 2: Matching entities with LLM judgment and semantic search

Jessica Moszkowicz — Thu, 26 Feb 2026 00:00:00 GMT

In Part 1, we prepared our watch list and extracted entity mentions. Now we’re ready to answer the hard question: Which entity does a mention actually refer to? Let's return to the example in the first blog of this series, which set up why we need entity resolution: "The Swift update is here!" Imagine that this headline is accompanied by a bit more context:

The new Swift update is here! Developers are eager to try out the new features.
The new Swift update is here! The new album will drop next month.

With this added context, we should be able to resolve the name "Swift" to the correct entity.

In the previous post, we set up our watch list and enriched the entities with additional context. Looking at our examples above, we need to have at least the following two entities in the list: Taylor Swift and Swift Programming Language. We also covered how we extract entity mentions from text. Both of these examples would extract "Swift". With these ingredients in place, the enriched watch list, and the extracted entities, we’re finally ready to introduce the star of the show: entity matching.

Remember: This is an educational prototype designed to teach entity-matching concepts. Production systems might use different large language models (LLMs), custom matching rules, specialized judgment pipelines, or ensemble approaches combining multiple matching strategies.

The problem: Why matching is hard

Human language is a remarkable thing. One of the most interesting properties of it is its endless creativity. We can generate and understand an infinite number of new sentences. Is it any wonder, then, that exact matches in entity resolution are rare? Authors strive to be creative when they can. It would get quite tedious if we had to write and read full names whenever an entity is mentioned. So, while exact matches are easy, the reality is that we need a more sophisticated approach to entity resolution: one that’s robust enough to handle at least some of the boundless creativity of human authors. That’s why we separate the problem into two steps: Use Elasticsearch to retrieve plausible candidates at scale, and then use an LLM to judge whether those candidates truly refer to the same real-world entity.

The solution: Three-step matching with transparent LLM judgment

We’re in the midst of a paradigm shift in how we use computers. Just as the rise of the internet took us from localized computing to a globally connected network, generative AI (GenAI) is fundamentally changing how content, code, and information are created. In fact, the educational prototype that accompanies this series was almost exclusively "vibe coded" using an LLM with careful prompting by the author. This is not to say that LLMs have or even will reach the kind of productivity inherent with human language, but it does mean that we now have a powerful resource to help with entity resolution.

A common pattern we use with GenAI is retrieval augmented generation (RAG). Here, retrieval means retrieving entity candidates (not generating answers), and the LLM is used strictly for match evaluation and explanation. While we could ask an LLM to help us with end-to-end entity resolution, that’s a costly approach, both in terms of time and money. RAG helps LLMs do their work by using more efficient ways to provide context to the LLM, thereby empowering the LLM to efficiently help with entity resolution.

For the retrieval part of RAG, we again turn to Elasticsearch. We first find potential matches using a combination of exact matching, matching against aliases, and hybrid search, which combines keyword and semantic search. Once we find these potential matches, we send them to an LLM for judgment. The LLM acts as the final match evaluator. We also make the LLM explain its reasoning, an important differentiator with other entity resolution systems. Without these explanations, entity resolution is a black box; with them, we can see for ourselves why a match makes sense.

Key concepts: Three-step matching, hybrid search, and transparent LLM judgment

What is three-step matching? At the onset of this project, we hypothesized that semantic search will be a crucial part of the system, but not every match requires such sophisticated search. In order to find matches efficiently, we take a progressive approach to the problem. First, we check for exact matches using keyword search. If we find such a match, our work is done and we can move on. If exact matching fails, we turn to alias matching. In the prototype, alias matching is also done using exact matching with keywords, for simplicity. In production, you might expand this step with normalization, transliteration rules, fuzzy matching, or curated alias tables. If we still haven't found a potential match in the first two steps, then it's time to bring in semantic search via Elasticsearch's hybrid search with reciprocal rank fusion (RRF).

What is hybrid search? In Elasticsearch, we can use semantic search to find meaningful matches that take context into account. Elasticsearch is widely used for vector search and hybrid retrieval. Semantic similarity is powerful for meaning, but it’s not a substitute for structured filtering (for example, by time ranges, locations, or identifiers), and it’s often unnecessary when an exact match is available. Elasticsearch made its mark with lexical search, which is great at tasks where semantic search doesn't fit. To take full advantage of both approaches, we use lexical search alongside semantic search in a single hybrid query. We then merge the results to find the most likely matches using RRF. In the prototype, the top two results become potential matches that can be sent for LLM judgment.

Why LLM judgment? LLM judgments and explanations allow our system to handle ambiguity and context transparently. This is vital for cases like "the president", which could refer to multiple entities, depending on the context, but it also makes things like nicknames and cultural variations work well in the system. Finally, when we consider mission-critical tasks, like identifying entities from sanctions lists, we need to know why a match was accepted in order to trust the system. Crucially, the LLM does not search the full corpus; it evaluates only the small set of candidates returned by Elasticsearch.

Real-world results: Matching with LLM reasoning

A major challenge for any natural language processing task is the creation of a golden document, an "answer key" that tells us what the expected results are. Without this, it's next to impossible to judge how well a system performs on a task, but creating such a document can be a laborious process. For the entity resolution prototype, we turned again to GenAI to help set up data we could test against.

We first defined several challenge types, such as nicknames and transliteration, and then asked the LLM to create a tiered collection of datasets that would get progressively larger and more challenging for the system. The creation of the datasets was less straightforward than one might hope. The LLM had a strong propensity for "cheating" by making it too easy to get the right answer. For example, one of the challenge types focused on semantic context. This type included things like resolving "Russian author" to "Leo Tolstoy". The LLM incorrectly put "Russian author" as an alias for "Leo Tolstoy", which negated the need for hybrid search to find the match.

After several refactorings to fix issues like this, we had five dataset tiers to work with. Tiers 1–4 were progressively larger with more challenge types. Tier 5 was the "ultimate challenge" dataset, made up of the trickiest examples from all challenge types. All of the test data is available in the comprehensive evaluation directory.

To evaluate our prompt-based entity resolution approach, we focused our attention on the tier 4 dataset. An important note is that the evaluation was conducted as a controlled experiment so that we could focus on entity match quality. The watch list data was pre-enriched with context, and entities were extracted from the article ahead of time. This ensured that evaluation was focused on matching rather than on extraction accuracy. This isolates match quality; end-to-end performance would additionally depend on extraction recall and enrichment quality.

Evaluation dataset

The tier 4 evaluation dataset provides a comprehensive test of the system's capabilities:[1]

Watch list entities: 66 entities across diverse types (people, organizations, locations).
Test articles: 69 articles covering real-world entity resolution scenarios.
Expected matches: 206 expected entity matches across all articles.
Challenge types: 15 different challenge types testing various aspects of entity resolution.

The challenge types included in the dataset are:

Nicknames: "Bob Smith" → "Robert Smith" (seven articles).
Titles and honorifics: "Dr. Sarah Williams" → "Sarah Williams" (five articles).
Semantic context: "Russian author" → "Leo Tolstoy" (eight articles).
Multilingual names: Handling names in different scripts (six articles).
Business entities: Corporate name variations (seven articles).
Executive references: "Microsoft CEO" → "Satya Nadella" (five articles).
Political leaders: Title-based references (five articles).
Initials: "J. Smith" → "John Smith" (three articles).
Name order variations: Different name ordering conventions (three articles).
Truncated names: Partial name matches (three articles).
Name splitting: Names split across text (three articles).
Missing spaces/hyphens: Formatting variations (two articles).
Transliteration: Cross-script name matching (two articles).
Combined challenges: Multiple challenges in one article (six articles).
Complex business: Hierarchical business relationships (five articles).

Let's see how prompt-based entity resolution performed.

Overall performance

The results show that there's a lot of promise with LLM-powered match evaluation, but they also reveal a significant reliability issue. Because each candidate pair must be evaluated by the LLM, failures in structured output can suppress acceptance and recall even when retrieval is working well.

Metric	Value
Precision	83.8%
Recall	62.6%
F1 score	71.7%
Total matches found	344
LLM acceptance rate	44.8%
Error rate	30.2%

The error rate problem

Recall that the first step we take in the prototype is to create potential match pairs using Elasticsearch. Each of these potential matches needs to be evaluated by the LLM. To efficiently process all of those matches, we batch the LLM calls together. This reduces API costs and latency, but there’s also an increased risk of getting malformed JSON in the output. As batch size increases, the JSON becomes longer and more complex, making it more likely that the LLM will generate invalid JSON. This is where the 30% error rate stems from. In the evaluation, we used a batch size of five matches per request. Even with this conservative batch size, we still see JSON parsing failures, which skews the evaluation results significantly.

What's next: Optimizing LLM integration

Now that we've matched entities using semantic search and LLM judgment, we have a complete entity resolution pipeline. This approach introduces a new failure mode, however, when the model’s judgment is correct, but its output isn’t usable. We can optimize the LLM integration for better reliability and cost efficiency. In the next post, we'll explore how to use function calling for structured output, which provides guaranteed structure and type safety while reducing errors and costs.

Try it yourself

Want to see entity matching in action? Check out the Entity Matching notebook for a complete walk-through with real implementations, detailed explanations, and hands-on examples. The notebook shows you exactly how to match entities using three-step search, hybrid search with RRF, and LLM-powered judgment with reasoning.

Remember: This is an educational prototype designed to teach the concepts. When building production systems, consider additional factors, like model selection, cost optimization, latency requirements, quality validation, error handling, and monitoring, which aren't covered in this learning-focused prototype.

Notes

These datasets are synthetic and designed for education; they approximate real challenges but are not representative of any single production domain.

Elasticsearch vector search is up to 8x faster than OpenSearch

Sachin Frayne — Wed, 25 Feb 2026 00:00:00 GMT

Why search speed matters for AI agents and context engineering

Our benchmarks on a 20M document corpus show that Elasticsearch delivers up to 8x higher throughput than OpenSearch for filtered vector search, while also achieving higher Recall@100 across the configurations we tested. Context engineering depends on more than fast vector retrieval. Teams also need strong relevance controls, like hybrid search and filtering, operational simplicity, and predictable performance, as workflows iterate. But because agents often run retrieve, reason, retrieve loops many times per request, retrieval latency becomes a multiplier, so improvements here translate directly into better end-to-end responsiveness and lower cost.

For context engineering, retrieval isn’t a one-time step. Agents and applications repeatedly run loops, such as retrieve → reason → retrieve, to refine queries, verify facts, assemble grounded context, and complete tasks. This pattern is common in agentic workflows and iterative retrieval augmented generation (RAG). Because retrieval may be invoked many times per user request, it adds delay to the response and/or increases infrastructure costs.

Why is vector search performance critical?

Imagine a shopping assistant answering the question, “I need a carry-on backpack under $60 that fits a 15-inch laptop, is water resistant, and can arrive by Friday.”

In production, the assistant rarely issues one vector query and stops. It runs a retrieval loop to build the right context, and each step is typically constrained by filters, like availability, region, shipping promise, brand rules, and policy eligibility.

Step 1: Interpret intent and translate to constraints.

The agent turns the request into structured filters and a semantic query, such as:

Filters: In stock, deliverable to the user’s postcode, delivery by Friday, price under $60, valid listing
Vector query: “Carry-on backpack 15-inch laptop water resistant”

Step 2: Retrieve candidates, and then refine.

It often repeats retrieval with variations to avoid missing good matches:

“travel backpack carry on laptop sleeve”
“water resistant commuter backpack 15 inch”
“lightweight cabin backpack”

Each query uses the same eligibility filters, because retrieving irrelevant or unavailable items is wasted context.

Step 3: Expand to confirm details and reduce risk.

The agent then retrieves again to verify key attributes that affect the final answer:

Material and water resistance wording
Dimensions and laptop compartment fit
Return policy or warranty constraints
Alternate options if inventory is low

This is multistep context engineering: Retrieve, reason, retrieve, assemble.

Why latency and recall matter for context engineering

These interactions can involve dozens of filtered retrieval calls per user session. That makes per-call latency a direct multiplier on end-to-end response time, and low recall forces extra retries or causes the agent to miss eligible items, degrading answer quality.

Takeaway: In context-engineered systems, filtered approximate nearest neighbors (ANN) isn’t a single lookup. It’s a repeated operation under constraints, so vector search performance shows up immediately in latency, throughput, and cost, even when the large language model (LLM) is the most visible component.

Benchmarking

Results

In Graph 2, each dot represents one test configuration. The best results appear toward the top left, meaning higher recall with lower latency. Elasticsearch’s results are consistently closer to the top left than OpenSearch’s, indicating better speed and accuracy under the same workload settings.

Some key insights

s_n_r_value: Shorthand for size_numCandidates_rescoreOversample (k and numCandidates set equal to numCandidates in these tests), for example, 100_500_1 means size=100, numCandidates=500 and k=500, rescore oversample=1
Recall: Measured Recall@100 for that configuration
Avg latency (ms): Average end-to-end latency per query
Throughput: Queries per second
Recall %: Relative recall lift of Elasticsearch versus OpenSearch (Elasticsearch minus OpenSearch) / OpenSearch
Latency Xs: OpenSearch average latency divided by Elasticsearch average latency
Throughput Xs: Elasticsearch throughput divided by OpenSearch throughput

Engine	`s_n_r_value`	Recall	Avg Latency (ms)	Throughput	Recall %	Latency Xs	Throughput Xs
Elasticsearch	100_250_1	0.7704	25	534.75	9.70%	2.28	1.91
OpenSearch	100_250_1	0.7023	57.08	279.58
Elasticsearch	100_500_1	0.8577	25.42	524.14	7.20%	2.4	2
OpenSearch	100_500_1	0.8001	60.9	262.12
Elasticsearch	100_750_1	0.8947	29.67	528.09	5.72%	2.25	2.21
OpenSearch	100_750_1	0.8463	66.76	239.11
Elasticsearch	100_1000_1	0.9156	29.65	534.5	4.66%	2.46	2.44
OpenSearch	100_1000_1	0.8748	72.88	219.01
Elasticsearch	100_1500_1	0.9386	31.84	497.3	3.38%	2.71	2.68
OpenSearch	100_1500_1	0.9079	86.16	185.4
Elasticsearch	100_2000_1	0.9507	34.69	457.2	2.57%	2.98	2.96
OpenSearch	100_2000_1	0.9269	103.36	154.55
Elasticsearch	100_2500_1	0.9582	37.9	418.43	1.99%	3.28	3.26
OpenSearch	100_2500_1	0.9395	124.29	128.53
Elasticsearch	100_3000_1	0.9636	41.86	379.4	1.62%	3.46	3.44
OpenSearch	100_3000_1	0.9482	144.67	110.34
Elasticsearch	100_4000_1	0.9705	50.28	316.21	1.06%	3.87	3.85
OpenSearch	100_4000_1	0.9603	194.36	82.22
Elasticsearch	100_5000_1	0.9749	58.77	270.91	0.73%	4.43	4.41
OpenSearch	100_5000_1	0.9678	260.33	61.38
Elasticsearch	100_6000_1	0.9781	66.75	238.59	0.52%	4.91	4.89
OpenSearch	100_6000_1	0.973	327.44	48.81
Elasticsearch	100_7000_1	0.9804	74.64	213.49	0.38%	5.28	5.27
OpenSearch	100_7000_1	0.9767	394.24	40.53
Elasticsearch	100_8000_1	0.9823	82.28	193.59	0.27%	6.86	6.83
OpenSearch	100_8000_1	0.9797	564.14	28.33
Elasticsearch	100_9000_1	0.9837	90.08	176.96	0.16%	7.63	7.61
OpenSearch	100_9000_1	0.9821	687.25	23.25
Elasticsearch	100_10000_1	0.9848	97.64	163.31	0.08%	8.38	8.36
OpenSearch	100_10000_1	0.984	818.64	19.53

For example, at 100_9000_1, OpenSearch averages 687 milliseconds per retrieval versus 90 milliseconds on Elasticsearch, and in a 10-step retrieval loop that’s about 10 x (687 - 90) = six seconds of additional waiting time.

See the full results.

Methodology

Using Python to send the queries and track the response timing and other statistics, we sent the following queries to the engines. Bear in mind that the performance of any vector search engine depends on how you tune its core parameters: how many candidates to consider, how aggressively to rescore, and how much context to return. These settings directly affect both recall (the likelihood of finding the right answer) and latency (how fast you get results).

In our benchmarks, we used the same candidate, rescore, and result-size settings you’d typically tune in an agentic retrieval loop, and we measured how Elasticsearch performs under that workload. We then ran OpenSearch with the same settings as a reference.

OpenSearch

GET /_search
{
  "query": {
    "knn": {
      "": {
        "vector": [...],
        "k": ,
        "method_parameters": {
          "ef_search": 
        },
        "rescore": {
          "oversample_factor": 
        },
        "filter": {
          
        }
      }
    }
  },
  "size": ,
  "_source": {
    "excludes": [
      ""
    ]
  }
}

"size": : Number of hits returned to the client. In this benchmark, result size is 100 to compute Recall@100.
"k": : The number of nearest neighbor candidates.
"ef_search": : The number of vectors to examine.
"oversample_factor": : How many candidate vectors are retrieved before rescoring.

Elasticsearch

GET /_search
{
  "query": {
    "knn": {
      "field": "",
      "query_vector": [...],
      "k": ,
      "num_candidates": ,
      "rescore_vector": {
        "oversample": 
      },
      "filter": {
        
      }
    }
  },
  "size": ,
  "_source": {
    "excludes": [
      ""
    ]
  }
}

"size": : Number of hits returned to the client. In this benchmark, result size is 100 to compute Recall@100.
"k": : Number of nearest neighbors to return from each shard.
"num_candidates": : Number of nearest neighbor candidates to consider per shard while doing knn search.
"oversample": : How many candidate vectors are retrieved before rescoring.

Example

Knn query, (100_500_1), would be as follows:

OpenSearch

GET search_catalog_128/_search
{
  "query": {
    "knn": {
      "search_catalog_embedding": {
        "vector": [...],
        "k": 500,
        "method_parameters": {
          "ef_search": 500
        },
        "rescore": {
          "oversample_factor": 1
        },
        "filter": {
          "term": {
            "valid": true
          }
        }
      }
    }
  },
  "size": 100,
  "_source": {
    "excludes": [
      "search_catalog_embedding"
    ]
  }
}

Elasticsearch

GET search_catalog_128/_search
{
  "query": {
    "knn": {
      "field": "search_catalog_embedding",
      "query_vector": [...],
      "k": 500,
      "num_candidates": 500,
      "rescore_vector": {
        "oversample": 1
      },
      "filter": {
        "term": {
          "valid": true
        }
      }
    }
  },
  "size": 100,
  "_source": {
    "excludes": [
      "search_catalog_embedding"
    ]
  }
}

The full configuration, alongside Terraform scripts, Kubernetes manifests and the benchmarking code is available in this repository in the folder es-9.3-vs-os-3.5-vector-search.

Cluster setup

We ran our tests on six e2-standard-16 cloud servers, each with 16 vCPUs and 64 GB RAM. On each server, we allocated 15 vCPUs and 56 GB RAM to each Kubernetes pod running the search engine node, with 28 GB reserved for the JVM heap.

The clusters ran Elasticsearch 9.3.0 and OpenSearch 3.5.0 (Lucene 10.3.2). Because both systems use the same Lucene version in this benchmark, the throughput and latency differences we observe cannot be attributed to Lucene alone and instead reflect differences in how each engine integrates and executes filtered k-nearest neighbor (kNN) retrieval and rescoring. We used a single index with three primary shards and one replica (so 6 shards total, 1 per node).

We also used a separate server in the same region to run the benchmark client and collect timing statistics.

The dataset

For this benchmark, we used a large-scale ecommerce-style catalog embedding dataset with 20 million documents, designed to reflect real-world filtered vector retrieval at scale.

Each document represents a catalog item and includes:

A 128-dimensional dense vector embedding used for approximate kNN retrieval.
Structured metadata fields used for filtering (for example, item validity and availability plus other catalog constraints) enabling the common production pattern of retrieving the nearest neighbors but only within an eligible subset.

We chose this dataset because it captures the core performance challenge we see in agentic and RAG-style systems in production: Vector similarity alone is not enough, retrieval is frequently constrained by filters, and the system must maintain high recall while keeping latency low under those constraints. Compared to smaller QA-style datasets, a 20M document corpus also better reflects the scale and candidate pressure that filtered ANN systems face in practice.

Conclusion

In modern AI architectures, especially those built around context engineering, vector search speed isn’t a minor implementation detail. It’s a multiplier. When agents and workflows iterate through retrieve → reason → retrieve, retrieval performance directly shapes end-to-end latency, throughput, and the quality of the context fed into the model.

In our benchmarks, Elasticsearch consistently delivered higher recall at lower latency than OpenSearch in scenarios where correctness depends on retrieving the right document, not just a similar vector. On a controlled dataset, the difference is clear, and in production those gains accumulate across large volumes of retrieval calls, improving responsiveness, increasing capacity headroom, and reducing infrastructure costs.

Elastic AutoOps now free for all: What you get with it

Valentin Crettaz, Arnon Stern, Ori Shafir — Wed, 25 Feb 2026 00:00:00 GMT

Same AutoOps, regardless of your license

We’re making this simple: Starting today, AutoOps is available at no cost for every self-managed Elasticsearch cluster through Elastic Cloud Connect. Whether you’re on Free, Basic, Platinum, or Enterprise, you get the same feature product. This isn’t a limited preview or a "lite" version. It’s the same product used by the largest deployments.

An investment in the self-managed community

Extending it to every user, including those on the free distribution, reflects Elastic's commitment to the success of the entire Elasticsearch community. By providing AutoOps for free, we’re investing in the stability and performance of the hundred of thousands of clusters that power the community's search and analytics workloads.

Elastic Cloud Connect enables self-managed clusters to consume Elastic Cloud services, such as AutoOps and the recently announced Elastic Inference Service (EIS), without the operational overhead of maintaining, patching, monitoring, and operating the services locally.

How AutoOps for self-managed works and what it provides

As clusters grow in complexity and size, you find yourself spending more time chasing configuration tweaks and trying to find the root cause of an issue. Monitoring tools show you metrics and leave the manual correlation to you and your favorite large language model (LLM) to find the root cause when the issue appears. AutoOps tells you what’s wrong, why, and exactly how to fix it, with real-time issue detection and specific resolution paths.

AutoOps runs on Elastic Cloud; there’s no infrastructure for you to provision or maintain. You simply run a lightweight agent on-premises to connect your cluster to the AutoOps service where operational metadata (such as node stats, cluster settings, and shard states) is shipped to AutoOps in real time to provide insights and recommendations. Your data never leaves your environment.

AutoOps vs. Stack Monitoring for self-managed users

Stack Monitoring provides the essential telemetry and basic monitoring for your nodes and indices, showing you the trend over time for various metrics, and alerts you when thresholds are crossed, yet it often leaves the diagnostic burden on the engineer. AutoOps offers a complete picture of cluster health by correlating all relevant metrics. This provides valuable insights and clear instructions on how to resolve issues when they occur.

Faster root cause analysis

Your cluster was humming along, but it suddenly got red in the middle of the night (and, as usual, nothing changed the days before).

With Stack Monitoring: A built-in alert will notify you when your cluster health turns red. To find the cause, you need to dig into your cluster logs and turn to Dev Tools to look at your shard allocations to understand why that primary shard couldn’t be allocated. Looking at your alerts history, you see another one informing you that your hot nodes reached 80% disk utilization two days ago. You can’t find any disk usage charts to learn more about your disk fill rate, you only know that your disk reached 90% utilization in the meantime, and when one of your data streams needed to roll over, a new backing index was created, but no shards could be allocated to any of your nodes.
With AutoOps: The system notifies you when your cluster turns red (1). Looking at the timeline, you immediately realize that this happened because you failed to take action on the previous watermark events that AutoOps raised, namely the high watermark events (2) that started happening recently, and the low watermark ones (3) that built up over the past few days. It is now straightforward for you to know what you need to do to get your cluster back to green.

Higher signal to noise ratio

Keeping your Elasticsearch cluster healthy is most probably your main concern. Yet, it’s not uncommon for the health status to sometimes flap between green and yellow (and sometimes red), and the cause is not always worthy of your time.

With Stack Monitoring: The built-in “Cluster Health” alert will continue to be raised on each health transition from green to either yellow or red. In some situations, like frequent index creations, this can create a lot of repeated and undesired noise. Also, and more importantly, there’s no distinction between a yellow and a red status.
With AutoOps: There are dedicated “Status Red” and “Status Yellow” events, with different severities. The latter can be customized in many different ways to fit your use case, as shown in the screenshot below:
1. Since the cluster health can turn yellow only for a brief duration, you can decide for how long to ignore the yellow status before being notified (for example, five minutes, in the screenshot below).
2. Furthermore, there are a lot of legitimate operations that Elasticsearch does all the time and that make the cluster turn yellow. You can pick any of the operations you don’t want to be notified about when your cluster turns yellow because of them (for example, adding replicas, relocating replicas, or closing or opening an index, among others).
3. Finally, and most importantly, if you have several clusters to manage, you don’t need to configure this for all of them separately; you simply decide to which ones this configuration should be applied. Simple, powerful!

More insightful correlations and comparisons of node metrics

When your cluster runs on more than just a handful nodes, you often need to see how they perform against each other, especially when looking at search and indexing performance.

With Stack Monitoring: The Nodes list doesn’t allow you to focus on a specific data tier and doesn’t provide any search or indexing performance metrics that you can sort on. These metrics are available, but only once you drill into a specific node, which doesn’t allow you to easily compare node performance against each other.
With AutoOps: The Nodes view allows you to select nodes from a specific data tier. It also provides you with over 50 metrics visualizations, among them search and indexing performance, which give you exactly the visual cues that you need to understand how each node performs against each other and whether there are any struggling nodes that require your attention. In the screenshot below, we can see that some nodes are indexing at double the rate of some others and that search latency is building up at four times the latency of the fastest nodes.

Quick overview of the main differences

Here’s a glimpse at the notable differences between AutoOps and Stack Monitoring, but you can find a more detailed breakdown in our official documentation.

Capability	Stack Monitoring	AutoOps
Cluster, node, and index metrics	Yes	Yes
Real-time dashboards	Yes	Yes
Multi-cluster overview	No	Yes
Root cause analysis	No	Automated analysis
Fix recommendations	None	In-context Elasticsearch commands
Performance tuning insights	None	Yes, based on usage patterns
Alerts and notifications	14 built-in alerts and 27 connectors	100+ customizable alerts and 7 connectors
Template and mapping analysis	None	Detects mapping misconfigurations
Setup infrastructure	Dedicated monitoring cluster	Five-minute installation. No extra infrastructure and no cost. Only lightweight agent installation is needed.
Other stack components	Kibana, Logstash, Elastic APM	Coming soon

Start now: Five-minute installation

Connecting your cluster takes minutes, regardless of your license type:

Log in to your free Elastic Cloud account, or sign up for one.
Choose how to connect your cluster: Elastic Cloud on Kubernetes (ECK), Kubernetes, Docker, or Linux.
Enter your Elasticsearch cluster endpoint, and run the single command to install and run the lightweight Elastic agent.
Access AutoOps in your Elastic Cloud account.

For more details on AutoOps and instructions on connecting your self-managed cluster, read our product documentation.

Reach out if you have any questions

Feel free to reach out to us to share your questions and ideas via our Slack community, by posting on our Discuss forum, or by clicking the “Give Feedback” button on the AutoOps product page. If you’re connecting a paid self-managed Platinum or Enterprise cluster, you can contact support within your Elastic Cloud account.

If you’re interested to learn more about AutoOps and what it can do for you, please head to the official AutoOps documentation and the following Elastic Search Labs articles:

Start using AutoOps for free

Build a "chat with your website data" agent with Jina Embeddings v5 and Elasticsearch

Ugo Sangiorgi, Lily Adler — Tue, 24 Feb 2026 00:00:00 GMT

Build a “chat with your website” experience in under an hour using Elasticsearch Serverless, Jina Embeddings v5, Elastic Open Web Crawler, and Elastic Agent Builder.

By the end, you’ll have a working agent that can search your crawled pages, cite relevant passages, and answer questions grounded in your content, no custom chunking or embedding pipeline required.

In this guide, you’ll:

Start an Elasticsearch Serverless project.
Create an index using the new semantic_text field powered by Jina Embeddings v5.
Crawl any website using Elastic Crawler Control (a.k.a. Crawly) (an open source UI + API wrapper around the Elastic Open Web Crawler).
Chat with that data using the Elastic Agent Builder in Kibana.

What you’ll walk away with:

A repeatable pattern you can point at any website/docs source.
Chat that stays grounded in your content.

Prerequisites

An Elasticsearch Serverless (Search) project + an API key with write permissions.
Docker + Docker Compose (to run the crawler UI).
git (to clone the repo).

1. Start an Elasticsearch Serverless project

First, we need a serverless project to host our data.

1. Log in to your Elastic Cloud Console.

2. Click Create project.

3. Select Search as the project type. (This type is optimized for vector search and retrieval.)

4. Give it a name (for example, es-labs-jina-guide), and click Create.

5. Important: Save the Elasticsearch endpoint and API Key provided when the project is created. You’ll need these for the crawler.

2. Create the index

Elasticsearch Serverless supports semantic_text, which handles chunking and embedding generation automatically. We’ll use the .jina-embeddings-v5-text-small model that’s hosted on GPUs on Elastic Inference Service.

Create the index with the semantic_text field. This tells Elastic to automatically vectorize content put into the field property using the inference endpoint we just created.

In Kibana Dev tools run:

PUT furnirem-website
{
  "mappings": {
    "_meta": {
      "description": "Each document represents a web page with the following schema: 'title' and 'meta_description' provide high-level summaries; 'body' contains the full text content; 'headings' preserves the page hierarchy for semantic weighting. URL metadata is decomposed into 'url_host', 'url_path', and 'url_path_dir1/2/3' to allow for granular filtering by site section (e.g., 'blog' or 'tutorials'). 'links' contains extracted outbound URLs for discovery. Crawl timestamp: 2026-01-26T12:54:16.347907."
    },
    "properties": {
      "body_content": { 
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          },
          "semantic_multilingual": {
            "type": "semantic_text",
            "inference_id": ".jina-embeddings-v5-text-small" 
          }
        }
      },
      "headings": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          },
          "semantic_multilingual": {
            "type": "semantic_text",
            "inference_id": ".jina-embeddings-v5-text-small"
          }
        }
      },
      "title": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          },
          "semantic_multilingual": {
            "type": "semantic_text",
            "inference_id": ".jina-embeddings-v5-text-small"
          }
        }
      }
    }
  }
}

3. Run the Elastic Open Crawler

Crawly is one example of how an application can be constructed around the functionalities that the Open Web Crawler provides.

The application wraps the Elastic Open Crawler in a FastAPI service that manages crawler processes and persists execution data. A React front end provides the interface for configuring and monitoring crawls.

What happens under the hood is that the crawler service (check crawler.py) spawns JRuby processes via subprocess.Popen, allowing multiple concurrent crawls. Each execution's configuration, status, and logs are persisted to disk (for now).

Clone the repository:

git clone https://github.com/ugosan/elastic-crawler-control

Create an env.local file with your Elasticsearch credentials:

ES_URL=https://your-elasticsearch-endpoint.es.cloud
ES_API_KEY=your_api_key_here

Start the services:

docker-compose up

Access the UI at http://localhost:16700

You don’t necessarily need seed_urls unless you want to be specific, so your config can be as simple as below:

{
  "domains": [
    {
      "url": "https://furnirem.com"
    }
  ],
  "max_crawl_depth": 3,
  "max_unique_url_count": 500,
  "output_index": "furnirem-website"
}

From there, you can start a crawl on any website and check its progress:

Once it's finished, we’re ready to query the content in Elasticsearch directly or use the pages you just crawled for chatting with the website on Agent Builder.

4. Chat with data in Kibana

Now that the data is indexed and vectorized, we can start chatting with the data using the Elastic Agent Builder.

Open Kibana, and navigate to Agents (under the "Search" section).
Test the agent:
- In the chat window, ask a question, like,"What is the difference between sparse and dense vectors?"

The agent will search your Jina-embedded data, retrieve the relevant snippets from the Search Labs blog posts, and generate an answer.

You can also chat with the data directly via Kibana API:

POST kbn://api/agent_builder/converse/async
{
  "input": "What is the difference between sparse and dense vectors?",
  "agent_id": "elastic-ai-agent",
  "conversation_id": ""
}

Use conversation_id to resume an existing conversation with an agent in Elastic Agent Builder. If you don’t provide it on the initial request, the API starts a new conversation and returns a newly generated ID in the streaming response.

Summary

You now have a working “chat with your website” stack: Your site gets crawled, indexed, auto-embedded with semantic_text + Jina v5, and surfaced through an agent in Kibana that answers questions grounded in your pages.

From here, you can point the same setup at docs, support content, or internal wikis and iterate on relevance in minutes.

jina-embeddings-v5-text: Compact state-of-the-art text embeddings for search and intelligent applications

Scott Martens, Sofia Vasileva — Mon, 23 Feb 2026 00:00:00 GMT

Jina AI and Elastic are releasing jina-embeddings-v5-text, a family of new, high-performance, compact text embedding models with state-of-the-art performance for models of comparable size across all major task types.

The family includes two models:

jina-embeddings-v5-text-small
jina-embeddings-v5-text-nano

These models are the successful result of an innovative new training recipe for embedding models. They both outperform models many times their size, creating savings in memory and computing resources and responding faster to requests.

The jina-embeddings-v5-text-small model has 677M parameters, supports a 32768 token input context window, and produces 1024 dimension embeddings by default.

jina-embeddings-v5-text-nano weighs in at roughly a third of its sibling's size, with 239M parameters and a 8192 token input context window, yielding slender 768 dimension embeddings.

Model name	Total size	Input context window size	Embedding size
jina-v5-text-small	677M params	32768 tokens	1024 dims
jina-v5-text-nano	239M params	8192 tokens	768 dims

These two models are the best in class for overall MMTEB (Multilingual MTEB) benchmark performance. Among models with under 500M parameters, jina-embeddings-v5-text-nano is the top performer, despite having less than 250M parameters, and jina-embeddings-v5-text-small model is the leader among multilingual embedding models with under 750M parameters.

These models are available via Elastic Inference Service (EIS), via an online API, and available for local hosting. For instructions on how to access jina-embeddings-v5-text models, see the “Getting started” section, below.

Embedding models and semantic indexing dramatically increase the accuracy of search algorithms but also have a variety of other uses for tasks involving semantic similarity and meaning extraction, for example:

Finding duplicate texts.
Recognizing paraphrases and translations.
Topic discovery.
Recommendation engines.
Sentiment and intention analysis.
Spam filtering.
And many others.

Features

This new model family has a number of features designed to improve relevance and reduce costs.

Task optimization

We’ve optimized the jina-embeddings-v5-text models for four broad task types:

Task	Example use cases
Retrieval	Searching with natural language queries and retrieving the most relevant matches in a collection of documents.
Text matching	Semantic similarity, deduplication, paraphrase and translation alignment, and more.
Clustering	Topic discovery, automatic organization of document collections.
Classification	Document categorization, sentiment and intent detection, similar tasks.

Optimizing for one task usually means having to compromise on another, so most embedding models only have competitive performance for one kind of task. But jina-embeddings-v5-text models are able to specialize in all four areas without compromising by training task-specific Low-Rank Adaptation (LoRA) adapters.

LoRA adapters are a kind of plugin for an AI model that changes its behavior dramatically while only adding slightly to the total size. Instead of having an entire model for each task, each one with hundreds of millions of parameters, the jina-embeddings-v5-text model family lets you use just one model with a compact LoRA adapter for each task. This saves memory, storage space, and inference costs.

Truncating embeddings

We’ve trained the jina-embeddings-v5-text models using Matryoshka Representation Learning, which lets you cut your embeddings down to smaller sizes at a minimal cost to their quality.

By default, jina-embeddings-v5-text-small generates 1024-dimension embedding vectors, each represented by a 16-bit number, making every embedding 2KB in size. For a large collection of documents, this can be a lot of data to store, and searching in a vector database full of embeddings is proportional both to the size of the database and to the number of dimensions each stored vector has.

But you can just halve the size of the embeddings (throw away 512 of the 1024 dimensions), and take up half the space while doubling search speeds. This has an impact on performance. Throwing away information reduces precision. But as the graph below shows, even getting rid of half of the embedding only reduces performance slightly:

As long your embeddings are at least 256 dimensions, the loss in precision should remain fairly small. Below that level, however, relevance and accuracy deteriorate quickly.

Truncating embeddings like this empowers users to manage their own trade-offs between accuracy and computing costs. It gives you the tools to get big efficiency gains and large cost savings out of your search AI.

Robust quantization

Quantization is another way of reducing the size of embeddings. Instead of throwing away part of each embedding, quantization reduces the precision of the numbers in the embedding. The jina-embeddings-v5-text models generate embeddings with 16-bit numbers, but we can round those numbers off, reducing their precision and the number of bits needed to store them. In the most extreme case, we can reduce each number to one bit (0 or 1), compressing jina-embeddings-v5-text’s default 1024 dimension embeddings from 2 kilobytes to 128 bytes, a 94% reduction from binary quantization alone. Just like for truncation, this produces large savings in memory and computing costs. However, also like truncation, quantization makes embeddings less accurate.

We’ve trained the jina-embeddings-v5-text models to work with Elasticsearch’s Better Binary Quantization by minimizing that loss of accuracy, and benchmark tests of binarized embeddings from these models show performance almost equal to their non-binarized equivalents. Consult the technical report for detailed ablation studies of binarization performance.

Multilingual performance

Many embedding models are multilingual because they’ve been trained on materials that include large numbers of languages. But that doesn’t mean that they all perform equally well in all supported languages.

We identified 211 languages in the MMTEB multilingual benchmark and separated them so we could compare our models to similar models on a language-by-language basis. The image below summarizes our results as a heat map. Each patch is a language (identified by its ISO-639 code), and the greener it is, the better the model performed compared to the average of similar models:

Although accuracy varies between languages, the jina-embeddings-v5-text models are state-of-the-art or nearly so across most of the world’s languages.

For details about multilingual performance, see the jina-embeddings-v5-text technical report.

Jina in Elastic: State-of-the-art native AI for search

With jina-embeddings-v5-text models on EIS, you can run high-performance multilingual embedding models natively in Elasticsearch with fully managed, GPU-accelerated inference and no infrastructure to provision or scale. jina-embeddings-v5-text models extend the growing EIS model catalog with compact, multilingual models powered by the latest developments in AI. These models have state-of-the-art performance on information retrieval and standard data analysis benchmarks, and they offer unequaled, globe-spanning multilingual support.

With two models of vastly different sizes, users can determine which one is best suited for their applications and budgets. Furthermore, with robust embeddings that remain performant when truncated to smaller sizes or quantized to lower precision, jina-embeddings-v5-text models provide opportunities for further concrete savings in storage and computing costs as well as in processing latency.

With the jina-embeddings-v5-text family, Jina Reranker, and Elastic’s fast vector and BM25 search, users now have access to end-to-end, state-of-the-art hybrid search from Elastic. When you need the most relevant results, whether for retrieval augmented generation (RAG) pipelines, search applications, or data analysis, Elastic with Jina search AI models provides solid and cost-effective quality.

Getting started

The jina-embeddings-v5-text models are fully integrated into EIS, and you can use them by setting the type field to semantic_text when creating your index and specifying the model (jina-embeddings-v5-text-small or jina-embeddings-v5-text-nano) in the inference_id field, as in this example:

PUT multilingual-semantic-index
{
  "mappings": {
    "properties": {
      "content": {
        "type": "semantic_text",
        "inference_id": ".jina-embeddings-v5-text-small"
      }
    }
  }
}

# Ingest data about France
POST multilingual-semantic-index/_doc
{
  "content": "The capital of France is Paris"}

GET multilingual-semantic-index/_search
{
  "query": {
    "semantic": {
      "field": "content",
      "query": "What is the French capital?"
    }
  }
}

Elasticsearch automatically selects the appropriate LoRA adapter during indexing and retrieval. The embedding dimensions (see the “Truncating embeddings” section, above) can be set when creating a custom inference endpoint.

See the Elasticsearch documentation for more information on using jina-embeddings-v5-text models.

More information

To learn more about jina-embeddings-v5-text models, read the release notes on the Jina AI blog and the technical report, with more detailed technical information about performance and Jina AI’s innovative new training procedure. For information about downloading and running these models locally, visit the jina-embeddings-v5-text collection page on Hugging Face.

Jina AI models are available under a CC-BY-NC-4.0 license, so you are free to download them and try them out, but for commercial use, please contact Elastic sales.

Ensuring semantic precision with minimum score

Mattias Brunnert — Fri, 20 Feb 2026 00:00:00 GMT

Semantic search has opened up a world of opportunities for search relevance. High-quality sparse and dense models, such as ELSER, E5 and Jina Embedding v4, return relevant results based on the meaning of words, rather than keyword matching. However, semantic search sometimes returns irrelevant results at the tail end or for queries that lack relevant results in the index. This property of sparse and dense models can confuse users or waste precious tokens for large language models (LLMs).

In this article, you’ll learn how you can use the minimum score parameter to increase the precision of your semantic search results. If you’d like to test the examples provided in this blog post, go to the associated Jupyter notebook.

Background: Precision and recall

In search relevance, precision and recall are key concepts. Any reader not already familiar is highly encouraged to read up on them. Following is a summary.

Precision: The fraction of returned search results that are relevant to the user.
Recall: The fraction of all relevant documents in the corpus that are included in the search result set.

Or, in other words, precision is returning only relevant results; and recall is returning all relevant results. As you can imagine, these are often competing requirements. Semantic search tends to have very high recall but can struggle with precision. Keep reading to learn how to get around this property.

Introducing the minimum score parameter

The ‘min_score’ parameter allows us to improve precision by setting a minimum score, which will truncate the result set by removing any matches with a score less than the defined threshold. Following is a simple example:

GET search-movies/_search
{
  "retriever": {
    "linear": {
      "min_score": 4,
      "retrievers": [
        ...
      ]
    }
  }
}

Normalizing the score

Setting a minimum score is all well and good; however, not all semantic models return a score suitable for a static threshold. ELSER, for example, returns a score that is unbounded. Some dense model scores are densely clustered and only make sense in the context of the specific query.

For most semantic search cases, we recommend using a normalization approach before applying the ‘min_score’. The normalization ensures that the document score is within a defined interval. Elasticsearch retrievers provide two such normalizers, ‘l2_norm’ and ‘minmax’. The most commonly used is ‘minmax’, since it’s easy to understand and works well in many scenarios. Key properties of ‘minmax’ include:

Document scores are distributed between 0–1.
The highest scoring document is always scored as 1.
The lowest scoring document is always scored as 0.
- This can make it less suitable for keyword search. See “Hybrid search” section for further discussion.

Following is an example of a normalized semantic query with min_score. Rank window size has been increased to 500 to allow us to return a longer list of search results, starting at 100.

GET search-movies/_search
{
  "size": 100,
  "_source": [
    "title", "overview"
  ],
  "retriever": {
    "linear": {
      "rank_window_size": 500,
      "min_score": 0.25,
      "retrievers": [
        {
          "normalizer": "minmax",
          "retriever": {
            "standard": {
              "query": {
                "semantic": {
                  "field": "overview_vector",
                  "query": "superhero movie"
                }
              }
            }
          }
        }
      ]
    }
  }
}

The size has been set to a higher value than normally seen in production. This is so we can inspect the quality of search results and tune the results.

Hybrid search using the linear retriever

For hybrid search, the simplest approach is to normalize all scores, assign weights, and apply a minimum score. Note that by choosing weights with a sum of 1, you keep the total score within a range of 0–1. This makes it easy to make sense of the final scores and tune min_score. Following is an example:

GET search-movies/_search
{
  "size": 100,
  "_source": ["title", "overview","keywords"],
  "retriever": {
    "linear": {
      "rank_window_size": 500,
      "min_score": 0.25,
      "retrievers": [
        {
          "weight": 0.6,
          "normalizer": "minmax",
          "retriever": {
            "standard": {
              "query": {
                "semantic": {
                  "field": "overview_vector",
                  "query": "superhero movie"
                }
              }
            }
          }
        },
        {
          "weight": 0.4,
          "normalizer": "minmax",
          "retriever": {
            "standard": {
              "query": {
                "multi_match": {
                  "query": "superhero movie",
                  "fields": ["overview","keywords", "title"],
                  "type": "cross_fields",
                  "minimum_should_match": "2"
                }
              }
            }
          }
        }
      ]
    }
  }
}

Hybrid search using RRF

With BM25, we often control precision through other means, such as using the AND operator or minimum_should_match. In addition, queries consisting of single, precise, and rare terms will naturally cause search results with few search results, often all being highly relevant. This can lead to:

Results further back in the result get assigned a low normalized score in the BM25 retriever, even if the absolute BM25 score is close to top scoring hits.
When adding a very low BM25 score to the semantic score, the total can be approximated as the semantic score.
The lack of BM25 score contribution can cause the document to be discarded by the min_score threshold.

As a solution, we can instead use reciprocal rank fusion (RRF) to combine BM25 and semantic results. RRF gets around the challenge of comparing scores from different search algorithms by instead focusing on the position in each result set. In this scenario, the min_score is only applied to the semantic retriever.

GET search-movies/_search
{
  "_source": ["title", "overview","keywords"],
  "retriever": {
    "rrf": {
      "rank_window_size": 500,
      "retrievers": [
        {
          "linear": {
            "rank_window_size": 500,
            "min_score": 0.25,
            "retrievers": [
              {
                "normalizer": "minmax",
                "retriever": {
                  "standard": {
                    "query": {
                      "semantic": {
                        "field": "overview_vector",
                        "query": "superhero movie"
                      }
                    }
                  }
                }
              }
            ]
          }
        },
        {
          "standard": {
            "query": {
              "multi_match": {
                "query": "superhero movie",
                "fields": ["overview", "keywords","title"],
                "type": "cross_fields",
                "minimum_should_match": "2"
              }
            }
          }
        }
      ]
    }
  }
}

Conclusion

By using min_score, we’ve shown how we can reduce the number of false positives in our result sets caused by the high recall of semantic search algorithms. To learn more about retrievers, please see this blog post and the Elasticsearch documentation.

Dependency management on Kubernetes

Nikos Fotiou — Thu, 19 Feb 2026 00:00:00 GMT

This is how we built a self-hosted dependency management platform using Kubernetes, Argo Workflows, Argo Events, and Renovate CLI to automate updates, quickly address Common Vulnerabilities and Exposures (CVEs), and efficiently propagate new package versions across thousands of repositories.

Dependency management at Elastic

At Elastic, we have to manage hundreds or even thousands of repositories, both private and public. When a critical CVE is discovered, we need immediate answers and actions: Which repositories are vulnerable? How quickly can we patch them? Apart from security, productivity questions also arise: How can we quickly propagate the release of a new package version across all the repositories that depend on it without spending too much time on manual tasks?

The initial trigger for searching ways of doing dependency management was the need to establish a secure foundation with automated updates for reducing CVEs. After carefully considering solutions on dependency management, we first started working on a self-hosted infrastructure. We were using our own Kubernetes cluster to run Mend Renovate Community Self-Hosted. The idea was to be able to provide a dependency management platform that our users could access in a self-service manner.

The initial experiment was successful, so more and more teams started onboarding our platform and using it in their everyday repositories’ lifecycle for updates and CVE patching. This happened so fast that we soon hit the ceiling of our self-hosted installation.

The challenge: How can we scale a dependency management platform in a large organization with a significant number of repositories?

Our dependency management platform was processing one repository at a time and the sequential processing model couldn’t keep up, due to the large number of repositories that we own. We had already identified that the issue resided within the concept that a single instance of our dependency management tool could process our big and ever-growing list of repositories. Repositories waited in a queue, sometimes for many hours. More than 50% of our repositories were not even processed daily. That means that more than 50% of our repositories waited more than 24 hours between scans.

Large repositories created larger bottlenecks, due to their sizable codebases and their multiple open PRs. GitHub webhook events disrupted the sequence. Automerge became unreliable because scan timing was unpredictable. We had made a promise to our users for the frequency of scans, and we couldn’t fulfill it.

The decision to build in-house: Meeting Elastic's unique scale and security needs

While we considered commercial options, including Mend's Renovate Self-Hosted Enterprise Self-Hosted edition, internally at Elastic we had a few key initiatives ramping up.

Our decision to build an in-house platform was driven by the recognition that only a deeply customized solution could meet Elastic's specific, nonnegotiable requirements:

Investing in our internal developer platform: At the time, we had already started heavily investing in our internal developer platform. We were discussing and designing ways that each one of our services could fit into that. This meant that we wanted to test-drive our own rules and practices for our dependency management platform. On top of that, new guidelines were coming into play and we wanted to design the platform ahead of events.
Native integration and workflow customization: We required straightforward integration with our internal tooling and internal processes. For example, we wanted to centralize configuration as code with our Service Catalog (Backstage). We have specific needs around the usage of Backstage that we wanted to make our platform compatible with. So, although it would be possible to make use of the Renovate Self-Hosted APIs alongside our Backstage automation, this wouldn’t cover completely for our internal processes.
Elastic-specific defense-in-depth security: Our stringent security compliance required bespoke security mechanisms tailored to our ecosystem. We were working to harden our usage of “non-human identities.” The way this hardening of access worked meant that the nonstandard means to authenticate to GitHub wouldn’t work with an off-the-shelf tool that didn’t support this internal implementation.Our workflow included implementing a parent-child workflow secret encryption pattern and using transient, single-use GitHub tokens. Building in-house was the only practical way to embed these unique security layers and minimize the attack surface across our complex multicloud environment.

The solution: Workflow orchestration for dependency management

Our solution started from the fact that we wanted to build on the dependency management tool that we already used and not replace it and look for other solutions. It had shown signs of its potential, and its flexibility is important for different needs throughout our organization. We considered different solutions, and what helped us make up our minds was the big and sometimes special needs that we have to cover for. We decided to build a reliable and scalable dependency management platform, where each repository will be processed on its own, removing bottlenecks and setting us up for growth.

We designed the platform abiding to three core principles:

1. Parallel processing

Every repository gets its own dependency management processing environment. No more queues. Our concurrency is only limited by the number of resources we spend. We have also applied smart distributed scheduling to avoid getting rate limited by GitHub.

2. Self-serviceable

We use our Service Catalog (Backstage) to automatically onboard and manage any new repository. We use our own resource definition to give the end user the option to select how often a repository will be processed, how many resources they want to allocate to their schedules, and if they want to turn processing off or back on for any reason. We plan to add more options that way as our users’ needs evolve and they get more fluent with the new installation.

3. Reduced secret scope and namespace isolation

For increased security, we supply our dependency management pods with ephemeral GitHub tokens that are being generated at the start of each workflow. On top of that, we isolate our workloads in specific namespaces so they can be provided only the necessary secrets. We control what secrets can be accessed by each dependency management workflow using Kubernetes RBAC. We also use encryption to propagate the GitHub token from the parent to the child workflows.

We rebuilt our platform using Kubernetes and harnessing the power of Kubernetes, Argo Workflows powers the logic of our processes, and Renovate CLI is set up for scanning and processing one repository at a time.

The beauty: We’re using battle-tested open source projects in an original way, providing new working examples for all of those projects and, at the same time, amplifying development velocity and consolidating CVE reduction for our teams.

Dependency management architecture: Four microservices

The platform comprises four custom-built components:

Workflows Operator (Go/Kubebuilder)

A Kubernetes operator managing workflow lifecycle through three Custom Resource Definitions (CRDs):

RepoConfig CRD: Single source of truth for repository configuration.

This is how RepoConfig is defined in the operator:

// RepoConfig is the Schema for the repoconfigs API
type RepoConfig struct {
	metav1.TypeMeta `json:",inline"`

	// metadata is a standard object metadata
	// +optional
	metav1.ObjectMeta `json:"metadata,omitempty,omitzero"`

	// spec defines the desired state of RepoConfig
	// +required
	Spec RepoConfigSpec `json:"spec"`

	// status defines the observed state of RepoConfig
	// +optional
	Status RepoConfigStatus `json:"status,omitempty,omitzero"`
}

And this is what an instance of RepoConfig would look like:

apiVersion: workflows.elastic.co/v1
kind: RepoConfig
metadata:
  generation: 3
  name: elastic-test-repo
  namespace: dependency-management-operator
spec:
  owner: group:my-team
  renovate:
    config:
      resourceGroup: SMALL
      runFrequency: 4h
    enabled: true
  repository: elastic/test-repo

Parent CRD: Manages CronWorkflows for scheduled scans.

Inside the reconciliation loop of the parent controller, we make sure that workflow settings are created and kept up to date or even deleted if needed.

First, it gets some globally configured settings for workflows:

func (r *ParentReconciler) reconcileSubResources(ctx context.Context, req ctrl.Request, parent *workflowsv1.Parent) error {
	logger := logf.FromContext(ctx)
	logger.Info("Reconcile SubResources for Parent", "name", req.NamespacedName)
	wfSet := workflowsettings.WorkflowSettings{
		RunFrequency:   parent.Spec.RunFrequency,
		ResourceGroups: "parent",
	}

It makes sure a mutex configmap is up to date to prevent similar workflows from running together:

	cfMngr := resources.NewConfigMapManager(r.Client, r.Scheme, r.OperatorConfig.ParentNamespace)
	err := cfMngr.CreateOrUpdateSyncMutexConfigmap(ctx, fmt.Sprintf("%s%s", r.OperatorConfig.ResourcesPrefix, r.OperatorConfig.SyncMutexCfgMapName), strings.TrimPrefix(parent.Spec.Repository, "elastic/"), r.OperatorConfig.SemaphoreConcurrencyLimit)

Then it creates a Workflow Manager that’s the struct which will create or update the CronWorkflows and the Workflow Templates:

	wfMngr := resources.NewArgoWorkflowManager(r.Client,
		r.Scheme,
		curateResourceName(
			strings.ReplaceAll(parent.Spec.Repository, "/", "-"),
		),
		parent.Namespace,
		"parent-workflow",
		false).
		WithOrganization(r.OperatorConfig.GitHubOrg).
		WithRepoName(parent.Spec.Repository).
		Init(true, true).
		WithPrefix(r.OperatorConfig.ResourcesPrefix).
		WithWfTemplateName(r.OperatorConfig.ParentWorkflowTemplate).
		WithResources(wfSet.GetResourceCategory()).
		WithSchedule(wfSet.GetCronSchedule()).
		WithImagePullSecrets([]corev1.LocalObjectReference{{
			Name: r.OperatorConfig.WorkflowImagePullSecrets,
		}}).
		AddArgument(true, true, "extra_cli_args").
		SetArgument(true, false, "extra_cli_args", "none").
		AddTemplate(resources.NewParentDAGTemplateInstance()).
		AddTemplate(resources.NewWorkflowsTemplateInstance("check-child-workflows", r.OperatorConfig.WorkflowImagePullPolicy, r.OperatorConfig.WorkflowNodeSelector)).
		AddTemplate(resources.NewWorkflowsTemplateInstance("security", r.OperatorConfig.WorkflowImagePullPolicy, r.OperatorConfig.WorkflowNodeSelector)).
		AddTemplate(resources.NewWorkflowsTemplateInstance("submit-child-workflow", r.OperatorConfig.WorkflowImagePullPolicy, r.OperatorConfig.WorkflowNodeSelector))
	wfMngr.OverWriteCommand("submit-child-workflow", r.OperatorConfig.ChildNamespace)
	wfMngr.OverwriteWfTemplateName("parent-wftmpl")
	wfMngr.AddSynchronization(fmt.Sprintf("%s%s", r.OperatorConfig.ResourcesPrefix, r.OperatorConfig.SyncMutexCfgMapName), "{{workflow.parameters.repo_name}}")
	err = wfMngr.CreateOrUpdateCronWorkflow(ctx)
	if err != nil {
		return fmt.Errorf("failed to create or update cron workflow: %w", err)
	}
	err = wfMngr.CreateOrUpdateWorkflowTemplate(ctx)
	if err != nil {
		return fmt.Errorf("failed to create or update workflow template: %w", err)
	}
	return nil

Child CRD: Manages WorkflowTemplates with per-repository resources.

The child controller has a similar reconciliation duty to the parent, but this time it’s responsible for workflow templates in the child namespace that will be triggered by the parent workflows.

func (r *ChildReconciler) reconcileSubResources(ctx context.Context, req ctrl.Request, child *workflowsv1.Child) error {
	logger := logf.FromContext(ctx)
	logger.Info("Reconcile SubResources for Child", "name", req.NamespacedName)
	wfSet := workflowsettings.WorkflowSettings{
		ResourceGroups: child.Spec.ResourceCategory,
	}
	wfMngr := resources.NewArgoWorkflowManager(r.Client,
		r.Scheme,
		curateResourceName(
			strings.ReplaceAll(child.Spec.Repository, "/", "-"),
		),
		child.Namespace,
		"runner",
		true).
		Init(false, true). // only manage workflow template
		WithPrefix(r.OperatorConfig.ResourcesPrefix).
		WithSuffix("-child-wftmpl").
		WithRepoName(child.Spec.Repository).
		WithOrganization(r.OperatorConfig.GitHubOrg).
		WithResources(wfSet.GetResourceCategory()). // will override resources of presets if set
		WithImagePullSecrets([]corev1.LocalObjectReference{{
			Name: r.OperatorConfig.WorkflowImagePullSecrets,
		}}).
		AddTemplate(resources.NewWorkflowsTemplateInstance("runner", r.OperatorConfig.WorkflowImagePullPolicy, r.OperatorConfig.WorkflowNodeSelector)).
		AddArgument(false, true, "repo_full_name").
		AddArgument(false, true, "repo_name").
		AddArgument(false, true, "encrypted_token").
		AddArgument(false, true, "extra_cli_args")
	wfMngr.OverWriteCommand("runner", r.OperatorConfig.ChildNamespace)
	err := wfMngr.CreateOrUpdateWorkflowTemplate(ctx)
	if err != nil {
		return fmt.Errorf("failed to create or update workflow template: %w", err)
	}
	return nil
}

The multi-controller pattern provides clear separation: RepoConfig Controller handles onboarding/offboarding, Parent Controller manages scheduling, and Child Controller handles execution templates.

GitHub Events Gateway (Go)

A secure webhook proxy that receives GitHub webhooks, verifies signatures, filters by organization/repository, and routes to Argo Events. We built 10 distinct sensors responding to dependency dashboard interactions, PR events, and package updates.

This gateway enables integration with GitHub Apps by:

Verifying incoming GitHub webhook signatures for security.
Forwarding valid events to the Argo Events EventSource with all relevant headers and authentication.
We also configure an authSecret on the EventSource and provide this as a Bearer header in forwarded requests.
Providing logging, metrics, and retry logic.

It performs various validations on each GitHub Event request.

It makes sure some HTTP attributes are present:

// ValidateRequestMethod checks if the request method is POST.
func ValidateRequestMethod(r *http.Request) error {
	if r.Method != http.MethodPost {
		return fmt.Errorf("method not allowed, only POST is accepted")
	}
	return nil
}

// ValidateRequiredHeaders checks for required GitHub headers.
func ValidateRequiredHeaders(r *http.Request) error {
	eventType := r.Header.Get("X-GitHub-Event")
	deliveryID := r.Header.Get("X-GitHub-Delivery")
	signature := r.Header.Get("X-Hub-Signature-256")
	if eventType == "" || deliveryID == "" || signature == "" {
		return fmt.Errorf("missing required GitHub headers")
	}
	return nil
}

// ValidateUserAgent checks that the User-Agent header starts with GitHub-Hookshot/
func ValidateUserAgent(r *http.Request) error {
	userAgent := r.Header.Get("User-Agent")
	if !strings.HasPrefix(userAgent, "GitHub-Hookshot/") {
		return fmt.Errorf("invalid User-Agent")
	}
	return nil
}

While it also validates the signature of each request and its organizsation:.

// ValidateSignature verifies the GitHub webhook signature.
func ValidateSignature(r *http.Request, secret string) ([]byte, error) {
	payload, err := GitHub.ValidatePayload(r, []byte(secret))
	if err != nil {
		return nil, fmt.Errorf("invalid GitHub signature: %w", err)
	}
	return payload, nil
}

// ValidateAllowedOwner checks if the organization login is in the allowed organizations list.
func ValidateAllowedOwner(payload []byte, allowedGitHubOrganizations []string) (string, error) {
	var orgLogin string
	var payloadMap map[string]any
	if err := json.Unmarshal(payload, &payloadMap); err == nil {
		if orgObj, ok := payloadMap["organization"].(map[string]any); ok {
			if login, ok := orgObj["login"].(string); ok {
				orgLogin = login
			} else if name, ok := orgObj["name"].(string); ok {
				orgLogin = name
			}
		}
	}
	if !slices.Contains(allowedGitHubOrganizations, orgLogin) {
		return orgLogin, fmt.Errorf("organization login not allowed")
	}
	return orgLogin, nil
}

Finally, it routes to Argo Events based on event type:

	// Map eventType to Argo `EventSource` path
	var endpoint string
	switch eventType {
	case "push":
		endpoint = "/push"
	case "issues":
		endpoint = "/issues"
	case "pull_request":
		endpoint = "/pull-requests"
	default:
		slog.Info("Ignoring unhandled event type", "event_type", eventType, "delivery_id", deliveryID)
		w.WriteHeader(http.StatusOK)
		_, _ = w.Write([]byte("ok"))
		return
	}
	forwardURL := h.config.ArgoEventSourceForwardURL + endpoint

On the Argo Events side of things, 10 sensors watch the Argo Events EventBus for new events:.

apiVersion: argoproj.io/v1alpha1
kind: Sensor
metadata:
  name: {{ .Values.sensors.packageUpdateOnDefaultBranch.name }}
  namespace: {{ .Release.Namespace }}
spec:
  eventBusName: {{ .Values.eventBus.name }}

Then the script applies each sensor’s logic:

script: |
          local e = event
          if not e or not e.body or not e.body.repository then
            return false
          end

          -- e.g., "refs/heads/main"
          local ref = e.body.ref
          local default_branch = e.body.repository.default_branch
          if not ref or not default_branch then
            return false
          end

          local expected = "refs/heads/" .. default_branch
          if ref ~= expected then
            return false
          end

        {{- if .Values.sensors.packageUpdateOnDefaultBranch.packageFiles }}
          patterns = { {{- range $i, $f := .Values.sensors.packageUpdateOnDefaultBranch.packageFiles }}{{ if $i }}, {{ end }}"{{ $f }}"{{- end }} }
        {{- end }}

          local function anyMatch(path)
            if type(path) ~= "string" then return false end
            for _, pat in ipairs(patterns) do
              -- match filename at repo root, or anywhere under subdirs
              if path:match(pat) or path:match(".+/" .. pat) then
                return true
              end
            end
            return false
          end

          local function filesContainPackage(paths)
            if type(paths) ~= "table" then return false end
            for _, p in ipairs(paths) do
              if anyMatch(p) then return true end
            end
            return false
          end

          -- Inspect all commits (GitHub includes added/modified/removed lists)
          local commits = e.body.commits
          if type(commits) ~= "table" then
            -- Fallback: some payloads include only head_commit
            commits = {}
            if type(e.body.head_commit) == "table" then
              table.insert(commits, e.body.head_commit)
            end
          end

          for _, c in ipairs(commits) do
            if filesContainPackage(c.added) or filesContainPackage(c.modified) or filesContainPackage(c.removed) then
              return true
            end
          end

          return false

Backstage Syncer (Go)

This polls our Service Catalog (Backstage) for Repository Real Resource Entities, transforms them into RepoConfig CRDs, and keeps the platform in sync with configuration changes. Changes apply within three minutes.

repoMap := make(map[string]map[string]interface{})
			for i := range entities {
				entity := &entities[i]
				if entity.Spec.Type != "GitHub-repository" {
					continue
				}

				implRaw, err := json.Marshal(entity.Spec.Implementation)
				if err != nil {
					logger.Error("Failed to marshal implementation", "error", err)
					continue
				}

				var implMap map[string]interface{}
				err = json.Unmarshal(implRaw, &implMap)
				if err != nil {
					logger.Error("Failed to unmarshal implementation map", "error", err)
					continue
				}
				var repoName string
				if specMap, ok := implMap["spec"].(map[string]interface{}); ok {
					if repo, ok := specMap["repository"].(string); ok {
						repoName = repo
					}
				}
				if repoName == "" {
					continue
				}

				var workflowsRaw []byte
				if v, ok := implMap["spec"].(map[string]interface{}); ok {
					if r, ok := v["renovate"]; ok {
						workflowsRaw, _ = json.Marshal(r)
					} else {
						workflowsRaw = []byte(`{}`)
					}
				} else {
					workflowsRaw = []byte(`{}`)
				}

				var workflowsWithDefaults schema.WorkflowsMetadata
				err = json.Unmarshal(workflowsRaw, &rworkflowsWithDefaults)
				if err != nil {
					logger.Error("Failed to unmarshal workflows config", "error", err)
					continue
				}

				workflowsMap := map[string]interface{}{
					"enabled":        workflowsWithDefaults.Enabled,
					"require_pr":     workflowsWithDefaults.RequirePr,
					"resource_group": string(workflowsWithDefaults.ResourceGroup),
					"run_frequency":  string(workflowsWithDefaults.RunFrequency),
				}
				repoMap[repoName] = map[string]interface{}{
					"renovate": workflowsMap,
					"owner":    entity.Spec.Owner,
				}
			}
			logger.Info("Fetched GitHub Repository data from Backstage", "repository_count", len(repoMap), "status_code", resp.StatusCode)

Finally, it writes that data into RepoConfig instances.

Workflows base (Mixed: JavaScript, Go, Helm)

The foundation layer contains Helm charts, JavaScript configs, a Go wrapper for Renovate CLI with encryption support, and a custom APK Indexer for Alpine packages.

Self-service configuration

Teams configure their repositories declaratively through Backstage:

spec:
  renovate:
    enabled: true
    config:
      resourceGroup: LARGE      # SMALL | MEDIUM | LARGE  
      runFrequency: "0 */4 * * *"  # Every 4 hours

Resource groups allocate CPU and memory based on repository size:

SMALL: 500m CPU, 1Gi memory.
MEDIUM: 1000m CPU, 2Gi memory.
LARGE: 2000m CPU, 4Gi memory.

Configuration is version-controlled, auditable, and applies automatically.

The parent-child pattern

The execution model uses a parent-child workflow pattern:

Parent workflow: Lightweight CronWorkflow running on schedule. Encrypts secrets, determines whether a scan should run, passes configuration to the child.
Child workflow: Ephemeral pod where Renovate CLI runs. Allocated resources dynamically, decrypts secrets in isolation, terminates after completion.

This separation provides security (secrets encrypted at parent level), resource optimization (parents use minimal resources), and scalability (children run in parallel).

The results

Performance transformation

Before: One repository at a time, some repositories would not get processed possibly even for a day or more, less than 1,000 scans per day.
After: 100+ concurrent scans, usually 8,000 scans and up to 10,000 recorded scans per day, limited only by the amount of resources we’re willing to spend and how we handle GitHub rate limits.

Cost efficiency

However weird it may sound, running 8,000 pods a day can get you the same result much cheaper than having one long-running pod trying to achieve the same results.

In the previous setup, we were running a single instance that, on a good day, would perform 500–600 scans. At the same time, due to the fact that different kinds of repositories would be executed on the same pod, we needed to size the pod for the biggest ones. That sizing would be much bigger than our current extra large offering, using 8 CPUs for the pod and 16G of memory.

To meet the current daily output, the single pod would need to run for 12 days. So comparing the cost of that single pod running for 12 days to 8,000 pods of our “MEDIUM” size running each day, our new design is far more efficient for the same output of scans:

Metric	Scenario A (Workflows)	Scenario B (The long-running single pod)
Setup	8,000 pods (1 vCPU / 2GB)	1 pod (8 vCPU / 16 GB)*
Duration	10 minutes each	12 days continuous
Total work time	1,333 compute hours	288 compute hours
Total cost	$65.83	$113.75

However, let’s take into consideration that our default for our workloads is set to “SMALL,” with the great majority running successfully with 0.5 CPU and 1G RAM, and only a few need to change to medium, large. Let’s see what happens if 60% of our workloads are running on “SMALL,” 30% at “MEDIUM,” and 10% at “LARGE,” which is closer to the truth.

Metric	Scenario A (Mixed swarm)	Scenario B (The long runner)
Strategy	8,000 pods (mixed sizes)	1 pod (8 vCPU / 16 GB)*
Duration	10 minutes each	12 days continuous
Total cost	$52.66	$113.75
Savings	$61.09 (54% cheaper)	—

We can see that, for the same output, we’re far more cost-efficient in our current setup.

Enhanced security

Ephemeral GitHub tokens (minutes of exposure versus days).
Namespace isolation with Role-Based Access Control (RBAC) boundaries.
Secret encryption at rest in parent workflows.
Removed direct vault access.

Predictable performance

With guaranteed scan frequency, we can finally set Service Level Objectives (SLOs). Automerge works reliably. Teams trust the platform to deliver what’s promised.

Key architectural decisions

Here are some of the milestone design decisions that shaped how the platform looks.

Why parent-child workflows?

We adopted this pattern to enforce a defense-in-depth strategy. By restricting high-value credentials (such as GitHub App secrets) to a dedicated, locked-down namespace, we use RBAC to ensure that ephemeral execution pods cannot arbitrarily access sensitive data. Recent supply chain vulnerabilities (for example, the "Shai Hulud" continuous integration/continuous delivery [CI/CD] attacks) have demonstrated the criticality of isolating runtime environments that execute dynamic scripts from the credential store.

Simultaneously, this decoupling enables granular resource optimization. The "parent" workflows act as lightweight orchestrators with a minimal footprint, while the "child" workflows handle the compute-intensive dependency scanning. This separation simplifies lifecycle management by allowing us to apply distinct reconciliation logic to each layer, granting users control over execution parameters (child) while retaining administrative control over the scheduling and security infrastructure (parent).

Why self-serviceable?

Eliminating our team as a bottleneck for repository configuration was a critical requirement. Our mission was to architect a scalable, self-service platform capable of supporting diverse use cases. We recognized that acting as gatekeepers for every configuration change was unsustainable, given the sheer volume of repositories. Instead, we adopted a philosophy of enablement: providing the “rails” (infrastructure and guardrails) while empowering users to drive the “trains” (execution and customization). We believe this shift toward team autonomy significantly enhances productivity by allowing users to tailor the system to their specific operational needs.

Why Kubernetes Operator pattern?

As mentioned above, a foundational design principle was to ensure that the platform was fully self-serviceable. We required an automated mechanism to capture user intent (such as toggling scans, adjusting scheduling frequency, or tuning runtime resource limits) and instantly propagate those changes to the underlying workflows. Anticipating future requirements, the system also needed to be easily extensible.

To achieve this, we developed a custom Dependency Management Kubernetes Operator. By using CRDs as the interface for configuration, we established a Kubernetes-native reconciliation loop. This operator continuously monitors the desired state defined by the user and automatically orchestrates the necessary updates to the workflow infrastructure. This ensures an event-driven, seamless operation, where the platform logic handles all complexity behind the scenes.

Why design a GitHub Events Gateway?

Adopting an event-driven architecture (EDA) was essential for the platform's responsiveness. While CronWorkflows provided a reliable baseline schedule, we required the agility to handle ad hoc executions, such as users manually triggering scans via the dashboard. To achieve this, we needed a dedicated ingestion gateway to validate payload integrity and route requests intelligently.

We evaluated existing solutions, including the native GitHub EventSource for Argo, but we identified significant risks regarding operational overhead and strict GitHub API quotas (for example, webhook limits per repository). Consequently, we built a custom gateway to decouple our infrastructure from these limitations.

Crucially, this gateway served as a strategic traffic control point during our migration. It acted as a switch, enabling us to perform a gradual, granular rollout (traffic shifting) from the legacy system to the new infrastructure. This ensured that onboarding thousands of repositories was a controlled, risk-free process rather than a “big bang” switchover.

Lessons learned

Some lessons that we learned go hand-in-hand with the Elastic Source Code:

Customer First: Platforms are built for users. So it’s important to take users’ needs as priority number one. This shapes the platform into efficiently designed infrastructure and applications that reduce friction with users, simplify the scaling of the platform and ease adoption.
Space, Time: Sometimes the path of least resistance leads to shifting sands. We initially tried to optimize the existing sequential processing model, but this failed to resolve our issues; in fact, it only introduced more complexity and loose ends. The bold decision to rearchitect the platform with parallel processing required significant up-front effort. However, it ultimately paved the way for sustainable platform growth and virtually eliminated tedious daily administrative work.
IT, Depends: A platform cannot operate in isolation; its success depends on how well it integrates with the broader ecosystem. In our case, integration with Backstage was critical, as it serves as the source of truth for seamless service onboarding. Similarly, connecting to Artifactory allowed us to manage private package updates efficiently, and the list of essential integrations goes on.
Progress, SIMPLE Perfection: Throughout the implementation, we constantly pressure-tested our initial assumptions and adapted to new barriers as they emerged. Rather than getting paralyzed by perfectionism, we adopted an iterative approach, tackling challenges one by one and adjusting our migration strategy to meet real-world conditions.

What’s next

The delivery of the platform enables us for more meaningful work that will help us improve the UX and efficiency of our platform. Some examples are:

Increase and guardrail the adoption of auto-merge

The auto-merge feature significantly accelerates team velocity by eliminating tedious manual tasks. However, we need to make sure that strict guardrails are in place to ensure that this increased speed does not come at the expense of security.

Improve observability around end-user experience

A critical priority for our roadmap is enhancing observability, not just at the platform level but also specifically from the end-user’s perspective. While capturing infrastructure metrics is straightforward, understanding the actual user experience requires deeper insights. We’re working to define core user-centric key performance indicators (KPIs) so our telemetry can detect friction points and performance issues before they escalate into user complaints.

Remove barriers for greater adoption

Looking ahead, our priority is to identify and remove any barriers hindering platform adoption. Whether this requires developing new integrations or deploying specific feature sets, we’re committed to data-driven planning. We’ve successfully built a platform designed for scale; our focus now shifts to maximizing its potential.

The bigger picture

The dependency management workflows project demonstrates a broader principle: When you need to scale open source tools beyond their default deployment model, Kubernetes-native patterns provide a path forward.

By embracing:

CRDs for configuration.
Operators for lifecycle management.
Event-driven architecture for responsiveness
GitOps for deployment.

We built orchestration that scales independently of the number of repositories it manages. The performance of scanning one repository is the same whether we’re managing 100 or 1,000.

When a critical CVE is announced, we now have answers in minutes, not hours. That’s the difference between a bottleneck and a competitive advantage.

Acknowledgments

This platform builds on excellent open source tools:

Kubebuilder: The open source framework we used to kick-start our Kubernetes Operators that bootstrap and orchestrate our workflows. [1][2]
Backstage: The open source framework on which we’ve built our Service Catalog and which we use as our source of truth. [1][2]
Argo Workflows and Argo Events: The open source suite we used to orchestrate complex processes and add dynamic processing based on events. [1][2][3][4]
Renovate CLI: The open source dependency management tool processing our repositories. [1][2]

* The AWS Fargate pricing model was used as a reference for of the cost of a single pod, although our workloads are not running necessarily on AWS and are running on full- blown Kubernetes clusters.

Better text analysis for complex languages with Elasticsearch and neural models

Noam Schwartz — Wed, 18 Feb 2026 00:00:00 GMT

If you work with English search, standard text analysis usually just works. You index “running,” the analyzer strips the suffix to store “run,” and a user searching for “run” finds the document. Simple.

But if you work with languages like Hebrew, Arabic, German, or Polish, you know that standard rule-based analyzers often fail. They either under-analyze (missing relevant matches) or overanalyze (returning garbage results).

For years, we’ve had to rely on complex dictionaries and fragile regex rules. Today, we can do better. By replacing rule-based logic with neural models for text analysis (small, efficient language models that understand context), we can drastically improve search quality.

Here’s how to solve the morphology challenge by using the Elasticsearch inference API and a custom model service.

The problem: Why rules fail

Most standard analyzers are context-free. They look at one word at a time and apply a static set of rules.

Algorithmic analyzers (like Snowball) strip suffixes based on patterns.
Dictionary analyzers (like Hunspell) look up words in a list.

This approach breaks down when the structure of a word (its root and affixes) changes based on the sentence it lives in.

1. The semitic ambiguity (roots versus prefixes)

Semitic languages, like Hebrew and Arabic, are built on root systems and often attach prepositions (such as, in, to, or from) directly to the word. This creates ambiguous tokens that rule-based systems cannot solve.

Word: בצל (B-Tz-L).
Context A: “The soup tastes better with onion (batzal).”
Context B: “We sat in the shadow (ba-tzel) of the tree.”

In Context A, בצל is a noun (onion). In Context B, it’s a preposition ב (in) attached to the noun צל (shadow).

A standard analyzer is forced to guess. If it aggressively strips the ב prefix, it turns "onion" into "shadow." If it’s conservative and leaves it alone, a user searching for "shadow" (tzel) will fail to find documents containing "in the shadow" (batzel). Neural models solve this by reading the sentence to determine whether the ב is part of the root or a separate preposition.

2. The compound problem (German, Dutch, and more)

Languages like German, Dutch, Swedish, and Finnish concatenate nouns without spaces to form new concepts. This results in a theoretically infinite vocabulary. To search effectively, you must split (decompound) these words.

Word: Wachstube.
Split A: Wach (guard) + Stube (room) = guardroom.
Split B: Wachs (wax) + Tube (tube) = wax tube.

A dictionary-based decompounder acts blindly. If both “Wach” and “Wachs” are in its dictionary, it might pick the wrong split, polluting your index with irrelevant tokens.

To see this problem in English: A naive algorithm might split “carpet” into “car” + “pet.” Without understanding meaning, rules fail.

The solution: “Neural analyzers” (neural models for text analysis)

We don’t need to abandon the inverted index. We just need to feed it better tokens.

Instead of a regex rule, we use a neural model (like BERT or T5) to perform the analysis. Because these models are trained on massive datasets, they understand context. They look at the surrounding words to decide whether בצל means "onion" or "in shadow" or if Wachstube belongs in a military or cosmetic context.

Architecture: The inference sidecar

We can integrate these Python-based models directly into the Elasticsearch ingestion pipeline using the inference API.

The pattern:

External model service: A simple Python service (for example, FastAPI) hosts the model.
Elasticsearch inference API: Defines this service as a custom model within Elasticsearch.
Ingest pipeline: Sends text to the inference processor, which calls your Python service.
Index mapping: Create a whitespace target field for the analyzed text.
Indexing: The service returns the cleaned text, which Elasticsearch stores in the target field.
Search: Queries are analyzed via the inference API before matching.

Implementation guide

Let’s build this for Hebrew (using DictaBERT) and German (using CompoundPiece).

To follow along, you’ll need:

Python 3.10+.
Elasticsearch 8.9.x+.

Install the Python dependencies:

pip3 install fastapi uvicorn torch transformers

Step 1: External model service

To connect Elasticsearch to our neural model, we need a simple API service that:

Receives text from the Elasticsearch inference API.
Passes it through the neural model.
Returns analyzed text in a format Elasticsearch understands.

This service interfaces Elasticsearch with the neural model. At ingest time, the Elasticsearch pipeline calls this API to analyze and store document fields; at search time, the application calls it to process the user's query. You can deploy this on any infrastructure, including EC2, Lambda, or SageMaker.

The code below loads both models at startup and exposes /analyze/hebrew and /analyze/german endpoints:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List, Union
from transformers import AutoTokenizer, AutoModel, AutoModelForSeq2SeqLM
from contextlib import asynccontextmanager
import torch

# Global models (loaded once at startup)
he_model = None
he_tokenizer = None
de_model = None
de_tokenizer = None


@asynccontextmanager
async def lifespan(app: FastAPI):
   """Load models at startup."""
   global he_model, he_tokenizer, de_model, de_tokenizer

   print("Loading Hebrew model (DictaBERT-Lex)...")
   he_tokenizer = AutoTokenizer.from_pretrained("dicta-il/dictabert-lex")
   he_model = AutoModel.from_pretrained("dicta-il/dictabert-lex", trust_remote_code=True)
   he_model.eval()

   print("Loading German model (CompoundPiece)...")
   de_tokenizer = AutoTokenizer.from_pretrained("benjamin/compoundpiece")
   de_model = AutoModelForSeq2SeqLM.from_pretrained("benjamin/compoundpiece")

   if torch.cuda.is_available():
       he_model.to("cuda")
       de_model.to("cuda")

   print("Models loaded successfully!")
   yield
   print("Shutting down...")


app = FastAPI(
   title="Neural Text Analyzer",
   description="Multi-language text normalization service",
   version="1.0.0",
   lifespan=lifespan
)


class InferenceRequest(BaseModel):
   """ES Inference API sends: {"input": ["text1", "text2"]} or {"input": "text"}"""
   input: Union[str, List[str]]


def format_response(normalized_text: str) -> dict:
   """
   Normalize output to OpenAI-compatible format for ES Inference API.
   ES extracts: $.choices[*].message.content You do not need to stick
   with the OpenAI output format.
   Using it here for consistency reasons, since using the completions API.
   """
   return {
       "choices": [
           {"message": {"content": normalized_text}}
       ]
   }


@app.post("/analyze/hebrew")
async def analyze_hebrew(request: InferenceRequest):
   """Hebrew lemmatization using DictaBERT-Lex."""
   global he_model, he_tokenizer

   if he_model is None:
       raise HTTPException(status_code=503, detail="Model not loaded")

   # Handle input (can be string or list)
   if isinstance(request.input, str):
       texts = [request.input]
   else:
       texts = request.input

   # Run prediction
   with torch.no_grad():
       results = he_model.predict(texts, he_tokenizer)

   # results format: [[[word, lemma], [word, lemma], ...]]
   if results and results[0]:
       lemmas = []
       for word, lemma in results[0]:
           if lemma == '[BLANK]':
               lemma = word
           lemmas.append(lemma)
       normalized = " ".join(lemmas)
   else:
       normalized = ""

   return format_response(normalized)


@app.post("/analyze/german")
async def analyze_german(request: InferenceRequest):
   """German decompounding using CompoundPiece (supports 56 languages)."""
   global de_model, de_tokenizer

   if de_model is None:
       raise HTTPException(status_code=503, detail="Model not loaded")

   # Handle input
   if isinstance(request.input, str):
       text = request.input
   else:
       text = request.input[0] if request.input else ""

   # Format: "de: " for German
   input_text = f"de: {text}"

   inputs = de_tokenizer(input_text, return_tensors="pt")
   if torch.cuda.is_available():
       inputs = {k: v.to("cuda") for k, v in inputs.items()}

   with torch.no_grad():
       outputs = de_model.generate(**inputs, max_length=128)

   # IMPORTANT: decode outputs[0], not outputs
   result = de_tokenizer.decode(outputs[0], skip_special_tokens=True)

   # Clean up: "de: Donau-Dampf-Schiff" -> "Donau Dampf Schiff"
   # Note: model returns "de: " (with space after colon)
   if result.startswith("de: "):
       clean_result = result[4:].replace("-", " ")
   elif result.startswith("de:-"):
       clean_result = result[4:].replace("-", " ")
   elif result.startswith("de:"):
       clean_result = result[3:].replace("-", " ")
   else:
       clean_result = result.replace("-", " ")

   return format_response(clean_result.strip())


@app.get("/health")
async def health():
   return {"status": "healthy"}

Save the code above to a file (for example, analyzer_service.py), and run:

python3 -m uvicorn analyzer_service:app --port 8000

Wait for “Models loaded successfully!” (takes ~30–60 seconds for models to download on first run).

Test locally:

#Hebrew
curl -X POST http://localhost:8000/analyze/hebrew \
 -H "Content-Type: application/json" \
 -d '{"input": "הילדים אכלו גלידה בגינה"}'

#German
curl -X POST http://localhost:8000/analyze/german \
 -H "Content-Type: application/json" \
 -d '{"input": "Donaudampfschifffahrt"}'

Expected output:

- Hebrew: `{"choices":[{"message":{"content":"ילד אוכל גלידה גינה"}}]}`
- German: `{"choices":[{"message":{"content":"Donau Dampf Schiff Fahrt"}}]}`

Step 2: Configure Elasticsearch inference API

We’ll use the custom inference endpoint. This allows us to define exactly how Elasticsearch talks to our Python endpoint.

Note: Use response.json_parser to extract the content from our normalized JSON structure. You do not need to stick with the OpenAI output format. We’re using it here for consistency reasons, since we’re using the completion task type, which is text to text.

Exposing your local service

For testing, we’ll use ngrok to expose the local Python service to the internet. This allows any Elasticsearch deployment (self-managed, Elastic Cloud, or Elastic Cloud Serverless) to reach your service.

Install and run ngrok:

# Install ngrok (macOS) (Or download from https://ngrok.com/download)
brew install ngrok

Expose your local service:

ngrok http 8000

ngrok will display a forwarding URL like:

Forwarding https://abc123.ngrok.io -> http://localhost:8000

Copy the HTTPS URL. You’ll use this in the Elasticsearch configuration.

Configure the inference endpoint

 PUT _inference/completion/hebrew-analyzer                           
 {                                  
   "service": "custom",                                              
   "service_settings": {                             
     "url": "https://abc123.ngrok.io/analyze/hebrew",  
     "headers": {                    
       "Content-Type": "application/json"               
     },                                                
     "request": "{\"input\": ${input}}",                     
     "response": {                                
       "json_parser": {                         
         "completion_result": "$.choices[*].message.content"     
       }                               
     }                                 
   }                                   
 }

Replace https://abc123.ngrok.io with your actual ngrok URL.

Note: ngrok is used here for fast testing and development. The free tier has request limits, and URLs change on restart. For production, deploy your service to a persistent infrastructure.

For production (with API Gateway)

In production, deploy your Python service to a secure, persistent endpoint (such as AWS API Gateway + Lambda, EC2, ECS, or any cloud provider). Use secret_parameters to securely store API keys:

 PUT _inference/completion/hebrew-analyzer                        
 {                                     
   "service": "custom",                  
   "service_settings": {                
     "url": "https://your-api-gateway.execute-api.region.amazonaws.com/prod/analyze/hebrew",                 
     "headers": {                      
       "x-api-key": "${api_key}",       
       "Content-Type": "application/json"  
     },                              
     "secret_parameters": {           
       "api_key": "YOUR-API-KEY"     
     },                           
     "request": "{\"input\": ${input}}",      
     "response": {                    
       "json_parser": {               
         "completion_result": "$.choices[*].message.content"  
       }                             
     }                               
   }                                 
 }

Step 3: Ingest pipeline

Create a pipeline that passes the raw text field to our model and stores the result in a new field.

PUT _ingest/pipeline/hebrew_analysis_pipeline
{
 "description": "Lemmatizes Hebrew text using a custom inference endpoint",
 "processors": [
   {
     "inference": {
       "model_id": "hebrew-analyzer",
       "input_output": {
         "input_field": "content",
         "output_field": "content_analyzed"
       }
     }
   }
 ]
}

Step 4: Index mapping

This is the most critical step. The output from our neural model is already analyzed. We do not want a standard analyzer to mess it up again. We use the whitespace analyzer to simply tokenize the text we received.

PUT /my-hebrew-index
{
 "mappings": {
   "properties": {
     "content": {
       "type": "text",
       "analyzer": "standard"
     },
     "content_analyzed": {
       "type": "text",
       "analyzer": "whitespace"
     }
   }
 }
}

Step 5: Indexing

Option A: Single document.

POST /my-hebrew-index/_doc?pipeline=hebrew_analysis_pipeline
{
"content": "הילדים אכלו גלידה בגינה"
}

Option B: Reindex existing data.

If you have existing data in another index, reindex it through the pipeline:

POST _reindex
{
 "source": {
   "index": "my-old-index"
 },
 "dest": {
   "index": "my-hebrew-index",
   "pipeline": "hebrew_analysis_pipeline"
 }
}

Option C: Set pipeline as default for index.

Make all future documents automatically use the pipeline:

PUT /my-hebrew-index/_settings
{
"index.default_pipeline": "hebrew_analysis_pipeline"
}

Then index normally (no ?pipeline= needed):

POST /my-hebrew-index/_doc
{
"content": "הילדים אכלו גלידה בגינה"
}

Step 6: Search

Search using a neural analyzer in Elasticsearch is a two-step process, so analyze the query first using the inference API, and then search with the result:

A. Analyze the query.

 POST _inference/completion/hebrew-analyzer
 {
   "input": "הילדים אכלו גלידה בגינה"
 }

B. Search with the result.

 GET /my-hebrew-index/_search
 {
   "query": {
     "match": {
       "content_analyzed": "ילד אוכל גלידה גינה"
     }
   }
 }

In production, wrap these two calls in your application code for a seamless experience.

Available models

The architecture above works for any language. You simply swap the Python model and adjust the post-processing of the output. Here are verified models for common complex languages:

Hebrew: Context-aware lemmatization. Handles prefix ambiguity (ב, ה, ל, and more) dicta-il/dictabert-lex.
German: Generative decompounding. Supports 56 languages, including Dutch, Swedish, Finnish, and Turkish. benjamin/compoundpiece.
Arabic: BERT-based disambiguation and lemmatization for Modern Standard Arabic. CAMeL Tools.
Polish: Case-sensitive lemmatization for Polish inflections. amu-cai/polemma-large.

Conclusion

You don’t need to choose between the precision of lexical search and the intelligence of AI. By moving the “smart” part of the process into the analysis phase using the inference API, you fix the root cause of poor search relevance in complex languages.

The tools are here. The models are open-source. The pipelines are configurable. It’s time to teach our search engines to read.

Code

All code snippets from this article are available at https://github.com/noamschwartz/neural-text-analyzer.

References:

An open‑source Hebrew analyzer for Elasticsearch lemmatization

Lily Adler — Tue, 17 Feb 2026 00:00:00 GMT

Hebrew is morphologically rich: Prefixes, inflections, and clitics make exact-token search brittle. This project provides an open-source Hebrew analyzer plugin for Elasticsearch 9.x that performs neural lemmatization in the analysis chain, using an embedded DictaBERT model executed in-process via ONNX Runtime with an INT8-quantized model.

Quick start

Download the relevant release or build and install (Linux build script generates Elasticsearch‑compatible zip):

./scripts/build_plugin_linux.sh

Install in Elasticsearch:

/path/to/elasticsearch/bin/elasticsearch-plugin install file:///path/to/heb-lemmas-embedded-plugin-.zip

Test:

curl -k -X POST "https://localhost:9200/_analyze" \
  -H "Content-Type: application/json" \
  -u "elastic:" \
  -d '{"tokenizer":"whitespace","filter":["heb_lemmas","heb_stopwords"],"text":"הילדים אוכלים את הבננות"}'

Why Hebrew search is different

Hebrew is morphologically rich: Prefixes, suffixes, inflection, and clitics all collapse into a single surface form. That makes naive tokenization insufficient. Without true lemmatization, search quality suffers; users miss relevant results due to simple variations in form. This project tackles that by embedding a Hebrew lemmatization model inside the analyzer itself, so every token passes through a neural model before indexing and querying.

Example

Users may search for the lemma “בית” (house), but documents might contain:

בית (a house)
בבית (in the house)
לבית (to the house)
בבתים (in houses)
לבתים (to houses)

Without lemmatization, these become different surface tokens; lemmatization normalizes them toward the same lemma (בית), improving recall:

What this plugin does

Rather than relying on rule-based stemming, the analyzer runs a Hebrew lemmatization model as part of the Elasticsearch analysis chain and emits one normalized lemma per token. Because the model is neural, it can use local context within each analyzed segment to choose a lemma in ambiguous cases—while still producing stable tokens that work well for indexing and querying. The analyzer:

Runs a Hebrew lemmatization model inside Elasticsearch.
Produces better normalized tokens for Hebrew text.
Supports stopwords and standard analyzer pipelines.

The result: Fast, reliable lemmatization

This analyzer is optimized for real‑world throughput:

ONNX Runtime in‑process inference.
INT8-quantized model for lower latency and memory footprint.
Java Foreign Function Interface (FFI) for high‑performance native inference.

The result: fast, reliable lemmatization with predictable operational behavior.

To evaluate performance, we ran a benchmark in a Docker container (4 cores, 12 GB RAM) on 1 million large documents (5.7 GB of data) from the Hebrew Wikipedia dataset. You’ll find the results below:

Metric (search)	Task	Value	Unit
Min throughput	hebrew-query-search	409.75	ops/s
Mean throughput	hebrew-query-search	490.65	ops/s
Median throughput	hebrew-query-search	491.85	ops/s
Max throughput	hebrew-query-search	496.13	ops/s
50th percentile latency	hebrew-query-search	7.02242	ms
90th percentile latency	hebrew-query-search	10.7338	ms
99th percentile latency	hebrew-query-search	19.0406	ms
99.9th percentile latency	hebrew-query-search	27.165	ms
50th percentile service time	hebrew-query-search	7.02242	ms
90th percentile service time	hebrew-query-search	10.7338	ms
99th percentile service time	hebrew-query-search	19.0406	ms
99.9th percentile service time	hebrew-query-search	27.165	ms
Error rate	hebrew-query-search	0	%

Open source and Elastic‑ready

The plugin is fully open source and works on:

Elastic open‑source distributions.
Elastic Cloud.

You can build it yourself or download prebuilt releases and install it like any other plugin.

To upload the analyzer plugin to Elastic Cloud, navigate to the Extensions section within your Elastic Cloud console and proceed with the upload.

Credits

This project is a fork of the Korra ai Hebrew analysis plugin (MIT), which was implemented by Korra.ai with funding and guidance from the National NLP Program led by MAFAT and the Israel Innovation Authority.

This fork focuses on Elasticsearch 9.x compatibility and running lemmatization fully in-process via ONNX Runtime, using an INT8‑quantized model and bundled Hebrew stopwords. Lemmatization is powered by DictaBERT dicta-il/dictabert-lex (CC‑BY‑4.0).

Huge thanks to the Dicta team for making high-quality Hebrew natural language processing (NLP) models available to the community.

Elasticsearch 9.3 adds bfloat16 vector support

Simon Cooper — Mon, 16 Feb 2026 00:00:00 GMT

Elasticsearch 9.3 brings with it several important improvements to vector data, including a new element_type: bfloat16. This has the potential to halve your vector data storage, with minimal reductions in recall and runtime performance for most use cases.

Storage formats in dense_vector fields

Prior to 9.3, dense_vector fields support vectors of single bits, 1-byte integers, and 4-byte floats. We store the original vectors on top of any quantization and/or hierarchical navigable small world (HNSW) graph used for indexing, and the original vectors make up the vast majority of the required disk space of the vector indices. If your vectors are floating point, then the only option versions of Elasticsearch prior to 9.3 provide is to store 4 bytes per vector value: That’s 4kB for a single 1024-dimensional vector.

There are other floating-point sizes available, of course: IEEE-754 specifies floating-point sizes of many different lengths, including the 4-byte float32 and 8-byte float64 used by Java float and double types. It also specifies a float16 format, which only uses 2 bytes per value. However, this only has a maximum value of 65,504, compared to the 3.4x1038 of 4-byte float32 values, and the conversion between the two involves several arithmetic operations.

As an alternative, many machine learning (ML) applications now use bfloat16, which is a modification of IEEE-754 float32 to only use 2 bytes. It does this by discarding the lowest 2 bytes of the fractional part of the value, leaving the sign and exponent unchanged. This effectively reduces the precision of the floating-point value without a corresponding reduction in range. The conversion from float32 to bfloat16 is a simple bitwise truncation on the float32 value, with a bit of jiggling to account for rounding.

bfloat16 in Elasticsearch 9.3

Elasticsearch 9.3 now supports storing vector element types as bfloat16. In memory, it will still process every vector value as a 4-byte float32, as Java does not have built-in support for bfloat16. As it writes vector data to disk, it will simply truncate and round each float32 value to a 2-byte bfloat16, and zero-expand each bfloat16 value back to float32 on reading the value into memory.

This effectively halves your vector index sizes, as it uses 2 bytes per value rather than 4 bytes. There may be a small performance cost during reading and writing data as Elasticsearch performs the necessary conversions, but this is often counterbalanced by a significant reduction in the I/O required, as the OS now has to read half as much data. And, for most datasets, there is a minimal effect on search recall.

As an example, this is the difference in sizes for bfloat16 on our dense_vector dataset:

So, if your input vectors are already at bfloat16 precision, then happy days! Elasticsearch accepts raw bfloat16 vectors as float values, and as Base64-encoded vectors. The vectors are persisted to disk with the same precision as your original source data, immediately halving your data storage requirements.

If your input vectors are at 4-byte precision, then you can also use bfloat16 format to halve your index data sizes. Elasticsearch will truncate and round each value to 2-byte precision, throwing away the least significant bits of the fraction. This means that the vector values you get back from Elasticsearch won’t be exactly the same as what you originally indexed, so don’t use bfloat16 if you need to maintain the full 4-byte precision of float32.

Starting in Elasticsearch 9.3, and on Elasticsearch Serverless, you can specify element_type: bfloat16 with all dense_vector index types on any newly created indices. If you wish to use bfloat16 with existing indices, you can reindex into an index with element_type: bfloat16 and Elasticsearch will automatically convert your existing float vectors to bfloat16.

AI agents that perform actions: Automating IT requests with Agent Builder and Workflows

Sri Kolagani, Ziyad Akmal — Fri, 13 Feb 2026 00:00:00 GMT

In the world of IT operations, context switching is the enemy of productivity. For internal teams, simple requests, like a laptop refresh or employee onboarding, often require navigating multiple portals, filling out rigid forms, and manually updating information technology service management (ITSM) tools like ServiceNow.

At a recent DevFest, we demonstrated how to bridge the gap between natural language requests and structured IT workflows. By combining Elastic Agent Builder with Elastic Workflows, we can create AI assistants that not only answer questions but also perform complex actions.

In this post, we’ll dive into the architecture from that talk, specifically looking at how we built an automated "Laptop Refresh" workflow. We’ll demonstrate how to configure an agent that collects user requirements and triggers a server-side automation to interact directly with ServiceNow APIs.

Watch the full breakdown: This post is based on our presentation at Google DevFest. You can watch the full session here to see the demo in action.

The architecture: From chat to fulfillment

Note: The technical implementation described in this document is a streamlined version of the full production environment. While the architecture diagram provided serves as an accurate structural reference for the actual deployment, the accompanying text and code snippets have been simplified for illustrative purposes and may differ from the final, complex configurations used in the live implementation.

The goal is to move from a manual, form-heavy process to a conversational interface. Instead of a user navigating a catalog, they simply tell the AI assistant that they’re due for a laptop upgrade.

As illustrated above, the flow consists of three distinct layers:

1. Interaction layer (ElasticGPT/Agent Builder): The user interacts naturally with an interface powered by ElasticGPT. Behind the scenes, Agent Builder processes this conversation, handling intent detection and slot filling, to structure the data and orchestrate interactions with other internal systems.

Intent detection
- Mechanism: System prompt instruction.
- Implementation: The agent is explicitly told its single purpose in the MISSION statement. It doesn’t need to "detect" other intents because it’s scoped strictly to IT provisioning.
  - Code reference: MISSION: You are a specialized agent designed to collect complete employee onboarding information...
- Constraint: If a user asks about non-IT topics (for example, "What is the weather?"), the MISSION implies that the agent should pivot back to data collection or decline, depending on the large language model’s (LLM's) default safety alignment.
Slot filling (data collection)
- Mechanism: Phased conversation flow.
- Implementation: Instead of asking for all slots at once, the DATA COLLECTION STRATEGY breaks the slots into five logical phases. This prevents the context switching fatigue mentioned above.
  - Code reference: PHASE 1: Personal information, PHASE 2: Employment Details, and so on.
- Validation: The prompt enforces immediate validation (for example, Validate inputs immediately), acting as a gatekeeper before moving to the next slot.

2. Automation layer ( Workflows): Once the agent has the data, it triggers a workflow. This workflow handles the logic: checking device eligibility, enforcing policy (for example, "Is the laptop > 3 years old?"), and making API calls.

3. System of record (ServiceNow): The workflow reads and writes directly to the ITSM tool to maintain audit trails and initiate fulfillment.

Step 1: Configuring the agent

The first step is defining the "brain" of the operation using Agent Builder. We need an agent that acts strictly within the bounds of IT provisioning. We don't want a general chatbot; we want a data collection machine that feels like a helpful colleague.

We achieve this via a robust system prompt. The prompt dictates the agent's operating protocol, enforcing a step-by-step data collection strategy.

Here’s the refined structure of the prompt we used. Notice how it enforces validation and logically groups questions to avoid overwhelming the user:

MISSION: You are a specialized agent designed to collect complete employee onboarding information for IT equipment provisioning.

OPERATING PROTOCOL:
0. On every new chat, send a welcome message, and directly jump to data collection.

1. DATA COLLECTION STRATEGY:
   - Use a step-by-step approach across 5 clear phases
   - Validate inputs immediately

2. CONVERSATION FLOW:
   PHASE 1: Personal Information (Name, Email, Phone)
   PHASE 2: Employment Details (Job Title, Department, Manager)
   PHASE 3: Location & Shipping (Address, Country)
   PHASE 4: Technical Setup (Laptop Type, Accessories)
   PHASE 5: Confirmation

...

6. SUCCESS COMPLETION:
   After all data is collected and validated, invoke the tool "laptoprefreshworkflow" with the JSON payload.

For a sample system prompt or instructions, please refer here.

By explicitly instructing the agent to send the data in a specific JSON format at the end of the conversation, we ensure that the input matches exactly what our automation layer expects.

Step 2: The automation layer (Workflows)

The agent provides the intent and the data, but Workflows provides the muscle.

We define a workflow using a YAML configuration. This workflow acts as the bridge between the AI agent and the ServiceNow REST APIs. It handles authentication, data retrieval, and the ordering process.

Below is the workflow definition. We’ve refined the code to use secure variable handling for credentials rather than hardcoding them.

Workflow inputs

First, we define the inputs the workflow expects to receive from the agent:

YAML
version: "1"
name: Submit Laptop Refresh Request
enabled: true
triggers:
  - type: manual
inputs:
  - name: userid
    type: string
  - name: preferred-address
    type: string
  - name: laptop-choice
    default: Macbook latest
    type: string
  - name: laptop-keep-or-return
    default: return
    type: string

Interacting with ServiceNow

The workflow executes a series of HTTP steps. Crucially, we first need to identify the user's current asset to link the refresh request correctly.

1. Fetching computer data

We query the cmdb_ci_computer table in ServiceNow to find the asset currently assigned to the user.

YAML
steps:
  - name: snow_get_computer_data
    type: http
    with:
      url: https://elasticdev.service-now.com/api/now/table/ci_computer?assigned_to={{ inputs.userid }}
      method: GET
      headers:
        Accept: application/json
        Content-Type: application/json
        # Best Practice: Use secrets for authorization headers
        Authorization: Basic {{ secrets.servicenow_creds }}
      timeout: 30s

2. Adding to cart

Once we have the asset details and the user's preferences, we don't just create a generic ticket. We use the ServiceNow Service Catalog API to programmatically add the specific item to a cart.

YAML
  - name: snow_post_add_item_to_cart
    type: http
    with:
      url: https://elasticdev.service-now.com/example
      method: POST
      headers:
        Accept: application/json
        Content-Type: application/json
        Authorization: Basic {{ secrets.servicenow_creds }}
      body: |
        {
            "sysparm_quantity": 1,
            "variables": {
              "caller_id_common": "{{ inputs.userid }}",
              "current_device": "{{ steps.snow_get_asset.output.data.result.sys_id }}",
              "laptop_keep_or_return": "{{ inputs.laptop-keep-or-return }}",
              "choose_your_laptop": "{{ inputs.laptop-choice }}",
              "shipping_address": "{{ inputs.preferred-address }}"
            }
        }

3. Indexing the transaction

Finally, we want to keep a record of this transaction within Elasticsearch for analytics and future reference. We use the elasticsearch.index step to store the request details immediately after submission.

YAML

  - name: index-submission-record
    type: elasticsearch.index
    with:
      index: laptop-refresh-submission-data
      id: "{{ steps.snow_post_submit_order.output.data.result.request_id }}"
      document:
        request-id: "{{ steps.snow_post_submit_order.output.data.result.request_id }}"
        user-id: "{{ inputs.userid }}"
        configuration-item: "{{ steps.snow_get_computer_data.output.data.result[0].sys_id }}"
        laptop-choice: "{{ inputs.laptop-choice }}"
        timestamp: "{{ steps.snow_post_submit_order.output.data.result.sys_created_on }}"

For detailed workflow yaml, please refer here.

The result

By stitching these components together, we create a seamless experience:

The user chats naturally with the agent to provide details.
The agent structures this unstructured conversation into a JSON object.
Workflow receives the JSON, validates the user's current hardware via ServiceNow, creates the order, and indexes the result.

This approach reduces a process that traditionally took users 5–10 minutes of form navigation into a quick conversation, while ensuring that IT operations retains full visibility and control.

Video demo:

Watch video on Vidyard

Ready to build?

This pattern, using an agent for the interface and using Workflows for the execution, can be applied to almost any ITSM task, from password resets to software provisioning.

If you’re interested in trying this out, be sure to watch the DevFest talk for the full context, and check out the Elastic AI Agent Builder documentation to get started building your own agents today.

Entity resolution with Elasticsearch & LLMs, Part 1: Preparing for intelligent entity matching

Jessica Moszkowicz — Thu, 12 Feb 2026 00:00:00 GMT

"The new Swift update is here!"

When you see that headline, what do you think of? For a developer, it's a call to action, time to dive into new syntax, concurrency models, and bug fixes for the Swift programming language. For a music fan, it's a completely different story, a signal that Taylor Swift has just dropped a new album or is making a major announcement.

Your brain, in a fraction of a second, performs a remarkable feat of natural language processing (NLP). It doesn't just read the word "Swift" in isolation; it uses the surrounding context (the headline's source, your personal interests, and more) to resolve that single, ambiguous word to a unique, real-world entity.

In NLP, we call this ability to disambiguate named entity resolution (NER), and it's something humans do all the time. Natural language is inherently ambiguous, so we need to be able to map entities like "Bill Gates" to "the founder of Microsoft" and "The Eras Tour" to "Taylor Swift's concert tour". For humans, these connections come easily; for computers, not so much. Think how disappointed a Swiftie would be when they find out the article that their smart assistant recommended is actually about programming.

This same challenge becomes critical when you're monitoring news articles for mentions of specific people or organizations. Imagine you're tracking sanctioned entities or monitoring mentions of specific companies. You have a watch list with “Sakura Shipping Group” on it, and you want to know when articles mention the company. Simple enough, right? But what happens when an article refers to “Sakura Shipping” instead of the full legal name? Or uses an abbreviation like “SSG”? Or describes it indirectly as “a major Japanese maritime logistics firm”? Or mentions the company in Japanese, as “さくら海運グループ”? Your simple text matching won’t find these mentions, even though they all refer to the same organization. For compliance and risk monitoring use cases, missing a mention could have serious consequences. You need to catch every variation, every alias, every way an entity might be referred to.

This is the problem of entity resolution: identifying when different mentions in text refer to the same real-world entity and determining which entity that is. To solve this, we need a system that can handle semantic search (understanding meaning, not just keywords), named entity recognition (extracting entities from text), and fast, scalable matching across millions of documents. That's why we built this prototype on Elasticsearch. It provides built-in semantic search capabilities, integrated NER models, and the scalability needed for entity resolution.

In this series, we present an educational prototype for intelligent entity resolution that deliberately separates retrieval from judgment and explanation. Elasticsearch is used to efficiently narrow the search space by combining keyword, alias, and semantic (hybrid) search. Once plausible entity candidates are identified, a large language model (LLM) is used to determine whether a candidate refers to the same real-world entity, and the model’s rationale is provided in natural language.

This division of responsibilities avoids treating LLMs as black-box retrievers, preserves explainability for sensitive use cases, and demonstrates a reusable design pattern for building transparent, Elasticsearch-native systems. We examine why this pattern is particularly effective for entity resolution, where ambiguity is common and explainability matters. The goal is not to present a production-ready solution but to teach the architectural principles behind building transparent entity resolution systems.

Important note: This series presents an educational prototype that teaches Elasticsearch-native entity resolution using LLM judgments. We've made some simplifying choices (such as using Wikipedia for entity enrichment) to keep the system accessible for learning. Production systems might use different data sources, additional validation steps, or more sophisticated enrichment pipelines. The goal here is to demonstrate the core concepts and architecture, not to provide a production-ready system.

This series shows how we can help computers make these necessary connections while working with a 100% Elasticsearch-native architecture. We'll explore three major innovations:

Enhancing entities with contextual information.
Recognizing basic and complex entities with comprehensive NER.
Providing transparent reasoning through Elasticsearch candidate matching and LLM-powered explanations.

We'll also evaluate the system and identify an important optimization that improves the overall performance of the educational prototype.

In this first post of a four-part series, we’ll focus on preparing both sides of the entity resolution equation: your watch list and the articles you want to search.

The problem: Why entity resolution requires preparation

Entity resolution is hard because we face challenges on both sides of the matching equation. On one side, entities can be mentioned in many different ways. A company might be referred to as "Microsoft", "Microsoft Corporation", "MSFT", or even "the Redmond-based tech giant", depending on the context and writing style. On the other hand, we need to find these mentions in articles, even when they're not obvious, such as when an article uses "the Russian President" or "F.D.R." instead of a full name.

Why we can't just match names directly: Without proper preparation, matching becomes unreliable. You might think, "But I can just search for 'Tim Cook' in the text, right?" Well, yes, if the article always mentions him by that exact name. But what about when it says "Apple CEO" instead? Or "Timothy D. Cook" (his full name)? Your simple text search won't find those mentions, even though they all refer to the same person.

Without entity preparation, we can't match "the Russian President" to "Vladimir Putin" because we don't know what "the Russian President" means without context. We can't match "J.R.R. Tolkien" to "John Ronald Reuel Tolkien" because we don't know that they're aliases for the same person. We can't match "Apple CEO" to "Tim Cook" because we can't understand the semantic relationship. Without indexing, finding matches means checking every entity in your watch list individually. This doesn't scale: With thousands of entities, every match becomes slow and expensive. For sanctioned individuals monitoring, this means missing critical mentions that use aliases or alternate spellings, a failure that could have serious consequences.

Why we can't just search text directly: Entity extraction is hard for the same reason entity resolution is hard: Entities can be mentioned in many different ways. The same person might be referred to as "J.R.R. Tolkien", "the author of The Lord of the Rings", or just "Tolkien", depending on the context. Without proper extraction, we can't find these mentions in the text. We'd have to manually identify every entity mention, which doesn't scale. We'd miss entities mentioned in nonstandard ways (for example, titles or abbreviations). We also wouldn't capture the context around entity mentions, which is crucial for accurate matching.

The solution is a two-phase system that prepares both your watch list and the articles you want to search.

The solution: Two-phase preparation system

To solve entity resolution, we need to prepare both sides of the matching equation. First, we enrich and index our watch list entities to enable semantic search. Second, we extract entity mentions from articles using hybrid techniques that capture explicit and implicit references. Together, these phases create the foundation for intelligent entity matching.

Phase 1: Preparing your watch list

The solution to preparing entities is to enrich them with meaningful contexts. This enables our entity matching system to work effectively. We'll explain how context helps in a bit, but let's walk through the prototype's simple implementation first.

Our watch list of entities may be provided in multiple formats. The Office of Foreign Assets Control (OFAC) provides sanctions lists that include first and last names, addresses, and identifying information, such as passport numbers, date and place of birth, and nationality information [1]. While this provides a good amount of context, in practice many of these fields are omitted when the values are unknown for the given entity. Some lists may be just a set of names. The most helpful lists for our purposes come out of the box with rich descriptions, as is often the case with commercial or curated datasets.

The three-component system used in the prototype starts by managing our entities and organizing their metadata. Since entity lists can vary in the amount of information they contain, our prototype is designed to work with whatever it receives. The JSON format supports entities with minimal information (just a name and type) or full information (with aliases, descriptions, metadata, and more). For example, an entity might be as simple as:

{
  "name": "J.R.R. Tolkien",
  "entity_type": "person"
}

Or it might include additional context:

{
  "name": "J.R.R. Tolkien",
  "entity_type": "person",
  "description": "English writer and philologist, author of The Lord of the Rings",
  "aliases": ["John Ronald Reuel Tolkien", "J.R.R. Tolkien", "Tolkien"],
  "priority": "medium"
}

The system handles both cases gracefully during enrichment. For the prototype, the enrichment process adds context from Wikipedia (specifically, the first paragraph of the entity's Wikipedia page) for entities that don't already have context [2]. This Wikipedia context helps with semantic matching, but it doesn't add other fields, like aliases or full names; those must come from the original dataset. (In production, you might use other approaches for enrichment, including an agentic system that figures out where to find the context information for a given entity. This is beyond the scope of our prototype, but it’s an exciting feature we could add in the future.) Finally, we index the entities in Elasticsearch with semantic search capabilities, creating a searchable index that understands meaning rather than just text.

Key concepts: Semantic search and indexing

What is semantic search? Semantics refers to the meanings of words and phrases. Figuring out meaning is usually easy for humans, but it's much more challenging for computers to "get" because it requires a depth of understanding that’s difficult to program. Semantic search works by turning this challenge into a math problem, something that computers are very good at [3].

Think of semantic search like map coordinates for meaning. Just as latitude and longitude tell you where something is on a map, semantic embeddings tell you where something is in "meaning space." Whereas traditional keyword search requires exact matches, semantic search relies on describing that "location" in a multidimensional vector space. For example, you might have the coordinates for a specific "big red building". When you search for a "small red building", semantic search looks in the "neighborhood" for similar concepts in the vector space. Your big red building might appear as a nearest neighbor, but the relevance score will be lower because parts of the meaning don't match.

Getting back to our example, when you search for "Apple CEO", semantic search can find "Tim Cook" because the semantic embeddings capture the meaning that both refer to the same person, even though they use completely different words. This capability is invaluable when monitoring for sanctioned individuals, as aliases and code names may be used to evade detection.

Why Elasticsearch for entity indexing? Elasticsearch has built-in semantic search capabilities using embedding models, like EmbEddings from bidirEctional Encoder rEpresentations (E5) [4]. This means we can create an index that understands meaning, not just text. When we index our enriched entities, Elasticsearch creates semantic embeddings that capture each entity's meaning, enabling intelligent matching later.

What is the mapping schema? The mapping schema defines how we structure entity data in Elasticsearch. Our schema includes several field types optimized for different search strategies, including:

Keyword fields (id, name.keyword, aliases.keyword): For exact matching on entity names and aliases.
Text fields (name, name_lower, context, aliases): For traditional, case-normalized full-text search with BM25 scoring.
Semantic text fields (name_semantic, context_semantic): For vector-based similarity search using the multilingual-e5-small model.

This hybrid mapping enables multiple search strategies: exact matching for precise names, keyword search for aliases, and semantic search for meaning-based matching. Even better, Elasticsearch supports hybrid search, allowing us to use both keyword and semantic search simultaneously.

Before and after entity preparation

Before entity preparation, you have a simple list without much context, possibly nothing more than a name: "J.R.R. Tolkien". That's it. You can only match exact text matches, which means you'll miss "John Ronald Reuel Tolkien", "Tolkien", and any other variations. For sanctioned individuals, this means missing critical mentions that use aliases or alternate spellings.

After entity preparation, you have a rich, searchable index. "Vladimir Putin" is now enriched with Wikipedia context, and if your original dataset included aliases, like "Путин" or "Vladimir Vladimirovich Putin", those are indexed as well. The entity also has semantic embeddings that capture its meaning. The Wikipedia context helps semantic search understand that "The Russian President" refers to Vladimir Putin, enabling that match. If "Путин" was provided as an alias in your original dataset, exact matching handles that. Semantic variations work because your semantic embeddings understand meaning. For sanctioned individuals, this comprehensive preparation ensures you catch every mention, regardless of how the name is written or what alternative name is used.

Phase 2: Extracting entities from articles

Now that we have a searchable watch list, we need to extract entity mentions from articles. This is where article processing comes in.

Remember: This is an educational prototype designed to teach entity extraction concepts. Production systems might use different NER models, custom extraction rules, or specialized extraction pipelines tailored to specific domains or languages.

We extract entities from articles using a hybrid NER approach that combines machine learning with pattern-based extraction. First, we process articles to prepare them for extraction. Then, we extract entities using a hybrid extraction approach that combines NER performed in Elasticsearch (using a deployed XLM-RoBERTa model) with pattern-based extraction to catch entities that NER might miss.

This hybrid extraction approach provides several benefits. NER automatically finds entity mentions in text, even when they're not obvious. Pattern-based extraction catches entities that NER might miss, like titles and compound entities. We preserve the context around each entity mention, which helps with matching decisions later. The approach scales well, allowing us to process thousands of articles automatically, not just a few manually.

Key concepts: NER, pattern-based extraction, and hybrid extraction approach

What is NER? Named entity recognition is a machine learning technique that identifies named entities in text. When we run NER on an article, it finds mentions like "Microsoft", "Seattle", and "Washington" and labels them as organization, location, or person entities.

Why use NER in Elasticsearch? Using NER in Elasticsearch maintains our 100% Elasticsearch-native architecture, which simplifies the entity resolution prototype's design. Instead of managing separate services for entity extraction and search, everything runs in one system. You can perform NER during document ingestion using inference pipelines, and the extracted entities are immediately available for indexing and searching. This unified approach reduces complexity, eliminates network calls between services, and makes deployment and management easier. The XLM-RoBERTa model is trained to recognize entities in multiple languages, so we can extract entities from articles in different languages without needing separate models for each language. For information on deploying NER models in Elasticsearch, see the Elasticsearch NER documentation.

What is pattern-based extraction? Pattern-based extraction uses rules and patterns to find entities that NER might miss. For example, NER might not recognize "the author of The Lord of the Rings" as an entity mention, but pattern-based extraction can catch titles and roles like "the CEO" or "the President". However, pattern-based extraction is language-specific. The patterns need to be defined for each language you want to support. This is a significant drawback for multilingual systems, but it's acceptable for our educational prototype, which focuses on demonstrating the core concepts. Production systems might use language-specific pattern sets or alternative approaches for multilingual support.

How do they work together? The hybrid extraction approach combines both techniques. NER finds obvious entity mentions like "J.R.R. Tolkien", while pattern-based extraction catches variations that NER might miss, such as "the author of The Lord of the Rings". Together, they provide comprehensive coverage of entity mentions in text.

When we extract entities from an article mentioning "the author of The Lord of the Rings", we get:

Text: "author of The Lord of the Rings"
Type: PERSON (from pattern-based extraction)
Confidence: 0.85
Context: "The author of The Lord of the Rings published a new edition"

Before and after entity extraction

With NER-only extraction, we might find "J.R.R. Tolkien" and "The Lord of the Rings" in the article, but we'd miss "the author of The Lord of the Rings" because NER doesn't recognize descriptive phrases as entity mentions.

With hybrid extraction, we find both "J.R.R. Tolkien" (from NER) and "the author of The Lord of the Rings" (from pattern-based extraction). This comprehensive coverage enables better matching later, since we can match both the name and the descriptive phrase to our watch list.

What's next: Matching entities to our watch list

Now that we've prepared both sides of the entity resolution equation, we have everything we need for intelligent matching:

A searchable watch list enriched with context and indexed for semantic search.
Extracted entity mentions from articles using hybrid NER.

Preparation gives us the raw ingredients, but it doesn’t tell us which entity a mention actually refers to. In the next post, we'll explore how to match these extracted entities to our watch list using semantic search and LLM-powered judgment that handles ambiguity and context transparently.

Try it yourself

Want to see the preparation process in action? Check out these notebooks for complete walkthroughs with real implementations, detailed explanations, and hands-on examples:

Entity preparation notebook: Shows you exactly how to enrich entities with Wikipedia context, create semantic search indexes, and prepare your watch list for intelligent matching.
Article processing notebook: Shows you exactly how to extract entities from articles using hybrid NER, handle multilingual content, and process compound entities.

Remember: This is an educational prototype designed to teach the concepts. When building production systems, consider additional factors, like data source reliability, validation pipelines, error handling, monitoring, compliance requirements, domain-specific NER models, custom extraction rules, and quality validation that aren't covered in this learning-focused prototype.

References

OFAC Sanctions List Search
The datasets used for the prototype also use a special field, 'explicit_context', in lieu of getting the context from Wikipedia. We do this to control for the entity preparation step when we're testing other components such as entity matching.
The big ideas behind retrieval augmented generation
E5 in Elasticsearch

From vectors to keywords: Elasticsearch hybrid search in LangChain

Margaret Gu, Eyo Eshetu — Wed, 11 Feb 2026 00:00:00 GMT

Elasticsearch hybrid search is available for LangChain across our Python and JavaScript integrations. Here we’ll discuss what hybrid search is, when it can be useful and we’ll run through some simple examples to get started.

We’re also planning to support hybrid search in the community-driven Java integration very soon.

What is hybrid search?

Hybrid search is an information retrieval approach that combines keyword-based full-text search (lexical matching) with semantic search (vector similarity). Practically, it means a query can match documents because they contain the right terms and/or because they express the right meaning (even if the wording differs).In simple terms, you can think of it like this:

Lexical retrieval: “Do these documents contain the words I typed (or related words)?”
Semantic retrieval: “Do these documents mean something similar to what I typed?”

These two retrieval methods produce scores on different scales, so hybrid search systems typically use a fusion strategy to merge them into one ranking, for example, using reciprocal rank fusion (RRF).

In the figure above, we show an example: BM25 (keyword search) returns Docs A, B, and C, while semantic search returns Docs X, A, and B. The RRF algorithm then combines these two result lists into the final ranking: Doc A, Doc B, Doc X, and Doc C. With hybrid search, Doc C is included in the results thanks to BM25.

Why hybrid search matters

If you’ve built search or retrieval-augmented generation (RAG) features in production, you’ve probably seen the same failure modes show up again and again:

Keyword search can be too literal. If the user doesn’t use the exact terms that appear in your documents, relevant content gets buried or missed.
Semantic search can be too fuzzy. It’s great at meaning, but it can also return results that feel related while missing a critical constraint, like a product name, an error code, or a specific phrase the user actually typed.

Hybrid search exists because real user queries in production environments usually need both.

Next we’ll dive into how you get started with hybrid search in the LangChain integration for Python and JavaScript. If you want to read more about hybrid search, check out What is hybrid search? and When hybrid search truly shines.

Setting up a local Elasticsearch instance

Before running the examples, you'll need Elasticsearch running locally. The easiest way is using the start-local script:

curl -fsSL https://elastic.co/start-local | sh

After starting, you'll have:

Elasticsearch at http://localhost:9200.
Kibana at http://localhost:5601.

Your API key is stored in the .env file (under the elastic-start-local folder) as ES_LOCAL_API_KEY.

Getting started with hybrid search in LangChain (Python and JavaScript)

The dataset is a CSV with information on 1,000 science fiction movies, taken from an IMDb dataset on Kaggle. This demo uses a subset of the data, which has been cleaned. You can download the dataset used for this article from our GitHub gist, along with the full code for this demo.

Step 1: Install what you need.

First you’ll need the LangChain Elasticsearch integration and Ollama for embeddings. (You can also use some other embedding model if you wish.)

In Python:

pip install langchain-elasticsearch langchain-ollama

In JavaScript:

npm install @langchain/community @langchain/ollama @elastic/elasticsearch csv-parse

Step 2: Configure your connection and dataset path.

In Python:

At the top of the script, we set:

Where Elasticsearch is (ES_LOCAL_URL).
How to authenticate (ES_LOCAL_API_KEY).
Which demo index name to use (INDEX_NAME).
Which CSV file we’ll ingest (scifi_1000.csv).

ES_URL = os.getenv("ES_LOCAL_URL", "http://localhost:9200") 
ES_API_KEY = os.getenv("ES_LOCAL_API_KEY")
INDEX_NAME = "scifi-movies-hybrid-demo" 
CSV_PATH = Path(__file__).with_name("scifi_1000.csv")

In JavaScript:

Notes for JavaScript:

JavaScript uses process.env instead of os.getenv.
Path resolution requires fileURLToPath and dirname for Elasticsearch modules.
The class is called ElasticVectorSearch (not ElasticsearchStore as in Python).

import { Client } from "@elastic/elasticsearch";
import { OllamaEmbeddings } from "@langchain/ollama";
import {
  ElasticVectorSearch,
  HybridRetrievalStrategy,
} from "@langchain/community/vectorstores/elasticsearch";
import { parse } from "csv-parse/sync";
import { readFileSync } from "fs";
import { dirname, join } from "path";
import { fileURLToPath } from "url";

const __dirname = dirname(fileURLToPath(import.meta.url));

const ES_URL = process.env.ES_LOCAL_URL || "http://localhost:9200";
const ES_API_KEY = process.env.ES_LOCAL_API_KEY;
const INDEX_NAME = "scifi-movies-hybrid-demo";
const CSV_PATH = join(__dirname, "scifi_1000.csv");

We can now also create the client.

In Python:

es = Elasticsearch(ES_URL, api_key=ES_LOCAL_API_KEY)

In JavaScript:

const client = new Client({
  node: ES_URL,
  auth: ES_API_KEY ? { apiKey: ES_LOCAL_API_KEY } : undefined,
});

Step 3: Ingest the dataset, and then compare vector-only vs. hybrid.

Step 3a: Read the CSV and build what we index.

We build three lists:

texts: The actual text that will be embedded + searched.
metadata: Structured fields stored alongside the document.
ids: Stable IDs (so Elasticsearch can dedupe if needed).

In Python:

# --- Ingest dataset ---
texts: list[str] = []
metadatas: list[dict] = []
ids: list[str] = []

with CSV_PATH.open(newline="", encoding="utf-8") as f:
    for row in csv.DictReader(f):
        movie_id = (row.get("movie_id") or "").strip()
        movie_name = (row.get("movie_name") or "").strip()
        year = (row.get("year") or "").strip()
        genre = (row.get("genre") or "").strip()
        description = (row.get("description") or "").strip()
        director = (row.get("director") or "").strip()

        # This text is both:
        #  - embedded (vector search)
        #  - keyword-matched (BM25 in hybrid mode)
        text = "\n".join(
            [
                f"{movie_name} ({year})" if year else movie_name,
                f"Director: {director}" if director else "Director: (unknown)",
                f"Genres: {genre}" if genre else "Genres: (unknown)",
                f"Description: {description}" if description else "Description: (missing)",
            ]
        )
        texts.append(text)
        metadatas.append(
            {
                "movie_id": movie_id or None,
                "movie_name": movie_name or None,
                "year": year or None,
                "genre": genre or None,
                "director": director or None,
            }
        )
        ids.append(movie_id or movie_name)

In JavaScript:

async function main() {
  // --- Ingest dataset ---
  const texts = [];
  const metadatas = [];
  const ids = [];

  const csvContent = readFileSync(CSV_PATH, "utf-8");
  const records = parse(csvContent, {
    columns: true,
    skip_empty_lines: true,
  });

  for (const row of records) {
    const movieId = (row.movie_id || "").trim();
    const movieName = (row.movie_name || "").trim();
    const year = (row.year || "").trim();
    const genre = (row.genre || "").trim();
    const description = (row.description || "").trim();
    const director = (row.director || "").trim();

    // This text is both:
    //  - embedded (vector search)
    //  - keyword-matched (BM25 in hybrid mode)
    const text = [
      year ? `${movieName} (${year})` : movieName,
      director ? `Director: ${director}` : "Director: (unknown)",
      genre ? `Genres: ${genre}` : "Genres: (unknown)",
      description ? `Description: ${description}` : "Description: (missing)",
    ].join("\n");

    texts.push(text);
    metadatas.push({
      movie_id: movieId || null,
      movie_name: movieName || null,
      year: year || null,
      genre: genre || null,
      director: director || null,
    });
    ids.push(movieId || movieName);
  }

What’s important here:

We don’t embed only the description. We embed a combined text block (title/year + director + genre + description). That makes results easier to print and sometimes improves retrieval.
The same text is what the lexical side uses, too (in hybrid mode), because it’s indexed as searchable text.

Step 3b: Add texts to Elasticsearch using LangChain.

This is the indexing step. Here we embed texts and write them to Elasticsearch.

For asynchronous applications, please use AsyncElasticsearchStore with the same API.

You can find our reference docs for both the sync and async versions of ElasticsearchStore, along with more parameters for advanced fine-tuning RRF.

In Python:

print(f"Ingesting {len(texts)} movies into '{INDEX_NAME}' from '{CSV_PATH.name}'...") 

vector_store = ElasticsearchStore(
    index_name=INDEX_NAME,
    embedding=OllamaEmbeddings(model="llama3"),
    es_url=ES_LOCAL_URL,
    es_api_key=ES_LOCAL_API_KEY,
    strategy=ElasticsearchStore.ApproxRetrievalStrategy(hybrid=False),
)

#This is the indexing step. We embed the texts and add them to Elasticsearch
vectore_store.add_texts(texts=texts, metadatas=metadatas, ids=ids)

In JavaScript:

  console.log(
    `Ingesting ${texts.length} movies into '${INDEX_NAME}' from 'scifi_1000.csv'...`
  );

  const embeddings = new OllamaEmbeddings({ model: "llama3" });

  // Vector-only store (no hybrid)
  const vectorStore = new ElasticVectorSearch(embeddings, {
    client,
    indexName: INDEX_NAME,
  });

  // This is the indexing step. We embed the texts and add them to Elasticsearch
  await vectorStore.addDocuments(
    texts.map((text, i) => ({
      pageContent: text,
      metadata: metadatas[i],
    })),
    { ids }
  );

Step 3c: Create another store for hybrid search.

We create another ElasticsearchStore object pointing at the same index but with different retrieval behavior: hybrid=False is vector-only search and hybrid=True is hybrid search (BM25 + kNN, fused with RRF).

In Python:

# Since we are using the same INDEX_NAME we can avoid adding texts again 
# This ElasticsearchStore will be used for hybrid search

hybrid_store = ElasticsearchStore(
    index_name=INDEX_NAME,
    embedding=OllamaEmbeddings(model="llama3"),
    es_url=ES_LOCAL_URL,
    es_api_key=ES_LOCAL_API_KEY,
    strategy=ElasticsearchStore.ApproxRetrievalStrategy(hybrid=True),
)

In JavaScript:

  // Since we are using the same INDEX_NAME we can avoid adding texts again
  // This ElasticVectorSearch will be used for hybrid search
  const hybridStore = new ElasticVectorSearch(embeddings, {
    client,
    indexName: INDEX_NAME,
    strategy: new HybridRetrievalStrategy(),
  });

  // With custom RRF parameters
  const hybridStoreCustom = new ElasticVectorSearch(embeddings, {
    client,
    indexName: INDEX_NAME,
    strategy: new HybridRetrievalStrategy({
      rankWindowSize: 100,  // default: 100
      rankConstant: 60,     // default: 60
      textField: "text",    // default: "text"
    }),
  });

Step 3d: Run the same query both ways, and print results.

As an example, let’s run the query “Find movies where the main character is stuck in a time loop and reliving the same day." and compare the results from hybrid search and vector search.

In Python:

query = "Find movies where the main character is stuck in a time loop and reliving the same day."
k = 5

print(f"\n=== Query: {query} ===")

vec_docs = vector_store.similarity_search(query, k=k)
hyb_docs = hybrid_store.similarity_search(query, k=k)

print("\nVector search (kNN) top results:")
for i, doc in enumerate(vec_docs, start=1):
    print(f"{i}. {(doc.page_content or '').splitlines()[0]}")

print("\nHybrid search (BM25 + kNN + RRF) top results:")
for i, doc in enumerate(hyb_docs, start=1):
    print(f"{i}. {(doc.page_content or '').splitlines()[0]}")

In JavaScript:

  const query =
    "Find movies where the main character is stuck in a time loop and reliving the same day.";
  const k = 5;

  console.log(`\n=== Query: ${query} ===`);

  const vecDocs = await vectorStore.similaritySearch(query, k);
  const hybDocs = await hybridStore.similaritySearch(query, k);

  console.log("\nVector search (kNN) top results:");
  vecDocs.forEach((doc, i) => {
    console.log(`${i + 1}. ${(doc.pageContent || "").split("\n")[0]}`);
  });

  console.log("\nHybrid search (BM25 + kNN + RRF) top results:");
  hybDocs.forEach((doc, i) => {
    console.log(`${i + 1}. ${(doc.pageContent || "").split("\n")[0]}`);
  });
}

main().catch(console.error);

Example output

Ingesting 1000 movies into 'scifi-movies-hybrid-demo' from 'scifi_1000.csv'...

=== Query: Find movies where main character is stuck in a time loop and reliving the same day. ===

Vector search (kNN) top results:
1. The Witch: Part 1 - The Subversion (20  18)
2. Divinity (2023)
3. The Maze Runner (2014)
4. Spider-Man (2002)
5. Spider-Man: Into the Spider-Verse (2018)

Hybrid search (BM25 + kNN + RRF) top results:
1. Edge of Tomorrow (2014)
2. The Witch: Part 1 - The Subversion (2018)
3. Boss Level (2020)
4. Divinity (2023)
5. The Maze Runner (2014)

Why these results?

Vector-only (kNN) embeds the query and tries to find semantically similar plots. Using a broad sci‑fi dataset, this can drift into “trapped / altered reality / memory loss / high-stakes sci‑fi” even when there’s no time-loop concept. That’s why results like “The Witch: Part 1 – The Subversion” (amnesia) and “The Maze Runner” (trapped/escape) can appear.
Hybrid (BM25 + kNN + RRF) rewards documents that match both keywords and meaning. Movies whose descriptions explicitly mention “time loop” or “relive the same day” get a strong lexical boost, so titles like “Edge of Tomorrow” (relive the same day over and over again…) and “Boss Level” (trapped in a time loop that constantly repeats the day…) rise to the top.

Hybrid search doesn’t guarantee that every result is perfect. It balances lexical and semantic signals so you may still see some non-time-loop sci‑fi in the tail of the top‑k.

The main takeaway is that hybrid search helps anchor semantic retrieval with exact textual evidence when the dataset contains those keywords.

Full code example

You can find our full demo code in Python and JavaScript, as well as the dataset used, hosted on GitHub gist.

Conclusion

Hybrid search provides a pragmatic and powerful retrieval strategy by combining traditional BM25 keyword search with modern vector similarity into a single, unified ranking. Instead of choosing between lexical precision and semantic understanding, you get the best of both worlds, without adding significant complexity to your application.

In real-world datasets, this approach consistently yields results that feel more intuitively correct. Exact term matches help anchor results to the user’s explicit intent, while embeddings ensure robustness against paraphrasing, synonyms, and incomplete queries. This balance is especially valuable for noisy, heterogeneous, or user-generated content, where relying on only one retrieval method often falls short.

In this article, we demonstrated how to use hybrid search in LangChain through its Elasticsearch integrations, with complete examples in both Python and JavaScript. We’re also contributing to other open-source projects, such as LangChain4j, to extend hybrid search support with Elasticsearch.

We believe hybrid search will be a key capability for generative AI (GenAI) and agentic AI applications, and we plan to continue collaborating with libraries, frameworks, and programming languages across the ecosystem to make high-quality retrieval more accessible and robust.

How to defend your RAG system from context poisoning

Tomás Murúa — Tue, 10 Feb 2026 00:00:00 GMT

With the recent release of models like Claude Sonnet 4.5, Gemini 3 series and GPT 5 featuring million-token context windows, there’s a growing misconception that context management is becoming simpler. If a large language model (LLM) can process millions of tokens at once, does it really matter what information we provide?

The reality is precisely the opposite. Context engineering, the practice of managing what information reaches your LLM, is more critical than ever. Large context windows don’t eliminate the need for precision; they amplify it. With more context comes exponentially more opportunities for error, hallucinations, and irrelevant information to contaminate your LLM reasoning process.

Whether you’re using retrieval-augmented generation (RAG) retrieval, tool outputs, or memory systems, effective context engineering isn’t about providing more information but about providing the right information. That’s where Elasticsearch comes in, serving as your context engineering platform.

In this article, we’ll explore what context poisoning is, how it manifests across different types of memory, and how Elasticsearch RAG capabilities provide defense at every stage of the retrieval pipeline, from ingestion to composition, ensuring your LLM receives clean, relevant, and reliable context.

What is context poisoning?

Context poisoning occurs when compromised, outdated, or irrelevant information enters an LLM’s context window, leading to degraded responses, hallucinations, or perpetuated errors. Once corrupted or incorrect information enters the context window, it propagates into answers. The LLM references it as truth, creating cascading errors across the conversation.

This poisoning can happen at multiple stages of the LLM lifecycle (like in training), but our focus is on the retrieval and composition stages. Although adversarial attacks, like prompt injection, also pose risks, this article focuses on the operational patterns that teams encounter most frequently in production environments.

Operational understanding

Context poisoning often happens for reasons like:

Context rot: Information becomes outdated but remains in your knowledge base without being updated or deleted.
Context overflow: Too much information overwhelms the LLM's attention to the real important and relevant context, leading to missing relevant information from answers.
Conflicting information: Multiple sources provide contradictory data, confusing the model.
Semantic noise: Vectorial similar but contextually irrelevant content dilutes relevance.
Malicious injection: Content deliberately inserted by attackers into knowledge bases, including prompt injections or manipulated data.

Understanding these patterns is the first step toward building robust defenses. Let’s examine each pattern and how Elasticsearch helps you address them. You can follow along with the supporting notebook.

Types of context poisoning

Temporal degradation

Over time, information in your knowledge base becomes outdated, and without proper management, stale content continues to be retrieved and presented to your LLM as current truth. This is especially problematic in industries where information changes frequently, like product documentation, pricing, regulations, or news.

Impact

Your LLM provides outdated advice, references to deprecated features, or contradictions to current reality, disengaging user trust.

Solutions: Temporal filtering in hybrid search

Elasticsearch’s date-based query capabilities ensure your RAG system prioritizes recent and relevant information through explicit temporal filters.

Example: Product documentation search with time filtering

A user asks your chatbot about authentication setup. Six months ago, the authentication had a significant change, so it’s important to only return documents from six months or earlier.

Without temporal filtering

POST product-docs/_search
{
  "retriever": {
    "rrf": {
      "retrievers": [
        {
          "standard": {
            "query": {
              "semantic": {
                "field": "content_semantic",
                "query": "how to configure OAuth authentication"
              }
            }
          }
        },
        {
          "standard": {
            "query": {
              "multi_match": {
                "query": "configure OAuth authentication",
                "fields": ["title^2", "content"]
              }
            }
          }
        }
      ],
      "rank_window_size": 50,
      "rank_constant": 20
    }
  },
  "_source": ["title", "last_updated", "version", "content_snippet"]
}

Response without filtering: Contradictory results

The LLM receives three different methods for OAuth configuration: the current security API (9.x), legacy realm settings (7.x), and the deprecated shield plugin (6.x). This contradictory context leads to confused or misleading responses:

{
  "hits": {
    "total": { "value": 23 },
    "max_score": 24.5,
    "hits": [
      {
        "_id": "doc-oauth-2025",
        "_score": 24.5,
        "_source": {
          "title": "OAuth 2.0 Authentication Setup",
          "last_updated": "2025-10-15",
          "version": "9.x",
          "content_snippet": "To configure OAuth 2.0 authentication in Elasticsearch 9.x, use the new security API..."
        }
      },
      {
        "_id": "doc-oauth-2023",
        "_score": 23.8,
        "_source": {
          "title": "OAuth Authentication Configuration",
          "last_updated": "2023-04-20",
          "version": "7.x",
          "content_snippet": "Configure OAuth using the legacy realm settings in elasticsearch.yml..."
        }
      },
      {
        "_id": "doc-oauth-deprecated",
        "_score": 22.9,
        "_source": {
          "title": "Setting Up OAuth (Deprecated)",
          "last_updated": "2022-11-10",
          "version": "6.x",
          "content_snippet": "Use the shield plugin to configure OAuth authentication..."
        }
      }
    ]
  }
}

With temporal filtering

Add a filter to restrict results to documents updated within the last six months:

POST product-docs/_search
{
  "retriever": {
    "rrf": {
      "retrievers": [
        {
          "standard": {
            "query": {
              "semantic": {
                "field": "content_semantic",
                "query": "how to configure OAuth authentication"
              }
            }
          }
        },
        {
          "standard": {
            "query": {
              "multi_match": {
                "query": "configure OAuth authentication",
                "fields": ["title^2", "content"]
              }
            }
          }
        }
      ],
      "filter": [
        {"range": {"last_updated": {"gte": "now-6M"}}},
        {"term": {"status": "published"}}
      ],
      "rank_window_size": 50,
      "rank_constant": 20
    }
  },
  "_source": ["title", "last_updated", "version", "content_snippet"],
  "size": 5
}

This hybrid search query

Semantic search (semantic) captures related concepts and context using the content_semantic field.
Lexical search (multi_match) matches exact keywords like “OAuth” with field boosting title^2.
Reciprocal rank fusion (RRF) combines both results sets with balanced reranking, retrieving the most relevant results.
Temporal filter ensures only documents updated within the last six months are retrieved.
Status filter restricts results to published documents, excluding drafts or deprecated content.

Response with temporal filtering: Consistent results

The temporal filtering eliminated outdated documents, leaving only current documentation for version 9.x. The LLM now receives consistent context and generates confident, accurate responses:

{
  "hits": {
    "hits": [
      {
        "_source": {
          "title": "OAuth 2.0 Authentication Setup",
          "last_updated": "2026-01-15",
          "version": "9.x",
          "content_snippet": "Configure OAuth 2.0 in Elasticsearch 9.x using the security API via Stack Management > Security."
        }
      },
      {
        "_source": {
          "title": "OAuth Provider Configuration",
          "last_updated": "2025-12-20",
          "version": "9.x",
          "content_snippet": "Configure Okta, Azure AD, Auth0 via security API with OIDC auto-discovery."
        }
      }
    ]
  }
}

Relative versus absolute time filters

Relative filtering (recommended for most use cases):

"filter": [
  {
    "range": {
      "last_updated": {
        "gte": "now-1y"
      }
    }
  }
]

Absolute filtering (for specific time ranges):

"filter": [
  {
    "range": {
      "last_updated": {
        "gte": "2025-01-01",
        "lte": "2025-12-31"
      }
    }
  }
]

Impact on LLM response quality

Without filtering: LLM receives contradictory guidance from 2023–2025, producing uncertain responses mixing deprecated and current methods.
With temporal filtering: LLM receives only recent documentation, generating confident responses based on current best practices.

Information conflicts

When your RAG system retrieves documentation for features that behave differently across deployment types, versions, or configurations, conflicting information can confuse the LLM about which guidance applies to the user’s specific context.

Impact

The LLM has to use more resources and tokens to understand and determine which information is correct, becoming more prone to errors and hallucinations.

Solutions: Hybrid search with metadata boosting

Elasticsearch’s bool query with a should clause allows you to boost values to prioritize documents matching specific metadata, ensuring deployment-specific or version-specific documentation appears first in the context window. For query syntax details, refer to Bool query reference.

Example: Deployment-specific feature documentation

A user asks, “How do I configure custom users in serverless?” Your knowledge base contains information about cloud, self-hosted, and managed deployments. With proper metadata prioritization, the LLM retrieves signals about feature availability and provides correct guidance:

POST platform-docs/_search
{
  "retriever": {
    "rrf": {
      "retrievers": [
        {
          "standard": {
            "query": {
              "bool": {
                "must": [
                  {
                    "multi_match": {
                      "query": "How do I configure custom users in serverless?",
                      "fields": ["title^2", "content"]
                    }
                  }
                ],
                "should": [
                  {"term": {"deployment_type": {"value": "serverless", "boost": 3.0}}},
                  {"term": {"doc_status": {"value": "current", "boost": 2.0}}}
                ]
              }
            }
          }
        },
        {
          "standard": {
            "query": {
              "semantic": {
                "field": "content_semantic",
                "query": "How do I configure custom users in serverless?"
              }
            }
          }
        }
      ],
      "rank_window_size": 50,
      "rank_constant": 20
    }
  },
  "_source": ["title", "deployment_type", "feature_supported", "content_snippet"],
  "size": 5
}

What this query does

must clause: All documents must match “How do I configure custom users in serverless?”
should clauses with explicit boosting:
- Documents with deployment_type: “serverless” receive 3x boost.
- Documents with doc_status: “current” receive 2x boost.
Semantic search runs in parallel to capture conceptual matches.
RRF combines lexical (with metadata boosting) and semantic results to get the best of both approaches.

Expected response:

{
  "hits": {
    "hits": [
      {
        "_source": {
          "title": "Authentication in Serverless",
          "deployment_type": "serverless",
          "feature_supported": false,
          "content_snippet": "Custom authentication not available in Serverless. Use SSO with your identity provider."
        }
      },
      {
        "_source": {
          "title": "User Management in Serverless",
          "deployment_type": "serverless",
          "feature_supported": false,
          "content_snippet": "Direct user creation not supported in Serverless. Use your organization's IdP."
        }
      },
      {
        "_source": {
          "title": "Role-Based Access in Serverless",
          "deployment_type": "serverless",
          "feature_supported": true,
          "content_snippet": "Configure roles in Serverless console. Roles sync with SSO provider groups."
        }
      },
      {
        "_source": {
          "title": "API Keys in Serverless",
          "deployment_type": "serverless",
          "feature_supported": true,
          "content_snippet": "Create API keys for programmatic Serverless access. Keys inherit user permissions."
        }
      },
      {
        "_source": {
          "title": "SSO Configuration for Serverless",
          "deployment_type": "serverless",
          "feature_supported": true,
          "content_snippet": "Configure SSO in Serverless via Cloud console with SAML 2.0 or OIDC."
        }
      }
    ]
  }
}

How metadata boosting resolves conflicts

Impact on LLM response quality

Without metadata boosting: The context window receives equal-weight documents from all deployment types. The LLM produces vague responses that hedge between possibilities, failing to clearly state deployment-specific limitations.
With metadata boosting (3x): Managed-specific documentation dominates the top results. The LLM generates direct answers about feature unavailability and provides actionable alternatives while maintaining the cross-deployment context for follow-up questions.

Semantic noise

Vector similarity search can retrieve documents that are semantically related but contextually irrelevant to the user’s need. This “semantic drift” occurs when embeddings capture a similarity without understanding the query intent. So when your context window fills with irrelevant information, the LLM's ability to generate precise answers declines.

Impact

The LLM receives correct information that doesn’t answer the question, wasting the context window and lowering the quality of the provided answer.

Solution: Hybrid search

Elasticsearch hybrid search combines lexical precision with semantic understanding, using explicit product filters to eliminate cross-product drift while maintaining conceptual recall.

Example: Technical documentation search

A developer searches for “Elastic Agent configuration,” and your knowledge base contains both the Elastic Agent (Elastic Observability) and the Elastic Agent Builder documentation. Both use the word "agent" prominently, making them semantically similar.

Let’s search for agent configuration documentation:

POST elastic-docs/_search
{
  "retriever": {
    "rrf": {
      "retrievers": [
        {
          "standard": {
            "query": {
              "multi_match": {
                "query": "agent configuration logs metrics collection",
                "fields": ["title^3", "content", "tags^2"],
                "type": "best_fields"
              }
            }
          }
        },
        {
          "standard": {
            "query": {
              "semantic": {
                "field": "content_semantic",
                "query": "configuring agents to collect logs and metrics from hosts"
              }
            }
          }
        }
      ],
      "filter": [
        {"terms": {"product": ["observability", "elastic-agent"]}},
        {"term": {"doc_type": "configuration"}}
      ],
      "rank_window_size": 50,
      "rank_constant": 20
    }
  },
  "_source": ["title", "product", "tags", "url"],
  "size": 5
}

This hybrid query:

Lexical component (multi_match) ensures exact keyword matches for "agent", "configuration", "logs", "metrics", and “collection”.
Field boosting (title^3, tags^2) prioritizes documents where terms appear in important fields.
Semantic component captures conceptual relationships and the intent about “configuring data collection agents”.
RRF merges both result sets with balanced ranking using rank_constant: 20.
Product filter restricts results to Elastic Observability and Elastic Agent domains, eliminating Agent Builder docs entirely.
Category filter restricts results to "observability" and "elastic-agent" domains, eliminating semantic drift to other domains.

Expected response:

{
  "hits": {
    "hits": [
      {
        "_source": {
          "title": "Elastic Agent Input Configuration",
          "product": "elastic-agent",
          "tags": ["inputs", "logs", "metrics", "configuration"],
          "url": "/docs/elastic-agent/inputs"
        }
      },
      {
        "_source": {
          "title": "Configure Elastic Agent for Log and Metric Collection",
          "product": "elastic-agent",
          "tags": ["configuration", "logs", "metrics", "observability"],
          "url": "/docs/elastic-agent/configure"
        }
      },
      {
        "_source": {
          "title": "Agent Policies and Integrations",
          "product": "observability",
          "tags": ["policies", "integrations", "fleet"],
          "url": "/docs/fleet/policies"
        }
      },
      {
        "_source": {
          "title": "Configuring Agent Outputs",
          "product": "elastic-agent",
          "tags": ["outputs", "elasticsearch", "logstash"],
          "url": "/docs/elastic-agent/outputs"
        }
      },
      {
        "_source": {
          "title": "Manage Elastic Agents with Fleet",
          "product": "observability",
          "tags": ["fleet", "agent-management", "deployment"],
          "url": "/docs/fleet/manage-agents"
        }
      }
    ]
  }
}

Why hybrid search works

Search type	Strengths
Lexical only	Precise keyword matching
Semantic only	Captures semantic meaning
Hybrid search	Precision and recall, intent understanding

Before and after: LLM response comparison

Elasticsearch RAG best practices

Following these best practices optimizes your context engineering and significantly reduces the risk of context poisoning in your RAG systems. By implementing the following strategies, you ensure that every token in your context window contributes to relevant, accurate, and trustworthy LLM responses.

Choose the right search strategy for your data:
Select your search approach based on your data characteristics and query patterns. Choose between lexical, semantic, or hybrid search. For more details, refer to Search approaches | Elastic Docs.
Implement temporal awareness
Time-sensitive information requires active management to prevent outdated content from contaminating your context window. Use range queries with relative time filters (like now-6M or now-1y) for content that changes frequently, ensuring your RAG system prioritizes recent content. For more details, refer to Range query | Reference.
Use metadata boosting
When your knowledge base contains similar content across different contexts, such as multiple product versions, deployment types, or user roles, metadata boosting helps prioritize contextually relevant results. For more details, refer to Boolean query | Reference.
Apply reranking when needed
For complex or high-priority queries where precision is critical, consider implementing a reranker solution where the model can significantly improve the search result quality by reordering results based on the semantic understanding of the queries and documents. For more details, refer to Ranking and reranking | Elastic Docs.
Optimize chunking strategies
Chunking is the process of breaking down large text into smaller “chunks.” Document chunking strategy affects both semantic representation and retrieval precision. Smaller chunks provide more granularity, but you may lose context; and larger chunks preserve more context, but you reduce retrieval precision. For more details, refer to Understanding chunking strategies in Elasticsearch.
Filter the data before it reaches the LLM
Vector similarity search can retrieve semantically related but contextually irrelevant documents. Apply explicit filters on product, category, or domain fields to constrain results to the appropriate context before delivering the results to the LLM. For more details, refer to RAG pipelines in production: Operationalize your GenAI project - Elasticsearch Labs.
Calibrate your retrieval volume (k)
Finding the "Goldilocks zone" for the number of documents retrieved is essential. Too few results lead to incomplete answers, while too many can cause the LLM to miss key facts. Balance your token budget against the depth of the model's window. For more details, refer to kNN search in Elasticsearch | Elastic Docs.
Consider summarization for large documents
When retrieved content exceeds your context budget, summarization techniques help retain essential information while reducing token count. For more details, refer to Adding AI summaries to your site with Elastic - Elasticsearch Labs.
Monitor and iterate
Over time, as knowledge bases grow and content evolves, we recommend that you implement monitoring to track relevance score distributions, temporal patterns in retrieved results, and user feedback signals. Watch for signs like outdated documents, declining user satisfaction scores, or a growing number of “no relevant results” queries. For more details, refer to Elastic Observability: Streams Data Quality and Failure Store Insights.

Conclusion

The new era of million-token context windows has not made context management obsolete; it has made context engineering more critical than ever. As context windows grow, so does the potential for poisoning from any source retrieval, tools, or memory.

The patterns shown in this article apply beyond just RAG. Temporal filtering, metadata boosting, and hybrid search are foundational techniques that improve context quality, regardless of source.

By implementing these strategies, you maintain control over what information reaches your LLM, ensuring relevance, accuracy, and trust at scale.