Elasticsearch Labs

How to build agentic AI applications with Mastra and Elasticsearch

Enrico Zimuel — Wed, 08 Apr 2026 00:00:00 GMT

In this article, we’ll walk through how to use the Mastra TypeScript framework to build agentic applications that interact with Elasticsearch.

We recently contributed to the mastra-ai/mastra open source project by adding support for Elasticsearch as a vector database. With this new feature, you can use Elasticsearch natively in Mastra to store embeddings. In addition to vectors, Elasticsearch provides a suite of advanced features to address all your context engineering requirements. (for example, hybrid search and reranking).

This article details the creation of an agent to implement a retrieval augmented generation (RAG) architecture using Elasticsearch. We’ll showcase a demo project where an agentic approach is used to interact with a corpus of sci-fi movie data stored within Elasticsearch. The project is available at elastic/mastra-elasticsearch-example.

Mastra

Mastra is a TypeScript framework to create agentic AI applications.

A project structure in Mastra looks as follows:

src/
├── mastra/
│   ├── agents/
│   │   └── weather-agent.ts
│   ├── tools/
│   │   └── weather-tool.ts
│   ├── workflows/
│   │   └── weather-workflow.ts
│   ├── scorers/
│   │   └── weather-scorer.ts
│   └── index.ts
├── .env.example
├── package.json
└── tsconfig.json

In Mastra, you can build agents, tools, workflows, and scores.

An agent is a class that accepts a message in input and produces a response as output. An agent can use tools, large language models (LLMs), and a memory (figure 1).

An agent's tools allow it to interact with the "external world," such as communicating with a web API or performing an internal operation, like querying Elasticsearch. The memory component is crucial for storing the history of conversations, including past inputs and outputs. This stored context enables the agent to provide more informed and relevant responses to future questions by using its past interactions.

Workflows let you define complex sequences of tasks using clear, structured steps rather than relying on the reasoning of a single agent (figure 2). They give you full control over how tasks are broken down, how data moves between them, and what gets executed when. Workflows run using the built-in execution engine by default or can be deployed to workflow runners.

In Mastra, you can also define scores, which are automated tests that evaluate agent outputs using model-graded, rule-based, and statistical methods. Scorers return scores: numerical values (typically between 0 and 1) that quantify how well an output meets your evaluation criteria. These scores enable you to objectively track performance, compare different approaches, and identify areas for improvement in your AI systems. Scorers can be customized with your own prompts and scoring functions.

Elasticsearch

For running the demo project, we need to have an Elasticsearch instance running. You can activate a free trial on Elastic Cloud or install it locally using the start-local script:

curl -fsSL https://elastic.co/start-local | sh

This will install Elasticsearch and Kibana on your computer and generate an API key to be used for configuring the Mastra integration.

The API key will be shown as output of the previous command and stored in a .env file in the elastic-start-local folder.

Install and configure the demo

We created an elastic/mastra-elasticsearch-example repository containing the source code of the demo project. The example reported in the repository illustrates how to create an agent in Mastra that implements a RAG architecture for retrieving documents from Elasticsearch.

We provided a dataset for the demo about sci-fi movies. We extracted 500 movies from the IMDb dataset on Kaggle.

The first step is to install the dependencies of the project with npm, using the following command:

npm install

Then we need to configure the .env file that will contain the settings. We can generate this file copying the structure from the .env.example file, using the following command:

cp .env.example .env

Now we can edit the .env, adding the missing information:

OPENAI_API_KEY=
ELASTICSEARCH_URL=
ELASTICSEARCH_API_KEY=
ELASTICSEARCH_INDEX_NAME=scifi-movies

The name of the Elasticsearch index is scifi-movies. If you want, you can change it using the env variable ELASTICSEARCH_INDEX_NAME.

We used OpenAI as embedding service, which means that you need to provide an API key for OpenAI in the OPENAI_API_KEY env variable.

The embedding model used in the example is openai/text-embedding-3-small, with an embedding dimension of 1536.

To generate the final answer, we used the openai/gpt-5-nano model to reduce the costs.

The RAG architecture allows you to use a less powerful (and typically less expensive) final LLM model because the heavy lifting of grounding the answer is done by the retrieval component (Elasticsearch in this case).

The smaller LLM is only responsible for two main tasks:

Rephrasing/embedding the query: Converting the user's natural language question into a vector embedding for semantic search.
Synthesizing the answer: Taking the highly relevant, retrieved context chunks (documents/movies) and synthesizing them into a coherent, final, human-readable answer, following the provided prompt instructions.

Since the RAG process provides the exact factual context needed for the answer, the final LLM doesn't need to be massive or highly complex and it doesn’t need to possess all the required knowledge within its own parameters (which is where large, expensive models excel). It essentially acts as a sophisticated text summarizer and formatter for the context provided by Elasticsearch, rather than as a full-fledged knowledge base itself. This enables the use of models like gpt-5-nano for cost and latency optimization.

After the configuration of the .env file, you can ingest the movies to Elasticsearch using the following command:

npx tsx src/utility/store.ts

You should see an output as follows:

🚀 Starting ingestion of 500 movies from 500_scifi_movies.jsonl...
Ingesting ░░░░░░░░░░░░░░░░░░░░░░░░ 1/500 (0%) | ok:1 | fail:0 | chunks:1 | eta:19m 33s | current:Capricorn One
Ingesting ░░░░░░░░░░░░░░░░░░░░░░░░ 2/500 (0%) | ok:2 | fail:0 | chunks:2 | eta:10m 32s | current:Doghouse
Ingesting ░░░░░░░░░░░░░░░░░░░░░░░░ 3/500 (1%) | ok:3 | fail:0 | chunks:3 | eta:7m 33s | current:Dinocroc
Ingesting ░░░░░░░░░░░░░░░░░░░░░░░░ 4/500 (1%) | ok:4 | fail:0 | chunks:7 | eta:6m 10s | current:Back to the Future           
Ingesting ░░░░░░░░░░░░░░░░░░░░░░░░ 5/500 (1%) | ok:5 | fail:0 | chunks:9 | eta:5m 14s | current:The Projected Man            
Ingesting ░░░░░░░░░░░░░░░░░░░░░░░░ 6/500 (1%) | ok:6 | fail:0 | chunks:11 | eta:4m 41s | current:I, Robot
...
✅ Ingestion complete in 1m 46s. Success: 500, Failed: 0, Chunks: 693.

The mapping of the scifi-movies index contains the following fields:

embedding, dense_vector with 1536 dimension, cosine similarity.
description, text containing the description of the movie.
director, text containing the name of the director.
title, text containing the title of the movie.

We generated the embeddings using the title + description. Since the title and the description are two separate fields, the concatenation of both ensures that the resulting embedding vector captures both the specific, unique identity (title) and the rich, descriptive context (description) of the movie, leading to more accurate and comprehensive semantic search results. This combined input gives the embedding model a better single representation of the document's content for similarity matching.

Run the demo

You can run the demo with the following command:

npm run dev

This command will start a web application at localhost:4111 to access Mastra Studio (figure 3).

Mastra Studio offers an interactive UI for building and testing your agents, along with a REST API that exposes your Mastra application as a local service. This lets you start building right away without worrying about integration.

We provided an Elasticsearch Agent that uses the createVectorQueryTool by Mastra as a tool for executing semantic search using Elasticsearch. This agent uses the RAG approach to search for relevant documents (that is, movies) to answer the user’s question.

This agent uses the following prompt:

You are a helpful assistant that answers questions based on the provided context.
Follow these steps for each response:

1. First, carefully analyze the retrieved context chunks and identify key information.
2. Break down your thinking process about how the retrieved information relates to the query.
3. Draw conclusions based only on the evidence in the retrieved context.
4. If the retrieved chunks don't contain enough information, explicitly state what's missing.

Format your response as:
THOUGHT PROCESS:
- Step 1: [Initial analysis of retrieved chunks]
- Step 2: [Reasoning based on chunks]

FINAL ANSWER:
[Your concise answer based on the retrieved context]

Important: When asked to answer a question, please base your answer only on the context provided in the tool. 
If the context doesn't contain enough information to fully answer the question, please state that explicitly and stop it.
Do not add more information than what is present in the retrieved chunks.
Remember: Explain how you're using the retrieved information to reach your conclusions.

If you click on the Mastra Studio > Agents menu and select Elasticsearch Agent, you can test the agent using a chat system. For instance, you can ask information regarding sci-fi movies with a question as follows:

Find 5 movies or TV series about UFOs.

You’ll notice that the agent will execute the vectorQueryTool. You can click on the invoked tool to have a look at the input and the output. At the end of execution, the LLM will reply to your question, given the context coming from the scifi-movies index of Elasticsearch (figure 4).

Mastra executes the following steps internally:

Vector conversion: The user's question, Find 5 movies or TV series about UFOs, is converted into a vector embedding using OpenAI's openai/text-embedding-3-small model.
Vector search: This embedding is then used to query Elasticsearch via a vector search.
Result retrieval: Elasticsearch returns a set of 10 movies highly relevant to the query (that is, those with vectors closest to the user's query vector).
Answer generation: The retrieved movies and the original user question are sent to the LLM, specifically openai/gpt-5-nano. The LLM processes this information and generates a final answer, ensuring that the user's request for five results is met.

The Elasticsearch Agent

Here we reported the source code of Elasticsearch Agent.

import { Agent } from "@mastra/core/agent";
import { ElasticSearchVector } from '@mastra/elasticsearch';
import { createVectorQueryTool } from '@mastra/rag';
import { ModelRouterEmbeddingModel } from "@mastra/core/llm";
import { Memory } from "@mastra/memory";

const es_url = process.env.ELASTICSEARCH_URL;
const es_apikey = process.env.ELASTICSEARCH_API_KEY;
const es_index_name = process.env.ELASTICSEARCH_INDEX_NAME;
const prompt = 'insert here the previous prompt';

const esVector = new ElasticSearchVector({
  id: 'elasticsearch-vector',
  url: es_url,
  auth: {
    apiKey : es_apikey
  }
});

const vectorQueryTool = createVectorQueryTool({
  vectorStore: esVector,
  indexName: es_index_name,
  model: new ModelRouterEmbeddingModel("openai/text-embedding-3-small")
});

export const elasticsearchAgent = new Agent({
  id: "elasticsearch-agent",
  name: "Elasticsearch Agent",
  instructions: prompt,
  model: 'openai/gpt-5-nano',
  tools: { vectorQueryTool },
  memory: new Memory(),
});

The vectorQueryTool is the tool that’s invoked to implement the retrieval part of the RAG example. It uses the ElasticSearchVector implementation that Elastic contributed to Mastra.

The agent is an object of the agent class that consumes the vectorQueryTool, the prompt, and a memory. As you can see, the code that we need to put in place for connecting Elasticsearch to an agent is very minimal.

Conclusion

This article demonstrated the simplicity and power of integrating Elasticsearch with the Mastra framework to build sophisticated agentic AI applications. Specifically, we walked through creating a RAG agent capable of performing semantic search over a corpus of sci-fi movie data indexed in Elasticsearch.

A key takeaway is the direct contribution by Elastic to the Mastra open source project, providing native support for Elasticsearch as a vector store. This integration significantly lowers the barrier to entry, as evidenced by the Elasticsearch Agent source code. Using the ElasticSearchVector and createVectorQueryTool, the complete setup for connecting Elasticsearch to your agent requires only a minimal number of lines of configuration code.

Elasticsearch provides several advanced features to enhance result relevance. For example, hybrid search significantly boosts accuracy by combining lexical search with vector search. Another interesting feature is reranking using the latest Jina models that can be applied at the end of hybrid search. To learn more about these techniques, consult the following articles from Elasticsearch Labs:

Elasticsearch hybrid search by Valentin Crettaz
An introduction to Jina models, their functionality, and uses in Elasticsearch by Scott Martens

We also encourage you to explore the provided example and begin building your own data-powered agents with Mastra and Elasticsearch. For more information about Mastra, you can have a look at the official documentation here.

Migrating your OpenShift Elasticsearch 6.x cluster to Elastic Cloud on Kubernetes (ECK)

Omer Kushmaro, Jamie Parker — Wed, 08 Apr 2026 00:00:00 GMT

Red Hat's OpenShift platform has long been a trusted foundation for enterprise Kubernetes workloads, and for years, its built-in Elasticsearch Operator has made log management simple. But all good things evolve: The OpenShift Elasticsearch Operator reached the end of its supported lifecycle with Red Hat OpenShift Container Platform (OCP) 4.13, and the Elasticsearch 6.x clusters it managed are long out of support. Working closely with Red Hat, we've put together this step-by-step guide to help you move from that legacy setup to Elastic Cloud on Kubernetes (ECK), the modern, full-featured operator maintained directly by Elastic. The migration path we've designed respects the OpenShift-native tooling you already rely on, keeps disruption to a minimum, and leaves you on a solid foundation for future upgrades to 8.x, 9.x, and beyond.

Why this matters

Security and support: The last Elasticsearch 6.x patch release was on January 13, 2022. ECK lets you upgrade at your own pace, with a supported operator from the creators of Elasticsearch. Remaining on an old Elasticsearch version exposes you to supportability risks or well-known security issues.
Features you’ve been missing: Autoscaling, data tiers, machine learning (ML) jobs, searchable snapshots. None of these are available in the legacy operator.

Future-proof operations: ECK ships day-and-date with every new Elastic release, so you’re never stuck waiting again.

High-level plan

Phase	Goal	Outcome
0	Snapshot and sanity-check your 6.x cluster.	You have a backup in case you need it.
1	Install ECK 2.16.1 alongside the Red Hat operator.	Both operators coexist safely.
2	Spin up a fresh, production-ready ES 6.8.23 cluster managed by ECK.	Empty ECK-managed cluster.
3	Restore data into that new cluster.	All indices now live under ECK.
4	Point openshift-logging to the new service and retire the old operator.	Single source of truth.
5	Rolling-upgrade Elasticsearch to 7.17.28.	Latest long-term 7.x.
6	Upgrade ECK to 3.3.1.	Operator on current release.
7	Schedule your own jump to 8.x & 9.x.	You control the timing.
8	Clean up	Removing old operator

Feel free to bookmark this list. Each milestone is small, reversible, and validated before you move on.

0. Preflight checks

A. Health first: Run /_cat/health and make sure you’re green.

B. Disk watermarks: Keep at least 20% free before starting a migration.

C. Final snapshot: S3, GCS, NFS: It doesn’t matter, as long as you can mount the same repo in the new cluster.

If you don’t have object storage handy in your environment, you can use this solution-post by Red Hat to snapshot your data to local storage on the OpenShift cluster.

D. Review the documentation: Elastic provides thorough documentation for migrating data between Elasticsearch clusters.

1. Installing ECK 2.16.1 (your “bridge” operator)

ECK 2.16.1 is the last release that still accepts spec.version: 6.8.x, which makes it the ideal bridge between past and future Elasticsearch versions.

helm repo add elastic https://helm.elastic.co
helm repo update
oc create namespace elastic-system 

helm install elastic-operator elastic/eck-operator --version=2.16.1 -n elastic-system --create-namespace

You can keep the Red Hat operator in place; the two watch different Custom Resource Definitions (CRDs), so they won’t step on each other’s toes.

Keep in mind that with OpenShift, ECK might display some Transport Layer Security (TLS) errors in its logs as OpenShift tries to connect to its healthcheck webhook endpoint via HTTP, but ECK allows TLS communication only. This is a well-known issue and shouldn’t pose a problem.

You can refer further to Elastic documentation, in case you need to make a local namespaced installation.

2. Launching a 6.x cluster under ECK

Below is a starter Kubernetes manifest that balances resiliency (separate masters) with cost (three hot-tier data nodes). Swap storage class names, resources, and snapshot credentials to match your environment.

Note: The syntax used below is a bit different than what it would be for newer Elasticsearch versions on ECK.

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: es-logs
  namespace: elastic # Create this namespace prior, or use another namespace
spec:
  version: 6.8.23
  nodeSets:
    - name: hot
      count: 3
      volumeClaimTemplates:
        - metadata:
            name: elasticsearch-data
          spec:
            accessModes:
              - ReadWriteOnce
            storageClassName: gp3-csi   # adjust if needed
            resources:
              requests:
                storage: 100Gi # Storage may vary depending on  
      config:
        node.master: true
        node.data: true
        node.ingest: true
        node.attr.data: hot
        cluster.routing.allocation.awareness.attributes: data
      podTemplate:
        spec:
          containers:
            - name: elasticsearch
              resources:
                requests:
                  memory: 16Gi
                  cpu: 2
                limits:
                  memory: 16Gi
---
apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
  name: kibana
  namespace: elastic
spec:
  version: 6.8.23
  count: 1
  elasticsearchRef:
    name: es-logs
  podTemplate:
    spec:
      containers:
        - name: kibana
          resources:
            requests:
              memory: 1Gi
              cpu: 0.5
            limits:
              memory: 4Gi

Deploy it, watch pods come up, and you’re ready for data.

3. Moving the data

To move data from one Elasticsearch cluster to another, you can also further consult this guide in the Elastic documentation. For the purpose of this post, we’re assuming that snapshot and restore are used.

Snapshot and restore are quickest:

# on the old cluster, take a snapshot
PUT _snapshot/log-backups
{
  "type": "s3",
  "settings": { ... }
}

PUT _snapshot/log-backups/final-snap-2025-08-07

# on the new cluster (readonly!)
PUT _snapshot/log-backups
{
  "type": "s3",
  "settings": {
    "readonly": true,
    ...
  }
}

# Perform the restore operation
POST _snapshot/log-backups/final-snap-2025-08-07/_restore

Can’t share an object store? Use remote re-index (slower, but works everywhere; has drawbacks in terms of not migrating index templates, component templates, and more) or pump logs through a one-off Logstash job.

4. Configuring ClusterLogging operator

First, we’ll need to decommission our Red Hat operator–managed Elasticsearch cluster. We’ll modify our ClusterLogging like so:

oc edit clusterlogging instance -n openshift-logging 
---------
 logStore:
    elasticsearch:
      nodeCount: 0 # scale down node count, previously > 0 
      redundancyPolicy: ZeroRedundancy
    type: elasticsearch
  managementState: Managed # this needs to be kept, as it will manage the fluentd instance for us.
  visualization:
    kibana:
      replicas: 0 # scale down kibana as well 
    type: kibana

Then we’ll define a ClusterLogForwarder to direct the logs from fluentd to our newly built Elasticsearch 6.x cluster managed by ECK. We’ll need to create a secret with the Elasticsearch credentials:

oc create secret generic eck-es-credentials \
  -n openshift-logging \
  --from-literal=username=elastic \
  --from-literal=password=$(oc get secret es-logs-es-elastic-user -n elastic -o jsonpath='{.data.elastic}' | base64 -d)

For configuring TLS (as recommended), you’ll need to create a ConfigMap for ClusterLogForwarder to trust the ECK ca certificates. Further guidance can be found here. We’ll run the command:

oc -n elastic get secret es-logs-es-http-certs-public \
-o go-template='{{index .data "tls.crt" | base64decode}}' > ca.crt

oc -n openshift-logging create configmap eck-es-ca \
--from-file=ca-bundle.crt=ca.crt

To create the certificate secret, and then we’ll reference it in the ClusterLogging CRD:

apiVersion: logging.openshift.io/v1
kind: ClusterLogForwarder
metadata:
  name: instance
  namespace: openshift-logging
spec:
  outputs:
    - name: eck-es
      type: elasticsearch
      url: https://es-logs-es-http.elastic.svc:9200
      secret:
        name: eck-es-credentials # this secret needs to be created first
      tls:
        # insecureSkipVerify: true # can be used for lab testing purposes
        ca:
          name: eck-es-ca
  pipelines:
    - name: send-to-eck
      inputRefs:
        - application
        - infrastructure
        - audit
      outputRefs:
        - eck-es

⚠️ If you’re troubleshooting connectivity issues, you can temporarily set tls.insecureSkipVerify: true, but this shouldn’t be used long term.

Because we’re restoring legacy indices into a fresh ECK-managed cluster, OpenShift Logging will not automatically recreate the legacy index layout or aliases. You must ensure that write aliases exist and point to writable indices. In my case, I needed to verify that I have proper aliases, set up as:

app-write
infra-write
audit-write

Pointing to indices with dynamic mappings (not recommended) for minimizing errors and troubleshooting steps.

# Forward ES port to local machine
oc -n elastic port-forward svc/es-logs-es-http 9200:9200

PASS="$(oc -n elastic get secret es-logs-es-elastic-user -o jsonpath='{.data.elastic}' | base64 -d)"

# Make sure the write alias points to the correct backing index
curl -s -k -u "elastic:${PASS}" -XPOST "https://localhost:9200/_aliases" \
  -H 'Content-Type: application/json' \
  -d '{
    "actions": [
      { "add": { "index": "infra-000002", "alias": "infra-write", "is_write_index": true } }
    ]
  }'

Repeat for app-write and audit-write with their respective backing indices.We should see data start flowing now toward our new ECK managed cluster.

5. Rolling upgrade to 7.17.29, and verify

Now you can finally leave 6.x behind.

A. Check _xpack/migration/deprecations?pretty using curl against Elasticsearch, to tackle deprecations. This API will return either warnings or critical things to attend to before you upgrade.

B. Patch the CRD to upgrade it to the latest 7.x version. I’m using 7.17.29.

oc -n elastic patch elasticsearch es-logs --type=merge -p '{"spec":{"version":"7.17.29"}}'

C. ECK restarts nodes one at a time. Your cluster should be online throughout.

D. Give cluster tasks and shard recoveries time to settle before pressing on.

E. Don’t forget to upgrade Kibana in the same way.

oc -n elastic patch kibana kibana --type=merge -p '{"spec":{"version":"7.17.29"}}'

Once complete, check your Elasticsearch version and Kibana version, as well as the health state:

oc -n elastic get elasticsearch es-logs
oc -n elastic get kibana kibana

6. Operator upgrade: ECK 2.16.1 → 3.3.1

ECK upgrades are pleasantly boring:

helm upgrade elastic-operator elastic/eck-operator -n elastic-system --version 3.3.1

Watch the operator pod roll. Your Elasticsearch cluster keeps running; only the controller restarts.

Verify that the upgrade is successful by looking at the operator logs and ensuring that no major errors appear:

oc logs -n elastic-system sts/elastic-operator

And then verifying the new version of the operator (will now be 3.3.1):

helm -n elastic-system list

7. Your roadmap to 8.x and 9.x (when you’re ready)

You’re now on:

ECK Operator: 3.3.1
Elastic Stack: 7.17.29

That pair is fully supported and serves as the official launchpad for 8.x. It’s important to first go through the Elastic upgrade documentation.

We’ll again go through the procedure of checking for any hard-breaking changes between our 7.17.29 and the latest 8 version (8.19.9):

GET _migration/deprecations?pretty

It's important to look through the result of this query carefully and to go through necessary steps, like re-indexing indices and changing mappings, among others.

Once you’ve addressed all required changes from 7.17.29 to 8.x:

oc -n elastic patch elasticsearch es-logs --type=merge -p '{"spec":{"version":"8.19.9"}}'
oc -n elastic patch kibana kibana --type=merge -p '{"spec":{"version":"8.19.9"}}'

ECK will handle the rest. Just remember to upgrade Beats, Logstash pipelines, and client libraries in lockstep to avoid wire-protocol surprises.

Repeat the process again to migrate to the latest 9.x version.

8. Cleanup

Remove the Red Hat Elasticsearch operator.

Now that you’re no longer using the Red Hat Elasticsearch operator, you can remove it from your cluster. You can do that via the following steps:

A. In the OpenShift Console, go to Operators and then to Installed Operators.

B. In the Filter By Name field, enter “Elasticsearch” to find the installed Red Hat Elasticsearch operator.

C. On the Operator Details page, select Uninstall Operator from the Actions list.

D. On the Uninstall Operator? dialog box, select Uninstall. This removes the operator, the operator deployments, and the pods. After this step, the operator stops running and will no longer receive updates.

All of these steps can be found in this link from Red Hat OpenShift documentation.

Wrapping up

By installing ECK 2.16.1 as a bridge, snapshot-restoring into a new cluster, and stepping cleanly through 7.x before landing on ECK 3.3, you’ve transformed an aging, unsupported logging back end into a modern, secure, first-class Elastic deployment, without surprises or downtime.

Semantic search, now multilingual by default

Kapil Jadhav, Brendan Jugan, Ranjana Devaji — Tue, 07 Apr 2026 00:00:00 GMT

Today, we’re pleased to announce that semantic_text now defaults to the jina-embeddings-v5-text family on Elastic Inference Service (EIS), bringing built-in multilingual inference with no additional configuration required.

EIS provides managed, GPU-accelerated inference tightly integrated with Elasticsearch. With EIS, you don’t need to host, scale, or maintain infrastructure for embedding models.

Semantic search retrieves results based on meaning. Text is converted into vector embeddings so queries can match related concepts, even when the exact words differ.

The semantic_text field type simplifies this entire workflow, with automatic chunking, embedding generation at index time, and seamless querying via the semantic query, without building custom pipelines or managing separate model inference.

The jina-embeddings-v5-text model family just launched on EIS, giving developers powerful multilingual embeddings accessible as part of the core semantic_text workflow. So now your semantic search works across languages out of the box, and global datasets, such as support articles, product descriptions, user reviews, and multilingual websites, work without extra configuration.

This default opens up broad, globe-spanning semantic retrieval with no operational overhead.

jina-embeddings-v5-text

The jina-embeddings-v5-text models represent the latest generation of compact, high-performance multilingual embedding models on EIS.

State-of-the-art multilingual performance: Top scores on MMTEB benchmarks across hundreds of languages. jina-embeddings-v5-text-nano leads models under 500M parameters, and jina-embeddings-v5-text-small outperforms significantly larger alternatives.
Multiple task capabilities: Spanning across retrieval, semantic matching, clustering, and classification.
Flexible choices to fit your use case: Two model sizes (small, nano) let you balance speed, cost, and quality.
Long-context support: Embed long texts efficiently, ideal for document collections with extended context.

Get started

1. Create index

Define a semantic_text field with no additional configuration. Embeddings will be generated automatically at index time using the default model. For production workloads, explicitly specify the model to ensure consistent behavior and results.

PUT /multilingual-reviews
{
  "mappings": {
    "properties": {
      "product": { "type": "keyword" },
      "review": { "type": "semantic_text" },
      "language": { "type": "keyword" }
    }
  }
}

2. Index multilingual documents

Add product reviews in six different languages. Each document’s review field is automatically embedded at ingest time, with no separate pipeline or preprocessing needed.

POST /multilingual-reviews/_bulk?refresh=wait_for
{ "index": { "_id": "1" } }
{ "product": "wireless-headphones", "review": "Amazing noise cancellation and the battery lasts all day. Perfect for long flights.", "language": "en" }
{ "index": { "_id": "2" } }
{ "product": "wireless-headphones", "review": "La cancelación de ruido es impresionante. Muy cómodos incluso después de horas de uso.", "language": "es" }
{ "index": { "_id": "3" } }
{ "product": "wireless-headphones", "review": "ノイズキャンセリングが素晴らしく、長時間つけていても耳が痛くなりません。", "language": "ja" }
{ "index": { "_id": "4" } }
{ "product": "wireless-headphones", "review": "Réduction de bruit excellente et très confortable pour les longs trajets en avion.", "language": "fr" }
{ "index": { "_id": "5" } }
{ "product": "wireless-headphones", "review": "Hervorragende Geräuschunterdrückung. Ideal für Pendler und Vielflieger.", "language": "de" }
{ "index": { "_id": "6" } }
{ "product": "wireless-headphones", "review": "O cancelamento de ruído é excelente e a bateria dura o dia todo.", "language": "pt" }

3. Search across languages with a query in English

GET /multilingual-reviews/_search
{
  "query": {
    "match": {
      "review": "comfortable for long flights"
    }
  }
}

The results show all six reviews ranked by semantic relevance to the English query:

{
  "took": 83,
  "timed_out": false,
  "_shards": {
    "total": 6,
    "successful": 6,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 6,
      "relation": "eq"
    },
    "max_score": 0.8275735,
    "hits": [
      {
        "_index": "multilingual-reviews",
        "_id": "4",
        "_score": 0.8275735,
        "_source": {
          "product": "wireless-headphones",
          "review": "Réduction de bruit excellente et très confortable pour les longs trajets en avion.",
          "language": "fr"
        }
      },
      {
        "_index": "multilingual-reviews",
        "_id": "1",
        "_score": 0.7616198,
        "_source": {
          "product": "wireless-headphones",
          "review": "Amazing noise cancellation and the battery lasts all day. Perfect for long flights.",
          "language": "en"
        }
      },
      {
        "_index": "multilingual-reviews",
        "_id": "5",
        "_score": 0.72122526,
        "_source": {
          "product": "wireless-headphones",
          "review": "Hervorragende Geräuschunterdrückung. Ideal für Pendler und Vielflieger.",
          "language": "de"
        }
      },
      {
        "_index": "multilingual-reviews",
        "_id": "2",
        "_score": 0.6867013,
        "_source": {
          "product": "wireless-headphones",
          "review": "La cancelación de ruido es impresionante. Muy cómodos incluso después de horas de uso.",
          "language": "es"
        }
      },
      {
        "_index": "multilingual-reviews",
        "_id": "3",
        "_score": 0.66513836,
        "_source": {
          "product": "wireless-headphones",
          "review": "ノイズキャンセリングが素晴らしく、長時間つけていても耳が痛くなりません。",
          "language": "ja"
        }
      },
      {
        "_index": "multilingual-reviews",
        "_id": "6",
        "_score": 0.61658823,
        "_source": {
          "product": "wireless-headphones",
          "review": "O cancelamento de ruído é excelente e a bateria dura o dia todo.",
          "language": "pt"
        }
      }
    ]
  }
}

Notice that the French review ranks first, even above the English one. That's because "très confortable pour les longs trajets en avion" ("very comfortable for long trips by plane") is a closer semantic match to the query than the English review, which splits its focus across noise cancellation, battery life, and flights. This demonstrates the jina-embeddings-v5-text-small ability to rank by meaning, not language.

4. Search across languages with a Japanese query

GET /multilingual-reviews/_search
{
  "query": {
    "match": {
      "review": "長時間のフライトに最適"
    }
  }
}

The results show all six reviews ranked by semantic relevance to the Japanese query (“Ideal for long-haul flights”):

{
  "took": 89,
  "timed_out": false,
  "_shards": {
    "total": 6,
    "successful": 6,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 6,
      "relation": "eq"
    },
    "max_score": 0.7556782,
    "hits": [
      {
        "_index": "multilingual-reviews",
        "_id": "4",
        "_score": 0.7556782,
        "_source": {
          "product": "wireless-headphones",
          "review": "Réduction de bruit excellente et très confortable pour les longs trajets en avion.",
          "language": "fr"
        }
      },
      {
        "_index": "multilingual-reviews",
        "_id": "1",
        "_score": 0.7395687,
        "_source": {
          "product": "wireless-headphones",
          "review": "Amazing noise cancellation and the battery lasts all day. Perfect for long flights.",
          "language": "en"
        }
      },
      {
        "_index": "multilingual-reviews",
        "_id": "5",
        "_score": 0.68835545,
        "_source": {
          "product": "wireless-headphones",
          "review": "Hervorragende Geräuschunterdrückung. Ideal für Pendler und Vielflieger.",
          "language": "de"
        }
      },
      {
        "_index": "multilingual-reviews",
        "_id": "3",
        "_score": 0.6487931,
        "_source": {
          "product": "wireless-headphones",
          "review": "ノイズキャンセリングが素晴らしく、長時間つけていても耳が痛くなりません。",
          "language": "ja"
        }
      },
      {
        "_index": "multilingual-reviews",
        "_id": "6",
        "_score": 0.6241487,
        "_source": {
          "product": "wireless-headphones",
          "review": "O cancelamento de ruído é excelente e a bateria dura o dia todo.",
          "language": "pt"
        }
      },
      {
        "_index": "multilingual-reviews",
        "_id": "2",
        "_score": 0.6183049,
        "_source": {
          "product": "wireless-headphones",
          "review": "La cancelación de ruido es impresionante. Muy cómodos incluso después de horas de uso.",
          "language": "es"
        }
      }
    ]
  }
}

The ranking is nearly identical to the English query: French and English still lead because they're the most semantically relevant to "perfect for long flights," regardless of query language. The Japanese review didn't get artificially boosted just because the query was in Japanese. It ranks fourth because it discusses wearing comfort, not flights. Semantic relevance takes priority over language matching.

Note: For English-only use cases

If you prefer a sparse representation or would like to continue to use Elastic Learned Sparse EncodeR (ELSER) for English workloads, ELSER remains available and fully supported as an option for semantic_text.

You can explicitly choose ELSER by specifying inference_id: ".elser-2-elastic in your mappings when creating an index.

Conclusion: Semantic search without borders

With semantic_text now defaulting to the jina-embeddings-v5-text family on Elastic Inference Service, multilingual semantic search becomes the standard developer experience in Elasticsearch. This means developers can build search, retrieval augmented generation (RAG), and AI applications that work across global datasets without stitching pipelines together.

Create a semantic_text field, index your data, and start searching. All Elastic Cloud trials have access to Elastic Inference Service. Try it now on Elastic Cloud Serverless or Elastic Cloud Hosted, or use EIS via Cloud Connect with your self-managed cluster.

How to compare two Elasticsearch indices and find missing documents

David Pilato — Mon, 06 Apr 2026 00:00:00 GMT

When managing Elasticsearch indices, you may need to verify that all documents present in one index also exist in another, such as after a reindex operation, a migration, or a data pipeline. Elasticsearch doesn't provide a built-in "diff" command for this, but the right approach depends on one key question: Are your document IDs stable between the two indices?

The problem

Imagine you have two indices, index-a (source) and index-b (target), and you want to find all documents that exist in index-a but are missing from index-b.

A naive approach, querying both indices and comparing results in memory, won't scale. Elasticsearch is designed to handle millions of documents, and loading them all at once isn’t practical.

There are two scenarios:

IDs are stable: Both indices use the same _id for the same document (for example, emp_no as the document ID). This is the easy case.
IDs are generated: Documents were ingested through different pipelines that assigned random or sequential IDs. You can't compare by _id; you need to match on content.

Let's walk through both.

Step 0 — A lighter CLI for Elasticsearch

All the examples in this post use escli, a small l Rust command line interface (CLI) that wraps the Elasticsearch REST API. It reads your cluster URL and credentials from environment variables, so you don’t have to repeat authentication headers on every command.

To see why that matters, here's a typical _search call with raw curl:

curl -X GET \
  -H "Authorization: ApiKey $ELASTIC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"query":{"term":{"user.id":"kimchy"}}}' \
  "$ELASTICSEARCH_URL/my-index-000001/_search"

With escli, the same request becomes:

./escli search --index my-index-000001 <<< '{"query":{"term":{"user.id":"kimchy"}}}'

The credentials live in a .env file that escli sources automatically — no -H "Authorization: ..." on every call, no risk of leaking secrets in shell history. The request body is passed via stdin (<<<), which makes it easy to pipe in multi-line JSON built dynamically with jq.

Step 1 — Count documents in both indices

Before doing a full scan, get a quick count of each index. If the counts match, the indices are likely in sync, and there’s no need to scan at all.

./escli count --index index-a
./escli count --index index-b

The _count API returns:

{ "count": 1000000 }

f the counts differ, proceed to the full comparison.

Step 2 — When IDs mean something: Use op_type=create

If both indices use the same _id for the same document, for example, because you indexed documents using a functional business key like emp_no rather than a generated UUID, you can find and fix missing documents in a single _reindex call.

Why functional IDs matter

Using a meaningful field as _id (instead of a random UUID) is a best practice when the data has a natural key. It means:

The same document always gets the same _id, regardless of which pipeline ingested it.
You can easily update or delete documents by ID.
You can use op_type=create to skip documents that already exist in the target.
No client-side scanning or comparison is needed.

The op_type=create trick

_reindex with op_type=create tries to create each document from the source in the target. If a document with the same _id already exists, Elasticsearch reports it as a version_conflict and moves on. It doesn’t overwrite the existing document. Setting conflicts=proceed tells the API to continue instead of aborting on the first conflict.

./escli reindex <<< '{
  "source": { "index": "index-a" },
  "dest":   { "index": "index-b", "op_type": "create" },
  "conflicts": "proceed"
}'

The response tells you exactly what happened:

{
  "total": 1000000,
  "created": 49594,
  "version_conflicts": 950406,
  "failures": []
}

created: Documents that were missing from index-b and have now been added.
version_conflicts: Documents that already existed in index-b and were left untouched.

No scanning, no client-side comparison, no intermediate file. Everything happens server-side in about six seconds on a 1M-document dataset.

Step 3 — When IDs are not stable: Business-key comparison

Sometimes you can't rely on _id. A document pipeline that generates IDs at ingestion time will assign a different _id each time the same record is processed. If index-a and index-b were populated by two such pipelines, the same employee record might have _id: "abc123" in one index and _id: "xyz789" in the other, even though the underlying data is identical.

In this case, you need to match documents by content rather than by ID. The key is to identify a set of fields that together form a unique business key.

For an employee dataset, a reasonable business key is (first_name, last_name, birth_date). A document in index-a is "missing" from index-b if no document in index-b has the same combination of those three fields.

3a — Scan the source with PIT + search_after

Open a point in time (PIT) on the source index to get a consistent snapshot, and then paginate through it, fetching only the business-key fields:

./escli open_point_in_time index-a 5m
# → { "id": "46ToAwMDaWR..." }

./escli search <<< '{
  "size": 10000,
  "_source": ["first_name", "last_name", "birth_date"],
  "pit": { "id": "46ToAwMDaWR...", "keep_alive": "5m" },
  "sort": [{ "_shard_doc": "asc" }]
}'

The sort key _shard_doc is the most efficient sort for full-index pagination: it uses the internal Lucene document order with no overhead. Repeat with search_after until the response contains zero hits. Always close the PIT when done:

./escli close_point_in_time <<< '{"id": "46ToAwMDaWR..."}'

3b — Check each page against the target via _msearch

For each page of source documents, build one _msearch request with one subquery per document. Each subquery uses a bool/must on the three business-key fields and requests size: 0; we only need to know whether a match exists, we don’t need to retrieve the document itself.

./escli msearch << 'EOF'
{"index": "index-b"}
{"size":0,"query":{"bool":{"must":[{"term":{"first_name.keyword":"Alice1"}},{"term":{"last_name.keyword":"Smith"}},{"term":{"birth_date":"1985-03-12"}}]}}}
{"index": "index-b"}
{"size":0,"query":{"bool":{"must":[{"term":{"first_name.keyword":"Bob2"}},{"term":{"last_name.keyword":"Jones"}},{"term":{"birth_date":"1990-07-24"}}]}}}
EOF

The response contains one entry per subquery, in the same order:

{
  "responses": [
    { "hits": { "total": { "value": 1 } } },
    { "hits": { "total": { "value": 0 } } }
  ]
}

total.value == 0 means no document in index-b matches that business key; the document is missing. Collect the corresponding _id from the source page.

Note on .keyword subfields: term queries require exact (keyword) matching. The first_name and last_name fields must have a .keyword subfield in the index mapping. The demo's mapping.json includes this.

3c — Speed it up with split-by-date

If the business key includes a date field, you can partition the source into date slices and run each slice as an independent job. Each slice opens its own PIT with a range filter on birth_date, runs its own msearch loop, and writes its results to a separate file. The parent script launches all slices in parallel and aggregates the results when they’re all done.

But depending on your use case, you might want to partition by a different field; for example, if you have a team field, you could run one slice per team. The key is to find a field that allows you to split the data into reasonably even chunks that can be processed in parallel.

[compare] Launching 5 slices in parallel...

  → Slice 1: 1960-01-01 → 1969-12-31 ✅ — 244408 checked, 12207 missing
  → Slice 2: 1970-01-01 → 1979-12-31 ✅ — 243624 checked, 12212 missing
  → Slice 3: 1980-01-01 → 1989-12-31 ✅ — 243551 checked, 11921 missing
  → Slice 4: 1990-01-01 → 1999-12-31 ✅ — 243895 checked, 11991 missing
  → Slice 5: 2000-01-01 → 2009-12-31 ✅ — 24522 checked, 1263 missing

Performance on a 1M dataset

To validate the approaches, the demo generates 1,000,000 documents in index-a and deliberately skips ~5% in index-b (49,594 missing documents), and then runs the full compare → reindex cycle.

Results on a MacBook M3 Pro:

Comparison (compare-indices.sh):

Strategy	Compare	Reindex	Total	How it works
op_type		6s	6s	Full _reindex server-side, skips existing
business-key	1m 38s	4s	1m 42s	PIT scan + _msearch by business key
split-by-date	32s	4s	36s	Same as business-key, 5 slices in parallel

The op_type=create approach is fastest because everything is server-side and requires no client-side scanning. The split-by-date strategy cuts the business-key duration from 1m 38s down to 36s through parallelism: not bad for a comparison across two 1M-document indices.

Decision tree

Are _id values stable between both indices?
├── Yes → _reindex with op_type=create          (6s, server-side)
└── No  → Do you have a reliable business key?
          ├── Yes, simple scan is fast enough → business-key   (1m 42s)
          └── Yes, and you need more speed    → split-by-date  (36s, parallel)

Conclusion

Elasticsearch doesn't offer a native index diff command, but the right strategy depends on your data model:

Use functional _ids (a natural business key like emp_no) whenever possible. It unlocks the simplest and fastest approach: _reindex with op_type=create finds and fills gaps in one server-side call.
When IDs are unstable, match by business key using PIT + _msearch. Partition by a field and run slices in parallel to recover most of the performance. If you find yourself doing this regularly, consider computing a hash of your business key fields and using it as _id at ingestion time. You get the best of both worlds: stable IDs and efficient lookups.

The complete demo, including dataset generation, comparison scripts, and reindex scripts, is available at https://github.com/dadoonet/blog-compare-indices/.

Monitoring Kibana dashboard views with Elastic Workflows

Gustavo Llermaly — Fri, 03 Apr 2026 00:00:00 GMT

Kibana tracks how many times each dashboard is viewed, but that data isn’t exposed natively in any built-in dashboard. In this article, we’ll use Elastic Workflows to automatically collect that data every 30 minutes and index it into Elasticsearch, so we can build our own analytics on top of it.

Elastic Workflows is a built-in automation engine inside Kibana that lets you define multistep processes using a simple YAML configuration. Each workflow can be triggered on a schedule or event or as a tool in Elastic Agent Builder, and each step can call Kibana APIs, query Elasticsearch, or transform data.

We’ll use dashboard view counts as a concrete example, but the same pattern applies to any metric exposed through the Kibana saved objects API.

Prerequisites

Elastic Cloud or self-managed cluster running 9.3
Workflows enabled (Advanced settings)

Step 1: Explore the raw data in Dev Tools

Before building anything, let's understand what data we have. Kibana stores most of its configuration and metadata as saved objects in a dedicated internal index. One of the things Kibana tracks this way is dashboard view counts, using a special saved object type called usage counters. You can query them directly from Dev Tools:

GET kbn:/api/saved_objects/_find?type=usage-counter&filter=usage-counter.attributes.domainId:"dashboard"%20and%20usage-counter.attributes.counterType:"viewed"&per_page=10000

The response looks like this:

{
  "page": 1,
  "per_page": 10000,
  "total": 1,
  "saved_objects": [
    {
      "type": "usage-counter",
      "id": "dashboard:346f3c64-ebca-484d-9d57-ec600067d596:viewed:server:20260310",
      "attributes": {
        "domainId": "dashboard",
        "counterName": "346f3c64-ebca-484d-9d57-ec600067d596",
        "counterType": "viewed",
        "source": "server",
        "count": 1
      },
      ...
    }
  ]

The counterName field is the dashboard ID, and count is the cumulative view count for that dashboard on that specific day. Kibana creates one counter object per dashboard per day; you can see the date suffix in the object ID (...viewed:server:20260310). The count grows throughout the day as users open the dashboard.

Rather than replicating this daily-document model in our index, we’ll create one document per workflow execution. Each document records how many views that dashboard had accumulated for the day at the moment of capture.

Step 2: Create the destination index

We need an index to store our dashboard view snapshots. The following command creates it with explicit mappings so we can aggregate and visualize later. Run this in Dev Tools:

PUT dashboard-views
{
  "mappings": {
    "properties": {
      "captured_at": {
        "type": "date"
      },
      "dashboard_id": {
        "type": "keyword"
      },
      "dashboard_name": {
        "type": "keyword"
      },
      "view_count": {
        "type": "integer"
      }
    }
  }
}

Using keyword mappings for IDs and names allows aggregations. Using integer for view_count is a safe default, since Kibana resets the counter daily, reaching the 32-bit limit (more than 2 billion views in a single day) isn’t a realistic concern. It still supports numeric operations, like max, avg, and min among others.

Step 3: Create the workflow

Go to Stack Management > Workflows > New Workflow, and paste the following workflow YAML configuration:

name: dashboard-views-ingestion
triggers:
  - type: scheduled
    with:
      every: 30m

steps:
  - name: fetch_dashboard_views
    type: kibana.request
    with:
      method: GET
      path: >-
        /api/saved_objects/_find?type=usage-counter&per_page=10000&filter=usage-counter.attributes.domainId:"dashboard"%20and%20usage-counter.attributes.counterType:"viewed"

  - name: index_each_dashboard
    type: foreach
    foreach: "{{ steps.fetch_dashboard_views.output.saved_objects }}"
    steps:
      - name: fetch_dashboard_name
        type: kibana.request
        with:
          method: GET
          path: /api/saved_objects/dashboard/{{ foreach.item.attributes.counterName }}
        on-failure:
          continue: true

      - name: index_doc
        type: elasticsearch.request
        with:
          method: POST
          path: /dashboard-views/_doc
          body:
            dashboard_id: "{{ foreach.item.attributes.counterName }}"
            dashboard_name: "{{ steps.fetch_dashboard_name.output.attributes.title }}"
            view_count: "${{ foreach.item.attributes.count | plus: 0 }}"
            captured_at: "{{ execution.startedAt | date: '%Y-%m-%dT%H:%M:%SZ' }}"

In the next section, let's break down the workflow step by step.

How the workflow works

Triggers

The workflow runs on a scheduled trigger every 30 minutes. This gives us time-series data without hammering the API.

fetch_dashboard_views

Uses kibana.request to call the Kibana saved objects API. No authentication setup is needed: The workflow engine automatically attaches the correct headers based on the execution context.

index_each_dashboard (foreach)

Iterates over the saved_objects array returned by the previous step. The current item in each iteration is available as foreach.item. Inside the loop, we run two nested steps for each dashboard.

1. fetch_dashboard_name:

Resolves the human-readable dashboard title by calling GET /api/saved_objects/dashboard/{id}. We add on-failure: continue: true so that if a dashboard was deleted but still has view counters, the loop continues instead of failing the whole execution.

2. index_doc:

Indexes each document using POST /dashboard-views/_doc (without an explicit ID), which lets Elasticsearch auto-generate IDs. This creates a new document on every run, building a history of view counts over time rather than overwriting the previous snapshot.

Two things worth noting:

The captured_at field uses the date filter to format the timestamp as ISO 8601. Without it, the value comes out as a JavaScript date string, like Tue Mar 10 2026 05:03:47 GMT+0000, which Elasticsearch won't map as a date.
The view_count uses ${{ }} syntax with | plus: 0 to preserve the numeric type. Using {{ }} would render it as a string, which would prevent math operations in the dashboard.

The UI allows you to nicely debug each of the workflow steps.

Step 4: Build the stats dashboard

Once the workflow has run a few times and data is collected, create a new dashboard in Kibana using the dashboard-views data view.

Some panels to start with:

Top dashboards by views: Use a Bar chart with dashboard_name on the X axis and last_value(view_count) on the Y axis. This shows the current daily view count per dashboard.
Views over time: Use a Line chart with captured_at on the X axis and last_value(view_count) on the Y axis, broken down by dashboard_name. Since each run appends a new document, use last value to get the peak count per time bucket rather than summing duplicates.
Current snapshot: Use a Data table with the latest captured_at to show the most recent view counts across all dashboards.

Since each workflow creates a new document, you can filter by time range to analyze activity in specific periods, compare week over week, or build alerts when a dashboard drops below a view threshold.

Conclusion

Elastic Workflows is a good fit for this kind of periodic data collection because both the source (Kibana API) and the destination (Elasticsearch) are native, which means zero credential management. The workflow engine handles authentication automatically for kibana.request and elasticsearch.request steps, so the only thing you write is the logic.

Resources

When TSDS meets ILM: Designing time series data streams that don't reject late data

Bret Wortman — Thu, 02 Apr 2026 00:00:00 GMT

Recently, I migrated a customer's metrics cluster from "everything in the hot tier" to a hot/cold/frozen architecture. It was a change I’d performed dozens of times before. Within minutes, Logstash stopped advancing data entirely.

Elasticsearch was rejecting late-arriving metrics. Those rejections caused the pipeline to fall behind, resulting in more late data, which triggered even more rejections. Eventually, the pipeline stalled completely.

We had to restore from snapshot, reindex the data, and redesign the ingestion pipeline to recover.

The root cause wasn't index lifecycle management (ILM) itself. It was time series data streams (TSDS) and how they enforce time‑bound backing indices.

TSDS can reduce storage requirements for metrics by 40–70%, but the architectural changes that make TSDS efficient also alter how indices behave over time. Those changes matter when designing ILM policies or when your ingestion pipelines may produce late‑arriving data.

TL;DR

When using TSDS:

Backing indices only accept documents within a specific time window.
If late data arrives after an index moves to cold or frozen, Elasticsearch rejects those documents or routes them to the failure store, if configured.

Design rule:

warm_min_age > rollover_max_age + maximum_expected_lateness

What is a time series data stream?

A time series data stream (TSDS) is a specialized data stream optimized for metrics data. Data is routed so that related documents are located within the same shards, optimizing them for query and retrieval. Here’s how Elasticsearch does it:

Each document contains:

A timestamp.
Dimension fields identifying the time series.
Metric fields representing measured values.

Examples include:

CPU usage per host.
Request latency per service.
Temperature readings per sensor.

Dimensions identify what we want to measure, while metrics represent values that change over time.

Dimensions

Dimensions describe the measured entity.

Examples:

host.name
service.name
container.id

We define them in mappings with:

time_series_dimension: true

Metrics

Metrics represent numeric values and are defined using:

time_series_metric

Common metric types:

Gauge: Values that rise and fall.
Counter: Values that increase until reset.

Elastic Agent primarily collects metrics and logs data, so even if you haven’t enabled any TSDS indices by hand, you may still have them in your cluster.

The _tsid field

Elasticsearch internally generates a _tsid value from dimension fields. This allows documents with identical dimensions to be routed to the same shard, improving:

Compression.
Query locality.
Aggregation performance.

The key difference: Time‑bound backing indices

Traditional data streams always write to the most recent backing index, called the write index, but TSDS behaves differently.

Each TSDS backing index has a defined time window and only accepts documents with @timestamp values that fall in that window:

GET _data_stream/my-metrics-data-stream
{
  "index_mode": "time_series",
  "time_series": {
    "temporal_ranges": [
      {
        "start": "2026-01-15T14:35:50.000Z",
        "end": "2026-03-16T11:34:40.000Z"
      }
    ]
  }
}

When a document is indexed, Elasticsearch routes it to the backing index responsible for that timestamp, meaning that, unlike traditional indices, a TSDS may write to multiple backing indices simultaneously.

For example:

Real‑time data → newest index.
Late data → earlier index covering that time range.

Designing for late‑arriving data

Real ingestion pipelines rarely deliver metrics perfectly on time. Metrics can be delayed by network outages, backlogs along the way, batch ingestion, and loss of edge devices, which reconnect and start to catch up.

Traditional indices quietly absorb those delays. TSDS does not.

If a document's timestamp falls outside the range of writable backing indices, Elasticsearch rejects it, meaning your ILM policy must account for late data.

The critical constraint

Backing indices must remain writable long enough to accept delayed data.

In practical terms:

time_until_readonly > maximum_expected_lateness

Because ILM measures ages from rollover, the operational rule becomes:

warm_or_cold_min_age > rollover_max_age + maximum_expected_lateness

For example, if metrics may arrive up to six hours late, indices must remain writable at least six hours after rollover.

Failing to account for this constraint was exactly what caused the ingestion failure described earlier. Late-arriving data was directed to an earlier index, which was already in the cold tier and therefore write-blocked.

Handling rejected documents

When TSDS rejects a document, Elasticsearch returns an error, indicating that the timestamp doesn’t fall within the range of writable indices. How your ingestion pipeline handles that error determines whether you lose data or stall ingestion.

The primary mechanism for handling rejected documents is the failure store.

Failure store (recommended in Elasticsearch 9.1+)

Elasticsearch 9.1 introduced the failure store, which automatically captures rejected documents. Instead of returning errors to clients, Elasticsearch writes failed documents to a dedicated failure index inside the data stream.

You can inspect failures using:

GET metrics-myapp::failures/_search

Using the failure store prevents ingestion pipelines from choking on rejection errors while preserving failed data for analysis or reindexing.

Monitoring for rejection issues

Late‑arrival problems usually appear first as ingestion anomalies. You may notice them first as:

Sudden drops in indexing rate.
Spikes in rejected documents.
A growing number of failure store entries.
Mismatches between pipeline input and output counts.

Alerting on these signals allows operators to detect issues before pipelines stall. Workflows, machine learning jobs, and other mechanisms can be used to automate detection and notification.

Migration checklist for TSDS + ILM

If you're migrating a metrics cluster to TSDS, introducing ILM tiering, or upgrading to an Elasticsearch version where metrics are TSDS by default, review these items first.

1. Measure ingestion latency

Before changing ILM policies, determine:

Normal ingestion delay.
Worst-case delay during incidents.
Delays caused by batch pipelines.

Your ILM design must accommodate the maximum realistic delay.

2. Verify index time windows

Inspect your TSDS backing indices:

GET _data_stream/

Look for:

time_series.start_time
time_series.end_time

These bounds determine which indices can accept documents. Understanding these windows can help you determine how late data can be before it’s rejected.

3. Size the hot tier for late arrivals

Ensure backing indices remain writable long enough for delayed data.

Operational rule:

warm_min_age > rollover_max_age + maximum_expected_lateness

Remember, indices must remain writable for at least six hours if metrics may arrive six hours late.

4. Decide how to handle rejected documents

Choose a strategy before enabling TSDS:

Failure store (recommended in Elasticsearch 9.1+).
Logstash dead letter queue.
Fallback index for late arrivals.
Accepting limited data loss.

5. Monitor ingestion health

Add alerts for:

Indexing rate drops.
Rejected documents.
Failure store growth.
Pipeline input/output mismatches.

Late data issues often appear first as ingestion anomalies.

Summary

Time series data streams provide major storage and performance improvements for metrics workloads, but they introduce an important architectural change: Backing indices are time‑bound, which affects how ILM behaves.

When using TSDS:

Indices must remain writable long enough to accept delayed data.
Ingestion pipelines should handle rejected documents safely.

The key rule to remember is:

warm_min_age > rollover_max_age + maximum_expected_lateness

If you design ILM policies around that constraint, TSDS works extremely well for metrics workloads.

Ignore it, though, and your ingestion pipeline may discover those time boundaries the hard way.

LINQ to ES|QL: Write C#, query Elasticsearch

Florian Bernd, Martijn Laarman — Wed, 01 Apr 2026 00:00:00 GMT

Starting with v9.3.4 and v8.19.18, the Elasticsearch .NET client includes a Language Integrated Query (LINQ) provider that translates C# LINQ expressions into Elasticsearch Query Language (ES|QL) queries at runtime. Instead of writing ES|QL strings by hand, you compose queries using Where, Select, OrderBy, GroupBy, and other standard operators. The provider takes care of translation, parameterization, and result deserialization, including per-row streaming that keeps memory usage constant, regardless of result set size.

Your first query

Start by defining a plain old CLR object (POCO) that maps to your Elasticsearch index. Property names are resolved to ES|QL column names through standard System.Text.Json attributes, like [JsonPropertyName], or through a configured JsonNamingPolicy. The same source serialization rules that apply across the rest of the client apply here as well.

using System.Text.Json.Serialization;

public class Product
{
    [JsonPropertyName("product_id")]
    public string Id { get; set; }

    public string Name { get; set; }

    public string Brand { get; set; }

    [JsonPropertyName("price_usd")]
    public double Price { get; set; }

    [JsonPropertyName("in_stock")]
    public bool InStock { get; set; }
}

With the type in place, a query looks like this:

var minPrice = 100.0;
var brand = "TechCorp";

await foreach (var product in client.Esql.QueryAsync(q => q
    .From("products")
    .Where(p => p.InStock && p.Price >= minPrice && p.Brand == brand)
    .OrderByDescending(p => p.Price)
    .Take(10)))
{
    Console.WriteLine($"{product.Name}: ${product.Price}");
}

The provider translates this into the following ES|QL:

FROM products
| WHERE (in_stock == true AND price_usd >= ?minPrice AND brand == ?brand)
| SORT price_usd DESC
| LIMIT 10

A few details to note:

Property name resolution: p.Price becomes price_usd because of the [JsonPropertyName] attribute, and p.Brand becomes brand following the default camelCase naming policy.
Parameter capturing: The C# variables minPrice and brand are captured as named parameters (?minPrice, ?brand). They’re sent separately from the query string in the JSON payload, which prevents injection and enables server-side query plan caching.
Streaming: QueryAsync returns IAsyncEnumerable. Rows are materialized one at a time as they arrive from Elasticsearch.

You can also inspect the generated query and its parameters without executing it:

var query = client.Esql.CreateQuery()
    .Where(p => p.InStock && p.Price >= minPrice && p.Brand == brand)
    .OrderByDescending(p => p.Price)
    .Take(10);

Console.WriteLine(query.ToEsqlString());
// FROM products | WHERE (in_stock == true AND price_usd >= 100) | SORT price_usd DESC | LIMIT 10

Console.WriteLine(query.ToEsqlString(inlineParameters: false));
// FROM products | WHERE (in_stock == true AND price_usd >= ?minPrice AND brand == ?brand) | SORT price_usd DESC | LIMIT 10

var parameters = query.GetParameters();
// { "minPrice": 100.0, "brand": "TechCorp" }

How does this work? A quick LINQ refresher

The mechanism that makes LINQ providers possible is the distinction between IEnumerable and IQueryable.

When you call .Where(p => p.Price > 100) on an IEnumerable, the lambda compiles to a Func, a regular delegate that the runtime executes in-process. This is LINQ-to-Objects.

When you call the same method on an IQueryable, the C# compiler wraps the lambda in an Expression> instead. This is a data structure that represents the structure of the code rather than its executable form. The expression tree can be inspected, analyzed, and translated into another language at runtime.

// IEnumerable: the lambda is a compiled delegate
IEnumerable local = products.Where(p => p.Price > 100);

// IQueryable: the lambda is an expression tree, a data structure
IQueryable remote = queryable.Where(p => p.Price > 100);

The IQueryProvider interface is the extension point. Any provider can implement CreateQuery and Execute to translate these expression trees into a target language. Entity Framework uses this to emit SQL. The LINQ to ES|QL provider uses it to emit ES|QL.

The expression tree for the query above looks like this:

Expression tree for the example query.

The tree is nested inside out: Take wraps OrderByDescending, which wraps Where, which wraps From, which wraps the root EsqlQueryable constant. The Where predicate is itself a subtree of BinaryExpression nodes for the &&, >=, and == operators, with MemberExpression leaves for property accesses and closure captures for the minPrice and brand variables. This is the data structure that the provider walks to produce the final ES|QL.

Under the hood: The translation pipeline

The path from a LINQ expression to query results follows a six-stage pipeline:

Translation pipeline overview.

1. Expression tree capture

When you chain .Where(), .OrderBy(), .Take() and other operators on an IQueryable, the standard LINQ infrastructure builds an expression tree. EsqlQueryable implements IQueryable and delegates to EsqlQueryProvider.

2. Translation

When the query is executed (by enumerating, calling ToList(), or using await foreach), the EsqlExpressionVisitor walks the expression tree inside out. It dispatches each LINQ method call to a specialized visitor:

Visitor	Translates	Into
WhereClauseVisitor	.Where(predicate)	WHERE condition
SelectProjectionVisitor	.Select(selector)	EVAL + KEEP + RENAME
GroupByVisitor	.GroupBy().Select()	STATS ... BY
OrderByVisitor	.OrderBy() / .ThenBy()	SORT field [ASC\\|DESC]
EsqlFunctionTranslator	EsqlFunctions., Math., string methods	80+ ES\|QL functions

During translation, C# variables referenced in expressions are captured as named parameters.

3. Query model

The visitors don’t produce strings directly. Instead, they produce QueryCommand objects, an immutable intermediate representation. A FromCommand, a WhereCommand, a SortCommand, and a LimitCommand, each representing one ES|QL processing command. These are collected into an EsqlQuery model.

Query model and command pattern.

This intermediate model is decoupled from both the expression tree and the output format. It can be inspected, intercepted (via IEsqlQueryInterceptor), or modified before formatting.

4. Formatting

EsqlFormatter visits each QueryCommand in order and produces the final ES|QL string. Each command becomes one line, separated by the pipe (|) operator that ES|QL uses to chain processing commands. Identifiers containing special characters are automatically escaped with backticks.

5. Execution

The formatted ES|QL string and captured parameters are sent to Elasticsearch’s /_query endpoint as a JSON payload. The IEsqlQueryExecutor interface abstracts the transport layer, which is where the layered package architecture comes into play.

6. Materialization

EsqlResponseReader streams the JSON response without buffering the entire result set into memory. A ColumnLayout tree, precomputed once per query, maps flat ES|QL column names (like address.street, address.city) to nested POCO properties. Each row is assembled into a T instance and yielded one at a time via IEnumerable or IAsyncEnumerable.

The layered architecture

The LINQ to ES|QL functionality is split across three packages:

Package architecture.
Elastic.Esql is the pure translation engine. It has zero HTTP dependencies and contains the expression visitors, query model, formatter, and response reader. You can use it stand alone to build and inspect ES|QL queries without an Elasticsearch connection, which is useful for testing, query logging, or building your own execution layer.

// Translation-only: no Elasticsearch connection needed
var provider = new EsqlQueryProvider();
var query = new EsqlQueryable(provider)
    .From("products")
    .Where(p => p.InStock)
    .OrderByDescending(p => p.Price);

Console.WriteLine(query.ToEsqlString());
// FROM products | WHERE in_stock == true | SORT price_usd DESC

Elastic.Clients.Esql is a lightweight stand-alone ES|QL client. It adds HTTP execution on top of Elastic.Esql via Elastic.Transport. If your application only needs ES|QL and none of the other Elasticsearch APIs, this is the minimal dependency option.

Elastic.Clients.Elasticsearch is the full Elasticsearch .NET client. It also builds on Elastic.Esql and exposes the LINQ provider through the client.Esql namespace. This is the recommended entry point for most applications.

Both execution-layer packages provide their own implementation of IEsqlQueryExecutor, the strategy interface that bridges translation and transport.

All three packages are compatible with Native AOT when used with a source-generated JsonSerializerContext. For the full client, see the Native AOT documentation.

Beyond the basics

The example above covered filtering, sorting, and pagination. The provider supports a broader set of operations.

Aggregations

GroupBy, combined with aggregate functions in Select, translates to ES|QL STATS ... BY:

var stats = client.Esql.Query(q => q
    .GroupBy(p => p.Brand)
    .Select(g => new
    {
        Brand = g.Key,
        Count = g.Count(),
        AvgPrice = g.Average(p => p.Price),
        MaxPrice = g.Max(p => p.Price)
    }));

// -> FROM products | STATS COUNT(*), AVG(price_usd), MAX(price_usd) BY brand

Projections

Select, with anonymous types generates EVAL, KEEP, and RENAME commands:

var query = client.Esql.CreateQuery()
    .Select(p => new { ProductName = p.Name, p.Price, p.InStock });

// -> FROM products | KEEP name, price_usd, in_stock | RENAME name AS ProductName

Rich function library

Over 80 ES|QL functions are available through the EsqlFunctions class, covering date/time, string, math, IP, pattern matching, and scoring. Standard Math.* and string.* methods are also translated:

.Where(p => p.Name.Contains("Pro"))       // -> WHERE name LIKE "*Pro*"
.Where(p => EsqlFunctions.CidrMatch(      // -> WHERE CIDR_MATCH(ip, "10.0.0.0/8")
    p.IpAddress, "10.0.0.0/8"))

LOOKUP JOIN

Cross-index lookups translate to ES|QL LOOKUP JOIN:

var enriched = client.Esql.Query(q => q
    .LookupJoin(
        "category-lookup-index",
        product => product.Id,
        category => category.CategoryId,
        (product, category) => new { product.Name, category!.CategoryLabel }));

Raw ES|QL escape hatch

For ES|QL features not yet covered by the LINQ provider, you can append raw fragments:

var results = client.Esql.Query(q => q
    .Where(p => p.InStock)
    .RawEsql("| EVAL discounted = price_usd * 0.9"));

Server-side async queries

For long-running queries, submit them for background processing on the server:

await using var asyncQuery = await client.Esql.SubmitAsyncQueryAsync(
    q => q.Where(p => p.InStock),
    asyncQueryOptions: new EsqlAsyncQueryOptions
    {
        WaitForCompletionTimeout = TimeSpan.FromSeconds(5),
        KeepAlive = TimeSpan.FromMinutes(10)
    });

await asyncQuery.WaitForCompletionAsync();
await foreach (var product in asyncQuery.AsAsyncEnumerable())
    Console.WriteLine(product.Name);

Server-side async queries are especially useful for long-running analytical queries / large dataset processing that might exceed typical timeout thresholds, or in timeout-sensitive environments with load balancers, API gateways, or proxies that enforce strict HTTP timeouts. Async queries avoid connection drops by decoupling submission from result retrieval.

Getting started

LINQ to ES|QL is available starting from:

Elastic.Clients.Elasticsearch v9.3.4 (9.x branch)
Elastic.Clients.Elasticsearch v8.19.18 (8.x branch)

Install from NuGet:

dotnet add package Elastic.Clients.Elasticsearch

The entry points are on client.Esql:

Method	Returns	Use case
Query(...)	IEnumerable	Synchronous execution
QueryAsync(...)	IAsyncEnumerable	Async streaming
CreateQuery()	IEsqlQueryable	Advanced composition and inspection
SubmitAsyncQueryAsync(...)	EsqlAsyncQuery	Long-running server-side queries

For the full feature reference, including query options, multifield access, nested objects, and multivalue field handling, see the LINQ to ES|QL documentation.

Conclusion

LINQ to ES|QL brings the full expressiveness of C# LINQ to Elasticsearch's ES|QL query language, letting you write strongly typed, composable queries without handcrafting query strings. With automatic parameter capturing, streaming materialization, and a layered package architecture that scales from stand-alone translation to the full Elasticsearch client, it fits naturally into .NET applications of any size. Install the latest client, point your LINQ expressions at an index, and let the provider handle the rest.

From judgment lists to trained Learning to Rank (LTR) models

Jeffrey Rengifo — Tue, 31 Mar 2026 00:00:00 GMT

In Evaluating search query relevance with judgment lists, we built lists and used the _rank_eval API to measure search quality. Though this approach gave us an objective way to evaluate changes, improving relevance still requires manual query tuning.

If judgment lists answer the question, “How good is my ranking?,” Learning To Rank (LTR) answers, “How do I systematically make it better?”

In this article, we take the next step: using those judgment lists to train an LTR model using XGBoost, Eland, and Elasticsearch. We’ll focus on understanding the process rather than on implementation details. For the complete code, refer to the companion notebook.

What is LTR?

LTR uses machine learning (ML) to build a ranking function for your search engine. Instead of manually tuning query weights, you provide examples of proper rankings (your judgment list) and let the model learn what makes documents relevant. In Elasticsearch, LTR works as a second-stage reranker following retrieval of documents from Elasticsearch:

First stage: A standard query (BM25, vector, or hybrid) retrieves candidate documents quickly.
Second stage: The LTR model reranks the top results using multiple signals it learned to combine.

For a deeper introduction, see Introducing Learning To Rank (LTR) in Elasticsearch.

The journey from judgment list to model

A judgment list tells us which documents should rank highly for a given query. But the model cannot learn directly from document IDs. It needs numerical signals that explain why certain documents are potentially relevant.

The process works like this:

Start with judgments. Query-document pairs with relevance grades, so you define that doc1 is a good match for “DiCaprio performance” search terms.
Extract features. For each query-document pair, compute numerical signals, some about the document alone (for example, popularity), and others about how the query and document interact (for example, BM25 score).
Train the model. The model learns which feature patterns predict high grades.
Deploy. Deploy the trained model to your Elasticsearch cluster.
Query. Use the model to rerank search results.

The key insight is that features must capture what your judgments are measuring. If your judgment list rewards popular thriller movies but your features only include text-matching scores, the model has no way to learn what makes those documents relevant.

What are features?

Features are numerical values that describe a query-document pair. In Elasticsearch, we define features using queries that return scores. There are three types:

Query-document features measure how well a query matches a document. Eland provides the QueryFeatureExtractor utility to define these features, which computes the BM25 relevance score for each query-document pair:

QueryFeatureExtractor(
    feature_name="title_bm25",
    query={"match": {"title": "{{query}}"}}
)

This extracts the BM25 score from the title field for each document relative to the query.

Document features are properties of the document that don’t depend on the query. You can extract these using script_score or function_score:

QueryFeatureExtractor(
    feature_name="popularity",
    query={
        "script_score": {
            "query": {"exists": {"field": "popularity"}},
            "script": {"source": "return doc['popularity'].value;"}
        }
    }
)

Query features describe the query itself, like the number of terms. These are less common but can help the model handle different query types.

Designing your feature set

Choosing features isn’t random. Each feature should capture a signal that might explain why users prefer certain documents. Let's look at the features from the LTR notebook and understand the reasoning:

Feature	Type	Purpose
`title_bm25`	Query-document	Title matches are strong relevance signals. For example, a movie titled Star Wars should rank highly for the query "star wars".
`actors_bm25`	Query-document	Some users search by actor name. If they search for "leonardo dicaprio movies", they should get films starring Leonardo DiCaprio.
`title_all_terms_bm25`	Query-document	This is a stricter version of title matching where all query terms must be present. It helps distinguish between exact matches and partial ones.
`actors_all_terms_bm25`	Query-document	This is the same stricter matching logic as described above but applied specifically to actors.
`popularity`	Document	Users generally prefer well-known movies over obscure ones when relevance is similar. A popular Star Wars film should rank above a low-budget parody with "Star Wars" in the title.

Notice the strategy here:

Multiple signals for the same concept. We have both title_bm25 (lenient) and title_all_terms_bm25 (strict). The lenient version scores any document where at least one query term matches the title, and the strict version requires all the terms to be present. For short queries, the lenient match might be enough; whereas for longer, more specific queries, strict matching might be more important. The model can learn when to rely on each.
Text features plus quality features. Text matching alone can return irrelevant documents that happen to contain the right words. The popularity feature lets the model boost well-known, quality content when text scores are similar.
Coverage for different query types. Some queries target titles ("star wars"), and others target actors ("dicaprio movies"). Having features for both means that the model can handle diverse searches.

When designing your own features, ask yourself, "What signals would a human use to decide if this document is relevant?" Those are your candidate features.

Building the training dataset

Once features are defined, we extract them for every query-document pair in our judgment list. The result is a training dataset where each row contains:

The query identifier.
The document identifier.
The relevance grade (from our judgment list).
All feature values.

Here’s a simplified example:

`query_id`	`query`	`doc_id`	`grade`
qid:1	star wars	11	4
qid:1	star wars	12180	3
qid:1	star wars	278427	1
qid:2	tom hanks movies	857	4
qid:2	tom hanks movies	13	3

A few things to notice:

NaN values are normal. When a query doesn’t match a field, the feature returns no score. The movie Star Wars has a high title_bm25 but no actors_bm25 because the query "star wars" doesn’t match any actor names.

Queries are grouped during training. The query_id column tells the model which documents to compare against each other. For "star wars", it learns that document 11 (grade 4) should rank above document 278427 (grade 1).

But here’s the important part: The model doesn’t memorize these specific queries. Instead, it learns general patterns, like "documents with high title_bm25 AND high popularity tend to have high grades." When presented with a new query, the model applies these learned patterns to rank the results.

Features must explain grade differences. Look at qid:1: The grade 4 document has a higher title_bm25 and higher popularity than the grade 1 document. These patterns are what the model learns.

Training the LTR model

With the training dataset prepared, we train an XGBoost model with a ranking objective. The model builds decision trees that learn patterns like:

"If title_bm25 > 10 and popularity > 50, predict high relevance."
"If title_bm25 is missing but actors_bm25 > 12, still predict moderate relevance."

Here's how the training process works in practice:

from xgboost import XGBRanker
from sklearn.model_selection import GroupShuffleSplit

# Create the ranker model:
ranker = XGBRanker(
    objective="rank:ndcg",
    eval_metric=["ndcg@10"],
    early_stopping_rounds=20,
)

# Shaping training and eval data in the expected format.
X = judgments_with_features[ltr_config.feature_names]
y = judgments_with_features["grade"]
groups = judgments_with_features["query_id"]

# Split the dataset in two parts respectively used for training and evaluation of the model.
group_preserving_splitter = GroupShuffleSplit(n_splits=1, train_size=0.7).split(
    X, y, groups
)
train_idx, eval_idx = next(group_preserving_splitter)

train_features, eval_features = X.loc[train_idx], X.loc[eval_idx]
train_target, eval_target = y.loc[train_idx], y.loc[eval_idx]
train_query_groups, eval_query_groups = groups.loc[train_idx], groups.loc[eval_idx]

# Training the model
ranker.fit(
    X=train_features,
    y=train_target,
    group=train_query_groups.value_counts().sort_index().values,
    eval_set=[(eval_features, eval_target)],
    eval_group=[eval_query_groups.value_counts().sort_index().values],
    verbose=True,
)

During training, the model tries different combinations of these rules and measures how well the resulting rankings match your judgment grades. It uses a metric called Normalized Discounted Cumulative Gain (NDCG) to score itself. A perfect NDCG of 1.0 means that the model's ranking exactly matches your judgments. Lower scores mean that some relevant documents are ranking below where they should be.

The training also uses a technique called early stopping. If the model's score stops improving for several rounds, training halts automatically. This prevents the model from memorizing the training data too closely, which would hurt its ability to generalize to new queries.

The companion notebook contains the complete training code.

Understanding what your LTR model learned

After training, XGBoost can show you which features the model relied on most. You can generate a feature importance chart using XGBoost's built-in visualization:

from xgboost import plot_importance

plot_importance(ranker, importance_type="weight")

The importance_type="weight" parameter shows how often each feature was used in tree splits. Here’s the resulting chart:

The F score counts how many times each feature was used to make split decisions across all trees in the model. Higher values mean that the model relied on that feature more often.

In this example:

popularity (2178): The most important feature. The model frequently uses popularity to separate relevant from nonrelevant documents.
title_bm25 (1642): Second-most important. Title matches matter a lot for movie searches.
actors_bm25 (565): Moderately important. This is useful for queries that mention actors.
title_all_terms_bm25 (211): Occasionally useful. The stricter matching helps for some queries.
actors_all_terms_bm25 (63): Rarely used. The model found this feature less predictive.

This chart helps you iterate on your feature set. If a feature that you expected to be important shows near-zero importance, investigate why. Maybe the feature extraction is not working as intended, or maybe that signal doesn’t actually predict relevance in your judgment data.

Deploying and using the LTR model

Once trained, upload the model to Elasticsearch using Eland:

MLModel.import_ltr_model(
    es_client=es_client,
    model=ranker,
    model_id="ltr-model-xgboost",
    ltr_model_config=ltr_config,
    es_if_exists="replace",
)

Once uploaded, the model can be used as a rescorer retriever to be combined with other retrievers for multistage search pipelines:

GET movies/_search
{
  "retriever": {
    "rescorer": {
      "rescore": {
        "window_size": 50,
        "learning_to_rank": {
          "model_id": "ltr-model-xgboost",
          "params": {
            "query": "star wars"
          }
        }
      },
      "retriever": {
        "standard": {
          "query": {
            "multi_match": {
              "fields": ["title", "overview", "actors", "director", "tags", "characters"],
              "query": "star wars"
            }
          }
        }
      }
    }
  }
}

Response (simplified):

 "hits": {
    "total": {
      "value": 852,
      "relation": "eq"
    },
    "max_score": 25.165691,
    "hits": [
      {
        "_index": "movies",
        "_id": "11",
        "_score": 25.165691,
        "_source": {
          "title": "Star Wars"
        }
      },
      {
        "_index": "movies",
        "_id": "12180",
        "_score": 25.092865,
        "_source": {
          "title": "Star Wars: The Clone Wars"
        }
      },
      {
        "_index": "movies",
        "_id": "181812",
        "_score": 23.456198,
        "_source": {
          "title": "Star Wars: The Rise of Skywalker"
        }
      },
      {
        "_index": "movies",
        "_id": "140607",
        "_score": 23.320757,
        "_source": {
          "title": "Star Wars: The Force Awakens"
        }
      },
...

The first-stage query retrieves candidates using BM25. The LTR model then reranks the top 50 results using all the features it learned to weight.

For the sake of the example, the multi_match query alone would return some less relevant results on the first positions that LTR helped to fix:

{
  "hits": [
    {
      "_index": "movies",
      "_id": "11",
      "_score": 10.971989,
      "_source": {
        "title": "Star Wars"
      }
    },
    {
      "_index": "movies",
      "_id": "12180",
      "_score": 9.923633,
      "_source": {
        "title": "Star Wars: The Clone Wars"
      }
    },
    {
      "_index": "movies",
      "_id": "1022100",
      "_score": 8.9880295,
      "_source": {
        "title": "Andor: A Disney+ Day Special Look"
      }
    },
    {
      "_index": "movies",
      "_id": "278427",
      "_score": 8.845748,
      "_source": {
        "title": "Family Guy Presents: It's a Trap!"
      }
    },
    ...
  ]
}

Conclusion

The path from judgment lists to a working LTR model involves three key steps: designing features that capture relevance signals, building a training dataset that pairs those features with your judgment grades, and training a model that learns the patterns.

Our previous article becomes the starting point for this process. Your grades define what "relevant" means and how to measure it, and your features give the model the signals to predict it.

For the complete implementation with a dataset of 9,750 movies and 384,755 judgment rows, see the LTR notebook. For advanced use cases, like personalized search, see Personalized search with LTR.

From Elasticsearch runtime fields to ES|QL: Adapting legacy tools to current techniques

Jeffrey Rengifo — Mon, 30 Mar 2026 00:00:00 GMT

Elasticsearch runtime fields solve the problem of computing values at query time without reindexing. But they come with Painless scripting complexity and performance costs that scale with document count. Elasticsearch Query Language (ES|QL) offers a more powerful alternative with a dedicated execution engine, pipeline processing, and no scripting required. In this article, you’ll learn how to map five common runtime field patterns to their ES|QL equivalents, so you can modernize your queries and understand when each approach makes sense.

Prerequisites

Elasticsearch 8.15+ (for :: cast operator support; core ES|QL features available from 8.11)

Runtime fields versus ES|QL

Runtime fields were introduced in Elasticsearch 7.11 as a way to define fields at query time. Instead of reindexing data, you could write a Painless script that computes values on the fly:

PUT my-index/_mapping
{
  "runtime": {
    "full_address": {
      "type": "keyword",
      "script": {
        "source": "emit(doc['address'].value + ':' + doc['port'].value)"
      }
    }
  }
}

This works, but comes with trade-offs:

Painless scripting overhead: Every runtime field requires scripting knowledge, and the syntax is Java-like, not query-like.
Performance cost: Runtime fields evaluate per document at query time. Elasticsearch classifies them as "expensive queries" that can be rejected by cluster settings.
Isolated computation: Each runtime field computes independently. There’s no way to chain transforms or use the output of one field in another within the same query.

ES|QL changes the equation. It has its own execution engine (not translated to Query DSL), runs queries concurrently across nodes, and provides a complete toolkit for field computation: EVAL, GROK, DISSECT, type casting, and pipeline chaining.

Let's see how each runtime field pattern maps to ES|QL.

Setting up the example data

All the code snippets in this article can be executed in the Kibana Dev Tools console.

To follow along, create a sample index with data that exercises all five patterns. This simulates a server logs scenario with mixed field types, raw messages, and some intentional data quality issues:

PUT server-logs
{
  "mappings": {
    "properties": {
      "host": { "type": "keyword" },
      "port": { "type": "keyword" },
      "raw_message": { "type": "text" },
      "response_time": { "type": "keyword" },
      "status_code": { "type": "keyword" },
      "region": { "type": "keyword" }
    }
  }
}

Now index some sample documents:

POST _bulk
{ "index": { "_index": "server-logs" } }
{ "host": "web-01", "port": "8080", "raw_message": "2024-01-15 INFO user=alice action=login duration=230ms", "response_time": "145", "status_code": "200", "region": "us-east" }
{ "index": { "_index": "server-logs" } }
{ "host": "web-02", "port": "443", "raw_message": "2024-01-15 ERROR user=bob action=upload duration=1200ms", "response_time": "not_available", "status_code": "500", "region": "eu-west" }
{ "index": { "_index": "server-logs" } }
{ "host": "api-01", "port": "3000", "raw_message": "2024-01-15 WARN user=charlie action=query duration=890ms", "response_time": "890", "status_code": "200", "region": "us-east" }
{ "index": { "_index": "server-logs" } }
{ "host": "api-02", "port": "3000", "raw_message": "2024-01-16 INFO user=diana action=export duration=3400ms", "response_time": "3400", "status_code": "200", "region": "ap-south" }
{ "index": { "_index": "server-logs" } }
{ "host": "web-01", "port": "8080", "raw_message": "2024-01-16 ERROR user=eve action=login duration=50ms", "response_time": "50", "status_code": "401", "region": "US-EAST" }

Notice that response_time is stored as a keyword (a common real-world mistake), and the last document has "US-EAST" instead of "us-east" (a data quality issue we’ll fix later).

Pattern 1: Field concatenation

A common runtime field use case is combining two fields into one. For example, creating a host:port identifier.

The runtime field approach

You can define it inline at query time. Query-time approach avoids modifying the mapping, but you still need Painless scripting, scoping it to a single search request:

GET server-logs/_search
{
  "runtime_mappings": {
    "endpoint": {
      "type": "keyword",
      "script": {
        "source": "emit(doc['host'].value + ':' + doc['port'].value)"
      }
    }
  },
  "fields": ["endpoint"],
  "_source": false
}

The ES|QL approach

You can run ES|QL queries using the _query API endpoint:

POST _query
{
  "query": """
    FROM server-logs
    | EVAL endpoint = CONCAT(host, ":", port)
    | KEEP host, port, endpoint
    | LIMIT 1
  """
}

Response:

{
  "columns": [
    { "name": "host", "type": "keyword" },
    { "name": "port", "type": "keyword" },
    { "name": "endpoint", "type": "keyword" }
  ],
  "values": [
    ["web-01", "8080", "web-01:8080"]
  ]
}

CONCAT accepts two or more arguments and always returns a keyword.

Note: For brevity, the remaining ES|QL examples in this article show just the query. Wrap them in POST _query { "query": "..." } to run them in Kibana Dev Tools.

When to use

If you need endpoint to persist across all queries and be available in Kibana dashboards, use a mapping-level runtime field. If you need it for a single search request within Query DSL, use a query-time runtime field. If you need it for ad-hoc analysis or exploratory work, ES|QL is simpler.

Pattern 2: Data extraction from unstructured text

Extracting structured data from raw log messages is another classic runtime field pattern.

The runtime field approach

Painless uses Java's regex Matcher class:

GET server-logs/_search
{
  "runtime_mappings": {
    "log_user": {
      "type": "keyword",
      "script": {
        "source": "def matcher = /user=(\\w+)/.matcher(params._source['raw_message']); if (matcher.find()) { emit(matcher.group(1)); }"
      }
    }
  },
  "fields": ["log_user"],
  "_source": false
}

This is verbose. You need to know Painless regex syntax, handle the Matcher object, and call emit() correctly.

The ES|QL approach: GROK

ES|QL provides two purpose-built commands for text extraction. GROK uses regex-based patterns:

FROM server-logs
| GROK raw_message "%{WORD:timestamp_date} %{WORD:log_level} user=%{WORD:user} action=%{WORD:action} duration=%{WORD:duration}"
| KEEP user, log_level, action, duration

Response:

{
  "columns": [
    { "name": "user", "type": "keyword" },
    { "name": "log_level", "type": "keyword" },
    { "name": "action", "type": "keyword" },
    { "name": "duration", "type": "keyword" }
  ],
  "values": [
    ["alice", "INFO", "login", "230ms"], ...
  ]
}

GROK uses the %{SYNTAX:SEMANTIC} pattern format. It extracts multiple fields in a single and readable command.

The ES|QL approach: DISSECT

For structured data with consistent delimiters, DISSECT is faster because it doesn’t use regular expressions:

FROM server-logs
| DISSECT raw_message "%{timestamp_date} %{log_level} user=%{user} action=%{action} duration=%{duration}"
| KEEP user, log_level, action, duration

The syntax is nearly identical to GROK, but DISSECT works by splitting on delimiters rather than matching regex patterns. This makes it faster for data that follows a consistent format.

When to use GROK vs DISSECT

Use DISSECT when your data has a predictable structure (same delimiters, same field order). Use GROK when you need regex flexibility, for example when fields may be optional or formats vary.

Pattern 3: Dynamic type conversion

When a field is mapped as keyword but contains numeric data (a surprisingly common scenario), runtime fields can cast it at query time.

The runtime field approach

GET server-logs/_search
{
  "runtime_mappings": {
    "response_time_long": {
      "type": "long",
      "script": {
        "source": """
          def val = doc['response_time'].value;
          if (val != 'not_available') {
            emit(Long.parseLong(val));
          }
        """
      }
    }
  },
  "fields": ["response_time_long"],
  "_source": false
}

You need to handle parsing exceptions manually. If Long.parseLong fails on an unexpected value, the script throws an error.

The ES|QL approach

ES|QL provides explicit conversion functions and a shorthand cast operator:

FROM server-logs
| EVAL response_ms = TO_LONG(response_time)
| KEEP host, response_time, response_ms

Or with the :: cast operator (available since 8.15):

FROM server-logs
| EVAL response_ms = response_time::long
| KEEP host, response_time, response_ms

Response:

{
  "columns": [
    { "name": "host", "type": "keyword" },
    { "name": "response_time", "type": "keyword" },
    { "name": "response_ms", "type": "long" }
  ],
  "values": [
    ["web-01", "145", 145]
  ]
}

Both produce the same result. The key difference from Painless: Failed conversions return null instead of throwing exceptions. The document with "not_available" simply gets null for response_ms, and ES|QL emits a warning.

Common conversion functions include:

Function	Converts to
`TO_LONG()`	Long integer
`TO_INTEGER()`	Integer
`TO_DOUBLE()`	Double
`TO_DATETIME()`	Date
`TO_BOOLEAN()`	Boolean
`TO_IP()`	IP address
`TO_VERSION()`	Version

The :: operator works with all these types (for example, field::double, field::datetime).

When to use

ES|QL's graceful null handling makes it safer for dirty data. Runtime fields with Painless give you fine-grained control over error handling but require more code. For type conversion specifically, ES|QL is almost always the better choice.

Pattern 4: Dynamic field handling

Runtime fields support "dynamic": "runtime" in mappings, which prevents mapping explosion by creating all new fields as runtime fields instead of indexed fields:

{
  "mappings": {
    "dynamic": "runtime",
    "properties": {
      "timestamp": { "type": "date" }
    }
  }
}

Any new field sent to this index becomes a runtime field automatically. This is useful when you ingest semi-structured data with unpredictable field names.

Where ES|QL fits

ES|QL provides query-time flexibility, but it still needs fields to be visible in the mapping. This is where runtime fields and ES|QL complement each other rather than compete.

If a field exists in _source but isn’t mapped, ES|QL cannot access it directly. The current workaround is to define a runtime field to make the unmapped field visible:

PUT dynamic-logs/_mapping
{
  "runtime": {
    "custom_field": {
      "type": "keyword",
      "script": {
        "source": "emit(params._source['custom_field'])"
      }
    }
  }
}

Once defined, ES|QL can query it:

FROM dynamic-logs
| WHERE custom_field == "some_value"
| KEEP timestamp, custom_field

This is one scenario where runtime fields remain essential. They act as a bridge, making unmapped data accessible to ES|QL.

Pattern 5: Field shadowing for error correction

Runtime fields can shadow (override) indexed fields by defining a runtime field with the same name as an existing field. This is useful for correcting data without reindexing.

The runtime field approach

Remember our data quality issue, where region has inconsistent casing ("US-EAST" versus "us-east")?

GET server-logs/_search
{
  "runtime_mappings": {
    "region": {
      "type": "keyword",
      "script": {
        "source": "emit(params._source['region'].toLowerCase())"
      }
    }
  },
  "fields": ["region"],
  "_source": false
}

This overrides the indexed region field for all queries. Every search, aggregation, and Kibana visualization will see the lowercase version.

FROM server-logs
| EVAL region = TO_LOWER(region)
| KEEP host, port, region

When you use EVAL with an existing column name, ES|QL drops the original column and replaces it with the computed value. This is the exact equivalent of field shadowing, but scoped to the current query.

You can also chain multiple corrections in a pipeline:

FROM server-logs
| EVAL region = TO_LOWER(region)
| EVAL region = CASE(region == "us-east", "US East", region == "eu-west", "EU West", region == "ap-south", "AP South", region)
| KEEP host, region

When to use

If the correction should apply to all queries and Kibana dashboards, use runtime field shadowing. If you need to correct data for a specific analysis, ES|QL is more flexible since you can apply different transformations in different queries without modifying the mapping.

The ES|QL pipeline advantage: Going beyond runtime fields

This is where ES|QL fundamentally surpasses runtime fields. Runtime fields are isolated: each one computes independently, and you cannot use the output of one runtime field as input for another in the same query.

ES|QL pipelines chain transforms. Here’s a single query that combines multiple patterns:

FROM server-logs
| GROK raw_message "%{WORD:log_date} %{WORD:log_level} user=%{WORD:user} action=%{WORD:action} duration=%{INT:duration_raw}ms"
| EVAL duration_ms = duration_raw::long
| EVAL region = TO_LOWER(region)
| WHERE log_level == "ERROR" AND duration_ms > 100
| STATS avg_duration = AVG(duration_ms), error_count = COUNT(*) BY region

This single query:

Extracts fields from raw text (GROK).
Converts the duration to a number (EVAL with cast).
Normalizes region casing (EVAL with TO_LOWER).
Filters for errors with high duration (WHERE).
Aggregates by region (STATS).

To achieve the same result with runtime fields, you would need to define at least three separate runtime fields (for extraction, conversion, and normalization) and then write a Query DSL query with filters and aggregations. The ES|QL version is a single, readable pipeline.

You can even use expressions directly inside aggregations:

FROM server-logs
| EVAL response_ms = response_time::long
| STATS
    avg_response = AVG(response_ms),
    p95_response = PERCENTILE(response_ms, 95),
    slow_count = COUNT(CASE(response_ms > 1000, 1, null))
  BY host

Conclusion

What we covered:

ES|QL provides a full toolkit (EVAL, GROK, DISSECT, type casting with ::) that replaces most runtime field patterns without any Painless scripting.
Failed type conversions in ES|QL return null instead of throwing exceptions, making it safer for real-world data.
Pipeline processing (chaining GROK into EVAL into WHERE into STATS) goes beyond what runtime fields can do in isolation.
Runtime fields remain valuable for persistent computed fields, field shadowing across all queries, and as a bridge for unmapped data in ES|QL.

One important caveat: Both runtime fields and ES|QL compute values at query time, which means they pay the cost on every query. If you find yourself applying the same transformation repeatedly (type corrections, field extraction, data normalization), consider using ingest pipelines to fix the data at index time instead. Ingest pipelines let you parse, enrich, and transform documents before they’re stored, so queries can work with clean, properly typed fields directly. Runtime fields and ES|QL are great for exploration and ad-hoc analysis, but for production workloads, indexing the right data from the start is almost always the better choice.

The key takeaway: Runtime fields aren’t deprecated, and they aren’t going away. But for most query-time computation patterns, ES|QL offers a simpler, more powerful, and more performant approach. And when the transformation is known up front, an ingest pipeline is the most efficient option of all.

Next steps

Creating an Elasticsearch MCP server with TypeScript

Jeffrey Rengifo — Fri, 27 Mar 2026 00:00:00 GMT

When working with large knowledge bases in Elasticsearch, finding information is only half the battle. Engineers often need to synthesize results from multiple documents, generate summaries, and trace answers back to their sources. Model Context Protocol (MCP) provides a standardized way to connect Elasticsearch with large language model–powered (LLM-powered) applications to accomplish this. While Elastic offers official solutions, like Elastic Agent Builder (which includes an MCP endpoint among its features), building a custom MCP server gives you full control over search logic, result formatting, and how retrieved content is passed to an LLM for synthesis, summaries, and citations.

In this article, we’ll explore the benefits of building a custom Elasticsearch MCP server and show how to create one in TypeScript that connects Elasticsearch to LLM-powered applications.

Why build a custom Elasticsearch MCP server?

Elastic provides some alternatives for MCP servers:

If you need more control over how your MCP server interacts with Elasticsearch, building your own custom server gives you the flexibility to tailor it exactly to your needs. For example, Agent Builder's MCP endpoint is limited to Elasticsearch Query Language (ES|QL) queries, while a custom server allows you to use the full Query DSL. You also gain control over how results are formatted before being passed to the LLM and can integrate additional processing steps, like the OpenAI-powered summarization we'll implement in this tutorial.

By the end of this article, you’ll have an MCP server in TypeScript that searches for information stored in an Elasticsearch index, summarizes it, and provides citations. We'll use Elasticsearch for retrieval, OpenAI's gpt-4o-mini model to summarize and generate citations, and Claude Desktop as the MCP client and UI to take in user queries and give responses. The end result is an internal knowledge assistant that helps engineers discover and synthesize best practices across their organization’s technical docs.

Prerequisites:

Node.js 20 +
Elasticsearch
OpenAI API key
Claude Desktop

What is MCP?

MCP is an open standard, created by Anthropic, that provides secure, bidirectional connections between LLMs and external systems, like Elasticsearch. You can read more about the current state of MCP in this article.

The MCP landscape is evolving every day, with servers available for a wide range of use cases. On top of that, it’s easy to build your own custom MCP server, as we’ll show in this article.

MCP clients

There’s a long list of available MCP clients, each with its own characteristics and limitations. For simplicity and popularity, we’ll use Claude Desktop as our MCP client. It will serve as the chat interface where users can ask questions in natural language, and it will automatically invoke the tools exposed by our MCP server to search documents and generate summaries.

Creating an Elasticsearch MCP server

Using the TypeScript SDK, we can easily create a server that understands how to query our Elasticsearch data based on a user query input.

Here are the steps in this article to integrate the Elasticsearch MCP server with the Claude Desktop client:

Configure MCP server for Elasticsearch

To begin, let's initialize a node application:

npm init -y

This will create a package.json file, and with it, we can start installing the necessary dependencies for this application.

npm install @elastic/elasticsearch @modelcontextprotocol/sdk openai zod && npm install --save-dev ts-node @types/node typescript

@elastic/elasticsearch will give us access to the Elasticsearch Node.js library.
@modelcontextprotocol/sdk provides the core tools to create and manage an MCP server, register tools, and handle communication with MCP clients.
openai allows interaction with OpenAI models to generate summaries or natural language responses.
zod helps define and validate structured schemas for input and output data in each tool.

ts-node, @types/node, and typescript will be used during development to type the code and compile the scripts.

Set up the dataset

To provide the data that Claude Desktop can query using our MCP server, we’ll use a mock internal knowledge base dataset. Here’s what a document from this dataset will look like:

{
    "id": 5,
    "title": "Logging Standards for Microservices",
    "content": "Consistent logging across microservices helps with debugging and tracing. Use structured JSON logs and include request IDs and timestamps. Avoid logging sensitive information. Centralize logs in Elasticsearch or a similar system. Configure log rotation to prevent storage issues and ensure logs are searchable for at least 30 days.",
    "tags": ["logging", "microservices", "standards"]
}

To ingest the data, we prepared a script that creates an index in Elasticsearch and loads the dataset into it. You can find it here.

MCP server

Create a file named index.ts and add the following code to import the dependencies and handle environment variables:

// index.ts
import { z } from "zod";
import { Client } from "@elastic/elasticsearch";
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import OpenAI from "openai";

const ELASTICSEARCH_ENDPOINT =
  process.env.ELASTICSEARCH_ENDPOINT ?? "http://localhost:9200";
const ELASTICSEARCH_API_KEY = process.env.ELASTICSEARCH_API_KEY ?? "";
const OPENAI_API_KEY = process.env.OPENAI_API_KEY ?? "";
const INDEX = "documents";

Also, let’s initialize the clients to handle the Elasticsearch and OpenAI calls:

const openai = new OpenAI({
  apiKey: OPENAI_API_KEY,
});

const _client = new Client({
  node: ELASTICSEARCH_ENDPOINT,
  auth: {
    apiKey: ELASTICSEARCH_API_KEY,
  },
});

To make our implementation more robust and ensure structured input and output, we'll define schemas using zod. This allows us to validate data at runtime, catch errors early, and make the tool responses easier to process programmatically:

const DocumentSchema = z.object({
  id: z.number(),
  title: z.string(),
  content: z.string(),
  tags: z.array(z.string()),
});

const SearchResultSchema = z.object({
  id: z.number(),
  title: z.string(),
  content: z.string(),
  tags: z.array(z.string()),
  score: z.number(),
});

type Document = z.infer;
type SearchResult = z.infer;

Learn more about structured outputs here.

Now let’s initialize the MCP server:

const server = new McpServer({
  name: "Elasticsearch RAG MCP",
  description:
    "A RAG server using Elasticsearch. Provides tools for document search, result summarization, and source citation.",
  version: "1.0.0",
});

Defining the MCP tools

With everything configured, we can start writing the tools that will be exposed by our MCP server. This server exposes two tools:

search_docs: Searches for documents in Elasticsearch using full-text search.
summarize_and_cite: Summarizes and synthesizes information from previously retrieved documents to answer a user question. This tool also adds citations referencing the source documents.

Together, these tools form a simple “retrieve-then-summarize” workflow, where one tool fetches relevant documents and the other uses those documents to generate a summarized, cited response.

Tool response format

Each tool can accept arbitrary input parameters, but it must respond with the following structure:

Content: This is the response of the tool in an unstructured format. This field is usually used to return text, images, audio, links, or embeddings. For this application, it will be used to return formatted text with the information generated by the tools.
structuredContent: This is an optional return used to provide the results of each tool in a structured format. This is useful for programmatic purposes. Although it isn't used in this MCP server, it can be useful if you want to develop other tools or process the results programmatically.

With that structure in mind, let’s dive into each tool in detail.

Search_docs tool

This tool performs a full-text search in the Elasticsearch index to retrieve the most relevant documents based on the user query. It highlights key matches and provides a quick overview with relevance scores.

server.registerTool(
  "search_docs",
  {
    title: "Search Documents",
    description:
      "Search for documents in Elasticsearch using full-text search. Returns the most relevant documents with their content, title, tags, and relevance score.",
    inputSchema: {
      query: z
        .string()
        .describe("The search query terms to find relevant documents"),
      max_results: z
        .number()
        .optional()
        .default(5)
        .describe("Maximum number of results to return"),
    },
    outputSchema: {
      results: z.array(SearchResultSchema),
      total: z.number(),
    },
  },
  async ({ query, max_results }) => {
    if (!query) {
      return {
        content: [
          {
            type: "text",
            text: "Query parameter is required",
          },
        ],
        isError: true,
      };
    }

    try {
      const response = await _client.search({
        index: INDEX,
        size: max_results,
        query: {
          bool: {
            must: [
              {
                multi_match: {
                  query: query,
                  fields: ["title^2", "content", "tags"],
                  fuzziness: "AUTO",
                },
              },
            ],
            should: [
              {
                match_phrase: {
                  title: {
                    query: query,
                    boost: 2,
                  },
                },
              },
            ],
          },
        },
        highlight: {
          fields: {
            title: {},
            content: {},
          },
        },
      });

      const results: SearchResult[] = response.hits.hits.map((hit: any) => {
        const source = hit._source as Document;

        return {
          id: source.id,
          title: source.title,
          content: source.content,
          tags: source.tags,
          score: hit._score ?? 0,
        };
      });

      const contentText = results
        .map(
          (r, i) =>
            `[${i + 1}] ${r.title} (score: ${r.score.toFixed(
              2,
            )})\n${r.content.substring(0, 200)}...`,
        )
        .join("\n\n");

      const totalHits =
        typeof response.hits.total === "number"
          ? response.hits.total
          : (response.hits.total?.value ?? 0);

      return {
        content: [
          {
            type: "text",
            text: `Found ${results.length} relevant documents:\n\n${contentText}`,
          },
        ],
        structuredContent: {
          results: results,
          total: totalHits,
        },
      };
    } catch (error: any) {
      console.log("Error during search:", error);

      return {
        content: [
          {
            type: "text",
            text: `Error searching documents: ${error.message}`,
          },
        ],
        isError: true,
      };
    }
  }
);

We configure fuzziness: “AUTO” to have a variable typo tolerance based on the length of the token that’s being analyzed. We also set title^2 to increase the score of the documents where the match happens on the title field.

summarize_and_cite tool

This tool generates a summary based on documents retrieved in the previous search. It uses OpenAI’s gpt-4o-mini model to synthesize the most relevant information to answer the user’s question, providing responses derived directly from the search results. In addition to the summary, it also returns citation metadata for the source documents used.

server.registerTool(
  "summarize_and_cite",
  {
    title: "Summarize and Cite",
    description:
      "Summarize the provided search results to answer a question and return citation metadata for the sources used.",
    inputSchema: {
      results: z
        .array(SearchResultSchema)
        .describe("Array of search results from search_docs"),
      question: z.string().describe("The question to answer"),
      max_length: z
        .number()
        .optional()
        .default(500)
        .describe("Maximum length of the summary in characters"),
      max_docs: z
        .number()
        .optional()
        .default(5)
        .describe("Maximum number of documents to include in the context"),
    },
    outputSchema: {
      summary: z.string(),
      sources_used: z.number(),
      citations: z.array(
        z.object({
          id: z.number(),
          title: z.string(),
          tags: z.array(z.string()),
          relevance_score: z.number(),
        })
      ),
    },
  },
  async ({ results, question, max_length, max_docs }) => {
    if (!results || results.length === 0 || !question) {
      return {
        content: [
          {
            type: "text",
            text: "Both results and question parameters are required, and results must not be empty",
          },
        ],
        isError: true,
      };
    }

    try {
      const used = results.slice(0, max_docs);

      const context = used
        .map(
          (r: SearchResult, i: number) =>
            `[Document ${i + 1}: ${r.title}]\\n${r.content}`
        )
        .join("\n\n---\n\n");

      // Generate summary with OpenAI
      const completion = await openai.chat.completions.create({
        model: "gpt-4o-mini",
        messages: [
          {
            role: "system",
            content:
              "You are a helpful assistant that answers questions based on provided documents. Synthesize information from the documents to answer the user's question accurately and concisely. If the documents don't contain relevant information, say so.",
          },
          {
            role: "user",
            content: `Question: ${question}\\n\\nRelevant Documents:\\n${context}`,
          },
        ],
        max_tokens: Math.min(Math.ceil(max_length / 4), 1000),
        temperature: 0.3,
      });

      const summaryText =
        completion.choices[0]?.message?.content ?? "No summary generated.";

      const citations = used.map((r: SearchResult) => ({
        id: r.id,
        title: r.title,
        tags: r.tags,
        relevance_score: r.score,
      }));

      const citationText = citations
        .map(
          (c: any, i: number) =>
            `[${i + 1}] ID: ${c.id}, Title: "${c.title}", Tags: ${c.tags.join(
              ", ",
            )}, Score: ${c.relevance_score.toFixed(2)}`,
        )
        .join("\n");

      const combinedText = `Summary:\\n\\n${summaryText}\\n\\nSources used (${citations.length}):\\n\\n${citationText}`;

      return {
        content: [
          {
            type: "text",
            text: combinedText,
          },
        ],
        structuredContent: {
          summary: summaryText,
          sources_used: citations.length,
          citations: citations,
        },
      };
    } catch (error: any) {
      return {
        content: [
          {
            type: "text",
            text: `Error generating summary and citations: ${error.message}`,
          },
        ],
        isError: true,
      };
    }
  }
);

Finally, we need to start the server using stdio. This means the MCP client will communicate with our server by reading and writing to its standard input and output streams. stdio is the simplest transport option and works well for local MCP servers launched as subprocesses by the client. Add the following code at the end of the file:

const transport = new StdioServerTransport();
server.connect(transport);

Now compile the project using the following command:

npx tsc index.ts --target ES2022 --module node16 --moduleResolution node16 --outDir ./dist --strict --esModuleInterop

This will create a dist folder, and inside it, an index.js file.

Load the MCP server into Claude Desktop

Follow this guide to configure the MCP server with Claude Desktop. In the Claude configuration file, we need to set the following values:

{
  "mcpServers": {
    "elasticsearch-rag-mcp": {
      "command": "node",
      "args": [   "/Users/user-name/app-dir/dist/index.js"
      ],
      "env": {
        "ELASTICSEARCH_ENDPOINT": "your-endpoint-here",
        "ELASTICSEARCH_API_KEY": "your-api-key-here",
        "OPENAI_API_KEY": "your-openai-key-here"
      }
    }
  }
}

The args value should point to the compiled file in the dist folder. You also need to set the environment variables in the configuration file with the exact same names defined in the code.

Test it out

Before executing each tool, click on Search and Tools to make sure that the tools are enabled. Here you can also enable or disable each one:

Finally, let’s test the MCP server from the Claude Desktop chat and start asking questions:

For the question “Search for documents about authentication methods and role-based access control”, the search_docs tool is executed and returns the following results:

Most Relevant Documents:
Access Control and Role Management (highest relevance) - This document covers role-based access control (RBAC) principles, including ensuring users only have necessary permissions, regular auditing of user roles, revoking inactive accounts, and implementing just-in-time access for sensitive operations.
User Authentication with OAuth 2.0 - This document explains OAuth 2.0 authentication, which enables secure delegated access without credential sharing. It covers configuring identity providers, token management with limited scope and lifetime, and secure storage of refresh tokens.
Container Security Guidelines - While primarily about container security, this document touches on access control aspects like running containers as non-root users and avoiding embedded credentials.
Incident Response Playbook - This mentions role assignment during incidents (incident commander, communications lead, etc.), which relates to access control in emergency scenarios.
Logging Standards for Microservices - This document includes guidance on avoiding logging sensitive information, which is relevant to authentication security.

The response is, “Great! I found 5 relevant documents about authentication methods and role-based access control. Here's what was found:”

The tool call returns the source documents as part of its response payload, which are later used to generate citations.

It’s also possible to chain multiple tools in a single interaction. In this case, Claude Desktop analyzes the user's question and determines that it needs to first call search_docs to retrieve relevant documents and to then pass those results to summarize_and_cite to generate the final answer, all without requiring separate prompts from the user:

In this case, for the query “What are the main recommendations to improve authentication and access control across our systems? Include references.”, we obtained the following results:

Based on the documentation, here are the main recommendations to improve authentication and access control across your systems:

Key Recommendations

1. Implement Role-Based Access Control (RBAC)
Ensure users have only the permissions necessary for their job functions. This principle of least privilege minimizes security risks by limiting access to what's truly needed. [1]

2. Regular Access Audits
Conduct regular audits of user roles and promptly revoke access for inactive accounts. This helps prevent unauthorized access from dormant or compromised accounts. [1]

3. Just-in-Time (JIT) Access
Implement just-in-time access for sensitive operations, where permissions are granted only when needed and for a limited duration. This reduces the attack surface by minimizing standing privileges. [1]

4. OAuth 2.0 for Secure Authentication
Use OAuth 2.0 to enable secure delegated access without sharing user credentials. This protocol allows third-party applications to access resources on behalf of users without exposing passwords. [2]

5. Token Security and Management
Configure your identity provider to issue tokens with limited scope and lifetime

Store refresh tokens securely
Validate access tokens consistently to maintain security [2]
References

Access Control and Role Management (Tags: security, access-control)
User Authentication with OAuth 2.0 (Tags: authentication, oauth)
These recommendations work together to create a defense-in-depth approach, where multiple security layers protect your systems from unauthorized access.

As in the previous step, we can see the response from each tool for this question:

Note: If a submenu appears asking whether you approve the use of each tool, select Always allow or Allow once.

Conclusion

MCP servers represent a significant step toward standardizing LLM tools for both local and remote applications. Though full compatibility is still in the works, we’re moving fast in that direction.

In this article, we learned how to build a custom MCP server in TypeScript that connects Elasticsearch to LLM-powered applications. Our server exposes two tools: search_docs for retrieving relevant documents using Query DSL; and summarize_and_cite for generating summaries with citations via OpenAI models and Claude Desktop as client UI.

The future of compatibility between different client and server providers looks promising. Next steps include adding more functionalities and flexibility to your agent. There’s a practical article on how you can parameterize your queries using search templates to gain precision and flexibility.

Announcing read-only permissions for Kibana dashboards

Fabio Busatto, Teresa Alvarez Soler — Thu, 26 Mar 2026 00:00:00 GMT

You’ve been there. You spend an hour building the perfect dashboard to monitor your logs: every chart, every filter, and every label. You share it with your team. A few days later, you open it and something’s off. A colleague tweaked a query. Or someone changed the date range. Maybe they thought they were helping. Now you’re digging through revisions and second-guessing every number. Sound familiar?

That’s exactly why we built read-only dashboards. It’s the control you’ve been asking for. Share dashboards with confidence, without worrying that the next person with edit access will change or break them.

Note: Read-only permissions are available in Elastic Cloud Serverless and from version 9.3 for Elastic Cloud Hosted and Elastic Self-Μanaged.

When “everyone can edit” gets in the way

In Kibana, sharing has usually meant space-level permissions. If someone can create dashboards in a space, they can also edit or delete anyone else’s. That’s great for collaboration until it isn’t. One accidental edit can ripple into wrong decisions, lost trust, and a lot of cleanup.

We’ve heard the workarounds: “We put ‘read-only’ in the dashboard name and hope people notice.” Or: “We tag them and cross our fingers.” Hope isn’t a permission model. You needed a real way to lock a dashboard without locking everyone out of the space.

What actually goes wrong

Deb and Kevin both have edit access to the log monitoring dashboard within the Operations space. Kevin makes some changes to the charts. When Deb comes back, the numbers don’t match what she presented. She has to track down what changed (often from memory), fix it, and wonder how many reports went out with bad data.

Read-only dashboards: Ownership and control that make sense

Read-only dashboards fix this by giving you control to decide whether other users can edit the dashboard. When you share a dashboard, you choose: edit (default, same as today) or view. In view mode, only you (and Kibana admins) can change or delete it. Everyone else can open it, use it, and trust it, but they can’t modify it.

What you get

Dashboard integrity: In view mode, other users with edit access in the space cannot modify or delete the dashboard. If they try, they’re told it’s locked. Your charts and logic stay as you left them.
You stay in control: You’re the owner. You can always edit, refine, and update. Sharing as view-only doesn’t lock you out; it locks in the version everyone else sees.
Flexible lifecycle: You can switch a dashboard back to “can edit” anytime. And Kibana admins can still manage all dashboards (for example, if the owner leaves). No dead ends.

You can share finalized, mission-critical dashboards widely and know they’ll stay consistent. This is available in all Elastic tiers and offerings, including Serverless.

Who can do what?

Quick reference by role:

Dashboard owner: You created it; you have full edit access.
Kibana admin: Can manage all dashboards.
User with space edit: Can create and edit their dashboards; can’t edit or delete view-only dashboards.
User with space view: Can only view (and list) dashboards.

Action	Dashboard owner	Kibana admin	User with space edit	User with space view
List and view dashboards	✔	✔	✔	✔
Create new dashboards	✔	✔	✔	✘
Modify/delete editable dashboards	✔	✔	✔	✘
Modify/delete read-only dashboards	✔	✔	✘	✘

How to turn on read-only

You can set view-only when you save a new dashboard or later from the share menu.

When saving a new dashboard

Build your dashboard, and click Save.
In the “Save as new dashboard” modal, find Permissions.
Change from Can edit to Can view.
Click Save. Done. It’s read-only for everyone else.

For a dashboard you already own

Open the dashboard.
Open the Share dashboard menu.

In the sharing modal, find Permissions and switch to Can view. The change applies immediately; other users in the space can no longer edit or delete it.

You can mouse over the Share action to see what type of permissions a given dashboard has.

Seeing which dashboards are locked

On the main Dashboards list, dashboards you can’t edit or delete have a disabled selection checkbox. This provides an easy way to spot what’s view-only.

In the dashboard, you will also find that the Edit action is disabled and a tooltip will appear, explaining that the dashboard has been set as view-only.

Try it

Read-only dashboards are available now. Create a dashboard, flip it to Can view, and share it. Your team gets a single source of truth, and you get peace of mind. No more “please don’t edit” in the title.

We’d love to hear how you use read-only dashboards. Share your feedback in our community forum.

The shell tool is not a silver bullet for context engineering

Leonie Monigatti — Wed, 25 Mar 2026 00:00:00 GMT

The most important tools an agent has are the search tools it can use to build its own context. Recent posts by LlamaIndex and LangChain have sparked a discussion: Are a shell tool and a filesystem all an agent needs for context engineering? Unfortunately, the discussion quickly drifted to the wrong focus: filesystem versus database.

This post refocuses on the question,What are the right search interfaces an agent needs to build its own context? It first covers the trade-offs between shell tools and dedicated database tools. From there, it offers a practical framework for finding the right interfaces for your agent's needs.

What does "building context" actually mean for an agent?

In early retrieval augmented generation (RAG) pipelines, the developer engineered a fixed retrieval pipeline, and the large language model (LLM) was a passive recipient of the context. This was a fundamental limitation: Context was retrieved on every query, whether or not it was needed, with no check that it actually helped.

With the shift to agentic RAG, the agents now have access to a set of search tools to build their own context. For example, both Claude Code [1] and Cursor [2] let the agent choose between different search tools and even combine them for chained queries, depending on what the task actually requires.

What search interfaces exist for context engineering?

Context can live in different locations, such as on the web, in a local filesystem, or in a database. An agent can interact with each of these out-of-context data sources through different tools:

Shell tools can execute shell commands and have access to the local filesystem. Some examples of built-in shell tools are Claude API's bash tool, OpenClaw's exec tool, and LangChain's shell tool.
Dedicated database tools, such as tools from a Model Context Protocol (MCP) server (for example, the Elastic Agent Builder MCP server) or custom tools (for example, run_esql(query) or db_list_index()), can query databases.
Dedicated file search tools can search and read local (or uploaded) files (without full shell access). Some examples of built-in file search tools are Gemini API’s File Search Tool or OpenAI’s File Search Tool.
Web search tools can retrieve information from the web.
Memory tools store and recall from long-term memory (regardless of how it’s stored).

As you can see, the shell tool is versatile and can be used to retrieve context from different data sources, including:

Filesystem: The agent explores the directory structure (ls, find), searches for relevant content (grep, cat), and repeats until it has built sufficient context.
Database: The agent can use database command line interface (CLI) tools (for example, elasticsearch-sql-cli), call HTTP APIs via curl, or run scripts, which is especially useful in combination with agent skills, which are reusable, documented examples injected into the agent's context to guide correct tool usage (for example, Elastic Agent Skills for Elasticsearch).
Web: The agent can execute web searches via a curl command through a search provider’s API.

However, the shell tool provides direct system access and therefore requires safety measures, such as running in an isolated sandbox environment and logging all executed commands.

When to use which search interfaces

The right search interface depends on your data, your query patterns, and your use case. This section serves as a practical starting point.

Filesystems aren’t making databases obsolete

The filesystems-versus-databases discussion is not about the storage layer. For example, LangChain explains that its memory system doesn’t actually store memory in a real filesystem. Instead, it stores memory in a database and represents it as a set of files to the agent [3].

Filesystems are a natural fit for file-native use cases, such as coding agents. They also work well as a temporary scratch pad or working memory and for single-user or single-agent scenarios where concurrency isn't a concern. In these cases, a physical filesystem or representing the data as a filesystem gives you flexibility before committing to a purpose-built interface.

But filesystem storage has real downsides, such as weak concurrency, manual schema enforcement, and atomic transactions. These become more apparent when your application needs to scale or move to a multi-agent scenario. Anyone who ignores these downsides is doomed to painfully reinvent worse databases without the decades of engineering behind transaction safety or access control that production databases already provide. Additionally, in most enterprise contexts, you don't choose whether to use a database since it's already there, storing business-critical data.

Shell tool + filesystem

A shell tool is the natural starting point for filesystem search. Currently, coding agents are driving a lot of progress in the field. Because they work with code in local files, they’re naturally file-heavy use cases. Therefore, LLMs are fine-tuned in the post-training stage for coding tasks. That’s why many LLMs are not only good at writing code but also at using shell commands and navigating filesystems.

Using a shell tool with built-in CLIs, like ls and grep, to find files is effective. With grep, a query like "Find all files that import matplotlib" is fast, precise, and cheap. But when the agent needs to handle conceptual queries, such as "How does our app handle failed authentication?", pattern matching with grep can hit a ceiling quickly. Several alternatives that bring semantic search capabilities to the command line have emerged to fill this gap, including jina-grep.

However, grep and many of its semantic search alternatives run in O(n) over the corpus. For use cases over codebases, this might be fine. However, if your data grows, latency will become noticeable. In this case, an indexed datastore becomes necessary to maintain performance.

Shell tool + database

Another way to add more search capabilities, such as semantic or hybrid search, over your data is to store it in a database, as Cursor does, for example. Additionally, when data requires complex relational joins or aggregations, a database interface is nonnegotiable.

When the data lives in a database rather than on the filesystem, a shell tool can serve as a lightweight database interface for certain use cases. If your queries are simple enough for a CLI or a curl call, a dedicated database tool may add unnecessary complexity.

This approach is also suitable in early exploration stages, when you don't yet know what query patterns your agent will actually develop. In this case, Agent Skills can give the agent enough structure to query correctly without committing to a purpose-built tool. However, when the agent requires many iterations to figure out the right way to query the database for repeated tasks, the token overhead of using a shell tool as the interface no longer justifies the simplicity benefit of avoiding an extra tool.

Dedicated database tool

Especially when repeated query patterns are structured or analytical, dedicated database tools become necessary. A blog post from Vercel and Braintrust compared agents with different sets of search tools for real-world retrieval tasks over semi-structured data, such as customer support tickets and sales call transcripts (for example, “How many open issues mention 'security'?" or "Find issues where someone reported a bug and later someone submitted a PR claiming to fix it?") [4].

Agents with dedicated database tools used fewer tokens, were faster, and made fewer mistakes than agents with only a shell tool and filesystem. The lesson is that direct database tools are the right choice when the query requires analytical reasoning over semi-structured data.

Combining search interfaces

No single search interface handles every query well. For example, Cursor combines shell tools (for searches via grep) and semantic search tools and lets the agent select the right tool based on the user’s prompt. They report that the agent chooses grep for matching specific symbols or strings, semantic search for conceptual or behavior questions, and both for exploratory tasks.

The Vercel experiment reports the same: Its hybrid agent with access to both a shell tool and a dedicated database tool achieved the best performance out of all tested agents by first using the dedicated database tools and then verifying the results by grepping through the filesystem. However, this approach uses more tokens and time for reasoning about tool choice and verification.

The pattern across both examples is the same: Composition beats any single interface, but composition comes at the trade-off of added cost and latency.

Practical recommendations for finding the right set of tools

The right set of search interfaces is small, purposeful, and specific to your agent's actual query patterns. The current best practice is to have an agent with as few tools as possible instead of having an agent with hundreds of MCP tools. This is because the downside of exposing all possible tools up front is that it bloats the context window and confuses the agent about which tool to actually use. For example, Claude Code reportedly only has about 20 tools.

Instead, the idea of progressive disclosure is to start with a minimal set of tools and let the agent discover additional capabilities only when needed. Research from Anthropic [5] and Cursor [6] has shown that this approach yields a token savings between 47%–85%. Claude Code, for example, implements this directly, allowing the agent to incrementally discover how to query an API or a database, without that knowledge consuming context on every LLM call.

Once you’re familiar with the agent's query patterns, you can revisit the set of search tools that the agent has access to by default. A useful way to think about this trade-off is the "low floor, high ceiling" principle for deciding which tools make the cut. High-ceiling tools don't limit the agent's potential. For example, a versatile shell tool lets the agent write full database queries, including ambiguous ones, but at the cost of reasoning overhead, higher latency, and lower reliability.

Low-floor tools are the opposite. They’re specialized tools that wrap specific queries and are immediately accessible to the agent with minimal reasoning overhead, producing lower cost and higher reliability. But they need upfront engineering, can't cover every possible query, and can make it harder for the agent to choose the right tool.

Think of each tool on a spectrum: Low-floor tools are easy for the agent to use correctly but narrow in scope. High-ceiling tools are versatile but demand more reasoning to use well.

Most agents need a mix of different search tools. But each tool needs to earn its addition. We recommend starting with an all-purpose search tool (for example a search_database() tool or a shell tool). Then reuse the command logs you're already keeping for security purposes to track what your agent actually does, including tool calls, retries, and number of calls per user query. And, when you see a query pattern repeating or failing, that's the signal to build a purpose-built tool for it.

Summary

The filesystem-versus-database debate is distracting from the actual question that engineers need to be asking: What are the right search interfaces an agent needs to build its own context? The answer is most likely, Not a single one.

A shell tool is a versatile tool to interact with different out-of-context sources and thus a good starting point. But it’s less efficient and accurate for use cases with structured analytical queries than dedicated database tools.

The goal is to find the minimal set of search tools that handles your agent's actual query patterns well. Start with a shell tool, and log what your agent actually does. When you see a query pattern repeating and failing, it’s time to engineer specialized tools.

References

1. Thariq (Anthropic). Lessons from Building Claude Code: Seeing like an Agent (2026).

2. Cursor: Documentation. Semantic & agentic search (2026).

3. Harrison Chase (LangChain). How we built Agent Builder’s memory system (2026).

4. Ankur Goyal (Braintrust) and Andrew Qu (Vercel). Testing if "bash is all you need" (2026).

5. Anthropic. Introducing advanced tool use on the Claude Developer Platform (2025).

6. Cursor. Dynamic context discovery (2026).

Elasticsearch replicas for load balancing in Serverless

Andrei Dan — Tue, 24 Mar 2026 00:00:00 GMT

In Elastic Cloud Serverless, we automatically adjust the number of replicas for your indices based on search load, ensuring optimal query performance without any manual configuration. In this blog, we’ll explain how replicas are scaled, when the system adds or removes them, and what this means for your indices.

The party is getting crowded

You're hosting a pizza party. You've got a few friends helping you serve, each stationed at different spots around the room. You give each friend a pizza, and they start handing out slices to hungry guests as they arrive.

At first, things run smoothly. A few guests trickle in, your friends serve slices, everyone's happy. But then word spreads about your sourdough pizzas. The doorbell keeps ringing. Guests pour in. Soon, there's a crowd forming around one of your friends, the one holding the pepperoni pizza, which everyone seems to want.

Your friend with the pepperoni pizza is overwhelmed. Guests are waiting, getting impatient, and a large queue has formed. Meanwhile, your friend holding the margherita pizza is standing around with barely anyone asking for a slice.

What do you do?

You order a couple more pepperoni pizzas and hand them to other friends. Now three friends are holding pepperoni instead of one. The crowd spreads out, and suddenly you can serve three times as many guests at once.

A few things become clear as you host more parties:

Not all pizzas are equally popular. Some are in high demand, others have fewer takers. You don't need extra "copies" of the unpopular ones. You need extras of the ones with queues.
Order more pizzas before the queue gets too long. If you wait until your friend is completely overwhelmed and guests are leaving angry, you've waited too long. Better to get an extra pizza when you see a crowd forming.
Don't throw away pizzas too quickly. Just because the crowd around the pepperoni thinned out for five minutes doesn't mean the rush is over. Maybe they're just refilling drinks, or even talking among themselves (is that still a thing?). Keep the extra pizzas ready. If the lull continues for a while, then you can put them away.
You can only hand out as many pizzas as you have friends who are helping. If you've only got four friends helping, ten pizzas won’t change the outcome. Only four can be served at once. Match your pizza count to your available hands.
When a friend leaves, take their pizza. If one of your friends needs to head out, grab their pizza immediately. You can't have pizzas sitting unattended. Hand it to someone else, or put it away.

From pizzas to replicas

Let's map this back to Elasticsearch.

In our analogy, pizzas are replicas (copies of your index shards), your friends helping serve are search nodes, hungry guests are search queries, and that popular pizza with a crowd around it is a hot index with high search load.

When search traffic increases on a particular index, we create additional replicas and distribute them across your search nodes. Any replica can serve any query for that index, just like any friend holding pepperoni can hand out pepperoni slices. More replicas means higher throughput: Three replicas can handle three times the queries per second of a single replica.

Measuring the hunger

Before we decide how many pizzas to order, we need to know how hungry the crowd is.

Elasticsearch tracks the search load for every shard. It's a metric that captures how much search activity a shard is handling. We aggregate this across all shards of an index to understand the total search demand.

What matters most is the relative search load: What proportion of your project's total search traffic is hitting each index? If one index is receiving 60% of all searches while another gets 5%, we know where to add capacity.

The math behind the pizzas

We calculate the optimal number of replicas following this formula:

desired_replicas = min(ceil(L × N / (S × X)), N)

Where:

L = the index's relative search load (between 0 and 1).
N = the number of desired search nodes in your project.
S = the number of shards in the index.
X = a threshold to avoid hot spots (default: 0.5).

An example: four search nodes, one index with two primary shards receiving 80% of search traffic:

desired_replicas = min(ceil(0.8 × 4 / (2 × 0.5)), 4)
                 = min(4, 4)
                 = 4

This hot index gets four replicas distributed across the search nodes.

The threshold X (defaulting to 0.5) is important. We don't wait until a replica is completely overwhelmed; we scale up when it's at half capacity. Hand out the extra pizza when you see the crowd forming, not when guests are already leaving.

Scale up fast, scale down slow

When search load increases, we add replicas immediately. No reason to make users wait.

When search load drops, we wait a bit before taking any action. We need to see consistent low demand for about 30 minutes before reducing replicas. (This is to deal with spiky traffic where a quiet moment doesn't mean the party is over.)

This matters because adding a replica has a cost. The new replica copies data and warms its caches before serving queries efficiently. Removing replicas too eagerly means constantly paying this startup cost as traffic naturally fluctuates.

Respecting topology bounds

Replicas can never exceed the number of search nodes. Having more replicas than nodes provides no benefit (you can only serve as many pizzas as you have friends who are helping to serve slices).

When nodes are removed from your project, we reduce replicas immediately to match. No waiting for the cooldown, as you can't have unassigned replicas. The moment a friend leaves, we remove their pizza.

The bigger Serverless picture

Replicas for search load balancing works alongside other autoscaling systems:

Search autoscaling adjusts the number of search nodes (how many friends are helping).
Replicas for search load balancing distribute traffic by adjusting replica counts per index (how many pizzas of each kind we need).
Data stream autosharding optimizes shard counts for writes (how to slice each pizza, covered in the previous post).

An important design principle: Replicas for load balancing don't directly trigger search autoscaling. Instead, by distributing search requests across more replicas, it enables increasing resource utilization across your search nodes. This higher utilization then triggers our existing autoscaling logic to add capacity if needed. Replicas for load balancing enables autoscaling to do its job, making sure your search nodes are actually being used, rather than having all traffic bottlenecked on a single replica while other nodes sit idle.

What this means for you

You don't need to predict which indices will be popular. You don't need to manually adjust replicas when traffic patterns change. You don't need to wake up at 3 a.m. because a surge overwhelmed your busiest index.

The system watches where queues are forming and orders more pizzas for those spots. Cold indices don't waste resources on unnecessary replicas. Hot indices get the capacity they need. Your budget goes where it matters.

Conclusion

In the autosharding post, we made sure your pizzas are sliced right. Now, with replicas for search load balancing, we make sure you have enough pizzas, in the right hands, when the hungry crowds arrive.

Try Elastic Cloud Serverless and let us handle the pizza logistics.

Using Elasticsearch Inference API along with Hugging Face models

Jeffrey Rengifo — Mon, 23 Mar 2026 00:00:00 GMT

In recent updates, Elasticsearch introduced a native integration to connect to models hosted on the Hugging Face Inference Service. In this post, we’ll explore how to configure this integration and perform inference through simple API calls using a large language model (LLM). We’ll use SmolLM3-3B, a lightweight general-purpose model with a good balance between resource usage and answer quality.

Prerequisites

Elasticsearch 9.3 or Elastic Cloud Serverless: You can create a cloud deployment following these instructions, or you can use the start-local quickstart instead.
Python 3.12: Download Python here.
Hugging Face access token.

Chat completions using a Hugging Face inference endpoint

First, we’ll build a practical example that connects Elasticsearch to a Hugging Face inference endpoint to generate AI-powered recommendations from a collection of blog posts. For the app knowledge base, we’ll use a dataset of company blog articles, which contains valuable but often hard-to-navigate information.

With this endpoint, semantic search retrieves the most relevant articles for a given query, and a Hugging Face LLM generates short, contextual recommendations based on those results.

Let’s take a look at a high-level overview of the information flow we’re going to build:

In this article, we’ll test SmolLM3-3B capacity to combine its compact size with strong multilingual reasoning and tool-calling capabilities. Based on a search query, we’ll send all the matching content (in English and Spanish) to the LLM to generate a list of recommended articles with a custom-made description based on the search query and results.

Here’s what the UI of an article site with an AI recommendations generation system could look like.

You can find the full implementation of this application in the linked notebook.

Configuring Elasticsearch inference endpoints

To use the Elasticsearch Hugging Face inference endpoint, we need two important elements: a Hugging Face API key and a running Hugging Face endpoint URL. It should look like this:

PUT _inference/chat_completions/hugging-face-smollm3-3b
{
    "service": "hugging_face",
    "service_settings": {
        "api_key": "hugging-face-access-token", 
        "url": "url-endpoint" 
    }
}

The Hugging Face inference endpoint in Elasticsearch supports different task types: text_embedding, completion, chat_completion, and rerank. In this blog post, we use chat_completion because we need the model to generate conversational recommendations based on the search results and a system prompt.This endpoint allows us to perform chat completions directly from Elasticsearch in a simple way using the Elasticsearch API:

POST _inference/chat_completion/hugging-face-smollm3-3b/_stream
{
  "messages": [
      { "role": "user", "content": "" }
  ]
}

This will serve as the core of the application, receiving the prompt and the search results that will pass through the model. With the theory covered, let’s start implementing the application.

Setting up inference endpoint on Hugging Face

To deploy the Hugging Face model, we’re going to use Hugging Face one-click deployments, an easy and fast service for deploying model endpoints. Keep in mind that this is a paid service, and using it may incur additional costs. This step will create the model instance that will be used to generate the recommendations of the articles.

You can pick a model from the one-click catalog:

Let’s pick the SmolLM3-3B model:

From here, grab the Hugging Face endpoint URL:

As mentioned in the Elasticsearch Hugging Face inference endpoints documentation, text generation requires a model that’s compatible with the OpenAI API. For that reason, we need to append the /v1/chat/completions subpath to the Hugging Face endpoint URL. The final result will look like this:

https://j2g31h0futopfkli.us-east-1.aws.endpoints.huggingface.cloud/v1/chat/completions

With this in place, we can start coding in a Python notebook.

Generating Hugging Face API key

Create a Hugging Face account, and obtain an API token by following these instructions. You can choose between three token types: fine-grained (recommended for production, as it provides access only to specific resources); read (for read-only access); or write (for read and write access). For this tutorial, a read token is sufficient, since we only need to call the inference endpoint. Save this key for the next step.

Setting up Elasticsearch inference endpoint

First, let’s declare an Elasticsearch Python client:

os.environ["ELASTICSEARCH_API_KEY"] = "your-elasticsearch-api-key"
os.environ["ELASTICSEARCH_URL"] = "https://xxxx.us-central1.gcp.cloud.es.io:443"

es_client = Elasticsearch(
    os.environ["ELASTICSEARCH_URL"], api_key=os.environ["ELASTICSEARCH_API_KEY"]
)

Next, let’s create an Elasticsearch inference endpoint that uses the Hugging Face model. This endpoint will allow us to generate responses based on the blog posts and the prompt passed to the model.

INFERENCE_ENDPOINT_ID = "smollm3-3b-pnz"

os.environ["HUGGING_FACE_INFERENCE_ENDPOINT_URL"] = (
 "https://j2g31h0futopfkli.us-east-1.aws.endpoints.huggingface.cloud/v1/chat/completions"
)
os.environ["HUGGING_FACE_API_KEY"] = "hf_xxxxx"

resp = es_client.inference.put(
        task_type="chat_completion",
        inference_id=INFERENCE_ENDPOINT_ID,
        body={
            "service": "hugging_face",
            "service_settings": {
                "api_key": os.environ["HUGGING_FACE_API_KEY"],
                "url": os.environ["HUGGING_FACE_INFERENCE_ENDPOINT_URL"],
            },
        },
    )

Dataset

The dataset contains the blog posts that will be queried, representing a multilingual content set used throughout the workflow:

// Articles dataset document example: 
{
    "id": "6",
    "title": "Complete guide to the new API: Endpoints and examples",
    "author": "Tomas Hernandez",
    "date": "2025-11-06",
    "category": "tutorial",
    "content": "This guide describes in detail all endpoints of the new API v2. It includes code examples in Python, JavaScript, and cURL for each endpoint. We cover authentication, resource creation, queries, updates, and deletion. We also explain error handling, rate limiting, and best practices. Complete documentation is available on our developer portal."
  }

Elasticsearch mappings

With the dataset defined, we need to create a data schema that properly fits the blog post structure. The following index mappings will be used to store the data in Elasticsearch:

INDEX_NAME = "blog-posts"

mapping = {
    "mappings": {
        "properties": {
            "id": {"type": "keyword"},
            "title": {
                "type": "object",
                "properties": {
                    "original": {
                        "type": "text",
                        "copy_to": "semantic_field",
                        "fields": {"keyword": {"type": "keyword"}},
                    },
                    "translated_title": {
                        "type": "text",
                        "fields": {"keyword": {"type": "keyword"}},
                    },
                },
            },
            "author": {"type": "keyword", "copy_to": "semantic_field"},
            "category": {"type": "keyword", "copy_to": "semantic_field"},
            "content": {"type": "text", "copy_to": "semantic_field"},
            "date": {"type": "date"},
            "semantic_field": {"type": "semantic_text"},
        }
    }
}


es_client.indices.create(index=INDEX_NAME, body=mapping)

Here, we can see more clearly how the data is structured. We’ll use semantic search to retrieve results based on natural language, along with the copy_to property to copy the field contents into the semantic_text field. Additionally, the title field contains two subfields: the original subfield stores the title in either English or Spanish, depending on the original language of the article; and the translated_title subfield is present only for Spanish articles and contains the English translation of the original title.

Ingesting data

The following code snippet ingests the blog posts dataset into Elasticsearch using the bulk API:

def build_data(json_file, index_name):
    with open(json_file, "r") as f:
        data = json.load(f)

    for doc in data:
        action = {"_index": index_name, "_source": doc}
        yield action


try:
    success, failed = helpers.bulk(
        es_client,
        build_data("dataset.json", INDEX_NAME),
    )
    print(f"{success} documents indexed successfully")

    if failed:
        print(f"Errors: {failed}")
except Exception as e:
    print(f"Error: {str(e)}")

Now that we have the articles ingested into Elasticsearch, we need to create a function capable of searching against the semantic_text field:

def perform_semantic_search(query_text, index_name=INDEX_NAME, size=5):
    try:
        query = {
            "query": {
                "match": {
                    "semantic_field": {
                        "query": query_text,
                    }
                }
            },
            "size": size,
        }

        response = es_client.search(index=index_name, body=query)
        hits = response["hits"]["hits"]

        return hits
    except Exception as e:
        print(f"Semantic search error: {str(e)}")
        return []

We also need a function that calls the inference endpoint. In this case, we’ll call the endpoint using the chat_completion task type to get streaming responses:

def stream_chat_completion(messages: list, inference_id: str = INFERENCE_ENDPOINT_ID):
    url = f"{ELASTICSEARCH_URL}/_inference/chat_completion/{inference_id}/_stream"
    payload = {"messages": messages}
    headers = {
        "Authorization": f"ApiKey {ELASTICSEARCH_API_KEY}",
        "Content-Type": "application/json",
    }

    try:
        response = requests.post(url, json=payload, headers=headers, stream=True)
        response.raise_for_status()

        for line in response.iter_lines(decode_unicode=True):
            if line:
                line = line.strip()

                if line.startswith("event:"):
                    continue

                if line.startswith("data: "):
                    data_content = line[6:]

                    if not data_content.strip() or data_content.strip() == "[DONE]":
                        continue

                    try:
                        chunk_data = json.loads(data_content)

                        if "choices" in chunk_data and len(chunk_data["choices"]) > 0:
                            choice = chunk_data["choices"][0]
                            if "delta" in choice and "content" in choice["delta"]:
                                content = choice["delta"]["content"]
                                if content:
                                    yield content

                    except json.JSONDecodeError as json_err:
                        print(f"\nJSON decode error: {json_err}")
                        print(f"Problematic data: {data_content}")
                        continue

    except requests.exceptions.RequestException as e:
        yield f"Error: {str(e)}"

Now we can write a function that calls the semantic search function, along with the chat_completions inference endpoint and the recommendations endpoint, to generate the data that will be allocated in the cards:

def recommend_articles(search_query, index_name=INDEX_NAME, max_articles=5):
    print(f"\n{'='*80}")
    print(f"🔍 Search Query: {search_query}")
    print(f"{'='*80}\n")

    articles = perform_semantic_search(search_query, index_name, size=max_articles)

    if not articles:
        print("❌ No relevant articles found.")
        return None, None

    print(f"✅ Found {len(articles)} relevant articles\n")

    # Build context with found articles
    context = "Available blog articles:\n\n"
    for i, article in enumerate(articles, 1):
        source = article.get("_source", article)
        context += f"Article {i}:\n"
        context += f"- Title: {source.get('title', 'N/A')}\n"
        context += f"- Author: {source.get('author', 'N/A')}\n"
        context += f"- Category: {source.get('category', 'N/A')}\n"
        context += f"- Date: {source.get('date', 'N/A')}\n"
        context += f"- Content: {source.get('content', 'N/A')}\n\n"

    system_prompt = """You are an expert content curator that recommends blog articles.

    Write recommendations in a conversational style starting with phrases like:
    - "If you're interested in [topic], this article..."
    - "This post complements your search with..."
    - "For those looking into [topic], this article provides..."


    FORMAT REQUIREMENTS:
    - Return ONLY a JSON array
    - Each element must have EXACTLY these three fields: "article_number", "title", "recommendation"
    - If the original title is in spanish, use the "translated_title" subfield in the "title" field

    Keep each recommendation concise (2-3 sentences max) and focused on VALUE to the reader.

    EXAMPLE OF CORRECT FORMAT:
    [
        {"article_number": 1, "title": "Article title in english", "recommendation": "If you are interested in [topic], this article provides..."},
        {"article_number": 2, "title": "Article title in english", "recommendation": " for those looking into [topic], this article provides..."}
    ]

    Return ONLY the JSON array following this exact structure."""

    user_prompt = f"""Search query: "{search_query}"

    Generate recommendations for the following articles: {context}
    """

    messages = [
        {"role": "system", "content": "/no_think"},
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ]

    # LLM generation
    print(f"{'='*80}")
    print("🤖 Generating personalized recommendations...\n")

    full_response = ""

    for chunk in stream_chat_completion(messages):
        print(chunk, end="", flush=True)
        full_response += chunk

    return context, articles, full_response

Finally, we need to extract the information and format it to be printed:

def display_recommendation_cards(articles, recommendations_text):
    print("\n" + "=" * 100)
    print("📇 RECOMMENDED ARTICLES".center(100))
    print("=" * 100 + "\n")

    # Parse JSON recommendations - clean tags and extract JSON
    recommendations_list = []
    try:

        # Clean up  tags
        cleaned_text = re.sub(
            r".*?", "", recommendations_text, flags=re.DOTALL
        )
        # Remove markdown code blocks ( ... ``` or ``` ... ```)
        cleaned_text = re.sub(r"```(?:json)?", "", cleaned_text)
        cleaned_text = cleaned_text.strip()

        parsed = json.loads(cleaned_text)

        # Extract recommendations from list format
        for item in parsed:
            article_number = item.get("article_number")
            title = item.get("title", "")
            rec_text = item.get("recommendation", "")

            if article_number and rec_text:
                recommendations_list.append(
                    {
                        "article_number": article_number,
                        "title": title,
                        "recommendation": rec_text,
                    }
                )
    except json.JSONDecodeError as e:
        print(f"⚠️  Could not parse recommendations as JSON: {e}")
        return

    for i, article in enumerate(articles, 1):
        source = article.get("_source", article)

        # Card border
        print("┌" + "─" * 98 + "┐")

        # Find recommendation and title for this article number
        recommendation = None
        title = None
        for rec in recommendations_list:
            if rec.get("article_number") == i:
                recommendation = rec.get("recommendation")
                title = rec.get("title")
                break

        # Print title
        title_lines = textwrap.wrap(f"📌 {title}", width=94)
        for line in title_lines:
            print(f"│  {line}".ljust(99) + "│")

        # Card border
        print("├" + "─" * 98 + "┤")

        # Print recommendation
        if recommendation:
            recommendation_lines = textwrap.wrap(recommendation, width=94)
            for line in recommendation_lines:
                print(f"│  {line}".ljust(99) + "│")

        # Card bottom
        print("└" + "─" * 98 + "┘")

Let’s test this by asking a question about the security blog posts:

search_query = "Security and vulnerabilities"

context, articles, recommendations = recommend_articles(search_query)

print("\nElasticsearch context:\n", context)

# Display visual cards
display_recommendation_cards(articles, recommendations)

Here we can see the cards in the console generated by the workflow:

You can see the full results, including all hits and the LLM response, in this file.

We’re asking for articles related to: “Security and vulnerabilities.” This question is used as the search query against the documents stored in Elasticsearch. The retrieved results are then passed to the model, which generates recommendations based on their content. As we can see, the model did a great job generating engaging short text that can motivate the reader to click on it.

Conclusion

This example shows how Elasticsearch and Hugging Face can be combined to create a fast and efficient centralized system for AI applications. This approach reduces manual effort and provides flexibility, thanks to Hugging Face’s extensive model catalog. Using SmolLM3-3B, in particular, shows how compact, multilingual models can still deliver meaningful reasoning and content generation when paired with semantic search. Together, these tools offer a scalable and effective foundation for building intelligent content analysis and multilingual applications.

Fast vs. accurate: Measuring the recall of quantized vector search

Jeff Vestal — Fri, 20 Mar 2026 00:00:00 GMT

Everyone wants vector search to be instant. But high-dimensional vectors are heavy. A single 1,024-dimension float-32 vector takes up significant memory, and comparing it against millions of others is computationally expensive.

To solve this, search engines like Elasticsearch use two main optimization strategies:

Approximate search (hierarchical navigable small world [HNSW]): Instead of scanning every document, we build a navigation graph to jump quickly to the likely neighborhood of the answer.
Quantization: We compress the vectors (for example, from 32-bit floats to 8-bit integers or even 1-bit binary values) to reduce memory usage and speed up calculations.

But optimization often comes with a tax: accuracy.

The fear is valid: "If I compress my data and take shortcuts during the search, will I miss the best results?" "Does this optimization degrade the relevance of my search engine?"

To prove that Elastic’s quantization doesn’t degrade results, we built a repeatable test harness using the DBPedia-14 dataset to calculate exactly how much accuracy (specifically, recall) we trade for speed when using default optimizations in Elasticsearch.

tldr: It’s likely much less than you think. Check out the notebook here, and try it yourself

The definitions (for the non-experts)

Before we look at the code, let’s level-set on some terms.

Relevance versus recall: Relevance is subjective (did I find good stuff?). Recall is mathematical. If there are 10 documents in the database that are the perfect mathematical matches for your query, and the search engine finds nine of them, your recall is 90% (or 0.9).
Exact search (flat): Sometimes called the "brute force" method. The search engine scans every single document in an index and calculates the distance.
- Pros: 100% perfect recall.
- Cons: Computationally expensive and slow at scale.
Approximate search (HNSW): The "shortcut" method. The search engine builds an HNSW graph. It traverses the graph to find the nearest neighbors.
- Pros: Extremely fast and scalable.
- Cons: You might miss a neighbor if the graph traversal stops too early.

The experiment: Exact versus approximate

To test recall, we used the DBPedia-14 dataset, a large dataset of titles and abstracts across 14 ontology classes, commonly used for training and evaluating text categorization models. Specifically, we’ll focus on the "Film" category. We wanted to compare the optimized production settings against a mathematically perfect ground truth.

For this experiment, we are using the jina-embeddings-v5-text-small model, a state-of-the-art multilingual model that leads industry benchmarks for text representation. We chose this model because it defines the current standard for high-performance embeddings. By combining Jina v5’s elite accuracy with Elasticsearch’s native quantization, we can demonstrate a search architecture that is both computationally efficient and uncompromising on retrieval quality.

We set up an index with dual mapping. We ingested the same text into two different fields simultaneously:

content.raw with type: flat. This forces Elasticsearch to perform a brute-force scan of the full Float32 vectors. This returns exact match results and will be used for our baseline.
content with type semantic_text. With defaults using HNSW + Better Binary Quantization (BBQ). This is the standard, optimized production setting for approximate match.

The Recall@10 test

For our metric, we used Recall@10.

We picked 50 random movies and ran the same query against both fields.

If the exact (flat) search says the top 10 neighbors are IDs [1, 2, 3... 10].
And the approximate (HNSW) search returns IDs [1, 2, 3... 9, 99].
We found nine out of the top 10 correctly. The score is 0.9.

Here’s the mapping we used:

# The "Control Group": Forces exact brute-force scan
"raw": {
    "type": "semantic_text",
    "inference_id": ".jina-embeddings-v5-text-small",
    "index_options": {
        "dense_vector": {
            "type": "flat"
        }
    }
}

The results: The "flat line" of success

We ran a scale test, reloading the full dataset and testing against index sizes of 1,000 to 40,000 documents.

Here’s what happened to the recall score:

Documents	Recall@10 score
1,000	1.000 (100%)
5,000	0.998 (100%)
10,000	0.992 (99.4%)
20,000	0.999 (99.0%)
40,000	0.992 (98.8%)

The results were incredibly stable. Even as we scaled up, the approximate search matched the brute-force exact search >99% of the time.

Why did it work so well?

You might expect that compressing vectors to binary values would hurt accuracy more than this. The reason it doesn't lies in how Elasticsearch handles the retrieval.

Most embedding models today output Float32 vectors, which are large. To make search efficient, Elasticsearch uses quantization for high-dimensional vectors. Specifically, since 9.2, it uses BBQ by default.

BBQ uses a rescoring mechanism:

Traversal: The search engine uses the compressed (quantized) vectors to traverse the HNSW graph quickly. Because the vectors are small, it can efficiently over-sample, gathering a larger list of candidates (for example, the top 100 roughly similar docs) without a performance penalty.
Rescore: Once it has those candidates, it retrieves the full-precision values for just those few documents to calculate the final, precise ranking.

This gives you the best of both worlds, the speed of quantization for the heavy lifting, and the precision of floats for the final sort.

Can we do better?

It’s worth noting that the results we’re seeing here are using default settings and a random sampling of data. Think of this as a high-performance starting point. While Jina v5 is a beast, these recall scores aren't a "one size fits all" guarantee for every dataset. Every data collection has its own quirks, and while you can definitely tune things further to squeeze out even more performance, you should always benchmark against your own specific data to see where your ceiling is.

Conclusion

This is a very small-scale test. But the point of the exercise is not to measure the embedding model or BBQ specifically, it’s to demonstrate how you can easily measure the recall of your dataset with minimal setup.

If you want to run this test on your own data, you can check out the notebook here and try it yourself.

Testing Elasticsearch. It just got simpler.

Piotr Przybyl — Thu, 19 Mar 2026 00:00:00 GMT

When I first wrote about testing Elasticsearch with Testcontainers for Java, the focus was very pragmatic: if you care about correctness, you should test against a real node; if you care about confidence, your integration tests should resemble production as closely as possible; and if you care about maintainability, your setup shouldn’t turn into a maze of mocks and assumptions.

That philosophy hasn’t changed.

What has changed, however, is how little effort it now takes to achieve that goal. With Elasticsearch 9.x, the modern Java client, and Testcontainers 2.x, the experience of writing integration tests feels noticeably smoother, as if a layer of incidental complexity has quietly been removed.

The example accompanying this article is intentionally modest and can be found here.

It doesn’t attempt to demonstrate sophisticated indexing strategies or elaborate data pipelines; instead, it concentrates on the essentials, because the essentials are precisely where the improvements are most visible.

When the tooling stops getting in the way

Anyone who has maintained a test suite for a few years will recognize the pattern: You introduce a new library, a transitive dependency pulls something unexpected, and before long, you’re negotiating between versions of testing engines rather than writing tests.

With Testcontainers 2.x, that negotiation largely disappears. The dependency structure is clearer, the modules are more explicit, and the accidental coupling to older testing frameworks no longer sneaks in behind your back. In practical terms, adding Elasticsearch support to your tests is now as straightforward as declaring:


  org.testcontainers
  testcontainers-elasticsearch
  2.0.3
  test

And, if you’re using JUnit Jupiter integration:


  org.testcontainers
  testcontainers-junit-jupiter
  2.0.3
  test

There are no exclusions to sprinkle in, no legacy engines to silence, and no uneasy feeling that something hidden might surface during the next upgrade. The configuration becomes almost unremarkable, which, in the context of build tooling, is a compliment.

A real Elasticsearch node, with security intact

In the demo test, we use the official Elasticsearch 9.3.1 Docker image:

var container =
    new ElasticsearchContainer("docker.elastic.co/elasticsearch/elasticsearch:9.3.1");

container.start();

At first glance, this may look similar to older examples, yet the subtle difference lies in what we no longer need to do. We don’t disable security. We don’t bypass SSL. We don’t simplify the environment just to make the test convenient.

Instead, once the container is started, we construct a client that uses the REST API and authenticates properly:

try (var client = ElasticsearchClient.of(c -> c
     .host("https://" + container.getHttpHostAddress())
     .usernameAndPassword("elastic", ElasticsearchContainer.ELASTICSEARCH_DEFAULT_PASSWORD)
     .sslContext(container.createSslContextFromCa())
)) {

What deserves special mention here is how neat the client construction itself has become. In earlier iterations, creating an Elasticsearch client often meant juggling multiple intermediate objects, configuring transport layers explicitly, wrapping low-level clients, and dedicating some amount of code to what was essentially plumbing. Now, the signal-to-noise ratio is refreshingly high. The builder encapsulates the necessary details, the container provides what the client needs, and the resulting configuration fits comfortably within a few readable lines.

Just as importantly, the ElasticsearchClient is AutoCloseable, which means it integrates naturally with try-with-resources, ensuring proper cleanup without additional ceremony. The lifecycle is explicit, concise, and self-contained, which is exactly what you want in integration tests that should focus on behavior rather than infrastructure management.

The container exposes everything required to build a legitimate, secure connection, and the client integrates with it naturally, which means the test environment mirrors production in all the aspects that matter, without imposing additional mental overhead from the developer.

This alignment between realism and simplicity is, perhaps, one of the most meaningful improvements.

Typed APIs change the character of tests

The evolution of the Elasticsearch Java client has also reshaped how integration tests read and feel. Where older approaches often involved parsing JSON responses or navigating loosely typed structures, the modern client offers a builder-based, strongly typed API that guides you through valid request shapes at compile time.

In the demo, we perform a simple cluster health check:

var health = client.cluster().health();

Assertions.assertEquals("docker-cluster", health.clusterName());
Assertions.assertEquals(HealthStatus.Green, health.status());

What’s striking here is not the complexity of the operation, but the absence of friction. There’s no manual extraction from maps, no assertions built on untyped string values, and no detour into low-level response handling. The test code looks indistinguishable from application code, which subtly reinforces the idea that integration tests aren’t a special category of code with different rules, but simply another consumer of the same APIs.

When the boundary between production code and test code becomes thinner, confidence increases almost by default.

Reading the test as a story

If you take a look at the full test case:

@Test
void newClientTest() throws IOException {
    try (var container =
             new ElasticsearchContainer("docker.elastic.co/elasticsearch/elasticsearch:9.3.1")) {
        
        container.start();
        
        try (
            var client = ElasticsearchClient.of(c ->
                c.host("https://" + container.getHttpHostAddress())
                    .usernameAndPassword("elastic", ElasticsearchContainer.ELASTICSEARCH_DEFAULT_PASSWORD)
                    .sslContext(container.createSslContextFromCa()))) {

            HealthResponse health = client.cluster().health();

            Assertions.assertEquals("docker-cluster", health.clusterName());
            Assertions.assertEquals(HealthStatus.Green, health.status());
        }
    }
}

you’ll notice that it reads less like a configuration script and more like a short narrative:

We define the container.
We start the container.
We build a client.
We call a real API.
We assert the outcome.

The supporting infrastructure fades into the background, leaving the intent of the test clearly visible. That clarity isn’t accidental; it’s the cumulative effect of incremental improvements across Testcontainers and the Elasticsearch client.

The advanced patterns still apply

None of the more advanced techniques discussed in earlier articles, Faster integration tests with real Elasticsearch and Advanced integration tests with real Elasticsearch, have become obsolete. Reusing containers to speed up large test suites, customizing cluster settings, preloading indices, or testing role-based access scenarios remain entirely valid and, in many cases, essential.

What has improved is the baseline experience. The simplest possible integration test, the one that merely needs a real node and a real client, no longer requires defensive configuration or dependency gymnastics. It’s concise, expressive, and production-like by default.

Progress without drama

There was no dramatic rewrite of the ecosystem, no disruptive migration guide that forced a rethinking of everything. Instead, there has been a steady refinement of APIs and dependencies, each release smoothing a rough edge here and removing a surprise there.

The result isn’t flashy, yet it’s tangible. Writing integration tests against Elasticsearch now feels less like assembling a test harness and more like exercising a real system in miniature.

Sometimes progress announces itself loudly. Sometimes it arrives quietly, in the form of code that simply reads better and requires less explanation. In this case, it’s the latter, and for those of us who care about clean, reliable integration tests, that’s more than enough.

And what if we could do something similar with Kibana? Sounds appealing? Stay tuned!

AI agent memory: Creating smart agents with Elasticsearch managed memory

Gustavo Llermaly, Jeffrey Rengifo — Wed, 18 Mar 2026 00:00:00 GMT

In this article, we’ll learn about using memory techniques to make agents smarter using Elasticsearch as the database for memories and knowledge.

Understanding memory in large language models (LLMs)

Here's something that trips people up: The conversations with LLMs are completely stateless. Every time you send a message, you need to include the entire chat history to "remind" the model what happened before. The ability to keep track of what was asked and answered within a single conversation session is what we call short-term memory.

But here's where it gets interesting: Nothing stops us from manipulating this chat history beyond simple storage. For example, when we want to persist memories like user preferences across different conversations, we inject those into fresh conversations when needed and call it long-term memory.

Why mess with chat history?

There are three compelling reasons to go beyond simply appending each new message and response to a growing list that gets sent to the LLM with every request:

Inject useful context: Add information about previous interactions, like user preferences, without cluttering the current conversation.
Summarize and remove data: Clean up information the model has already used to avoid confusion (context poisoning) and keep the model focused.
Save tokens: Remove unnecessary data to prevent filling the context window, enabling longer, more meaningful conversations.

This opens up some sci-fi possibilities. Imagine an agent that selectively remembers things based on its environment or who it's talking to, like the TV show Severance, where the main character, Mark, has a chip implanted in his brain that creates two separate identities with distinct memories depending on whether he’s in the office ("innie") or outside of it ("outie"), switching based on location.

Memory types and selective retrieval in agents: Creating smart agents with Elasticsearch managed memory

Not all memories serve the same purpose, and treating them as interchangeable chat history limits how far agents can scale. Modern agent architectures, including frameworks like Cognitive Architectures for Language Agents (CoALA), distinguish between procedural, episodic, and semantic memory. Rather than treating all context as a single growing buffer, these architectures recognize that each memory type requires distinct storage, retrieval, and consolidation strategies.

Procedural memory: How the agent operates

Procedural memory defines how an agent behaves, not what it knows or remembers.

In practice, this includes:

When to store a memory.
When to retrieve one.
How to summarize conversations.
How to use tools.

In our system, procedural memory lives primarily in the application code and prompts and isn’t stored in Elasticsearch. Instead, Elasticsearch is used by procedural memory.

Procedural memory determines how memory is used, not what’s stored.

Episodic memory: What happened

Episodic memory captures specific experiences tied to an entity and a context.

Examples:

“Peter’s birthday is tomorrow and he wants steak.”
“Janice has a report due at 9 am.”

This is the most dynamic and personal form of memory and the one most prone to context pollution if handled incorrectly.

In our architecture:

Episodic memories are stored as documents in Elasticsearch.
Each memory includes metadata (user, role, timestamp, innie or outie).
Retrieval is selective, based on who’s asking and in what context.

This is where the innie/outie model applies as an example of episodic memory isolation.

Semantic memory: Ground truth

Semantic memory represents abstracted, generalized knowledge about the world, independent of any single interaction or personal context. Unlike episodic memory, which is tied to who said what and when, semantic memory captures what is true in general.

In our analogy, the knowledge about Lumon, which is the company where Mark works in the show Severance, is world truth shared between innies and outies.

Things like company handbooks and rules are part of the knowledge being used as semantic memory.

While episodic memory retrieval prioritizes precision and strong contextual filters (such as identity, role, and time), semantic memory favors high-recall, concept-level retrieval. It’s designed to surface generally true information that can ground reasoning, rather than personal experiences tied to a specific situation.

Let’s move to architecture and see how these ideas translate into a memory system for our agent.

Prerequisites

Elasticsearch Elastic Cloud Hosted (ECH) or self-hosted 9.1+ instance.
Python 3.x.
OpenAI API Key.

The full Python notebook for this application can be found here.

Why Elasticsearch?

Elasticsearch is an ideal solution for storing both knowledge and memory because it's a native vector database ready to scale. It gives us everything we need to manage selective memory:

Vector database with hybrid search to find memories by context, not only by keywords.
Multiple data types, including text, numbers, dates, and geolocation.
Metadata filters for complex queries across different fields.
Document level security to filter memories based on who's asking.

Why selective memory improves latency and reasoning

Selective memory is not only about correctness and isolation; it also has a direct impact on latency and model performance. By narrowing the search space using structured filters (such as memory type, user, or time) before running semantic retrieval, Elasticsearch reduces the number of vectors that need to be scored and the amount of context that must be injected into the LLM. This results in faster retrieval, smaller prompts, and more focused attention for the model, which in practice translates into lower latency, lower token usage, and more accurate responses.

Episodic memory is inherently temporal: Recent experiences are usually more relevant than older ones, and not all memories should be kept with the same level of detail forever. In human cognition, experiences are gradually forgotten, summarized, or consolidated into more abstract knowledge.

Memory compression is a whole different topic, but you can implement strategies to summarize and store old memories while retrieving the fresh ones entirely.

The setup

Following the Severance concept, we're creating an agent named Mark with two distinct memory sets:

Innie memories: Work-related conversations with colleagues.
Outie memories: Personal conversations with friends and family.

When Mark talks to an innie, he shouldn't remember conversations with outies, and vice versa.

Building the memory system

Memory index structure

First, we define our memory schema:

mappings = {
    "properties": {
        "user_id": {"type": "keyword"},
        "memory_type": {"type": "keyword"},
        "created_at": {"type": "date"},
        "memory_text": {
            "type": "text",
            "fields": {
                "semantic": {
                    "type": "semantic_text"
                }
            }
        }
    }
}

Note that we use multi-field for memory_text so we can do both full-text search, and semantic search using the Elastic Learned Sparse EncodeR (ELSER) model (default) against the same field content.

This gives us semantic search capabilities while maintaining structured metadata for filtering.

Setting up document level security

This is the key piece that makes selective memory work. We create two separate roles: one for innies, one for outies, each with query-level filters built in. When a user with the innie role queries the memories index, Elasticsearch automatically applies a filter that only returns memories where memory_type equals "innie".

You can find more illustrative examples about access control here and about role management here.

Here's the innie role:

innie_role_descriptor = {
    "indices": [
        {
            "names": ["memories"],
            "privileges": ["read", "write"],
            "query": {
                "bool": {
                    "filter": [
                        {"term": {"memory_type": "innie"}}
                    ]
                }
            }
        }
    ]
}

We create a similar role for outies, just filtering by "memory_type": "outie" instead.

Then we create users and assign them to these roles. For example:

Peter (outie): Can only access memories marked as "outie".
Janice (innie): Can only access memories marked as "innie".

When Mark (our agent) receives a query, he uses the credentials of whoever is asking. If Peter asks something, Mark uses Peter's credentials, which means Elasticsearch automatically filters to only show outie memories. If Janice asks, only innie memories are visible.

The application code doesn't need to filter the user management and is completely decoupled from the application logic. Elasticsearch handles all the security automatically.

Creating the agent tools

We define three key functions for our agent:

GetKnowledge: Searches the knowledge base for relevant context (traditional retrieval augmented generation [RAG]).
GetMemories: Retrieves memories using hybrid search (semantic + keyword):

def get_memory(query: str):
    es_query = {
        "retriever": {
            "rrf": {
                "retrievers": [
                    {
                        "standard": {
                            "query": {
                                "semantic": {
                                    "field": "semantic_field",
                                    "query": query
                                }
                            }
                        }
                    },
                    {
                        "standard": {
                            "query": {
                                "multi_match": {
                                    "query": query,
                                    "fields": ["memory_text"]
                                }
                            }
                        }
                    }
                ],
                "rank_window_size": 50,
                "rank_constant": 20
            }
        }
    }
    
    response = user_es_client.search(index="memories", body=es_query)
    return response

Notice that we don't apply security filters in the query; Elasticsearch handles that automatically based on the user's credentials.

SetMemory: Stores new memories (implementation uses LLM to convert conversations into structured memory records).

How the agent uses these tools

When a user asks Mark a question, the flow works like this:

1. User asks: "What's my favorite family destination?"

2. LLM decides to use tools: OpenAI's Response API with function calling lets the LLM decide it needs to call GetMemories with the query "favorite family destination".

3. We execute the function: Our code calls get_memory("favorite family destination") using the user's credentials (Peter's in this case).

4. Elasticsearch filters automatically: Because we're using Peter's credentials, only outie memories are returned:

Memories
peter125: (User name is Peter Johnson. His favorite family destination is Disneyland.)

5. We send results back to LLM: The memory gets added to the conversation context.

6. LLM generates an answer: "Your favorite family destination is Disneyland."

Here's the actual code that handles this loop:

# Initial call with tools available
response = client.responses.create(
    model="gpt-4.1-mini",
    input=messages,
    tools=tools,
    parallel_tool_calls=True
)

# Execute any tool calls the LLM requested
for tool_call in response.output:
    if tool_call.name == "GetMemories":
        result = get_memory(tool_call.arguments["query"])
        # Add result to messages
        
# Call LLM again with tool results to generate final answer
final_response = client.responses.create(
    model="gpt-4.1-mini",
    input=messages  # Now includes tool results
)

The key insight: The application doesn't decide which memories to retrieve or when. The LLM decides based on the user's question, and Elasticsearch ensures that only the right memories are accessible.

Testing selective memory

Let's see it in action:

Outie conversation (Peter):

Peter: Hey Mark, my birthday is tomorrow! I'd like to have a steak for dinner.
Mark: That's great! (memory stored)

Mark stores this as an outie memory associated with Peter. Here's what that memory looks like in Elasticsearch:

{
    "user_id": "peter125",
    "memory_type": "outie",
    "created_at": "2025-10-11T18:02:52.182780",
    "memory_text": "Peter's birthday is tomorrow. He wants steak for dinner."
}

Innie conversation (Janice):

Janice: Hey Mark, remember we have to finish the end of year report tomorrow at 9am.
Mark: Thanks for reminding me! (memory stored)

This creates a separate innie memory:

{
    "user_id": "janice456",
    "memory_type": "innie", 
    "created_at": "2025-10-11T19:15:33.445821",
    "memory_text": "End of year report deadline tomorrow at 9am with Janice."
}

Imagine Peter also works at Lumon. A colleague stores a work-related memory about him:

{
    "user_id": "innie-peter",
    "memory_type": "innie",
    "created_at": "2025-10-11T20:30:00.000000",
    "memory_text": "Peter needs to review the Q4 budget spreadsheet before Friday."
}

This memory exists in Elasticsearch, but Peter's current credentials only grant him the outie role. When he asks Mark about work tasks, this memory is invisible to him; Elasticsearch's document level security ensures that it’s never returned.

Note: To allow interaction with these memories, you would need to create a separate user (or assign an additional role) with "innie" access for Peter. This is left as an exercise, but it demonstrates that the same person can have isolated memory contexts, and access is controlled entirely at the security layer.

Memory isolation test

Now Peter starts a new conversation:

Peter: Hey Mark, do you remember what I want for my birthday?
Mark: Yes! You want steak.

Peter: When do you have to finish the end of year report?
Mark: What are you talking about?

Perfect! Mark only accesses outie memories when talking to Peter. The agent's "brain" is genuinely split, just like in the show.

The full implementation

The complete working implementation is available in this notebook, where you can:

Set up the Elasticsearch indices.
Create roles and users with document level security.
Build the agent with OpenAI's Response API.
Test the selective memory system.

Conclusion

Memory isn’t just a place to store past conversations. It’s part of the agent’s architecture. By going beyond raw chat history and separating procedural, episodic, and semantic memory, we can build agents that reason more clearly, scale better, and stay focused over long interactions.

Selective retrieval reduces context pollution, lowers latency, and improves the quality of the information sent to the LLM. Episodic memory can be filtered by user and time, semantic memory can be used to ground answers in shared knowledge, and procedural memory controls how and when all of this is used.

Elasticsearch provides the building blocks to implement this in practice through hybrid search, rich metadata, security, and temporal filtering. Just like in Severance, we can create agents with isolated experiences and shared world knowledge. The difference is that here the split is intentional and useful, not a mystery.

The Gemini CLI extension for Elasticsearch with tools and skills

Walter Rafelsberger, Enrico Zimuel — Tue, 17 Mar 2026 00:00:00 GMT

We're glad to announce the release of our Elastic extension for Google's Gemini CLI, bringing the full power of Elasticsearch and Elastic Agent Builder directly into your AI development workflow. This extension also offers several recently developed agent skills for interacting with Elasticsearch.

The extension is available as an open source project here.

What is Gemini CLI, and how do you install it?

Gemini CLI is an open source AI agent that brings Google’s Gemini models directly into the command line. It allows developers to interact with AI from the terminal to perform tasks such as generating code, editing files, running shell commands, and retrieving information from the web.

Unlike typical chat interfaces, Gemini CLI integrates with your local development environment, meaning that it can understand project context, modify files, run builds or tests, and automate workflows directly within the terminal. This makes it useful for developers, site reliability engineers (SREs), and engineers who want AI-assisted coding and automation without leaving their command-line workflow.

Gemini CLI can be installed using several package managers. The most common method is via npm:

npm install -g @google/gemini-cli

If you want to know alternative installation options, refer to the official installation page.

After installation, start the CLI by running:

gemini

You see a screen, as shown in Figure 1:

Configure Elasticsearch

We need to have an Elasticsearch instance running. If you want to use the Model Context Protocol (MCP) server, you also need Kibana 9.3+ installed. To use the Elasticsearch Query Language (ES|QL) skill (esql) described below, Kibana is not required.

You can activate a free trial on Elastic Cloud or install it locally using the start-local script:

curl -fsSL https://elastic.co/start-local | sh

This will install Elasticsearch and Kibana on your computer and will generate an API key to be used for configuring Gemini CLI.

The API key will be shown as output of the previous command and stored in a .env file in the elastic-start-local folder.

If you’re using on-premises Elasticsearch (for example, using start-local), and you want to use Elastic Agent Builder with MCP, you also need to connect a large language model (LLM). You can read this documentation page to understand the different options.

If you’re using Elastic Cloud (or serverless), you already have a prebuilt LLM connection in place.

Install the Elasticsearch extension

You can install the Elasticsearch extension for Gemini CLI with the following command:

gemini extensions install https://github.com/elastic/gemini-cli-elasticsearch

You can check that the extensions have been installed successfully by opening Gemini and executing the following command:

/extensions list

You should see the Elasticsearch extension available.

If you want to use the MCP integration, you need to have an Elasticsearch 9.3+ version installed. You need your MCP server URL from Kibana:

Get your MCP server URL from Agents > View all tools > Manage MCP > Copy MCP Server URL.
The URL will look like this: https://your-kibana-instance/api/agent_builder/mcp

You need the Elasticsearch endpoint URL. This is typically reported at the top of the Kibana Elasticsearch page. If you’re running Elasticsearch with start-local, you already have the endpoint in the ES_LOCAL_URL key in the start-local .env file.

You also need an API key. If you’re running Elasticsearch with start-local, you already have the ES_LOCAL_API_KEY in the start-local .env file. Otherwise, you can create an API key using the Kibana interface, as reported here:

In Kibana: Stack Management > Security > API Keys > Create API key.
We suggest setting only the read privileges for the API key, enabling the feature_agentBuilder.read privilege as reported here.
Copy the encoded API key value.

Set the required environment variables in your shell:

export ELASTIC_URL="your-elasticsearch-url"
export ELASTIC_MCP_URL="your-elasticsearch-mcp-url"
export ELASTIC_API_KEY="your-encoded-api-key"

Install the example dataset

You can install the eCommerce orders dataset available from Kibana. It includes a single index named kibana_sample_data_ecommerce, containing information for 4,675 orders from an ecommerce website. For each order, we have the following information:

Customer information (name, ID, birth date, email, and more).
Order date.
Order ID.
Products (list of all the products with price, quantity, ID, category, discount, and other details).
SKU.
Total price (taxless, taxed).
Total quantity.
Geo information (city, country, continent, location, region).

To install the sample data, open the Integrations page in Kibana (search for “Integration” in the top search bar) and install the Sample Data. For more details, refer to the documentation here.

The goal of this article is to show how easy it is to configure Gemini CLI to connect to Elasticsearch and interact with the kibana_sample_data_ecommerce index.

How to use the Elasticsearch MCP

You can check the connection using the following command in Gemini:

/mcp list

You should see the elastic-agent-builder enabled, as shown in Figure 2:

Elasticsearch provides a default set of tools. See the description here.

Using these tools, you can interact with Elasticsearch, asking questions like:

Give me the list of all the indexes available in Elasticsearch.
How many customers are based in the USA in the kibana_sample_data_ecommerce index of Elasticsearch?

Depending on the question, Gemini will use one or more of the available tools to try to answer it.

The /elastic commands

In the Elasticsearch extension for Gemini CLI, we also added /elastic commands.

If you execute the /help command, you see all the available /elastic options (Figure 3):

These commands can be useful if you want to directly execute a specific tool of the elastic-agent-builder MCP server. For instance, using the following command, you can get the mapping of the kibana_sample_data_ecommerce:

/elastic:get-mapping kibana_sample_data_ecommerce

These commands are essentially shortcuts for executing specific tools, rather than relying on the Gemini model to determine which tool should be invoked.

How to use the Elasticsearch skills

This extension also comes with an agent skill for ES|QL, the Elasticsearch Query Language available in Elasticsearch. Agent Skills is an open format that gives AI coding agents, like Gemini CLI, custom instructions for specific tasks. They use a concept called progressive disclosure, meaning that only a brief description of the skill is added to the initial system prompt. When you ask the agent to perform a task, like querying Elasticsearch, it matches the request to the relevant skill and dynamically loads the detailed instructions. This is an efficient way to manage token budgets while providing the AI with exactly the context it needs.

The esql skill is designed to let Gemini CLI write and execute ES|QL queries directly against your cluster. ES|QL is a powerful piped query language that makes data exploration, log analysis, and aggregations highly intuitive. With this skill enabled, you don't need to look up ES|QL syntax; you can simply ask the Gemini CLI natural language questions about your data, and the agent will handle the rest.

The executions are performed using simple curl commands run in a terminal. This is possible because Elasticsearch provides a rich set of REST APIs that can be easily used to integrate the system into any architecture.

What the esql skill offers:

Index and schema discovery: The agent can use the skill's built-in tools to list available indices and fetch field mappings. For instance, before writing a query for the eCommerce dataset, the agent can run a schema check on kibana_sample_data_ecommerce to understand the available fields, like taxful_total_price or category.
Seamless natural language translation: The skill gives the agent more than just a plain reference manual; it provides a specific guide for interpreting user intent. When you type natural language requests, like "Show average response time grouped by service," the agent uses the skill's bundled pattern matching to instantly translate your words into the correct ES|QL aggregations, filters, and commands.
Self-correct: If a query fails (for example, due to a type mismatch or syntax error), the skill returns the generated query alongside the exact Elasticsearch error message, allowing the agent to instantly fix the query and try again without you having to intervene.

Since the esql skill is also available as a tool in the elastic-agent-builder MCP server, we need to disable this server momentarily. You can use the following command to disable it:

/mcp disable elastic-agent-builder

Then you can simply type a prompt like this into your Gemini CLI:

Find the top 5 product categories by total sales revenue in the kibana_sample_data_ecommerce index

The agent will:

Recognize the need for the esql skill.
Check the schema of kibana_sample_data_ecommerce.
Construct an ES|QL query, like: FROM kibana_sample_data_ecommerce | STATS total_revenue = SUM(taxful_total_price) BY category.keyword | SORT total_revenue DESC | LIMIT 5.
Execute the query against the Elasticsearch API.
Present the final answer to you right in the terminal.

Here, we reported an example of response from Gemini to the previous prompt:

───────────────────────────────────────────────────────────
> Find the top 5 product categories by total sales revenue in the kibana_sample_data_ecommerce index   ───────────────────────────────────────────────────────────
✓  Activate Skill "esql": Interact with Elasticsearch using ES|QL and curl. Use when querying, managing indices,  …
Skill esql activated. Resources loaded from /home/user/.gemini/extensions/elasticsearch/skills/esql: 
Showing up to 200 items (files + folders).                                                                     /home/user/.gemini/extensions/elasticsearch/skills/esql/                                                            ├───SKILL.md 
└───references/
├───esql-reference.md 
├───generation-tips.md 
└───query-patterns.md

✦ I will start by exploring the mapping of the kibana_sample_data_ecommerce index to identify the fields for product category and sales revenue. Then, I will construct and execute an ES|QL query to determine the top 5 product categories by total sales revenue.

✓  Shell curl -s "${ELASTIC_URL%/}/kibana_sample_data_ecommerce/_mapping" -H "Authorization: ApiKey $(printenv ELAS…                                                                                                                       {  
   "kibana_sample_data_ecommerce": { 
     "mappings": { 
       "properties": {
         "category": { 
           "type": "text", 
           "fields": {
             "keyword": {
               "type": "keyword"
             }
           }
         }, 
         "currency": {
           "type": "keyword"
         }, 

✦ Okay, I have the mapping. The category is products.category.keyword and sales revenue is taxful_total_price. I'll now execute the ES|QL query to get the top 5 categories by revenue.

✓  Shell curl -s -X POST "${ELASTIC_URL%/}/_query" \  -H "Authorization: ApiKey $(printenv ELASTIC_API_KEY)" \  -H …
{ 
   "took": 62,
   "is_partial": false, 
   "completion_time_in_millis": 1772805644164,
   "documents_found": 4675,
   "values_loaded": 12084, 
   "start_time_in_millis": 1772805644102, 
   "expiration_time_in_millis": 1773237644087,  
   "columns": [ 
     { 
       "name": "total_revenue",    
       "type": "double"  
     }, 
     {   
       "name": "products.category.keyword",

✦ The query for top 5 product categories by revenue was successful. I'll format the returned JSON values into a   user-friendly list.

 ✓  Shell 
                             
 Top 5 product categories by total sales revenue: 
 1. Men's Clothing: 149393.91
 2. Women's Clothing: 135099.91
 3. Women's Shoes: 105479.17
 4. Men's Shoes: 91797.92
 5. Women's Accessories: 60830.31 

✦ I've found the answer; now I'll format it for clarity.

 ✓  Shell 

 Here are the top 5 product categories by total sales revenue:                                                        
                                                                      
 | Rank | Category              | Total Revenue | 
 |------|-----------------------|---------------| 
 | 1    | Men's Clothing        | 149,393.91    |
 | 2    | Women's Clothing      | 135,099.91    |
 | 3    | Women's Shoes         | 105,479.17    |
 | 4    | Men's Shoes           | 91,797.92     | 
 | 5    | Women's Accessories   | 60,830.31     | 

✦ I've formatted and presented the answer. I'm finished.
 ───────────────────────────────────────────────────────────

It’s interesting to note how the Gemini model generates the final response by showing all the steps it follows. Here, you can clearly see the influence of the skill in the model’s reasoning process. The first time the model recognizes that it needs to use a skill or execute a shell command, it requests permission using the human-in-the-loop approach.

By handling the heavy lifting of schema discovery, query generation, and execution, the esql skill lets you focus entirely on the answers rather than the mechanics of getting them. You’ll get the data you need, properly formatted and right in your terminal, all without ever writing a single line of syntax or context-switching to a different application.

Conclusion

In this article, we presented the Elasticsearch extension for Gemini CLI that we recently released. This extension gives you the ability to interact with your Elasticsearch instance using Gemini and the Elasticsearch MCP server provided by Elastic Agent Builder, available starting from version 9.3.0, as well as the /elastic command.

Moreover, the extension also includes an esql skill that converts a user’s request from natural language into an ES|QL query. This skill can be particularly useful when the MCP server cannot be used, since the underlying communication is driven by simple curl commands executed in a terminal. Elasticsearch offers a rich set of REST APIs that can be easily integrated into any project. This is especially useful when developing agentic AI applications.

For more information about our Gemini CLI extension, visit the project repository here.

Agent Skills for Elastic: Turn your AI agent into an Elastic expert

Graham Hudgins, Matt Ryan — Mon, 16 Mar 2026 00:00:00 GMT

Every developer, site reliability engineer (SRE), or analyst who has tried to use an AI coding agent with a specialized platform has hit the same wall. You ask the agent to write a query, configure an alert, or investigate something, and it gets close but not right. Elastic has an advantage here: over a decade of documentation, blog posts, and community answers means AI agents already know Elastic better than most data platforms. But that depth comes with noise. Deprecated APIs sit alongside current ones. Outdated patterns rank as highly as best practices. The agent confidently reproduces an approach that worked three versions ago, because in its training data, it did. The result is a correction tax: users manually feed documentation into context, fix hallucinated syntax, and work around the agent instead of with it. Worse, advanced capabilities go unused entirely, not because users don't need them, but because the agent doesn't know they exist.

That's why we’re open-sourcing Elastic Agent Skills: native platform expertise for Elasticsearch, Kibana, Elastic Observability, and Elastic Security. Drop them into the agent runtime you already use, and improve your agent from being a ‘generalist’ that guesses at a lot of syntax to giving it expertise, such as being able to use many of the architectural standards as Elastic’s own engineering teams. This initial technical preview release focuses on skills with maximum compatibility for Elastic Cloud Serverless, but will evolve quickly to include improved support for older stack releases.

In addition, Elastic is solving this problem from both sides. For agents on the Elastic platform, Elastic Agent Builder (now generally available) lets you create and chat with AI agents that inherit your data's access controls, use built-in search and analysis tools, and work in context alongside your dashboards, alerts, and investigations. We're working hard to ensure amazing Agentic experiences in the Elastic platform. But not every agent lives inside Elastic. Your team already uses Cursor, Claude Code, or other runtimes, and those agents need to get Elastic right, too. That's where Agent Skills come in.

Why agents struggle with specialized platforms

Large language models (LLMs) are remarkably capable generalists. They can write Python, explain Kubernetes manifests, and refactor React components because their training data is rich with examples. But when it comes to platform-specific work, the kind that involves proprietary query languages, deep API surfaces, and domain-specific best practices, they fall short in predictable ways.

For Elasticsearch, the gap shows up concretely:

Elasticsearch Query Language (ES|QL) is new territory. LLMs are trained heavily on SQL, but ES|QL is a piped query language with different syntax, different functions, and different semantics. Agents frequently write queries that look plausible but don't parse. They confuse WHERE with | WHERE, invent functions that don't exist, and miss the pipe-based composition model entirely.
API surfaces are wide and deep. Elasticsearch, Kibana, and Elastic Security expose hundreds of APIs across search, ingestion, alerting, detection rules, case management, dashboards, and more. An agent armed with nothing but general training data has to guess which endpoint to call, what the request body looks like, and how to handle the response. It guesses wrong often enough to erode trust.
Best practices aren't in the training data. When should you use semantic_text versus a custom embedding pipeline? How should you structure an ingest pipeline for a 10GB CSV? What's the right detection rule syntax for a MITRE ATT&CK technique? General-purpose agents don't have curated, reliably structured Elastic-specific knowledge loaded by default. They'd have to go find it, and even if they did, raw docs don't always encode the judgment calls and best practices that skilled practitioners carry.

The result: Developers spend more time fixing agent output than they would have spent writing the code themselves. That's not the experience anyone signed up for.

Agent Skills: Platform knowledge, packaged for agents

Agent Skills are self-contained directories of instructions, scripts, and reference material that agent runtimes can load dynamically. When a skill is active, the agent has access to the right context at the right time: query syntax, API patterns, validation logic, worked examples, so it can complete tasks correctly on the first try.

Each skill follows the open agentskills.io specification: a folder with a SKILL.md file containing metadata and structured instructions. No proprietary format, no lock-in. Skills work across agent runtimes, including Cursor, Claude Code, GitHub Copilot, Windsurf, Gemini CLI, Cline, Codex, and many more.

What's in the initial v0.1.0 release

The first set of skills spans five areas of the Elastic Stack:

Interacting with Elasticsearch APIs (search, indexing, cluster management)
Building and managing Kibana content such as dashboards, alerts, connectors, and more
Domain expertise for Elastic Observability
Domain expertise for Elastic Security
Making effective agents in Agent Builder

Skills are composable

Skills aren't monolithic. They're modular by design. Your agent loads only the skills relevant to the task at hand. Working on an ES|QL query? The ES|QL skill activates. Need to build a dashboard from those results? The dashboards skill picks up. Assessing the health of your application? The service health skill comes into play. Investigating a security alert? The triage skill chains into case management and response skills as the investigation progresses.

This composability means that you don't need a single, massive prompt that tries to cover everything. Each skill carries exactly the context its domain requires, nothing more, nothing less.

For developers building search and AI applications

If you're loading data into Elasticsearch, writing queries, or migrating indices, skills reduce the cycle of generating code, hitting errors, and searching docs for what went wrong.

Ask your agent to load a CSV file, and it uses a streaming ingestion tool that handles backpressure and infers mappings from the data. It’s not a hand-rolled _bulk loop that runs out of memory on the first large file. Ask it to query with ES|QL, and it discovers your actual index names and field schemas, then writes valid piped queries with correct syntax, appropriate aggregations, and version-aware feature selection, not a SQL-flavored guess that requires three rounds of debugging. Ask it to reindex across clusters, and it follows the full operational workflow: creates the destination with explicit mappings, tunes settings for throughput, runs the job async, and restores production settings when it finishes, not a bare _reindex call that skips half the steps an experienced operator would follow.

Instead of an agent that gives you a plausible starting point you have to fix, you get one that encodes the operational discipline that makes the output actually work.

Example impacts of using Elastic Agent Skills

Eval	What the skill changed
es-audit-query-failed-logins	Used the audit log query patterns from the skill instead of generic search
es-authz-role-mapping-ldap	Emitted the correct role mapping API call structure
esql-basic-query	Wrote ES\|QL pipe syntax over Query DSL
esql-error-handling	Schema first instead of guessing field names
esql-schema-discovery	Never guessed an index name
es-ingest-csv-with-infer	Used --infer-mappings alone, avoided combining with --source-format csv which causes an empty index
es-ingest-json-file	Used robust ingestion approach capable of handling large files
es-reindex-local-async	Created destination index first with replicas: 0 and refresh_interval: "-1", then async reindex. Baseline skipped any prep
es-security-403-privileges	Followed the skill's diagnostic workflow for privilege errors instead of generic advice

For security teams

Security teams repeat the same operational workflows daily: triaging alerts, tuning detection rules, managing cases. Agent Skills encode that procedural knowledge so your AI agent can execute these workflows correctly, calling the right APIs in the right order with the right field names. For a hands-on walkthrough that takes you from zero to a fully populated Elastic Security environment without leaving your IDE, see Get started with Elastic Security from your AI agent.

For observability and operations teams

The new Agent Skills for Elastic Observability reduce the operational toil of instrumenting complex systems, managing SLOs, sifting through complex data, and assessing service health. Embedding native Elastic expertise directly into AI agents allows teams to execute complex observability workflows using simple natural language. This empowers SREs and Ops teams to resolve incidents faster and maintain reliable systems more easily. Learn more in this blog.

Open source, open spec, community-driven

We're releasing Agent Skills under the Apache 2.0 license because we believe that agent knowledge should be open. The agentskills.io specification that skills follow is an open standard, not an Elastic proprietary format. We want skills to be a community effort, not a walled garden.

Part of a bigger picture

Agent Skills is one piece of a broader initiative to make Elasticsearch the most agent-friendly data platform available. For agents that live on the Elasticsearch platform, Agent Builder goes further by inheriting your data's access controls and permissions, providing built-in and custom tools for search and analysis, and letting users interact with agents in context alongside their dashboards, alerts, and investigations. Finally, support for skills is coming soon to Agent Builder, allowing developer flexibility to leverage Elastic Agent Skills as well as skills from any other source to enable secure, context enhanced chat and automation on the Elasticsearch platform.

For agents that live everywhere else, we're investing in the open ecosystem:

Model Context Protocol (MCP) server expansion: Extending the MCP endpoint in Agent Builder with more tools beyond the current search, ES|QL, and index operations.
Authentication improvements: Making it easier for agents to connect securely, with the goal of eliminating manual API key copy-paste.
LLM-readable documentation: Publishing llms.txt and AGENTS.md files so agents can discover and understand Elastic APIs on their own.
A command line interface (CLI) for agent workflows: Command-line tooling that makes connection management and common operations agent-friendly.

Skills are the layer you can use today. The rest is coming.

Get started

Before you get started: AI coding agents operate with real credentials, real shell access, and often the full permissions of the user running them. When those agents are pointed at security workflows, the stakes are higher: you're handing an automated system access to detection logic, response actions, and sensitive telemetry. Every organization's risk profile is different. Before enabling AI-driven security workflows, evaluate what data the agent can access, what actions it can take, and what happens if it behaves unexpectedly.

Install Elastic Agent Skills into your agent runtime:

npx skills add elastic/agent-skills

This auto-detects your installed agent runtimes and places skills in the correct configuration directory. From there, your agent picks them up automatically.

You can also browse the skills catalog directly and install individual skills manually by copying the skill folder into your agent's configuration directory.

Don't have an Elasticsearch cluster yet? Start an Elastic Cloud free trial. It takes about a minute to get a fully configured environment.

Explore the project:

Entity resolution with Elasticsearch, part 4: The ultimate challenge

Jessica Moszkowicz — Fri, 13 Mar 2026 00:00:00 GMT

We’ve now seen intelligent entity resolution implemented in two ways. Both approaches begin the same way: entity preparation and extraction, followed by candidate retrieval with Elasticsearch. From there, we evaluate those candidates using a large language model (LLM), either through prompt-based JSON generation or through function calling, and require the model to provide a transparent explanation for its judgment.

As we saw in the previous post, the consistency provided by function calling is not just a nice optimization; it’s essential. Once we removed structural errors from the evaluation loop, results on standard scenarios (such as those in the tier 4 dataset) improved dramatically.

Yet there’s an obvious question left to answer:

Does this approach still work when things get genuinely messy?

Real-world entity resolution rarely fails because of simple cases. It fails when names cross languages, cultures, writing systems, time periods, and organizational boundaries. It fails when people are referenced by titles instead of names, when companies change names, when transliterations aren’t consistent, and when context (not spelling) is the only thing tying a mention to a real-world entity.

So, for the final post in this series, we put the system through what we called the ultimate challenge.

What makes this the ultimate challenge?

In earlier evaluations, we tested the system using increasingly complex datasets. By the time we reached tier 4, discussed in the previous post, we were already dealing with a mix of nicknames, titles, multilingual names, and semantic references. Those tests showed that the architecture itself was sound, but that reliability issues, especially malformed JSON, were suppressing recall.

With function calling in place, we finally had a stable foundation. That gave us the opportunity to ask a more interesting question:

Can one unified pipeline handle many different kinds of entity resolution problems at once?

The ultimate challenge dataset was designed to push precisely on that dimension.

Instead of focusing on a single difficulty (like nicknames or transliteration), this dataset combines 50+ distinct challenge types, including:

Cultural naming conventions.
Title-based references.
Business relationships and historical name changes.
Multilingual and cross-script mentions.
Compound challenges that mix several of the above.

Crucially, this isn’t about optimizing for any one narrow use case. It’s about testing whether the design pattern holds up when the rules change from entity to entity.

The dataset at a glance

The ultimate challenge dataset consists of:

50 entities, spanning people, organizations, and institutions.
~60 articles, with varying structure and linguistic complexity.
51 distinct challenge categories, grouped broadly into:
- Cultural naming conventions.
- Titles and professional context.
- Business and organizational relationships.
- Multilingual and transliteration challenges.
- Combined and edge‑case scenarios.

Earlier in the series, we saw that using generative AI (GenAI) to create datasets can be a mixed blessing. Without it, assembling sufficiently large and diverse test data would be extremely difficult. But left unchecked, the model has a tendency to make things too easy.

On an early generation pass, for example, we discovered that the model had included phrases like “the Russian president” as explicit aliases for Vladimir Putin. That might seem reasonable today, but it defeats the purpose of testing contextual resolution. What happens if the article is discussing Russia in the 1990s? The system should infer the correct entity from context, not rely on a hard-coded alias.

For that reason, this dataset was deliberately designed so that shortcuts don’t work. Aliases are not explicitly listed when the system is expected to infer meaning. Descriptive phrases are not prelinked to entities. Correct matches often depend on article-level context, not just local text.

Important note: Although we demonstrate the system’s capabilities across diverse scenarios, this is still an educational prototype. Production systems handling real-world sanctioned-entity monitoring would require additional validation, compliance checks, audit trails, and specialized handling for sensitive use cases.

Why these scenarios are hard

Back in the first post in this series, we introduced a simple but ambiguous example: “The new Swift update is here!” The challenge is that “Swift” can resolve to multiple real-world entities, depending on context. That example captures a broader truth: Natural language is inherently ambiguous.

Entity resolution, therefore, is not just a string-matching problem. Humans routinely rely on shared knowledge, cultural norms, and situational context to resolve references, and we rarely even notice we’re doing it.

Consider a few common cases:

A title like “the president” is meaningless without geopolitical and temporal context.
A company name may refer to a parent, a subsidiary, or a former brand depending on when the article was written.
A person’s name may appear in different orders, scripts, or transliterations, depending on language and culture.
The same phrase can legitimately refer to different entities in different contexts, and the system must be able to reject matches just as confidently as it accepts them.

There is no single rule set that handles all of this cleanly. That’s why this prototype separates concerns so aggressively:

Elasticsearch narrows the candidate space efficiently and transparently.
The LLM is used only where judgment is required and is forced to explain itself.
Retrieval and reasoning remain distinct steps.

This separation becomes even more important as the diversity of challenge types increases.

How the system handles diversity without special cases

One of the most interesting outcomes of this evaluation is what didn’t change:

We did not add special logic for Japanese names.
We did not add custom rules for Arabic patronymics.
We did not add hard-coded mappings for historical company names.

Instead, the system relied on the same core ingredients introduced earlier in the series:

Context-enriched entities indexed for semantic search.
Hybrid retrieval (exact, alias, and semantic) in Elasticsearch.
A small, well-defined set of candidate matches.
LLM judgment constrained by function calling and minimal schemas.

This suggests that the system’s flexibility comes from representation and architecture, not from an ever-growing collection of rules.

When the system succeeds, it’s because the right candidates are retrieved and the LLM has enough context to explain why a reference does (or does not) map to a specific entity.

Results: How did it perform?

On the ultimate challenge dataset, the system produced the following overall results:

Precision: ~91%
Recall: ~86%
F1 Score: ~89%
LLM acceptance rate: ~72%

Performance across challenge types

Breaking down results by challenge type reveals strengths and limitations:

Strongest performance (100% F1 score) was observed in areas such as:

Cross-script matching (Cyrillic, Korean, Chinese business entities).
Hebrew scenarios (patronymics, professional titles, religious titles, transliteration).
Business hierarchies (aerospace, diversified manufacturing, multidivision corporations).
Professional titles (academic, military, political, religious).
Combined Japanese scenarios involving multiple writing systems.

Strong performance (80–99% F1 score) included:

International political figures (98%).
Historical name changes (90%).
Complex business hierarchies (89%).
Japanese company names (93%).
Cross-script transliteration (86%).
Arabic patronymics (86%).

More challenging areas included:

Advanced transliteration (Chinese, Korean): 0% F1.
Certain Japanese scenarios (honorifics, name order, writing system variation): ~67% F1.
Some Arabic scenarios (company names, institutional references): ~40% F1.

What’s important here is why the system struggled in these cases. The failures were not due to the overall approach breaking down, but to limitations in specific components, most notably the dense vector model used for semantic search in certain multilingual scenarios.

Because retrieval and judgment are cleanly separated, improving performance does not require rewriting the system. Swapping in a more capable multilingual embedding model, enriching entity context, or refining retrieval strategies would improve results across these categories without changing the core architecture.

From an architectural standpoint, that’s the real success metric.

What this tells us about the design

Looking back across the series, a few patterns stand out:

Preparation matters more than clever matching. Enriching entities with context up front dramatically reduces ambiguity later.
LLMs are most valuable as judges, not retrievers. Asking them to explain why a match makes sense is far more powerful than asking them to search.
Reliability enables accuracy. Function calling didn’t just clean up JSON; it unlocked recall that was already latent in the retrieval step.
Generalization beats specialization. A small number of well-chosen abstractions handled dozens of challenge types without custom logic.

This is why the prototype is intentionally Elasticsearch-native and intentionally conservative in how it uses LLMs. The goal isn’t to replace search; it’s to make search explainable in situations where meaning matters.

Final thoughts

The ultimate challenge wasn’t about chasing perfect metrics; it was about answering a more fundamental question:

Can a transparent, search-first, LLM-assisted architecture handle real-world entity ambiguity without collapsing into rules or black boxes?

For this educational prototype, the answer is yes, with clear caveats around production hardening, compliance, monitoring, and data quality. If you’re building systems that need to justify why an entity match was made, this pattern is worth serious consideration. I hope this series has shown that entity resolution doesn’t have to be mysterious. With the right separation of concerns, it becomes something you can reason about, measure, and improve.

This work also suggests a broader architectural pattern. What emerges is a slight but important evolution of classic retrieval augmented generation (RAG). Instead of allowing retrieval to feed generation directly, we introduce an explicit evaluation step. The LLM is first used to judge and sanity-check retrieved candidates, and only those approved results are allowed to augment generation. You can think of this as Generation-Augmented Retrieval-Augmented Generation with Evaluation, or GARAGE, because who doesn’t love a good acronym.

What other use cases could benefit from this pattern? Systems that require trust, transparency, and defensible reasoning are natural candidates. Future work in this area should prove as compelling as the results we’ve seen here, and I’m excited to see where the community takes it next.

Next steps: Try it yourself

Want to see the ultimate challenge in action? Check out the Ultimate Challenge notebook for a complete walkthrough, with real implementations, detailed explanations, and hands-on examples.

The complete entity resolution pipeline demonstrates the core concepts and architecture needed for production use. You can use it as a foundation to build systems that monitor news articles, track entity mentions, and answer questions about which entities appear in which articles, all while retaining transparency and explainability.

The stateless architecture of Elasticsearch Serverless

Iraklis Psaroudakis — Thu, 12 Mar 2026 00:00:00 GMT

We're thrilled to announce the publication of our new peer-reviewed paper, Serverless Elasticsearch: the Architecture Transformation from Stateful to Stateless, which was accepted and presented at the industrial track of the 2025 Association for Computing Machinery (ACM) Symposium on Cloud Computing (SoCC). The paper gives a concise view of our recent innovations in Elasticsearch. At Elastic, we’re relentlessly focused on the future of search. From optimizing performance to simplifying operations, our teams are always exploring what's next.

This paper isn't just an academic exercise. It's a foundational exploration of how the core of a search engine could be reimagined for a purely serverless world. We decouple storage from compute: Data lives in a cloud blob store with virtually infinite storage and scalability. That vision is the main driver behind our Elastic Cloud Serverless offering: seamless search over massive datasets, with the economics and operational simplicity of serverless.

The challenge: Rethinking stateful search for the cloud

For decades, search engines have been powerful, stateful systems. Deploying a production-grade cluster like Elasticsearch has meant:

Provisioning servers and managing storage.
Carefully tuning configurations for cost, performance, and reliability.
Paying for idle capacity when workloads are spiky or unpredictable.
Significant operational effort to scale up and down.

Modern cloud platforms have made some of this easier, but the fundamental tension remains:

Can we build a search engine that delivers the power and rich query capabilities of Elasticsearch with the economics and operational simplicity of a serverless architecture?

That question drove our research.

Our key contributions

The paper presents concrete innovations that make Elasticsearch Serverless possible:

Object store as single source of truth: We offload index data, the transaction log (translog), and cluster state to a cloud object store. That eliminates replica shards for durability and makes the object store the sync point between indexing and search.
"Thin" (stateless) shards: Shards recover and relocate quickly across nodes without copying large amounts of data. Disks are used only for caching, not for persistent storage.
Batched compound commits (BCC): We wrap index commits in a custom format, cutting upload costs, while keeping the same read-after-write semantics as Elasticsearch.
Batched translog uploads: Translog uploads are batched at the node level, cutting upload costs.
Smart garbage collection: We track the usage of BCCs and translogs we’ve uploaded, and we delete them once they’re unused, to reduce storage footprint and retention costs.
Autoscaling: We scale automatically with ingestion and search load so clients can call APIs without managing cluster size.

The bottom line: In our experiments, Elasticsearch Serverless achieves up to twice the indexing throughput of stateful Elasticsearch on comparable hardware and scales linearly with autoscaling to match ingestion load.

Visualizing the architecture

Figure 1 in the paper gives a clear side-by-side view: stateful Elasticsearch versus the new stateless architecture Elasticsearch Serverless.

Stateful Elasticsearch (top): Familiar data tiers: hot, warm, cold, frozen. Data lives on local disks; primaries and replicas are spread across nodes; colder tiers may use searchable snapshots on an object store.

Elasticsearch Serverless (bottom): Just two tiers: indexing and search. All durable data (Lucene commits, translogs, cluster state) lives in the object store. Indexing nodes write and upload; search nodes read from the object store and a shared cache, with no local persistence of index data.

The takeaway: a complete separation between the resources used for indexing and those used for querying.

A tale of two data flow paths

Figures 2 and 3 in the paper contrast how data flows in stateful versus Elasticsearch Serverless.

Stateful Elasticsearch (figure 2):

Documents go to the primary shard's Lucene buffers and translog and then to replica shards.
After refresh, the documents go to new searchable segments.
After flush, they’re committed to disk.
Thus, durability is given by the disk and the replicas.

Elasticsearch Serverless (figure 3):

Documents go to Lucene and the translog on an indexing node.
Before acknowledging the client, the translog is uploaded to the object store.
After refresh, the documents go to new searchable segments and are committed to disk in the indexing nodes.
After flush, they go into BCCs and are uploaded to the blob object store.
Search nodes serve queries from the object store (and, for recent data not yet uploaded, directly from the indexing node).
Thus, durability comes from the object store, not from disk or replicas.

Result: The indexing and search paths are fully decoupled.

Autoscaling

Section 7 of the paper describes the autoscaler. Because data lives in the object store, relocating shards doesn't mean copying full segment data; only metadata and, when needed, cache warming. So the cluster can scale up and down much faster than in stateful Elasticsearch.

How it works:

The autoscaler is an external component that monitors metrics from Elasticsearch Serverless.
Indexing tier: Scale-up is driven by memory usage and ingestion load (including queued work).
Search tier: Scale-up is driven by memory, search load, and the user-configurable "search power" (how much of the dataset is cached locally).
It polls every few seconds and adjusts each tier independently.

Outcome: automatic, workload-driven scaling so clients can focus on their applications instead of on capacity planning.

The experimental results

Section 8 of the paper presents our experimental evaluation.

Microbenchmarks show the impact of batching: fewer object store operations for both commits and translogs, with some trade-offs.

Autoscaling experiments: As we increase the number of indexing clients, throughput scales linearly while P50 and P99 latency stay stable. A real-world example shows bulk response times improving and stabilizing as the indexing tier scales up with demand.

Head-to-head comparison of stateful Elasticsearch versus Elasticsearch Serverless:

Elasticsearch Serverless achieves roughly twice the indexing throughput of stateful Elasticsearch at the 50th percentile.
The gain comes largely from using the object store for durability instead of replicating every operation to replica shards.
Latency stays competitive.

The takeaway: The stateless design delivers both better peak performance and more efficient, automatic scaling.

Why this matters for the future of Elastic

The stateless architecture isn't just a technical achievement; it's the foundation for how we want search to work in the cloud.

Pay-as-you-go: Customers can index and search over practically limitless data without provisioning clusters, tuning tiers, or managing replicas and snapshots.
Automatic scaling: Each tier scales on its own automatically; no capacity planning required.
Frequent, automated upgrades: Better security and time-to-value, without the operational cost of rolling upgrades over stateful data.

This work is a step toward making powerful search more accessible, cost-effective, and scalable for everyone.

Read the full paper, and join the conversation

We believe in the power of open research and collaboration to move technology forward. We encourage you to dive into the details. We provide a preprint of this paper for your information, which details in depth the architecture transformation.

Dive deeper: Explore related blog posts

While our paper offers a concise overview of the Elasticsearch Serverless architecture, the details and underlying innovations are explored more fully in a collection of in-depth blog posts written by our engineering team. These articles provide the background, nuance, and specific technical deep dives that make the stateless transformation possible.

We encourage you to delve into the following resources to gain a richer understanding of the components and concepts presented in the paper:

Stateless — your new state of find with Elasticsearch (2022) and Serve more with Serverless (2023). Read the foundational posts introducing the concept of decoupling storage and compute.
Stateless: Data safety in a stateless world (2024). Learn how data durability is achieved in the absence of local replicas.
Autosharding of data streams in Elasticsearch Serverless (2024). Discover the logic behind automatic and dynamic data stream sharding.
How we optimized refresh costs in Elasticsearch Serverless (2024). Understand the specific optimizations applied to reduce the cost of making data searchable.
Introducing Serverless Thin Indexing Shards (2024). Explore the innovation of "thin" shards that enable rapid relocation and recovery.
Search tier autoscaling in Elasticsearch Serverless (2024). Gain insight into the mechanisms driving the automatic scaling of search resources.
Ingest autoscaling in Elasticsearch (2024). Learn how the ingestion tier scales automatically to meet fluctuating indexing load.
Elastic Cloud Serverless pricing and packaging (2025). Learn how the pricing and packaging was initially structured for Elastic Cloud's Serverless offering.
Elasticsearch vs. OpenSearch: Unraveling the performance gap (2023). Learn about the performance differences and key optimizations that distinguish Elasticsearch from OpenSearch, as observed in 2023.

Acknowledgments

We would like to thank all the co-authors of the paper: Iraklis Psaroudakis, Pooya Salehi, Jason Bryan, Francisco Fernández Castaño, Brendan Cully, Ankita Kumar, Henning Andersen, and Thomas Repantis. We would also like to thank the Elasticsearch Distributed Systems team for their contributions, and also the entire Elasticsearch engineering team.

Hybrid search with Java: LangChain4j Elasticsearch integration

Laura Trotta — Wed, 11 Mar 2026 00:00:00 GMT

In our previous article on hybrid search with Elasticsearch in LangChain, we explained why hybrid search can help retrieve better results than simple vector search, along with how it works. We recommend reading that article first.

In addition to Python and JavaScript, the LangChain ecosystem also has a community-driven Java project called LangChain4j, which will be the focus of this article, showing how powerful hybrid search can be by writing a complete application using LangChain4j, Elasticsearch, and Ollama.

Setting up the environment

Running a local Elasticsearch instance

Before running the examples, you'll need Elasticsearch running locally. The easiest way is using the start-local script:

curl -fsSL https://elastic.co/start-local | sh

After starting, you'll have:

Elasticsearch at http://localhost:9200.
Kibana at http://localhost:5601.

Your API key is stored in the .env file (under the elastic-start-local folder) as ES_LOCAL_API_KEY.

> Note: This script is for local testing only. Do not use it in production. For production installations, refer to the official documentation for Elasticsearch.

Running a local Ollama instance

You’ll also need to connect your application to an embedding model. Although you can choose between any provider supported by LangChain4j (check the complete list), for this example we’ll be using Ollama, which can be easily set up locally following the quickstart.

Let’s start coding

The idea for the application is simple: Given a dataset of movies (taken from an IMDb dataset on Kaggle), we want to be able to find movies whose descriptions are relevant to our queries. This demo uses a subset of the data, which has been cleaned. You can download the dataset used for this article from our GitHub repo, along with the full code for this demo.

Step 1: Dependencies and environment

Open your favorite integrated development environment (IDE), create a new blank project, preferably with a modern Java version (we’re using Java24) and a gradle/maven version to match (in our case, Gradle 9.0).

We only need three dependencies:

dependencies {
    implementation("com.fasterxml.jackson.dataformat:jackson-dataformat-csv:2.17.0")
    implementation("dev.langchain4j:langchain4j-elasticsearch:1.11.0-beta19")
    implementation("dev.langchain4j:langchain4j-ollama:1.11.0")
}

The first one is needed to ingest the data that we’ll embed and query; the other two are the necessary LangChain4j dependencies to connect and manage our Elasticsearch vector store and Ollama embedding model.

The best way to connect to the external services is to set up environment variables and set them at the start of our main function:

String elasticsearchServerUrl = System.getenv("ES_LOCAL_URL");
String elasticsearchApiKey = System.getenv("ES_LOCAL_API_KEY");

String ollamaUrl = System.getenv("ollama-url");
String ollamaModelName = System.getenv("model-name");

Step 2: Ingesting the dataset

Since the dataset is a CSV, we’ll be using Jackson dataformat’s jackson-dataformat-csv to easily read the data and map it to a Java class, defined as:

public record Movie(
    String movie_id,
    String movie_name,
    Integer year,
    String genre,
    String description,
    String director
) {
}

Now we can create an instance of CsvSchema mapping the CSV structure and read the file into an iterator:

CsvSchema schema = CsvSchema.builder()                    
    .addColumn("movie_id") // same order as in the csv    
    .addColumn("movie_name")                              
    .addColumn("year")                                    
    .addColumn("genre")                                   
    .addColumn("description")                             
    .addColumn("director")                                
    .setColumnSeparator(',')                              
    .setSkipFirstDataRow(true)                            
    .build();                                             
                                                          
CsvMapper csvMapper = new CsvMapper();                    
                                                          
File initialFile = new File("src/main/resources/scifi_1000.csv");
InputStream csvContentStream = new FileInputStream(initialFile);
                                                          
MappingIterator it = csvMapper                     
    .readerFor(Movie.class)                               
    .with(schema)                                         
    .readValues(new InputStreamReader(csvContentStream));

Each row needs to be embedded first, and then both the embedded content and the text representation will be ingested by Elasticsearch.

Let’s start by creating an instance of the Ollama embedding model class:

EmbeddingModel embeddingModel = OllamaEmbeddingModel.builder()
    .baseUrl(ollamaUrl)
    .modelName(ollamaModelName)
    .build();

And then the Elasticsearch vector store, which needs an instance of the Elasticsearch Java RestClient:

RestClient restClient = RestClient
    .builder(HttpHost.create(elasticsearchServerUrl))
    .setDefaultHeaders(new Header[]{
        new BasicHeader("Authorization", "ApiKey " + elasticsearchApiKey)
    })
    .build(); 

EmbeddingStore embeddingStore = ElasticsearchEmbeddingStore.builder()
    .restClient(restClient)
    .build();

For the ingestion loop, the LangChain4j library requires the data to be split in two lists for ingestion, one for the vector representation and one for the original text, so we’ll set up two lists which will be filled by the loop:

List embeddings = new ArrayList<>();
List embedded = new ArrayList<>();

Where Embedding and TextSegment are both library specific classes.

We’ll iterate on the movie dataset iterator, use the embedding model to retrieve the vector representation for each movie information (a text representation of all the fields merged), and add the name separately as metadata so that the result will be easier to read.

boolean hasNext = true;

while (hasNext) {
    try {
        Movie movie = it.nextValue();
        String text = movie.toString();

        Embedding embedding = embeddingModel.embed(text).content();
        embeddings.add(embedding);

        Metadata metadata = new Metadata();
        metadata.put("movie_name", movie.movie_name());
        embedded.add(new TextSegment(text, metadata));

        hasNext = it.hasNextValue();
    } catch (JsonParseException | InvalidFormatException e) {
        // ignore malformed data
    }
}

Finally, the vector list and text list are passed to the vector store method addAll(), which will handle asynchronously sending the data to the vector store:

embeddingStore.addAll(embeddings, embedded);

Step 3: Querying

Our goal is to find movies with time loops in the plot, so our prompt will be:

String query = "Find movies where the main character is stuck in a time loop and reliving the same day.";

Let’s try a simple vector search first, by creating a content retriever with a k-nearest neighbor (kNN) query default configuration and then running the query and printing the results:

ElasticsearchContentRetriever contentRetrieverVector = ElasticsearchContentRetriever.builder()
                .restClient(restClient)
                .configuration(ElasticsearchConfigurationKnn.builder().build())
                .maxResults(5)
                .embeddingModel(embeddingModel)
                .build();

List vectorSearchResult = contentRetrieverVector.retrieve(Query.from(query));

System.out.println("Vector search results:");
vectorSearchResult.forEach(v -> System.out.println(v.textSegment().metadata().getString(
                "movie_name")));

This outputs:

Vector search results:
The Witch: Part 1 - The Subversion
Divinity
The Maze Runner
Spider-Man
Spider-Man: Into the Spider-Verse

Now let’s see how hybrid search performs:

ElasticsearchContentRetriever contentRetrieverHybrid = ElasticsearchContentRetriever.builder()
    .restClient(restClient)
    .configuration(ElasticsearchConfigurationHybrid.builder().build())
    .maxResults(5)
    .embeddingModel(embeddingModel)
    .build();

List hybridSearchResult = contentRetrieverHybrid.retrieve(Query.from(query));

System.out.println("Hybrid search results:");
hybridSearchResult.forEach(v -> System.out.println(v.textSegment().metadata().getString(
            "movie_name")));

Hybrid search results:
Edge of Tomorrow
The Witch: Part 1 - The Subversion
Boss Level
Divinity
The Maze Runner

Why these results?

This query (“time loop / reliving the same day”) is a great case where hybrid search tends to shine because the dataset contains literal phrases that BM25 can match and vectors can still capture meaning.

Vector-only (kNN) embeds the query and tries to find semantically similar plots. Using a broad sci‑fi dataset, this can drift into “trapped / altered reality / memory loss / high-stakes sci‑fi” even when there’s no time-loop concept. That’s why results like “The Witch: Part 1 – The Subversion” (amnesia) and “The Maze Runner” (trapped / escape) can appear.
Hybrid (BM25 + kNN + reciprocal rank fusion [RRF]) rewards documents that match keywords and meaning. Movies whose descriptions explicitly mention “time loop” or “relive the same day” get a strong lexical boost, so titles like “Edge of Tomorrow” (relive the same day over and over again…) and “Boss Level” (trapped in a time loop that constantly repeats the day…) rise to the top.

Hybrid search doesn’t guarantee that every result is perfect; it balances lexical and semantic signals, so you may still see some non-time-loop sci‑fi in the tail of the top‑k.

The main takeaway is that hybrid search helps anchor semantic retrieval with exact textual evidence when the dataset contains those keywords. Check the previous article for more information on how hybrid search works.

Full code example

You can find the full demo code on GitHub.

Conclusion

In this article, we demonstrated how to use hybrid search in LangChain4j through its Elasticsearch integrations, with a complete Java example. This article is an extension of a previous article, which presents the LangChain integrations for Python and JavaScript and introduces and explains hybrid search. We’re planning to continue our collaboration with LangChain4j in the future by contributing to the embedding models with our Elasticsearch Inference API.

SearchClaw: Bring Elasticsearch to OpenClaw with composable skills

Alex Salgado — Tue, 10 Mar 2026 00:00:00 GMT

In recent weeks, OpenClaw has been appearing frequently in AI community discussions, particularly among developers interested in agents, automation, and local runtimes. The project gained traction quickly, which naturally raised a technical question:

What real problem does it solve for engineers?

OpenClaw is a self-hosted gateway for AI agents: a single runtime that coordinates execution, treats agents as isolated processes, and uses skills (structured instructions in markdown files) as the unit of integration. Conceptually, this isn’t entirely different from what we already do with command line interfaces (CLIs) and scripts, but it’s now formalized around agent-driven workflows.

This led to a practical exploration within the Elastic Stack:

If we treat OpenClaw as an orchestration runtime, how does it behave when Elasticsearch is the back end? And how straightforward is integration using OpenClaw skills?

Let's build an integration using composable skills.

Solution architecture

In this tutorial, we’ll teach OpenClaw how to access and query Elasticsearch data through a custom read-only skill, and we’ll then demonstrate how it composes multiple skills together; for example, combining Elasticsearch queries with real-time weather data to generate dynamic reports.

Before diving into the hands-on steps, let’s look at what we’re building. The solution is composed of three integrated layers that work together through OpenClaw orchestration.

Layer 1: Storage and search (Elasticsearch)

The data layer runs on Elasticsearch via start-local, a single command that spins up Elasticsearch and Kibana locally with Docker.

Two sample indices demonstrate different use cases:

fresh_produce: 10 products with semantic search (ecommerce scenario)
app-logs-synthetic: 30 log entries across four services (observability scenario)

The same read-only skill works with both indices without any reconfiguration; the agent inspects the mapping and adapts its queries accordingly.

Layer 2: Orchestration (OpenClaw Gateway)

The gateway receives natural language requests and loads the Elasticsearch skill, and the large language model (LLM) decides which queries to construct. The skill is a pure SKILL.md with reference docs, meaning that its operations require no custom code.

To understand how the gateway organizes this, two core OpenClaw concepts are worth knowing:

Agents: Independent AI instances, each with its own configuration, workspace, and set of skills. You can run multiple agents for different purposes.
Workspace: A folder that defines an agent’s context: AGENTS.md (the agent’s permanent briefing), .env (credentials), and a skills/ directory. Think of it as the agent’s working environment.

Layer 3: Skills (composable capabilities)

Skills are structured instructions in markdown files (SKILL.md) that teach the agent how to use specific tools or APIs. They can be global (available to all agents), workspace-specific, or bundled with OpenClaw. The agent selectively loads only the skills relevant to each request.

This tutorial uses two skills:

Elasticsearch-openclaw (custom, built for this tutorial): A read-only skill that teaches the agent how to search, filter, aggregate, and explore Elasticsearch indices using curl.
Weather (community skill, used for composition demo): A skill that fetches current weather conditions from external APIs.

Later in the tutorial, we'll demonstrate how OpenClaw composes both skills in a single request, querying Elasticsearch products based on real-time weather data without any custom integration code.

Read-only by design

The elasticsearch-openclaw skill is read-only by design. It provides patterns for searching, filtering, and aggregating data, but it never writes, updates, or deletes. This minimizes the security footprint when giving AI agents access to your Elasticsearch cluster.

Even if the agent environment is compromised, your data remains safe from modification or deletion. This is enforced through:

Skill design: No write operation patterns in SKILL.md or reference files.
API key permissions: The tutorial uses a read-only API key with only read and view_index_metadata privileges.
Agent instructions: AGENTS.md explicitly states "You can SEARCH, FILTER, and AGGREGATE data, but you can NEVER write, update, or delete."

This security-first approach is why infrastructure setup (index creation, data loading) must be done manually; by design, the agent cannot do it for you.

Prerequisites

To follow this tutorial, you’ll need:

Software and tools:

Docker Desktop installed and running (Docker Engine with Compose V2).
Elasticsearch running locally via start-local. (We’ll set this up in the next section.)
Jina API key (free): https://jina.ai/embeddings.
OpenClaw installed: https://openclaw.ai.

Setting up the environment

Start by cloning the starter project, which contains the skill, workspace configuration, and Dev Tools scripts:

git clone https://github.com/salgado/elasticsearch-openclaw-start-blog
cd elasticsearch-openclaw-start-blog

The repository contains:

elasticsearch-openclaw-start-blog/
├── devtools_fresh_produce.md         ← Creates fresh_produce index (10 products)
├── devtools_app_logs_synthetic.md    ← Creates app-logs-synthetic index (30 logs)
└── openclaw-workspace-elastic-blog/
    ├── AGENTS.md                      ← Agent briefing
    ├── .env.example                   ← Credentials template

Note: The devtools*.md files contain Kibana Dev Tools commands formatted as reference documentation.

Installing OpenClaw

OpenClaw is a self-hosted gateway. This means you maintain full control over execution and data, but you need to prepare your local environment or server.

I installed OpenClaw on a separate machine, which is why I included the disclaimer below.

** Security and responsibility disclaimer **

Since OpenClaw is an early-stage, rapidly evolving open-source project, the community has raised important discussions about potential security vulnerabilities, especially around token handling and third-party script execution.

Deployment recommendations:

Isolated environments: If you’re not an advanced infrastructure security user, we recommend installing OpenClaw strictly in isolated, controlled environments (such as a dedicated virtual machine [VM], a rootless Docker container, or a test machine).
Do not use in production: Avoid running the gateway on servers containing sensitive data or with unrestricted access to your corporate network until the project reaches a more stable, audited version.
Least privilege: We reinforce the need to use Elasticsearch API keys with restricted permissions (read-only) to mitigate risks, in case the environment is compromised.
Network segmentation: Both Elasticsearch and OpenClaw bind to localhost by default. Keep it that way, unless you have a specific reason to expose them.
Credential rotation: Rotate API keys periodically. OpenClaw stores credentials locally, so treat the machine’s security as the perimeter.
Audit logging: Enable Elasticsearch audit logging to track all API calls made by OpenClaw. This creates a full trail of what the agent accessed and when.
Keep the installation up to date.

For a deeper analysis of the security architecture and deployment options, consult the official OpenClaw documentation.

Runtime installation

OpenClaw manages daemons and skill isolation via CLI. Since it’s a recent project that has undergone naming changes, we recommend strictly following the official documentation to ensure installation compatibility.

# Global gateway installation
curl -fsSL https://openclaw.ai/install.sh | bash

Preparing the Elasticsearch back end

Before connecting any agent runtime, we need a working Elasticsearch environment with data to query and a secure, read-only access layer. In the next two sections, we’ll spin up Elasticsearch locally using start-local, create an index with semantic_text and Jina v5 embeddings, load sample data, validate that semantic search works, and generate a read-only API key. Once this foundation is in place, the Elasticsearch side is complete and we can focus entirely on teaching the agent how to use it.

Part 1: Setting up Elasticsearch locally

Start a local Elasticsearch and Kibana instance with a single command:

curl -fsSL https://elastic.co/start-local | sh

Once complete: Elasticsearch at http://localhost:9200, Kibana at http://localhost:5601, and credentials in elastic-start-local/.env.

Part 2: Configuring the index in Kibana Dev Tools

Open http://localhost:5601 → Dev Tools and run devtools_fresh_produce.md in order.

Step 1: Replace YOUR_JINA_API_KEY with your actual Jina API key (free).
Step 2: Save the encoded field immediately; it cannot be retrieved later.

The key commands in the Dev Tools file are:

Create the Jina inference endpoint:

PUT _inference/text_embedding/jina-embeddings-v5
{
  "service": "jinaai",
  "service_settings": {
    "api_key": "YOUR_JINA_API_KEY",
    "model_id": "jina-embeddings-v5-text-small"
  }
}

Create the index with semantic_text:

PUT /fresh_produce
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "fields": { "keyword": { "type": "keyword" } }
      },
      "description": { "type": "text" },
      "category": { "type": "keyword" },
      "price": { "type": "float" },
      "stock_kg": { "type": "float" },
      "on_sale": { "type": "boolean" },
      "image_url": { "type": "keyword" },
      "semantic_content": {
        "type": "semantic_text",
        "inference_id": "jina-embeddings-v5"
      }
    }
  }
}

The semantic_text field type handles embedding generation automatically at index time.

Index sample products using the bulk API (see devtools_fresh_produce.md for the full dataset of 10 products).

Validate semantic search:

GET /fresh_produce/_search
{
  "query": {
    "semantic": {
      "field": "semantic_content",
      "query": "healthy colorful meals"
    }
  },
  "size": 3,
  "_source": ["name", "description", "category"]
}

The semantic query type handles inference on the query side automatically; no need to specify model IDs or embedding details.

Create a read-only API key:

POST /_security/api_key
{
  "name": "openclaw-readonly",
  "role_descriptors": {
    "reader": {
      "cluster": ["monitor"],
      "indices": [
        {
          "names": ["fresh_produce", "app-logs-synthetic"],
          "privileges": ["read", "view_index_metadata"]
        }
      ]
    }
  }
}

Save the encoded value from the response. This is your API key for the OpenClaw configuration.

Connecting to OpenClaw

With the Elasticsearch back end ready, we can now wire it into OpenClaw. Several Elasticsearch integrations already exist in the ecosystem, from Elastic’s own Model Context Protocol (MCP) server to community-built MCP servers. However, most of these offer full CRUD access or are designed for different agent runtimes. Given that the technology is still in its early stages and security remains a primary concern, I chose to build a dedicated skill, simple, read-only, and purpose-built for OpenClaw. This approach ensures that the agent can search, filter, and aggregate data but never modify it, keeping the blast radius minimal even if the environment is compromised.

In the next sections, we’ll configure credentials, install the skill, create a dedicated agent, and explore how the workspace ties everything together.

Install the skill and create the agent

Step 1: Configure credentials

From the cloned repository, configure the credentials by copying the environment template and filling in your Elasticsearch URL and the read-only API key:

cp openclaw-workspace-elastic-blog/.env.example 
openclaw-workspace-elastic-blog/.env

Edit the .env file with these two values:

ELASTICSEARCH_URL: http://localhost:9200 (from start-local)
ELASTICSEARCH_API_KEY: The encoded value from the read-only API key you created in Part 2 (the POST /_security/api_key response)

Example .env file:

ELASTICSEARCH_URL=http://localhost:9200
ELASTICSEARCH_API_KEY=VnVaRmxLSDRCQxxxxxxxxbGVfa2V5

Step 2: Install the skill from ClawHub

ClawHub is OpenClaw's public skill registry. Think of it as npm for AI agent skills. At the time of this writing, ClawHub hosts over 3,200 skills, covering everything from Slack and GitHub integrations to Internet of Things (IoT) device automation. For this tutorial, we created elasticsearch-openclaw, a custom skill focused on read-only queries using semantic_text, aggregations, and observability on Elasticsearch 9.x. It’s published on ClawHub so you can install it directly. As a best practice, only install skills from trusted sources with known provenance; as with any package manager, review the content before granting access to your agent.

The elasticsearch-openclaw skill is published on ClawHub.

Recommended: Open the OpenClaw Web UI (http://127.0.0.1:18789/) and ask:

Install the elasticsearch-openclaw skill from https://clawhub.ai/salgado/elasticsearch-openclaw

OpenClaw will:

Fetch the skill from ClawHub.
Install it in the appropriate directory.
Confirm when ready to use.

Step 3: Create the agent

Do this by registering a dedicated agent with its own workspace, and then restart the gateway to load the new configuration:

openclaw agents add elasticsearch-agent \
  --workspace ~/path/to/elasticsearch-openclaw-start-blog/openclaw-workspace-elastic-blog \
  --non-interactive

openclaw gateway restart

Understanding the workspace

Now that the agent is running, let’s look at what makes it tick.

`AGENTS.md`

The AGENTS.md file is the agent’s permanent briefing. It defines who the agent is, what it can do, and how it should behave. For our Elasticsearch agent, this file instructs the agent about the available indices, the read-only constraint, and the preferred query patterns.

Skills: When they make a difference

Without skill	With `elasticsearch-openclaw` skill
Agent has no knowledge of Elasticsearch query syntax.	Agent knows semantic, full-text, filtered, and aggregation patterns.
Agent might attempt write operations.	Agent is instructed to never write, update, or delete.
Agent guesses field names and types.	Agent inspects mappings first and then constructs appropriate queries.
Generic curl commands with trial and error.	Structured query templates with best practices for Elasticsearch 9.x.

Exploring with the agent

With the Elasticsearch back end configured and the OpenClaw agent connected, it’s time to see what the agent can actually do. In the next sections, we’ll test natural language queries, explore observability data, and compose multiple skills together.

Testing in OpenClaw

Open the OpenClaw web UI, and try some natural language queries. The agent will inspect the index mapping, choose the appropriate query type, and return results.

Type:

“Find products that would be good for a healthy summer salad.”

Result:

Others ideas to explore:

Index exploration: > “What indices do I have in Elasticsearch? Show me the fields of fresh_produce.”
Filtered search: > “Show me all products on sale under $15.”
Aggregations: > “What’s the average price by category?”

Observability

To demonstrate that the skill works beyond a single use case, the repository includes a second index: app-logs-synthetic, with 30 synthetic log entries across four fictional services, created from devtools_app_logs_synthetic.md.

Setting up the log data

Since the skill is read-only, you need to populate the index first. The devtools_app_logs_synthetic.md file contains five commands (three for setup and two for verification):

Create ingest pipeline: Adds @timestamp to log entries automatically.
Create index mapping: Defines the app-logs-synthetic structure (classic fields only, no semantic_text).
Bulk insert logs: Loads 30 synthetic log entries across four services.
Count query: Verify 30 documents were indexed.
Sample search: Quick test to confirm that data is queryable.

How to run:

Open Kibana Dev Tools: http://localhost:5601 → Dev Tools.
Copy each numbered block from the .md file.
Paste into the Dev Tools console.
Press Ctrl/Cmd+Enter to execute.
Wait for a successful response before continuing to the next block.

This creates the app-logs-synthetic index with sample data ready for querying.

Try this query in the OpenClaw web UI:

Show me the distribution of HTTP status codes across all services.

Result:

Other ideas to explore:

“How many 500 errors do I have in app-logs-synthetic? Which services are failing?”
“Which endpoints have the slowest response times?”
“What happened with the payment-service in the last 24 hours?”

This is the same skill, same agent, same setup, just pointed at different data. The agent inspects the new index mapping, adapts its queries, and returns relevant results without any reconfiguration.

Composing skills in action

This is where composable skills truly shine. Start by asking the agent:

Install the weather skill.

OpenClaw will search for the weather skill, automatically attempt the installation, and guide you through the process. Just follow the on-screen instructions; no new API key is required for the weather skill. Afterward, try this:

“Find the products on sale in the fresh_produce index that match today’s weather in São Paulo. Generate a nice HTML report with product cards using the image_url field from each document, price, description, and stock. Save it to ~/Desktop/report.html and open it in the browser.”

In a single request, the agent chains multiple skills: the weather skill to check current conditions, the Elasticsearch skill to run a hybrid search on products that match the context, and its built-in file and browser tools to generate an HTML report and open it. No custom integration code, no glue scripts, just skills composed by the LLM at runtime.

This is what makes OpenClaw different from a traditional automation framework. You don’t preprogram the workflow. You describe the outcome, and the agent figures out the composition.

Conclusion

SearchClaw started as a simple experiment and ended up demonstrating what composable, LLM-driven integration looks like in practice. The key takeaway is not the individual tools (all are familiar) but the approach. Instead of writing a specific application with hardcoded queries, we gave the agent capabilities and let it compose solutions dynamically. This is what makes OpenClaw native: composable, LLM-driven, and local-first.

As with any early-stage project, OpenClaw should be used thoughtfully, especially regarding security and environment isolation. The read-only skill approach demonstrated here is one way to limit risk while still unlocking the value of your Elasticsearch data.

The full code is available in the repository and can serve as a starting point for your own integrations: https://github.com/salgado/elasticsearch-openclaw-start-blog.

Building effective database retrieval tools for context engineering

Leonie Monigatti — Mon, 09 Mar 2026 00:00:00 GMT

Good search is incredibly difficult. That's why we have an ecosystem full of search algorithms, tooling, and even dedicated stacks like Elastic’s, not to mention an entire field called information retrieval. In modern context engineering, AI agents search and retrieve context from many different sources, including local files, the web, or memory files. Giving agents access to tools to interact with data stored in a database allows them to ground their answers in proprietary information or even complete analytical tasks.

However, if these interfaces aren’t carefully engineered, agents can search the wrong index, generate invalid SQL/Elasticsearch Query Language (ES|QL) queries, or return large amounts of irrelevant data. During the development of Elastic Agent Builder, we’ve seen these failure modes multiple times. While collaborating with dozens of internal teams to build tools for interacting with Elasticsearch data and integrating them to improve our internal processes with agentic workflows, such as our internal laptop refresh process, we found that the most successful teams carefully design database retrieval tools as curated interfaces to their data.

In this blog, we share the best practices we follow when building database retrieval tools. In fact, the principles we share are based on common patterns we saw during iteratively improving our prebuilt tools and helping internal teams build custom tools.

Key challenges of agentic retrieval

Coding and search are among the best use cases of agents. Even though coding agents have recently made substantial progress with new concepts, such as filesystem-oriented tools and code-specific embeddings, search agents (specifically for database retrieval) lack reported breakthroughs.

Agent use cases are challenging for multiple reasons: They can ignore the available tools to accomplish a task; they can call the wrong tools; and they can call the right tools with the wrong parameters. In addition to these general challenges, we believe that database retrieval use cases are challenging for the following three key reasons:

Identifying the right index of data requires the large language model (LLM) to understand what it contains. But sometimes, the number of indices can already be so large that even representing those to select may cause context length problems.
Generating efficient queries that balance retrieving relevant information with minimizing latency and resource usage can be challenging.
Avoiding context bloat with tool responses requires the tool response to be optimized for contextual relevance and token efficiency. This isn’t always easy, especially when the agent generates the query from scratch. Once the context is no longer relevant to answer a user’s query, offloading the data for later reference is another challenge.

During the development of Agent Builder and integrating it into our own processes, we faced these challenges multiple times. In fact, the principles we share in the following sections are based on common patterns we saw during iteratively improving our built-in and custom tools and our internal workflows built on top of them.

Principles for building effective database retrieval tools

In this section, we translate our learnings into guiding principles for building effective database retrieval tools: deciding which tools to build, making sure the agent finds the right index to search and calls the right tool with appropriate parameters, optimizing the tool responses, handling errors, and safeguarding the data.

Building the right database retrieval tools (“low floor, high ceiling”)

When deciding on what database retrieval tools to build, we follow the principle of “low floor, high ceiling” for good agent experience:

High ceiling: Tools that don’t limit the agent's potential to handle ambiguous user queries in the worst case. In the context of database retrieval, these are general-purpose tools that, for example, allow the agent to write full SQL/ES|QL queries from scratch. These come with the trade-off of reasoning overhead of the agent and result in higher latency, higher cost, and lower reliability.
Low floor: Tools with high accessibility that the agent can use successfully on the first try with minimal reasoning overhead for repeating user queries. In the context of database retrieval, these are specialized tools that, for example, wrap specific queries. These have the benefit of lower latency, lower cost, and higher reliability than general-purpose tools. However, they require engineering effort, and realistically, it may not be possible for engineers to anticipate every possible user query.

For example, in our experience, a generic search tool is mandatory to allow the agent to handle unique and ambiguous user queries in the worst case. However, we found it necessary to reduce reasoning overhead and increase efficiency by creating specialized tools (for example, get_top_performing_products(category)).

Another lesson we learned is to consider the level of abstraction of a tool. During our preview phase, the agent had access to a large number of atomic general-purpose tools (for example, get_index_mappings, generate_esql, execute_query, and others). In practice, this had two downsides: When faced with a complex, open-ended question, the agent would confuse tools and their expected order, despite guiding instructions. Combining multiple tools in an agentic workflow also requires hand-off of information via the context window, which leads to filling up the context window with information that’s only temporarily important. To overcome this inefficiency, we wrapped the functionality of multiple atomic tools into one self-contained search tool.

Finding the right index

While the majority of tools that interact with a database will focus on querying the database, in some use cases, like for index selection, the tool will allow the agent to interact with the database’s metadata to decide which indices to search against based on a user’s query and intent.

Initially, our naive index selection relied on the index names and a sample of their schema definitions. This worked well in our internal testing, but when internal teams experimented with it, we realized real use cases often don’t have index names that are distinct and descriptive but are instead vague (for example, users, logs, flight_travels versus web-logs-2026.01, web-logs-2026.02).

To overcome this, we started exposing each index’s metadata and schema definitions in the tool. This significantly improved the selection by allowing engineers to add descriptions to translate technical names into natural language on two levels:

Index-level descriptions: What data is stored in the index and how documents relate to one another.
Field-level hints: Specific guidance on a field's format, expected values, or business meaning (for example, ”Use this field for exact ID matches only”).

In another iteration, we decided to add data sampling. For each index, we sample some of the data present in each field to let the agent have a clearer understanding of the type of data in the index. While it significantly improved the efficiency of index selection, it comes at the cost of increased tool response data.

Calling the right database retrieval tool

Guiding an agent to call the right tool is already challenging enough in general. This section discusses what helped us make sure the agent calls a tool to ground their response, as well as calling the right database retrieval tool.

Naming and namespacing: Standardizing identifiers for reliable selection

A tool’s name acts as a skimmable header that agents use to decide which one to investigate further. In practice, this means using descriptive and distinct tool names. Additionally, tool selection is more reliable when their names are consistent in formatting (for example, standardize on snake_case) and wording. Using action-oriented verbs helps the LLM map the user’s intent to the tool’s purpose, although the exact wording of the tool names is less critical in practice (for example, search versus find).

Namespacing tools to group related tools under common prefixes or suffixes is similarly helpful. In the case of databases, namespacing tools by index or domain helps the LLM understand tool relationships and prevents naming collisions (for example, finance.search_ticker or support.get_ticket_details).

Tool description: Instructing the agent on proper usage

The description is the most crucial component of any tool definition because it instructs the agent on when and how to use it, especially when tools have similar names (for example, search_logs and find_errors).

Consider this template for complex tools:

Core purpose: A high-level summary of what the tool does.
Trigger: When the tool should be used (and when it should not).
Action: Which specific data the tool retrieves or modifies, and what type of questions it can answer.
Limitations: What important limitations and constraints exist, such as specific query languages or formats.
Relationships with other tools: Does one tool affect another tool, or are there any preconditions?
Examples: Specific few-shot examples of user queries and how to use the tool for them, such as how to determine the optimal search strategy or when to use which operator.

A note on model sensitivity: While top-tier models like Claude 4.5 Sonnet are forgiving, smaller models often need clearer and more elaborate descriptions to select the right tool.

Adding reasoning parameters

Inspired by the paper on think-augmented function calling, we added a reasoning parameter. This approach improves the parameter accuracy by providing a scratchpad for the LLM to process its thoughts and facilitates a more transparent user experience.

This works well for complex tool calls or when a large number of tools are exposed to the agent. However, it can lead to regression in simple scenarios, and the benefits are further diminished for thinking-based LLMs. In our Agent Builder implementation, the reasoning parameter is often optional and stripped before execution and is only used for tool selection and parameter filling.

"properties": {
   "reasoning": {
      "type": "string",
      "description": "Brief explanation of why you're calling this tool"
   }
}

Support: Reinforcing instructions in the agent prompt

A common error we observed is that the LLM would sometimes ignore the available tools and instead use its innate knowledge to generate a (hallucinated) response. For example, when asked ”Can you tell me more information about Elasticsearch’s ES|QL language?”, it would assume it was fine to answer by itself instead of calling the tool, which was specifically designed to fetch documentation about Elastic products.

To mitigate this problem, we added repeated, explicit instructions in the system prompt of the agent itself to guide the agent to find the right balance between using its innate knowledge and grounding its answer in a tool response. Our testing indicates that this is especially effective when multiple tools with similar purposes are exposed to the agent.

Forcing tool usage

Beyond reinforcing the instructions in the agent prompt, we found it helpful to force tool usage when mandatory by explicitly binding tools using tool_choice: ‘any’.

Calling database retrieval tools with appropriate values and writing queries

Another challenge is to enable the agent to call a tool with appropriate parameter values. We’ve observed a consistent pattern where a strong definition, the number of parameters, and their complexity play an important role in reducing errors.

Parameter definition

A strong parameter definition significantly improves the parameter accuracy. General best practices for parameter definitions in agent tools are:

Unambiguous name: Clearly identifies the purpose (for example, user_id versus user).
Strong typing: Uses integer, string, or enums, among others, for finite sets of valid values.
Detailed description: Explains what the parameter means and when and how to use it. Specifies default values for missing values, document formats (for example, for dates), hidden rules (for example, ”at least one of agent_id | user_id is required”), and includes small examples.

# Weak parameter description
"properties": {
   "index": {
      "description": "Name of the index",
   },
}

# Strong parameter description
"properties": 
{
   "index": {
      "type": "string","description": "The specific index, alias, or datastream to search. Defaults to 'main-alias' if unknown. ",
   },
}

Number of parameters

Agents struggle to call a tool with a large number of parameters with appropriate values, especially mandatory ones. As a general rule of thumb, we try to keep the mandatory parameters below five and the total parameters below 10.

Parameter complexity

Reducing the complexity of the input parameter when possible leads to fewer mistakes. For example, it requires reasoning overhead to let an LLM generate a search term than generating full SQL or ES|QL queries from scratch. Especially for repeating user queries, “pre-canning” search queries reduces latency, cost, and error rates (although modern LLMs are good at using well-known languages, such as SQL).

To follow the principles of “low floor, high ceiling,” we gravitated toward wrapping a specific query inside a tool and letting the agent only provide the search term. Below is an example of input parameters with varying complexity for the same user query, "Find the 5 most relevant 'resolved' support tickets based on a new problem description to find previous solutions."

# Complex parameter
search_support_tickets("FROM support_tickets | WHERE status = 'resolved' AND customer_email = ?email | MATCH(issue_title, issue_description, ?problem_description) | KEEP ticket_id, issue_title, resolution | LIMIT 5",
)

# Simple parameter
find_similar_customer_resolved_tickets(problem_description)

Model sensitivity

Models have a strong impact on parameter accuracy and query generation. Our internal benchmarking showed that switching from Claude 3.7 Sonnet to Claude 4.5 Sonnet reduced the syntax errors of the generated ES|QL queries from ~28% to ~4%.

Input validation

Although the above techniques increase parameter accuracy, they don’t eliminate the possibility of errors. Instead of trusting the LLM's input, we gravitated toward always validating and sanitizing it to ensure queries adhere to the expected schema.

Optimizing database retrieval tool responses

One common mistake is neglecting the size of the tool’s output. Because the tool’s output is what goes into the agent’s context window, not optimizing it for contextual relevance (quality) can distract the agent, and not optimizing it for token efficiency (quantity) can increase cost and risk exceeding the LLM’s context window limit. Working alongside internal teams, we’ve identified three dimensions for reviewing the return values:

The first dimension is length: The number of search results. A recurring pitfall we’ve observed among teams is the failure to limit search results, which can lead to overflowing the context window. While vector search queries inherently require a limiting parameter, other search methods often do not. We recommend including limit clauses (for example, 10 - 20) in all query types. This prevents returning low-signal results and ensures token efficiency.

The second dimension is width: The number of fields per data object. Instead of returning all properties, including cluttering ones (for example, timestamps and internal IDs, among others), curating a set of relevant fields can help improve both the user experience and the agent behavior.

Consider the following examples:

Returning identifiers with a concise set of properties of a data object allows the agent to retrieve full information about a data object later when it needs it (“just-in-time context engineering”), rather than cluttering the context window.
Returning metadata for citations (for example, page numbers in large PDF documents) can help build trust for the user.
Returning the number of search results and status messages can help the agent reason the status of the search query.

The third dimension is depth: The size of a single field. Consider a case where the documents themselves are large (for example, in the 10s of MB scale). Those can’t just be passed back to the agent in full, as context length will instantly reach its limit. To mitigate this, we recommend truncating long text fields when an excerpt is sufficient. We found features such as Elasticsearch’s highlighting are helpful for this without the need for another LLM API call.

But even when working with smaller documents, letting the tool format the data into an easily digestible format for the LLM, such as sanitizing the content (for example, stripping HTML tags), formatting for readability (for example, tables to Markdown tables or links as “[Title](url)”), improved downstream performance.

While using only one of these techniques is often not sufficient for improving the contextual relevance, combining all of them might reduce the retrieval recall (for example, limiting the number of search results might risk not retrieving all of the relevant documents). In practice, this requires evaluating different combinations of these approaches to find the best balance.

Handling errors and enabling self-correction

We’ve observed that agents can get stuck in infinite loops or hallucinate responses when they encounter an error. Even if an agent follows its instructions perfectly, if a tool doesn’t provide any error message, only returns an error code, or at best provides a short, nondescriptive error message, the agent has no chance to self-correct from an error it doesn’t understand.

An informative error message enables the agent to understand why the error is happening and how to recover from it. For this, engineers need to think about the “not so happy” paths and the anticipated edge cases, such as the following examples:

If an error occurred because of a wrongly formulated search query, the agent should be able to reason over the failure and reformulate the query. In this example, returning the number of search results and the generated query can help the agent self-correct.

In general, engineers need to consider whether “zero results” is an expected behavior or an error for a given tool. In cases where an empty result likely indicates an error, both an error message and agent instructions can be helpful.

# Example error message from tool
"No product data found for product ID [XYZ]. 

Review the search query [insert used search query here].

Ask the customer to confirm the product name, and look up the product ID by name to confirm you have the correct ID."

---
# Example agent instruction
"If the product_search tool returns no results, do not state that the product does not exist. 

First, check that the `:` operator was used for multi-value fields.


Second, ask the user to provide and confirm the product ID or provide an alternative identifier like product name."

When encountering an API failure due to an expired API key, strictly limit retries (for example, a maximum of two or three) to prevent the agent from repeatedly trying a failing deterministic flow.

Safeguarding data

The primary engineering challenge for production-grade agent applications with different tools lies in identity propagation, specifically the distinct requirements of authentication (verifying who the user is) and authorization (verifying what they can access). While an initial layer (for example, Okta) can provide the base authentication, downstream systems (for example, ServiceNow, Elasticsearch, and others) maintain their own nonuniform authorization architectures with varying levels of granularity.

The most successful implementations we’ve seen enforce identity verification at every touchpoint within the tool's logic. This prevents the agent from accidentally accessing data that the end user isn't allowed to see. By verifying identity at every system level, we ensure that the agent respects privacy even when navigating complex, multisystem workflows. Be aware of the downside; this introduces intentional, security-mandated latency.

Beyond verifying the identity at every touchpoint, managing credentials securely is critical. Exposing sensitive API keys in tool definitions or hardcoding database credentials in YAML configuration files is a high-risk vulnerability. Instead, we recommend that engineers use secure credential management systems.

Evaluating database retrieval tools

The development of database retrieval tools for agentic systems is an iterative, evaluation-driven process. To evaluate the effectiveness of our database retrieval tools and uncover issues, our internal teams maintain evaluation datasets with realistic user queries and the expected tool calls (for example, ”Am I eligible for a laptop refresh?” expects the check_eligibility tool). We’ve used the following metrics for evaluation of our database retrieval tools and for benchmarking different LLMs for model selection:

Tool selection accuracy: How often the correct tool was selected for a specific query type (for example, retrieval, analytical, hybrid, adversarial).
First-pass success rate: A critical metric for us was distinguishing between eventual success and immediate success. Does the agent pick the right tool on the first try, or does it require a self-correction loop? (High self-correction indicates poor tool descriptions.)
Average tool calls per answer: We track the efficiency of the agent. If the average number of tool calls to answer a simple question jumps from 1.5 to 4, it usually indicates that the agent is getting lost or that the tools are too granular.
Tool-specific recall/precision: For dedicated database retrieval tools, we measure standard retrieval metrics to ensure that the documents returned are relevant to the arguments passed by the LLM.
Failure rate: We strictly monitor the rate of malformed tool calls (for example, missing parameters) to identify which models need more "hand-holding" in the prompt instructions.

Once live, our teams at Elastic continue to monitor agentic health and log telemetry data (for example, every success and failure and the time taken for the agent to complete the task) in Kibana. This allows our ITOps teams to build dashboards to answer meta-questions like, "What is the failure rate this week?," "How many laptop requests came from California?," "How many requests were fulfilled?," without building a custom analytics engine.

Summary

During the iterative, evaluation-driven development process of Elastic Agent Builder, we identified consistent patterns in what makes database retrieval tools effective for context engineering. When implementing tools to search, retrieve, and manipulate data, we try to follow the following core principles:

Building the right database retrieval tools by following the “low floor, high ceiling” principle.
Helping the agent call the right database retrieval tool with appropriate parameter values through careful and reinforced prompting and interface design.
Avoid context flooding by optimizing the tool response for context relevancy (quality) and token efficiency (quantity).

However, there remain open challenges, and we’re actively working on improving these techniques:

Context bloat is a primary hurdle for agent performance because retrieved data remains in the context window as the conversation progresses. A future direction is to dynamically off-load large chunks of data, such as tool responses or attachments, in a file store and allow the agent to retrieve them on demand.
Efficient discovery of large volumes of tools and data attachments will be central for building production-grade agents. We plan to introduce agent skills with the functionality of progressive disclosure (loading information only as needed) and implementing a semantic metadata layer.

Acknowledgements

Written by Leonie Monigatti with valuable contributions from colleagues across Search Solutions Engineering (Sean Story, Pierre Gayvallet, Abhimanyu Anand) and Enterprise Applications (Sri Kolagani).

Build task-aware agents with an expanded model catalog on Elastic Inference Service (EIS)

Sean Handley, Anish Mathur, Deepti Dheer, Ranjana Devaji — Fri, 06 Mar 2026 00:00:00 GMT

Today, we’re pleased to announce an expanded model catalog for Elastic Inference Service (EIS), making it easy to run fast, high-quality inference on managed GPUs, without setup or hosting complexity.

EIS already provides access to state-of-the-art large language models (LLMs) that power out-of-the-box AI capabilities across Elastic Agent Builder and Elastic AI Assistants, including automatic ingest, threat detection, problem investigation, and root cause analysis. We’re now extending this foundation with a broader catalog of managed models, giving developers more control over how agents reason, retrieve, and act.

In practice, this reflects a broader shift in how enterprises build AI systems. The idea of a single, all-purpose AI model no longer holds up. Real-world agent workflows require multiple models with different strengths, costs, and performance characteristics. With EIS, teams can either choose and switch models directly in Agent Builder, with zero setup, cost, or hosting overhead, or they can mix and match models in an agent workflow so each step uses the model best suited to the task.

Developers can use models from OpenAI, Anthropic, and Google directly in Elasticsearch, selecting different models for different agent steps while Elastic fully manages inference, scaling, and GPU execution for production agents.

An expanded catalog of managed models on EIS

The expanded EIS catalog now includes models optimized for different classes of tasks, from lightweight generation to large-context reasoning and embeddings for retrieval.

For generation, the catalog includes:

Anthropic Claude Opus 4.5 and 4.6.
Gemini 2.5 Flash.
Gemini 2.5 Pro.
OpenAI GPT-4.1 and GPT-4.1 Mini.
OpenAI GPT-5.2.
OpenAI GPT-OSS-120B.

For retrieval, EIS includes native Jina AI models, jina-embeddings-v3 and jina-embeddings-v5, which provide fast, high-quality embeddings for multilingual retrieval. The service also includes embedding models from Microsoft, OpenAI, Google, and Alibaba.

Choosing the right models for agent tasks

With EIS, model choice becomes a design decision inside the agent, rather than an operational concern. Agents can select models based on the role they play, without changing how inference is deployed or scaled.

To see how this plays out in practice, consider a few common agent scenarios.

Simple informational query

Simple interactions, such as answering “What is our holiday policy?,” do not require an expensive frontier model and can be handled by a fast, low-cost option.

Task: “What is our holiday policy?”
Pattern: Retrieve and summarize.
Model choice: Fast, low-cost generation model.

This can also be configured through the API by selecting the model you want to use:

POST kbn://api/agent_builder/converse
{
 "input": "What is our holiday policy?",
 "agent_id": "internal-kb-bot",
 "connector_id": "Anthropic-Claude-Sonnet-4-5"
}

This step relies primarily on retrieval quality. A lightweight model is sufficient to summarize a small set of documents quickly.

Moderate capability

More complex tasks may benefit from a more capable generation model, without necessarily requiring the most expensive reasoning model available.

Task: “Compare our holiday policy with new labor laws in France and draft an email.”
Pattern: Retrieve relevant documents, compare policy details across sources, and generate output such as a draft email.
Model choice: More capable generation model.

Watch video on Vidyard

Here’s the API example:

POST kbn://api/agent_builder/converse
{
 "input": "Compare our holiday policy with new labor laws in France and draft an email.",
 "agent_id": "internal-kb-bot",
 "connector_id": "Google-Gemini-2-5-Pro"
}

This task requires synthesis across multiple sources and structured output but doesn’t need the heaviest frontier reasoning model.

Investigation or audit task (high capability)

Task: Review a large document set to identify compliance risks.
Pattern: Multistep reasoning over large context, where the model evaluates information across many documents and synthesizes findings before producing a final judgment.
Model choice: Frontier or large-context model.

Try it out using the API:

POST kbn://api/agent_builder/converse
{
 "input": "What are the compliance risks associated with Example AI products?",
 "connector_id": "Anthropic-Claude-Opus-4-6"
}

Because the task requires deeper reasoning and consistent evaluation across many inputs, output quality matters more. A high-capability model is therefore appropriate for this step.

EIS also enables more advanced orchestration patterns. Enterprises increasingly recognize that using a frontier model for every agent step is inefficient.

With Agent Builder and Elastic Workflows, teams can design agents where each subtask is executed by the most efficient model for the job, based on cost, complexity, and accuracy requirements.

Models-as-judge pattern (quality control)

Task: Validate an agent’s output using a second model
Pattern: Generate and evaluate.

In this Elastic Workflow example, the agent uses one model to generate a response and a second model to evaluate its quality, adding a validation layer for the result. Elastic Workflows, the automation engine built into Elasticsearch, let developers combine reliable scripted automation with AI-driven steps for tasks that require reasoning.

Watch video on Vidyard

The multimodel approach enables new reliability patterns by separating generation from evaluation, allowing one model to produce a response and another to validate it. Today, teams can implement this by pairing a general-purpose generation model with a lighter-weight evaluation model.

Over time, this pattern naturally lends itself to specialized judging and safeguard models designed specifically for validation, policy checks, and quality control. As these models become available, EIS makes it straightforward to introduce them into agent workflows without changing how inference is deployed or managed.

What’s next

EIS is actively evolving, with more models on the way. You can track what’s coming next and what we’re currently building on the Elastic public roadmap.

Get started

Elastic Inference Service makes it easy to start with default models and evolve toward sophisticated, multimodel agent workflows over time, all within Elasticsearch. Whether you’re building global retrieval augmented generation (RAG) systems, search, or agentic workflows that need reliable context, Elastic now gives you high-performance models out of the box, along with the operational simplicity to move from prototype to production with confidence.

All Elastic Cloud trials have access to Elastic Inference Service. Try it now on Elastic Cloud Serverless or Elastic Cloud Hosted, or use EIS via Cloud Connect with your self-managed cluster.

Does MCP make search obsolete? Not even close

Dayananda Srinivas — Thu, 05 Mar 2026 00:00:00 GMT

With the rise of large language models (LLMs), agent frameworks, and new protocols like Model Context Protocol (MCP), a provocative question is starting to surface:

Do we still need a search engine at all?

If agents can call tools on demand and models can reason over massive context windows, why not just fetch data live from every system and let the LLM figure it out?

It’s a reasonable question. It’s also the wrong conclusion.

The reality is that MCP and agent tooling don’t eliminate the need for search. They make the quality of search more critical than ever. In this blog, we’ll explore why MCP, federated search, and large context windows don’t replace search engines and why indexes remain the foundational layer for scalable, accurate, enterprise-grade AI.

What MCP actually is (and what it is not)

MCP is a coordination protocol. It standardizes how an agent requests information or actions from external systems.

What MCP doesn’t do:

Rank results across systems.
Understand relevance across heterogeneous data.
Normalize schemas or metadata.
Data transformations or enrichments at scale.
Apply consistent security and permissions.
Optimize for latency, cost, or scale.

In other words, MCP tells agents how to ask for data, not which data matters most.

Modern retrieval requires query intelligence, not just data access

In modern enterprise search architectures, retrieval quality is determined long before a query reaches an index. Raw queries — especially those generated by agents — may be incomplete, overly literal, schema-driven rather than intent-driven, and at times syntactically invalid.

This is why mature search platforms introduce a query intelligence layer that performs query rewriting, entity normalization, synonym expansion, and intent disambiguation before retrieval even begins.

For example, an agent-generated request such as: “Show severity 2 authentication failures from last sprint” may be rewritten to include authentication synonyms (login, SSO, OAuth), normalized severity mappings, and sprint-to-date-range translation. The result is not just more matches — it is more relevant matches.

In enterprise AI, retrieval is not a single step. It is a controlled pipeline.

This distinction is crucial because once MCP-based agents start pulling information live from multiple tools, they recreate a familiar pattern under a new name: federated search.

MCP-based retrieval is federated search in disguise

Federated search isn’t new. Enterprises have tried it for decades.

The model is simple:

Send the user’s query to multiple systems in parallel (SharePoint, GitHub, Jira, customer relationship management [CRM]).
Collect the responses.
Merge and present the results.

MCP-driven tool calls follow the same pattern, except that the caller is now an agent instead of a user interface.

And the same problems resurface.

Why federated search breaks down at enterprise scale

Latency becomes unpredictable: A federated query is only as fast as its slowest system. Enterprise systems can have wildly different response times and rate limits, so federated queries tend to be slow and jittery. Agents must wait for multiple round trips before reasoning can even begin. The result is a laggy experience and unpredictable wait times.
Relevance is fragmented: Because each system ranks results on its own, there’s no unified relevance model. Federated search cannot apply a single ranking or semantic understanding across all content, so results often seem disjointed or incomplete. Agents may retrieve correct information but not the most useful information.
Context is shallow and incomplete: Federated systems typically expose only what’s directly accessible through an API call.They rarely surface:
- Usage signals, like clicks, dwell time, recency of access, popularity, or authority.
- Relationships between documents across different systems to correlate the insights.
- Organizational knowledge beyond a single silo.
  
  This strips agents of the broader context required for high-quality reasoning.
Limited filtering and features: In a federated setup, you can only filter on fields that every system supports (the “lowest common denominator”). If one system doesn’t support a particular filter or facet, you lose that functionality entirely. This severely limits rich search features, like date ranges, categories, or tags.

The power of an indexed search

Search engines achieve millisecond-level retrieval at massive scale by using specialized data structures, including inverted indexes for lexical search and k‑dimensional trees (k-d trees) for vector-based retrieval. The approach is to crawl or ingest every source into search engines, creating a central place of company knowledge. This brings big advantages:

Speed by design: Searching an index is lightning fast. Queries hit inverted indexes and specialized data structures, avoiding the need to poll each backend system.
Relevance that compounds over time: Search engines that support semantic search are capable of comprehending the intent, and machine learning models can rerank results for enterprise contexts. In one Elastic experiment, Elastic users see more accurate results when combining vector search with a question-answering (QA) model to extract answers. It gives better precision than keyword matching.
Advanced features: Elastic’s Graph retrieval augmented generation (RAG) solution shows how structuring an index as a knowledge graph can power more contextual retrieval. In other words, indexes aren’t just backward-looking dumps of text; they can also encode relationships and ontologies that let AI connect the dots across documents.
Permission-aware search: Enterprise AI cannot compromise on security. Indexed search allows:

Agents see only what users are allowed to see, without leaking data into model prompts or training. Elasticsearch is suitable for the indexed search layer in the diagram below, as it provides the essential components for context engineering.

Retrieval consistency through search templates and governed execution

At scale, retrieval must be predictable, secure, and repeatable. This is where search templates become critical.

Search templates act as retrieval contracts between applications, agents, and the search platform. Instead of dynamically constructing queries at runtime, agents invoke pre-defined retrieval patterns that enforce:

Consistent relevance logic
Mandatory security filters
Cost and latency guardrails
Business-specific ranking rules
Explicit index and field scope boundaries

In MCP-driven architectures, this becomes even more important. Agents should not dynamically invent retrieval strategies. Instead, MCP tool calls can map directly to approved search templates, ensuring that every retrieval request adheres to enterprise relevance and governance standards.

This approach shifts retrieval from ad-hoc query execution to controlled retrieval orchestration.

Retrieval is now a multi-layer engineering discipline

Modern enterprise retrieval is no longer a simple query-to-index operation. It typically includes multiple coordinated layers:

Query understanding — rewriting, expansion, entity resolution
Retrieval strategy selection — hybrid search, vector search, graph retrieval, or synthetic query techniques such as Hypothetical Document Embeddings (HyDE), where the system generates a representative answer or expanded context first and retrieves documents using that richer semantic signal.
Execution governance — templates, security enforcement, and performance guardrails
Ranking and re-ranking — blending lexical precision, semantic similarity, and interaction-derived relevance signals such as click-through patterns, dwell time, and document usage frequency.

When these layers are implemented upstream, agents receive clean, high-confidence context rather than raw, fragmented data.

This is what makes large-scale agent systems reliable in production environments.

Advanced retrieval techniques improve context quality before reasoning begins

Modern retrieval systems increasingly use AI-assisted techniques to improve recall and semantic coverage before ranking is applied.

One example is Hypothetical Document Embeddings (HyDE). Instead of embedding only the original query, the system first generates a hypothetical answer or expanded context, embeds that representation, and retrieves documents based on that richer semantic signal.

This is particularly useful in enterprise environments where:

Users or agents may not know the exact terminology
Knowledge is distributed across silos
Important context is implied rather than explicitly stated

Techniques like HyDE improve the probability that relevant documents are retrieved even when the original query is underspecified.

This reinforces a key principle of enterprise AI: better context retrieval produces better reasoning outcomes.

Agents aren’t data engineers; they’re reasoning systems

They shouldn’t be responsible for stitching together raw data, reconciling schemas, or compensating for poor retrieval.

This is where a search platform such as Elasticsearch becomes foundational.

By ingesting data once and normalizing it upstream (through pipelines, mappings, enrichment processors, and prebuilt indexes), Elasticsearch resolves schema mismatches, joins signals across sources, and materializes retrieval-ready views of the data. At query time, the agent receives clean, ranked, semantically enriched results rather than fragmented raw records.

For example, instead of an agent pulling independently from CRM, ticketing, and documentation systems and attempting to reconcile customer IDs, timestamps, and formats in real time, Elasticsearch can pre-index these sources into a unified customer interaction index with hybrid (keyword + vector) search and relevance ranking. The agent then queries a single, coherent interface and immediately reasons over the most relevant context.

This separation of concerns, that is, Elasticsearch handling data integration and retrieval, and agents focusing on reasoning, planning, and decision-making, is what makes agent systems scalable, reliable, and production ready.

Elastic’s role in the AI stack

Elastic sits at the intersection of search and AI by design.

Connectors and crawlers ingest data continuously from enterprise systems.
Semantic and vector search enable intent-based retrieval.
Hybrid search blends lexical precision with semantic understanding.
RAG workflows ground LLMs in authoritative, permission-aware data.

Elastic does not compete with agents or MCP. It makes them effective.

Bigger models don’t eliminate retrieval

Some have wondered whether huge new LLMs can bypass traditional search, perhaps by letting the model read everything in one go. Large context windows feel powerful, but they introduce:

Higher latency.
Higher cost.
Lower precision due to noise.
A higher propensity for confusion, context clash, and context poisoning.

RAG wins because it filters first and then reasons.In another Elastic Search Labs experiment, RAG achieved answers in about 1 second, versus 45 seconds for the raw-LM approach, at 1/1250th the cost, and with far higher accuracy. In other words, giving an LLM a million tokens of documents is slower, more expensive, and actually less precise than filtering through an index first.

Conclusion: MCP changes the interface, not the fundamentals

MCP is a meaningful step forward in how agents interact with tools. But it doesn’t replace the need for fast, relevant, governed retrieval.

In enterprise AI:

Context quality determines answer quality.
Indexes create that context.
Search is the foundation, not the legacy.

Indexes aren’t obsolete in the era of MCP. They’re the reason that MCP-based agents can work at all.

Entity resolution with Elasticsearch, part 3: Optimizing LLM integration with function calling

Jessica Moszkowicz — Wed, 04 Mar 2026 00:00:00 GMT

In part 1 and part 2 of this series, we built a complete entity resolution pipeline that included preparing entities with context and indexing them for semantic search, extracting entities from articles using hybrid named entity recognition (NER), and matching entities using semantic search and large language model (LLM) judgment. The results were promising, but JSON parsing errors significantly lowered measured accuracy by causing otherwise valid judgments to be discarded. The system wasn’t failing because it made bad judgments; it was failing because it couldn’t reliably express them.

The root of this problem was our somewhat naive choice to use prompt-based JSON generation in which the LLM generates JSON responses in text format. If we asked the LLM to judge more than a couple of matches at a time, the generated JSON was often ill-formed. To mitigate this, we were forced to reduce the processing batch size, which simply won't scale in a production system.

So the prompt-based JSON generation helped validate our approach to entity resolution, but we need a more systematic and reliable method. OpenAI function calling provides a better path by guaranteeing structure and type safety while reducing errors and costs. We chose OpenAI's functions for the educational prototype, but other LLM providers typically provide similar functionality (for example, Claude tools).

Note: While we discuss production challenges here, this is still an educational prototype demonstrating optimization techniques. Real production systems would need additional considerations, like monitoring, alerting, fallback strategies, and comprehensive error handling.

Key concepts: Function calling, schema design, and cost benefits

What is function calling? Function calling is OpenAI's structured output API. With it, we can define schemas for LLM responses, so we always know exactly what we're going to get. By enforcing the JSON format rather than trying to define it in the LLM prompt, we should be able to eliminate parsing errors.

Why is it better than prompt-based JSON? LLMs generate nondeterministic output. One hopes that they'll at least generate content that contains the correct response, but the presentation of that response is unpredictable. With a chatbot, this is often not a problem, but our prototype is trying to programmatically process the output. Computer programs demand consistency, so when the LLM generates what we expect, everything is fine, but as soon as it goes off script, so to speak, the code errors out. We could try to account for the different possibilities, but it would be very difficult to catch everything. We could try to enforce more consistent behavior by adding something like "Always return parsable JSON". We tried this exact technique in the prototype's prompt, but we've seen that prompt-based JSON still goes off the rails pretty quickly, particularly if we try to process a batch of matches.

Function calling makes the LLM generation controllable and predictable, exactly what we need for entity resolution. To aid in the definition of the functions, we’ll also follow minimal schema design principles.

What are minimal schema design principles? Minimal schema design means defining only the fields you need, using simple types, and avoiding nested structures when possible. This reduces token usage (smaller schemas mean fewer tokens), improves reliability (simpler schemas are easier for the LLM to follow), and lowers costs (fewer tokens mean lower API costs).

What are the cost and reliability benefits? Since fewer errors means match processing is much more likely to succeed, even with large batch sizes, we don't have to retry judging matches. The elimination of retries reduces costs by reducing token usage, but using minimal schemas also keeps our token count down. This all leads to a less expensive and more reliable approach that’s much more suitable to use in production.

We need to check one more thing, though. While matches may be getting processed without error, are the errorless results actually correct? How does this new approach compare to the promising results we saw with the prompt-based approach?

Real-world results: Side-by-side comparison

As we did in the previous blog, we ran the function calling approach against the tier 4 dataset, which consists of 206 expected matches across 69 articles. The results demonstrate a dramatic improvement:

Metric	Prompt-based	Function calling	Improvement
Error rate	30.2%	0.0%	100% elimination
Precision	83.8%	90.3%	+6.5pp
Recall	62.6%	90.8%	+28.2pp
F1 score	71.7%	90.6%	+18.9pp
Acceptance rate	44.8%	60.2%	+15.4pp
True positives	129	187	+45.0%
False negatives	77	19	-75.3%

Error elimination: The key differentiator

The most striking difference is the complete elimination of JSON parsing errors. This resulted in a modest precision improvement and a far more dramatic recall improvement. The precision metric captures how often the matches the system accepts were expected in the golden document. So the prototype was decent at judging matches correctly in the prompt-based approach, but function calling does that even better.

Conversely, recall tells us how many of the expected matches were found. When a batch of matches comes back with malformed JSON, the system loses all of those matches. It's likely that Elasticsearch sends many of these matches for judgment, but we lose those matches if judgment fails. The significant recall improvement shows that this hypothesis is correct. Elasticsearch identifies the potential matches and function calling verifies which of those matches are correct.

Note: It’s expected that Elasticsearch will find some incorrect matches because we look at the top two or three results from hybrid search. Most of the time, hybrid search returns the correct match as the top result, but having the LLM judge the top few hits ensures that we see how the LLM handles incorrect matches. If we move from the educational prototype to a production system, we’ll likely tune the Elasticsearch queries more carefully so that we only send promising matches to the LLM, further optimizing our LLM costs.

What's next: The ultimate challenge

Now that we've optimized our LLM integration with function calling, we have a complete entity resolution pipeline with improved reliability and cost efficiency. However, can it handle the ultimate challenge? In the next post, we'll explore how the system handles diverse entity resolution scenarios across 50 different challenge types, including cultural naming conventions, business relationships, titles, and multilingual variations.

Try it yourself

Want to see function calling optimization in action? Check out the Function Calling Optimization notebook for a complete walkthrough with real implementations, detailed explanations, and hands-on examples. The notebook shows you exactly how to use function calling for structured output, compare it with prompt-based JSON, and analyze cost and reliability benefits.

Remember: This is an educational prototype designed to teach optimization concepts. When building production systems, consider additional factors, like multi-provider support, advanced caching strategies, monitoring and alerting, comprehensive error handling, and compliance requirements that aren't covered in this learning-focused prototype.

Using subagents and Elastic Agent Builder to bring business context into code planning

Gustavo Llermaly — Tue, 03 Mar 2026 00:00:00 GMT

Subagents in Claude Code let you offload specialized tasks to separate context windows, keeping your main conversation focused. In this article, you'll learn what subagents are, when to use them, and how to build a retrieval subagent using Elastic Agent Builder that connects your development workflow to business data in Elasticsearch.

What are subagents?

Subagents are specialized assistants that can be called to execute a specific task, using their own context window. They complete a task and give the results to the main agent, preventing it from saving information that isn’t relevant for the rest of the conversation in the context window.

Their four core principles are:

Context preservation: Each subagent uses its own context window.
Specialized expertise: Each subagent is designed for a specific task.
Reusability: You can reuse a subagent in different sessions and projects.
Flexible access: You can limit the subagent access to specific tools.

Each subagent can have access to Claude Code tools to work with the terminal, such as glob, read, write, grep, or bash, or to access the internet, like search, fetch, or call external tools with Model Context Protocol (MCP) servers.

A subagent uses the following schema:

---
name: your-sub-agent-name
description: Description of when this subagent should be invoked
tools: tool1, tool2, tool3  # Optional - inherits all tools if omitted
model: sonnet  # Optional - specify model alias or 'inherit'
permissionMode: default  # Optional - permission mode for the subagent
skills: skill1, skill2  # Optional - skills to auto-load
---

Your subagent's system prompt goes here. This can be multiple paragraphs
and should clearly define the subagent's role, capabilities, and approach
to solve problems.

Include specific instructions, best practices, and any constraints
the subagent should follow.

You can call subagents implicitly by talking about the task they run, and Claude will call them automatically. For example, you can say, "I want to plan my new functionality."

You can also call them explicitly by directly asking Claude Code to use a subagent and telling it, "Use the planning subagent to plan my new functionality."

Another important feature is that subagents are stateful, so once you give one a task, it will generate an ID. This way, when you use it again, you can start from scratch or provide the ID to give it context from its previous tasks.

You can read the full documentation here.

When are subagents used?

Subagents are useful when you need to delegate tasks that require specialized context but you don't want to clutter the main chat window. Considering our example of coding, the most common subtasks include:

Subtask type	Description	Typical tools
Exploration / research	Searching and analyzing code without modifying it.	Read, grep, glob
Planning	Running deep analysis to create implementation plans.	Read, grep, glob, bash
Code review	Reviewing quality, safety, and best practices.	Read, grep, glob, bash
Code modification	Writing and editing code.	Read, edit, write, grep, glob
Testing / debugging	Running tests and analyzing issues.	Bash, read, grep, edit
Retrieval	Getting information from external sources (APIs, databases).	MCP tools, bash

Claude Code includes three built-in agents that showcase these use cases:

Explore: Quick agents for read-only search in the codebase. It's great for answering questions like, "Where are the client's errors handled?"
Plan: Research agent that activates in plan mode to analyze the codebase before proposing changes.
General-purpose: The most capable agent for complex tasks that require multiple steps and can include modifications.

Context management: Ensuring subagents have the right information

One of the most important decisions when designing subagents is how to handle context. There are three key considerations:

1. Which context the subagent should get

The prompt you give to the subagent must contain all of the necessary information to complete the task since the subagent doesn’t have access to the main chat. You need to be specific:

Do NOT say, "Review the code."
SAY, "Review the changes to src/auth/index.ts, focusing on JWT token validation."

Providing the exact file name makes a difference between using the read tool against the file directly and making a wide search using grep and thus wasting time and tokens.

Also consider what not to include. Irrelevant context can distract the subagent or bias results. It’s tempting to ask for multiple things in one pass, but focused tasks yield better results:

Do NOT say, “Review src/auth/index.ts. Here is also the database schema and our API docs for reference, fix bugs and suggest improvements about the architecture decisions.”
SAY, “Fix the token refresh bug in src/auth/index.ts that's throwing AUTH_TOKEN_EXPIRED unexpectedly.”

2. What tools to provide

Limit the tools to what’s strictly necessary. This improves security, keeps the subagent focused, and reduces unnecessary tool calls and execution costs.

# For just an analysis agent
tools: Read, Grep, Glob

# For an agent that needs to modify the code
tools: Read, Edit, Write, Grep, Glob

If you don't specify a tools field, the subagent inherits all tools from the main agent, including MCP tools.

You can learn all Claude Code tools here.

3. How to keep context between calls

Subagents can be resumed using their agentId:

# First call
> Use the code-analyzer agent to review the authentication module
[Agent completes the analysis and returns agentId: "abc123"]

# Continue with previous context
> Resume agent abc123 and now analyze the authorization module
[Agent continues with the context from the previous chat]

You can ask Claude for the agent ID or find it in ~/.claude/projects/{project}/{sessionId}/subagents/

This is especially useful for long research tasks or multistep workflows.

Another way to keep context consistent is to ask the agent to write a Markdown checklist with what it's doing and its current progress. Then you can execute /clear without losing the initial instruction. In that request, you can define the task granularity or details to retain that make sense for your use case.

# Task: Review authentication module

## Progress
- [x] Analyzed src/auth/index.ts
- [x] Found JWT validation issue
- [ ] Review authorization module
- [ ] Check rate limiting

## Findings
- Token refresh has race condition in line 42

After you clear the conversation, the next agent can pick it up from here. This is very useful when you want an agent to run a script over a list and watch the output record by record.

Orchestration patterns

It’s important to see subagents as a context optimization mechanism. The way in which you coordinate them determines the efficiency of the whole system. There are different orchestration patterns.

Sequential (chaining)

Here, a subagent completes a task, and its results feed the next one in a sequence of tasks, similar to traditional Linux piping.

Call example:

> First use the planning agent to design the feature,
> then use the coding agent to implement it,
> finally use the reviewer agent to check the code

Parallel

In this pattern, multiple subagents run independent tasks simultaneously. The main Claude Code agent invokes them since subagents cannot spawn other subagents.

This approach reduces the execution time for tasks like code review since it allows you to work with the same code from different angles without impacting the running time.

Hub-and-spoke (delegation)

In this approach, the main agent acts as an orchestrator, delegates tasks to specialized agents, and then consolidates the results.

This is the pattern we’ll implement in our example: The main Claude Code agent will delegate the gathering of business information to a retrieval agent built with Elastic Agent Builder, while the explore agent will look into local files and the planning agent builds a plan.

Why use an agent instead of a single query?

Before building our retrieval subagent, it's worth understanding when an agent adds value versus when a simple Elasticsearch Query Language (ES|QL) query suffices.

If you need a single aggregation, like "What's our most visited page?" just run the query directly. The agent adds value when your question requires:

Multiple queries that build on each other: The answer from query 1 informs query 2.
Cross-index reasoning: Correlating data from different sources.
Ambiguity resolution: The agent interprets and follows leads.
Synthesis: Combining quantitative data with qualitative knowledge.

Our example will demonstrate all of these capabilities.

Agent Builder as subagent

Generating code using AI is very quick, but the problem is having a good planning phase to set the boundaries for our coding agent. To help with that, Claude created a subagent that specializes in planning to perform deep analysis and create a to-do list for the main agent to execute.

With this flow, you can plan based on what Claude Code can see both in local files and on the internet. However, there's still knowledge available in Elasticsearch that you cannot access via standard tools.

To access our internal knowledge during the planning phase, we'll create a Claude Code subagent by making a retrieval agent using Agent Builder.

You can configure the agent using the UI or an API. In this example, we'll use the latter.

Prerequisites

Claude Code 2.0.76+
Elasticsearch 9.2
Elasticsearch API key

The scenario: Technical debt sprint planning

You're a tech lead. You have two weeks and two developers. Your TECH_DEBT.md lists 12 items. You can tackle maybe three or four. Which ones should you prioritize?

The complexity is that you need to optimize across multiple dimensions simultaneously:

User impact: How many users hit this issue?
Business impact: Does it affect paying customers? Enterprise tier?
Severity: Errors? Performance? Just ugly code?
Effort: Quick win or rabbit hole?
Dependencies: Does fixing A unlock fixing B?
Strategic alignment: Does it align with Q1 priorities?

A single query like, "What's the most important tech debt item?" fails because this requires:

Reading TECH_DEBT.md to understand what the 12 items even are.
For EACH item, querying error_logsto get error frequency.
Cross-referencing with customer_data to see tier breakdown.
Checking support_ticketsto see complaint volume.
Reading engineering_standards in the knowledge base to see whether any items violate core principles.
Reading Q1_roadmap to check strategic alignment.
Synthesizing all of this into a prioritized recommendation.

This is where a retrieval agent can be helpful in orchestrating multiple queries across different indices and synthesizing the results.

Steps

Preparing the test dataset

We'll create four indices: a knowledge base with internal documentation, error logs, support tickets, and customer data.

You can create the indices, index the data, and create the agent using one of the following:

Kibana Dev Tools: Using the Elasticsearch requests provided below.
Jupyter Notebook: Using the complete notebook written for this article.

Create the indices

Open Kibana Dev Tools, and run the following requests to create each index with its mapping and bulk data. Here's an example of the knowledge index structure and data to be indexed:

PUT customer_data
{
  "mappings": {
    "properties": {
      "user_id": { "type": "keyword" },
      "customer_tier": { "type": "keyword" },
      "company_name": { "type": "text" },
      "mrr": { "type": "float" },
      "joined_at": { "type": "date" }
    }
  }
}

POST customer_data/_bulk
{"index":{}}
{"user_id":"enterprise_user_01","customer_tier":"enterprise","company_name":"Acme Corp","mrr":2500.00,"joined_at":"2023-01-15"}
{"index":{}}
{"user_id":"enterprise_user_02","customer_tier":"enterprise","company_name":"GlobalTech Inc","mrr":4200.00,"joined_at":"2022-08-20"}
{"index":{}}
{"user_id":"enterprise_user_05","customer_tier":"enterprise","company_name":"DataFlow Systems","mrr":3100.00,"joined_at":"2023-06-01"}
{"index":{}}
{"user_id":"user_001","customer_tier":"free","company_name":"","mrr":0,"joined_at":"2024-03-15"}
{"index":{}}
{"user_id":"user_002","customer_tier":"free","company_name":"","mrr":0,"joined_at":"2024-05-20"}
{"index":{}}
{"user_id":"user_045","customer_tier":"pro","company_name":"SmallBiz LLC","mrr":49.00,"joined_at":"2024-01-10"}
{"index":{}}
{"user_id":"user_089","customer_tier":"pro","company_name":"StartupXYZ","mrr":49.00,"joined_at":"2024-02-28"}

Full requests for all indices:

Knowledge index: knowledge.txt
Error logs index: error_logs.txt
Support tickets index: support_tickets.txt
Customer data index: customer_data.txt

The raw JSON files with the dataset are also available:

Local project files

Create the following Markdown (MD) files in your project. These files look like this:

# Tech Debt Items

## AUTH-001: Token refresh race condition
- **Module**: src/auth/refresh.ts
- **Symptom**: Users randomly logged out
- **Estimate**: 3 days

## EXPORT-002: CSV export timeout on large datasets
- **Module**: src/export/csv.ts
- **Symptom**: Timeout after 30s for >10k rows
- **Estimate**: 2 days

...

Full files:

TECH_DEBT.md: Tech debt items list.
REQUIREMENTS.md: FlowDesk Q1 2025 requirements.

This ties directly to the tech debt items and gives the agent clear priorities to work with when cross-referencing with the Elasticsearch data.

Create an agent with Agent Builder

We'll now create an agent capable of running analytics queries with ES|QL to provide us with app usage information while also capable of searching to provide us info from Knowledge Base (KB) in unstructured text format.

We're using the built-in tools since they cover search and analytics on any index. Agent Builder also supports custom tools for more specialized operations, like scoping an index or adding ES|QL dynamic parameters, but that's beyond our scope here.

You can create the agent using the curl request in create_agent.txt.

curl -X POST "https://${KIBANA_URL}/api/agent_builder/agents" \
  -H "Authorization: ApiKey ${API_KEY}" \
  -H "kbn-xsrf: true" \
  -H "Content-Type: application/json" \
  -d '{
    "id": "tech-debt-advisor",
    "name": "Tech Debt Prioritization Agent",
    "description": "I help prioritize technical debt by analyzing error logs, support tickets, customer impact, and aligning with engineering standards and roadmap priorities.",
    "avatar_color": "#BFDBFF",
    "avatar_symbol": "TD",
    "configuration": {
      "instructions": "This agent helps prioritize technical debt items. Use the following indices:\n\n- knowledge: Engineering standards, policies, and roadmap priorities\n- error_logs: Production error frequency by module\n- support_tickets: Customer complaints and their urgency\n- customer_data: Customer tier information (enterprise, pro, free)\n\nWhen analyzing tech debt:\n1. Check error frequency in error_logs\n2. Cross-reference affected users with customer_data to understand tier impact\n3. Count support tickets and note urgency markers\n4. Check knowledge base for relevant policies and Q1 priorities\n5. Synthesize findings into prioritized recommendations",
      "tools": [
        {
          "tool_ids": [
            "platform.core.search",
            "platform.core.list_indices",
            "platform.core.get_index_mapping",
            "platform.core.get_document_by_id",
            "platform.core.execute_esql",
            "platform.core.generate_esql"
          ]
        }
      ]
    }
  }'

You’ll get this response if everything went OK:

{
  "id": "tech-debt-advisor",
  "type": "chat",
  "name": "Tech Debt Prioritization Agent",
  "description": "I help prioritize technical debt by analyzing error logs, support tickets, customer impact, and aligning with engineering standards and roadmap priorities.",
  ...
}

The agent will be available in Kibana, so you can now chat with it if you want:

Configure the agent as Claude Code tool

The agent we just created will expose an MCP server. Let's add the MCP server to Claude Code using the already-generated API key:

claude mcp add --transport http agentbuilder https://${KIBANA_URL}/api/agent_builder/mcp --header "Authorization: ApiKey ${API_KEY}"

We can check the connection status using claude mcp get agentbuilder.

Create a subagent that uses the tool

Now that we have the Agent Builder available as a set of MCP tools, we can create a subagent in Claude Code that will use all or some of those tools, in combination with Claude Code ones.

Claude Code recommends using its agent creator tool for this step:

1. Type /agents in Claude Code.

2. Choose Create new agent.

3. Select Project scope so that it's only available for this project. (This is the recommended setting to avoid agent overflow.)

4. Select Generate with Claude (recommended).

5. Type in the description: "Agent that analyzes technical debt by querying Elasticsearch for error logs, support tickets, customer data, and engineering knowledge base. Use this agent when you need to prioritize tech debt items based on business impact."

6. In “Select tools,” choose Advanced options and select the tools we defined on the agent creation.

Individual Tools:
☒ platform.core.search (agentbuilder)
☒ platform.core.list_indices (agentbuilder)
☒ platform.core.get_index_mapping (agentbuilder)
☒ platform.core.get_document_by_id (agentbuilder)
☒ platform.core.execute_esql (agentbuilder)
☒ platform.core.generate_esq (agentbuilder)

7. Select [ Continue ].

Now choose the model. For planning tasks, the recommendation is to use Opus due to its significant reasoning capacity. So let's select that and continue.

Finally, choose the background color for our subagent text and confirm.

Claude automatically names our subagent based on the description (for example, tech-debt-analyzer).

Testing the agent

Once the agent has been created, we can test it with a complex prioritization question that requires multistep reasoning:

> Based on TECH_DEBT.md, which items should we prioritize for our 2-week sprint?
> Use the tech-debt-analyzer agent to check error frequency, customer impact,
> support ticket volume, and alignment with engineering standards.

Watch how the agent orchestrates multiple queries:

And will give you a comprehensive analysis of the local files combined with Elasticsearch data:

This demonstrates why a single query fails and an agent succeeds: It orchestrates five or more queries across different indices, correlates the data, and synthesizes a recommendation that contradicts the naive "fix highest error count" approach.

By typing /context, we can see how much context each of the MCP tool's definitions uses and our subagent's prompt. Keep an eye on this overhead when creating subagents.

Start planning

We can now start planning using local files, the internet, and our Elasticsearch knowledge as information sources.

Ask something like:

"Based on our requirements defined in REQUIREMENTS.md, use the planning agent
to create a detailed implementation plan, prioritizing tasks according to
business impact. Use the tech-debt-analyzer agent to query about internal
company knowledge and make analytical queries about error patterns and
customer impact."

Note that Claude decides to run the Elasticsearch data analysis and the local documentation reading in parallel, following the hub-and-spoke orchestration pattern.

After the analysis, you should get a plan that prioritizes based on actual business data rather than on assumptions. This context will make your AI coding experience much more reliable, as you can feed this plan directly to the agent and execute step by step:

The more details you provide and the more focused the instructions are, the better the quality of the plan will be. If you have an existing codebase, it will suggest the code changes.

Conclusion

Subagents are a great tool to offload specific tasks where we only need the final result for the main chat (without going through how we got there), keeping the chat flow focused.

By choosing the right orchestration pattern (sequential, parallel, or hub-and-spoke) and handling the context properly, we can build efficient and maintainable agent systems.

Elastic Agent Builder and its MCP feature allow us to access our data using a retrieval subagent to facilitate planning and coding by combining local (files, source code), external (internet), and internal (Elasticsearch) sources. The key insight is that agents add value not for simple queries but when you need multistep reasoning that builds on previous results and synthesizes information from multiple sources.

Resources

Adaptive early termination for HNSW in Elasticsearch

Tommaso Teofili — Mon, 02 Mar 2026 00:00:00 GMT

Elasticsearch uses the Hierarchical Navigable Small World (HNSW) algorithm to perform vector search over a proximity graph. HNSW is known to provide a nice trade-off between the quality of k-nearest neighbor (KNN) results and the associated cost.

In HNSW, search proceeds by iteratively expanding candidate nodes in the graph, maintaining a bounded set of nearest neighbors discovered so far. Each expansion has a cost (vector operations, random seeks to disk, and more), and the marginal benefit of that cost tends to decrease as the search progresses.

One way to optimize HNSW graph traversal is to stop searching when the marginal likelihood of finding new true neighbors doesn’t increase. For this reason, in Elasticsearch 9.2 we introduced a new early termination mechanism. This stops the search process when visiting graph nodes doesn’t provide enough new nearest neighbors, consecutively, for a fixed number of times.

This article guides you through how we improved over the mentioned early termination mechanism in HNSW to make it better suited for different datasets and data distributions.

Early termination in HNSW

In HNSW, search proceeds by iteratively expanding candidate nodes in the proximity graph, maintaining a bounded set of nearest neighbors discovered so far, until it either has visited the whole graph or meets some early stop criteria.

Early termination is therefore not necessarily always an optimization, it’s part of the search algorithm itself. The moment we decide to stop determines the balance between efficiency and recall. In Elasticsearch, there are already a number of ways a query on HNSW can early terminate:

A fixed maximum number of nodes is visited.
A fixed timeout is reached.

While simple and predictable, these rules are largely agnostic to what the search is actually doing. Also they’re used mostly to make sure that the query finishes in reasonable time for the end user.

In a previous blogpost, we introduced the concept of redundancy in HNSW. In short, redundant computations occur when HNSW continues to evaluate new candidate nodes that don’t result in finding more nearest neighbors.

Patience: Measuring progress instead of effort

The notion of patience reframes early termination around progress rather than effort.

Instead of asking:

“How many steps have we taken?”

The new question becomes:

“What is the amount of computation we accept to waste, until we lose hope?”

During HNSW search, early exploration typically produces peak improvements to the top-k candidate set. During first steps of the HNSW graph exploration, the set of neighbors is continuously updated as the algorithm keeps discovering nearer and nearer neighbors to the query vector. Over time, these improvements become rarer as the search converges. Patience-based termination monitors this pattern and terminates the search once improvements have ceased for a sustained period.

In practice, while visiting the HNSW graph we also compute the queue saturation ratio as we hop through candidate nodes. This measures the percentage of nearest neighbors that were left unchanged while visiting the most recent graph node (or the inverse of the number of new neighbors introduced during the last iteration). When such a ratio becomes too big for too many consecutive iterations, we stop visiting the graph.

Conceptually, patience treats HNSW search as a diminishing returns process. When returns flatten out, continuing to explore the graph yields little benefit.

This framing is powerful because it ties termination directly to observable outcomes rather than to arbitrary fixed limits.

The benefit of using this smart early termination technique is that HNSW graph explorations tend to visit a smaller number of graph nodes while retaining an almost perfect relative recall.

To visualize this, we can plot the amount of recall per visited node that we got with the patience based early termination (labeled as et=static), when compared to the default HNSW behavior (labeled as et=no) on a couple of datasets, FinancialQA and Quora, and models, JinaV3 and E5-small.

Static thresholds and HNSW dynamics

In practice, in Elasticsearch this is implemented using static thresholds. One threshold refers to the saturation threshold: that is, the ratio of saturation that we consider suboptimal. The other threshold refers to the number of consecutive graph nodes that we allow to be visited while still having a suboptimal queue saturation: that is, the patience threshold.

When we introduced this early termination strategy in Elasticsearch 9.2, we decided to opt for conservative defaults, so as to let the recall as much as possible, while still gaining in terms of latency and memory consumption. For this reason, we set the saturation threshold to be 100% and the patience threshold to be set as a (bounded) 30% of the num_candidates in the KNN query.

In many scenarios, these settings resulted to work nicely; however, two queries requesting the same number of neighbors might have radically different convergence behaviors. Some queries encounter dense local neighborhoods and saturate quickly; others must traverse long, sparse paths before finding competitive candidates. The latter resulted to be the most difficult to handle effectively.

As a result, we sometimes noticed:

Over-exploration for easy queries.
Premature termination for hard queries.

Therefore, we figured that fixed threshold values encode global assumptions about convergence, whereas we could make HNSW better adapt to different dynamics.

Making HNSW early termination adaptive

Adaptive early termination approaches this problem from a different angle. Instead of enforcing predefined stopping thresholds, the algorithm infers when to stop from the search dynamics themselves.

So instead of comparing the queue saturation ratio between two consecutive candidates, we decided to introduce both an instant smoothed discovery rate $d_{q,i} $ (how many new neighbors were introduced for a query q, in the last visit i) together with rolling mean $\mu_{q,i}$ and standard deviation $\sigma_{q,i}$ of such a discovery rate during the graph visit (using Welford’s algorithm). These statistics about the discovery rate are calculated per query, so that this information can be used to decide different degrees of patience for each query.

The previously static thresholds become adaptive to the discovery rate statistics: The saturation threshold becomes the rolling mean plus the standard deviation; whereas we make the patience adapt and scale inversely with the standard deviation.

The early exit rules remain the same; the saturation happens when the instant discovery rate is lower than the adaptive saturation threshold. The graph visit stops if the saturation persists for a number of consecutive candidate visits that’s larger than the adaptive patience.

This way, we obtain a behavior that doesn’t depend on the num_candidates parameter in the KNN query (which might be always set or left as the default, regardless of early exit) and that better adapts to each query and vector distribution dynamically.

The recall per visited node on FinancialQA and Quora with the adaptive strategy (labeled as et=adaptive) reports a higher recall per visited node, when compared to the static strategy (et=static) and the default HNSW behavior (et=no).

Adaptive early termination is turned on by default in Elasticsearch 9.3 for HNSW dense vector fields (and it can eventually be turned off via the same index level setting).

Common Expression Language (CEL): How the CEL input improves data collection in Elastic Agent integrations

Chris Berkhout — Fri, 27 Feb 2026 00:00:00 GMT

Elastic Agent integrations allow users to ingest data into Elasticsearch from a wide range of sources. They combine collection logic, ingest pipelines, dashboards, and other artifacts into a package that can be installed and managed from the Kibana web interface.

Integrations configure Filebeat inputs to do the data collection. To collect data from HTTP APIs, we’ve often used the HTTP JSON input. However, even basic listing APIs can differ greatly in the details, and the HTTP JSON input's model of YAML-configured transformations can make it awkward and sometimes impossible to express the required collection logic.

The Common Expression Language (CEL) input was introduced to allow more flexible interaction with HTTP APIs. CEL is a language designed to be embedded in applications that require a fast, safe, and extensible way to express conditions and data transformations. The CEL input lets an integration builder write one expression that can read settings, keep track of its own state, make requests, process responses, and ultimately return events ready to ingest.

In this article, we’ll look at how CEL differs from other programming languages, how we’ve extended it for the CEL input, and the flexibility and power that gives you to express your data collection logic.

CEL and how it works in the input

CEL is an expression language. It has no statements. When you write CEL, you don’t tell it what to do by writing statements, you tell it what value to produce by writing an expression. Every CEL expression produces a value, and smaller expressions can be combined into a larger expression to produce a result according to more complex rules. Later, we’ll see how to use expressions for things that may be written with statements in other languages.

CEL is intentionally a non-Turing complete language. It doesn’t allow unbounded loops. Later, we’ll see how you can process lists and maps using macros, but by avoiding unbounded loops, the language guarantees predictable and limited execution time for individual expressions.

The CEL input is configured with a CEL program (an expression) and some initial state. The state will be provided as input to the program. The program is evaluated to produce an output state. If the output state includes a list of events, those will be removed and published. The rest of the output state will be used as the input for the next evaluation. If the output state includes one or more events and the flag want_more: true, the next evaluation will be performed immediately; otherwise, it will sleep for the rest of the configured interval time before continuing. Here’s a simplified diagram of the input’s control flow:

The output of each evaluation will be passed forward as the input to the next evaluation, for as long as the input runs. Output data under the key "cursor" will be persisted to disk and reloaded after the input is restarted, but the rest of the state will not be preserved across restarts.

The CEL language itself has limited functionality and avoids side effects, but it is extensible. The cel-go implementation adds some functionality, such as optional syntax and types. The Mito library builds on cel-go and adds more functionality, including the ability to make HTTP requests. The CEL input uses Mito’s version of CEL.

Working with Mito

To build or debug an integration using the CEL input, the most important thing to understand is what output state your CEL program will produce for a given input state. During development, it can be cumbersome to have your CEL program run by the input, surrounded by the full Elastic stack. One way to achieve a faster feedback loop is to use Mito’s command-line tool, which will let you run a CEL program directly and see the output it produces for a given input.

Mito is written in Go and can be installed as follows:

go install github.com/elastic/mito/cmd/mito@latest

When you run a CEL program with Mito, you typically give it two files: a JSON file with the initial input state, and another file with the source code of your CEL program:

mito -data state.json src.cel

For easier copy and pasting, the examples in this article are written as single commands that have the shell create temporary files on the fly, by wrapping the content of each file in <(echo '...content...'). In your own development, working with actual files will be easier.

Fetching issues data from GitHub

The following example includes a full CEL program that will fetch data about issues from the GitHub API. Its initial input state has a URL for the API endpoint, and some information about how it should handle pagination. The CEL program uses the data in the input state to generate a request. It will decode the response, produce events from it, and return them as part of its output state.

mito -data <(echo '
  {
    "url": "https://api.github.com/repos/elastic/integrations/issues",
    "per_page": 3,
    "max_pages": 3
  }
') <(echo '
  int(state.?cursor.page.orValue(1)).as(page,
    (
      state.url + "?" + {
        "state": ["all"],
        "sort": ["created"],
        "direction": ["asc"],
        "per_page": [string(state.per_page)],
        "page": [string(page)],
      }.format_query()
    ).as(full_url,
      request("GET", full_url).with({
        "Header": {
          "Accept": ["application/vnd.github+json"],
          "X-GitHub-Api-Version": ["2022-11-28"],
        }
      }).do_request().as(resp,
        resp.Body.decode_json().as(data,
          state.with({
            "events": data.map(i, {
              "html_url": i.html_url,
              "title": i.title,
              "created_at": i.created_at,
            }),
            "cursor": { "page": page + 1 },
            "want_more": size(data) == state.per_page && page < state.max_pages,
          })
        )
      )
    )
  )
')

Its first evaluation produces the following output:

{
  "cursor": {
    "page": 2
  },
  "events": [
    {
      "created_at": "2018-09-14T09:47:35Z",
      "html_url": "https://github.com/elastic/integrations/issues/3250",
      "title": "Increase support of log formats in haproxy filebeat module"
    },
    {
      "created_at": "2019-02-06T12:37:37Z",
      "html_url": "https://github.com/elastic/integrations/issues/487",
      "title": "ETCD Metricbeat module needs polishing and grooming"
    },
    {
      "created_at": "2019-08-13T11:33:11Z",
      "html_url": "https://github.com/elastic/integrations/pull/1",
      "title": "Initial structure"
    }
  ],
  "max_pages": 3,
  "per_page": 3,
  "url": "https://api.github.com/repos/elastic/integrations/issues",
  "want_more": true
}

The events will be removed, and when run in the CEL input, they’ll be published for ingestion. The rest of the output will be provided to the next CEL program evaluation as its input state.

To understand how that CEL program works, we’ll look at some smaller CEL examples and discuss more details of how the CEL input operates.

CEL basics

In the CEL language, there are no statements; there are only expressions. Every successful CEL expression evaluates to a final value. Here’s one of the smallest CEL expressions you can write, along with its output:

mito <(echo '
  "hello" + " " + "world"
')

"hello world"

Many simple expressions are intuitive. Mathematical operations are only supported on values of the same type (for example, int with int), so convert types as you need (here from int to double):

mito <(echo '
  double((1 + 2) * (3 + 4)) / 2.0
')

10.5

There are no variables in the CEL language, but an expression can be given a name and used in a larger expression with the help of Mito’s as macro. In this example, the expression (1 + 1) evaluates to the value 2, and .as(n, ...) gives that value the name n for use in the expression "one plus one is "+string(n):

mito <(echo '
  (1 + 1).as(n, "one plus one is "+string(n))
')

"one plus one is 2"

It's also possible to accumulate information in a map and use it later in the expression, as demonstrated here using with:

mito <(echo '
  { "key": "value" }.with({ "key2": "value2" }).as(data,
    {
      "data": data,
      "size": size(data),
    }
  )
')

{
  "data": {
    "key": "value",
    "key2": "value2"
  },
  "size": 2
}

Look at that example again. Notice that the nested part, ({ "data": data, "size": size(data), }), gives us the shape of the final value. It’s a map with the keys "data" and "size". The values for those keys depend on data, which is defined by the outer part of the expression. Reading CEL expressions from the inside out can help to quickly see what they’ll return.

CEL has no control flow statements, like if, but conditional branching can be done with the ternary operator:

mito <(echo '
  1 + 1 < 12 ? "few" : "many"
')

"few"

Unbounded loops and recursion are not supported, as CEL is not a Turing complete language. That makes execution time predictable and proportional to the size of the input data and the expression complexity.

Although unbounded loops are not possible in individual CEL expressions, you can process lists and maps using macros like map:

mito <(echo '
  [1, 2, 3].map(x, x * 2)
')

[2, 4, 6]

In this section, we’ve covered:

Strings, numbers, lists, and maps.
String concatenation.
Mathematical operations.
Type casting.
Conditionals.
Naming sub-expressions.
Processing collections.

Next, we’ll look at how to make HTTP requests.

Requests

Mito extends CEL with the ability to make HTTP requests:

mito <(echo '
  get("https://example.com").as(resp, string(resp.Body))
')

"Example Domain..."

Requests can be explicitly constructed before they’re executed. That makes it possible to use different HTTP methods and to add headers and a body.

In this example, we build a URL with the help of format_query, add a header to the request, and parse the response body with decode_json. When given the -log_requests option, Mito will log detailed information in JSON format about each request and response.

mito -log_requests <(echo '
  request("GET",
    "https://postman-echo.com/get?" + {
        "q": ["query value"]
     }.format_query()
  ).with({
    "Header": { "Accept": ["application/json"] }
  }).do_request().as(resp, {
    "status": resp.StatusCode,
    "data": resp.Body.decode_json(),
  })
')

{"time":"...","level":"INFO","msg":"HTTP request",...}
{"time":"...","level":"INFO","msg":"HTTP response",...}
{
  "data": {
    "args": {
      "q": "query value"
    },
    "headers": {
      "accept": "application/json",
      "accept-encoding": "gzip, br",
      "host": "postman-echo.com",
      "user-agent": "Go-http-client/2.0",
      "x-forwarded-proto": "https"
    },
    "url": "https://postman-echo.com/get?q=query+value"
  },
  "status": 200
}

Managing state and evaluations

Now that we’ve covered how to make requests and the CEL basics required to produce our desired output state, let’s take a closer look at what we should put into the output state and how that lets us direct later processing.

An integration’s CEL program needs to make sure its output state is suitable for use as the input of the next evaluation. Configuration sets the initial state, and that should be repeated in the output with any appropriate changes. An easy way to do that is to use state.with({ ... }), to repeat the state map with some overrides. A common pattern for small programs is to wrap the whole program in state.with(), so that state propagation doesn’t have to be repeated in each branch that generates output data (for example, success, errors).

When there are state values that are initialized by an evaluation rather than hard-coded in the initial input state, the program will need to check for an existing value before setting the initial one. That’s something that the support for optional syntax and types can help with. By using a question mark before the field name in a map key, the access becomes optional: It may or may not resolve to a value, but further optional accesses are possible and it’s easy to supply a default if no value is present:

mito -data <(echo '{}') <(echo '
  int(state.?counter.orValue(0)).as(counter,
    state.with({
      "counter": counter + 1,
      "want_more": counter + 1 < 3,
    })
  )
')

{ "counter": 1, "want_more": true }
{ "counter": 2, "want_more": true }
{ "counter": 3, "want_more": false }

In that example, the counter value read from state is cast to int because all numbers are serialized in the state as floating point numbers, in keeping with conventions established by JSON and JavaScript’s Number type. It should also be noted that "want_more": true is honored here by Mito, but when run in the CEL input, the evaluation would only be repeated if the output also contains events.

It’s a requirement of CEL programs run by the CEL input that they return an "events" key in their output map. Its value can be a list of event maps, an empty list, or a single event map. The single event case is usually used for errors. The event will be published by the input, but its value will also be logged, and if it sets an error.message value, that will be used to update the integration’s Fleet health status. If your program produces a single non-error event, it’s best to wrap it in a list.

Take another look at the output of our GitHub issues program from earlier:

{
  "url": "https://api.github.com/repos/elastic/integrations/issues",
  "per_page": 3,
  "max_pages": 3,
  "cursor": {
    "page": 2
  },
  "events": [
    { ... },
    { ... },
    { ... }
  ],
  "want_more": true
}

The program effectively managed its state, by:

Repeating initial state values in url, per_page, and max_pages.
Adding state that should be persisted across restarts in cursor.page.
Returning events ready to publish in the events list.
Requesting immediate re-evaluation with want_more: true.

Now that you understand optional access and state management, as well as CEL basics and HTTP requests, the full GitHub issues program should be readable. Try running it with Mito and experimenting with some changes.

Review and resources

In this article, we looked at what the CEL language is and how it has been extended in the Mito library for use in the CEL input. We saw the flexibility of CEL in an example program that fetches issues information from the GitHub API, and went through all the details necessary to understand that program, covering access to settings in the initial state, interaction with HTTP APIs, returning events to be ingested, and managing the state for later program executions.

To learn more and build integrations using the CEL input, there are a number of resources worth exploring:

And perhaps the most valuable resource for building integrations with the CEL input is the CEL code of existing Elastic integrations, which can be found on GitHub:

cel.yml.hbs files in the Elastic integrations repository - GitHub

Entity resolution with Elasticsearch & LLMs, Part 2: Matching entities with LLM judgment and semantic search

Jessica Moszkowicz — Thu, 26 Feb 2026 00:00:00 GMT

In Part 1, we prepared our watch list and extracted entity mentions. Now we’re ready to answer the hard question: Which entity does a mention actually refer to? Let's return to the example in the first blog of this series, which set up why we need entity resolution: "The Swift update is here!" Imagine that this headline is accompanied by a bit more context:

The new Swift update is here! Developers are eager to try out the new features.
The new Swift update is here! The new album will drop next month.

With this added context, we should be able to resolve the name "Swift" to the correct entity.

In the previous post, we set up our watch list and enriched the entities with additional context. Looking at our examples above, we need to have at least the following two entities in the list: Taylor Swift and Swift Programming Language. We also covered how we extract entity mentions from text. Both of these examples would extract "Swift". With these ingredients in place, the enriched watch list, and the extracted entities, we’re finally ready to introduce the star of the show: entity matching.

Remember: This is an educational prototype designed to teach entity-matching concepts. Production systems might use different large language models (LLMs), custom matching rules, specialized judgment pipelines, or ensemble approaches combining multiple matching strategies.

The problem: Why matching is hard

Human language is a remarkable thing. One of the most interesting properties of it is its endless creativity. We can generate and understand an infinite number of new sentences. Is it any wonder, then, that exact matches in entity resolution are rare? Authors strive to be creative when they can. It would get quite tedious if we had to write and read full names whenever an entity is mentioned. So, while exact matches are easy, the reality is that we need a more sophisticated approach to entity resolution: one that’s robust enough to handle at least some of the boundless creativity of human authors. That’s why we separate the problem into two steps: Use Elasticsearch to retrieve plausible candidates at scale, and then use an LLM to judge whether those candidates truly refer to the same real-world entity.

The solution: Three-step matching with transparent LLM judgment

We’re in the midst of a paradigm shift in how we use computers. Just as the rise of the internet took us from localized computing to a globally connected network, generative AI (GenAI) is fundamentally changing how content, code, and information are created. In fact, the educational prototype that accompanies this series was almost exclusively "vibe coded" using an LLM with careful prompting by the author. This is not to say that LLMs have or even will reach the kind of productivity inherent with human language, but it does mean that we now have a powerful resource to help with entity resolution.

A common pattern we use with GenAI is retrieval augmented generation (RAG). Here, retrieval means retrieving entity candidates (not generating answers), and the LLM is used strictly for match evaluation and explanation. While we could ask an LLM to help us with end-to-end entity resolution, that’s a costly approach, both in terms of time and money. RAG helps LLMs do their work by using more efficient ways to provide context to the LLM, thereby empowering the LLM to efficiently help with entity resolution.

For the retrieval part of RAG, we again turn to Elasticsearch. We first find potential matches using a combination of exact matching, matching against aliases, and hybrid search, which combines keyword and semantic search. Once we find these potential matches, we send them to an LLM for judgment. The LLM acts as the final match evaluator. We also make the LLM explain its reasoning, an important differentiator with other entity resolution systems. Without these explanations, entity resolution is a black box; with them, we can see for ourselves why a match makes sense.

Key concepts: Three-step matching, hybrid search, and transparent LLM judgment

What is three-step matching? At the onset of this project, we hypothesized that semantic search will be a crucial part of the system, but not every match requires such sophisticated search. In order to find matches efficiently, we take a progressive approach to the problem. First, we check for exact matches using keyword search. If we find such a match, our work is done and we can move on. If exact matching fails, we turn to alias matching. In the prototype, alias matching is also done using exact matching with keywords, for simplicity. In production, you might expand this step with normalization, transliteration rules, fuzzy matching, or curated alias tables. If we still haven't found a potential match in the first two steps, then it's time to bring in semantic search via Elasticsearch's hybrid search with reciprocal rank fusion (RRF).

What is hybrid search? In Elasticsearch, we can use semantic search to find meaningful matches that take context into account. Elasticsearch is widely used for vector search and hybrid retrieval. Semantic similarity is powerful for meaning, but it’s not a substitute for structured filtering (for example, by time ranges, locations, or identifiers), and it’s often unnecessary when an exact match is available. Elasticsearch made its mark with lexical search, which is great at tasks where semantic search doesn't fit. To take full advantage of both approaches, we use lexical search alongside semantic search in a single hybrid query. We then merge the results to find the most likely matches using RRF. In the prototype, the top two results become potential matches that can be sent for LLM judgment.

Why LLM judgment? LLM judgments and explanations allow our system to handle ambiguity and context transparently. This is vital for cases like "the president", which could refer to multiple entities, depending on the context, but it also makes things like nicknames and cultural variations work well in the system. Finally, when we consider mission-critical tasks, like identifying entities from sanctions lists, we need to know why a match was accepted in order to trust the system. Crucially, the LLM does not search the full corpus; it evaluates only the small set of candidates returned by Elasticsearch.

Real-world results: Matching with LLM reasoning

A major challenge for any natural language processing task is the creation of a golden document, an "answer key" that tells us what the expected results are. Without this, it's next to impossible to judge how well a system performs on a task, but creating such a document can be a laborious process. For the entity resolution prototype, we turned again to GenAI to help set up data we could test against.

We first defined several challenge types, such as nicknames and transliteration, and then asked the LLM to create a tiered collection of datasets that would get progressively larger and more challenging for the system. The creation of the datasets was less straightforward than one might hope. The LLM had a strong propensity for "cheating" by making it too easy to get the right answer. For example, one of the challenge types focused on semantic context. This type included things like resolving "Russian author" to "Leo Tolstoy". The LLM incorrectly put "Russian author" as an alias for "Leo Tolstoy", which negated the need for hybrid search to find the match.

After several refactorings to fix issues like this, we had five dataset tiers to work with. Tiers 1–4 were progressively larger with more challenge types. Tier 5 was the "ultimate challenge" dataset, made up of the trickiest examples from all challenge types. All of the test data is available in the comprehensive evaluation directory.

To evaluate our prompt-based entity resolution approach, we focused our attention on the tier 4 dataset. An important note is that the evaluation was conducted as a controlled experiment so that we could focus on entity match quality. The watch list data was pre-enriched with context, and entities were extracted from the article ahead of time. This ensured that evaluation was focused on matching rather than on extraction accuracy. This isolates match quality; end-to-end performance would additionally depend on extraction recall and enrichment quality.

Evaluation dataset

The tier 4 evaluation dataset provides a comprehensive test of the system's capabilities:[1]

Watch list entities: 66 entities across diverse types (people, organizations, locations).
Test articles: 69 articles covering real-world entity resolution scenarios.
Expected matches: 206 expected entity matches across all articles.
Challenge types: 15 different challenge types testing various aspects of entity resolution.

The challenge types included in the dataset are:

Nicknames: "Bob Smith" → "Robert Smith" (seven articles).
Titles and honorifics: "Dr. Sarah Williams" → "Sarah Williams" (five articles).
Semantic context: "Russian author" → "Leo Tolstoy" (eight articles).
Multilingual names: Handling names in different scripts (six articles).
Business entities: Corporate name variations (seven articles).
Executive references: "Microsoft CEO" → "Satya Nadella" (five articles).
Political leaders: Title-based references (five articles).
Initials: "J. Smith" → "John Smith" (three articles).
Name order variations: Different name ordering conventions (three articles).
Truncated names: Partial name matches (three articles).
Name splitting: Names split across text (three articles).
Missing spaces/hyphens: Formatting variations (two articles).
Transliteration: Cross-script name matching (two articles).
Combined challenges: Multiple challenges in one article (six articles).
Complex business: Hierarchical business relationships (five articles).

Let's see how prompt-based entity resolution performed.

Overall performance

The results show that there's a lot of promise with LLM-powered match evaluation, but they also reveal a significant reliability issue. Because each candidate pair must be evaluated by the LLM, failures in structured output can suppress acceptance and recall even when retrieval is working well.

Metric	Value
Precision	83.8%
Recall	62.6%
F1 score	71.7%
Total matches found	344
LLM acceptance rate	44.8%
Error rate	30.2%

The error rate problem

Recall that the first step we take in the prototype is to create potential match pairs using Elasticsearch. Each of these potential matches needs to be evaluated by the LLM. To efficiently process all of those matches, we batch the LLM calls together. This reduces API costs and latency, but there’s also an increased risk of getting malformed JSON in the output. As batch size increases, the JSON becomes longer and more complex, making it more likely that the LLM will generate invalid JSON. This is where the 30% error rate stems from. In the evaluation, we used a batch size of five matches per request. Even with this conservative batch size, we still see JSON parsing failures, which skews the evaluation results significantly.

What's next: Optimizing LLM integration

Now that we've matched entities using semantic search and LLM judgment, we have a complete entity resolution pipeline. This approach introduces a new failure mode, however, when the model’s judgment is correct, but its output isn’t usable. We can optimize the LLM integration for better reliability and cost efficiency. In the next post, we'll explore how to use function calling for structured output, which provides guaranteed structure and type safety while reducing errors and costs.

Try it yourself

Want to see entity matching in action? Check out the Entity Matching notebook for a complete walk-through with real implementations, detailed explanations, and hands-on examples. The notebook shows you exactly how to match entities using three-step search, hybrid search with RRF, and LLM-powered judgment with reasoning.

Remember: This is an educational prototype designed to teach the concepts. When building production systems, consider additional factors, like model selection, cost optimization, latency requirements, quality validation, error handling, and monitoring, which aren't covered in this learning-focused prototype.

Notes

These datasets are synthetic and designed for education; they approximate real challenges but are not representative of any single production domain.

Elasticsearch vector search is up to 8x faster than OpenSearch

Sachin Frayne — Wed, 25 Feb 2026 00:00:00 GMT

Why search speed matters for AI agents and context engineering

Our benchmarks on a 20M document corpus show that Elasticsearch delivers up to 8x higher throughput than OpenSearch for filtered vector search, while also achieving higher Recall@100 across the configurations we tested. Context engineering depends on more than fast vector retrieval. Teams also need strong relevance controls, like hybrid search and filtering, operational simplicity, and predictable performance, as workflows iterate. But because agents often run retrieve, reason, retrieve loops many times per request, retrieval latency becomes a multiplier, so improvements here translate directly into better end-to-end responsiveness and lower cost.

For context engineering, retrieval isn’t a one-time step. Agents and applications repeatedly run loops, such as retrieve → reason → retrieve, to refine queries, verify facts, assemble grounded context, and complete tasks. This pattern is common in agentic workflows and iterative retrieval augmented generation (RAG). Because retrieval may be invoked many times per user request, it adds delay to the response and/or increases infrastructure costs.

Why is vector search performance critical?

Imagine a shopping assistant answering the question, “I need a carry-on backpack under $60 that fits a 15-inch laptop, is water resistant, and can arrive by Friday.”

In production, the assistant rarely issues one vector query and stops. It runs a retrieval loop to build the right context, and each step is typically constrained by filters, like availability, region, shipping promise, brand rules, and policy eligibility.

Step 1: Interpret intent and translate to constraints.

The agent turns the request into structured filters and a semantic query, such as:

Filters: In stock, deliverable to the user’s postcode, delivery by Friday, price under $60, valid listing
Vector query: “Carry-on backpack 15-inch laptop water resistant”

Step 2: Retrieve candidates, and then refine.

It often repeats retrieval with variations to avoid missing good matches:

“travel backpack carry on laptop sleeve”
“water resistant commuter backpack 15 inch”
“lightweight cabin backpack”

Each query uses the same eligibility filters, because retrieving irrelevant or unavailable items is wasted context.

Step 3: Expand to confirm details and reduce risk.

The agent then retrieves again to verify key attributes that affect the final answer:

Material and water resistance wording
Dimensions and laptop compartment fit
Return policy or warranty constraints
Alternate options if inventory is low

This is multistep context engineering: Retrieve, reason, retrieve, assemble.

Why latency and recall matter for context engineering

These interactions can involve dozens of filtered retrieval calls per user session. That makes per-call latency a direct multiplier on end-to-end response time, and low recall forces extra retries or causes the agent to miss eligible items, degrading answer quality.

Takeaway: In context-engineered systems, filtered approximate nearest neighbors (ANN) isn’t a single lookup. It’s a repeated operation under constraints, so vector search performance shows up immediately in latency, throughput, and cost, even when the large language model (LLM) is the most visible component.

Benchmarking

Results

In Graph 2, each dot represents one test configuration. The best results appear toward the top left, meaning higher recall with lower latency. Elasticsearch’s results are consistently closer to the top left than OpenSearch’s, indicating better speed and accuracy under the same workload settings.

Some key insights

s_n_r_value: Shorthand for size_numCandidates_rescoreOversample (k and numCandidates set equal to numCandidates in these tests), for example, 100_500_1 means size=100, numCandidates=500 and k=500, rescore oversample=1
Recall: Measured Recall@100 for that configuration
Avg latency (ms): Average end-to-end latency per query
Throughput: Queries per second
Recall %: Relative recall lift of Elasticsearch versus OpenSearch (Elasticsearch minus OpenSearch) / OpenSearch
Latency Xs: OpenSearch average latency divided by Elasticsearch average latency
Throughput Xs: Elasticsearch throughput divided by OpenSearch throughput

Engine	`s_n_r_value`	Recall	Avg Latency (ms)	Throughput	Recall %	Latency Xs	Throughput Xs
Elasticsearch	100_250_1	0.7704	25	534.75	9.70%	2.28	1.91
OpenSearch	100_250_1	0.7023	57.08	279.58
Elasticsearch	100_500_1	0.8577	25.42	524.14	7.20%	2.4	2
OpenSearch	100_500_1	0.8001	60.9	262.12
Elasticsearch	100_750_1	0.8947	29.67	528.09	5.72%	2.25	2.21
OpenSearch	100_750_1	0.8463	66.76	239.11
Elasticsearch	100_1000_1	0.9156	29.65	534.5	4.66%	2.46	2.44
OpenSearch	100_1000_1	0.8748	72.88	219.01
Elasticsearch	100_1500_1	0.9386	31.84	497.3	3.38%	2.71	2.68
OpenSearch	100_1500_1	0.9079	86.16	185.4
Elasticsearch	100_2000_1	0.9507	34.69	457.2	2.57%	2.98	2.96
OpenSearch	100_2000_1	0.9269	103.36	154.55
Elasticsearch	100_2500_1	0.9582	37.9	418.43	1.99%	3.28	3.26
OpenSearch	100_2500_1	0.9395	124.29	128.53
Elasticsearch	100_3000_1	0.9636	41.86	379.4	1.62%	3.46	3.44
OpenSearch	100_3000_1	0.9482	144.67	110.34
Elasticsearch	100_4000_1	0.9705	50.28	316.21	1.06%	3.87	3.85
OpenSearch	100_4000_1	0.9603	194.36	82.22
Elasticsearch	100_5000_1	0.9749	58.77	270.91	0.73%	4.43	4.41
OpenSearch	100_5000_1	0.9678	260.33	61.38
Elasticsearch	100_6000_1	0.9781	66.75	238.59	0.52%	4.91	4.89
OpenSearch	100_6000_1	0.973	327.44	48.81
Elasticsearch	100_7000_1	0.9804	74.64	213.49	0.38%	5.28	5.27
OpenSearch	100_7000_1	0.9767	394.24	40.53
Elasticsearch	100_8000_1	0.9823	82.28	193.59	0.27%	6.86	6.83
OpenSearch	100_8000_1	0.9797	564.14	28.33
Elasticsearch	100_9000_1	0.9837	90.08	176.96	0.16%	7.63	7.61
OpenSearch	100_9000_1	0.9821	687.25	23.25
Elasticsearch	100_10000_1	0.9848	97.64	163.31	0.08%	8.38	8.36
OpenSearch	100_10000_1	0.984	818.64	19.53

For example, at 100_9000_1, OpenSearch averages 687 milliseconds per retrieval versus 90 milliseconds on Elasticsearch, and in a 10-step retrieval loop that’s about 10 x (687 - 90) = six seconds of additional waiting time.

See the full results.

Methodology

Using Python to send the queries and track the response timing and other statistics, we sent the following queries to the engines. Bear in mind that the performance of any vector search engine depends on how you tune its core parameters: how many candidates to consider, how aggressively to rescore, and how much context to return. These settings directly affect both recall (the likelihood of finding the right answer) and latency (how fast you get results).

In our benchmarks, we used the same candidate, rescore, and result-size settings you’d typically tune in an agentic retrieval loop, and we measured how Elasticsearch performs under that workload. We then ran OpenSearch with the same settings as a reference.

OpenSearch

GET /_search
{
  "query": {
    "knn": {
      "": {
        "vector": [...],
        "k": ,
        "method_parameters": {
          "ef_search": 
        },
        "rescore": {
          "oversample_factor": 
        },
        "filter": {
          
        }
      }
    }
  },
  "size": ,
  "_source": {
    "excludes": [
      ""
    ]
  }
}

"size": : Number of hits returned to the client. In this benchmark, result size is 100 to compute Recall@100.
"k": : The number of nearest neighbor candidates.
"ef_search": : The number of vectors to examine.
"oversample_factor": : How many candidate vectors are retrieved before rescoring.

Elasticsearch

GET /_search
{
  "query": {
    "knn": {
      "field": "",
      "query_vector": [...],
      "k": ,
      "num_candidates": ,
      "rescore_vector": {
        "oversample": 
      },
      "filter": {
        
      }
    }
  },
  "size": ,
  "_source": {
    "excludes": [
      ""
    ]
  }
}

"size": : Number of hits returned to the client. In this benchmark, result size is 100 to compute Recall@100.
"k": : Number of nearest neighbors to return from each shard.
"num_candidates": : Number of nearest neighbor candidates to consider per shard while doing knn search.
"oversample": : How many candidate vectors are retrieved before rescoring.

Example

Knn query, (100_500_1), would be as follows:

OpenSearch

GET search_catalog_128/_search
{
  "query": {
    "knn": {
      "search_catalog_embedding": {
        "vector": [...],
        "k": 500,
        "method_parameters": {
          "ef_search": 500
        },
        "rescore": {
          "oversample_factor": 1
        },
        "filter": {
          "term": {
            "valid": true
          }
        }
      }
    }
  },
  "size": 100,
  "_source": {
    "excludes": [
      "search_catalog_embedding"
    ]
  }
}

Elasticsearch

GET search_catalog_128/_search
{
  "query": {
    "knn": {
      "field": "search_catalog_embedding",
      "query_vector": [...],
      "k": 500,
      "num_candidates": 500,
      "rescore_vector": {
        "oversample": 1
      },
      "filter": {
        "term": {
          "valid": true
        }
      }
    }
  },
  "size": 100,
  "_source": {
    "excludes": [
      "search_catalog_embedding"
    ]
  }
}

The full configuration, alongside Terraform scripts, Kubernetes manifests and the benchmarking code is available in this repository in the folder es-9.3-vs-os-3.5-vector-search.

Cluster setup

We ran our tests on six e2-standard-16 cloud servers, each with 16 vCPUs and 64 GB RAM. On each server, we allocated 15 vCPUs and 56 GB RAM to each Kubernetes pod running the search engine node, with 28 GB reserved for the JVM heap.

The clusters ran Elasticsearch 9.3.0 and OpenSearch 3.5.0 (Lucene 10.3.2). Because both systems use the same Lucene version in this benchmark, the throughput and latency differences we observe cannot be attributed to Lucene alone and instead reflect differences in how each engine integrates and executes filtered k-nearest neighbor (kNN) retrieval and rescoring. We used a single index with three primary shards and one replica (so 6 shards total, 1 per node).

We also used a separate server in the same region to run the benchmark client and collect timing statistics.

The dataset

For this benchmark, we used a large-scale ecommerce-style catalog embedding dataset with 20 million documents, designed to reflect real-world filtered vector retrieval at scale.

Each document represents a catalog item and includes:

A 128-dimensional dense vector embedding used for approximate kNN retrieval.
Structured metadata fields used for filtering (for example, item validity and availability plus other catalog constraints) enabling the common production pattern of retrieving the nearest neighbors but only within an eligible subset.

We chose this dataset because it captures the core performance challenge we see in agentic and RAG-style systems in production: Vector similarity alone is not enough, retrieval is frequently constrained by filters, and the system must maintain high recall while keeping latency low under those constraints. Compared to smaller QA-style datasets, a 20M document corpus also better reflects the scale and candidate pressure that filtered ANN systems face in practice.

Conclusion

In modern AI architectures, especially those built around context engineering, vector search speed isn’t a minor implementation detail. It’s a multiplier. When agents and workflows iterate through retrieve → reason → retrieve, retrieval performance directly shapes end-to-end latency, throughput, and the quality of the context fed into the model.

In our benchmarks, Elasticsearch consistently delivered higher recall at lower latency than OpenSearch in scenarios where correctness depends on retrieving the right document, not just a similar vector. On a controlled dataset, the difference is clear, and in production those gains accumulate across large volumes of retrieval calls, improving responsiveness, increasing capacity headroom, and reducing infrastructure costs.

Elastic AutoOps now free for all: What you get with it

Valentin Crettaz, Arnon Stern, Ori Shafir — Wed, 25 Feb 2026 00:00:00 GMT

Same AutoOps, regardless of your license

We’re making this simple: Starting today, AutoOps is available at no cost for every self-managed Elasticsearch cluster through Elastic Cloud Connect. Whether you’re on Free, Basic, Platinum, or Enterprise, you get the same feature product. This isn’t a limited preview or a "lite" version. It’s the same product used by the largest deployments.

An investment in the self-managed community

Extending it to every user, including those on the free distribution, reflects Elastic's commitment to the success of the entire Elasticsearch community. By providing AutoOps for free, we’re investing in the stability and performance of the hundred of thousands of clusters that power the community's search and analytics workloads.

Elastic Cloud Connect enables self-managed clusters to consume Elastic Cloud services, such as AutoOps and the recently announced Elastic Inference Service (EIS), without the operational overhead of maintaining, patching, monitoring, and operating the services locally.

How AutoOps for self-managed works and what it provides

As clusters grow in complexity and size, you find yourself spending more time chasing configuration tweaks and trying to find the root cause of an issue. Monitoring tools show you metrics and leave the manual correlation to you and your favorite large language model (LLM) to find the root cause when the issue appears. AutoOps tells you what’s wrong, why, and exactly how to fix it, with real-time issue detection and specific resolution paths.

AutoOps runs on Elastic Cloud; there’s no infrastructure for you to provision or maintain. You simply run a lightweight agent on-premises to connect your cluster to the AutoOps service where operational metadata (such as node stats, cluster settings, and shard states) is shipped to AutoOps in real time to provide insights and recommendations. Your data never leaves your environment.

AutoOps vs. Stack Monitoring for self-managed users

Stack Monitoring provides the essential telemetry and basic monitoring for your nodes and indices, showing you the trend over time for various metrics, and alerts you when thresholds are crossed, yet it often leaves the diagnostic burden on the engineer. AutoOps offers a complete picture of cluster health by correlating all relevant metrics. This provides valuable insights and clear instructions on how to resolve issues when they occur.

Faster root cause analysis

Your cluster was humming along, but it suddenly got red in the middle of the night (and, as usual, nothing changed the days before).

With Stack Monitoring: A built-in alert will notify you when your cluster health turns red. To find the cause, you need to dig into your cluster logs and turn to Dev Tools to look at your shard allocations to understand why that primary shard couldn’t be allocated. Looking at your alerts history, you see another one informing you that your hot nodes reached 80% disk utilization two days ago. You can’t find any disk usage charts to learn more about your disk fill rate, you only know that your disk reached 90% utilization in the meantime, and when one of your data streams needed to roll over, a new backing index was created, but no shards could be allocated to any of your nodes.
With AutoOps: The system notifies you when your cluster turns red (1). Looking at the timeline, you immediately realize that this happened because you failed to take action on the previous watermark events that AutoOps raised, namely the high watermark events (2) that started happening recently, and the low watermark ones (3) that built up over the past few days. It is now straightforward for you to know what you need to do to get your cluster back to green.

Higher signal to noise ratio

Keeping your Elasticsearch cluster healthy is most probably your main concern. Yet, it’s not uncommon for the health status to sometimes flap between green and yellow (and sometimes red), and the cause is not always worthy of your time.

With Stack Monitoring: The built-in “Cluster Health” alert will continue to be raised on each health transition from green to either yellow or red. In some situations, like frequent index creations, this can create a lot of repeated and undesired noise. Also, and more importantly, there’s no distinction between a yellow and a red status.
With AutoOps: There are dedicated “Status Red” and “Status Yellow” events, with different severities. The latter can be customized in many different ways to fit your use case, as shown in the screenshot below:
1. Since the cluster health can turn yellow only for a brief duration, you can decide for how long to ignore the yellow status before being notified (for example, five minutes, in the screenshot below).
2. Furthermore, there are a lot of legitimate operations that Elasticsearch does all the time and that make the cluster turn yellow. You can pick any of the operations you don’t want to be notified about when your cluster turns yellow because of them (for example, adding replicas, relocating replicas, or closing or opening an index, among others).
3. Finally, and most importantly, if you have several clusters to manage, you don’t need to configure this for all of them separately; you simply decide to which ones this configuration should be applied. Simple, powerful!

More insightful correlations and comparisons of node metrics

When your cluster runs on more than just a handful nodes, you often need to see how they perform against each other, especially when looking at search and indexing performance.

With Stack Monitoring: The Nodes list doesn’t allow you to focus on a specific data tier and doesn’t provide any search or indexing performance metrics that you can sort on. These metrics are available, but only once you drill into a specific node, which doesn’t allow you to easily compare node performance against each other.
With AutoOps: The Nodes view allows you to select nodes from a specific data tier. It also provides you with over 50 metrics visualizations, among them search and indexing performance, which give you exactly the visual cues that you need to understand how each node performs against each other and whether there are any struggling nodes that require your attention. In the screenshot below, we can see that some nodes are indexing at double the rate of some others and that search latency is building up at four times the latency of the fastest nodes.

Quick overview of the main differences

Here’s a glimpse at the notable differences between AutoOps and Stack Monitoring, but you can find a more detailed breakdown in our official documentation.

Capability	Stack Monitoring	AutoOps
Cluster, node, and index metrics	Yes	Yes
Real-time dashboards	Yes	Yes
Multi-cluster overview	No	Yes
Root cause analysis	No	Automated analysis
Fix recommendations	None	In-context Elasticsearch commands
Performance tuning insights	None	Yes, based on usage patterns
Alerts and notifications	14 built-in alerts and 27 connectors	100+ customizable alerts and 7 connectors
Template and mapping analysis	None	Detects mapping misconfigurations
Setup infrastructure	Dedicated monitoring cluster	Five-minute installation. No extra infrastructure and no cost. Only lightweight agent installation is needed.
Other stack components	Kibana, Logstash, Elastic APM	Coming soon

Start now: Five-minute installation

Connecting your cluster takes minutes, regardless of your license type:

Log in to your free Elastic Cloud account, or sign up for one.
Choose how to connect your cluster: Elastic Cloud on Kubernetes (ECK), Kubernetes, Docker, or Linux.
Enter your Elasticsearch cluster endpoint, and run the single command to install and run the lightweight Elastic agent.
Access AutoOps in your Elastic Cloud account.

For more details on AutoOps and instructions on connecting your self-managed cluster, read our product documentation.

Reach out if you have any questions

Feel free to reach out to us to share your questions and ideas via our Slack community, by posting on our Discuss forum, or by clicking the “Give Feedback” button on the AutoOps product page. If you’re connecting a paid self-managed Platinum or Enterprise cluster, you can contact support within your Elastic Cloud account.

If you’re interested to learn more about AutoOps and what it can do for you, please head to the official AutoOps documentation and the following Elastic Search Labs articles:

Start using AutoOps for free

Build a "chat with your website data" agent with Jina Embeddings v5 and Elasticsearch

Ugo Sangiorgi, Lily Adler — Tue, 24 Feb 2026 00:00:00 GMT

Build a “chat with your website” experience in under an hour using Elasticsearch Serverless, Jina Embeddings v5, Elastic Open Web Crawler, and Elastic Agent Builder.

By the end, you’ll have a working agent that can search your crawled pages, cite relevant passages, and answer questions grounded in your content, no custom chunking or embedding pipeline required.

In this guide, you’ll:

Start an Elasticsearch Serverless project.
Create an index using the new semantic_text field powered by Jina Embeddings v5.
Crawl any website using Elastic Crawler Control (a.k.a. Crawly) (an open source UI + API wrapper around the Elastic Open Web Crawler).
Chat with that data using the Elastic Agent Builder in Kibana.

What you’ll walk away with:

A repeatable pattern you can point at any website/docs source.
Chat that stays grounded in your content.

Prerequisites

An Elasticsearch Serverless (Search) project + an API key with write permissions.
Docker + Docker Compose (to run the crawler UI).
git (to clone the repo).

1. Start an Elasticsearch Serverless project

First, we need a serverless project to host our data.

1. Log in to your Elastic Cloud Console.

2. Click Create project.

3. Select Search as the project type. (This type is optimized for vector search and retrieval.)

4. Give it a name (for example, es-labs-jina-guide), and click Create.

5. Important: Save the Elasticsearch endpoint and API Key provided when the project is created. You’ll need these for the crawler.

2. Create the index

Elasticsearch Serverless supports semantic_text, which handles chunking and embedding generation automatically. We’ll use the .jina-embeddings-v5-text-small model that’s hosted on GPUs on Elastic Inference Service.

Create the index with the semantic_text field. This tells Elastic to automatically vectorize content put into the field property using the inference endpoint we just created.

In Kibana Dev tools run:

PUT furnirem-website
{
  "mappings": {
    "_meta": {
      "description": "Each document represents a web page with the following schema: 'title' and 'meta_description' provide high-level summaries; 'body' contains the full text content; 'headings' preserves the page hierarchy for semantic weighting. URL metadata is decomposed into 'url_host', 'url_path', and 'url_path_dir1/2/3' to allow for granular filtering by site section (e.g., 'blog' or 'tutorials'). 'links' contains extracted outbound URLs for discovery. Crawl timestamp: 2026-01-26T12:54:16.347907."
    },
    "properties": {
      "body_content": { 
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          },
          "semantic_multilingual": {
            "type": "semantic_text",
            "inference_id": ".jina-embeddings-v5-text-small" 
          }
        }
      },
      "headings": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          },
          "semantic_multilingual": {
            "type": "semantic_text",
            "inference_id": ".jina-embeddings-v5-text-small"
          }
        }
      },
      "title": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          },
          "semantic_multilingual": {
            "type": "semantic_text",
            "inference_id": ".jina-embeddings-v5-text-small"
          }
        }
      }
    }
  }
}

3. Run the Elastic Open Crawler

Crawly is one example of how an application can be constructed around the functionalities that the Open Web Crawler provides.

The application wraps the Elastic Open Crawler in a FastAPI service that manages crawler processes and persists execution data. A React front end provides the interface for configuring and monitoring crawls.

What happens under the hood is that the crawler service (check crawler.py) spawns JRuby processes via subprocess.Popen, allowing multiple concurrent crawls. Each execution's configuration, status, and logs are persisted to disk (for now).

Clone the repository:

git clone https://github.com/ugosan/elastic-crawler-control

Create an env.local file with your Elasticsearch credentials:

ES_URL=https://your-elasticsearch-endpoint.es.cloud
ES_API_KEY=your_api_key_here

Start the services:

docker-compose up

Access the UI at http://localhost:16700

You don’t necessarily need seed_urls unless you want to be specific, so your config can be as simple as below:

{
  "domains": [
    {
      "url": "https://furnirem.com"
    }
  ],
  "max_crawl_depth": 3,
  "max_unique_url_count": 500,
  "output_index": "furnirem-website"
}

From there, you can start a crawl on any website and check its progress:

Once it's finished, we’re ready to query the content in Elasticsearch directly or use the pages you just crawled for chatting with the website on Agent Builder.

4. Chat with data in Kibana

Now that the data is indexed and vectorized, we can start chatting with the data using the Elastic Agent Builder.

Open Kibana, and navigate to Agents (under the "Search" section).
Test the agent:
- In the chat window, ask a question, like,"What is the difference between sparse and dense vectors?"

The agent will search your Jina-embedded data, retrieve the relevant snippets from the Search Labs blog posts, and generate an answer.

You can also chat with the data directly via Kibana API:

POST kbn://api/agent_builder/converse/async
{
  "input": "What is the difference between sparse and dense vectors?",
  "agent_id": "elastic-ai-agent",
  "conversation_id": ""
}

Use conversation_id to resume an existing conversation with an agent in Elastic Agent Builder. If you don’t provide it on the initial request, the API starts a new conversation and returns a newly generated ID in the streaming response.

Summary

You now have a working “chat with your website” stack: Your site gets crawled, indexed, auto-embedded with semantic_text + Jina v5, and surfaced through an agent in Kibana that answers questions grounded in your pages.

From here, you can point the same setup at docs, support content, or internal wikis and iterate on relevance in minutes.

jina-embeddings-v5-text: Compact state-of-the-art text embeddings for search and intelligent applications

Scott Martens, Sofia Vasileva — Mon, 23 Feb 2026 00:00:00 GMT

Jina AI and Elastic are releasing jina-embeddings-v5-text, a family of new, high-performance, compact text embedding models with state-of-the-art performance for models of comparable size across all major task types.

The family includes two models:

jina-embeddings-v5-text-small
jina-embeddings-v5-text-nano

These models are the successful result of an innovative new training recipe for embedding models. They both outperform models many times their size, creating savings in memory and computing resources and responding faster to requests.

The jina-embeddings-v5-text-small model has 677M parameters, supports a 32768 token input context window, and produces 1024 dimension embeddings by default.

jina-embeddings-v5-text-nano weighs in at roughly a third of its sibling's size, with 239M parameters and a 8192 token input context window, yielding slender 768 dimension embeddings.

Model name	Total size	Input context window size	Embedding size
jina-v5-text-small	677M params	32768 tokens	1024 dims
jina-v5-text-nano	239M params	8192 tokens	768 dims

These two models are the best in class for overall MMTEB (Multilingual MTEB) benchmark performance. Among models with under 500M parameters, jina-embeddings-v5-text-nano is the top performer, despite having less than 250M parameters, and jina-embeddings-v5-text-small model is the leader among multilingual embedding models with under 750M parameters.

These models are available via Elastic Inference Service (EIS), via an online API, and available for local hosting. For instructions on how to access jina-embeddings-v5-text models, see the “Getting started” section, below.

Embedding models and semantic indexing dramatically increase the accuracy of search algorithms but also have a variety of other uses for tasks involving semantic similarity and meaning extraction, for example:

Finding duplicate texts.
Recognizing paraphrases and translations.
Topic discovery.
Recommendation engines.
Sentiment and intention analysis.
Spam filtering.
And many others.

Features

This new model family has a number of features designed to improve relevance and reduce costs.

Task optimization

We’ve optimized the jina-embeddings-v5-text models for four broad task types:

Task	Example use cases
Retrieval	Searching with natural language queries and retrieving the most relevant matches in a collection of documents.
Text matching	Semantic similarity, deduplication, paraphrase and translation alignment, and more.
Clustering	Topic discovery, automatic organization of document collections.
Classification	Document categorization, sentiment and intent detection, similar tasks.

Optimizing for one task usually means having to compromise on another, so most embedding models only have competitive performance for one kind of task. But jina-embeddings-v5-text models are able to specialize in all four areas without compromising by training task-specific Low-Rank Adaptation (LoRA) adapters.

LoRA adapters are a kind of plugin for an AI model that changes its behavior dramatically while only adding slightly to the total size. Instead of having an entire model for each task, each one with hundreds of millions of parameters, the jina-embeddings-v5-text model family lets you use just one model with a compact LoRA adapter for each task. This saves memory, storage space, and inference costs.

Truncating embeddings

We’ve trained the jina-embeddings-v5-text models using Matryoshka Representation Learning, which lets you cut your embeddings down to smaller sizes at a minimal cost to their quality.

By default, jina-embeddings-v5-text-small generates 1024-dimension embedding vectors, each represented by a 16-bit number, making every embedding 2KB in size. For a large collection of documents, this can be a lot of data to store, and searching in a vector database full of embeddings is proportional both to the size of the database and to the number of dimensions each stored vector has.

But you can just halve the size of the embeddings (throw away 512 of the 1024 dimensions), and take up half the space while doubling search speeds. This has an impact on performance. Throwing away information reduces precision. But as the graph below shows, even getting rid of half of the embedding only reduces performance slightly:

As long your embeddings are at least 256 dimensions, the loss in precision should remain fairly small. Below that level, however, relevance and accuracy deteriorate quickly.

Truncating embeddings like this empowers users to manage their own trade-offs between accuracy and computing costs. It gives you the tools to get big efficiency gains and large cost savings out of your search AI.

Robust quantization

Quantization is another way of reducing the size of embeddings. Instead of throwing away part of each embedding, quantization reduces the precision of the numbers in the embedding. The jina-embeddings-v5-text models generate embeddings with 16-bit numbers, but we can round those numbers off, reducing their precision and the number of bits needed to store them. In the most extreme case, we can reduce each number to one bit (0 or 1), compressing jina-embeddings-v5-text’s default 1024 dimension embeddings from 2 kilobytes to 128 bytes, a 94% reduction from binary quantization alone. Just like for truncation, this produces large savings in memory and computing costs. However, also like truncation, quantization makes embeddings less accurate.

We’ve trained the jina-embeddings-v5-text models to work with Elasticsearch’s Better Binary Quantization by minimizing that loss of accuracy, and benchmark tests of binarized embeddings from these models show performance almost equal to their non-binarized equivalents. Consult the technical report for detailed ablation studies of binarization performance.

Multilingual performance

Many embedding models are multilingual because they’ve been trained on materials that include large numbers of languages. But that doesn’t mean that they all perform equally well in all supported languages.

We identified 211 languages in the MMTEB multilingual benchmark and separated them so we could compare our models to similar models on a language-by-language basis. The image below summarizes our results as a heat map. Each patch is a language (identified by its ISO-639 code), and the greener it is, the better the model performed compared to the average of similar models:

Although accuracy varies between languages, the jina-embeddings-v5-text models are state-of-the-art or nearly so across most of the world’s languages.

For details about multilingual performance, see the jina-embeddings-v5-text technical report.

Jina in Elastic: State-of-the-art native AI for search

With jina-embeddings-v5-text models on EIS, you can run high-performance multilingual embedding models natively in Elasticsearch with fully managed, GPU-accelerated inference and no infrastructure to provision or scale. jina-embeddings-v5-text models extend the growing EIS model catalog with compact, multilingual models powered by the latest developments in AI. These models have state-of-the-art performance on information retrieval and standard data analysis benchmarks, and they offer unequaled, globe-spanning multilingual support.

With two models of vastly different sizes, users can determine which one is best suited for their applications and budgets. Furthermore, with robust embeddings that remain performant when truncated to smaller sizes or quantized to lower precision, jina-embeddings-v5-text models provide opportunities for further concrete savings in storage and computing costs as well as in processing latency.

With the jina-embeddings-v5-text family, Jina Reranker, and Elastic’s fast vector and BM25 search, users now have access to end-to-end, state-of-the-art hybrid search from Elastic. When you need the most relevant results, whether for retrieval augmented generation (RAG) pipelines, search applications, or data analysis, Elastic with Jina search AI models provides solid and cost-effective quality.

Getting started

The jina-embeddings-v5-text models are fully integrated into EIS, and you can use them by setting the type field to semantic_text when creating your index and specifying the model (jina-embeddings-v5-text-small or jina-embeddings-v5-text-nano) in the inference_id field, as in this example:

PUT multilingual-semantic-index
{
  "mappings": {
    "properties": {
      "content": {
        "type": "semantic_text",
        "inference_id": ".jina-embeddings-v5-text-small"
      }
    }
  }
}

# Ingest data about France
POST multilingual-semantic-index/_doc
{
  "content": "The capital of France is Paris"}

GET multilingual-semantic-index/_search
{
  "query": {
    "semantic": {
      "field": "content",
      "query": "What is the French capital?"
    }
  }
}

Elasticsearch automatically selects the appropriate LoRA adapter during indexing and retrieval. The embedding dimensions (see the “Truncating embeddings” section, above) can be set when creating a custom inference endpoint.

See the Elasticsearch documentation for more information on using jina-embeddings-v5-text models.

More information

To learn more about jina-embeddings-v5-text models, read the release notes on the Jina AI blog and the technical report, with more detailed technical information about performance and Jina AI’s innovative new training procedure. For information about downloading and running these models locally, visit the jina-embeddings-v5-text collection page on Hugging Face.

Jina AI models are available under a CC-BY-NC-4.0 license, so you are free to download them and try them out, but for commercial use, please contact Elastic sales.

Ensuring semantic precision with minimum score

Mattias Brunnert — Fri, 20 Feb 2026 00:00:00 GMT

Semantic search has opened up a world of opportunities for search relevance. High-quality sparse and dense models, such as ELSER, E5 and Jina Embedding v4, return relevant results based on the meaning of words, rather than keyword matching. However, semantic search sometimes returns irrelevant results at the tail end or for queries that lack relevant results in the index. This property of sparse and dense models can confuse users or waste precious tokens for large language models (LLMs).

In this article, you’ll learn how you can use the minimum score parameter to increase the precision of your semantic search results. If you’d like to test the examples provided in this blog post, go to the associated Jupyter notebook.

Background: Precision and recall

In search relevance, precision and recall are key concepts. Any reader not already familiar is highly encouraged to read up on them. Following is a summary.

Precision: The fraction of returned search results that are relevant to the user.
Recall: The fraction of all relevant documents in the corpus that are included in the search result set.

Or, in other words, precision is returning only relevant results; and recall is returning all relevant results. As you can imagine, these are often competing requirements. Semantic search tends to have very high recall but can struggle with precision. Keep reading to learn how to get around this property.

Introducing the minimum score parameter

The ‘min_score’ parameter allows us to improve precision by setting a minimum score, which will truncate the result set by removing any matches with a score less than the defined threshold. Following is a simple example:

GET search-movies/_search
{
  "retriever": {
    "linear": {
      "min_score": 4,
      "retrievers": [
        ...
      ]
    }
  }
}

Normalizing the score

Setting a minimum score is all well and good; however, not all semantic models return a score suitable for a static threshold. ELSER, for example, returns a score that is unbounded. Some dense model scores are densely clustered and only make sense in the context of the specific query.

For most semantic search cases, we recommend using a normalization approach before applying the ‘min_score’. The normalization ensures that the document score is within a defined interval. Elasticsearch retrievers provide two such normalizers, ‘l2_norm’ and ‘minmax’. The most commonly used is ‘minmax’, since it’s easy to understand and works well in many scenarios. Key properties of ‘minmax’ include:

Document scores are distributed between 0–1.
The highest scoring document is always scored as 1.
The lowest scoring document is always scored as 0.
- This can make it less suitable for keyword search. See “Hybrid search” section for further discussion.

Following is an example of a normalized semantic query with min_score. Rank window size has been increased to 500 to allow us to return a longer list of search results, starting at 100.

GET search-movies/_search
{
  "size": 100,
  "_source": [
    "title", "overview"
  ],
  "retriever": {
    "linear": {
      "rank_window_size": 500,
      "min_score": 0.25,
      "retrievers": [
        {
          "normalizer": "minmax",
          "retriever": {
            "standard": {
              "query": {
                "semantic": {
                  "field": "overview_vector",
                  "query": "superhero movie"
                }
              }
            }
          }
        }
      ]
    }
  }
}

The size has been set to a higher value than normally seen in production. This is so we can inspect the quality of search results and tune the results.

Hybrid search using the linear retriever

For hybrid search, the simplest approach is to normalize all scores, assign weights, and apply a minimum score. Note that by choosing weights with a sum of 1, you keep the total score within a range of 0–1. This makes it easy to make sense of the final scores and tune min_score. Following is an example:

GET search-movies/_search
{
  "size": 100,
  "_source": ["title", "overview","keywords"],
  "retriever": {
    "linear": {
      "rank_window_size": 500,
      "min_score": 0.25,
      "retrievers": [
        {
          "weight": 0.6,
          "normalizer": "minmax",
          "retriever": {
            "standard": {
              "query": {
                "semantic": {
                  "field": "overview_vector",
                  "query": "superhero movie"
                }
              }
            }
          }
        },
        {
          "weight": 0.4,
          "normalizer": "minmax",
          "retriever": {
            "standard": {
              "query": {
                "multi_match": {
                  "query": "superhero movie",
                  "fields": ["overview","keywords", "title"],
                  "type": "cross_fields",
                  "minimum_should_match": "2"
                }
              }
            }
          }
        }
      ]
    }
  }
}

Hybrid search using RRF

With BM25, we often control precision through other means, such as using the AND operator or minimum_should_match. In addition, queries consisting of single, precise, and rare terms will naturally cause search results with few search results, often all being highly relevant. This can lead to:

Results further back in the result get assigned a low normalized score in the BM25 retriever, even if the absolute BM25 score is close to top scoring hits.
When adding a very low BM25 score to the semantic score, the total can be approximated as the semantic score.
The lack of BM25 score contribution can cause the document to be discarded by the min_score threshold.

As a solution, we can instead use reciprocal rank fusion (RRF) to combine BM25 and semantic results. RRF gets around the challenge of comparing scores from different search algorithms by instead focusing on the position in each result set. In this scenario, the min_score is only applied to the semantic retriever.

GET search-movies/_search
{
  "_source": ["title", "overview","keywords"],
  "retriever": {
    "rrf": {
      "rank_window_size": 500,
      "retrievers": [
        {
          "linear": {
            "rank_window_size": 500,
            "min_score": 0.25,
            "retrievers": [
              {
                "normalizer": "minmax",
                "retriever": {
                  "standard": {
                    "query": {
                      "semantic": {
                        "field": "overview_vector",
                        "query": "superhero movie"
                      }
                    }
                  }
                }
              }
            ]
          }
        },
        {
          "standard": {
            "query": {
              "multi_match": {
                "query": "superhero movie",
                "fields": ["overview", "keywords","title"],
                "type": "cross_fields",
                "minimum_should_match": "2"
              }
            }
          }
        }
      ]
    }
  }
}

Conclusion

By using min_score, we’ve shown how we can reduce the number of false positives in our result sets caused by the high recall of semantic search algorithms. To learn more about retrievers, please see this blog post and the Elasticsearch documentation.

Dependency management on Kubernetes

Nikos Fotiou — Thu, 19 Feb 2026 00:00:00 GMT

This is how we built a self-hosted dependency management platform using Kubernetes, Argo Workflows, Argo Events, and Renovate CLI to automate updates, quickly address Common Vulnerabilities and Exposures (CVEs), and efficiently propagate new package versions across thousands of repositories.

Dependency management at Elastic

At Elastic, we have to manage hundreds or even thousands of repositories, both private and public. When a critical CVE is discovered, we need immediate answers and actions: Which repositories are vulnerable? How quickly can we patch them? Apart from security, productivity questions also arise: How can we quickly propagate the release of a new package version across all the repositories that depend on it without spending too much time on manual tasks?

The initial trigger for searching ways of doing dependency management was the need to establish a secure foundation with automated updates for reducing CVEs. After carefully considering solutions on dependency management, we first started working on a self-hosted infrastructure. We were using our own Kubernetes cluster to run Mend Renovate Community Self-Hosted. The idea was to be able to provide a dependency management platform that our users could access in a self-service manner.

The initial experiment was successful, so more and more teams started onboarding our platform and using it in their everyday repositories’ lifecycle for updates and CVE patching. This happened so fast that we soon hit the ceiling of our self-hosted installation.

The challenge: How can we scale a dependency management platform in a large organization with a significant number of repositories?

Our dependency management platform was processing one repository at a time and the sequential processing model couldn’t keep up, due to the large number of repositories that we own. We had already identified that the issue resided within the concept that a single instance of our dependency management tool could process our big and ever-growing list of repositories. Repositories waited in a queue, sometimes for many hours. More than 50% of our repositories were not even processed daily. That means that more than 50% of our repositories waited more than 24 hours between scans.

Large repositories created larger bottlenecks, due to their sizable codebases and their multiple open PRs. GitHub webhook events disrupted the sequence. Automerge became unreliable because scan timing was unpredictable. We had made a promise to our users for the frequency of scans, and we couldn’t fulfill it.

The decision to build in-house: Meeting Elastic's unique scale and security needs

While we considered commercial options, including Mend's Renovate Self-Hosted Enterprise Self-Hosted edition, internally at Elastic we had a few key initiatives ramping up.

Our decision to build an in-house platform was driven by the recognition that only a deeply customized solution could meet Elastic's specific, nonnegotiable requirements:

Investing in our internal developer platform: At the time, we had already started heavily investing in our internal developer platform. We were discussing and designing ways that each one of our services could fit into that. This meant that we wanted to test-drive our own rules and practices for our dependency management platform. On top of that, new guidelines were coming into play and we wanted to design the platform ahead of events.
Native integration and workflow customization: We required straightforward integration with our internal tooling and internal processes. For example, we wanted to centralize configuration as code with our Service Catalog (Backstage). We have specific needs around the usage of Backstage that we wanted to make our platform compatible with. So, although it would be possible to make use of the Renovate Self-Hosted APIs alongside our Backstage automation, this wouldn’t cover completely for our internal processes.
Elastic-specific defense-in-depth security: Our stringent security compliance required bespoke security mechanisms tailored to our ecosystem. We were working to harden our usage of “non-human identities.” The way this hardening of access worked meant that the nonstandard means to authenticate to GitHub wouldn’t work with an off-the-shelf tool that didn’t support this internal implementation.Our workflow included implementing a parent-child workflow secret encryption pattern and using transient, single-use GitHub tokens. Building in-house was the only practical way to embed these unique security layers and minimize the attack surface across our complex multicloud environment.

The solution: Workflow orchestration for dependency management

Our solution started from the fact that we wanted to build on the dependency management tool that we already used and not replace it and look for other solutions. It had shown signs of its potential, and its flexibility is important for different needs throughout our organization. We considered different solutions, and what helped us make up our minds was the big and sometimes special needs that we have to cover for. We decided to build a reliable and scalable dependency management platform, where each repository will be processed on its own, removing bottlenecks and setting us up for growth.

We designed the platform abiding to three core principles:

1. Parallel processing

Every repository gets its own dependency management processing environment. No more queues. Our concurrency is only limited by the number of resources we spend. We have also applied smart distributed scheduling to avoid getting rate limited by GitHub.

2. Self-serviceable

We use our Service Catalog (Backstage) to automatically onboard and manage any new repository. We use our own resource definition to give the end user the option to select how often a repository will be processed, how many resources they want to allocate to their schedules, and if they want to turn processing off or back on for any reason. We plan to add more options that way as our users’ needs evolve and they get more fluent with the new installation.

3. Reduced secret scope and namespace isolation

For increased security, we supply our dependency management pods with ephemeral GitHub tokens that are being generated at the start of each workflow. On top of that, we isolate our workloads in specific namespaces so they can be provided only the necessary secrets. We control what secrets can be accessed by each dependency management workflow using Kubernetes RBAC. We also use encryption to propagate the GitHub token from the parent to the child workflows.

We rebuilt our platform using Kubernetes and harnessing the power of Kubernetes, Argo Workflows powers the logic of our processes, and Renovate CLI is set up for scanning and processing one repository at a time.

The beauty: We’re using battle-tested open source projects in an original way, providing new working examples for all of those projects and, at the same time, amplifying development velocity and consolidating CVE reduction for our teams.

Dependency management architecture: Four microservices

The platform comprises four custom-built components:

Workflows Operator (Go/Kubebuilder)

A Kubernetes operator managing workflow lifecycle through three Custom Resource Definitions (CRDs):

RepoConfig CRD: Single source of truth for repository configuration.

This is how RepoConfig is defined in the operator:

// RepoConfig is the Schema for the repoconfigs API
type RepoConfig struct {
	metav1.TypeMeta `json:",inline"`

	// metadata is a standard object metadata
	// +optional
	metav1.ObjectMeta `json:"metadata,omitempty,omitzero"`

	// spec defines the desired state of RepoConfig
	// +required
	Spec RepoConfigSpec `json:"spec"`

	// status defines the observed state of RepoConfig
	// +optional
	Status RepoConfigStatus `json:"status,omitempty,omitzero"`
}

And this is what an instance of RepoConfig would look like:

apiVersion: workflows.elastic.co/v1
kind: RepoConfig
metadata:
  generation: 3
  name: elastic-test-repo
  namespace: dependency-management-operator
spec:
  owner: group:my-team
  renovate:
    config:
      resourceGroup: SMALL
      runFrequency: 4h
    enabled: true
  repository: elastic/test-repo

Parent CRD: Manages CronWorkflows for scheduled scans.

Inside the reconciliation loop of the parent controller, we make sure that workflow settings are created and kept up to date or even deleted if needed.

First, it gets some globally configured settings for workflows:

func (r *ParentReconciler) reconcileSubResources(ctx context.Context, req ctrl.Request, parent *workflowsv1.Parent) error {
	logger := logf.FromContext(ctx)
	logger.Info("Reconcile SubResources for Parent", "name", req.NamespacedName)
	wfSet := workflowsettings.WorkflowSettings{
		RunFrequency:   parent.Spec.RunFrequency,
		ResourceGroups: "parent",
	}

It makes sure a mutex configmap is up to date to prevent similar workflows from running together:

	cfMngr := resources.NewConfigMapManager(r.Client, r.Scheme, r.OperatorConfig.ParentNamespace)
	err := cfMngr.CreateOrUpdateSyncMutexConfigmap(ctx, fmt.Sprintf("%s%s", r.OperatorConfig.ResourcesPrefix, r.OperatorConfig.SyncMutexCfgMapName), strings.TrimPrefix(parent.Spec.Repository, "elastic/"), r.OperatorConfig.SemaphoreConcurrencyLimit)

Then it creates a Workflow Manager that’s the struct which will create or update the CronWorkflows and the Workflow Templates:

	wfMngr := resources.NewArgoWorkflowManager(r.Client,
		r.Scheme,
		curateResourceName(
			strings.ReplaceAll(parent.Spec.Repository, "/", "-"),
		),
		parent.Namespace,
		"parent-workflow",
		false).
		WithOrganization(r.OperatorConfig.GitHubOrg).
		WithRepoName(parent.Spec.Repository).
		Init(true, true).
		WithPrefix(r.OperatorConfig.ResourcesPrefix).
		WithWfTemplateName(r.OperatorConfig.ParentWorkflowTemplate).
		WithResources(wfSet.GetResourceCategory()).
		WithSchedule(wfSet.GetCronSchedule()).
		WithImagePullSecrets([]corev1.LocalObjectReference{{
			Name: r.OperatorConfig.WorkflowImagePullSecrets,
		}}).
		AddArgument(true, true, "extra_cli_args").
		SetArgument(true, false, "extra_cli_args", "none").
		AddTemplate(resources.NewParentDAGTemplateInstance()).
		AddTemplate(resources.NewWorkflowsTemplateInstance("check-child-workflows", r.OperatorConfig.WorkflowImagePullPolicy, r.OperatorConfig.WorkflowNodeSelector)).
		AddTemplate(resources.NewWorkflowsTemplateInstance("security", r.OperatorConfig.WorkflowImagePullPolicy, r.OperatorConfig.WorkflowNodeSelector)).
		AddTemplate(resources.NewWorkflowsTemplateInstance("submit-child-workflow", r.OperatorConfig.WorkflowImagePullPolicy, r.OperatorConfig.WorkflowNodeSelector))
	wfMngr.OverWriteCommand("submit-child-workflow", r.OperatorConfig.ChildNamespace)
	wfMngr.OverwriteWfTemplateName("parent-wftmpl")
	wfMngr.AddSynchronization(fmt.Sprintf("%s%s", r.OperatorConfig.ResourcesPrefix, r.OperatorConfig.SyncMutexCfgMapName), "{{workflow.parameters.repo_name}}")
	err = wfMngr.CreateOrUpdateCronWorkflow(ctx)
	if err != nil {
		return fmt.Errorf("failed to create or update cron workflow: %w", err)
	}
	err = wfMngr.CreateOrUpdateWorkflowTemplate(ctx)
	if err != nil {
		return fmt.Errorf("failed to create or update workflow template: %w", err)
	}
	return nil

Child CRD: Manages WorkflowTemplates with per-repository resources.

The child controller has a similar reconciliation duty to the parent, but this time it’s responsible for workflow templates in the child namespace that will be triggered by the parent workflows.

func (r *ChildReconciler) reconcileSubResources(ctx context.Context, req ctrl.Request, child *workflowsv1.Child) error {
	logger := logf.FromContext(ctx)
	logger.Info("Reconcile SubResources for Child", "name", req.NamespacedName)
	wfSet := workflowsettings.WorkflowSettings{
		ResourceGroups: child.Spec.ResourceCategory,
	}
	wfMngr := resources.NewArgoWorkflowManager(r.Client,
		r.Scheme,
		curateResourceName(
			strings.ReplaceAll(child.Spec.Repository, "/", "-"),
		),
		child.Namespace,
		"runner",
		true).
		Init(false, true). // only manage workflow template
		WithPrefix(r.OperatorConfig.ResourcesPrefix).
		WithSuffix("-child-wftmpl").
		WithRepoName(child.Spec.Repository).
		WithOrganization(r.OperatorConfig.GitHubOrg).
		WithResources(wfSet.GetResourceCategory()). // will override resources of presets if set
		WithImagePullSecrets([]corev1.LocalObjectReference{{
			Name: r.OperatorConfig.WorkflowImagePullSecrets,
		}}).
		AddTemplate(resources.NewWorkflowsTemplateInstance("runner", r.OperatorConfig.WorkflowImagePullPolicy, r.OperatorConfig.WorkflowNodeSelector)).
		AddArgument(false, true, "repo_full_name").
		AddArgument(false, true, "repo_name").
		AddArgument(false, true, "encrypted_token").
		AddArgument(false, true, "extra_cli_args")
	wfMngr.OverWriteCommand("runner", r.OperatorConfig.ChildNamespace)
	err := wfMngr.CreateOrUpdateWorkflowTemplate(ctx)
	if err != nil {
		return fmt.Errorf("failed to create or update workflow template: %w", err)
	}
	return nil
}

The multi-controller pattern provides clear separation: RepoConfig Controller handles onboarding/offboarding, Parent Controller manages scheduling, and Child Controller handles execution templates.

GitHub Events Gateway (Go)

A secure webhook proxy that receives GitHub webhooks, verifies signatures, filters by organization/repository, and routes to Argo Events. We built 10 distinct sensors responding to dependency dashboard interactions, PR events, and package updates.

This gateway enables integration with GitHub Apps by:

Verifying incoming GitHub webhook signatures for security.
Forwarding valid events to the Argo Events EventSource with all relevant headers and authentication.
We also configure an authSecret on the EventSource and provide this as a Bearer header in forwarded requests.
Providing logging, metrics, and retry logic.

It performs various validations on each GitHub Event request.

It makes sure some HTTP attributes are present:

// ValidateRequestMethod checks if the request method is POST.
func ValidateRequestMethod(r *http.Request) error {
	if r.Method != http.MethodPost {
		return fmt.Errorf("method not allowed, only POST is accepted")
	}
	return nil
}

// ValidateRequiredHeaders checks for required GitHub headers.
func ValidateRequiredHeaders(r *http.Request) error {
	eventType := r.Header.Get("X-GitHub-Event")
	deliveryID := r.Header.Get("X-GitHub-Delivery")
	signature := r.Header.Get("X-Hub-Signature-256")
	if eventType == "" || deliveryID == "" || signature == "" {
		return fmt.Errorf("missing required GitHub headers")
	}
	return nil
}

// ValidateUserAgent checks that the User-Agent header starts with GitHub-Hookshot/
func ValidateUserAgent(r *http.Request) error {
	userAgent := r.Header.Get("User-Agent")
	if !strings.HasPrefix(userAgent, "GitHub-Hookshot/") {
		return fmt.Errorf("invalid User-Agent")
	}
	return nil
}

While it also validates the signature of each request and its organizsation:.

// ValidateSignature verifies the GitHub webhook signature.
func ValidateSignature(r *http.Request, secret string) ([]byte, error) {
	payload, err := GitHub.ValidatePayload(r, []byte(secret))
	if err != nil {
		return nil, fmt.Errorf("invalid GitHub signature: %w", err)
	}
	return payload, nil
}

// ValidateAllowedOwner checks if the organization login is in the allowed organizations list.
func ValidateAllowedOwner(payload []byte, allowedGitHubOrganizations []string) (string, error) {
	var orgLogin string
	var payloadMap map[string]any
	if err := json.Unmarshal(payload, &payloadMap); err == nil {
		if orgObj, ok := payloadMap["organization"].(map[string]any); ok {
			if login, ok := orgObj["login"].(string); ok {
				orgLogin = login
			} else if name, ok := orgObj["name"].(string); ok {
				orgLogin = name
			}
		}
	}
	if !slices.Contains(allowedGitHubOrganizations, orgLogin) {
		return orgLogin, fmt.Errorf("organization login not allowed")
	}
	return orgLogin, nil
}

Finally, it routes to Argo Events based on event type:

	// Map eventType to Argo `EventSource` path
	var endpoint string
	switch eventType {
	case "push":
		endpoint = "/push"
	case "issues":
		endpoint = "/issues"
	case "pull_request":
		endpoint = "/pull-requests"
	default:
		slog.Info("Ignoring unhandled event type", "event_type", eventType, "delivery_id", deliveryID)
		w.WriteHeader(http.StatusOK)
		_, _ = w.Write([]byte("ok"))
		return
	}
	forwardURL := h.config.ArgoEventSourceForwardURL + endpoint

On the Argo Events side of things, 10 sensors watch the Argo Events EventBus for new events:.

apiVersion: argoproj.io/v1alpha1
kind: Sensor
metadata:
  name: {{ .Values.sensors.packageUpdateOnDefaultBranch.name }}
  namespace: {{ .Release.Namespace }}
spec:
  eventBusName: {{ .Values.eventBus.name }}

Then the script applies each sensor’s logic:

script: |
          local e = event
          if not e or not e.body or not e.body.repository then
            return false
          end

          -- e.g., "refs/heads/main"
          local ref = e.body.ref
          local default_branch = e.body.repository.default_branch
          if not ref or not default_branch then
            return false
          end

          local expected = "refs/heads/" .. default_branch
          if ref ~= expected then
            return false
          end

        {{- if .Values.sensors.packageUpdateOnDefaultBranch.packageFiles }}
          patterns = { {{- range $i, $f := .Values.sensors.packageUpdateOnDefaultBranch.packageFiles }}{{ if $i }}, {{ end }}"{{ $f }}"{{- end }} }
        {{- end }}

          local function anyMatch(path)
            if type(path) ~= "string" then return false end
            for _, pat in ipairs(patterns) do
              -- match filename at repo root, or anywhere under subdirs
              if path:match(pat) or path:match(".+/" .. pat) then
                return true
              end
            end
            return false
          end

          local function filesContainPackage(paths)
            if type(paths) ~= "table" then return false end
            for _, p in ipairs(paths) do
              if anyMatch(p) then return true end
            end
            return false
          end

          -- Inspect all commits (GitHub includes added/modified/removed lists)
          local commits = e.body.commits
          if type(commits) ~= "table" then
            -- Fallback: some payloads include only head_commit
            commits = {}
            if type(e.body.head_commit) == "table" then
              table.insert(commits, e.body.head_commit)
            end
          end

          for _, c in ipairs(commits) do
            if filesContainPackage(c.added) or filesContainPackage(c.modified) or filesContainPackage(c.removed) then
              return true
            end
          end

          return false

Backstage Syncer (Go)

This polls our Service Catalog (Backstage) for Repository Real Resource Entities, transforms them into RepoConfig CRDs, and keeps the platform in sync with configuration changes. Changes apply within three minutes.

repoMap := make(map[string]map[string]interface{})
			for i := range entities {
				entity := &entities[i]
				if entity.Spec.Type != "GitHub-repository" {
					continue
				}

				implRaw, err := json.Marshal(entity.Spec.Implementation)
				if err != nil {
					logger.Error("Failed to marshal implementation", "error", err)
					continue
				}

				var implMap map[string]interface{}
				err = json.Unmarshal(implRaw, &implMap)
				if err != nil {
					logger.Error("Failed to unmarshal implementation map", "error", err)
					continue
				}
				var repoName string
				if specMap, ok := implMap["spec"].(map[string]interface{}); ok {
					if repo, ok := specMap["repository"].(string); ok {
						repoName = repo
					}
				}
				if repoName == "" {
					continue
				}

				var workflowsRaw []byte
				if v, ok := implMap["spec"].(map[string]interface{}); ok {
					if r, ok := v["renovate"]; ok {
						workflowsRaw, _ = json.Marshal(r)
					} else {
						workflowsRaw = []byte(`{}`)
					}
				} else {
					workflowsRaw = []byte(`{}`)
				}

				var workflowsWithDefaults schema.WorkflowsMetadata
				err = json.Unmarshal(workflowsRaw, &rworkflowsWithDefaults)
				if err != nil {
					logger.Error("Failed to unmarshal workflows config", "error", err)
					continue
				}

				workflowsMap := map[string]interface{}{
					"enabled":        workflowsWithDefaults.Enabled,
					"require_pr":     workflowsWithDefaults.RequirePr,
					"resource_group": string(workflowsWithDefaults.ResourceGroup),
					"run_frequency":  string(workflowsWithDefaults.RunFrequency),
				}
				repoMap[repoName] = map[string]interface{}{
					"renovate": workflowsMap,
					"owner":    entity.Spec.Owner,
				}
			}
			logger.Info("Fetched GitHub Repository data from Backstage", "repository_count", len(repoMap), "status_code", resp.StatusCode)

Finally, it writes that data into RepoConfig instances.

Workflows base (Mixed: JavaScript, Go, Helm)

The foundation layer contains Helm charts, JavaScript configs, a Go wrapper for Renovate CLI with encryption support, and a custom APK Indexer for Alpine packages.

Self-service configuration

Teams configure their repositories declaratively through Backstage:

spec:
  renovate:
    enabled: true
    config:
      resourceGroup: LARGE      # SMALL | MEDIUM | LARGE  
      runFrequency: "0 */4 * * *"  # Every 4 hours

Resource groups allocate CPU and memory based on repository size:

SMALL: 500m CPU, 1Gi memory.
MEDIUM: 1000m CPU, 2Gi memory.
LARGE: 2000m CPU, 4Gi memory.

Configuration is version-controlled, auditable, and applies automatically.

The parent-child pattern

The execution model uses a parent-child workflow pattern:

Parent workflow: Lightweight CronWorkflow running on schedule. Encrypts secrets, determines whether a scan should run, passes configuration to the child.
Child workflow: Ephemeral pod where Renovate CLI runs. Allocated resources dynamically, decrypts secrets in isolation, terminates after completion.

This separation provides security (secrets encrypted at parent level), resource optimization (parents use minimal resources), and scalability (children run in parallel).

The results

Performance transformation

Before: One repository at a time, some repositories would not get processed possibly even for a day or more, less than 1,000 scans per day.
After: 100+ concurrent scans, usually 8,000 scans and up to 10,000 recorded scans per day, limited only by the amount of resources we’re willing to spend and how we handle GitHub rate limits.

Cost efficiency

However weird it may sound, running 8,000 pods a day can get you the same result much cheaper than having one long-running pod trying to achieve the same results.

In the previous setup, we were running a single instance that, on a good day, would perform 500–600 scans. At the same time, due to the fact that different kinds of repositories would be executed on the same pod, we needed to size the pod for the biggest ones. That sizing would be much bigger than our current extra large offering, using 8 CPUs for the pod and 16G of memory.

To meet the current daily output, the single pod would need to run for 12 days. So comparing the cost of that single pod running for 12 days to 8,000 pods of our “MEDIUM” size running each day, our new design is far more efficient for the same output of scans:

Metric	Scenario A (Workflows)	Scenario B (The long-running single pod)
Setup	8,000 pods (1 vCPU / 2GB)	1 pod (8 vCPU / 16 GB)*
Duration	10 minutes each	12 days continuous
Total work time	1,333 compute hours	288 compute hours
Total cost	$65.83	$113.75

However, let’s take into consideration that our default for our workloads is set to “SMALL,” with the great majority running successfully with 0.5 CPU and 1G RAM, and only a few need to change to medium, large. Let’s see what happens if 60% of our workloads are running on “SMALL,” 30% at “MEDIUM,” and 10% at “LARGE,” which is closer to the truth.

Metric	Scenario A (Mixed swarm)	Scenario B (The long runner)
Strategy	8,000 pods (mixed sizes)	1 pod (8 vCPU / 16 GB)*
Duration	10 minutes each	12 days continuous
Total cost	$52.66	$113.75
Savings	$61.09 (54% cheaper)	—

We can see that, for the same output, we’re far more cost-efficient in our current setup.

Enhanced security

Ephemeral GitHub tokens (minutes of exposure versus days).
Namespace isolation with Role-Based Access Control (RBAC) boundaries.
Secret encryption at rest in parent workflows.
Removed direct vault access.

Predictable performance

With guaranteed scan frequency, we can finally set Service Level Objectives (SLOs). Automerge works reliably. Teams trust the platform to deliver what’s promised.

Key architectural decisions

Here are some of the milestone design decisions that shaped how the platform looks.

Why parent-child workflows?

We adopted this pattern to enforce a defense-in-depth strategy. By restricting high-value credentials (such as GitHub App secrets) to a dedicated, locked-down namespace, we use RBAC to ensure that ephemeral execution pods cannot arbitrarily access sensitive data. Recent supply chain vulnerabilities (for example, the "Shai Hulud" continuous integration/continuous delivery [CI/CD] attacks) have demonstrated the criticality of isolating runtime environments that execute dynamic scripts from the credential store.

Simultaneously, this decoupling enables granular resource optimization. The "parent" workflows act as lightweight orchestrators with a minimal footprint, while the "child" workflows handle the compute-intensive dependency scanning. This separation simplifies lifecycle management by allowing us to apply distinct reconciliation logic to each layer, granting users control over execution parameters (child) while retaining administrative control over the scheduling and security infrastructure (parent).

Why self-serviceable?

Eliminating our team as a bottleneck for repository configuration was a critical requirement. Our mission was to architect a scalable, self-service platform capable of supporting diverse use cases. We recognized that acting as gatekeepers for every configuration change was unsustainable, given the sheer volume of repositories. Instead, we adopted a philosophy of enablement: providing the “rails” (infrastructure and guardrails) while empowering users to drive the “trains” (execution and customization). We believe this shift toward team autonomy significantly enhances productivity by allowing users to tailor the system to their specific operational needs.

Why Kubernetes Operator pattern?

As mentioned above, a foundational design principle was to ensure that the platform was fully self-serviceable. We required an automated mechanism to capture user intent (such as toggling scans, adjusting scheduling frequency, or tuning runtime resource limits) and instantly propagate those changes to the underlying workflows. Anticipating future requirements, the system also needed to be easily extensible.

To achieve this, we developed a custom Dependency Management Kubernetes Operator. By using CRDs as the interface for configuration, we established a Kubernetes-native reconciliation loop. This operator continuously monitors the desired state defined by the user and automatically orchestrates the necessary updates to the workflow infrastructure. This ensures an event-driven, seamless operation, where the platform logic handles all complexity behind the scenes.

Why design a GitHub Events Gateway?

Adopting an event-driven architecture (EDA) was essential for the platform's responsiveness. While CronWorkflows provided a reliable baseline schedule, we required the agility to handle ad hoc executions, such as users manually triggering scans via the dashboard. To achieve this, we needed a dedicated ingestion gateway to validate payload integrity and route requests intelligently.

We evaluated existing solutions, including the native GitHub EventSource for Argo, but we identified significant risks regarding operational overhead and strict GitHub API quotas (for example, webhook limits per repository). Consequently, we built a custom gateway to decouple our infrastructure from these limitations.

Crucially, this gateway served as a strategic traffic control point during our migration. It acted as a switch, enabling us to perform a gradual, granular rollout (traffic shifting) from the legacy system to the new infrastructure. This ensured that onboarding thousands of repositories was a controlled, risk-free process rather than a “big bang” switchover.

Lessons learned

Some lessons that we learned go hand-in-hand with the Elastic Source Code:

Customer First: Platforms are built for users. So it’s important to take users’ needs as priority number one. This shapes the platform into efficiently designed infrastructure and applications that reduce friction with users, simplify the scaling of the platform and ease adoption.
Space, Time: Sometimes the path of least resistance leads to shifting sands. We initially tried to optimize the existing sequential processing model, but this failed to resolve our issues; in fact, it only introduced more complexity and loose ends. The bold decision to rearchitect the platform with parallel processing required significant up-front effort. However, it ultimately paved the way for sustainable platform growth and virtually eliminated tedious daily administrative work.
IT, Depends: A platform cannot operate in isolation; its success depends on how well it integrates with the broader ecosystem. In our case, integration with Backstage was critical, as it serves as the source of truth for seamless service onboarding. Similarly, connecting to Artifactory allowed us to manage private package updates efficiently, and the list of essential integrations goes on.
Progress, SIMPLE Perfection: Throughout the implementation, we constantly pressure-tested our initial assumptions and adapted to new barriers as they emerged. Rather than getting paralyzed by perfectionism, we adopted an iterative approach, tackling challenges one by one and adjusting our migration strategy to meet real-world conditions.

What’s next

The delivery of the platform enables us for more meaningful work that will help us improve the UX and efficiency of our platform. Some examples are:

Increase and guardrail the adoption of auto-merge

The auto-merge feature significantly accelerates team velocity by eliminating tedious manual tasks. However, we need to make sure that strict guardrails are in place to ensure that this increased speed does not come at the expense of security.

Improve observability around end-user experience

A critical priority for our roadmap is enhancing observability, not just at the platform level but also specifically from the end-user’s perspective. While capturing infrastructure metrics is straightforward, understanding the actual user experience requires deeper insights. We’re working to define core user-centric key performance indicators (KPIs) so our telemetry can detect friction points and performance issues before they escalate into user complaints.

Remove barriers for greater adoption

Looking ahead, our priority is to identify and remove any barriers hindering platform adoption. Whether this requires developing new integrations or deploying specific feature sets, we’re committed to data-driven planning. We’ve successfully built a platform designed for scale; our focus now shifts to maximizing its potential.

The bigger picture

The dependency management workflows project demonstrates a broader principle: When you need to scale open source tools beyond their default deployment model, Kubernetes-native patterns provide a path forward.

By embracing:

CRDs for configuration.
Operators for lifecycle management.
Event-driven architecture for responsiveness
GitOps for deployment.

We built orchestration that scales independently of the number of repositories it manages. The performance of scanning one repository is the same whether we’re managing 100 or 1,000.

When a critical CVE is announced, we now have answers in minutes, not hours. That’s the difference between a bottleneck and a competitive advantage.

Acknowledgments

This platform builds on excellent open source tools:

Kubebuilder: The open source framework we used to kick-start our Kubernetes Operators that bootstrap and orchestrate our workflows. [1][2]
Backstage: The open source framework on which we’ve built our Service Catalog and which we use as our source of truth. [1][2]
Argo Workflows and Argo Events: The open source suite we used to orchestrate complex processes and add dynamic processing based on events. [1][2][3][4]
Renovate CLI: The open source dependency management tool processing our repositories. [1][2]

* The AWS Fargate pricing model was used as a reference for of the cost of a single pod, although our workloads are not running necessarily on AWS and are running on full- blown Kubernetes clusters.

Better text analysis for complex languages with Elasticsearch and neural models

Noam Schwartz — Wed, 18 Feb 2026 00:00:00 GMT

If you work with English search, standard text analysis usually just works. You index “running,” the analyzer strips the suffix to store “run,” and a user searching for “run” finds the document. Simple.

But if you work with languages like Hebrew, Arabic, German, or Polish, you know that standard rule-based analyzers often fail. They either under-analyze (missing relevant matches) or overanalyze (returning garbage results).

For years, we’ve had to rely on complex dictionaries and fragile regex rules. Today, we can do better. By replacing rule-based logic with neural models for text analysis (small, efficient language models that understand context), we can drastically improve search quality.

Here’s how to solve the morphology challenge by using the Elasticsearch inference API and a custom model service.

The problem: Why rules fail

Most standard analyzers are context-free. They look at one word at a time and apply a static set of rules.

Algorithmic analyzers (like Snowball) strip suffixes based on patterns.
Dictionary analyzers (like Hunspell) look up words in a list.

This approach breaks down when the structure of a word (its root and affixes) changes based on the sentence it lives in.

1. The semitic ambiguity (roots versus prefixes)

Semitic languages, like Hebrew and Arabic, are built on root systems and often attach prepositions (such as, in, to, or from) directly to the word. This creates ambiguous tokens that rule-based systems cannot solve.

Word: בצל (B-Tz-L).
Context A: “The soup tastes better with onion (batzal).”
Context B: “We sat in the shadow (ba-tzel) of the tree.”

In Context A, בצל is a noun (onion). In Context B, it’s a preposition ב (in) attached to the noun צל (shadow).

A standard analyzer is forced to guess. If it aggressively strips the ב prefix, it turns "onion" into "shadow." If it’s conservative and leaves it alone, a user searching for "shadow" (tzel) will fail to find documents containing "in the shadow" (batzel). Neural models solve this by reading the sentence to determine whether the ב is part of the root or a separate preposition.

2. The compound problem (German, Dutch, and more)

Languages like German, Dutch, Swedish, and Finnish concatenate nouns without spaces to form new concepts. This results in a theoretically infinite vocabulary. To search effectively, you must split (decompound) these words.

Word: Wachstube.
Split A: Wach (guard) + Stube (room) = guardroom.
Split B: Wachs (wax) + Tube (tube) = wax tube.

A dictionary-based decompounder acts blindly. If both “Wach” and “Wachs” are in its dictionary, it might pick the wrong split, polluting your index with irrelevant tokens.

To see this problem in English: A naive algorithm might split “carpet” into “car” + “pet.” Without understanding meaning, rules fail.

The solution: “Neural analyzers” (neural models for text analysis)

We don’t need to abandon the inverted index. We just need to feed it better tokens.

Instead of a regex rule, we use a neural model (like BERT or T5) to perform the analysis. Because these models are trained on massive datasets, they understand context. They look at the surrounding words to decide whether בצל means "onion" or "in shadow" or if Wachstube belongs in a military or cosmetic context.

Architecture: The inference sidecar

We can integrate these Python-based models directly into the Elasticsearch ingestion pipeline using the inference API.

The pattern:

External model service: A simple Python service (for example, FastAPI) hosts the model.
Elasticsearch inference API: Defines this service as a custom model within Elasticsearch.
Ingest pipeline: Sends text to the inference processor, which calls your Python service.
Index mapping: Create a whitespace target field for the analyzed text.
Indexing: The service returns the cleaned text, which Elasticsearch stores in the target field.
Search: Queries are analyzed via the inference API before matching.

Implementation guide

Let’s build this for Hebrew (using DictaBERT) and German (using CompoundPiece).

To follow along, you’ll need:

Python 3.10+.
Elasticsearch 8.9.x+.

Install the Python dependencies:

pip3 install fastapi uvicorn torch transformers

Step 1: External model service

To connect Elasticsearch to our neural model, we need a simple API service that:

Receives text from the Elasticsearch inference API.
Passes it through the neural model.
Returns analyzed text in a format Elasticsearch understands.

This service interfaces Elasticsearch with the neural model. At ingest time, the Elasticsearch pipeline calls this API to analyze and store document fields; at search time, the application calls it to process the user's query. You can deploy this on any infrastructure, including EC2, Lambda, or SageMaker.

The code below loads both models at startup and exposes /analyze/hebrew and /analyze/german endpoints:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List, Union
from transformers import AutoTokenizer, AutoModel, AutoModelForSeq2SeqLM
from contextlib import asynccontextmanager
import torch

# Global models (loaded once at startup)
he_model = None
he_tokenizer = None
de_model = None
de_tokenizer = None


@asynccontextmanager
async def lifespan(app: FastAPI):
   """Load models at startup."""
   global he_model, he_tokenizer, de_model, de_tokenizer

   print("Loading Hebrew model (DictaBERT-Lex)...")
   he_tokenizer = AutoTokenizer.from_pretrained("dicta-il/dictabert-lex")
   he_model = AutoModel.from_pretrained("dicta-il/dictabert-lex", trust_remote_code=True)
   he_model.eval()

   print("Loading German model (CompoundPiece)...")
   de_tokenizer = AutoTokenizer.from_pretrained("benjamin/compoundpiece")
   de_model = AutoModelForSeq2SeqLM.from_pretrained("benjamin/compoundpiece")

   if torch.cuda.is_available():
       he_model.to("cuda")
       de_model.to("cuda")

   print("Models loaded successfully!")
   yield
   print("Shutting down...")


app = FastAPI(
   title="Neural Text Analyzer",
   description="Multi-language text normalization service",
   version="1.0.0",
   lifespan=lifespan
)


class InferenceRequest(BaseModel):
   """ES Inference API sends: {"input": ["text1", "text2"]} or {"input": "text"}"""
   input: Union[str, List[str]]


def format_response(normalized_text: str) -> dict:
   """
   Normalize output to OpenAI-compatible format for ES Inference API.
   ES extracts: $.choices[*].message.content You do not need to stick
   with the OpenAI output format.
   Using it here for consistency reasons, since using the completions API.
   """
   return {
       "choices": [
           {"message": {"content": normalized_text}}
       ]
   }


@app.post("/analyze/hebrew")
async def analyze_hebrew(request: InferenceRequest):
   """Hebrew lemmatization using DictaBERT-Lex."""
   global he_model, he_tokenizer

   if he_model is None:
       raise HTTPException(status_code=503, detail="Model not loaded")

   # Handle input (can be string or list)
   if isinstance(request.input, str):
       texts = [request.input]
   else:
       texts = request.input

   # Run prediction
   with torch.no_grad():
       results = he_model.predict(texts, he_tokenizer)

   # results format: [[[word, lemma], [word, lemma], ...]]
   if results and results[0]:
       lemmas = []
       for word, lemma in results[0]:
           if lemma == '[BLANK]':
               lemma = word
           lemmas.append(lemma)
       normalized = " ".join(lemmas)
   else:
       normalized = ""

   return format_response(normalized)


@app.post("/analyze/german")
async def analyze_german(request: InferenceRequest):
   """German decompounding using CompoundPiece (supports 56 languages)."""
   global de_model, de_tokenizer

   if de_model is None:
       raise HTTPException(status_code=503, detail="Model not loaded")

   # Handle input
   if isinstance(request.input, str):
       text = request.input
   else:
       text = request.input[0] if request.input else ""

   # Format: "de: " for German
   input_text = f"de: {text}"

   inputs = de_tokenizer(input_text, return_tensors="pt")
   if torch.cuda.is_available():
       inputs = {k: v.to("cuda") for k, v in inputs.items()}

   with torch.no_grad():
       outputs = de_model.generate(**inputs, max_length=128)

   # IMPORTANT: decode outputs[0], not outputs
   result = de_tokenizer.decode(outputs[0], skip_special_tokens=True)

   # Clean up: "de: Donau-Dampf-Schiff" -> "Donau Dampf Schiff"
   # Note: model returns "de: " (with space after colon)
   if result.startswith("de: "):
       clean_result = result[4:].replace("-", " ")
   elif result.startswith("de:-"):
       clean_result = result[4:].replace("-", " ")
   elif result.startswith("de:"):
       clean_result = result[3:].replace("-", " ")
   else:
       clean_result = result.replace("-", " ")

   return format_response(clean_result.strip())


@app.get("/health")
async def health():
   return {"status": "healthy"}

Save the code above to a file (for example, analyzer_service.py), and run:

python3 -m uvicorn analyzer_service:app --port 8000

Wait for “Models loaded successfully!” (takes ~30–60 seconds for models to download on first run).

Test locally:

#Hebrew
curl -X POST http://localhost:8000/analyze/hebrew \
 -H "Content-Type: application/json" \
 -d '{"input": "הילדים אכלו גלידה בגינה"}'

#German
curl -X POST http://localhost:8000/analyze/german \
 -H "Content-Type: application/json" \
 -d '{"input": "Donaudampfschifffahrt"}'

Expected output:

- Hebrew: `{"choices":[{"message":{"content":"ילד אוכל גלידה גינה"}}]}`
- German: `{"choices":[{"message":{"content":"Donau Dampf Schiff Fahrt"}}]}`

Step 2: Configure Elasticsearch inference API

We’ll use the custom inference endpoint. This allows us to define exactly how Elasticsearch talks to our Python endpoint.

Note: Use response.json_parser to extract the content from our normalized JSON structure. You do not need to stick with the OpenAI output format. We’re using it here for consistency reasons, since we’re using the completion task type, which is text to text.

Exposing your local service

For testing, we’ll use ngrok to expose the local Python service to the internet. This allows any Elasticsearch deployment (self-managed, Elastic Cloud, or Elastic Cloud Serverless) to reach your service.

Install and run ngrok:

# Install ngrok (macOS) (Or download from https://ngrok.com/download)
brew install ngrok

Expose your local service:

ngrok http 8000

ngrok will display a forwarding URL like:

Forwarding https://abc123.ngrok.io -> http://localhost:8000

Copy the HTTPS URL. You’ll use this in the Elasticsearch configuration.

Configure the inference endpoint

 PUT _inference/completion/hebrew-analyzer                           
 {                                  
   "service": "custom",                                              
   "service_settings": {                             
     "url": "https://abc123.ngrok.io/analyze/hebrew",  
     "headers": {                    
       "Content-Type": "application/json"               
     },                                                
     "request": "{\"input\": ${input}}",                     
     "response": {                                
       "json_parser": {                         
         "completion_result": "$.choices[*].message.content"     
       }                               
     }                                 
   }                                   
 }

Replace https://abc123.ngrok.io with your actual ngrok URL.

Note: ngrok is used here for fast testing and development. The free tier has request limits, and URLs change on restart. For production, deploy your service to a persistent infrastructure.

For production (with API Gateway)

In production, deploy your Python service to a secure, persistent endpoint (such as AWS API Gateway + Lambda, EC2, ECS, or any cloud provider). Use secret_parameters to securely store API keys:

 PUT _inference/completion/hebrew-analyzer                        
 {                                     
   "service": "custom",                  
   "service_settings": {                
     "url": "https://your-api-gateway.execute-api.region.amazonaws.com/prod/analyze/hebrew",                 
     "headers": {                      
       "x-api-key": "${api_key}",       
       "Content-Type": "application/json"  
     },                              
     "secret_parameters": {           
       "api_key": "YOUR-API-KEY"     
     },                           
     "request": "{\"input\": ${input}}",      
     "response": {                    
       "json_parser": {               
         "completion_result": "$.choices[*].message.content"  
       }                             
     }                               
   }                                 
 }

Step 3: Ingest pipeline

Create a pipeline that passes the raw text field to our model and stores the result in a new field.

PUT _ingest/pipeline/hebrew_analysis_pipeline
{
 "description": "Lemmatizes Hebrew text using a custom inference endpoint",
 "processors": [
   {
     "inference": {
       "model_id": "hebrew-analyzer",
       "input_output": {
         "input_field": "content",
         "output_field": "content_analyzed"
       }
     }
   }
 ]
}

Step 4: Index mapping

This is the most critical step. The output from our neural model is already analyzed. We do not want a standard analyzer to mess it up again. We use the whitespace analyzer to simply tokenize the text we received.

PUT /my-hebrew-index
{
 "mappings": {
   "properties": {
     "content": {
       "type": "text",
       "analyzer": "standard"
     },
     "content_analyzed": {
       "type": "text",
       "analyzer": "whitespace"
     }
   }
 }
}

Step 5: Indexing

Option A: Single document.

POST /my-hebrew-index/_doc?pipeline=hebrew_analysis_pipeline
{
"content": "הילדים אכלו גלידה בגינה"
}

Option B: Reindex existing data.

If you have existing data in another index, reindex it through the pipeline:

POST _reindex
{
 "source": {
   "index": "my-old-index"
 },
 "dest": {
   "index": "my-hebrew-index",
   "pipeline": "hebrew_analysis_pipeline"
 }
}

Option C: Set pipeline as default for index.

Make all future documents automatically use the pipeline:

PUT /my-hebrew-index/_settings
{
"index.default_pipeline": "hebrew_analysis_pipeline"
}

Then index normally (no ?pipeline= needed):

POST /my-hebrew-index/_doc
{
"content": "הילדים אכלו גלידה בגינה"
}

Step 6: Search

Search using a neural analyzer in Elasticsearch is a two-step process, so analyze the query first using the inference API, and then search with the result:

A. Analyze the query.

 POST _inference/completion/hebrew-analyzer
 {
   "input": "הילדים אכלו גלידה בגינה"
 }

B. Search with the result.

 GET /my-hebrew-index/_search
 {
   "query": {
     "match": {
       "content_analyzed": "ילד אוכל גלידה גינה"
     }
   }
 }

In production, wrap these two calls in your application code for a seamless experience.

Available models

The architecture above works for any language. You simply swap the Python model and adjust the post-processing of the output. Here are verified models for common complex languages:

Hebrew: Context-aware lemmatization. Handles prefix ambiguity (ב, ה, ל, and more) dicta-il/dictabert-lex.
German: Generative decompounding. Supports 56 languages, including Dutch, Swedish, Finnish, and Turkish. benjamin/compoundpiece.
Arabic: BERT-based disambiguation and lemmatization for Modern Standard Arabic. CAMeL Tools.
Polish: Case-sensitive lemmatization for Polish inflections. amu-cai/polemma-large.

Conclusion

You don’t need to choose between the precision of lexical search and the intelligence of AI. By moving the “smart” part of the process into the analysis phase using the inference API, you fix the root cause of poor search relevance in complex languages.

The tools are here. The models are open-source. The pipelines are configurable. It’s time to teach our search engines to read.

Code

All code snippets from this article are available at https://github.com/noamschwartz/neural-text-analyzer.

References:

An open‑source Hebrew analyzer for Elasticsearch lemmatization

Lily Adler — Tue, 17 Feb 2026 00:00:00 GMT

Hebrew is morphologically rich: Prefixes, inflections, and clitics make exact-token search brittle. This project provides an open-source Hebrew analyzer plugin for Elasticsearch 9.x that performs neural lemmatization in the analysis chain, using an embedded DictaBERT model executed in-process via ONNX Runtime with an INT8-quantized model.

Quick start

Download the relevant release or build and install (Linux build script generates Elasticsearch‑compatible zip):

./scripts/build_plugin_linux.sh

Install in Elasticsearch:

/path/to/elasticsearch/bin/elasticsearch-plugin install file:///path/to/heb-lemmas-embedded-plugin-.zip

Test:

curl -k -X POST "https://localhost:9200/_analyze" \
  -H "Content-Type: application/json" \
  -u "elastic:" \
  -d '{"tokenizer":"whitespace","filter":["heb_lemmas","heb_stopwords"],"text":"הילדים אוכלים את הבננות"}'

Why Hebrew search is different

Hebrew is morphologically rich: Prefixes, suffixes, inflection, and clitics all collapse into a single surface form. That makes naive tokenization insufficient. Without true lemmatization, search quality suffers; users miss relevant results due to simple variations in form. This project tackles that by embedding a Hebrew lemmatization model inside the analyzer itself, so every token passes through a neural model before indexing and querying.

Example

Users may search for the lemma “בית” (house), but documents might contain:

בית (a house)
בבית (in the house)
לבית (to the house)
בבתים (in houses)
לבתים (to houses)

Without lemmatization, these become different surface tokens; lemmatization normalizes them toward the same lemma (בית), improving recall:

What this plugin does

Rather than relying on rule-based stemming, the analyzer runs a Hebrew lemmatization model as part of the Elasticsearch analysis chain and emits one normalized lemma per token. Because the model is neural, it can use local context within each analyzed segment to choose a lemma in ambiguous cases—while still producing stable tokens that work well for indexing and querying. The analyzer:

Runs a Hebrew lemmatization model inside Elasticsearch.
Produces better normalized tokens for Hebrew text.
Supports stopwords and standard analyzer pipelines.

The result: Fast, reliable lemmatization

This analyzer is optimized for real‑world throughput:

ONNX Runtime in‑process inference.
INT8-quantized model for lower latency and memory footprint.
Java Foreign Function Interface (FFI) for high‑performance native inference.

The result: fast, reliable lemmatization with predictable operational behavior.

To evaluate performance, we ran a benchmark in a Docker container (4 cores, 12 GB RAM) on 1 million large documents (5.7 GB of data) from the Hebrew Wikipedia dataset. You’ll find the results below:

Metric (search)	Task	Value	Unit
Min throughput	hebrew-query-search	409.75	ops/s
Mean throughput	hebrew-query-search	490.65	ops/s
Median throughput	hebrew-query-search	491.85	ops/s
Max throughput	hebrew-query-search	496.13	ops/s
50th percentile latency	hebrew-query-search	7.02242	ms
90th percentile latency	hebrew-query-search	10.7338	ms
99th percentile latency	hebrew-query-search	19.0406	ms
99.9th percentile latency	hebrew-query-search	27.165	ms
50th percentile service time	hebrew-query-search	7.02242	ms
90th percentile service time	hebrew-query-search	10.7338	ms
99th percentile service time	hebrew-query-search	19.0406	ms
99.9th percentile service time	hebrew-query-search	27.165	ms
Error rate	hebrew-query-search	0	%

Open source and Elastic‑ready

The plugin is fully open source and works on:

Elastic open‑source distributions.
Elastic Cloud.

You can build it yourself or download prebuilt releases and install it like any other plugin.

To upload the analyzer plugin to Elastic Cloud, navigate to the Extensions section within your Elastic Cloud console and proceed with the upload.

Credits

This project is a fork of the Korra ai Hebrew analysis plugin (MIT), which was implemented by Korra.ai with funding and guidance from the National NLP Program led by MAFAT and the Israel Innovation Authority.

This fork focuses on Elasticsearch 9.x compatibility and running lemmatization fully in-process via ONNX Runtime, using an INT8‑quantized model and bundled Hebrew stopwords. Lemmatization is powered by DictaBERT dicta-il/dictabert-lex (CC‑BY‑4.0).

Huge thanks to the Dicta team for making high-quality Hebrew natural language processing (NLP) models available to the community.

Elasticsearch 9.3 adds bfloat16 vector support

Simon Cooper — Mon, 16 Feb 2026 00:00:00 GMT

Elasticsearch 9.3 brings with it several important improvements to vector data, including a new element_type: bfloat16. This has the potential to halve your vector data storage, with minimal reductions in recall and runtime performance for most use cases.

Storage formats in dense_vector fields

Prior to 9.3, dense_vector fields support vectors of single bits, 1-byte integers, and 4-byte floats. We store the original vectors on top of any quantization and/or hierarchical navigable small world (HNSW) graph used for indexing, and the original vectors make up the vast majority of the required disk space of the vector indices. If your vectors are floating point, then the only option versions of Elasticsearch prior to 9.3 provide is to store 4 bytes per vector value: That’s 4kB for a single 1024-dimensional vector.

There are other floating-point sizes available, of course: IEEE-754 specifies floating-point sizes of many different lengths, including the 4-byte float32 and 8-byte float64 used by Java float and double types. It also specifies a float16 format, which only uses 2 bytes per value. However, this only has a maximum value of 65,504, compared to the 3.4x1038 of 4-byte float32 values, and the conversion between the two involves several arithmetic operations.

As an alternative, many machine learning (ML) applications now use bfloat16, which is a modification of IEEE-754 float32 to only use 2 bytes. It does this by discarding the lowest 2 bytes of the fractional part of the value, leaving the sign and exponent unchanged. This effectively reduces the precision of the floating-point value without a corresponding reduction in range. The conversion from float32 to bfloat16 is a simple bitwise truncation on the float32 value, with a bit of jiggling to account for rounding.

bfloat16 in Elasticsearch 9.3

Elasticsearch 9.3 now supports storing vector element types as bfloat16. In memory, it will still process every vector value as a 4-byte float32, as Java does not have built-in support for bfloat16. As it writes vector data to disk, it will simply truncate and round each float32 value to a 2-byte bfloat16, and zero-expand each bfloat16 value back to float32 on reading the value into memory.

This effectively halves your vector index sizes, as it uses 2 bytes per value rather than 4 bytes. There may be a small performance cost during reading and writing data as Elasticsearch performs the necessary conversions, but this is often counterbalanced by a significant reduction in the I/O required, as the OS now has to read half as much data. And, for most datasets, there is a minimal effect on search recall.

As an example, this is the difference in sizes for bfloat16 on our dense_vector dataset:

So, if your input vectors are already at bfloat16 precision, then happy days! Elasticsearch accepts raw bfloat16 vectors as float values, and as Base64-encoded vectors. The vectors are persisted to disk with the same precision as your original source data, immediately halving your data storage requirements.

If your input vectors are at 4-byte precision, then you can also use bfloat16 format to halve your index data sizes. Elasticsearch will truncate and round each value to 2-byte precision, throwing away the least significant bits of the fraction. This means that the vector values you get back from Elasticsearch won’t be exactly the same as what you originally indexed, so don’t use bfloat16 if you need to maintain the full 4-byte precision of float32.

Starting in Elasticsearch 9.3, and on Elasticsearch Serverless, you can specify element_type: bfloat16 with all dense_vector index types on any newly created indices. If you wish to use bfloat16 with existing indices, you can reindex into an index with element_type: bfloat16 and Elasticsearch will automatically convert your existing float vectors to bfloat16.

AI agents that perform actions: Automating IT requests with Agent Builder and Workflows

Sri Kolagani, Ziyad Akmal — Fri, 13 Feb 2026 00:00:00 GMT

In the world of IT operations, context switching is the enemy of productivity. For internal teams, simple requests, like a laptop refresh or employee onboarding, often require navigating multiple portals, filling out rigid forms, and manually updating information technology service management (ITSM) tools like ServiceNow.

At a recent DevFest, we demonstrated how to bridge the gap between natural language requests and structured IT workflows. By combining Elastic Agent Builder with Elastic Workflows, we can create AI assistants that not only answer questions but also perform complex actions.

In this post, we’ll dive into the architecture from that talk, specifically looking at how we built an automated "Laptop Refresh" workflow. We’ll demonstrate how to configure an agent that collects user requirements and triggers a server-side automation to interact directly with ServiceNow APIs.

Watch the full breakdown: This post is based on our presentation at Google DevFest. You can watch the full session here to see the demo in action.

The architecture: From chat to fulfillment

Note: The technical implementation described in this document is a streamlined version of the full production environment. While the architecture diagram provided serves as an accurate structural reference for the actual deployment, the accompanying text and code snippets have been simplified for illustrative purposes and may differ from the final, complex configurations used in the live implementation.

The goal is to move from a manual, form-heavy process to a conversational interface. Instead of a user navigating a catalog, they simply tell the AI assistant that they’re due for a laptop upgrade.

As illustrated above, the flow consists of three distinct layers:

1. Interaction layer (ElasticGPT/Agent Builder): The user interacts naturally with an interface powered by ElasticGPT. Behind the scenes, Agent Builder processes this conversation, handling intent detection and slot filling, to structure the data and orchestrate interactions with other internal systems.

Intent detection
- Mechanism: System prompt instruction.
- Implementation: The agent is explicitly told its single purpose in the MISSION statement. It doesn’t need to "detect" other intents because it’s scoped strictly to IT provisioning.
  - Code reference: MISSION: You are a specialized agent designed to collect complete employee onboarding information...
- Constraint: If a user asks about non-IT topics (for example, "What is the weather?"), the MISSION implies that the agent should pivot back to data collection or decline, depending on the large language model’s (LLM's) default safety alignment.
Slot filling (data collection)
- Mechanism: Phased conversation flow.
- Implementation: Instead of asking for all slots at once, the DATA COLLECTION STRATEGY breaks the slots into five logical phases. This prevents the context switching fatigue mentioned above.
  - Code reference: PHASE 1: Personal information, PHASE 2: Employment Details, and so on.
- Validation: The prompt enforces immediate validation (for example, Validate inputs immediately), acting as a gatekeeper before moving to the next slot.

2. Automation layer ( Workflows): Once the agent has the data, it triggers a workflow. This workflow handles the logic: checking device eligibility, enforcing policy (for example, "Is the laptop > 3 years old?"), and making API calls.

3. System of record (ServiceNow): The workflow reads and writes directly to the ITSM tool to maintain audit trails and initiate fulfillment.

Step 1: Configuring the agent

The first step is defining the "brain" of the operation using Agent Builder. We need an agent that acts strictly within the bounds of IT provisioning. We don't want a general chatbot; we want a data collection machine that feels like a helpful colleague.

We achieve this via a robust system prompt. The prompt dictates the agent's operating protocol, enforcing a step-by-step data collection strategy.

Here’s the refined structure of the prompt we used. Notice how it enforces validation and logically groups questions to avoid overwhelming the user:

MISSION: You are a specialized agent designed to collect complete employee onboarding information for IT equipment provisioning.

OPERATING PROTOCOL:
0. On every new chat, send a welcome message, and directly jump to data collection.

1. DATA COLLECTION STRATEGY:
   - Use a step-by-step approach across 5 clear phases
   - Validate inputs immediately

2. CONVERSATION FLOW:
   PHASE 1: Personal Information (Name, Email, Phone)
   PHASE 2: Employment Details (Job Title, Department, Manager)
   PHASE 3: Location & Shipping (Address, Country)
   PHASE 4: Technical Setup (Laptop Type, Accessories)
   PHASE 5: Confirmation

...

6. SUCCESS COMPLETION:
   After all data is collected and validated, invoke the tool "laptoprefreshworkflow" with the JSON payload.

For a sample system prompt or instructions, please refer here.

By explicitly instructing the agent to send the data in a specific JSON format at the end of the conversation, we ensure that the input matches exactly what our automation layer expects.

Step 2: The automation layer (Workflows)

The agent provides the intent and the data, but Workflows provides the muscle.

We define a workflow using a YAML configuration. This workflow acts as the bridge between the AI agent and the ServiceNow REST APIs. It handles authentication, data retrieval, and the ordering process.

Below is the workflow definition. We’ve refined the code to use secure variable handling for credentials rather than hardcoding them.

Workflow inputs

First, we define the inputs the workflow expects to receive from the agent:

YAML
version: "1"
name: Submit Laptop Refresh Request
enabled: true
triggers:
  - type: manual
inputs:
  - name: userid
    type: string
  - name: preferred-address
    type: string
  - name: laptop-choice
    default: Macbook latest
    type: string
  - name: laptop-keep-or-return
    default: return
    type: string

Interacting with ServiceNow

The workflow executes a series of HTTP steps. Crucially, we first need to identify the user's current asset to link the refresh request correctly.

1. Fetching computer data

We query the cmdb_ci_computer table in ServiceNow to find the asset currently assigned to the user.

YAML
steps:
  - name: snow_get_computer_data
    type: http
    with:
      url: https://elasticdev.service-now.com/api/now/table/ci_computer?assigned_to={{ inputs.userid }}
      method: GET
      headers:
        Accept: application/json
        Content-Type: application/json
        # Best Practice: Use secrets for authorization headers
        Authorization: Basic {{ secrets.servicenow_creds }}
      timeout: 30s

2. Adding to cart

Once we have the asset details and the user's preferences, we don't just create a generic ticket. We use the ServiceNow Service Catalog API to programmatically add the specific item to a cart.

YAML
  - name: snow_post_add_item_to_cart
    type: http
    with:
      url: https://elasticdev.service-now.com/example
      method: POST
      headers:
        Accept: application/json
        Content-Type: application/json
        Authorization: Basic {{ secrets.servicenow_creds }}
      body: |
        {
            "sysparm_quantity": 1,
            "variables": {
              "caller_id_common": "{{ inputs.userid }}",
              "current_device": "{{ steps.snow_get_asset.output.data.result.sys_id }}",
              "laptop_keep_or_return": "{{ inputs.laptop-keep-or-return }}",
              "choose_your_laptop": "{{ inputs.laptop-choice }}",
              "shipping_address": "{{ inputs.preferred-address }}"
            }
        }

3. Indexing the transaction

Finally, we want to keep a record of this transaction within Elasticsearch for analytics and future reference. We use the elasticsearch.index step to store the request details immediately after submission.

YAML

  - name: index-submission-record
    type: elasticsearch.index
    with:
      index: laptop-refresh-submission-data
      id: "{{ steps.snow_post_submit_order.output.data.result.request_id }}"
      document:
        request-id: "{{ steps.snow_post_submit_order.output.data.result.request_id }}"
        user-id: "{{ inputs.userid }}"
        configuration-item: "{{ steps.snow_get_computer_data.output.data.result[0].sys_id }}"
        laptop-choice: "{{ inputs.laptop-choice }}"
        timestamp: "{{ steps.snow_post_submit_order.output.data.result.sys_created_on }}"

For detailed workflow yaml, please refer here.

The result

By stitching these components together, we create a seamless experience:

The user chats naturally with the agent to provide details.
The agent structures this unstructured conversation into a JSON object.
Workflow receives the JSON, validates the user's current hardware via ServiceNow, creates the order, and indexes the result.

This approach reduces a process that traditionally took users 5–10 minutes of form navigation into a quick conversation, while ensuring that IT operations retains full visibility and control.

Video demo:

Watch video on Vidyard

Ready to build?

This pattern, using an agent for the interface and using Workflows for the execution, can be applied to almost any ITSM task, from password resets to software provisioning.

If you’re interested in trying this out, be sure to watch the DevFest talk for the full context, and check out the Elastic AI Agent Builder documentation to get started building your own agents today.

Entity resolution with Elasticsearch & LLMs, Part 1: Preparing for intelligent entity matching

Jessica Moszkowicz — Thu, 12 Feb 2026 00:00:00 GMT

"The new Swift update is here!"

When you see that headline, what do you think of? For a developer, it's a call to action, time to dive into new syntax, concurrency models, and bug fixes for the Swift programming language. For a music fan, it's a completely different story, a signal that Taylor Swift has just dropped a new album or is making a major announcement.

Your brain, in a fraction of a second, performs a remarkable feat of natural language processing (NLP). It doesn't just read the word "Swift" in isolation; it uses the surrounding context (the headline's source, your personal interests, and more) to resolve that single, ambiguous word to a unique, real-world entity.

In NLP, we call this ability to disambiguate named entity resolution (NER), and it's something humans do all the time. Natural language is inherently ambiguous, so we need to be able to map entities like "Bill Gates" to "the founder of Microsoft" and "The Eras Tour" to "Taylor Swift's concert tour". For humans, these connections come easily; for computers, not so much. Think how disappointed a Swiftie would be when they find out the article that their smart assistant recommended is actually about programming.

This same challenge becomes critical when you're monitoring news articles for mentions of specific people or organizations. Imagine you're tracking sanctioned entities or monitoring mentions of specific companies. You have a watch list with “Sakura Shipping Group” on it, and you want to know when articles mention the company. Simple enough, right? But what happens when an article refers to “Sakura Shipping” instead of the full legal name? Or uses an abbreviation like “SSG”? Or describes it indirectly as “a major Japanese maritime logistics firm”? Or mentions the company in Japanese, as “さくら海運グループ”? Your simple text matching won’t find these mentions, even though they all refer to the same organization. For compliance and risk monitoring use cases, missing a mention could have serious consequences. You need to catch every variation, every alias, every way an entity might be referred to.

This is the problem of entity resolution: identifying when different mentions in text refer to the same real-world entity and determining which entity that is. To solve this, we need a system that can handle semantic search (understanding meaning, not just keywords), named entity recognition (extracting entities from text), and fast, scalable matching across millions of documents. That's why we built this prototype on Elasticsearch. It provides built-in semantic search capabilities, integrated NER models, and the scalability needed for entity resolution.

In this series, we present an educational prototype for intelligent entity resolution that deliberately separates retrieval from judgment and explanation. Elasticsearch is used to efficiently narrow the search space by combining keyword, alias, and semantic (hybrid) search. Once plausible entity candidates are identified, a large language model (LLM) is used to determine whether a candidate refers to the same real-world entity, and the model’s rationale is provided in natural language.

This division of responsibilities avoids treating LLMs as black-box retrievers, preserves explainability for sensitive use cases, and demonstrates a reusable design pattern for building transparent, Elasticsearch-native systems. We examine why this pattern is particularly effective for entity resolution, where ambiguity is common and explainability matters. The goal is not to present a production-ready solution but to teach the architectural principles behind building transparent entity resolution systems.

Important note: This series presents an educational prototype that teaches Elasticsearch-native entity resolution using LLM judgments. We've made some simplifying choices (such as using Wikipedia for entity enrichment) to keep the system accessible for learning. Production systems might use different data sources, additional validation steps, or more sophisticated enrichment pipelines. The goal here is to demonstrate the core concepts and architecture, not to provide a production-ready system.

This series shows how we can help computers make these necessary connections while working with a 100% Elasticsearch-native architecture. We'll explore three major innovations:

Enhancing entities with contextual information.
Recognizing basic and complex entities with comprehensive NER.
Providing transparent reasoning through Elasticsearch candidate matching and LLM-powered explanations.

We'll also evaluate the system and identify an important optimization that improves the overall performance of the educational prototype.

In this first post of a four-part series, we’ll focus on preparing both sides of the entity resolution equation: your watch list and the articles you want to search.

The problem: Why entity resolution requires preparation

Entity resolution is hard because we face challenges on both sides of the matching equation. On one side, entities can be mentioned in many different ways. A company might be referred to as "Microsoft", "Microsoft Corporation", "MSFT", or even "the Redmond-based tech giant", depending on the context and writing style. On the other hand, we need to find these mentions in articles, even when they're not obvious, such as when an article uses "the Russian President" or "F.D.R." instead of a full name.

Why we can't just match names directly: Without proper preparation, matching becomes unreliable. You might think, "But I can just search for 'Tim Cook' in the text, right?" Well, yes, if the article always mentions him by that exact name. But what about when it says "Apple CEO" instead? Or "Timothy D. Cook" (his full name)? Your simple text search won't find those mentions, even though they all refer to the same person.

Without entity preparation, we can't match "the Russian President" to "Vladimir Putin" because we don't know what "the Russian President" means without context. We can't match "J.R.R. Tolkien" to "John Ronald Reuel Tolkien" because we don't know that they're aliases for the same person. We can't match "Apple CEO" to "Tim Cook" because we can't understand the semantic relationship. Without indexing, finding matches means checking every entity in your watch list individually. This doesn't scale: With thousands of entities, every match becomes slow and expensive. For sanctioned individuals monitoring, this means missing critical mentions that use aliases or alternate spellings, a failure that could have serious consequences.

Why we can't just search text directly: Entity extraction is hard for the same reason entity resolution is hard: Entities can be mentioned in many different ways. The same person might be referred to as "J.R.R. Tolkien", "the author of The Lord of the Rings", or just "Tolkien", depending on the context. Without proper extraction, we can't find these mentions in the text. We'd have to manually identify every entity mention, which doesn't scale. We'd miss entities mentioned in nonstandard ways (for example, titles or abbreviations). We also wouldn't capture the context around entity mentions, which is crucial for accurate matching.

The solution is a two-phase system that prepares both your watch list and the articles you want to search.

The solution: Two-phase preparation system

To solve entity resolution, we need to prepare both sides of the matching equation. First, we enrich and index our watch list entities to enable semantic search. Second, we extract entity mentions from articles using hybrid techniques that capture explicit and implicit references. Together, these phases create the foundation for intelligent entity matching.

Phase 1: Preparing your watch list

The solution to preparing entities is to enrich them with meaningful contexts. This enables our entity matching system to work effectively. We'll explain how context helps in a bit, but let's walk through the prototype's simple implementation first.

Our watch list of entities may be provided in multiple formats. The Office of Foreign Assets Control (OFAC) provides sanctions lists that include first and last names, addresses, and identifying information, such as passport numbers, date and place of birth, and nationality information [1]. While this provides a good amount of context, in practice many of these fields are omitted when the values are unknown for the given entity. Some lists may be just a set of names. The most helpful lists for our purposes come out of the box with rich descriptions, as is often the case with commercial or curated datasets.

The three-component system used in the prototype starts by managing our entities and organizing their metadata. Since entity lists can vary in the amount of information they contain, our prototype is designed to work with whatever it receives. The JSON format supports entities with minimal information (just a name and type) or full information (with aliases, descriptions, metadata, and more). For example, an entity might be as simple as:

{
  "name": "J.R.R. Tolkien",
  "entity_type": "person"
}

Or it might include additional context:

{
  "name": "J.R.R. Tolkien",
  "entity_type": "person",
  "description": "English writer and philologist, author of The Lord of the Rings",
  "aliases": ["John Ronald Reuel Tolkien", "J.R.R. Tolkien", "Tolkien"],
  "priority": "medium"
}

The system handles both cases gracefully during enrichment. For the prototype, the enrichment process adds context from Wikipedia (specifically, the first paragraph of the entity's Wikipedia page) for entities that don't already have context [2]. This Wikipedia context helps with semantic matching, but it doesn't add other fields, like aliases or full names; those must come from the original dataset. (In production, you might use other approaches for enrichment, including an agentic system that figures out where to find the context information for a given entity. This is beyond the scope of our prototype, but it’s an exciting feature we could add in the future.) Finally, we index the entities in Elasticsearch with semantic search capabilities, creating a searchable index that understands meaning rather than just text.

Key concepts: Semantic search and indexing

What is semantic search? Semantics refers to the meanings of words and phrases. Figuring out meaning is usually easy for humans, but it's much more challenging for computers to "get" because it requires a depth of understanding that’s difficult to program. Semantic search works by turning this challenge into a math problem, something that computers are very good at [3].

Think of semantic search like map coordinates for meaning. Just as latitude and longitude tell you where something is on a map, semantic embeddings tell you where something is in "meaning space." Whereas traditional keyword search requires exact matches, semantic search relies on describing that "location" in a multidimensional vector space. For example, you might have the coordinates for a specific "big red building". When you search for a "small red building", semantic search looks in the "neighborhood" for similar concepts in the vector space. Your big red building might appear as a nearest neighbor, but the relevance score will be lower because parts of the meaning don't match.

Getting back to our example, when you search for "Apple CEO", semantic search can find "Tim Cook" because the semantic embeddings capture the meaning that both refer to the same person, even though they use completely different words. This capability is invaluable when monitoring for sanctioned individuals, as aliases and code names may be used to evade detection.

Why Elasticsearch for entity indexing? Elasticsearch has built-in semantic search capabilities using embedding models, like EmbEddings from bidirEctional Encoder rEpresentations (E5) [4]. This means we can create an index that understands meaning, not just text. When we index our enriched entities, Elasticsearch creates semantic embeddings that capture each entity's meaning, enabling intelligent matching later.

What is the mapping schema? The mapping schema defines how we structure entity data in Elasticsearch. Our schema includes several field types optimized for different search strategies, including:

Keyword fields (id, name.keyword, aliases.keyword): For exact matching on entity names and aliases.
Text fields (name, name_lower, context, aliases): For traditional, case-normalized full-text search with BM25 scoring.
Semantic text fields (name_semantic, context_semantic): For vector-based similarity search using the multilingual-e5-small model.

This hybrid mapping enables multiple search strategies: exact matching for precise names, keyword search for aliases, and semantic search for meaning-based matching. Even better, Elasticsearch supports hybrid search, allowing us to use both keyword and semantic search simultaneously.

Before and after entity preparation

Before entity preparation, you have a simple list without much context, possibly nothing more than a name: "J.R.R. Tolkien". That's it. You can only match exact text matches, which means you'll miss "John Ronald Reuel Tolkien", "Tolkien", and any other variations. For sanctioned individuals, this means missing critical mentions that use aliases or alternate spellings.

After entity preparation, you have a rich, searchable index. "Vladimir Putin" is now enriched with Wikipedia context, and if your original dataset included aliases, like "Путин" or "Vladimir Vladimirovich Putin", those are indexed as well. The entity also has semantic embeddings that capture its meaning. The Wikipedia context helps semantic search understand that "The Russian President" refers to Vladimir Putin, enabling that match. If "Путин" was provided as an alias in your original dataset, exact matching handles that. Semantic variations work because your semantic embeddings understand meaning. For sanctioned individuals, this comprehensive preparation ensures you catch every mention, regardless of how the name is written or what alternative name is used.

Phase 2: Extracting entities from articles

Now that we have a searchable watch list, we need to extract entity mentions from articles. This is where article processing comes in.

Remember: This is an educational prototype designed to teach entity extraction concepts. Production systems might use different NER models, custom extraction rules, or specialized extraction pipelines tailored to specific domains or languages.

We extract entities from articles using a hybrid NER approach that combines machine learning with pattern-based extraction. First, we process articles to prepare them for extraction. Then, we extract entities using a hybrid extraction approach that combines NER performed in Elasticsearch (using a deployed XLM-RoBERTa model) with pattern-based extraction to catch entities that NER might miss.

This hybrid extraction approach provides several benefits. NER automatically finds entity mentions in text, even when they're not obvious. Pattern-based extraction catches entities that NER might miss, like titles and compound entities. We preserve the context around each entity mention, which helps with matching decisions later. The approach scales well, allowing us to process thousands of articles automatically, not just a few manually.

Key concepts: NER, pattern-based extraction, and hybrid extraction approach

What is NER? Named entity recognition is a machine learning technique that identifies named entities in text. When we run NER on an article, it finds mentions like "Microsoft", "Seattle", and "Washington" and labels them as organization, location, or person entities.

Why use NER in Elasticsearch? Using NER in Elasticsearch maintains our 100% Elasticsearch-native architecture, which simplifies the entity resolution prototype's design. Instead of managing separate services for entity extraction and search, everything runs in one system. You can perform NER during document ingestion using inference pipelines, and the extracted entities are immediately available for indexing and searching. This unified approach reduces complexity, eliminates network calls between services, and makes deployment and management easier. The XLM-RoBERTa model is trained to recognize entities in multiple languages, so we can extract entities from articles in different languages without needing separate models for each language. For information on deploying NER models in Elasticsearch, see the Elasticsearch NER documentation.

What is pattern-based extraction? Pattern-based extraction uses rules and patterns to find entities that NER might miss. For example, NER might not recognize "the author of The Lord of the Rings" as an entity mention, but pattern-based extraction can catch titles and roles like "the CEO" or "the President". However, pattern-based extraction is language-specific. The patterns need to be defined for each language you want to support. This is a significant drawback for multilingual systems, but it's acceptable for our educational prototype, which focuses on demonstrating the core concepts. Production systems might use language-specific pattern sets or alternative approaches for multilingual support.

How do they work together? The hybrid extraction approach combines both techniques. NER finds obvious entity mentions like "J.R.R. Tolkien", while pattern-based extraction catches variations that NER might miss, such as "the author of The Lord of the Rings". Together, they provide comprehensive coverage of entity mentions in text.

When we extract entities from an article mentioning "the author of The Lord of the Rings", we get:

Text: "author of The Lord of the Rings"
Type: PERSON (from pattern-based extraction)
Confidence: 0.85
Context: "The author of The Lord of the Rings published a new edition"

Before and after entity extraction

With NER-only extraction, we might find "J.R.R. Tolkien" and "The Lord of the Rings" in the article, but we'd miss "the author of The Lord of the Rings" because NER doesn't recognize descriptive phrases as entity mentions.

With hybrid extraction, we find both "J.R.R. Tolkien" (from NER) and "the author of The Lord of the Rings" (from pattern-based extraction). This comprehensive coverage enables better matching later, since we can match both the name and the descriptive phrase to our watch list.

What's next: Matching entities to our watch list

Now that we've prepared both sides of the entity resolution equation, we have everything we need for intelligent matching:

A searchable watch list enriched with context and indexed for semantic search.
Extracted entity mentions from articles using hybrid NER.

Preparation gives us the raw ingredients, but it doesn’t tell us which entity a mention actually refers to. In the next post, we'll explore how to match these extracted entities to our watch list using semantic search and LLM-powered judgment that handles ambiguity and context transparently.

Try it yourself

Want to see the preparation process in action? Check out these notebooks for complete walkthroughs with real implementations, detailed explanations, and hands-on examples:

Entity preparation notebook: Shows you exactly how to enrich entities with Wikipedia context, create semantic search indexes, and prepare your watch list for intelligent matching.
Article processing notebook: Shows you exactly how to extract entities from articles using hybrid NER, handle multilingual content, and process compound entities.

Remember: This is an educational prototype designed to teach the concepts. When building production systems, consider additional factors, like data source reliability, validation pipelines, error handling, monitoring, compliance requirements, domain-specific NER models, custom extraction rules, and quality validation that aren't covered in this learning-focused prototype.

References

OFAC Sanctions List Search
The datasets used for the prototype also use a special field, 'explicit_context', in lieu of getting the context from Wikipedia. We do this to control for the entity preparation step when we're testing other components such as entity matching.
The big ideas behind retrieval augmented generation
E5 in Elasticsearch

From vectors to keywords: Elasticsearch hybrid search in LangChain

Margaret Gu, Eyo Eshetu — Wed, 11 Feb 2026 00:00:00 GMT

Elasticsearch hybrid search is available for LangChain across our Python and JavaScript integrations. Here we’ll discuss what hybrid search is, when it can be useful and we’ll run through some simple examples to get started.

We’re also planning to support hybrid search in the community-driven Java integration very soon.

What is hybrid search?

Hybrid search is an information retrieval approach that combines keyword-based full-text search (lexical matching) with semantic search (vector similarity). Practically, it means a query can match documents because they contain the right terms and/or because they express the right meaning (even if the wording differs).In simple terms, you can think of it like this:

Lexical retrieval: “Do these documents contain the words I typed (or related words)?”
Semantic retrieval: “Do these documents mean something similar to what I typed?”

These two retrieval methods produce scores on different scales, so hybrid search systems typically use a fusion strategy to merge them into one ranking, for example, using reciprocal rank fusion (RRF).

In the figure above, we show an example: BM25 (keyword search) returns Docs A, B, and C, while semantic search returns Docs X, A, and B. The RRF algorithm then combines these two result lists into the final ranking: Doc A, Doc B, Doc X, and Doc C. With hybrid search, Doc C is included in the results thanks to BM25.

Why hybrid search matters

If you’ve built search or retrieval-augmented generation (RAG) features in production, you’ve probably seen the same failure modes show up again and again:

Keyword search can be too literal. If the user doesn’t use the exact terms that appear in your documents, relevant content gets buried or missed.
Semantic search can be too fuzzy. It’s great at meaning, but it can also return results that feel related while missing a critical constraint, like a product name, an error code, or a specific phrase the user actually typed.

Hybrid search exists because real user queries in production environments usually need both.

Next we’ll dive into how you get started with hybrid search in the LangChain integration for Python and JavaScript. If you want to read more about hybrid search, check out What is hybrid search? and When hybrid search truly shines.

Setting up a local Elasticsearch instance

Before running the examples, you'll need Elasticsearch running locally. The easiest way is using the start-local script:

curl -fsSL https://elastic.co/start-local | sh

After starting, you'll have:

Elasticsearch at http://localhost:9200.
Kibana at http://localhost:5601.

Your API key is stored in the .env file (under the elastic-start-local folder) as ES_LOCAL_API_KEY.

Getting started with hybrid search in LangChain (Python and JavaScript)

The dataset is a CSV with information on 1,000 science fiction movies, taken from an IMDb dataset on Kaggle. This demo uses a subset of the data, which has been cleaned. You can download the dataset used for this article from our GitHub gist, along with the full code for this demo.

Step 1: Install what you need.

First you’ll need the LangChain Elasticsearch integration and Ollama for embeddings. (You can also use some other embedding model if you wish.)

In Python:

pip install langchain-elasticsearch langchain-ollama

In JavaScript:

npm install @langchain/community @langchain/ollama @elastic/elasticsearch csv-parse

Step 2: Configure your connection and dataset path.

In Python:

At the top of the script, we set:

Where Elasticsearch is (ES_LOCAL_URL).
How to authenticate (ES_LOCAL_API_KEY).
Which demo index name to use (INDEX_NAME).
Which CSV file we’ll ingest (scifi_1000.csv).

ES_URL = os.getenv("ES_LOCAL_URL", "http://localhost:9200") 
ES_API_KEY = os.getenv("ES_LOCAL_API_KEY")
INDEX_NAME = "scifi-movies-hybrid-demo" 
CSV_PATH = Path(__file__).with_name("scifi_1000.csv")

In JavaScript:

Notes for JavaScript:

JavaScript uses process.env instead of os.getenv.
Path resolution requires fileURLToPath and dirname for Elasticsearch modules.
The class is called ElasticVectorSearch (not ElasticsearchStore as in Python).

import { Client } from "@elastic/elasticsearch";
import { OllamaEmbeddings } from "@langchain/ollama";
import {
  ElasticVectorSearch,
  HybridRetrievalStrategy,
} from "@langchain/community/vectorstores/elasticsearch";
import { parse } from "csv-parse/sync";
import { readFileSync } from "fs";
import { dirname, join } from "path";
import { fileURLToPath } from "url";

const __dirname = dirname(fileURLToPath(import.meta.url));

const ES_URL = process.env.ES_LOCAL_URL || "http://localhost:9200";
const ES_API_KEY = process.env.ES_LOCAL_API_KEY;
const INDEX_NAME = "scifi-movies-hybrid-demo";
const CSV_PATH = join(__dirname, "scifi_1000.csv");

We can now also create the client.

In Python:

es = Elasticsearch(ES_URL, api_key=ES_LOCAL_API_KEY)

In JavaScript:

const client = new Client({
  node: ES_URL,
  auth: ES_API_KEY ? { apiKey: ES_LOCAL_API_KEY } : undefined,
});

Step 3: Ingest the dataset, and then compare vector-only vs. hybrid.

Step 3a: Read the CSV and build what we index.

We build three lists:

texts: The actual text that will be embedded + searched.
metadata: Structured fields stored alongside the document.
ids: Stable IDs (so Elasticsearch can dedupe if needed).

In Python:

# --- Ingest dataset ---
texts: list[str] = []
metadatas: list[dict] = []
ids: list[str] = []

with CSV_PATH.open(newline="", encoding="utf-8") as f:
    for row in csv.DictReader(f):
        movie_id = (row.get("movie_id") or "").strip()
        movie_name = (row.get("movie_name") or "").strip()
        year = (row.get("year") or "").strip()
        genre = (row.get("genre") or "").strip()
        description = (row.get("description") or "").strip()
        director = (row.get("director") or "").strip()

        # This text is both:
        #  - embedded (vector search)
        #  - keyword-matched (BM25 in hybrid mode)
        text = "\n".join(
            [
                f"{movie_name} ({year})" if year else movie_name,
                f"Director: {director}" if director else "Director: (unknown)",
                f"Genres: {genre}" if genre else "Genres: (unknown)",
                f"Description: {description}" if description else "Description: (missing)",
            ]
        )
        texts.append(text)
        metadatas.append(
            {
                "movie_id": movie_id or None,
                "movie_name": movie_name or None,
                "year": year or None,
                "genre": genre or None,
                "director": director or None,
            }
        )
        ids.append(movie_id or movie_name)

In JavaScript:

async function main() {
  // --- Ingest dataset ---
  const texts = [];
  const metadatas = [];
  const ids = [];

  const csvContent = readFileSync(CSV_PATH, "utf-8");
  const records = parse(csvContent, {
    columns: true,
    skip_empty_lines: true,
  });

  for (const row of records) {
    const movieId = (row.movie_id || "").trim();
    const movieName = (row.movie_name || "").trim();
    const year = (row.year || "").trim();
    const genre = (row.genre || "").trim();
    const description = (row.description || "").trim();
    const director = (row.director || "").trim();

    // This text is both:
    //  - embedded (vector search)
    //  - keyword-matched (BM25 in hybrid mode)
    const text = [
      year ? `${movieName} (${year})` : movieName,
      director ? `Director: ${director}` : "Director: (unknown)",
      genre ? `Genres: ${genre}` : "Genres: (unknown)",
      description ? `Description: ${description}` : "Description: (missing)",
    ].join("\n");

    texts.push(text);
    metadatas.push({
      movie_id: movieId || null,
      movie_name: movieName || null,
      year: year || null,
      genre: genre || null,
      director: director || null,
    });
    ids.push(movieId || movieName);
  }

What’s important here:

We don’t embed only the description. We embed a combined text block (title/year + director + genre + description). That makes results easier to print and sometimes improves retrieval.
The same text is what the lexical side uses, too (in hybrid mode), because it’s indexed as searchable text.

Step 3b: Add texts to Elasticsearch using LangChain.

This is the indexing step. Here we embed texts and write them to Elasticsearch.

For asynchronous applications, please use AsyncElasticsearchStore with the same API.

You can find our reference docs for both the sync and async versions of ElasticsearchStore, along with more parameters for advanced fine-tuning RRF.

In Python:

print(f"Ingesting {len(texts)} movies into '{INDEX_NAME}' from '{CSV_PATH.name}'...") 

vector_store = ElasticsearchStore(
    index_name=INDEX_NAME,
    embedding=OllamaEmbeddings(model="llama3"),
    es_url=ES_LOCAL_URL,
    es_api_key=ES_LOCAL_API_KEY,
    strategy=ElasticsearchStore.ApproxRetrievalStrategy(hybrid=False),
)

#This is the indexing step. We embed the texts and add them to Elasticsearch
vectore_store.add_texts(texts=texts, metadatas=metadatas, ids=ids)

In JavaScript:

  console.log(
    `Ingesting ${texts.length} movies into '${INDEX_NAME}' from 'scifi_1000.csv'...`
  );

  const embeddings = new OllamaEmbeddings({ model: "llama3" });

  // Vector-only store (no hybrid)
  const vectorStore = new ElasticVectorSearch(embeddings, {
    client,
    indexName: INDEX_NAME,
  });

  // This is the indexing step. We embed the texts and add them to Elasticsearch
  await vectorStore.addDocuments(
    texts.map((text, i) => ({
      pageContent: text,
      metadata: metadatas[i],
    })),
    { ids }
  );

Step 3c: Create another store for hybrid search.

We create another ElasticsearchStore object pointing at the same index but with different retrieval behavior: hybrid=False is vector-only search and hybrid=True is hybrid search (BM25 + kNN, fused with RRF).

In Python:

# Since we are using the same INDEX_NAME we can avoid adding texts again 
# This ElasticsearchStore will be used for hybrid search

hybrid_store = ElasticsearchStore(
    index_name=INDEX_NAME,
    embedding=OllamaEmbeddings(model="llama3"),
    es_url=ES_LOCAL_URL,
    es_api_key=ES_LOCAL_API_KEY,
    strategy=ElasticsearchStore.ApproxRetrievalStrategy(hybrid=True),
)

In JavaScript:

  // Since we are using the same INDEX_NAME we can avoid adding texts again
  // This ElasticVectorSearch will be used for hybrid search
  const hybridStore = new ElasticVectorSearch(embeddings, {
    client,
    indexName: INDEX_NAME,
    strategy: new HybridRetrievalStrategy(),
  });

  // With custom RRF parameters
  const hybridStoreCustom = new ElasticVectorSearch(embeddings, {
    client,
    indexName: INDEX_NAME,
    strategy: new HybridRetrievalStrategy({
      rankWindowSize: 100,  // default: 100
      rankConstant: 60,     // default: 60
      textField: "text",    // default: "text"
    }),
  });

Step 3d: Run the same query both ways, and print results.

As an example, let’s run the query “Find movies where the main character is stuck in a time loop and reliving the same day." and compare the results from hybrid search and vector search.

In Python:

query = "Find movies where the main character is stuck in a time loop and reliving the same day."
k = 5

print(f"\n=== Query: {query} ===")

vec_docs = vector_store.similarity_search(query, k=k)
hyb_docs = hybrid_store.similarity_search(query, k=k)

print("\nVector search (kNN) top results:")
for i, doc in enumerate(vec_docs, start=1):
    print(f"{i}. {(doc.page_content or '').splitlines()[0]}")

print("\nHybrid search (BM25 + kNN + RRF) top results:")
for i, doc in enumerate(hyb_docs, start=1):
    print(f"{i}. {(doc.page_content or '').splitlines()[0]}")

In JavaScript:

  const query =
    "Find movies where the main character is stuck in a time loop and reliving the same day.";
  const k = 5;

  console.log(`\n=== Query: ${query} ===`);

  const vecDocs = await vectorStore.similaritySearch(query, k);
  const hybDocs = await hybridStore.similaritySearch(query, k);

  console.log("\nVector search (kNN) top results:");
  vecDocs.forEach((doc, i) => {
    console.log(`${i + 1}. ${(doc.pageContent || "").split("\n")[0]}`);
  });

  console.log("\nHybrid search (BM25 + kNN + RRF) top results:");
  hybDocs.forEach((doc, i) => {
    console.log(`${i + 1}. ${(doc.pageContent || "").split("\n")[0]}`);
  });
}

main().catch(console.error);

Example output

Ingesting 1000 movies into 'scifi-movies-hybrid-demo' from 'scifi_1000.csv'...

=== Query: Find movies where main character is stuck in a time loop and reliving the same day. ===

Vector search (kNN) top results:
1. The Witch: Part 1 - The Subversion (20  18)
2. Divinity (2023)
3. The Maze Runner (2014)
4. Spider-Man (2002)
5. Spider-Man: Into the Spider-Verse (2018)

Hybrid search (BM25 + kNN + RRF) top results:
1. Edge of Tomorrow (2014)
2. The Witch: Part 1 - The Subversion (2018)
3. Boss Level (2020)
4. Divinity (2023)
5. The Maze Runner (2014)

Why these results?

Vector-only (kNN) embeds the query and tries to find semantically similar plots. Using a broad sci‑fi dataset, this can drift into “trapped / altered reality / memory loss / high-stakes sci‑fi” even when there’s no time-loop concept. That’s why results like “The Witch: Part 1 – The Subversion” (amnesia) and “The Maze Runner” (trapped/escape) can appear.
Hybrid (BM25 + kNN + RRF) rewards documents that match both keywords and meaning. Movies whose descriptions explicitly mention “time loop” or “relive the same day” get a strong lexical boost, so titles like “Edge of Tomorrow” (relive the same day over and over again…) and “Boss Level” (trapped in a time loop that constantly repeats the day…) rise to the top.

Hybrid search doesn’t guarantee that every result is perfect. It balances lexical and semantic signals so you may still see some non-time-loop sci‑fi in the tail of the top‑k.

The main takeaway is that hybrid search helps anchor semantic retrieval with exact textual evidence when the dataset contains those keywords.

Full code example

You can find our full demo code in Python and JavaScript, as well as the dataset used, hosted on GitHub gist.

Conclusion

Hybrid search provides a pragmatic and powerful retrieval strategy by combining traditional BM25 keyword search with modern vector similarity into a single, unified ranking. Instead of choosing between lexical precision and semantic understanding, you get the best of both worlds, without adding significant complexity to your application.

In real-world datasets, this approach consistently yields results that feel more intuitively correct. Exact term matches help anchor results to the user’s explicit intent, while embeddings ensure robustness against paraphrasing, synonyms, and incomplete queries. This balance is especially valuable for noisy, heterogeneous, or user-generated content, where relying on only one retrieval method often falls short.

In this article, we demonstrated how to use hybrid search in LangChain through its Elasticsearch integrations, with complete examples in both Python and JavaScript. We’re also contributing to other open-source projects, such as LangChain4j, to extend hybrid search support with Elasticsearch.

We believe hybrid search will be a key capability for generative AI (GenAI) and agentic AI applications, and we plan to continue collaborating with libraries, frameworks, and programming languages across the ecosystem to make high-quality retrieval more accessible and robust.

How to defend your RAG system from context poisoning

Tomás Murúa — Tue, 10 Feb 2026 00:00:00 GMT

With the recent release of models like Claude Sonnet 4.5, Gemini 3 series and GPT 5 featuring million-token context windows, there’s a growing misconception that context management is becoming simpler. If a large language model (LLM) can process millions of tokens at once, does it really matter what information we provide?

The reality is precisely the opposite. Context engineering, the practice of managing what information reaches your LLM, is more critical than ever. Large context windows don’t eliminate the need for precision; they amplify it. With more context comes exponentially more opportunities for error, hallucinations, and irrelevant information to contaminate your LLM reasoning process.

Whether you’re using retrieval-augmented generation (RAG) retrieval, tool outputs, or memory systems, effective context engineering isn’t about providing more information but about providing the right information. That’s where Elasticsearch comes in, serving as your context engineering platform.

In this article, we’ll explore what context poisoning is, how it manifests across different types of memory, and how Elasticsearch RAG capabilities provide defense at every stage of the retrieval pipeline, from ingestion to composition, ensuring your LLM receives clean, relevant, and reliable context.

What is context poisoning?

Context poisoning occurs when compromised, outdated, or irrelevant information enters an LLM’s context window, leading to degraded responses, hallucinations, or perpetuated errors. Once corrupted or incorrect information enters the context window, it propagates into answers. The LLM references it as truth, creating cascading errors across the conversation.

This poisoning can happen at multiple stages of the LLM lifecycle (like in training), but our focus is on the retrieval and composition stages. Although adversarial attacks, like prompt injection, also pose risks, this article focuses on the operational patterns that teams encounter most frequently in production environments.

Operational understanding

Context poisoning often happens for reasons like:

Context rot: Information becomes outdated but remains in your knowledge base without being updated or deleted.
Context overflow: Too much information overwhelms the LLM's attention to the real important and relevant context, leading to missing relevant information from answers.
Conflicting information: Multiple sources provide contradictory data, confusing the model.
Semantic noise: Vectorial similar but contextually irrelevant content dilutes relevance.
Malicious injection: Content deliberately inserted by attackers into knowledge bases, including prompt injections or manipulated data.

Understanding these patterns is the first step toward building robust defenses. Let’s examine each pattern and how Elasticsearch helps you address them. You can follow along with the supporting notebook.

Types of context poisoning

Temporal degradation

Over time, information in your knowledge base becomes outdated, and without proper management, stale content continues to be retrieved and presented to your LLM as current truth. This is especially problematic in industries where information changes frequently, like product documentation, pricing, regulations, or news.

Impact

Your LLM provides outdated advice, references to deprecated features, or contradictions to current reality, disengaging user trust.

Solutions: Temporal filtering in hybrid search

Elasticsearch’s date-based query capabilities ensure your RAG system prioritizes recent and relevant information through explicit temporal filters.

Example: Product documentation search with time filtering

A user asks your chatbot about authentication setup. Six months ago, the authentication had a significant change, so it’s important to only return documents from six months or earlier.

Without temporal filtering

POST product-docs/_search
{
  "retriever": {
    "rrf": {
      "retrievers": [
        {
          "standard": {
            "query": {
              "semantic": {
                "field": "content_semantic",
                "query": "how to configure OAuth authentication"
              }
            }
          }
        },
        {
          "standard": {
            "query": {
              "multi_match": {
                "query": "configure OAuth authentication",
                "fields": ["title^2", "content"]
              }
            }
          }
        }
      ],
      "rank_window_size": 50,
      "rank_constant": 20
    }
  },
  "_source": ["title", "last_updated", "version", "content_snippet"]
}

Response without filtering: Contradictory results

The LLM receives three different methods for OAuth configuration: the current security API (9.x), legacy realm settings (7.x), and the deprecated shield plugin (6.x). This contradictory context leads to confused or misleading responses:

{
  "hits": {
    "total": { "value": 23 },
    "max_score": 24.5,
    "hits": [
      {
        "_id": "doc-oauth-2025",
        "_score": 24.5,
        "_source": {
          "title": "OAuth 2.0 Authentication Setup",
          "last_updated": "2025-10-15",
          "version": "9.x",
          "content_snippet": "To configure OAuth 2.0 authentication in Elasticsearch 9.x, use the new security API..."
        }
      },
      {
        "_id": "doc-oauth-2023",
        "_score": 23.8,
        "_source": {
          "title": "OAuth Authentication Configuration",
          "last_updated": "2023-04-20",
          "version": "7.x",
          "content_snippet": "Configure OAuth using the legacy realm settings in elasticsearch.yml..."
        }
      },
      {
        "_id": "doc-oauth-deprecated",
        "_score": 22.9,
        "_source": {
          "title": "Setting Up OAuth (Deprecated)",
          "last_updated": "2022-11-10",
          "version": "6.x",
          "content_snippet": "Use the shield plugin to configure OAuth authentication..."
        }
      }
    ]
  }
}

With temporal filtering

Add a filter to restrict results to documents updated within the last six months:

POST product-docs/_search
{
  "retriever": {
    "rrf": {
      "retrievers": [
        {
          "standard": {
            "query": {
              "semantic": {
                "field": "content_semantic",
                "query": "how to configure OAuth authentication"
              }
            }
          }
        },
        {
          "standard": {
            "query": {
              "multi_match": {
                "query": "configure OAuth authentication",
                "fields": ["title^2", "content"]
              }
            }
          }
        }
      ],
      "filter": [
        {"range": {"last_updated": {"gte": "now-6M"}}},
        {"term": {"status": "published"}}
      ],
      "rank_window_size": 50,
      "rank_constant": 20
    }
  },
  "_source": ["title", "last_updated", "version", "content_snippet"],
  "size": 5
}

This hybrid search query

Semantic search (semantic) captures related concepts and context using the content_semantic field.
Lexical search (multi_match) matches exact keywords like “OAuth” with field boosting title^2.
Reciprocal rank fusion (RRF) combines both results sets with balanced reranking, retrieving the most relevant results.
Temporal filter ensures only documents updated within the last six months are retrieved.
Status filter restricts results to published documents, excluding drafts or deprecated content.

Response with temporal filtering: Consistent results

The temporal filtering eliminated outdated documents, leaving only current documentation for version 9.x. The LLM now receives consistent context and generates confident, accurate responses:

{
  "hits": {
    "hits": [
      {
        "_source": {
          "title": "OAuth 2.0 Authentication Setup",
          "last_updated": "2026-01-15",
          "version": "9.x",
          "content_snippet": "Configure OAuth 2.0 in Elasticsearch 9.x using the security API via Stack Management > Security."
        }
      },
      {
        "_source": {
          "title": "OAuth Provider Configuration",
          "last_updated": "2025-12-20",
          "version": "9.x",
          "content_snippet": "Configure Okta, Azure AD, Auth0 via security API with OIDC auto-discovery."
        }
      }
    ]
  }
}

Relative versus absolute time filters

Relative filtering (recommended for most use cases):

"filter": [
  {
    "range": {
      "last_updated": {
        "gte": "now-1y"
      }
    }
  }
]

Absolute filtering (for specific time ranges):

"filter": [
  {
    "range": {
      "last_updated": {
        "gte": "2025-01-01",
        "lte": "2025-12-31"
      }
    }
  }
]

Impact on LLM response quality

Without filtering: LLM receives contradictory guidance from 2023–2025, producing uncertain responses mixing deprecated and current methods.
With temporal filtering: LLM receives only recent documentation, generating confident responses based on current best practices.

Information conflicts

When your RAG system retrieves documentation for features that behave differently across deployment types, versions, or configurations, conflicting information can confuse the LLM about which guidance applies to the user’s specific context.

Impact

The LLM has to use more resources and tokens to understand and determine which information is correct, becoming more prone to errors and hallucinations.

Solutions: Hybrid search with metadata boosting

Elasticsearch’s bool query with a should clause allows you to boost values to prioritize documents matching specific metadata, ensuring deployment-specific or version-specific documentation appears first in the context window. For query syntax details, refer to Bool query reference.

Example: Deployment-specific feature documentation

A user asks, “How do I configure custom users in serverless?” Your knowledge base contains information about cloud, self-hosted, and managed deployments. With proper metadata prioritization, the LLM retrieves signals about feature availability and provides correct guidance:

POST platform-docs/_search
{
  "retriever": {
    "rrf": {
      "retrievers": [
        {
          "standard": {
            "query": {
              "bool": {
                "must": [
                  {
                    "multi_match": {
                      "query": "How do I configure custom users in serverless?",
                      "fields": ["title^2", "content"]
                    }
                  }
                ],
                "should": [
                  {"term": {"deployment_type": {"value": "serverless", "boost": 3.0}}},
                  {"term": {"doc_status": {"value": "current", "boost": 2.0}}}
                ]
              }
            }
          }
        },
        {
          "standard": {
            "query": {
              "semantic": {
                "field": "content_semantic",
                "query": "How do I configure custom users in serverless?"
              }
            }
          }
        }
      ],
      "rank_window_size": 50,
      "rank_constant": 20
    }
  },
  "_source": ["title", "deployment_type", "feature_supported", "content_snippet"],
  "size": 5
}

What this query does

must clause: All documents must match “How do I configure custom users in serverless?”
should clauses with explicit boosting:
- Documents with deployment_type: “serverless” receive 3x boost.
- Documents with doc_status: “current” receive 2x boost.
Semantic search runs in parallel to capture conceptual matches.
RRF combines lexical (with metadata boosting) and semantic results to get the best of both approaches.

Expected response:

{
  "hits": {
    "hits": [
      {
        "_source": {
          "title": "Authentication in Serverless",
          "deployment_type": "serverless",
          "feature_supported": false,
          "content_snippet": "Custom authentication not available in Serverless. Use SSO with your identity provider."
        }
      },
      {
        "_source": {
          "title": "User Management in Serverless",
          "deployment_type": "serverless",
          "feature_supported": false,
          "content_snippet": "Direct user creation not supported in Serverless. Use your organization's IdP."
        }
      },
      {
        "_source": {
          "title": "Role-Based Access in Serverless",
          "deployment_type": "serverless",
          "feature_supported": true,
          "content_snippet": "Configure roles in Serverless console. Roles sync with SSO provider groups."
        }
      },
      {
        "_source": {
          "title": "API Keys in Serverless",
          "deployment_type": "serverless",
          "feature_supported": true,
          "content_snippet": "Create API keys for programmatic Serverless access. Keys inherit user permissions."
        }
      },
      {
        "_source": {
          "title": "SSO Configuration for Serverless",
          "deployment_type": "serverless",
          "feature_supported": true,
          "content_snippet": "Configure SSO in Serverless via Cloud console with SAML 2.0 or OIDC."
        }
      }
    ]
  }
}

How metadata boosting resolves conflicts

Impact on LLM response quality

Without metadata boosting: The context window receives equal-weight documents from all deployment types. The LLM produces vague responses that hedge between possibilities, failing to clearly state deployment-specific limitations.
With metadata boosting (3x): Managed-specific documentation dominates the top results. The LLM generates direct answers about feature unavailability and provides actionable alternatives while maintaining the cross-deployment context for follow-up questions.

Semantic noise

Vector similarity search can retrieve documents that are semantically related but contextually irrelevant to the user’s need. This “semantic drift” occurs when embeddings capture a similarity without understanding the query intent. So when your context window fills with irrelevant information, the LLM's ability to generate precise answers declines.

Impact

The LLM receives correct information that doesn’t answer the question, wasting the context window and lowering the quality of the provided answer.

Solution: Hybrid search

Elasticsearch hybrid search combines lexical precision with semantic understanding, using explicit product filters to eliminate cross-product drift while maintaining conceptual recall.

Example: Technical documentation search

A developer searches for “Elastic Agent configuration,” and your knowledge base contains both the Elastic Agent (Elastic Observability) and the Elastic Agent Builder documentation. Both use the word "agent" prominently, making them semantically similar.

Let’s search for agent configuration documentation:

POST elastic-docs/_search
{
  "retriever": {
    "rrf": {
      "retrievers": [
        {
          "standard": {
            "query": {
              "multi_match": {
                "query": "agent configuration logs metrics collection",
                "fields": ["title^3", "content", "tags^2"],
                "type": "best_fields"
              }
            }
          }
        },
        {
          "standard": {
            "query": {
              "semantic": {
                "field": "content_semantic",
                "query": "configuring agents to collect logs and metrics from hosts"
              }
            }
          }
        }
      ],
      "filter": [
        {"terms": {"product": ["observability", "elastic-agent"]}},
        {"term": {"doc_type": "configuration"}}
      ],
      "rank_window_size": 50,
      "rank_constant": 20
    }
  },
  "_source": ["title", "product", "tags", "url"],
  "size": 5
}

This hybrid query:

Lexical component (multi_match) ensures exact keyword matches for "agent", "configuration", "logs", "metrics", and “collection”.
Field boosting (title^3, tags^2) prioritizes documents where terms appear in important fields.
Semantic component captures conceptual relationships and the intent about “configuring data collection agents”.
RRF merges both result sets with balanced ranking using rank_constant: 20.
Product filter restricts results to Elastic Observability and Elastic Agent domains, eliminating Agent Builder docs entirely.
Category filter restricts results to "observability" and "elastic-agent" domains, eliminating semantic drift to other domains.

Expected response:

{
  "hits": {
    "hits": [
      {
        "_source": {
          "title": "Elastic Agent Input Configuration",
          "product": "elastic-agent",
          "tags": ["inputs", "logs", "metrics", "configuration"],
          "url": "/docs/elastic-agent/inputs"
        }
      },
      {
        "_source": {
          "title": "Configure Elastic Agent for Log and Metric Collection",
          "product": "elastic-agent",
          "tags": ["configuration", "logs", "metrics", "observability"],
          "url": "/docs/elastic-agent/configure"
        }
      },
      {
        "_source": {
          "title": "Agent Policies and Integrations",
          "product": "observability",
          "tags": ["policies", "integrations", "fleet"],
          "url": "/docs/fleet/policies"
        }
      },
      {
        "_source": {
          "title": "Configuring Agent Outputs",
          "product": "elastic-agent",
          "tags": ["outputs", "elasticsearch", "logstash"],
          "url": "/docs/elastic-agent/outputs"
        }
      },
      {
        "_source": {
          "title": "Manage Elastic Agents with Fleet",
          "product": "observability",
          "tags": ["fleet", "agent-management", "deployment"],
          "url": "/docs/fleet/manage-agents"
        }
      }
    ]
  }
}

Why hybrid search works

Search type	Strengths
Lexical only	Precise keyword matching
Semantic only	Captures semantic meaning
Hybrid search	Precision and recall, intent understanding

Before and after: LLM response comparison

Elasticsearch RAG best practices

Following these best practices optimizes your context engineering and significantly reduces the risk of context poisoning in your RAG systems. By implementing the following strategies, you ensure that every token in your context window contributes to relevant, accurate, and trustworthy LLM responses.

Choose the right search strategy for your data:
Select your search approach based on your data characteristics and query patterns. Choose between lexical, semantic, or hybrid search. For more details, refer to Search approaches | Elastic Docs.
Implement temporal awareness
Time-sensitive information requires active management to prevent outdated content from contaminating your context window. Use range queries with relative time filters (like now-6M or now-1y) for content that changes frequently, ensuring your RAG system prioritizes recent content. For more details, refer to Range query | Reference.
Use metadata boosting
When your knowledge base contains similar content across different contexts, such as multiple product versions, deployment types, or user roles, metadata boosting helps prioritize contextually relevant results. For more details, refer to Boolean query | Reference.
Apply reranking when needed
For complex or high-priority queries where precision is critical, consider implementing a reranker solution where the model can significantly improve the search result quality by reordering results based on the semantic understanding of the queries and documents. For more details, refer to Ranking and reranking | Elastic Docs.
Optimize chunking strategies
Chunking is the process of breaking down large text into smaller “chunks.” Document chunking strategy affects both semantic representation and retrieval precision. Smaller chunks provide more granularity, but you may lose context; and larger chunks preserve more context, but you reduce retrieval precision. For more details, refer to Understanding chunking strategies in Elasticsearch.
Filter the data before it reaches the LLM
Vector similarity search can retrieve semantically related but contextually irrelevant documents. Apply explicit filters on product, category, or domain fields to constrain results to the appropriate context before delivering the results to the LLM. For more details, refer to RAG pipelines in production: Operationalize your GenAI project - Elasticsearch Labs.
Calibrate your retrieval volume (k)
Finding the "Goldilocks zone" for the number of documents retrieved is essential. Too few results lead to incomplete answers, while too many can cause the LLM to miss key facts. Balance your token budget against the depth of the model's window. For more details, refer to kNN search in Elasticsearch | Elastic Docs.
Consider summarization for large documents
When retrieved content exceeds your context budget, summarization techniques help retain essential information while reducing token count. For more details, refer to Adding AI summaries to your site with Elastic - Elasticsearch Labs.
Monitor and iterate
Over time, as knowledge bases grow and content evolves, we recommend that you implement monitoring to track relevance score distributions, temporal patterns in retrieved results, and user feedback signals. Watch for signs like outdated documents, declining user satisfaction scores, or a growing number of “no relevant results” queries. For more details, refer to Elastic Observability: Streams Data Quality and Failure Store Insights.

Conclusion

The new era of million-token context windows has not made context management obsolete; it has made context engineering more critical than ever. As context windows grow, so does the potential for poisoning from any source retrieval, tools, or memory.

The patterns shown in this article apply beyond just RAG. Temporal filtering, metadata boosting, and hybrid search are foundational techniques that improve context quality, regardless of source.

By implementing these strategies, you maintain control over what information reaches your LLM, ensuring relevance, accuracy, and trust at scale.

jina-vlm: Seeing like an AI with vision language models

Scott Martens — Mon, 09 Feb 2026 00:00:00 GMT

Vision-language models (VLMs) are generative language models that take texts and images as input and generate new text as outputs. They’re designed for visual question-answering (VQA), an application of multimodal AI that fuses image and natural language processing (NLP) in ways that would have been impossible a few years ago. Jina’s new model, jina-vlm, breaks new ground for VQA with state-of-the-art semantic processing of images like scans, slides, and diagrams that combine text and visual materials to carry information instead of natural scenes like photographs.

jina-vlm opens up new application areas:

Accessibility: Creating descriptions of images for visually impaired people.
Indexing: Generating detailed textual information to improve search for images and documents that contain images.
Querying: Matching natural language queries to images through direct analysis of image content.
Visual content analysis: Automatically inspecting images for classification and analysis.
Retrieval-augmented generation (RAG): Using visual information as a knowledge source in RAG systems.

This article will introduce you to VLMs and show you how to use jina-vlm to bring multimodal AI to your own solutions.

Vision-language models

jina-vlm is a decoder-only language model with additional features and training to answer questions or create general descriptions of images.

We adapted the Qwen3-1.7B-Base open weight language model as the backbone of jina-vlm. What makes it a vision language model is that we’ve connected it to a custom fine-tuned image encoder model, derived from the SigLIP2-So400M/14-384 model, which converts images into a form that the language model can understand and trained it specifically for VQA and description generation.

To understand how this works, think about how language models “read” your prompts. Input texts are split into tokens (words or parts of words), which are replaced with embedding vectors from a lookup table and then inputted to a neural network. The model never sees the actual letters you typed.

As seen in the image, each unique token is substituted with a number that, in turn, corresponds to an embedding vector stored in the model’s vocabulary. This enables the model to process whatever text you give it, even if the words aren’t in its stored vocabulary.

This is why language models struggle to count letters in obscure words, like this example with Claude Haiku 4.5:

This model correctly recognizes what Qaqortoq is; it knows that the word means a town in Greenland, and it can tell you all about it, but it has no idea which actual letters compose the name. There are ways to supplement language models to address this shortcoming, so some models (like ChatGPT 5, Gemini, and Claude Sonnet) don’t generally make this kind of mistake.

However, this architectural shortcoming makes it possible to create multimodal language models.

The actual input to generative language models is a sequence of semantic embeddings, and there’s no reason those embeddings have to come from texts. They could come from anything, as long as they’re compatible with the model’s training. We can make any language model handle images, provided we can generate vectors for those images in the same embedding space as the tokens.

jina-vlm contains an image embedding model and adapter trained to do just that. The image (with some preprocessing into overlapping patches) is the input to the model, and it outputs a sequence of embeddings in the same semantic embedding space as the language model’s text token embeddings. These embeddings are then treated just like the text input embeddings. The full language model has access to your text and to the image at the same time, and it treats both as the same kind of input.

Figure 3 is a schematic of the jina-vlm model, showing the dual track processing of text and images together.

This architecture has some important benefits. Language models learn a lot of implicit information about the world from their training data, and the VLM architecture can use that information to help understand the content of images. Furthermore, some of the bias problems present in other multimodal architectures are absent in this one, notably the media “gap” that troubles CLIP-style models.

What can jina-vlm do?

jina-vlm has three overlapping core competencies:

Describing images and generating captions.
Understanding informationally rich images.
Answering questions about the content of images.

Describing and captioning pictures

Providing good descriptions of images can be a boon for accessibility. There are increasing legal and informal requirements to provide alternative text (alt-text) for digitized images on the internet. Search engine optimization (SEO) strategies in particular often demand high-quality image captions and alt-text. Manually composing such descriptions can be expensive and time consuming.

jina-vlm can help. For example:

You can input this image to jina-vlm with the text:

Input Instruction: Describe this image in detail.

The result is the following:

Output: The image captures a vibrant scene of a white plate filled with a multitude of colorful jelly beans. The jelly beans, in hues of green, yellow, orange, purple, red, and blue, are scattered across the plate, creating a rainbow-like effect. The plate itself is placed on a gray surface, providing a stark contrast to the colorful jelly beans. The image does not contain any discernible text. The relative positions of the jelly beans vary, with some clustered together and others spread out, giving the image a sense of depth and texture. The overall composition of the image is simple yet striking, with the colorful jelly beans being the central focus.

This description is a little wordy but quite detailed.

jina-vlm treats the text you give it as a question or instruction, and some prompt engineering may improve the result a lot. For example, using the same picture of jelly beans:

Input Instruction: Describe this image. Be concise.
Output: A white plate is filled with a variety of colorful jelly beans, including green, yellow, red, purple, and blue, arranged in a scattered manner.

Or:

Input Instruction: Provide a caption for this picture.
Output: A white plate full of colorful jelly beans.

Asking specifically for a caption tends to produce very short results. Most captions are brief, so jina-vlm is trained to generate brief captions.

You can also use jina-vlm to identify and list items in images:

As a relatively small model, jina-vlm has limits to its object recognition. For common categories, it’s very good. For example:

It’s able to handle content like “red cars”, but ask jina-vlm a question involving detailed categories and it can become unreliable very quickly. For example, using the same image:

Input Instruction: What model of car is this?
Output: The car in the image is a classic French car, specifically a Citroën DS. This model is known for its distinctive rounded body shape, which was a hallmark of the DS series. The DS was produced from 1955 to 1969 and was a popular choice for its combination of style, performance, and practicality. It was designed to be a luxury car that could be driven on city streets as well as on the roads, making it a versatile vehicle for its time. The DS was also notable for its innovative features, such as the rear spoiler and the front grille with the Citroën emblem.

This answer is largely hallucinated. The car in the picture is one of the most recognizable cars in history: a Citroën 2CV, manufactured from 1948 to 1990. The Citroën DS was made from 1955 to 1975 and looked like this:

In short, when pressed for details beyond its capacity, jina-vlm hallucinates without warning.

It’s especially important to keep this in mind with images of people:

Tom Cruise is famous enough, and present enough in the training data, for jina-vlm to recognize him. But, using the same picture:

Input Instruction: Provide a caption for this picture.
Output: A man with a green shirt and a big smile.

This caption never mentions Tom Cruise, even though it’s the most important fact about the picture. Asking for a caption produces very little hallucination but also very little information.

In contrast, let’s do the same thing with this image:

This is terribly wrong. Ronald Reagan looks nothing like Richard Nixon.

When you ask jina-vlm to identify persons, it’s likely to hallucinate because it doesn’t have very deep world knowledge. Only people incidentally included many times in the training data might be correctly identified.

jina-vlm hallucinates much less when asked to write a caption. For example, using the same picture of Richard Nixon:

Input Instruction: Provide a caption for this picture.
Output: A man in a suit sits in a chair with a bookshelf behind him.

The most effective way to reduce hallucination is to prompt the model to say less. Any use or deployment of jina-vlm should take this into account.

Understanding informational content

jina-vlm truly shines at understanding informational graphics and synthetic images, like diagrams, graphs, screenshots, scans of printed pages, advertising copy, and presentation slides. The examples below show only part of jina-vlm’s range.

Charts

jina-vlm can report on the contents of all sorts of data presentations.

Bar charts:

Pie charts:

Area charts:

Line graphs:

Table:

Multilingual content

jina-vlm’s multilingual support means that it can recognize languages and describe informational content in a different language from its presentation:

It can also list the languages when there’s more than one:

jina-vlm may answer in a different language than you asked for, if the image content is in another language or strongly associated with it:

To fix this, specify the language in the query. With the same image:

Input Instruction: Summarize the content of this document in English.
Output: The image is a promotional poster for a KFC Thailand event. The poster is predominantly red with white and black text. At the top, there is a logo of KFC, which is a red and white square with a black silhouette of Colonel Sanders. Below the logo, there are three young men, presumably the main participants or hosts of the event. They are dressed in casual attire, with two of them wearing red jackets and one in a white shirt with a black stripe.

The central part of the poster features a large text in white that reads, “เปิดตัวใหม่! ร่วมกับ KFC Thailand พบกับความสนุกสนานที่ไม่เหมือนใคร!” which translates to “Launching New! Join KFC Thailand and experience something unique!” Below this, there is a smaller text that says, “ทุกคนมีส่วนร่วม! วันนี้ 8 พฤศจิกายน 64 เวลา 12:00 น.” indicating that the event is happening on November 8, 2021, at 12:00 PM.

The bottom section of the poster includes social media icons and a Facebook page link, suggesting that the event will be live on Facebook. There are also icons of a chicken drumstick and a bucket, which are associated with KFC, reinforcing the brand’s identity.

The overall design of the poster is vibrant and eye-catching, with a clear focus on the event’s theme and the brand’s presence. The use of red and white colors is typical for KFC branding, which is known for its bold and energetic visual style. The poster is designed to attract attention and inform viewers about the upcoming event.

Complex information displays

We specifically trained jina-vlm to understand complex information displays and to be able to summarize them in natural language.

Infographics:

Technical and educational diagrams:

Maps:

Screenshots

jina-vlm has been trained to parse and understand screenshots:

It can even parse video games:

Scans

jina-vlm excels at materials like scans and images of printed text:

Pre-digital materials

Robust support for scans also makes pre-digital materials accessible to digital search technologies:

Social media

Since much of the internet is “five giant websites filled with screenshots of text from the other four,” jina-vlm is especially helpful in processing the kind of content that the web is full of:

jina-vlm can sometimes even explain memes:

Visual question-answering

“A picture is worth a thousand words.” As cliché as that saying is, multimodal AI can prove it.

VQA is the domain where jina-vlm excels and the one with the most potential for new and revolutionary applications. jina-vlm answers questions about the content of images just like the way a language model can answer questions about texts. This means it can serve as part of a RAG system that assembles responses from mixtures of images and texts stored in a database.

The infographic below is very information rich, and with jina-vlm you can ask natural language questions about its content:

Or, using the same image:

Input Instruction: What is the crew module of the SLS called?
Output: Orion spacecraft

This capability makes it possible to query formal schematics and diagrams:

With the same image:

Input Instruction: How many bathrooms are there?
Output: 4

If you use photographs or natural scenes, jina-vlm is typically good with high-visibility characteristics:

But if you’re asking about finer, less visible details, the answers might get spotty:

Thank you, Captain Obvious.

jina-vlm is also not well-suited to resolving optical illusions:

You can ask questions about text in photographs:

And jina-vlm is ideally suited to understand traditional magazine and newspaper advertisements:

Visual question-answering also works cross-lingually:

Using the same image:

Input Instruction: How much do grapes cost per kilo?
Output: The cost of grapes per kilogram is 3.98.

It answers questions from manual pages:

It can also answer questions about the content of posters and artistic works:

We showed in the previous section that jina-vlm is not very good at general face recognition, outside of a few very famous people. However, it can use the text in images to answer questions it can’t answer from just a picture:

Best practices

jina-vlm is a generative language model. It’s not a database. It’s prone to error and hallucination. It’s very important to take that into account in your use case.

To minimize errors, we offer the following advice:

Test your prompts

If you depend on getting certain kinds of results consistently, you should do thorough quality testing on your queries. Getting a few good results with one prompt does not guarantee the same quality of result with different images. You can minimize this risk by manually reviewing the results of queries over a broad collection of images that are genuinely representative of what you plan to process.

Specify the language

Add to your query the language you expect the answer to be in. Like all language models, jina-vlm cannot easily distinguish between your instructions and the data it’s supposed to apply your instructions to. The only way for it to know how to answer is to look at the languages it recognizes in its input or to follow your explicit instructions.

Don’t ask leading questions

It’s very easy to get jina-vlm to hallucinate by asking leading questions. For example:

Prompt for short answers

jina-vlm can be very wordy sometimes. Using words like “briefly” and “concise” reduces the size of the output.

Compare:

And with the same image:

Input Instruction: What kind of vehicle is this? Be very concise.
Output: This is a Massey Ferguson tractor.

Only ask questions relevant to the image

jina-vlm will not always be able to tell that your question is irrelevant to the content and can easily hallucinate answers. For example:

With some planning, you might be able to reduce this effect by using a decision tree: Ask multiple questions, each one dependent on the answer to the previous one. Using the same Picasso painting above, we would never have asked about the language of the text if we had asked this question first:

Input Instruction: Is there any written language in this image? Be very concise.
Output: No

Think like a language model

We trained jina-vlm to attend to the image given to it. This means that no matter what you ask it, its answer will be a result of the picture it sees. Anthropomorphizing language models is a major hazard for AI users, but it’s easy to see how jina-vlm “thinks” if we ask it questions that call for reflection. For example:

You can ask jina-vlm anything, even questions like, “How are you feeling?” and it will answer in a way that is, if possible, attentive to the image.

jina-vlm has no filter, safety training, or reliable source of facts, and it only knows that what it sees in the image you give it matters. For example:

But:

This is not a bug, it’s a feature. jina-vlm will rely on what you give it to ground what it says. It’s up to you to decide what to give it.

As they say,

And,

Getting started with jina-vlm

jina-vlm is available via a public OpenAI-compatible API at https://api-beta-vlm.jina.ai. To use this API, get a free Jina API key at the Jina website. It comes with 10 million free tokens to try out Jina by Elastic’s models. Additional tokens are available for purchase.

You can find instructions for accessing the jina-vlm API on the Jina AI website, and there’s a web-based sandbox for trying out images and requests.

You can also download the model from Hugging Face and run it locally for noncommercial purposes. For commercial deployment on site, contact Elastic sales.

For more technical information about the model, including pricing information, visit the jina-vlm model page. You can also read the release paper on arXiv.org.

Using Groq with Elasticsearch for intelligent queries

Mark Puddick — Fri, 06 Feb 2026 00:00:00 GMT

One of the challenges with using large language models (LLMs) in conjunction with Elastic is that we often need fast results. Elastic has no issues providing millisecond response time. However, when we introduce LLM calls with this, we potentially have a big issue with our performance dropping to unacceptable levels. This is where hardware inference using Groq can supercharge the speed of your results when combining Elastic with an LLM.

Groq is a hardware and software company focused on delivering ultra-low-latency, deterministic AI inference at scale. Its core innovation is the Groq Language Processing Unit (LPU) Inference Engine, a purpose-built, custom-designed chip architecture specifically engineered for running LLMs at extremely high speed with predictable performance. The links below give a more detailed overview of the Groq architecture.

Unlike traditional GPU-based systems, Groq's inference-specific architecture allows it to process tokens at unprecedented throughput with minimal variance in response time. This directly addresses the memory bandwidth bottlenecks and scheduling overhead that typically slow down traditional LLM calls, ensuring that integrating an LLM with Elastic's search results maintains a real-time user experience. Groq provides this industry-leading speed and performance, often at the best price performance, via GroqCloud, an easy-to-use tokens-as-a-service platform.

Let’s start by looking at a common intelligence query layer request pattern and what improvements we can get from this.

Natural language search

Since the widespread adoption of LLMs, a common search ask is to be able to use natural language for domain-specific search. A naive way of addressing this is to do a simple semantic search within a retrieval-augmented generation (RAG) workflow; however, in most cases, this doesn’t provide the desired results. This is largely due to specific attributes in the question that need to be translated into query terms. To address this, we can ask an LLM to generate a query that we can execute. However, this leaves a lot of room for error. Ultimately, we’ve found that providing a tool with specific parameters for the domain and using that with the LLM gives the best result. There’s more info on this in this blog.

To define the agent, we’ll use the following prompt:

You are a helpful banking transaction agent. You help users search and analyze their banking transactions.

Current date: {current_date}

When users ask about transactions, use the appropriate tools:
- Use trans-search for finding specific transactions

For date references:
- "last month" = past 30 days from today
- "this month" = current month from 1st to today
- "last week" = past 7 days
- "this year" = January 1st of current year to today

By default set the make the to date today and the from date 1 year ago
Common categories: groceries, dining, gas, shopping, entertainment, utilities, healthcare, transportation, travel, subscriptions, insurance, phone, internet

As an example:

This gives us good results, but our search time goes from under 100ms to over 1 second, due to the LLM call.

To address this, we can use Groq’s hardware inference to run this query in a fraction of the time. To run through the example, you’ll need to sign up for a Groq account.

Then you can generate an API key from the top-right menu:

We’ve created the tool here to be able to execute the search.

Once you’ve cloned the above repo, you’ll need to update the .env to point to Groq:

OPENAI_API_KEY=gsk-........

OPENAI_API_BASE=https://api.groq.com/openai/v1

OPENAI_MODEL=openai/gpt-oss-20b

We’ve used the 20b gpt-oss model, as this will give accurate results. There’s little to no gain in using a bigger model for this type of solution.

Now, from a testing situation, we can run this from a simple UI with a prompt to use the tool:

To test the timings around this, we’ll run the tool 50 times and get the average of the repose from the total, the LLM, and Groq. We’ll use ChatGPT-4.1-nano and the Groq OSS-20b model. Here’s the result from this test:

It’s clear that we drop around a second by using Groq’s hardware inference. We’re also using a smaller model which, for this use case, still provides good results. By dropping this from 1.5 seconds to 250ms, we’ll generally fall inside the Service Level Agreement (SLA) levels for a lot of organizations.

Elastic Agent Builder

We’ve shown how this cannot only be used to accelerate natural language processing (NLP) search with Elastic but also how we can use this to accelerate Elastic Agent Builder. Agent Builder was recently released into technical preview and is now capable of connecting to Groq via the Groq endpoint. Agent Builder is available on Elastic 9.1+. We can use the same API key that we used earlier.

Here’s how you set this up:

If you’re using serverless, you need to create a new connector from the stack management connectors page. First, click AI Connector.

On the next screen, select Groq as the service:

You can then set up the model you want to use. The supported models are listed on the Groq website.

If you need to add your organization ID, this can be added by expanding More options under Settings.

If you’re on a hosted version of Elastic, at the time of this writing, you can use the OpenAI compatible endpoint on Groq to connect to Elastic. To do this, select the OpenAI service and use a custom URL that points to the Groq URL, as below:

Once you have set up Groq using either of the above methods, go to GenAI Settings and set Groq as your default GenAI.

Agent Builder will now default to using the Groq connector.

Let's look to see if we can replicate the NLP search within Agent Builder and use Groq.

In order to create agents, we generally need to have some tools for the agent to use. In Agent Builder, you have the ability to use built-in tools or create your own. A number of built-in tools are documented here.

You can use these tools for your transaction searching. The LLM will use the built-in tools, such as the `index_explorer`, `generate_esql`, and `execute_esql`, which will try and find the relevant index, inspect the structure, and execute an Elasticsearch Query Language (ES|QL) generated query. However, this presents a few challenges:

The time to run the agent will increase substantially, as there will be multiple reasoning steps and tool executions. Since we’re using Groq to get faster results, this isn’t ideal.
As the number of steps and the tool usage grows, we’re going to consume substantially more tokens and therefore increase cost.

To avoid the above issues, we can create a new tool, designed to specifically search transactions. At the time of this writing, there are three types of tools we can use:

ES|QL tools: These allow you to use templated ES|QL to define a query.
Index search tools: These allow you to provide an index, and the LLM creates the query.
Model Context Protocol (MCP) tools: These allow you to use external tools via MCP.

We could use our previously created MCP tools; however, to keep things simple, we’ll use the index search tool. You can set this up as below:

Once we’ve created the tools, we can create an agent in Agent Builder. To do this, we click the Create agent button and fill in the screenshot below, using the prompt we used in our original example:

We also want to select the tool we created as part of the agent:

And test in the Agent Builder UI by asking a few different questions:

We actually get some more functionality via Agent Builder, since it can create extra queries due to the extra built-in tools we selected. The only real disadvantage of this is that it can take longer to answer questions overall, since the LLM has the ability to do more. Again, this is where Groq can help. Let's take a look at the performance difference in Agent Builder with Groq.

Performance with Groq in Agent Builder

A great feature of Agent Builder is that it has MCP and agent-to-agent (A2A) out of the box. We can use this to do some simple benchmark testing. Using A2A, we can replace the built-in agent in the UI and test harness. This allows us to test Agent Builder with Elastic LLM and a couple of different models in Groq.

There’s an updated repo that has the benchmark script in it.

To test this, we’ll ask the question:

How much did I spend on gas?

The results of the testing are shown below:

	Groq -openai/gpt-oss-120b	Groq llama-3.3-70b-versatile	Elastic LLM
Min: 6.040s	6.04	4.433	15.962
Max: 9.625s	9.625	7.986	24.037
Mean: 7.862s	7.862	6.216	17.988
Median: 7.601s	7.601	6.264	17.027
StdDev: 1.169s	1.169	1.537	2.541

As you can see, the built-in Elastic LLM is not bad, but Groq still outperforms this by almost 3x on average. You’ll notice that the speed overall is substantially slower than the external app. This is due to the way we’ve set up our tool in Agent Builder just to use the index. As a result, a lot of this time is taken by Agent Builder reasoning (that is, inspecting the index). We could use templated ES|QL tools instead of the index, and this would move the results closer to the external app.

Conclusion

It’s clear to see that by using Groq with Elastic we open up a range of new possibilities in which speed is an important factor. This article covers the basic intelligent query example, but there are many other applications, such as image understanding, summarization, and captioning, that become possible with the 10x increase in speed.

ES|QL dense vector search support

Carlos Delgado — Thu, 05 Feb 2026 00:00:00 GMT

You can now use Elasticsearch Query Language (ES|QL) for vector search! ES|QL can retrieve, filter, and score dense_vector fields. Use k-nearest neighbors (KNN) queries for fast, approximate nearest neighbors search at scale. Use vector similarity functions for exact search and custom scoring.

KNN is simpler to use in ES|QL than in the Query DSL. Prefilters and the number of results to retrieve per shard are automatically inferred from the ES|QL query.

What is vector search?

Modern search is no longer limited to exact keyword matching. Users expect systems to understand meaning, not just text. This is where vector embeddings and Elasticsearch’s dense_vector field type come in.

The easiest way to use vector search in Elasticsearch is to use the semantic_text field type. It allows you to automatically generate text embeddings, perform semantic search, and handle chunking. However, you may want to use dense_vector instead when:

You’re already using dense_vector fields.
You’re using non-textual data, like images, sound, or video.
You need to generate embeddings separately to ingestion in Elasticsearch.
You need to do custom or advanced scoring.
You want to perform exact nearest neighbors search.

A dense_vector stores numerical embeddings produced by machine learning models. These embeddings capture semantic similarity: documents with similar meaning have vectors that are close to each other in high-dimensional space.

With vectors, you can build:

Semantic text search, for finding documents related to a question.
Retrieval-augmented generation (RAG).
Recommendation systems.

ES|QL brings the power of a query-piped experience to Elasticsearch. Adding first-class support for dense_vector fields means you can now retrieve, filter, score, and search using vectors directly in ES|QL, alongside your text and non-text data.

In this post, we’ll walk through how to work with dense_vector fields in ES|QL, from basic inspection to approximate and exact similarity search, and how to use vector search as part of hybrid search strategies.

The basics: Retrieving vector data

Assume you have an index with a mapping similar to:

{
  "mappings": {
    "properties": {
      "title": { "type": "text" },
      "category": { "type": "keyword" },
      "content_vector": {
        "type": "dense_vector",
        "dims": 384,
        "similarity": "cosine"
      }
    }
  }
}

You can retrieve vector fields just like any other column:

FROM documents
| KEEP title, content_vector
| LIMIT 5

Keep in mind that vectors can be large. For exploration and debugging, it may be useful to retrieve vector data, but in production you should avoid returning full vector data unless it's really necessary.

You can use familiar ES|QL constructs to check how many rows have vector information:

FROM documents
| WHERE content_vector IS NOT NULL
| STATS non_null = COUNT(*)

Approximate search using KNN

Vector search means finding the most similar vectors to a given query vector.

For large datasets, the most common approach is approximate nearest neighbor (ANN) search. ANN tries to find the most similar vectors by using data structures that allow for fast computation of similar vectors but doesn’t guarantee that all vectors will be considered.

ES|QL exposes approximate search via the KNN function:

FROM documents MEDATADATA _score
| WHERE KNN(content_vector, [0.12, -0.03, 0.98, ...])
| SORT _score DESC
| KEEP title, _score
| LIMIT 10

This simple example:

Searches over the content_vector field.
Uses a dense vector query [0.12, -0.03, 0.98, ...] to search similar vectors to it.
Sorts the results by score, by using the METADATA _score attribute that will be populated by the `KNN` function.
Keeps just the title and score, as the content_vector field is not interesting to be returned and we can avoid loading its contents.
Retrieves the top 10 elements by using LIMIT. This automatically sets k to 10 in the KNN function.

The KNN function can be further customized by using options:

FROM documents MEDATADATA _score
| WHERE KNN(content_vector, [0.12, -0.03, 0.98, ...], {"k": 20, "min_candidates": 100, "rescore_oversample": 4.0})
| SORT _score DESC
| KEEP title, _score
| LIMIT 10

See the KNN function named parameters for a complete description of the available parameters.

Combining KNN with filters

You can narrow down the candidate set for the vector search:

FROM documents METADATA _score
| WHERE category == "tutorial"
| WHERE KNN(content_vector, [0.12, -0.03, 0.98, ...])
| SORT _score DESC
| LIMIT 10
| KEEP title, category, _score

Of course, you can use any other WHERE clauses that filter the results or include KNN as part of a filter expression:

FROM documents METADATA _score
| WHERE published_date > NOW() - 1 hour AND LENGTH(category) > 10 AND KNN(content_vector, [0.12, -0.03, 0.98, ...])

KNN made simple

KNN is simpler to use in ES|QL. You won't have to specify prefilters or k for your query explicitly.

Prefilters are the way to ensure that a KNN query returns as many results as expected. Prefilters are applied on the KNN search itself, instead of being applied after the query.

Keep in mind that KNN returns the top k results it's been asked for. If filters are applied after the KNN query, some of the results returned by the query may be filtered. If that happens, we’ll retrieve fewer results than expected.

The Query DSL knn query contains a section for specifying prefilters:

POST my-image-index/_search
{
  "query" : {
    "knn": {
      "field": "content_vector",
      "query_vector": [0.12, -0.03, 0.98, ...],
      "filter" : {
        "term" : { "category" : "tutorial" }
      }
    }
  }
}

You don't need to care about prefilters when using KNN in ES|QL. All filters are applied as prefilters for the KNN function, so there’s no need to specify them as a specific option or command; just use `WHERE` and let ES|QL do it for you!

KNN also allows specifying the number of results to retrieve per shard; that is, the k parameter. Similar to the Query DSL, k defaults to the LIMIT specified in your query.

Exact search using vector similarity functions

KNN is designed to be fast, and that makes it ideal for large datasets (hundreds of thousands or millions of vectors) and latency-sensitive applications. The trade-off is that results are approximate, though usually very accurate.

Sometimes you want exact similarity computation instead of approximate search, for example:

When your dataset is small.
When the filters used in the query are very restrictive and select a small subset of your dataset.

ES|QL provides the following vector similarity functions:

Using these functions, you can calculate the similarity of your query vector with all the vectors your query retrieves.

The following query uses the same mapping as our KNN example, above, but does exact search using cosine similarity:

FROM documents
| EVAL similarity = V_COSINE(content_vector, [0.12, -0.03, 0.98, ...])
| SORT similarity DESC
| KEEP title, similarity
| LIMIT 10

This query:

Computes the similarity using the V_COSINE vector similarity function.
Sorts on the computed similarity.
Keeps the top 10 similar results.

Semantic search

When doing semantic search, you'll be trying to match a text query to your vectors. Of course, you can retrieve the query vector by first calculating the embeddings and then supply the query vector directly to your vector search.

But it would be much simpler to allow Elasticsearch to calculate the embeddings for you by using the TEXT_EMBEDDING function:

FROM documents METADATA _score
| WHERE KNN(content_vector, TEXT_EMBEDDING("my semantic query", inference_id))
| SORT _score DESC
| LIMIT 10
| KEEP title, _score

TEXT_EMBEDDING uses an already existing inference endpoint to automatically calculate the embeddings and use them in your query.

Hybrid search

Most searches don't rely just on vector data; they need to be combined with lexical search, as well, so we have the best of both worlds:

Lexical information is great for exact search of words and synonyms and provides a strong signal that users are looking for specific terms.
Vectors capture meaning and intention, using similar phrases or terms that are not lexically related.

Combining vector search and lexical search is best done using FORK and FUSE:

FROM documents METADATA _score, _id, _index
| FORK
(WHERE KNN(content_vector, TEXT_EMBEDDING("my query")) | SORT _score DESC | LIMIT 10)
(WHERE MATCH(title, "my query") | SORT _score DESC | LIMIT 10)
| FUSE
| SORT _score DESC
| LIMIT 10

The query above:

Uses FORK to do two queries:
- A KNN query over a `dense_vector` field.
- A MATCH query over a text field.
- Both queries sort by score and return the top 10 results for each.
Uses FUSE to mix together the query results by using reciprocal rank fusion (RRF) by default.

This allows total control of the queries you want to perform, how many results to retrieve from each one, and how to combine the results together.

Check out our multistage retrieval blog post for more details on how modern search works and how easy it is to implement via ES|QL.

Custom scoring

Calculating custom scoring is easy using ES|QL! Just use the _score metadata field for calculating your custom score:

FROM documents METADATA _score
| WHERE KNN(content_vector, TEXT_EMBEDDING("my semantic query", inference_id)
| EVAL my_custom_score = _score * 1.5 + ...
| SORT my_custom_score DESC
| LIMIT 10

If you're using exact search, you already have an evaluation for the vector similarity that you can fine-tune:

FROM documents
| EVAL similarity = V_COSINE(content_vector, [0.12, -0.03, 0.98, ...])
| EVAL my_custom_score = similarity * 1.5 + ...
| SORT my_custom_score DESC
| LIMIT 10

Compared to the Query DSL script_score, this is a much simpler and more iterative approach and fits perfectly into the ES|QL execution flow.

Using query parameters

When using a query vector, you can specify it directly on the query as in our previous examples. But you may have noticed that we're using ellipses (...) to signal that there's more data to come.

Dense vectors are usually high dimensional; they can have hundreds or thousands of dimensions, so copying and pasting your query vector on the query itself can make it difficult to understand or reason about, as you'll be seeing thousands of numeric values on your screen.

Remember that you can use ES|QL query parameters for supplying parameters to your query:

POST _query
{
  "query": """
FROM documents
| WHERE KNN(content_vector, ?query_vector)
| SORT _score DESC
| KEEP title, _score
| LIMIT 10
   """,
 "params": [{"query_vector" : [0.12, -0.03, 0.98, ...]}]
}

This helps to keep your query and parameters separated, so you can focus on the query logic and not on specific parameters that get in your way.

Using query parameters for vectors is also more performant, as vectors are parsed faster that way using the request parser instead of the ES|QL parser.

Conclusion

ES|QL doesn’t just support vector search; it makes it a natural part of how you query your data. It allows you to use a single, powerful syntax for text, vectors, and everything in between, including:

Vector search, both approximate and exact.
Semantic search, using text for search over vector data.
Hybrid search, combining the best of text and vector search.
Custom vector scoring, using EVAL and ES|QL constructs.

Vector search in ES|QL is easier than in the Query DSL, by inferring prefilters and parameters, and integrating with the expressive, feature-rich expressions that ES|QL allows.

Defining KNN as part of a query pipeline for multistage retrieval is just another piece in the query; you can keep using filters, combine with other text functions for hybrid search, and apply reranking or query completion on top of your vector results.

We’ll keep adding vector functions for performing vector arithmetic and aggregations over dense vectors, so you can use the full power of ES|QL to manipulate your vector data.

Happy (vector) searching!

Speed up vector ingestion using Base64-encoded strings

Jim Ferenczi, Benjamin Trent, Ignacio Vera Sequeiros — Wed, 04 Feb 2026 00:00:00 GMT

We’re improving the ingestion speed of vectors in Elasticsearch. Now, in Elastic Cloud Serverless and in v9.3, you can send your vectors to Elasticsearch encoded as Base64 strings, which will provide immediate benefits to your ingestion pipeline.

This change reduces the overhead of parsing vectors in JSON by an order of magnitude, which translates to almost a 100% improvement on indexing throughput for DiskBBQ and around 20% improvement for hierarchical navigable small world (HNSW) workloads. In this blog, we’ll take a closer look at Base64-encoded strings and the improvements it brings to vector ingestion.

What’s the problem?

At Elastic, we’re always looking for ways to improve our vector search capabilities, whether that’s enhancing existing storage formats or introducing new ones. Recently, for example, we added a new disk-friendly storage format called DiskBBQ and enabled vector indexing with NVIDIA cuVS.

In both cases, we expected to see major gains in ingestion speed. However, once these changes were fully integrated into Elasticsearch, the improvements weren’t as large as we had hoped. A flamegraph of the ingestion process made the issue clear: JSON parsing had become one of the main bottlenecks.

Parsing JSON requires walking through every element in the arrays and converting numbers from text format into 32-bit floating-point values, which is very expensive.

Why Base64-encoded strings?

The most efficient way to parse vectors is directly from their binary representation, where each element uses a 32-bit floating-point value. However, JSON is a text-based format, and the way to include binary data in it is by using Base64-encoded strings. Base64 is just a binary-to-text encoding schema.

{
  “emb” : [1.2345678, 2.3456789, 3.4567891]
}

We can now send vectors encoded as Base64 strings:

{
  “emb” : ”P54GUUAWH5pAXTwI”
}

Is it worth it? Our benchmarks suggest yes. When parsing 1,000 JSON documents, using Base64 encoded strings instead of float arrays resulted in performance improvements of more than an order of magnitude, at the cost of a small encode/decode trade-off (client-side Base64 encoding and a temporary byte array on the server for decoding) in exchange for eliminating expensive per-element numeric parsing.

Give me some ingestion numbers

We can see these improvements in practice when running the so_vector rally track with the different approaches. The actual gains depend on how fast indexing is for each storage format. For bbq_disk, indexing throughput increases by about 100%, while for bbq_hnsw, the improvement is closer to 20%, since indexing is inherently slower there.

Starting with Elasticsearch v9.2, vectors are excluded from _source by default and are stored internally as 32-bit floating-point values. This behavior also applies to Base64-encoded vectors, making the choice of indexing format completely transparent at search time.

Client support

Adding a new format for indexing vectors might require changes on ingestion pipelines. To help this effort, in v9.3, Elasticsearch official clients can transform vectors with 32-bit floating-point values into Base64-encoded strings and the other way around. You might need to check the client documentation for the specific implementation.

For example, here’s a snippet for implementing bulk loading using the Python client:

from elasticsearch.helpers import bulk, pack_dense_vector

def get_next_document():
    for doc in dataset:
        yield {
            "_index": "my-index",
            "_source": {
                "title": doc["title"],
                "text": doc["text"],
                "emb": pack_dense_vector(doc["emb"]),
            },
        }

result = bulk(
    client=client,
    chunk_size=chunk_size,
    actions=get_next_document,
    stats_only=True,
)

The only difference from a bulk ingest using floats is that the embedding is wrapped with the pack_dense_vector() auxiliary function.

Conclusion

By switching from JSON float arrays to Base64-encoded vectors, we remove one of the largest remaining bottlenecks in Elasticsearch’s vector ingestion pipeline: numeric parsing. The result is a simple change with outsized impact: up to 2× higher throughput for DiskBBQ workloads and meaningful gains even for slower indexing strategies, like HNSW.

Because vectors are already stored internally in a binary format and excluded from _source by default, this improvement is completely transparent at search time. With official client support landing in v9.3, adopting Base64 encoding requires only minimal changes to existing ingestion code, while delivering immediate performance benefits.

If you’re indexing large volumes of embeddings, especially in high-throughput or serverless environments, Base64-encoded vectors are now the fastest and most efficient way to get your data into Elasticsearch.Those interested in the implementation details can follow the related Elasticsearch issues and pull requests: #111281 and #135943.

Jina Rerankers bring fast, multilingual reranking to Elastic Inference Service (EIS)

Sean Handley, Brendan Jugan, Ranjana Devaji — Tue, 03 Feb 2026 00:00:00 GMT

Today we’re excited to introduce jina-reranker-v2-base-multilingual and jina-reranker-v3 on Elastic Inference Service (EIS), enabling fast multilingual, high-precision reranking directly in Elasticsearch.

Jina AI, recently acquired by Elastic, is a leader in open-source multilingual and multimodal models, delivering state-of-the-art search foundation models for high-quality retrieval and retrieval-augmented generation (RAG). EIS makes it easy to run fast, high-quality inference with an expanding catalog of these ready-to-use models on managed GPUs, with no setup or hosting complexity.

Rerankers add semantic precision by refining the ordering of retrieved results, helping select the best matches for a query. They add relevance improvements without the time and effort involved in reindexing or pipeline disruption, and they’re especially valuable for hybrid and RAG workflows where better context boosts downstream accuracy.

This follows the recent introduction of jina-embeddings-v3 on EIS, expanding the model catalog for multilingual reranking. Developers can now perform hybrid search using lexical search with BM25F and vector search with multilingual embeddings from jina-embeddings-v3, then rerank results with Jina Rerankers v2 or v3 depending on the use case. This delivers full control over recall tuning natively in Elasticsearch.

jina-reranker-v2-base-multilingual

jina-reranker-v2-base-multilingual is a compact general-purpose reranker with features designed to support function calling and SQL querying.

Low-latency inference at scale: A compact 278M parameter model with Flash Attention 2 for low-latency inference, delivering strong multilingual performance that outperforms larger rerankers based on AIR metrics and other widely used benchmarks.
Supports agentic use cases: Accurate multilingual text reranking with additional support for selecting SQL tables and external functions that match text queries, enabling agentic workflows.
Unbounded candidate support: v2 handles arbitrarily large candidate lists by scoring documents independently. Scores are compatible across batches, so developers can rerank large result sets incrementally. For example, a pipeline can score 100 candidates at a time, merge the scores, and sort the combined results. This makes v2 suitable when pipelines don’t apply strict top-k limits.

jina-reranker-v3

jina-reranker-v3 performs multilingual listwise reranking, offering state-of-the-art performance with higher precision for RAG and agent-driven workflows.

Lightweight, production-friendly architecture: A ~0.6B parameter listwise reranker optimized for low-latency inference and efficient deployment in production settings.
Strong multilingual performance: Benchmarks show that v3 delivers state-of-the-art multilingual performance while outperforming much larger alternatives and maintains stable top-k rankings under permutation.
Cost-efficient, cross-document reranking: Unlike v2, v3 reranks up to 64 documents together in a single inference call, reasoning over relationships across the full candidate set to improve ordering when results are similar or overlapping. By batching candidates instead of scoring them individually, v3 significantly reduces inference usage, making it a strong fit for RAG and agentic workflows with defined top-k results.

More models are on the way. EIS continues to expand with models optimized for candidate reranking, retrieval, and agentic reasoning. Next up is jina-reranker-m0 for multimodal reranking, followed closely by frontier models from OpenAI, Google and Anthropic.

Get started

You can start using jina-reranker-v2-base-multilinugal on EIS with just a few steps.

Create embeddings with jina-embeddings-v3

POST _inference/text_embedding/.jina-embeddings-v3
{
  "input": [
    "The Atlantic is a vast, deep ocean.",
    "A small puddle formed on the sidewalk."
  ]
}

The response:

{
  "text_embedding": [
    {
      "embedding": [
        0.0061287,
        ...
      ]
    },
    {
      "embedding": [
        -0.11765291,
        ...
      ]
    }
  ]
}

Rerank with jina-reranker-v2-base-multilingual

Perform inference:

POST _inference/rerank/.jina-reranker-v2-base-multilingual
{
 "input": ["puddle", "ocean", "cup of tea"],
 "query": "a large body of water"
}

The response:

{
  "rerank": [
    {
      "index": 1,
      "relevance_score": 0.48755136
    },
    {
      "index": 0,
      "relevance_score": 0.41489884
    },
    {
      "index": 2,
      "relevance_score": 0.07696084
    }
  ]
}

The response contains the ranked list of inputs sorted by relevance score. In this example, the model identifies "ocean" as the most relevant match for a large body of water, assigning it the highest score while correctly ranking "puddle" and "cup of tea" lower.

Rerank with jina-reranker-v3

Perform inference:

POST _inference/rerank/.jina-reranker-v3
{
  "input": ["pebble", "The Swiss Alps", "a steep hill"],
  "query": "mountain range"
}

The response:

{
  "rerank": [
    {
      "index": 1,
      "relevance_score": 0.06519848
    },
    {
      "index": 2,
      "relevance_score": -0.05002501
    },
    {
      "index": 0,
      "relevance_score": -0.09782915
    }
  ]
}

Similar to jina-reranker-v2-base-multilingual, the response provides a prioritized list of the inputs sorted by relevance. In this example, the model identifies "The Swiss Alps" as the most relevant match for "mountain range," compared to "pebble" and "a steep hill."

However, a key difference is that jina-reranker-v3 is a listwise reranker. Unlike jina-reranker-v2-base-multilingual, which scores document-query pairs individually, jina-reranker-v3 processes all inputs simultaneously, enabling rich cross-document interactions before determining the final ranking.

What’s new in EIS

EIS via Cloud Connect brings EIS to self-managed clusters, allowing developers to access its GPU fleet to prototype and ship RAG, semantic search, and agent workloads without needing to procure GPU capacity on their self-managed clusters. Platform teams gain hybrid flexibility by keeping data and indexing on-prem while scaling GPU inference in Elastic Cloud when needed.

What’s next

semantic_text fields will soon default to jina-embeddings-v3 on EIS, providing built-in inference at ingestion time, making it easier to adopt multilingual search without additional configuration.

Try it out

With Jina AI models on EIS, you can build multilingual, high-precision retrieval pipelines without managing models, GPUs, or infrastructure. You get fast dense retrieval, accurate reranking, and tight integration with Elasticsearch’s relevance stack, all in one platform.

Whether you’re building RAG systems, search, or agentic workflows that need reliable context, Elastic now gives you high-performance models out of the box and the operational simplicity to move from prototype to production with confidence.

All Elastic Cloud trials have access to the Elastic Inference Service. Try it now on Elastic Cloud Serverless and Elastic Cloud Hosted.

Building automation with Elastic Workflows

Tinsae Erkailo, Tal Borenstein, Shahar Glazner — Tue, 03 Feb 2026 00:00:00 GMT

Elastic Workflows is an automation engine built into the Elasticsearch platform. You define workflows in YAML; what triggers (starts) them, what steps they take, what actions they perform, and the platform handles execution. A workflow can query Elasticsearch, transform data, branch based on conditions, call external APIs, and integrate with services like Slack, Jira, PagerDuty and more through connectors you've already configured.

In this blog post, we'll cover the core concepts of Workflows and build an example workflow together.

Workflows are declarative and defined in YAML

Workflows are composable. You define what should happen, and the platform handles execution, error recovery, and logging. Every workflow is defined as YAML and lives inside Kibana.

A workflow is composed of a few key parts: triggers, inputs, and steps.

Triggers determine when a workflow runs. An alert trigger runs when a Kibana alerting rule fires, with full access to the alert context. A scheduled trigger runs on an interval or cron pattern. A manual trigger runs on demand from the UI or API. A workflow can have multiple triggers.

Inputs define parameters that can be passed to the workflow at runtime. These let you create reusable workflows that accept different values depending on how they're invoked.

Steps are the actions a workflow takes. They execute in sequence, and each step can reference outputs from previous steps. Step types include:

Internal actions for actions you perform inside Elasticsearch and Kibana, like querying indices, running Elasticsearch Query Language (ES|QL) queries, creating cases, or updating alerts.
External actions for actions you perform on external systems, like sending a Slack message or creating a Jira ticket. Use any connector you've configured in Elastic, with the flexibility to hit any API or internal service using HTTP steps.
Flow control for defining the logic of your workflow with conditionals, loops, and parallel execution.
AI for everything from prompting a large language model (LLM) to enabling agents as workflow steps, unlocking agentic workflow use cases.

Getting practical: Your first workflow

Let's build a workflow that demonstrates core capabilities: working with Elasticsearch indices, conditional logic, and data flow between steps. We'll create a simple demo that sets up a National Parks index, loads sample data, and searches it.

Enable Workflows

Workflows is available in Elastic 9.3 (Technical Preview). Go to Stack Management → Advanced Settings, and enable Elastic Workflows:

Create a workflow

Navigate to Workflows in Kibana. If this is your first time, you'll see the Get Started screen:

Click Create a new workflow to open the editor. You can also explore Example workflows to see the Elastic Workflow Library, a collection of ready-to-use workflows for search, observability, and security use cases.

The workflow editor

The editor gives you YAML editing with autocomplete and validation. Start typing a step type, and suggestions appear. Use the quick action menu (Cmd+K / Ctrl+K) to browse available triggers, steps, and actions by category:

Build the workflow

Here's our National Parks demo workflow:

name: National Parks Demo
description: Creates an Elasticsearch index, loads sample national park data, searches for parks, and displays the results.
enabled: true
consts:
  indexName: national-parks

triggers:
  - type: manual

steps:
  - name: get_index
    type: elasticsearch.indices.exists
    with:
      index: '{{ consts.indexName }}'

  - name: check_if_index_exists
    type: if
    condition: 'steps.get_index.output: true'
    steps:
      - name: index_already_exists
        type: console
        with:
          message: 'index: {{ consts.indexName }} already exists. Will proceed to delete it and re-create'
      - name: delete_index
        type: elasticsearch.indices.delete
        with:
          index: '{{ consts.indexName }}'
    else:
      - name: no_index_found
        type: console
        with:
          message: 'index: {{ consts.indexName }} not found. Will proceed to create'

  - name: create_parks_index
    type: elasticsearch.indices.create
    with:
      index: '{{ consts.indexName }}'
      mappings:
        properties:
          name:
            type: text
          category:
            type: keyword
          description:
            type: text

  - name: index_park_data
    type: elasticsearch.index
    with:
      index: '{{ consts.indexName }}'
      id: yellowstone
      document:
        name: Yellowstone National Park
        category: geothermal
        description: "America's first national park, established in 1872, famous for Old Faithful geyser and diverse wildlife including grizzly bears, wolves, and herds of bison and elk."
      refresh: wait_for

  - name: search_park_data
    type: elasticsearch.search
    with:
      index: '{{ consts.indexName }}'
      query:
        term:
          _id: yellowstone

  - name: log_results
    type: console
    with:
      message: 'Found {{ steps.search_park_data.output.hits.total.value }} park with doc id of yellowstone.'

This workflow demonstrates several capabilities:

Constants: Define reusable values, like indexName, that can be referenced throughout.
Elasticsearch operations: Check whether an index exists, delete it, create it with mappings, index a document, and search.
Conditional logic with branches: If the index exists, log a message and delete it. If not, log that it wasn't found. Either way, continue to create the index.
Data flow: Each step references outputs from previous steps using steps..output.

Notice the {{ }} syntax throughout. This is Liquid templating for passing data between steps. consts contains workflow constants. steps..output references the output of a previous step.

Run the workflow

Save the workflow and click the Play button next to the Save button to execute it.

The workflow begins executing, and you'll see the execution view. Each step appears in the side panel as it runs, showing status and timing:

Click on any step to see its inputs and outputs. The side panel shows exactly what data the workflow received and produced at each step. This makes debugging straightforward: You see what went in, what came out, and where things went wrong if a step fails.

Extending the workflow

Let's extend this workflow with AI and external notifications. We'll add a step that generates a poem about the park using an LLM and then sends it to Slack.

Add these steps after log_results:

  - name: generate_poem
    type: ai.prompt
    with:
      prompt: >
        Write a short, fun poem about {{ steps.search_park_data.output.hits.hits[0]._source | json }}.
        Include something about its famous features. No other formatting.

  - name: share_poem
    type: slack
    connector-id: my-slack-connector
    with:
      message: |
        🏔️ *Poem of the Day about Yellowstone*
        
        {{ steps.generate_poem.output.content }}

Now the workflow creates the index, loads data, searches it, generates a poem with AI, and shares it to Slack. The same patterns apply throughout: Add steps, reference their outputs, and let the workflow handle execution.

This is a simple example, but the same approach scales to real use cases. Replace national parks with security alerts, observability metrics, or any data in Elasticsearch. Replace the poem with an AI summary or assessment. Replace Slack with Jira, PagerDuty, or any connector you've configured.

Workflows and Elastic Agent Builder

The National Parks example demonstrates the core components of Workflows: triggers, steps, conditional logic, data flow, AI prompts, and external notifications. These components can be combined to build automations that orchestrate outcomes within Elasticsearch, Kibana, external systems, and AI.

This covers processes where the steps are known. But what about processes where they aren't, where the right action depends on what you find, and what you find depends on where you look?

This is where Agent Builder expands what's automatable. An agent grounded in your operational context can do the initial analysis or investigation that an analyst or developer would normally do manually. It can explore, reason across data sources, and surface findings. Then the workflow continues with the structured steps that follow.

Workflows integrate with Agent Builder, and the integration works both directions.

Agents as workflow steps

Use the ai.agent step type to invoke an agent from within a workflow:

- name: analyze
  type: ai.agent
  with:
    agent_id: my-analyst-agent
    message: 'Analyze this data and recommend next steps: {{ steps.search.output | json }}'

The agent uses its configured tools to query indices, correlate data, and reason across results. The workflow waits for findings and continues with the next steps.

Workflows as agent tools

Workflows can also be exposed to Agent Builder as tools, allowing agents to invoke them when action is required.

This pattern is useful when an agent determines what should be done, but the how should follow a known and repeatable process. Agents are well suited for reasoning, exploration, and judgment, especially when information is incomplete or evolving. Workflows are well suited for executing established procedures that involve multiple steps, external systems, and state changing actions.

By exposing a workflow as a tool, you separate decision making from execution. An agent can conclude that a specific outcome is needed, such as declaring an incident or initiating a response, and then delegate the execution to a workflow that encodes the correct sequence of actions. That workflow handles coordination across systems, applies the same logic every time, and makes the result observable and auditable.

This also enables reuse and consistency. The same workflow can be invoked by different agents or run manually, while enforcing a shared process and set of controls. Instead of teaching every agent how to carry out complex actions safely and correctly, that responsibility lives in one place.

Together, this creates a clear contract. Agents decide when an action is needed. Workflows ensure that the action follows the right process, every time.

Looking ahead

This technical preview establishes Workflows as a core capability of the Elasticsearch platform. The foundation is here: triggers, steps, data flow, AI integration, and the bidirectional connection with Agent Builder.

What comes next builds on this foundation by expanding how workflows are authored and where they show up. In addition to new step types and broader connector support, Workflows will support multiple authoring modes. These include natural language intent, where AI helps translate what you want to achieve into a working workflow, as well as visual, drag and drop builders. As these capabilities are embedded directly into solution specific experiences across Elastic, they change not just how workflows are built, but how work actually gets done.

For Developers, this means AI assistants that go beyond conversation. Agents that can actually do things: query systems, update records, trigger processes, and return results. Reasoning backed by reliable execution.

In Observability, this means correlating signals from logs, metrics, and traces. Surfacing likely root causes. Orchestrating remediation steps. Closing the loop between detection and resolution without waiting for someone to intervene.

In Security, this means investigation that begins the moment an alert fires. Enriching findings with context from internal and external sources. Coordinating response actions across tools. Updating cases and notifying the right people. Work that used to require manual effort, handled automatically.

The patterns you learn today apply directly to these experiences. Workflows is the automation layer that makes them possible.

Get started with Elastic Workflows

Elastic Workflows is available now as a technical preview. Get started with an Elastic Cloud trial, and check out the documentation or library of example workflows to learn more.

Skip MLOps: Managed cloud inference for self-managed Elasticsearch with EIS via Cloud Connect

Jordi Mon Companys, Matt Ryan — Tue, 03 Feb 2026 00:00:00 GMT

If you’re running Elasticsearch on-premises or in a private cloud, you’ve likely faced a familiar standoff when trying to modernize your search experience: You want to implement semantic search. You know that state-of-the-art dense vector models, like jina-embeddings-v3, are the standard for relevance. But when you look at the infrastructure requirements to run these models at scale, the project stalls.

The problem usually isn't the software: Elasticsearch has supported vector search for years. The problem is the hardware.

The MLOps bottleneck

The process of running model inference to generate the embeddings required for semantic search is computationally expensive. If you’re self-managing your cluster, this presents a difficult set of trade-offs regarding operational complexity and resource flexibility:

Burn CPU cycles: You run the models on your existing CPU-bound nodes. This works for small datasets and small models, but as ingestion rates climb, your indexing throughput plummets and your search nodes end up choking on vector generation instead of serving queries.
Provision GPUs: You ask your infrastructure team for GPU-accelerated nodes. In many organizations, this triggers a procurement nightmare. GPUs are expensive and scarce, and they introduce a new layer of machine learning operations (MLOps) complexity: driver compatibility, container orchestration, and scaling logic that your team might not have time to manage.

This creates a gap where self-managed deployments remain stuck on keyword search (BM25) simply because the infrastructure barrier to AI is too high.

Introducing a hybrid architecture for inference

We built the Elastic Inference Service (EIS) and made it available via Cloud Connect to solve this specific hardware constraints problem. It allows self-managed clusters (running on Elastic Cloud on Kubernetes [ECK], Elastic Cloud Enterprise [ECE], or stand-alone) hosted on-premises or in private cloud environments to delegate compute-intensive model inference to Elastic Cloud. It doesn’t require a lift-and-shift migration of your cluster architecture. While the specific text fields you need to vectorize are transmitted to the cloud for processing, you don’t need to permanently relocate your terabytes of business data or re-architect your storage. It allows you to use a hybrid topology: Your data nodes, master nodes, and index storage remain in your private environment, but the heavy lifting of generating embeddings is outsourced to Elastic’s managed GPU fleets.

How it works: Data stays, intelligence travels

It’s important to understand the distinction between storage and inference. When you use EIS via Cloud Connect:

The handshake: You generate an API key in Elastic Cloud and paste it into your self-managed Kibana instance. This establishes a secure, authenticated bridge.
The pipeline: When you index a document using the semantic_text field (or manually configure an inference processor), your local cluster automatically sends only the specific text content to the EIS endpoint, encrypted in transit.
The inference: The text is processed in memory on Elastic’s managed GPUs. The resulting vector embedding is returned immediately to your local cluster.
The storage: The vector is indexed and stored on your local disks alongside the original source document.

The raw text payload is ephemeral: It’s processed for inference and discarded. It’s never indexed or permanently stored in the cloud. You get the relevance of a GPU-powered cluster without changing your data residency posture.

Scaling semantic search without hardware provisioning

Let’s look at a practical scenario. You’re a site reliability engineer (SRE) managing a cluster for a large ecommerce platform. The search team wants to deploy Jina to fix “zero results” queries, but your on-premises nodes are CPU-bound and you have no GPU infrastructure available.

Here’s how you can use EIS via Cloud Connect to solve this in minutes, not months.

Step 1: The handshake

First, we establish the bridge between your self-managed cluster and Elastic Cloud:

Navigate to Stack Management in your local Kibana.
Under the Cloud Connect section, click Connect to Elastic Cloud.
Authenticate with your Elastic Cloud credentials, and authorize the connection.
- Result: Your local cluster is now cloud connected, acting as a satellite that can consume Software as a Service (SaaS) services.

Step 2: Enable the service

On the Cloud connected services page, locate Elastic Inference Service.
Click Enable and wait for the status to switch to Enabled.

Which makes all these inference endpoints immediately available locally:

Step 3: Configure the pipeline

Now that the service is available, you configure an ingest pipeline. Instead of managing a local Jina container, you simply point to the cloud-hosted model. Open the console, and try the Jina embeddings model locally:

PUT /semantic-search-eis
{
  "mappings": {
    "properties": {
      "text": {
        "type": "semantic_text",
        "inference_id": ".jina-embeddings-v3"
      }
    }
  }
}

POST /semantic-search-eis/_doc
{
  "text": "Aberdeen Football Club"
}

GET /semantic-search-eis/_search
{
  "query": {
    "match": {
      "text": "soccer"
    }
  }
}

The match query will return the semantically correct result for “Aberdeen Football Club” where an exact match for “soccer would not”:

{
  "took": 42,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.89421,
    "hits": [
      {
        "_index": "semantic-search-eis",
        "_id": "1",
        "_score": 0.89421,
        "_source": {
          "text": "Aberdeen Football Club"
        }
      }
    ]
  }
}

Step 4: Hybrid ingestion

When you index a document through this pipeline, the flow is streamlined:

Your local node accepts the document.
The text field is securely transmitted to the EIS endpoint.
Elastic Cloud generates the dense vector embedding on a managed GPU using Jina v3.
The vector is returned and indexed locally on your self-managed disk.

Oh and one more thing: LLMs via Cloud Connect too!

While vector search solves the retrieval part of retrieval-augmented generation (RAG), Cloud Connect also solves the generation side. Typically, giving your self-managed teams access to high-quality large language models (LLMs), like Anthropic’s Claude, involves a separate procurement cycle, managing vendor API keys, and handling secure egress.

EIS via Cloud Connect removes that friction entirely.

Preconfigured access: The moment you enable the service, you get access to a preconfigured Anthropic connector (one for Claude 3.7 and one for Claude 4.5). You don’t need to bring your own API key or sign a separate contract with the model provider.
Zero setup: It just works. Because the secure bridge is already established, you can immediately select these models in the Search Playground to test RAG against your local data.
Integrated workflows: This also powers our internal AI tools. Your team can instantly start using the Elastic AI Assistant for observability insights or Attack Discovery for security threat analysis, all powered by cloud-hosted LLMs, grounded in your on-premises data.

The shortcut to AI

EIS via Cloud Connect removes the operational friction of managing GPU drivers, specialized hardware, and complex performance monitoring stacks. By offloading these requirements to Elastic Cloud, you can build RAG applications and semantic search interfaces today, regardless of whether your cluster is running on-premises or in a private cloud VPC.

The hybrid approach solves the resource utilization problem typical of self-managed AI. You no longer have to over-provision expensive hardware that sits idle during quiet periods, nor do you have to worry about performance bottlenecks when traffic bursts. You simply establish the secure connection via Cloud Connect and consume high-performance inference as a managed API, keeping your data residency intact while your search capabilities scale instantly.

EIS via Cloud Connect is available immediately for Elastic Enterprise self-managed customers on Elastic Stack 9.3.

Cookbook for a production-grade generative AI sandbox

Sean MacKirdy — Mon, 02 Feb 2026 00:00:00 GMT

Building generative AI (GenAI) applications is all the rage, and context engineering, that is, providing the prompt structure and data needed for a large language model (LLM) to return specific, relevant answers to a question without filling in the blanks itself, is one of the most popular patterns that has emerged in the past 24 months. One particular subset of context engineering, retrieval-augmented generation (RAG), is being used widely to bring additional context to LLM interactions by using the power of natural language-based search to surface the most relevant results in private datasets based on meaning rather than on keywords.

As context engineering is exploding, ensuring that rapid prototype projects don’t expose business- or mission-critical data to unauthorized recipients is a significant concern. For audiences interested in technology and policy alike, I've championed the concept of a privacy-first GenAI sandbox, which I’ll simply refer to as a sandbox from here on. In this article, the term sandbox refers to a self-service, secure prototyping space (much like a child's sandbox, where the wooden edges prevent sand from escaping), allowing organization members to test their custom context engineering applications safely, without risking exposure of confidential data.

Production-grade GenAI sandboxes = enabling privacy and innovation

GenAI, from text-generating tools, like ChatGPT, Claude, and Gemini, to image creators, such as Google’s Nano Banana, OpenAI’s DALL-E, and Midjourney, has sparked discussions everywhere: in classrooms, at dinner tables, in regulatory circles, in courts, and in boardrooms over the past two years.

I’ve had the privilege of sharing Elastic’s approach to context engineering, and particularly RAG, with customers, including developers and C-suite executives, and with contacts of mine, ranging from friends and family to legislators. Think of context engineering as a librarian that looks up and then serves contextual data to augment text, audio, or image GenAI apps that they don’t have in the data they were trained on for their intended tasks; for example, looking up sports scores and headlines to help a text-generation application answer the question, “What happened in the National Hockey League yesterday?”

Elasticsearch Labs has fabulous primers on context engineering here and RAG here, if you’re unfamiliar with the concept and would like to do more reading.

A privacy first approach ensures that context engineering supplies the GenAI app with protected, selected, or delicate data, fostering responses that are better informed and more pertinent than what might be generated using solely public information. An example of this would be providing a GenAI-powered, interactive text chat experience (chatbot) for university students to obtain financial aid and scholarship information relevant to their personal background, without risking exposure of personally identifiable information (PII), such as their Social Security number or birthdate, to malicious actors extracting information via common vulnerabilities, as per the OWASP Top 10, or the LLM itself.

The core tenets of the logic behind deploying a sandbox are as follows:

Users will find a way to incorporate GenAI into their daily workflow, regardless of whether one’s organization provides the tools. Even in organizations where preventing such “shadow IT” is, realistically, impractical or impossible, providing and monitoring access to prevent disclosure of an organization’s sensitive data still remains imperative; a sandbox is just the place to turn such tools loose.
Providing a sandbox to deploy applications with Application Performance Monitoring (APM) and information security (InfoSec) best practices embedded allows an organization to derive insights into potential use cases for GenAI while also safeguarding privacy, enabling audit and accountability of GenAI use, and establishing centralized cost management.
An organization’s sandbox should allow either self-service or low-touch deployment of peer-reviewed GenAI applications to permit maximum experimentation with minimum friction by those inclined to develop their own applications.
If properly implemented and contained within the organization’s controlled perimeter, the sandbox allows leveraging data assets available to the organization without triggering the liabilities that could attach to unauthorized or unintended external sharing or other leakage of protected data such as PII – think California CCPA, or the EU/UK GDPR for instance.

This article will not focus on building a GenAI app; there are numerous excellent examples here on Elasticsearch Labs. Instead, I’ll be focusing on the recipe necessary for deploying a sandbox that provides the security and availability needed to implement principle #3 above.

Foundational ingredients

For a sandbox to be considered production grade, the following foundational ingredients should be considered:

Let's explore why each ingredient plays a crucial role in our sandbox recipe. As we do, please note that brand-name decisions I’ve listed below are based on personal experience and aren’t an endorsement of one technology or another by Elastic. As with any recipe, these then form my preferred ingredients. You can, of course, substitute in each area to make the recipe to your liking:

1. Containerization platform

The first ingredient in our sandbox recipe is the selection of a containerization platform. These platforms, while conceptually similar to the virtual machines that have been a staple of enterprise IT for the past 15+ years, represent a significant evolution in how applications are packaged and deployed. They’re designed for rapid deployment, upgrades without service disruption, and native distribution across both on-premises and cloud computing environments, while also providing increased testability, validation of infrastructure, and auditability. The platform you choose, often managed through infrastructure as code (IaC) to ensure reproducibility and consistency, is the foundation that enables agility and scalability for your GenAI applications.

Key components of a containerization platform

A robust containerization platform is built on several key components:

Container runtime: The software that executes containers and manages their lifecycle. A popular example is Docker, which provides the tools to build, share, and run container images.
Image build infrastructure: This is the process and tooling used to create container images from your application's source code. Tools like Dockerfiles provide a clear, repeatable way to define the environment, dependencies, and application code within an image, ensuring consistency across development, testing, and production environments.
Orchestration engine: For a production-grade environment, you need a system to automate the deployment, scaling, and management of containers. Kubernetes (k8s) is the industry-standard for this, providing powerful features for load balancing, self-healing, and service discovery. More on that below in ingredient #2.

1. 1 Infrastructure as code:

To ensure the reproducibility and maintainability of your sandbox, a containerization platform should be managed using IaC principles. This means that instead of manually configuring your platform, you define your infrastructure (for example, Kubernetes clusters, networking rules, security policies) in code files (for example, using Terraform or Pulumi). This approach provides several benefits:

Version control: Your infrastructure can be treated like any other code, allowing you to track changes, revert to previous versions, and collaborate with your team using Git.
Consistency: IaC dramatically reduces manual errors and ensures that your sandbox environment can be recreated identically in any cloud or on-premises location.
Automation: It enables you to automate the entire setup and teardown process, making it easy to create temporary sandboxes for specific projects or testing.

2. Hosting and orchestration

As we introduced in the "Containerization platform" section, a powerful orchestration engine is needed to manage our containers at scale. For this, k8s is the de facto standard for orchestrating a production-grade sandbox. If you’re unfamiliar, check out the Cloud Native Computing Foundation (CNCF) primer on k8s available here. Whether running in the cloud or on-premises, Kubernetes provides the robust framework needed to deploy, scale, and manage the lifecycle of containerized applications. Major cloud providers, like Google Cloud (Google Kubernetes Engine [GKE]), Amazon Web Services (Elastic Kubernetes Service [EKS]), and Microsoft Azure (Azure Kubernetes Service [AKS]), all offer mature, managed Kubernetes services that handle the underlying complexity, including in particular contractually assured and independently certified compliance with statutory privacy and information security mandates, allowing your teams to focus on building and deploying applications.

For a GenAI sandbox, Kubernetes is particularly valuable because it can efficiently manage and scale GPU resources, which are often necessary for two key components of the GenAI stack: 1) privately hosted LLMs; and 2) the inference processes that power them (discussed in more detail in ingredients #6 and #7). Its ability to automate deployments and manage resources ensures that rapid prototypers can experiment with different models and applications without needing to become infrastructure experts, all within the secure and isolated area, called a namespace in k8s, that you define. This abstraction is key to the sandbox's success, empowering innovation while maintaining centralized control.

3. Code repository / image repository

A centralized code repository is an essential element of a secure and collaborative GenAI sandbox. It provides a single, controlled environment for developers to store, manage, and version their code, preventing the proliferation of sensitive information across disparate, unsecured locations. By establishing a centralized repository, organizations can enforce security policies, monitor for vulnerabilities, and maintain a clear audit trail of all code changes, which is critical for maintaining data privacy and integrity within the sandbox environment.

For instance, a service like GitHub, when integrated with your organization's identity and access management (IAM) and single sign-on (SSO) solutions (see ingredient #4 below), becomes a powerful tool for enforcing the principle of least privilege. This integration ensures that only authenticated and authorized developers can access specific code repositories. You can create teams and apply granular permissions, restricting access to sensitive projects and preventing unauthorized code modifications. This is especially important in a GenAI context where code might contain proprietary algorithms, sensitive data connectors, or even, in some cases, organization or user-level credentials or other confidential information.

Furthermore, modern repository platforms offer automated security scanning features. These tools continuously scan code for known vulnerabilities, insecure coding practices, and exposed secrets. If a developer accidentally commits a password or an API key, the system can automatically flag it and notify the security team. This proactive approach to security is essential for preventing data breaches, enforcing legal requirements and contractual commitments of confidentiality, and ensuring the overall integrity of the GenAI applications being developed to deploy in the sandbox. By mandating that all development occurs in a centralized and secured repository, you create a transparent, auditable, and secure foundation for innovation, allowing your developers the freedom to experiment without compromising organizational security.

4. Identity and access management

IAM is a core component of a secure, privacy-first grounded AI environment. It provides the foundation for ensuring that only authorized individuals and services can access sensitive data and powerful AI models. A robust IAM framework enforces the principle of least privilege, granting the minimum level of access necessary for a user or service to perform its function.

4.1 Single sign-on:

SSO streamlines user access by allowing users to authenticate once and gain access to multiple applications and services without re-entering their credentials. In a sandbox environment, SSO simplifies the user experience for developers, data scientists, and business users who need to interact with various components of the AI ecosystem, such as data repositories, modeling workbenches, and deployment pipelines. By centralizing authentication, SSO also enhances security by reducing the number of passwords that can be compromised and providing a single point for enforcing authentication policies. Importantly, it also lowers the barrier to entry for less-experienced developers to properly protect the data they are using in the sandbox, in turn preventing the inadvertent disclosure of sensitive information to insiders and outsiders alike.

4.2 Role-based access control:

Role-based access control (RBAC) is a method of restricting network access based on the roles of individual users within an organization. In the context of a GenAI sandbox, RBAC is used to define and enforce permissions for different user personas. For example, a data scientist role might have read/write access to specific datasets and the ability to apply machine learning models, while a business analyst role may only have read-only access to the outputs of those models. This ensures a clear separation of duties and prevents unauthorized access to or modification of sensitive data and AI assets.

4.3 Attribute-based access control:

Attribute-based access control (ABAC) provides a more granular and dynamic approach to access control than traditional RBAC. ABAC makes access decisions based on a combination of attributes of the user, the resource being accessed, and the environment. For instance, access to a particularly sensitive dataset could be restricted to users who are on the data scientist team (user attribute), accessing a resource tagged as PII (resource attribute), and are doing so from a corporate network during business hours (environment attributes). This level of granular control is critical in a GenAI sandbox for enforcing complex data governance and privacy requirements. We’ll come back to this later, when discussing the search AI datastore.

4.4 Access auditability:

A robust IAM framework also ensures that the granting, use, review and revocation of all access permissions is granularly logged, discoverable and auditable, so that in case of any suspected or confirmed incident, responders can quickly understand what happened, contain the incident, assess its extent, and comprehensively remedy its consequences. This is not only important for the organization’s own security, but also necessary to comply with any incident reporting and breach notice requirements that could be triggered.

5. Secrets management

Of all the ingredients in our recipe, secrets management is perhaps the most potent, yet most frequently overlooked. Much like a tiny pinch of saffron can dramatically alter a culinary dish, a single mishandled secret can have an outsized and devastating impact on your organization's security and reputation. In our context, a secret is any piece of sensitive information needed for our applications to function: API keys for first- or third-party services, database passwords, trust certificates, or tokens for authenticating to LLMs.

When these secrets are hard-coded into source code or left in plain-text configuration files, they create a massive vulnerability. A leaked API key or an exposed database credential can bypass all other security measures, providing a direct path for attackers to access sensitive data and systems. This is especially critical in a GenAI sandbox, where developers are frequently connecting to various data sources and external model providers. Without a robust secrets management strategy, you’re leaving the keys to your kingdom scattered across your digital landscape, turning your innovative sandbox into a potential source for a major data breach.

To properly secure these secrets, a dedicated secrets management platform is an essential ingredient. These tools provide a centralized, encrypted vault for storing secrets, with robust access control, auditing, and dynamic rotation capabilities. Whether you choose a self-hosted solution, like HashiCorp Vault, or a managed cloud service, such as Google Cloud's Secret Manager, or AWS Key Management Service (KMS), the principle is the same: Programmatically inject secrets into your applications at runtime. This practice ensures that secrets are never exposed in your code, keeping your most valuable credentials secure and your sandbox environment protected.

And this is more than just a best practice: since secret management technology is readily available and widely used, it forms part of the “state-of-the-art” which certain privacy laws and regulators reference as the benchmark against which an organization’s information security posture must be assessed. Failing to protect an organization’s most valuable secrets with the latest and greatest techniques available is not only a missed opportunity, it is also a potential case of regulatory non-compliance, as enforcement agencies and courts of law often recall.

6. Private LLM deployment(s)

Early in the advent of modern GenAI, the primary driver for using managed services, like Azure OpenAI, was the assurance that customer prompts and data would not be used to retrain public models. This was a crucial first step in enterprise adoption. However, as the field has matured, the conversation has shifted. While data privacy remains paramount, the decision to use private LLM instances, whether from major cloud providers or self-hosted, is now equally driven by the need for guaranteed throughput, predictable latency, and fine-grained control over the model's operational environment to support production-grade applications.

This critical ingredient comes in three distinct flavors, each with valid use cases and its own set of trade-offs:

A. Cloud-hosted SaaS

This is the most common and accessible approach. Services like OpenAI Enterprise, Azure OpenAI, Google Cloud's Vertex AI, and AWS Bedrock provide access to powerful, state-of-the-art models through a managed API.

Pros: This flavor offers the fastest time-to-market. The cloud provider handles all the underlying infrastructure, scaling, and maintenance, allowing teams to focus purely on application development. It provides a simple, pay-as-you-go model and access to a diverse model library of proprietary and open-source options.
Cons: This approach offers the least control over the underlying infrastructure, which can lead to variability in performance during peak demand. It can also be more expensive at very high volumes, and it creates a dependency on the provider's roadmap and model availability. It also increases the potential vulnerability surface of the application, with data leaving the customer premises: a challenge for highly regulated and/or sovereignty-minded customers.

B. Cloud-hosted GPU + containerized LLMs

This flavor involves running open-source LLMs (like models from Mistral or Meta's Llama series) on your own virtualized GPU infrastructure within a cloud provider. This is typically managed using the containerization and Kubernetes orchestration we've already discussed, often with high-performance inference servers like vLLM.

Pros: This approach provides a powerful balance of control and flexibility. You gain direct control over resource allocation, model versioning, and the serving configuration, allowing for significant performance tuning. In high-concurrency scenarios, a well-tuned inference server can dramatically increase throughput. For example, benchmarks have shown inference engines like vLLM delivering significantly higher tokens-per-second and lower latency compared to less production-oriented servers under heavy load [Red Hat, 2025].
Cons: This option carries a higher operational burden. Your team is now responsible for managing the GPU instances, container images, and the inference server configuration. It requires a deeper technical expertise in machine learning operations (MLOps) and infrastructure management to implement and maintain effectively.

C. On-premises GPUs + containerized LLMs

The most controlled, and often most complex, approach involves deploying containerized LLMs on your own dedicated hardware within your own data centers. This setup is functionally similar to the second flavor but removes the reliance on a public cloud provider for the hardware layer.

Pros: This flavor offers maximum security, control, and data sovereignty. It’s the only option for organizations that require a completely air-gapped environment, where no data leaves the physical premises. For massive, predictable workloads, it can become more cost-effective in the long run by avoiding cloud data egress fees and per-transaction costs.
Cons: The initial capital expenditure for purchasing and maintaining high-end GPU hardware is substantial. It requires a highly specialized team to manage the physical infrastructure, networking, and the entire software stack. This approach is more difficult to scale, as it requires the physical procurement and installation of new hardware.

7. Search AI data store

If the LLM is the brain of our GenAI application, then the datastore is its heart, pumping relevant, context-rich information to be reasoned upon. For a RAG application to be truly effective, it cannot rely on a simple vector database alone. The grounding data is often complex, containing a mix of unstructured text, structured metadata, and a variety of data types. Therefore, the datastore you select must possess a unique set of characteristics to handle this complexity at scale.

Underpinning this entire process is the creation of vector embeddings, numerical representations of your data relative to the knowledge set of that embedding space. To enable semantic search, your data must first be converted into these numerical representations by an inference model. A flexible datastore should not only store these vectors but also be capable of hosting the inference process itself. Crucially, it should allow you to use your model of choice, whether it's a state-of-the-art multilingual model, a fine-tuned model for a specific domain like finance or law, a compact model built for very high-speed results, or even a model that can process images. By managing inference, the platform ensures that your data is consistently and efficiently vectorized, paving the way for the powerful search capabilities that follow.

First, it must master hybrid search. The best retrieval systems don't force a choice between traditional keyword search, like BM25, which excels at finding specific keywords, and modern vector search, which excels at finding results using semantic meaning (that is, natural language). A truly capable datastore allows you to use both simultaneously in a single query. This ensures you can find documents that match exact product codes or acronyms while also finding documents that are conceptually related, providing the LLM with the most relevant possible context.

Second, it needs a sophisticated method for intelligent result reranking. When you run a hybrid search that combines multiple approaches, you need a way to merge the different result sets into a single, coherent ranking. Techniques like reciprocal rank fusion (RRF) are crucial here, as they intelligently combine the relevance scores from different queries to produce a final list that is more accurate and relevant than any single approach could deliver on its own.

Finally, a search AI-oriented datastore must be a unified engine with security built in. For enterprise RAG, it's not enough to just find similar vectors. You must be able to apply security and access controls to data before the search even happens. The aforementioned RBAC and ABAC capabilities allow prefiltering of content at search time, ensuring that the vector search is only performed on data a user is authorized to see. This mitigates risks of accidental or malicious circumvention of your access controls through the sandbox preserving demonstrable compliance with privacy and confidentiality requirements. This capability, which combines filtering, full-text search, and vector search in a single, scalable platform, is the defining characteristic of a datastore truly ready to power a secure, privacy-first GenAI sandbox.

8. APM and security

The final ingredient in our recipe ensures the health, security, and performance of the entire sandbox: a unified platform for APM and security information and event monitoring (SIEM). A key characteristic of a truly versatile search AI datastore is its ability to power the R in your RAG applications, while also acting as the standards-based repository for all logs, metrics, and traces generated by your infrastructure and applications. By consolidating this operational data into the same powerful datastore, you create a single pane of glass for observability and security.

This approach provides several critical capabilities. At the infrastructure level, you can monitor the performance and resource utilization of both the k8s clusters hosting your sandbox and the underlying GPUs that power your LLMs, allowing you to proactively identify bottlenecks or failures. At the application layer, APM provides detailed traces to diagnose latency issues or errors within your GenAI prototypes. For security, this centralized datastore becomes your SIEM, correlating login events, application logs, and network traffic to detect anomalous behavior or potential threats within the sandbox.

Most importantly, this unified platform allows you to gain deep insights into the usage of the GenAI applications themselves. By ingesting and analyzing the application telemetry, which should include the prompts being submitted by users wherever permissible, potentially with PII redacted, you can identify trends, understand what types of questions are being asked, and discover popular use cases. This provides an invaluable feedback loop for improving your RAG applications and demonstrates the power of using a single, scalable datastore to secure, monitor, and optimize your entire GenAI ecosystem.

Cooking the recipe

With all of the ingredients in place, let’s talk about the steps for assembling them into a production-grade sandbox.

As with any recipe book, let’s start with a photo of the cooked dish. Here’s a view of what a final architecture might look like:

The holistic environment depicted here consists of a Kubernetes cluster to host your sandboxed AI applications (with dev/preprod/prod namespaces for a continuous integration and continuous deployment [CI/CD] pipeline), an IAM infrastructure for authentication, a few GenAI applications, a repository for code and container images, and a wrapper of APM and cyber monitoring around the entire sandbox.

Recipe step 0: Policy baseline

Before you begin mixing any ingredients, every good chef performs their mise en place, that is, setting up their station for success. In our recipe, this means establishing clear policies for how the sandbox will be used. This is the foundational step, where you decide the rules of your kitchen. Will developers be allowed to use internal production data, or production data sanitized with techniques like pseudonymization and differential privacy, or life-like synthetic data, or only public data? Will the sandbox be a completely self-service platform, or a managed service with guardrails? Will application updates require a formal Change Review Board, or is a peer-review process sufficient? These questions are highly specific to each organization’s context and purposes. Answering them up front is critical, as these policy decisions will directly influence how you configure every other ingredient in the recipe.

Recipe step 1: InfoSec baseline

As stated in the “Ingredients” section, IAM is a nonnegotiable part of our recipe. Before letting anyone into the kitchen, you must secure the perimeter and ensure only authorized chefs wearing your approved uniform and compliant protective gear can access the tools and ingredients. This means working directly with your information security organization from day one to build the sandbox on a foundation of strong security principles. Access to your datastore, your code repository, your Kubernetes hosting environment, and the applications themselves must be restricted based on established best practices.

With your organization’s IAM policies enforced in the environment, a practical authentication flow might look like the one depicted in figure 3.

As you can see in the figure, no communication can occur between applications in the Kubernetes production namespace without first passing through an OAuth proxy, such as Vouch. This ensures every user is authenticated against a central provider, like Okta, which enforces policies such as two-factor authentication. In this model, critical user context, such as username and IP address, can be passed along with every request, enabling robust auditing and nonrepudiation at the application layer.

Recipe step 2: Container configuration baseline

Assuming that many of your rapid prototypers are passionate innovators but not necessarily seasoned software engineers or legally trained data compliance experts, it’s critical to provide a baseline configuration to ensure their success and security, without putting them at risk of breaching any rules or policies inadvertently. Think of this step as providing a master recipe card that guarantees consistency. At a minimum, you should provide clear documentation on how to build a container image, deploy it into the Kubernetes cluster, and test that all connectivity is secure.

Even better, you can create a “Clone This Starter App” template in your code repository. This gives developers a preconfigured, security-blessed starting point, complete with Dockerfiles and pipeline scripts, that they can immediately fork to begin tinkering, dramatically lowering the barrier to entry while enforcing best practices from the outset.

Additionally, many real life GenAI use cases will inevitably involve some form of PII processing, or can produce outputs that will materially impact individuals such as your employees, your consumers, or your customers’ staff. In such cases, more and more state, federal and international laws require completing various risk assessments before actual work can begin. These assessments can be cumbersome to conduct and are difficult to scale if they are carried out case by case. The “Clone This Starter App” approach also helps to prevent such compliance mandates from becoming bottlenecks to innovation, since under most legal mandates, the required assessments can be completed once for your template, and they need not be repeated for any clone that doesn’t exceed your initially defined parameters.

Recipe step 3: Deploy user applications

With your policies defined, your security baseline established, and your developer templates in place, it’s finally time to serve the dish. Whether you've chosen a self-service or managed deployment model, you can now confidently invite the rapid prototypers in your organization to start creating in the sandbox.

Because you’ve included APM and security logging (ingredient #8) from the beginning, you have the necessary observability to monitor application performance and user activity. This is where the magic happens: You can now learn from the applications people build, identify powerful new use cases, and gather real-world data to improve the platform, all while safeguarding organizational data. Coincidentally, this approach will also allow you to organically collect the information you might need to put on record, disclose to users, or share with auditors and regulators to demonstrate the transparency, accountability and explainability of your GenAI application, ticking many compliance boxes as you build (and not after the fact) – a textbook best practice of Privacy by Design.

Where do you go from here?

We've now walked through the entire cookbook, from selecting your fresh ingredients to following the recipe step by step. Most of the domains we've discussed (containerization, APM, IAM, and more) are culinary specialties in and of themselves.

Conclusion

This cookbook was designed to provide a clear recipe for building a production-grade GenAI sandbox. By carefully selecting each foundational ingredient, from your containerization platform and Kubernetes orchestration to your search AI datastore and unified APM, you ensure your final dish will be both successful and secure. Following the recipe ensures that this powerful environment is built on a foundation of security and thoughtful policy from day one.

The goal is to empower your rapid prototypers, not restrict them, and to foster a culture of responsible innovation. By providing a secure, observable, and well-equipped kitchen for experimentation, you get ahead of the curve, fostering a culture of responsible innovation. This proactive approach enables you to harness the creativity of your entire organization, transforming brilliant ideas into tangible prototypes while preventing the rise of shadow AI. You've cooked the meal; now you can enjoy the innovation it serves.

If you want to chat about this or anything else related to Elasticsearch, come join us in our Discuss forum.

Query rewriting strategies for LLMs and search engines to improve results

Christina Nasika, Emilia Garcia Casademont — Fri, 30 Jan 2026 00:00:00 GMT

One of the biggest challenges in integrating large language models (LLMs) into search pipelines is the complexity of navigating the space of possibilities that they provide. This blog focuses on a small set of concrete query rewriting (QR) strategies, using LLM-generated keywords, pseudo-answers, or enriched terms. We specifically focus on how to best use the LLM's output to strategically boost the original query's results to maximize search relevance and recall.

LLMs and search engines: An exploration of query rewriting strategies for search improvement

The integration of large language models (LLMs) with search engines represents a significant advancement in the fields of search and AI. This integration can take many forms, depending on the application [1]. On one hand, search engines can enhance LLMs by grounding their generation in freshly retrieved data, a strategy that’s commonly referred to as retrieval-augmented generation (RAG). On the other hand, LLMs can improve search engines by optimizing various stages of the search pipeline.

The search pipeline typically consists of three main components: indexing, first-stage retrieval, and reranking. LLMs can contribute at each of these stages. During indexing, LLMs can assist with data preparation tasks, such as text cleaning, entity extraction, and document enrichment by adding domain knowledge, synonyms, or clarifications. In the first-stage retrieval, LLMs can optimize user queries by rewriting them in natural language to improve retrieval accuracy or by mapping them to structured queries (for example, domain‑specific language–style [DSL-style] or Elasticsearch Query Language–style [ES|QL-style] queries) [2]. This blog focuses on query optimization strategies for this stage.

While there is some research on using LLMs as rerankers [3], the literature is less extensive. Technical blogs on the topic suggest that LLMs as rerankers may not always be the optimal choice, though this area remains an active field of exploration [4,5].

The advancements in LLMs have also unlocked new possibilities beyond the traditional indexing and retrieval stages. For example, LLMs can be used to generate natural language responses grounded in retrieved data (RAG). Users increasingly anticipate coherent, natural language responses to their queries, which are also dependable and guaranteed to be based on retrieval data. This is a significant shift in user expectation, occurring quickly since LLM-integrated search engines became generally available, demonstrating a major improvement in user experience. A language model that can understand intent, retrieve data, and synthesize information is especially valuable in multi-hop scenarios where a query requires combining information from various sources.

This is even clearer when looking at the application of LLMs in the creation of interactive, conversational search interfaces. These interfaces allow users to submit queries, provide feedback on responses, or introduce clarifications, enabling iterative improvements to the results, while making use of the historical context of the conversation. Taking this a step further, integrating LLMs with autonomous capabilities, such as planning, retrieving, reasoning, and decision-making, can lead to the development of agentic search systems. These systems can refine results based on user feedback or self-evaluation, creating a dynamic and intelligent search experience.

Finally, LLMs are widely used in search tool development, from data generation to serving as backbones for embedding and reranking models [6,7,8]. Synthetic data generation has become a common step in training retrieval models, and LLMs are increasingly being used as judges to generate labels for training and evaluation.

Query rewriting and optimization strategies

Query rewriting strategies are best understood by categorizing user queries into two main types: retrieval and computational.

Retrieval queries

Unlike computational queries, the user's intent here is information retrieval, not calculation. These are the standard queries handled by retrieval algorithms, like lexical and vector search. For example, for the following query:

"What is the origin of COVID-19?"

texts providing answers or context relevant to the query are targeted.

Computational queries

These queries require calculation, aggregation, or structured filtering to produce an answer. They must be translated from natural language into a structured query language, like Elasticsearch DSL or ES|QL.

For example, a query like:

"What was the average amount spent by customers who placed more than five orders in the last month?"

Assuming that the information on the orders and customers can be found in some available index, this query requires more than simple text matching. It involves filtering by a date range, grouping by customer, calculating order counts, filtering customers with fewer than five orders, and computing the final average. In this case, the LLM's task is to parse the natural language and generate the corresponding structured query to execute these calculations.

Another example would be:

"Which universities in Germany have an acceptance rate below 20%, and what is their average tuition fee?"

When there is no indexed document that contains that specific information, but rather there might be documents containing acceptance rate information separately from tuition fees information.

In computational queries, the model is essentially expected to decompose the query into a retrievable informational query and a calculation that can be performed when the retrieved data is available, or to build a structured query that can do both.

Query type	Primary mechanisms	Example	Query rewriting task
Retrieval	Lexical and vector search.	"What is the origin of COVID-19?"	Optimize the query's text to improve lexical or vector matching.
Computational	Calculation, aggregation, or structured filtering.	"Which universities in Germany have an acceptance rate below 20%, and what is their average tuition fee?"	Decompose the query: Retrieve relevant data (university profiles), and then perform a calculation (average tuition fee) on the subset of retrieved documents.

Design methodology: Template-based expansion

The main focus of this blog is on retrieval queries. However, unlike standard approaches that simply ask an LLM to rewrite or completely rephrase a query, we adopted a template-based design methodology.

Improving query-text-to-query-text relevance by completely rephrasing the query is often not straightforward for vector nor lexical search. It introduces the complexity of merging search results when multiple hypotheses are considered, and the LLM can often drift from the original user intent. Instead, we explore expanding the original query through an Elasticsearch Query DSL template + prompt strategy. By using specific prompts, we guide the LLM to output specific textual elements (like a list of entities, synonyms, or a pseudo-answer) rather than giving it free rein. These elements are then plugged into a predefined Elasticsearch Query DSL template (a search "recipe").

This approach reduces the scope of the LLM application, making the output more deterministic. In our experiments, the LLM is simply prompted to output some text, which then is inserted into the template.

To validate this approach, we performed a limited exploration of different Elasticsearch primitives to identify and "freeze" a good-enough search template. This allowed us to test how different prompting strategies affect relevance within that fixed structure, rather than changing the structure itself.

While this blog focuses on retrieval queries, and lexical extraction and semantic expansion strategies where the linguistic aspect plays the major role, this methodology is flexible. Specific templates could be designed for other specific retrieval query use cases, such as handling product codes since relevance criteria are often context dependent. However, use cases with queries dependent on complex aggregations or strict filtering should be considered computational queries, which would require query optimization strategies outside the scope of this blog.

Query optimization strategies

While query optimization predates LLMs, LLMs excel at this task. They can be prompted to apply several rewriting strategies [9], such as:

Generic query rephrasing.
Pseudo-answer generation.
Noise reduction (removing irrelevant text, extracting important entities).
Entity enrichment (synonyms, abbreviation expansion, or other related terms).
Fixing typos.
A combination of the above.

Most of these techniques depend on the model’s capacity to understand user intent and its knowledge of the corpus characteristics.

In the following sections, we’ll present our experimentation with query rewriting for informational queries and their application to Elasticsearch. We’ll present our most successful experiments and discuss our unsuccessful ones.

Experiments and results

All the experiments presented in this blog were run using Anthropic Claude 3.5 Sonnet. Normalized Discounted Cumulative Gain (NDCG) at position 10 and Recall at positions 10 and 50 are the metrics used to evaluate the experiments throughout the blog. All NDCG and recall scores were computed using pytrec_eval [10].

We experimented with several variations of the query rewriting task for informational queries and measured relevance and recall gains for three widely used information retrieval benchmarks: Benchmarking Information Retrieval - BEIR, 15 datasets, English-only [11]. Multilingual Document Ranking - MLDR, 13 languages [12] and Multilingual Information Retrieval Across a Continuum of Languages - MIRACL, 18 languages [13].

We mainly explored the following tasks:

Lexical keyword enrichment.
Pseudo-answer generation.
Letting the model decide on a method or a combination of methods among keyword extraction, keyword enrichment, and pseudo-answer generation.

We detail the prompts we used for each case and expand on some attempted variations below.

It’s worth noting that, out of the datasets we evaluated, only four within BEIR (NQ, Robust04, Quora, and MS MARCO) contain real user queries that can benefit from generic query rewriting strategy fixes, such as misspellings, corrections, or query cleaning. The rest of the datasets are either synthetically generated (MLDR, MIRACL) or human-constructed (most of the BEIR datasets).

Lexical keyword enrichment

This is the first task we tried and considered various prompts in an effort to optimize results. We started from the simplest possible version, prompting the LLM to extract relevant keywords without specifying more details.

Prompt 1.

You are a search engine and you want to extract the relevant keyword, codewords or proper names from a query. 
Please, extract them and list them in a comma-separated list. Reply with the list only.

On a second attempt, we tried a prompt with more explicit instructions, prompting the model to provide only the most important keywords, and insisting on why that is important for our use case. We also introduce here the idea of entity enrichment, prompting the model to augment the original query only if it considers it to be too small or missing information.

Prompt 2.

You are a search engine and you want to extract the relevant keyword, codewords or proper names from a query.
            
            Think about the query and select only the very central and necessary entities. 
            They will be used as boosters for the lexical search, so make sure to only extract critical ones. We don't want to boost documents that contain entities that might not be related to the specific context of the query. 
            If, and only if, the query is too short and is missing essential information, see if you can extract synonyms or enrich the query with entities that don't exist in the original query but might help the search.
            
            Return the thinking process inside  tags and the final entities inside  tags.

Finally, we tried a prompt with even more explicit instructions and details encouraging the model to apply different techniques based on the original query’s length.

Prompt 3.

You are a search engine and you want to extract the relevant keyword, codewords or proper names from a query.
            
            Extraction for boosting -  query length independent:
         Think about the query and identify the main context or the information that the user is asking for. 
         Select only the very central and necessary entities. They will be used as boosters for the lexical search, so make sure to only extract critical ones. 
         We don't want to boost documents that contain entities that might not be related to the specific context of the query.

             Enrichment - query length dependent:
       	If the query is shorter than 5 words, then extract the most important entities, but also evaluate whether you can enrich with synonyms or other entities that might enhance recall.
             If the query is larger than 10 words stick to entities that exist in the query
             If you find any initials, return the initials as well as the full words represented by the initials

            Stem-proofing - query length independent:
         Regardless of the query length, return the entities in different forms, in order to overcome stemming miss-matches, provided that the versions of the word are relevant to the query context e.g.: walk, walked, walking.
            
Return the thinking process inside  tags and the final entities, comma separated, inside  tags.
          Make sure you are only returning the most important entities! Don't return entities that are too limiting.

We ran lexical search tests on the three prompt variations on a subset of BEIR datasets and compared performance in terms of relevance and recall. The following table lists averaged results over datasets ArguAna, FiQA-2018, Natural Questions (NQ), SciDocs, SciFact, TREC-COVID, Touché 2020, NFCorpus, Robust04:

	Original query	Prompt 1	Prompt 2	Prompt 3
NDCG@10	0.346	0.345	0.356	0.346
Recall@10	0.454	0.453	0.466	0.455

We consider a baseline lexical search of the original queries and compare with experiments where the LLM-output entities are used in lexical search. The results are linearly combined with an Elasticsearch bool query with a must clause wrapping the original query and a should clause for each returned LLM term. (See the Elasticsearch docs for more information on Boolean queries.) The linear combination can be expressed as:

Where n is the number of keywords returned for each query. Score in this case refers to a lexical BM25 score. The query is written in native Elasticsearch Query DSL and is provided in the following section.

The rescoring, including the contribution of the extracted keywords, was applied to the top 200 retrieved documents.

Prompt 2 performs best, both in terms of relevance and recall considering the top 10 results, improving +1pt, whereas the other prompts have little effect. We kept prompt 2 for lexical keyword extraction going forward.

We visually inspect the returned keywords for some of the tested queries in these following examples, and identify several query optimization techniques that were mentioned in the previous section.

Typo corrections:

Query: when does season 3 of strnger things come out
Keywords: Stranger Things, Season 3

Noise reduction:

Query: Evolvability : What Is It and How Do We Get It?
Keywords: evolvability

Keyword expansion/augmentation:

Query: Trust-Aware Review Spam Detection
Keywords: Trust-Aware, Review Spam Detection, fake reviews, credibility

Query: Dimensional inconsistencies in code and ROS messages: A study of 5.9M lines of code
Keywords: dimensional inconsistencies, code, ROS, Robot Operating System, 5.9M

Query: low-carb diets
Keywords: low-carb, diets, ketogenic, keto, carbohydrate

Pseudo-answer generation

Prompting the LLM to generate pseudo-answers to the original query is another technique that was explored. It has been explored before in the context of query rewriting,9 and is expected to especially benefit factual queries and match the advantages of keyword augmentation. For this method we used the prompt:

Prompt 4.

You are a search engine assistant and you want to generate 5 possible replies for a query. 
The replies will be used to boost the search results, in a combined manner with the original query. Make sure that the generated results respect the preferences of most search (lexical or vector) engines, that is, they should be concise, relevant, and cover different aspects of the query.
Return the rewritten replies inside  tags. Separate the replies by the line break symbol.

We explore some examples of LLM outputs for this prompting strategy. For comparison, the extracted Entities for Prompt 4 are also listed.
Providing replies to a general knowledge question:

Query: What is the best way to practice web development?

Replies:
- Learn HTML, CSS, and JavaScript fundamentals
- Build personal projects and create a portfolio
- Contribute to open-source projects on GitHub
- Follow online tutorials and take coding courses
- Practice responsive design and cross-browser compatibility
Entities (Prompt 4): development, practice, coding projects, tools

Offering imagined document titles that might contain a reply to a question that cannot be answered based on general knowledge:

Query: In your opinion, what is the best classical song of all time?

Replies:
- Best classical compositions ranked by music critics
- Most popular and influential classical pieces in history
- Timeless masterpieces from renowned classical composers
- Iconic classical works that have stood the test of time
- Top classical music recommendations for beginners and enthusiasts
Entities (Prompt 4): classical music, classical composition, symphony, concerto, opera

Query: A description of the history and meaning of the 1st amendment

Replies:
- First Amendment: origins, historical context, and significance in US Constitution
- Freedom of speech, religion, press, assembly: core principles of 1st Amendment
- Supreme Court interpretations and landmark cases on First Amendment rights
- Evolution of First Amendment protections throughout American history
- Impact and controversies surrounding the First Amendment in modern society
Entities (Prompt 4): Amendment, Constitution, Bill of Rights

We observe an interesting combination of LLM output modes for this strategy. Different types of queries are handled differently:

When the query can be directly replied to based on general knowledge, the LLM returns possible replies.
If not, the LLM “imagines” what the type of document that the user might want could look like and returns these title-like document descriptions as replies.

Both methods end up introducing useful and creative new keywords.

Letting the model choose a method

Finally, experiments were run using a prompt that encouraged the model to select the most appropriate method, or combination of methods and asked it to explain its rationale behind the choice. We expect the model to be inherently capable of “thinking” through how each query can be improved in a search context:

Prompt 5.

You are a search engine and you want to extract the relevant keyword, codewords or proper names from a query.
            There are 3 basic methods to do so:
            1. extract important entities and keywords
            2. create a pseudo answer to the query
            3. expand the initial query with synonyms and related terms
            Given a query, choose a method or a combination of any of them and return the rewritten query inside  tags.
            You can separate the different parts or terms with the line break symbol.
            Also return the rationale behind your choice of methods inside  tags. Why does this query need this type of method?

The query rewriting implementation for prompts 4 and 5 closely resembles that of prompt 2. Specifically, the LLM output is incorporated into a Boolean query, where the original query is placed in a must clause and each LLM-generated term is included in a should clause. For prompt 4, an LLM-output term represents a single pseudo-answer, while for prompt 5, it represents a rewrite.

Overall, the change in performance using different prompts is significant in the context of lexical search, where prompts 4 and 5 outperform prompt 2. In the following table, LKE refers to lexical keyword extraction using prompt 2, PA refers to pseudo-answer generation using prompt 4, and MC stands for model’s choice and refers to prompt 5. The model’s output is used according to equation 1.

In the final row of the table, the scores are averaged at the benchmark level. It’s computed as an average of the average scores of BEIR, MLDR, and MIRACL benchmarks. The pseudo-answers and model’s choice strategies perform better across metrics, with pseudo-answers being slightly better.

We further analyze these prompting techniques and obtain more results in the following section, with respect to vector search experiments.

We also present the results of lexical search when using only the query rewriting terms. In the table below, the original query's contribution to the scores is entirely excluded:

Excluding the original query scores from the result seems to hurt relevance performance by average. In terms of Recall@50, the pseudo-answer strategy results in improved performance compared to baseline, but the boost is smaller than when the strategy includes the original query.

Overall, we recommend combining the query rewriting terms with the original query to achieve gains across metrics in lexical search.

Large language models versus small language models

For the majority of the results discussed in this blog, we utilized Anthropic's Claude 3.5 Sonnet LLM. However, we also experimented with a smaller model to assess how inference cost affects performance. We tried LKE with Anthropic’s Claude 3.5 Haiku for a subset of datasets from BEIR (ArguAna, FiQA-2018, Natural Questions [NQ], SciDocs, SciFact, TREC-COVID, Touché 2020, NFCorpus, Robust04).

	Original query	LKE with Sonnet	LKE with Haiku
NDCG@10	0.346	0.364	0.368
Recall@10	0.454	0.472	0.475

Relevance and recall within the top 10 results remain unaffected. While this initial investigation is not exhaustive and requires further study in real-world scenarios that implement query optimization, these first results strongly suggest that small language models (SLMs) are likely a viable option for this specific use case.

A comparison between Claude 3.5 Sonnet and Claude 3.5 Haiku is provided below:

Model	Number of parameters	Context window	Max output	Input cost	Output cost
Claude 3.5 Sonnet	Est. ~175B	200K	8192 tokens	$15.00 / million tokens	$4.00 / million tokens
Claude 3.5 Haiku	Est. ~20B	200K	8192 tokens	$3.00 / million tokens	$0.80 / million tokens

Query rewriting in Elasticsearch

In enterprise search, where precision and control are paramount, we explored methods that integrate query rewriting with existing search functionality. The focus was on strategies that build upon the original query to target relevance gains without a high implementation cost.

Elasticsearch features a wide range of search tools that tackle different search scenarios. It supports lexical and vector retrieval, as well as rerankers. We look for optimal ways to integrate query rewriting strategies in Elasticsearch, trying to explore across base retrievers and hybrid methods.

In the previous section, we presented results on lexical search and introduced equations 1 and 2. These correspond to the following Elasticsearch Query DSL code:

{
  "query": {
    "bool": {
      "must":  { "match": { "text": "ORIGINAL QUERY" } },
      "should": [
        { "match": { "text": "QR TERM 1" } },
        { "match": { "text": "QR TERM 2" } },
        { "match": { "text": "QR TERM 3" } }
      ]
    }
  }
}

QR TERM 1, 2, 3 stands for query rewriting term and refers to whatever the LLM output represents: keywords, pseudo-answers, or other types of replies.

The bool query functions like a linear combination of terms. Crucially, the must clause enforces hard requirements, meaning any document that fails to match this clause is excluded from the results. In contrast, the should clause operates as a score booster: Documents matching it receive a higher final score, but documents that don't match are not discarded from the results.

Through iterative experimentation, we determined the most effective query configuration. Initial attempts included querying solely with terms generated by the LLM or various combinations of the original query and LLM terms. We observed that overreliance on LLM output reduced relevance. The optimal setup, which consistently yielded the best results, required the full inclusion of the original query, with the LLM output used only to selectively boost the ranking of certain documents.

Dense vector search as base retriever

When moving to vector search, the narrative changes. It’s already well-established in the industry that hybrid search (lexical + vector) improves both relevance and recall by combining the semantic understanding of dense vectors with the exact matching precision of BM25. Our goal here was to determine whether query rewriting applied to a vector retriever covers the same gap that hybrid search fixes or provides additional improvement.

For experiments where the base retriever was not BM25 but rather a K-Nearest Neighbors (KNN) search using dense vectors, we chose to explore the maximum gains we can achieve in terms of relevance and recall using an optimized, weighted linear combination. We opted to optimize the weight to ensure that the impact of query rewriting strategies on performance is isolated, preventing any potential influence from a poorly chosen linear combination on the results. We run an optimization process expressed in equation 3 using Bayesian Optimization (Optuna14) on the test set itself.

The optimization process was conducted on the same text queries used for evaluation to establish the upper bounds of potential performance improvements. Because we’re tuning only one parameter, the chance of fitting the test data is minimal. We verify this hypothesis below by running the optimization for some datasets on the train split and observe the difference in terms of performance.

For vector search evaluation, we use two optimization metrics:

LINEAR NDCG@10 OPTIMIZED(vector_oq, bm25_qr): The weight is optimized to achieve the maximum NDCG at the top 10 results.
LINEAR RECALL@50 OPTIMIZED(vector_oq, bm25_qr): The weight is optimized to achieve the maximum recall at the top 50 results.

In these metrics, oq stands for the original query, and qr stands for query rewriting. We include recall at 50 to assess query optimization's performance as a first-stage retriever, with the assumption that the search results will subsequently be processed by a reranker.

To provide a comparison, we also conducted experiments where the BM25 scores of the original query were combined with the vector search scores. These combinations are referred to as:

LINEAR NDCG@10 OPTIMIZED(vector_oq, bm25_oq)
LINEAR RECALL@50 OPTIMIZED(vector_oq, bm25_qr)

For the experiments in the following tables we used the multilingual-e5-large [15] dense vector model for benchmarks BEIR and MIRACL, and the Qwen3-0.6B-Embedding [16] model to search for long-context documents in the MLDR benchmark.

First, we verify that performing the optimization on the evaluation split instead of the training split yields results that follow the same trends. The pseudo-answers prompting strategy was used to compute the qr scores.

We compare the recall@50 scores when optimizing on the training split versus the evaluation split for MIRACL and MLDR, finding that both produced results on the same range.

In the following tables, the evaluation split is used for optimization. The final row contains results that are averaged on the benchmark level as in the previous tables.

We omit the Recall@10 results as they are very similar to NDCG@10 results. These tables show no advantage in hybrid search using QR terms instead of the original query. In terms of relevance, replacing the original query with QR seems to deteriorate results. In terms of recall, some gains are achieved in BEIR and MIRACL, but the averaged score reveals no advantage over a well-tuned hybrid search.

We further explored hybrid search using reciprocal rank fusion (RRF), relying on the built-in Elasticsearch functionality. Method RRF(vector_oq, bm25_qr) refers to DSL code:

{
  "retriever": {
    "rrf": {
      "retrievers": [
        {
          "knn": {
            "field": "dense_vector_field",
            "query_vector": ORIGINAL_QUERY_VECTOR,
            "k": 100,
            "num_candidates": 200
          }
        },
        {
          "standard": {
            "query": {
              "bool": {
                "should": [
                  { "match": { "text": "QR TERM" } },
                  { "match": { "text": "QR TERM" } },
                  { "match": { "text": "QR TERM" } }
                ]
              }
            }
          }
        }
      ],
      "rank_window_size": 500,
      "rank_constant": 20
    }
  },
  "size": 200,
}

The corresponding baseline run is denoted RRF(vector_oq, bm25_oq).

Replacing the original query with LLM-output terms to get lexical search scores deteriorates recall by average in all cases. In terms of relevance, we observe marginal improvement in BEIR and a notable increase by ~8 points of NDCG@10 in MIRACL. Relevance in MLDR is however so negatively affected that the average result is overall higher in baseline runs.

So far, our exploration has not revealed a definitive strategy for enhancing vector search performance through the exposed query rewriting methods. Considering the learnings from the exploration of lexical search, we try experimenting with hybrid search strategies that include both the original query and the query rewriting terms. We focus on a first-stage retrieval scenario and aim to improve Recall@50. In the following experiment, we try hybrid search with RRF, where the lexical scores are computed as an optimized linear combination of BM25 scores of the original query and BM25 scores of LLM-output terms. To establish an upper-bound for potential performance improvements, we perform Bayesian optimization on the set of queries using the Optuna library [14].

We denote this experiment RRF(vector_oq, LINEAR RECALL50 OPTIMIZED(bm25_oq, bm25_qr)). The same baseline as run in the previous table, RRF(vector_oq, bm25_oq), still provides a useful comparison in this experiment.

This experiment was not designed for relevance optimization; however, the resulting NDCG@10 scores are documented for completeness. A 1–3 percentage points of recall@50 increase was achieved with this method, with the prompting strategy that generates pseudo-answers being the most prevalently beneficial among the benchmarks. This strategy is suitable only for informational queries based on general knowledge or when the LLM possesses the necessary domain expertise. This method was employed to determine the upper limits of potential performance improvements. It's important to note that optimizing the weights using the complete test dataset, as was done, is not feasible in real-world applications.

PA turns out to be the most successful strategy for BEIR and MIRACL, while LKE gives the highest boost in recall for MLDR.

First-stage retriever and reranking

To maximize performance in a production setting, query rewriting could be viewed as part of a multistage pipeline. The goal of the first-stage retriever is not to be good at relevance but rather at recall, that is, to ensure the good documents make it into the candidate set for the reranker.

We implemented the following pipeline configuration:

Base retrieval: Retrieve top 200 documents.
Entity boosting: Rescore based on LLM-extracted entities (from prompt 2).
Pruning: Cut to the top 50 documents.
Reranking: Apply jina-reranker-v2 to the top 50 documents.

We compared the final outcomes of this pipeline using different first-stage inputs including base retrievers and base retrievers plus QR. For QR and hybrid baseline, we used the experiments that obtained higher recall.

Lexical corresponds to the original query using BM25.
Lexical + QR corresponds to the experiment using prompt MC.
Vector corresponds to the original query using vector search.
Vector + QR corresponds to the experiment LINEAR RECALL50 OPTIMIZED (vector_oq, bm25_qr) for prompt PA.
Hybrid corresponds to the RRF (vector_oq, bm25_oq).

Hybrid + QR corresponds to the experiment RRF (vector_oq, LINEAR RECALL50 OPTIMIZED (bm25_oq, bm25_qr)) for prompt PA.

The last row shows the averaged results across BEIR, MLDR, and MIRACL.

QR in lexical and vector is applied by combining the scores as shown in equations (1, 2) and optimized for recall. RRF is widely considered a strong standard for maximizing recall in hybrid search. Our results show that an optimized linear combination of the original vector and QR actually can achieve better recall than the RRF configurations, which cannot be optimized. This suggests that, with the right weighting, a linear combination can be more effective at using LLM-generated queries for recall than rank-based fusion, since optimizing RRF is less effective.

The table below shows the averages across datasets for NDCG@10 before and after reranking and recall@50:

Relevance (NDCG@10) results improve after reranking is applied. Consistently improving alongside recall@50.

Strategy domain adaptation

Unlike open web search, enterprise domains (such as legal, medical, or internal wikis) have specific vocabularies that generic LLMs might miss. Below we discuss several strategies that could be used to tailor the presented QR strategies for specific domains:

Domain-specific tuning: Further tune the query rewriting instructions (prompts) specifically for niche domains or specific use cases.
In-context learning: Use few-shot examples retrieved from a knowledge base (such as, append the top k BM25 results to the prompt from a quick, cheap initial search) to ground the rewriting process.
LLMs + rules hybrid approach: Combine the flexibility of LLMs with deterministic rules for specific domain terms.
Gated query rewriting: Selectively apply QR only when necessary, employing rules, custom classifiers, or specialized prompts and models to detect whether the query requires optimization for a specific use case
Generation: Query rewriting for generation: Expanding the query or context not just for retrieval but also specifically to improve the quality of the final LLM response generation.

Conclusions

The investigation shows how simple LLM-driven query optimization can have a positive impact within the modern search ecosystem.

Key take-aways

LLMs are a good complement to improve lexical search: Using LLMs to enrich keywords or generate pseudo-answers provides consistent improvements in both relevance and recall for standard lexical retrieval.
Hybrid search is harder to beat: When using dense vector search or hybrid retrieval, simple query rewriting terms offer marginal gains. The best results come from using QR to boost existing hybrid scores rather than replacing them.
Pseudo-answers improve recall: Generating hypothetical answers (pseudo-answer generation) proved to be the most effective strategy for maximizing recall in multistage pipelines.
Structured guidance over free-form generation: Guiding the LLM is critical. Rather than allowing the LLM to freely rephrase a query, providing a strict template (like extracting specific entities to fit a DSL clause) ensures that the output adds value without introducing noise. A specific prompt + DSL template combination allows the design for a specific relevance use case (such as lexical extraction versus semantic expansion) and reduces the scope of error.
Efficiency with small models: The strategies explored here are simple strategies that could be deployed effectively using SLMs or distilled into compact models, offering a cost-effective solution.

The following table contains some practical guidelines on how to incorporate the most successful query rewriting techniques into your search pipeline, depending on your particular setting:

Real-world setting	QR strategy	Implementation
Lexical search	Lexical keyword enrichment (prompt 2)	Boost search performance without the cost of migrating to vector search: Use the LLM's output (keywords, entities, synonyms) in an Elasticsearch `should` clause to boost documents that match the expanded terms, while keeping the original query in a `must` clause. Avoid relying solely on QR-generated terms, and use SLMs to reduce costs without compromising QR performance.
Multistage retrieval with reranking	Pseudo-answer generation (prompt 4)	In a multistage search system (retriever followed by a reranker), use the pseudo-answers as boosting terms. Use the Elasticsearch `should` +` must` clauses structure as before to retrieve the widest set of potentially relevant documents before passing them to the reranker to score.

To reiterate our earlier comments, these solutions will benefit scenarios where most queries are retrieval queries, that is, scenarios where relevance does not depend on specific filtering, aggregations, or other types of structure. However, the same template meta strategy can potentially be adapted to such cases.

The value of task‑focused tuning in search pipeline design

One of the broader implications of this investigation is the importance of viewing search pipeline architectures as a set of modular, well‑defined stages where lightweight, task‑focused adjustments can meaningfully improve performance, allowing pipeline components to be tuned for specific retrieval goals. Such tuning could involve a variety of strategies, including experimenting with how LLMs are prompted to target particular gains (such as maximizing recall versus precision), parametrizing how LLM output is combined with the original query (for example, DSL Query template), or evaluating the impact of different rescoring strategies (such as MMR or match_phrase–based query rescoring) on an initial candidate set (such as the top 200 retrieved documents), and layering these techniques before a more computationally intensive reranking step. Overall, this perspective encourages designing pipelines with clear component boundaries and a small, controllable set of hyperparameters that can be tuned to achieve targeted retrieval outcomes. Furthermore, although our experiments demonstrated measurable gains in a general‑purpose IR setting, we expect these interventions to be even more impactful in scenarios where relevance is narrowly defined, allowing the template‑based approach to improve results in a more controlled way.

LLM-driven query optimization in modern search pipelines

Simple query rewriting strategies can be well-suited, easy-to-plug-in solutions for targeted performance gains. In environments where LLMs are already in use (for example, RAG, conversational interfaces, or agentic search workflows), the overhead of an extra LLM call for rewriting is absorbed, making latency less of an issue. This allows for significant and targeted improvements in relevance and recall across specific domains or challenging query types.

All the strategies discussed in this blog consist of a combination of an LLM prompt and an Elasticsearch Query DSL template, and hence they can be naturally integrated into the application layer of a search solution.

Finally, Elasticsearch has already begun integrating LLM-powered capabilities directly into its search experience, offering tools like ES|QL COMPLETION, managed LLMs through the Elastic Inference Service (EIS), and lately, the possibility to build a custom query rewriting tool within Elastic Agent Builder.

A detailed table of the results presented can be found here.

References

Xiong, H., Bian, J., Li, Y., Li, X., Du, M., Wang, S., Yin, D., & Helal, S. (2024). When search engine services meet large language models: Visions and challenges. arXiv.
Remmey, M. (2024, May 14). NL to SQL architecture alternatives. Azure Architecture Blog. https://techcommunity.microsoft.com/blog/azurearchitectureblog/nl-to-sql-architecture-alternatives/4136387
Abdallah, A., Piryani, B., Mozafari, J., Ali, M., & Jatowt, A. (2025, August 22). How good are LLM-based rerankers? An empirical analysis of state-of-the-art reranking models. arXiv. arxiv
Joshi, A., Shi, Z., Goindani, A., & Liu, H. (2025, October 22). The case against LLMs as rerankers. Voyage AI. https://blog.voyageai.com/2025/10/22/the-case-against-llms-as-rerankers/
Oosterhuis, H., Jagerman, R., Qin, Z., & Wang, X. (2025, July). Optimizing compound retrieval systems. In Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25) (pp. 1–11). ACM. https://doi.org/10.1145/3726302.3730051
Zhang Y, Li M, Long D, Zhang X, Lin H, Yang B, Xie P, Yang A, Liu D, Lin J, Huang F, Zhou J. Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models. arXiv preprint arXiv:2506.05176. 2025. arXiv
Wang L, Yang N, Huang X, Yang L, Majumder R, Wei F. Improving Text Embeddings with Large Language Models. arXiv preprint arXiv:2401.00368. 2024. arXiv
Lee J, Dai Z, Ren X, Chen B, Cer D, Cole JR, et al. Gecko: Versatile Text Embeddings Distilled from Large Language Models. arXiv preprint arXiv:2403.20327. 2024. arxiv
Li, Zhicong; Wang, Jiahao; Jiang, Zhishu; Mao, Hangyu; Chen, Zhongxia; Du, Jiazhen; Zhang, Yuanxing; Zhang, Fuzheng; Zhang, Di; Liu, Yong (2024). DMQR-RAG: Diverse Multi-Query Rewriting for RAG. arXiv preprint arXiv:2411.13154. DOI:10.48550/arXiv.2411.13154. (arxiv)
Van Gysel C, de Rijke M. pytrec_eval: An extremely fast Python interface to trec_eval. In: Proceedings of the 2018 SIGIR Workshop on Reproducibility in Information Retrieval; 2018.
Thakur N, Reimers N, Rücklé A, Srivastava A, Gurevych I. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. arXiv [cs.IR]. 2021;arXiv:2104.08663. (arxiv)
Chen J, Xiao S, Zhang P, Luo K, Lian D, Liu Z. BGE M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv [cs.CL]. 2024;arXiv:2402.03216. (arxiv)
Zhang X, Thakur N, Ogundepo O, Kamalloo E, Alfonso-Hermelo D, Li X, Liu Q, Rezagholizadeh M, Lin J. MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages. Trans Assoc Comput Linguistics. 2023;11:1114-1131. (aclanthology.org)
Akiba T, Sano S, Yanase T, Ohta T, Koyama M. Optuna: A Next-generation Hyperparameter Optimization Framework. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’19). 2019:2623-2631. doi:10.1145/3292500.3330701
Wang L, Yang N, Huang X, Yang L, Majumder R, Wei F. Multilingual E5 Text Embeddings: A Technical Report. arXiv preprint arXiv:2402.05672. Published 2024. Accessed November 18, 2025. arxiv
Zhang Y, Li M, Long D, Zhang X, Lin H, Yang B, Xie P, Yang A, Liu D, Lin J, Huang F, Zhou J. Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models. arXiv preprint arXiv:2506.05176. Published 2025. Accessed November 18, 2025. https://arxiv.org/abs/2506.05176 Hugging Face+2GitHub+2

Building human-in-the-loop (HITL) AI agents with LangGraph and Elasticsearch

Jeffrey Rengifo — Thu, 29 Jan 2026 00:00:00 GMT

In this article, we’ll explore how to combine LangGraph and Elasticsearch to build a human-in-the-loop (HITL) system. This approach allows AI systems to involve users directly in the decision-making process, making interactions more reliable and context-aware. We’ll implement a practical example using a context-driven scenario to demonstrate how LangGraph workflows can integrate with Elasticsearch to retrieve data, handle user input, and produce refined results.

Requirements

NodeJS version 18 or newer
OpenAI API Key
Elasticsearch 8.x+ deployment

Why use LangGraph for production HITL systems

In a previous article, we presented LangGraph and its benefits for building a retrieval-augmented generation (RAG) system using large language models (LLMs) and conditional edges to automatically make decisions and display results. Sometimes we don’t want the system to act autonomously end to end, but we want users to select options and make decisions within the execution loop. This concept is called human in the loop.

Human in the loop

HITL is an AI concept that allows a real person to interact with AI systems to provide more context, evaluate responses, edit responses, ask for more information, and perform other tasks. This is very useful in low-error-tolerance scenarios, such as compliance, decision-making, and content generation, helping improve the reliability of LLM outputs.

It's important to note that the primary purpose of HITL in agentic systems is validation, not blind trust in the agent's approach. HITL interventions should be reactive and triggered only when the system detects missing or ambiguous information. This ensures human involvement remains meaningful and adds value, rather than becoming a mandatory checkpoint that interrupts every workflow unnecessarily.

A common example is when your coding assistant asks you for permission to execute a certain command on the terminal or shows you the step-by-step thinking process for you to approve before starting coding.

Elasticsearch and LangGraph: How they interact

LangChain allows us to use Elasticsearch as a vector store and to perform queries within LangGraph applications, which is useful to execute full-text or semantic searches, while LangGraph is used to define the specific workflow, tools, and interactions. It also adds HITL as an additional interaction layer with the user.

Practical implementation: Human in the loop

Let’s imagine a case where a lawyer has a question about a case he recently took on. Without the right tools, he would need to manually search through legal articles and precedents, read them in full, and then interpret how they apply to his situation. With LangGraph and Elasticsearch, however, we can build a system that searches a database of legal precedents and generates a case analysis that incorporates the specific details and context provided by the lawyer. You can find the full implementation of this use case in the following repository.

The workflow begins when the lawyer submits a legal question. The system performs a vector search in Elasticsearch, retrieves the most relevant precedents, and presents them for the lawyer to choose from, using natural language. After the selection, the LLM generates a draft analysis and checks whether the information is complete. At this point, the workflow can follow two paths: If everything is clear, it proceeds directly to generate a final analysis; if not, it pauses to request clarification from the lawyer. Once the missing context is provided, the system completes the analysis and returns it, taking into consideration the clarifications.

The following is a graph drawn by LangGraph showing how the app will look at the end of the development. Each node represents a tool or functionality:

Dataset

Here’s the dataset that will be used for this example. This dataset contains a collection of legal precedents, each describing a case involving service delays, the court’s reasoning, and the final outcome.

[
  {
    "pageContent": "Legal precedent: Case B - Service delay not considered breach. A consulting contract used term 'timely delivery' without specific dates. A three-week delay occurred but contract lacked explicit schedule. Court ruled no breach as parties had not defined concrete timeline and delay did not cause demonstrable harm.",
    "metadata": {
      "caseId": "CASE-B-2022",
      "contractType": "consulting agreement",
      "delayPeriod": "three weeks",
      "outcome": "no breach found",
      "reasoning": "no explicit deadline defined, no demonstrable harm",
      "keyTerms": "timely delivery, open terms, schedule definition",
      "title": "Case B: Delay Without Explicit Schedule"
    }
  },
  ...
]

Ingestion and index setup

The index setup and data ingestion logic are defined in the dataIngestion.ts file, where we declare functions to handle the index creation. This setup is compatible with the LangChain vector store interface for Elasticsearch.

Note: The mapping setup is also included in the dataIngestion.ts file.

Install packages and set up environment variables

Let's initialize a Node.js project with default settings:

npm init -y

Now let's install all required dependencies:

@elastic/elasticsearch: Elasticsearch client for Node.js. Used to connect, create indices, and run queries.
@langchain/community: Provides integrations for community-supported tools, including the ElasticVectorSearch store.
@langchain/core: Core building blocks of LangChain, such as chains, prompts, and utilities.
@langchain/langgraph: Adds graph-based orchestration, allowing workflows with nodes, edges, and state management.
@langchain/openai: Provides access to OpenAI models (LLMs and embeddings) through LangChain.
dotenv: Loads environment variables from an .env file into process.env.
tsx: Is a useful tool to execute typescript code.

Run the following command in the console to install all of them:

npm install @elastic/elasticsearch @langchain/community @langchain/core @langchain/langgraph @langchain/openai dotenv --legacy-peer-deps && npm install --save-dev tsx

Create an .env file to set up the environment variables:

ELASTICSEARCH_ENDPOINT=
ELASTICSEARCH_API_KEY=
OPENAI_API_KEY=

We’ll use TypeScript to write the code because it provides a layer of type safety and a better developer experience. Create a TypeScript file named main.ts, and insert the next section's code.

Package imports

In the main.ts file, we start by importing the required modules and initializing the environment variable configuration. This includes the core LangGraph components, the OpenAI model integrations, and the Elasticsearch client.

We also import the following from the dataIngestion.ts file:

ingestData: A function that creates the index and ingests the data.
Document and DocumentMetadata: Interfaces that define the dataset document structure.

Elasticsearch vector store client, embeddings client, and OpenAI client

This code will initialize the vector store, the embeddings client, and one OpenAI client:

const VECTOR_INDEX = "legal-precedents";

const llm = new ChatOpenAI({ model: "gpt-4o-mini" });
const embeddings = new OpenAIEmbeddings({
  model: "text-embedding-3-small",
});

const esClient = new Client({
  node: process.env.ELASTICSEARCH_ENDPOINT,
  auth: {
    apiKey: process.env.ELASTICSEARCH_API_KEY ?? "",
  },
});

const vectorStore = new ElasticVectorSearch(embeddings, {
  client: esClient,
  indexName: VECTOR_INDEX,
});

Application workflow state schema will help in the communication between nodes:

const LegalResearchState = Annotation.Root({
  query: Annotation(),
  precedents: Annotation(),
  userChoice: Annotation(),
  selectedPrecedent: Annotation(),
  validation: Annotation(),
  draftAnalysis: Annotation(),
  ambiguityDetected: Annotation(),
  userClarification: Annotation(),
  finalAnalysis: Annotation(),
});

In the state object, we’ll pass the following through the nodes: the user’s query, the concepts extracted from it, the legal precedents retrieved, and any ambiguity detected. The state also tracks the precedent selected by the user, the draft analysis generated along the way, and the final analysis once all clarifications are completed.

Nodes

searchPrecedents: This node performs a similarity search in the Elasticsearch vector store based on the user’s input. It retrieves up to five matching documents and prints them so they can be reviewed by the user:

async function searchPrecedents(state: typeof LegalResearchState.State) {
  console.log(
    "📚 Searching for relevant legal precedents with query:\n",
    state.query
  );

  const results = await vectorStore.similaritySearch(state.query, 5);
  const precedents = results.map((d) => d as Document);

  console.log(`Found ${precedents.length} relevant precedents:\n`);

  for (let i = 0; i < precedents.length; i++) {
    const p = precedents[i];
    const m = p.metadata;
    console.log(
      `${i + 1}. ${m.title} (${m.caseId})\n` +
        `   Type: ${m.contractType}\n` +
        `   Outcome: ${m.outcome}\n` +
        `   Key reasoning: ${m.reasoning}\n` +
        `   Delay period: ${m.delayPeriod}\n`
    );
  }

  return { precedents };
}

precedentSelection: This node allows the user to select, using natural language, the use case retrieved by the proximity search that best matches the question. At this point, the application interrupts the workflow and waits for user input:

function precedentSelection(state: typeof LegalResearchState.State) {
  console.log("\n⚖️  HITL #1: Human input needed\n");
  const result = interrupt({
    question: "👨‍⚖️  Which precedent is most similar to your case? ",
  });

  return { userChoice: result as string };
}

selectPrecedent: This node sends the user input, along with the retrieved documents, to be interpreted so that one of them can be selected. The LLM performs this task by returning a number that represents the document it infers from the user’s natural-language input:

async function selectPrecedent(state: typeof LegalResearchState.State) {
  const precedents = state.precedents || [];
  const userInput = state.userChoice || "";

  const precedentsList = precedents
    .map((p, i) => {
      const m = p.metadata;
      return `${i + 1}. ${m.caseId}: ${m.title} - ${m.outcome}`;
    })
    .join("\n");

  const structuredLlm = llm.withStructuredOutput({
    name: "precedent_selection",
    schema: {
      type: "object",
      properties: {
        selected_number: {
          type: "number",
          description:
            "The precedent number selected by the lawyer (1-based index)",
          minimum: 1,
          maximum: precedents.length,
        },
      },
      required: ["selected_number"],
    },
  });

  const prompt = `
    The lawyer said: "${userInput}"

    Available precedents:
    ${precedentsList}

    Which precedent number (1-${precedents.length}) matches their selection?
  `;

  const response = await structuredLlm.invoke([
    {
      role: "system",
      content:
        "You are an assistant that interprets lawyer's selection and returns the corresponding precedent number.",
    },
    { role: "user", content: prompt },
  ]);

  const selectedIndex = response.selected_number - 1;
  const selectedPrecedent = precedents[selectedIndex] || precedents[0];

  console.log(`✅ Selected: ${selectedPrecedent.metadata.title}\n`);
  return { selectedPrecedent };
}

validatePrecedentSelectionThis node introduces an explicit validation step (yes/no) to confirm that the precedent inferred by the system is indeed the one the lawyer intends to use. This step acts as a safeguard against misinterpretation of the user’s intent, ensuring that downstream legal analysis is based on the correct case:

function validatePrecedentSelection(state: typeof LegalResearchState.State) {
  const precedent = state.selectedPrecedent;
  if (!precedent) return {};

  const m = precedent.metadata;
  console.log("\n⚖️  HITL #1.5: Validation needed\n");
  console.log(
    `Selected precedent: ${m.title} (${m.caseId})\n` +
      `Type: ${m.contractType}\n` +
      `Outcome: ${m.outcome}\n`
  );

  const result = interrupt({
    question: "👨‍⚖️  Is this the correct precedent? (yes/no): ",
  });
  const validation =
    typeof result === "string" ? result : (result as any)?.value || "";

  return { validation };
}

processValidation: This node processes the lawyer’s confirmation response (yes/no) from the previous HITL step and determines the next path in the workflow. It interprets the user input as an explicit approval or rejection of the selected precedent.

If the precedent isn’t confirmed, the node clears the current selection and redirects the graph back to the precedentSelection phase, allowing the user to choose again. If the precedent is confirmed, the workflow continues without modification:

function processValidation(state: typeof LegalResearchState.State) {
  const userInput = (state.validation || "").toLowerCase().trim();
  const isValid = userInput === "yes" || userInput === "y";

  if (!isValid) {
    console.log("❌ Precedent not confirmed. Returning to selection...\n");
    return { selectedPrecedent: null, userChoice: "" };
  }

  console.log("✅ Precedent confirmed.\n");
  return {};
}

createDraft: This node generates the initial legal analysis based on the precedent selected by the user. It uses an LLM to evaluate how the chosen precedent applies to the lawyer’s question and determines whether the system has enough information to proceed.

If the precedent can be applied directly, the node produces a draft analysis and, taking the right path, jumps to the final node. If the LLM detects ambiguities, such as undefined contract terms, missing timeline details, or unclear conditions, it returns a flag indicating that clarification is required, along with a list of the specific pieces of information that must be provided. In that case, the ambiguity triggers the left path of the graph:

async function createDraft(state: typeof LegalResearchState.State) {
  console.log("📝 Drafting initial legal analysis...\n");

  const precedent = state.selectedPrecedent;
  if (!precedent) return { draftAnalysis: "" };

  const m = precedent.metadata;

  const structuredLlm = llm.withStructuredOutput({
    name: "draft_analysis",
    schema: {
      type: "object",
      properties: {
        needs_clarification: {
          type: "boolean",
          description:
            "Whether the analysis requires clarification about contract terms or context",
        },
        analysis_text: {
          type: "string",
          description: "The draft legal analysis or the ambiguity explanation",
        },
        missing_information: {
          type: "array",
          items: { type: "string" },
          description:
            "List of specific information needed if clarification is required (empty if no clarification needed)",
        },
      },
      required: ["needs_clarification", "analysis_text", "missing_information"],
    },
  });

  const prompt = `
    Based on this precedent:
    Case: ${m.title}
    Outcome: ${m.outcome}
    Reasoning: ${m.reasoning}
    Key terms: ${m.keyTerms}

    And the lawyer's question: "${state.query}"

    Draft a legal analysis applying this precedent to the question.
    
    If you need more context about the specific contract terms, timeline details, 
    or other critical information to provide accurate analysis, set needs_clarification 
    to true and list what information is missing.
    
    Otherwise, provide the legal analysis directly.
  `;

  const response = await structuredLlm.invoke([
    {
      role: "system",
      content:
        "You are a legal research assistant that analyzes cases and identifies when additional context is needed.",
    },
    { role: "user", content: prompt },
  ]);

  let displayText: string;
  if (response.needs_clarification) {
    const missingInfoList = response.missing_information
      .map((info: string, i: number) => `${i + 1}. ${info}`)
      .join("\n");
    displayText = `AMBIGUITY DETECTED:\n${response.analysis_text}\n\nMissing information:\n${missingInfoList}`;
  } else {
    displayText = `ANALYSIS:\n${response.analysis_text}`;
  }

  console.log(displayText + "\n");

  return {
    draftAnalysis: displayText,
    ambiguityDetected: response.needs_clarification,
  };
}

The two paths that the graph can take look like this:

The left path includes an additional node that handles the clarification.

requestClarification: This node triggers the second HITL step when the system identifies that the draft analysis lacks essential context. The workflow is interrupted, and the user is asked to clarify the missing contract details detected by the previous node:

function requestClarification(state: typeof LegalResearchState.State) {
  console.log("\n⚖️  HITL #2: Additional context needed\n");
  const userClarification = interrupt({
    question: "👨‍⚖️  Please provide clarification about your contract terms:",
  });
  return { userClarification };
}

This intervention exemplifies the validation-driven nature of HITL: The system only pauses to request human input when it has identified specific gaps that could compromise the quality of the analysis. As an improvement, in production systems, this node could be enhanced with validation logic to verify that all required information points are addressed before proceeding, ensuring the analysis is built on complete and accurate context.

generateFinalAnalysis: This node produces the final legal analysis by combining the selected precedent with the additional context provided by the user, if needed. Using the clarification gathered in the previous HITL step, the LLM synthesizes the precedent’s reasoning, the contract details supplied by the user, and the conditions that determine whether a breach may have occurred.

The node outputs a complete analysis that integrates legal interpretation and practical recommendations:

async function generateFinalAnalysis(state: typeof LegalResearchState.State) {
  console.log("📋 Generating final legal analysis...\n");

  const precedent = state.selectedPrecedent;
  if (!precedent) return { finalAnalysis: "" };

  const m = precedent.metadata;

  const prompt = `
    Original question: "${state.query}"
    
    Selected precedent: ${m.title}
    Outcome: ${m.outcome}
    Reasoning: ${m.reasoning}
    
    Lawyer's clarification: "${state.userClarification}"
    
    Provide a comprehensive legal analysis integrating:
    1. The selected precedent's reasoning
    2. The lawyer's specific contract context
    3. Conditions for breach vs. no breach
    4. Practical recommendations
  `;

  const response = await llm.invoke([
    {
      role: "system",
      content:
        "You are a legal research assistant providing comprehensive analysis.",
    },
    { role: "user", content: prompt },
  ]);

  const finalAnalysis = response.content as string;

  console.log(
    "\n" +
      "=".repeat(80) +
      "\n" +
      "⚖️  FINAL LEGAL ANALYSIS\n" +
      "=".repeat(80) +
      "\n\n" +
      finalAnalysis +
      "\n\n" +
      "=".repeat(80) +
      "\n"
  );

  return { finalAnalysis };
}

Building graph:

const workflow = new StateGraph(LegalResearchState)
  .addNode("searchPrecedents", searchPrecedents)
  .addNode("precedentSelection", precedentSelection)
  .addNode("selectPrecedent", selectPrecedent)
  .addNode("validatePrecedentSelection", validatePrecedentSelection)
  .addNode("processValidation", processValidation)
  .addNode("createDraft", createDraft)
  .addNode("requestClarification", requestClarification)
  .addNode("generateFinalAnalysis", generateFinalAnalysis)
  .addEdge("__start__", "searchPrecedents")
  .addEdge("searchPrecedents", "precedentSelection") // HITL #1
  .addEdge("precedentSelection", "selectPrecedent")
  .addEdge("selectPrecedent", "validatePrecedentSelection") // Selection validation
  .addEdge("validatePrecedentSelection", "processValidation")
  .addConditionalEdges(
    "processValidation",
    (state: typeof LegalResearchState.State) => {
      const userInput = (state.validation || "").toLowerCase().trim();
      const isValid = userInput === "yes" || userInput === "y";
      return isValid ? "validated" : "reselect";
    },
    {
      validated: "createDraft",
      reselect: "precedentSelection",
    }
  )
  .addConditionalEdges(
    "createDraft",
    (state: typeof LegalResearchState.State) => {
      // If ambiguity detected, request clarification (HITL #2)
      if (state.ambiguityDetected) return "needsClarification";
      // Otherwise, generate final analysis
      return "final";
    },
    {
      needsClarification: "requestClarification",
      final: "generateFinalAnalysis",
    }
  )
  .addEdge("requestClarification", "generateFinalAnalysis") // HITL #2
  .addEdge("generateFinalAnalysis", "__end__");

In the graph, we can see the createDraft conditional edge, which defines the condition for choosing the “final” path. As shown, the decision now depends on whether the draft analysis detected ambiguity that requires additional clarification.

Put all together to be executed:

await ingestData();

// Compile workflow
const app = workflow.compile({ checkpointer: new MemorySaver() });
const config = { configurable: { thread_id: "hitl-circular-thread" } };

await saveGraphImage(app);

// Execute workflow
const legalQuestion =
  "Does a pattern of repeated delays constitute breach even if each individual delay is minor?";

console.log(`⚖️  LEGAL QUESTION: "${legalQuestion}"\n`);

let currentState = await app.invoke({ query: legalQuestion }, config);

// Handle all interruptions in a loop
while ((currentState as any).__interrupt__?.length > 0) {
  console.log("\n💭 APPLICATION PAUSED WAITING FOR USER INPUT...");

  const interruptQuestion = (currentState as any).__interrupt__[0]?.value
    ?.question;

  // Handling empty responses
  let userChoice = "";
  while (!userChoice.trim()) {
    userChoice = await getUserInput(interruptQuestion || "👤 YOUR CHOICE: ");
    if (!userChoice.trim()) {
      console.log("⚠️  Please provide a response.\n");
    }
  }

  currentState = await app.invoke(
    new Command({ resume: userChoice.trim() }),
    config
  );
}

Execute the script

With all code allocated, let's execute the main.ts file, writing the following command on terminal:

tsx main.ts

Once the script is executed, the question “Does a pattern of repeated delays constitute breach even if each individual delay is minor?” will be sent to Elasticsearch to perform a proximity search, and the results retrieved from the index will be displayed. The app detects that multiple relevant precedents match the query, so it pauses execution and asks the user to help disambiguate which legal precedent is most applicable:

📚 Searching for relevant legal precedents with query:
 Does a pattern of repeated delays constitute breach even if each individual delay is minor?
Found 5 relevant precedents:

1. Case H: Pattern of Repeated Delays (CASE-H-2021)
   Type: ongoing service agreement
   Outcome: breach found
   Key reasoning: pattern demonstrated failure to perform, cumulative effect
   Delay period: multiple instances

2. Case E: Minor Delay Quality Maintained (CASE-E-2022)
   Type: service agreement
   Outcome: minor breach only
   Key reasoning: delay minimal, quality maintained, termination unjustified
   Delay period: five days

3. Case A: Delay Breach with Operational Impact (CASE-A-2023)
   Type: service agreement
   Outcome: breach found
   Key reasoning: delay affected operations and caused financial harm
   Delay period: two weeks

4. Case B: Delay Without Explicit Schedule (CASE-B-2022)
   Type: consulting agreement
   Outcome: no breach found
   Key reasoning: no explicit deadline defined, no demonstrable harm
   Delay period: three weeks

5. Case C: Justified Delay External Factors (CASE-C-2023)
   Type: construction service
   Outcome: no breach found
   Key reasoning: external factors beyond control, force majeure applied
   Delay period: one month

⚖️  HITL #1: Human input needed

💭 APPLICATION PAUSED WAITING FOR USER INPUT...
👨‍⚖️  Which precedent is most similar to your case?

The interesting thing about this application is that we can use natural language to choose one option, letting the LLM interpret the user’s input to determine the correct choice. Let’s see what happens if we enter the text “Case H”:

💭 APPLICATION PAUSED WAITING FOR USER INPUT...
👨‍⚖️  Which precedent is most similar to your case? Case H

✅ Selected: Case H: Pattern of Repeated Delays

After case selection, the application requests a small validation to confirm that the model selected the correct use case. If you answer “no”, the application returns to the case selection step:

Validation needed

Selected precedent: Case H: Pattern of Repeated Delays (CASE-H-2021)
Type: ongoing service agreement
Outcome: breach found


💭 APPLICATION PAUSED WAITING FOR USER INPUT...
👨‍⚖️  Is this the correct precedent? (yes/no):  yes

Once confirmed, the application workflow continues drafting the analysis:

✅ Precedent confirmed.

📝 Drafting initial legal analysis...

AMBIGUITY DETECTED:
Based on Case H, a pattern of repeated delays can indeed constitute a breach of contract, even if each individual delay is minor. The outcome in Case H indicates that the cumulative effect of these minor delays led to a significant failure to perform the contractual obligations adequately. The reasoning emphasizes that consistent performance is critical in fulfilling the terms of a contract. Therefore, if the repeated delays create a situation where the overall performance is hindered, this pattern could be interpreted as a breach. However, the interpretation may depend on the specific terms of the contract at issue, as well as the expectations of performance set forth in that contract.

Missing information:
1. Specific contract terms regarding performance timelines
2. Details on the individual delays (duration, frequency)
3. Context on consequences of delays stated in the contract
4. Other parties' expectations or agreements related to performance


⚖️  HITL #2: Additional context needed


💭 APPLICATION PAUSED WAITING FOR USER INPUT...
👨‍⚖️  Please provide clarification about your contract terms:

The model takes the user’s case selection and integrates it into the workflow, proceeding with the final analysis once enough context has been provided. In this step, the system also makes use of the previously detected ambiguity: The draft analysis highlighted missing contractual details that could meaningfully affect the legal interpretation. These “missing information” items guide the model in determining what clarifications are essential to resolve uncertainty before producing a reliable final opinion.

The user must include in the next input the requested clarifications. Let's try with "Contract requires ‘prompt delivery’ without timelines. 8 delays of 2-4 days over 6 months. $50K in losses from 3 missed client deadlines. Vendor notified but pattern continued.":

💭 APPLICATION PAUSED WAITING FOR USER INPUT...
👨‍⚖️  Please provide clarification about your contract terms: Contract requires "prompt delivery" without timelines. 8 delays of 2-4 days over 6 months. $50K in losses from 3 missed client deadlines. Vendor notified but pattern continued.

📋 Generating final legal analysis...

================================================================================
⚖️  FINAL LEGAL ANALYSIS
================================================================================

To analyze the question of whether a pattern of repeated minor delays constitutes a breach of contract, we need to combine insights from the selected precedent, the specifics of the lawyer's contract situation, conditions that typically govern breach versus non-breach, and practical recommendations for the lawyer moving forward.

### 1. Selected Precedent's Reasoning

The precedent case, referred to as Case H, found that a pattern of repeated delays amounted to a breach of contract. The court reasoned that even minor individual delays, when considered cumulatively, demonstrated a failure to perform as stipulated in the contract. The underlying rationale was that the cumulative effect of these minor delays could significantly undermine the purpose of the contract, which typically aims for timely performance and reliable delivery.

### 2. Lawyer's Specific Contract Context

In the lawyer's situation, the contract specified "prompt delivery" but did not provide a strict timeline. The vendor experienced 8 delays ranging from 2 to 4 days over a period of 6 months. These delays culminated in $50,000 in losses due to three missed client deadlines. The vendor was notified regarding these delays; however, the pattern of delays persisted.

Key considerations include:
- **Nature of the Obligations**: While “prompt delivery” does not define a strict timeline, it does imply an expectation for timely performance.
- **Material Impact**: The missed client deadlines indicate that these delays had a material adverse effect on the lawyer's ability to fulfill contractual obligations to third parties, likely triggering damages.

### 3. Conditions for Breach vs. No Breach

**Conditions for Breach**:
- **Pattern and Cumulative Effect**: Similar to the reasoning in Case H, evidence of a habitual pattern of delays can amount to a breach. Even if individual delays are minor, when combined, they may show a lack of diligence or reliability by the vendor.
- **Materiality**: The impact of these delays is crucial. If the cumulative delays adversely affect the contract's purpose or cause significant losses, this reinforces the case for a breach.
- **Notification and Opportunity to Cure**: The fact that the vendor was notified of the delays and failed to rectify the behavior can often be interpreted as a further indication of breach.

**Conditions for No Breach**:
- **Non-Material Delays**: If the delays did not affect the overall contractual performance or client obligations, this may lessen the likelihood of establishing a breach. However, given the risks and losses involved, this seems less relevant in this scenario.
- **Force Majeure or Justifiable Delays**: If the vendor could show that these delays were due to justify circumstances not within their control, it may potentially provide a defense against breach claims.

### 4. Practical Recommendations

1. **Assess Damages**: Document the exact nature of the financial losses incurred due to the missed deadlines to substantiate claims of damages.
  
2. **Gather Evidence**: Collect all communication regarding the delays, including any notifications sent to the vendor about the issues.

3. **Consider Breach of Contract Action**: Based on the precedent and accumulated delays, consider formalized communication to the vendor regarding a breach of contract claim, highlighting both the pattern and the impact of these repeated delays.

4. **Evaluate Remedies**: Depending upon the contract specifics, the lawyer may wish to pursue several remedies, including:
   - **Compensatory Damages**: For the financial losses due to missed deadlines.
   - **Specific Performance**: If timely delivery is critical and can still be enforced.
   - **Contract Termination**: Depending on the severity, terminating the contract and seeking replacements may be warranted.

5. **Negotiate Terms**: If continuing to work with the current vendor is strategic, the lawyer should consider renegotiating terms for performance guarantees or penalties for further delays.

6. **Future Contracts**: In future contracts, consider including explicit timelines and conditions for prompt delivery, as well as specified damages for delays to better safeguard against this issue.

By integrating the legal principles from the precedent with the specific context and conditions outlined, the lawyer can formulate a solid plan to address the repeated delays by the vendor effectively.

This output shows the final stage of the workflow, where the model integrates the selected precedent (Case H) and the lawyer’s clarifications to generate a complete legal analysis. The system explains why the pattern of delays likely constitutes a breach, outlines the factors that support this interpretation, and provides practical recommendations. Overall, the output demonstrates how the HITL clarifications resolve ambiguity and allow the model to produce a well-founded, context-specific legal opinion.

Other real-world scenarios

This kind of application, using Elasticsearch, LangGraph, and HITL, can be useful in other kinds of apps, including:

Reviewing tools calls before their execution; for example, in financial trading, a human approves buy/sell orders before they’re placed.
Provide additional parameters when needed; for example, in customer support triage, where a human agent selects the correct issue category when the AI finds multiple possible interpretations of the customer’s problem.

And there are plenty of use cases to discover, where HITL will be a game changer.

Conclusion

With LangGraph and Elasticsearch, we can build agents that make their own decisions and act as linear workflows or follow conditional paths based on context. With human in the loop, the agents can involve the actual user in the decision-making process to fill contextual gaps and request confirmations on systems where fault tolerance is critical.

The key advantage of this approach is that you can filter a large dataset using Elasticsearch capabilities and then use an LLM to select a single document based on the user intent; that is, HITL. The LLM complements Elasticsearch by handling the dynamics of how users express their intent..

This approach keeps the system fast and token efficient, as we’re only sending the LLM what’s needed to make the final decision and not the whole dataset. And at the same time, this keeps it very precise at detecting user intent and iterating until the desired option is picked.

Apache Lucene 2025 wrap-up

Benjamin Trent, Chris Hegarty — Wed, 28 Jan 2026 00:00:00 GMT

It’s hard to believe that Apache Lucene has been around for over a quarter of a century! Yes, more than 25 years of search enabled through Apache Lucene.

Community by the numbers

For contributions and the community as a whole, 2025 was another strong year, with 1,756 commits and 1,080 pull requests from 134 unique contributors. The community continued to grow this year, increasing the number of contributors by 98 from the previous year. The project management committee (PMC) and committer cohort also increased. Apache Lucene added nine new committers in 2025, one of whom was Elastic’s own Simon Cooper. (Congratulations, thecoop!) The PMC also welcomed two new members. Our little community continues to grow.

Our highlights

With almost 2,000 commits and eight releases, it’s difficult to summarize all the things that we loved from the year in Apache Lucene. But, not hiding from a challenge, here are some of our highlights.

The year of faster queries

In many ways, 2025 was the year Apache Lucene embraced better auto-vectorization, manual SIMD optimizations, and overall making things go faster. Lucene community member Adrien Grand goes into great detail in this blog, and below are links and summaries for the biggest jumps. As always, a huge shout-out to Mike McCandless for maintaining the Apache Lucene Benchmarks for years.

[IN] Refactor main top-n bulk scorers to evaluate hits in a more term-at-a-time fashion
[IS] Irrelevant hardware update, so it’s just noise here
[IY] Use branchless way to speedup filterCompetitiveHits
[IZ] Improve collecting docs stored as bitsets
[JA] Vectorize filterCompetitiveHits by hand with the Java Panama API
[JK] Increase the document block size to 256

Discounting the hardware change [IS], this is almost a 60% increase in query speed in 2025, going from <100 queries per second (qps) to >170 qps.

Vector search

There were a number of vector search improvements in 2025. Three to highlight are improving filtered vector search with ACORN, adding optimistic multisegment search, and bulk scoring of vectors.

ACORN-1 is an interesting algorithm for graph-based vector indices. It has the significant benefit of being filter and algorithm-agnostic. Since Apache Lucene uses hierarchical navigable small world (HNSW) for its indexing and users generally want to just filter on anything and everything without additional configuration, it’s a perfect fit. A community member originally investigated adding this algorithm. (He has since been hired by Elastic. Hi, Ben! 👋👋👋) A nice balance was found for Lucene, providing faster filtered vector search without requiring significant work by the user in configuration and no additional indexing of information.

The addition of optimistic multisegment vector search to Apache Lucene really highlights how the community comes together to make it all happen. Three different individuals combined powers to debug, benchmark, design, and iterate on this solution. Originally proposed by Michael Sokolov (a vector search superstar in the Lucene community), it immediately caught my attention, too, because it claimed to fix our weird concurrency consistency bug without sacrificing performance. With some iteration and benchmarking by fellow community member Dzung Bui, we were able to strike the right balance in speed and recall, improving performance, making multithreaded searches consistent, and implementing a pretty neat algorithm.

Bulk scoring was born out of a collaboration between community member Trevor McCulloch and our very own Chris Hegarty and was introduced as a new scoring interface in PR #14978, with an initial float32 implementation following in PR #14980. Modern vector search involves thousands or even millions of comparisons as a query vector is evaluated against vectors in the index, often by traversing a graph of nearest neighbors. Traditionally, these comparisons are performed one vector at a time. Bulk scoring inverts that model by passing a batch of vectors, represented as ordinals in the index, to the scorer in a single call. This allows the scorer to prefetch and pipeline work across vectors, amortizing cache misses and reducing per-vector overhead. An exciting aspect of the original design discussion considered bulk scorers implemented in Rust and C. While Lucene itself remains a Java library, this opens the door to highly optimized, SIMD-friendly, native implementations.

We didn’t even get to talk about several other improvements that landed in this area, including HNSW optimizations, like more compact GroupVarInt graph encoding, bypassing graph construction for tiny segments, and continued reductions in memory footprint. On the operations side, Lucene now exposes off-heap memory requirements, making it easier to understand and debug native memory usage. While these changes are small(ish) individually, together they help make Lucene’s vector search faster, leaner, and easier to operate in production.

Bonus

The last highlight is a little out of place. It's a particularly frustrating but satisfying bug fix. I won’t go into deep details here, as it touches on how Lucene does max scoring and bulk scoring, applies filters, and handles all its internal iterator state. This deserves its own blog post. In short, we ran into this bug in production in late September 2025. It was surfacing as an EndOfFileException during a specific query execution. And then, like all fun bugs, it took a week or two of work to reproduce and fully debug. Finally, when we knew what exactly caused the exception to throw, we then had to dig into why to fix it. All in all, a month’s worth of work summarized in one line of code. Bask in its glory:

- top.doc = top.approximation.advance(filter.doc);
+ // Must use the iterator as `top` might be a two-phase iterator
+ top.doc = top.iterator.advance(filter.doc);

Goodbye 2025, hello 2026

A big thank you to everyone in the Apache Lucene community who worked tirelessly to improve this venerable search library. We ❤️ you.

Using Discord and Elastic Agent Builder A2A to build a gaming community support bot

Tomás Murúa — Tue, 27 Jan 2026 00:00:00 GMT

In this article, we'll build a gaming community support bot that connects Discord to Elastic Agent Builder's Agent-to-Agent (A2A) server. Players will ask questions like, "Who's the best Mage?" or "What's the current meta?" and get real-time answers powered by Elasticsearch Query Language (ES|QL) analytics and semantic search.

Many gaming companies already use Elasticsearch for player data and game analytics. In this tutorial, we'll access that data through Discord, where many users spend their time.

Prerequisites

Elasticsearch cluster (9.2 or higher)
Python 3.9+
Discord bot token and server created

What we're building

We'll create a Discord bot that uses Elastic Agent Builder's A2A server to answer player questions. The bot will have two types of tools:

ES|QL tools for analytics: leaderboards, hero stats, meta reports.
Index search tools for knowledge: game mechanics, FAQs.

The architecture looks like this:

Architecture overview

Elastic Agent Builder provides an A2A server to connect with clients. This server exposes the agent with tools that can query Elasticsearch using ES|QL or index search tools. Any A2A-compatible client can connect to it.

The A2A client instantiates a bot that can connect to the Discord server, communicate with Agent Builder, and receive/send messages to the users.

Why A2A instead of MCP?

Agent Builder also offers the option to expose the agent's tools via a Model Context Protocol (MCP) server. The key difference is that with MCP, the client can only access the tools and their descriptions, not the agent itself. Additionally, tool selection logic must be implemented on the MCP client side, since all tools from all agents are exposed together.

With A2A, the entire agent, including instructions and tools, is available to the client. This gives you more control from the Agent Builder side, letting you manage behavior centrally rather than in each client.

The right choice depends on where you want that control to live. For this Discord bot, we want to manage the agent's behavior from Elastic, so A2A is the better fit.

Setting up sample game data

Let's create gaming data that our bot can query. We'll set up three indices:

player_stats: Player profiles with wins, kills, rank.
hero_meta: Hero pick rates and win rates by tier.
game_knowledge: FAQs and game mechanics. Uses semantic_text for meaning-based versus keyword matching. Both title and content are copied into semantic_field for a hybrid search solution.

Create the indices

from elasticsearch import Elasticsearch
import os

es = Elasticsearch(
    hosts=[os.getenv("ELASTICSEARCH_URL")],
    api_key=os.environ["ELASTIC_API_KEY"]
)

# Player stats index
es.indices.create(
    index="player_stats",
    mappings={
        "properties": {
            "player_id": {"type": "keyword"},
            "username": {"type": "keyword"},
            "hero": {"type": "keyword"},
            "wins": {"type": "integer"},
            "losses": {"type": "integer"},
            "kills": {"type": "integer"},
            "deaths": {"type": "integer"},
            "rank": {"type": "keyword"},
            "last_played": {"type": "date"}
        }
    }
)

# Hero meta index
es.indices.create(
    index="hero_meta",
    mappings={
        "properties": {
            "hero_name": {"type": "keyword"},
            "pick_rate": {"type": "float"},
            "win_rate": {"type": "float"},
            "tier": {"type": "keyword"},
            "patch_version": {"type": "keyword"}
        }
    }
)

# Game knowledge index (for semantic search)
es.indices.create(
    index="game_knowledge",
    mappings={
        "properties": {
            "title": {"type": "text", "copy_to": "semantic_field"},
            "content": {"type": "text", "copy_to": "semantic_field"},
            "category": {"type": "keyword"},
            "semantic_field": {"type": "semantic_text"} # Semantic search queries this combined field
        }
    }
)

Index sample data

from datetime import datetime, timedelta
from elasticsearch.helpers import bulk
import random

# Sample player data
players = [
    {"player_id": "p001", "username": "DragonSlayer99", "hero": "Warrior", "wins": 342, "losses": 198, "kills": 4521, "deaths": 2103, "rank": "Diamond"},
    {"player_id": "p002", "username": "ShadowMage", "hero": "Mage", "wins": 567, "losses": 234, "kills": 8932, "deaths": 3421, "rank": "Master"},
    {"player_id": "p003", "username": "HealBot3000", "hero": "Healer", "wins": 423, "losses": 187, "kills": 1234, "deaths": 1876, "rank": "Diamond"},
    {"player_id": "p004", "username": "TankMaster", "hero": "Tank", "wins": 298, "losses": 302, "kills": 2341, "deaths": 1543, "rank": "Platinum"},
    {"player_id": "p005", "username": "AssassinX", "hero": "Assassin", "wins": 789, "losses": 156, "kills": 12453, "deaths": 2987, "rank": "Grandmaster"},
]

for player in players:
    player["last_played"] = datetime.now() - timedelta(hours=random.randint(1, 72))

# Hero meta data
heroes = [
    {"hero_name": "Warrior", "pick_rate": 15.2, "win_rate": 51.3, "tier": "A", "patch_version": "2.4.1"},
    {"hero_name": "Mage", "pick_rate": 22.8, "win_rate": 54.7, "tier": "S", "patch_version": "2.4.1"},
    {"hero_name": "Healer", "pick_rate": 18.5, "win_rate": 52.1, "tier": "A", "patch_version": "2.4.1"},
    {"hero_name": "Tank", "pick_rate": 12.3, "win_rate": 48.9, "tier": "B", "patch_version": "2.4.1"},
    {"hero_name": "Assassin", "pick_rate": 31.2, "win_rate": 49.2, "tier": "A", "patch_version": "2.4.1"},
]

# Game knowledge for semantic search
knowledge = [
    {"title": "How to unlock the Dragon Mount", "content": "Complete the Dragon's Lair dungeon on Nightmare difficulty with all party members alive. The mount has a 15% drop rate.", "category": "mounts"},
    {"title": "Best Mage build for Season 4", "content": "Focus on Intelligence and Critical Chance. Use the Arcane Staff with Frost Runes. Prioritize cooldown reduction for burst damage.", "category": "builds"},
    {"title": "Understanding the ranking system", "content": "Ranks go from Bronze to Grandmaster. You need 100 points to advance. Wins give 25 points, losses subtract 20.", "category": "ranked"},
]

# Bulk index all data
actions = []
for player in players:
    actions.append({"_index": "player_stats", "_source": player})
for hero in heroes:
    actions.append({"_index": "hero_meta", "_source": hero})
for doc in knowledge:
    actions.append({"_index": "game_knowledge", "_source": doc})

success, errors = bulk(es, actions)
print(f"Indexed {success} documents")

es.indices.refresh(index="player_stats,hero_meta,game_knowledge")

Done! Now we have data to query.

Creating ES|QL tools via API

Instead of using the UI, we'll create tools programmatically using the Agent Builder API. This makes it easier to version control and deploy.

First, let's set up our Kibana API connection:

import requests

KIBANA_URL = os.environ["KIBANA_URL"]  # e.g., https://your-deployment.kb.us-central1.gcp.cloud.es.io
KIBANA_API_KEY = os.environ["KIBANA_API_KEY"]

headers = {
    "kbn-xsrf": "true",
    "Authorization": f"ApiKey {KIBANA_API_KEY}",
    "Content-Type": "application/json"
}

Tool 1: Leaderboard

This tool shows top players by kills. We use ES|QL for aggregations:

leaderboard_tool = {
    "id": "leaderboard",
    "type": "esql",
    "description": "Shows top players ranked by kills. Use when someone asks Who is the best? or Show me top players.",
    "configuration": {
        "query": """FROM player_stats
| STATS total_kills = SUM(kills), total_wins = SUM(wins) BY username, hero, rank
| SORT total_kills DESC
| LIMIT 10""",
        "params": {}
    }
}

response = requests.post(
    f"{KIBANA_URL}/api/agent_builder/tools",
    headers=headers,
    json=leaderboard_tool
)
print(f"Leaderboard tool: {response.status_code}")

Tool 2: Hero stats

The hero stats tool returns performance stats for a specific hero using a dynamic parameter:

hero_stats_tool = {
    "id": "hero_stats",
    "type": "esql",
    "description": "Gets win rate, pick rate, and tier for a specific hero. Use when someone asks How good is Mage? or What is the win rate for Warrior?",
    "configuration": {
        "query": """FROM hero_meta
| WHERE hero_name == ?hero
| KEEP hero_name, win_rate, pick_rate, tier, patch_version""",
        "params": {
            "hero": {
                "type": "keyword",
                "description": "The hero name to look up"
            }
        }
    }
}

response = requests.post(
    f"{KIBANA_URL}/api/agent_builder/tools",
    headers=headers,
    json=hero_stats_tool
)
print(f"Hero stats tool: {response.status_code}")

?hero is a dynamic parameter. When the agent calls this tool, it fills in the hero name from the user's question, bridging natural language, “Is Mage strong?” to structured queries, like | WHERE hero_name == “Mage”.

Tool 3: Meta report

This tool shows which heroes are dominating the current patch:

meta_report_tool = {
    "id": "meta_report",
    "type": "esql",
    "description": "Shows all heroes sorted by tier and win rate. Use when someone asks What is the current meta? or Which heroes are S-tier?",
    "configuration": {
        "query": """FROM hero_meta
| SORT tier ASC, win_rate DESC
| KEEP hero_name, tier, win_rate, pick_rate""",
        "params": {}
    }
}

response = requests.post(
    f"{KIBANA_URL}/api/agent_builder/tools",
    headers=headers,
    json=meta_report_tool
)
print(f"Meta report tool: {response.status_code}")

Adding index search tools

For unstructured content, like FAQs and game mechanics, we use index search tools. The pattern parameter specifies which Elasticsearch index to search:

game_knowledge_tool = {
    "id": "game_knowledge",
    "type": "index_search",
    "description": "Searches game guides, FAQs, and mechanics. Use when someone asks How do I...? or What is...? questions about game content.",
    "configuration": {
        "pattern": "game_knowledge"
    }
}

response = requests.post(
    f"{KIBANA_URL}/api/agent_builder/tools",
    headers=headers,
    json=game_knowledge_tool
)
print(f"Game knowledge tool: {response.status_code}")

Keep in mind that ES|QL tools are for structured analytics ("Who has the most kills?"), while index search tools are for unstructured knowledge ("How do I unlock the dragon mount?").

Creating the agent

Let’s create an agent that uses these tools:

agent = {
    "id": "gaming_support_bot",
    "name": "Gaming Support Bot",
    "description": "A gaming community support bot that answers player questions about stats, heroes, and game mechanics.",
    "configuration": {
        "tools": [{"tool_ids": ["leaderboard", "hero_stats", "meta_report", "game_knowledge"]}],
        "instructions": """You are a helpful gaming community bot. Answer player questions about:
- Player stats and leaderboards (use leaderboard tool)
- Hero performance and meta (use hero_stats and meta_report tools)
- Game mechanics and guides (use game_knowledge tool)

Be concise and friendly. Format leaderboards clearly with rankings."""
    }
}

response = requests.post(
    f"{KIBANA_URL}/api/agent_builder/agents",
    headers=headers,
    json=agent
)
print(f"Agent created: {response.status_code}")

The agent now has access to all our tools and can choose one on the user's question.

You can test the connectivity by visiting the Kibana Agent Chat GUI or by sending an API call. By default, Agent Builder uses the Elastic Managed LLM, so no connector configuration is required:

test_message = "Show me all heroes sorted by tier"

response = requests.post(
    f"{KIBANA_URL}/api/agent_builder/converse",
    headers=headers,
    json={
        "agent_id": "gaming_support_bot",
        "input": test_message
    },
    timeout=60
)

print(f"Status: {response.status_code}")
if response.status_code == 200:
    result = response.json()
    print(f"\nAgent used tools: {[step.get('tool_id') for step in result.get('steps', []) if step.get('type') == 'tool_call']}")
    print(f"\nResponse:\n{result.get('response', {}).get('message', 'No message')}")
else:
    print(f"Error: {response.text}")

Setting up a Discord bot

If you don't have a Discord bot yet, you'll need to create one following these steps:

1. Go to the Discord Developer Portal.

2. Click New Application, and give it a name.

3. Go to the Bot section, and click Add Bot.

4. Copy the bot token. (You'll need this later.)

5. Under Privileged Gateway Intents, enable Message Content Intent.

6. Go to OAuth2 > URL Generator, select bot and application.commands scopes and Send Messages + Read Message History permissions.

7. Use the generated URL to invite the bot to your server.

Security considerations

On the Discord side, request only the minimum permissions:

Send Messages.
Send Messages in Threads.
Read Message History.
Create Polls.

This limits what the bot can do if compromised.

On the Elastic side, create an API key with restricted privileges. For this bot, you only need read access to the game indices:

POST /_security/api_key
{
  "name": "gaming-bot-key",
  "role_descriptors": {
    "gaming_bot_role": {
      "cluster": ["monitor"],
      "indices": [
        {
          "names": ["player_stats", "hero_meta", "game_knowledge"],
          "privileges": ["read"]
        }
      ]
    }
  }
}

Configuring the Discord A2A client

The next step is to connect Discord to Agent Builder. We'll use a modified version of the kagent-a2a-discord.

Clone and setup

git clone https://github.com/llermaly/agentbuilder-a2a-discord
cd agentbuilder-a2a-discord

Environment variables

Create an .env file in the root directory:

DISCORD_BOT_TOKEN=your_bot_token
AGENT_BUILDER_URL=https:///api/agent_builder/a2a/gaming_support_bot
A2A_API_KEY=your_api_key

Run with Python

# Create virtual environment
uv venv

# Install dependencies
uv sync

# Run the bot 
uv run main.py

The bot should come online in your Discord server within a few seconds.

Testing the bot

Let's test with different query types.

Analytics queries (ES|QL)

User: "Who is the best Mage player?"

Bot: Based on the leaderboard data, ShadowMage is currently the best Mage player in the game. They've reached Master rank with an impressive 8,932 kills and 567 wins…

User: "What's the current meta?"

Bot: The current meta is dominated by Mage, which stands alone in S-tier with an impressive 54.7% win rate and 22.8% pick rate in patch 2.4.1.

Semantic queries (index search)

User: "How do I unlock the dragon mount?"

Bot: To unlock the Dragon Mount, you need to:

Complete the Dragon's Lair dungeon on Nightmare difficulty.
Keep all party members alive throughout the entire run.
Cross your fingers for luck - the mount has a 15% drop rate…

The bot automatically picks the right tool based on the question.

Bidirectional: Giving the agent actions

Beyond answering questions, we can give Agent Builder the ability to trigger Discord actions. With a small modification to the Discord client, we can parse special tags in the agent's response and execute Discord commands.

For example, we added support for a tag:

Should Mage be nerfed?|Yes, too strong|No, it's balanced|Need more data

When the agent includes this in its response, the bot creates a native Discord poll. The agent just needs instructions on when to use it. Run the following to add to the instructions:

agent = {
    "id": "gaming_support_bot",
    "name": "Gaming Support Bot",
    "description": "A gaming community support bot that answers player questions about stats, heroes, and game mechanics.",
    "configuration": {
        "tools": [{"tool_ids": ["leaderboard", "hero_stats", "meta_report", "game_knowledge"]}],
        "instructions": """You are a helpful gaming community bot. Answer player questions about:
- Player stats and leaderboards (use leaderboard tool)
- Hero performance and meta (use hero_stats and meta_report tools)
- Game mechanics and guides (use game_knowledge tool)

When discussing balance topics, create a poll for community input.
Use: Question|Option1|Option2|Option3

Be concise and friendly. Format leaderboards clearly with rankings."""
    }
}

response = requests.put(
    f"{KIBANA_URL}/api/agent_builder/agents",
    headers=headers,
    json=agent
)
print(f"Agent created: {response.status_code}")

Now when a user asks "Is Mage overpowered?" the agent can share the stats and create a poll, turning data insights into community engagement.

The same pattern works for other Discord features, like reactions, role mentions, or scheduled announcements. The agent's response becomes a control channel for Discord actions.

Conclusion

In this article, we successfully built a Discord bot using Elastic Agent Builder's A2A server. The process covered creating ES|QL tools via API for analytics, such as leaderboards, hero stats, and meta reports, and developing index search tools for semantic search against a knowledge base. Furthermore, we demonstrated the use of dynamic parameters, like ?hero for flexible queries, established the connection between Discord and the A2A client, and tested both analytics and semantic query types.

The Agent Builder provides the A2A server, so you only need to focus on creating tools and connecting clients, while ES|QL handles structured analytics and index search handles unstructured knowledge.

Resources

All about those chunks, ’bout those chunks, and snippets!

Kathleen DeRusso — Mon, 26 Jan 2026 00:00:00 GMT

These days, if you’ve performed semantic searches or context engineering, you’ve probably worked a lot with chunks. If you’re not familiar with chunks, a chunk is a small, meaningful piece of content extracted from a larger document. This blog provides a great foundational overview of chunking, why it’s important, and various chunking strategies.

For this blog, we want to focus on one specific problem in the topic of chunking, and that is defining the best context to send to a large language model (LLM) or other model. Models have a limited number of tokens they can take in as context, but even within that limit, sending in large amounts of content can lead to relevance degradation caused by factors such as context rot or “lost in the middle” problems, where important information hidden in large blocks of text is overlooked.

This led to the question: How can we make this better?

Reranking in retrievers

We started by looking at retrievers, specifically the text_similarity_reranker retriever. We know that many cross-encoder rerankers do not perform well on long documents. This is because these rerankers will truncate long content to the model’s token window, discarding the remaining content. This can actually lead to degraded search relevance, if the most relevant part of the document is cut off before being sent to the reranker!

We decided to address this by introducing a chunk_rescorer to the text_similarity_reranker retriever. When specified, rather than sending in the entire document to the reranker, we will chunk the document first and evaluate each chunk based on the reranking inference text. We do this by indexing each chunk into a temporary in-memory Lucene index and performing a BM25 text search over these chunks. We return the best chunks for consideration into the reranker.

The chunk rescorer is simple to use with a small update to the API call:

GET books/_search
{
  "retriever": {
    "text_similarity_reranker": {
      "retriever": {
        "standard": {
          "query": {
            "match": {
              "author": "Tolkien"
            }
          }
        }
      },
      "rank_window_size": 10,
      "field": "semantic_text",
      "inference_text": "are all who wander lost?",
      "chunk_rescorer": {
        "size": 1
      }
    }
  }
}

When we evaluated the chunk rescorer, we found a significant improvement for many truncating models, including the Elastic Reranker and Cohere's rerank-english-v3.0 model. However, when we evaluated against jina-reranker-v2-base-multilingual, the results were not as impressive due to the fact that Jina already addresses this long document problem internally.

We performed evaluations using the Multilingual Long-Document Retrieval (MLDR) English dataset. This is a document containing very long article content that would trigger this document truncation issue in many reranking models. The following table shows our evaluation results with BM25 text search and a rank_window_size of 32:

Reranker model	NDCG@10	NDCG@10 NDCG@10 with chunk rescoring
jina-reranker-v2-base-multilingual	0.771145	0.764488
Cohere rerank-english-v3.0	0.592588	0.707842
.rerank-v1-elasticsearch	0.478121	0.751994

It’s worth noting that the raw BM25 results without reranking had a Normalized Discounted Cumulative Gain (NDCG) score, or relevance score, close to 0.64. (Find additional background in this paper.) This means that for rerankers that perform truncation, reranked results for long documents were actually worse than without reranking. Note that this only applies for long documents; shorter documents that fit into the token window would not be affected by this long document problem.

Of the rerankers we evaluated, Jina was the only reranker to perform well against long documents out of the box, thanks to its sliding window approach.

We saw better baseline performance but similar overall difference in results when using semantic_text fields with Elastic Learned Sparse EncodeR (ELSER).

We felt the results for truncating models were promising enough to release the chunk rescorer as an opt-in feature for models that will benefit from the additional relevance, but we recommend evaluating against specific rerankers before implementing this in production.

ES|QL

The real power of chunk extraction, however, lies in the Elasticsearch Query Language (ES|QL). We wanted chunks and snippets to be first class citizens in ES|QL so they could be easily extracted and repurposed for reranking, sending into LLM context, or other purposes.

We started by introducing the CHUNK function in Elasticsearch version 9.2:

FROM books
| EVAL chunks = CHUNK(description)

CHUNK is an extreme primitive that takes some string content (a text field, a semantic text field, or any other row content that is a string) and chunks it. You can view and interact with these chunks, and you can also explore using different chunking settings:

FROM books
| EVAL chunks = CHUNK(description, {"strategy": "sentence", "max_chunk_size": 25, "sentence_overlap": 0})

You can then combine chunk with existing primitives, like MV_SLICE and MV_EXPAND, to format the way chunks are represented in your row output:

FROM books
| WHERE MATCH(author, "Tolkien")
| EVAL chunks = CHUNK(semantic_description, {"strategy": "sentence", "max_chunk_size": 25, "sentence_overlap": 0})
| EVAL truncated = MV_SLICE(chunks, 0, 2)
| MV_EXPAND truncated
| KEEP title, truncated

This is great, but what we really wanted was to get the top matching snippets for a query, so we also introduced TOP_SNIPPETS in Elasticsearch version 9.3:

FROM books
| EVAL snippets = TOP_SNIPPETS(description, "ring")

We added support to control the number of snippets you want to return and the word size using a sentence-based chunking strategy:

FROM books
| EVAL snippets = TOP_SNIPPETS(description, "ring", { "num_snippets": 3, "num_words": 25 })
| MV_EXPAND snippets
| KEEP title, snippets

This fits into the broader story of LLMs when you add in COMPLETION. Here is an example of how we envision TOP_SNIPPETS integrating with LLMs:

FROM books METADATA _score
| WHERE semantic_description:"what are hobbits?"
| SORT _score DESC
| LIMIT 10
| FORK
   ( EVAL snippets = TOP_SNIPPETS(semantic_description, "what are hobbits?", {"num_snippets": 3})
    | COMPLETION CONCAT("Answer what are hobbits", snippets) WITH {"inference_id": "my-completion-endpoint"})

In this example, we’re performing a semantic search, but for each document we’re identifying the top snippets from that document. We’re sending in the highly relevant snippets into the completion command, rather than the entire document. This is a simple document, but you could also use reranking here, and in the future, when multiple forks are available, hybrid search will be supported in the same format.

We can also utilize snippets in the newest version of RERANK:

FROM books
| WHERE MATCH(title, "ring")
| EVAL snippets = TOP_SNIPPETS(semantic_description, "what are hobbits?", { "num_snippets": 3, "num_words": 25 })
| RERANK "what are hobbits?" ON  snippets WITH { "inference_id" : "my-reranker" }

What we’re thinking about next

The story isn’t over for chunking and snippet extraction; in fact, it’s only getting started.

We’re looking at how to best integrate existing semantic_text chunks out of the box into strategies using chunking and snippet extraction. We’re also exploring what other features we need to make snippet extraction a compelling feature to use in products such as Elastic Agent Builder.

Overall, we’re excited to share these tools and look forward to your feedback as we evolve our strategies for getting the best context for LLMs!

Improve search performance with `best_compression`

Sherry Ger, Ryan Eno — Fri, 23 Jan 2026 00:00:00 GMT

When tuning Elasticsearch for high-concurrency workloads, the standard approach is to maximize RAM to keep the working set of documents in memory to achieve low search latency. Consequently, best_compression is rarely considered for search workloads, as it is primarily viewed as a storage saving measure for Elastic Observability and Elastic Security use cases where storage efficiency takes priority.

In this blog, we demonstrate that when the dataset size significantly exceeds the OS page cache, best_compression improves search performance and resource efficiency by reducing the I/O bottleneck.

The setup

Our use case is a high-concurrency search application running on Elastic Cloud CPU optimized instances.

Data volume: ~500 million documents
Infrastructure: 6 Elastic Cloud (Elasticsearch service) instances (each instance: 1.76 TB storage | 60 GB RAM | 31.9 vCPU)
Memory-to-storage ratio: ~5% of the total dataset fits into RAM

The symptoms: high latency

We observed that when the number of current requests spiked around 19:00, the search latency deteriorated significantly. As shown in Figure 1 and Figure 2, while traffic peaked around 400 requests per minute per Elasticsearch instance, the average query service time degraded to over 60ms.

The CPU usage remained relatively low after the initial connections handling, indicating that compute was not the bottleneck.

A strong correlation emerged between query volume and page faults. As requests increased, we observed a proportional rise in page faults, peaking around 400k/minute. This indicated that the active dataset could not fit in the page cache.

Simultaneously, the JVM heap usage appeared to be normal and healthy. This ruled out garbage collection issues and confirmed the bottleneck was I/O.

The diagnosis: I/O bound

The system was I/O bound. Elasticsearch relies on the OS page cache to serve index data from memory. When the index is too large for the cache, queries trigger expensive disk reads. While the typical solution is to scale horizontally (add nodes/RAM), we wanted to exhaust efficiency improvements on our existing resources first.

The fix

By default, Elasticsearch uses LZ4 compression for its index segments, striking a balance between speed and size. We hypothesized that switching to best_compression (which uses zstd) would reduce the size of indices. A smaller footprint allows a larger percentage of the index to fit in the page cache, trading a negligible increase in CPU (for decompression) for a reduction in disk I/O.

To enable best_compression, we reindexed the data with the index setting index.codec: best_compression. Alternatively, the same result could be achieved by closing the index, resetting the index codec to best_compression, and then performing a segment merge.

POST my-index/_close
PUT my-index/_settings
{
    "codec": "best_compression"
}
  
POST my-index/_open  
POST my-index/_forcemerge?max_num_segments=1

The results

The results confirmed our hypothesis: improved storage efficiency directly translated into a substantial boost in search performance with no accompanying increase in CPU utilization.

Applying best_compression reduced the index size by approximately 25%. While less than the reduction seen in repetitive log data, this 25% reduction effectively increased our page cache capacity by the same margin.

During the next load test (starting at 17:00), the traffic was even higher, peaking at 500 requests per minute per Elasticsearch node.

Despite the higher load, the CPU utilization was lower than in the previous run. The elevated usage in the earlier test was likely due to the overhead of excessive page fault handling and disk I/O management.

Crucially, page faults dropped significantly. Even at higher throughput, faults hovered around <200k per minute, compared to >300k in the baseline test.

Although the page fault results were still less than optimal, query service time was cut by about 50%, hovering below 30ms even under heavier load.

The conclusion: best_compression for search

For search use cases where data volume exceeds available physical memory, best_compression is a potent performance-tuning lever.

The conventional solution to cache misses is to scale out to increase RAM. However, by reducing the index footprint, we achieved the same goal: maximizing the document count in the page cache. Our next step is to explore index sorting to further optimize storage and squeeze even more performance out of our existing resources.

Agent Builder now GA: Ship context-driven agents in minutes

Anish Mathur, Evan Castle — Thu, 22 Jan 2026 00:00:00 GMT

We are excited to announce the general availability of Agent Builder in Elastic Cloud Serverless and in the upcoming 9.3 release. Agent Builder brings the power of Elasticsearch as a context engineering platform to quickly develop contextual, data-focused AI agents.

Agents are gaining traction driven by their potential to deliver efficiency gains and better customer experiences. But in practice, providing agents with the right context is difficult, especially when operating over messy, unstructured enterprise data. Developers must manage tools, prompts, state, reasoning logic, models, and crucially retrieve relevant context from business sources to deliver accurate results and actions. Elastic Agent Builder delivers these core components to develop secure, reliable, context-driven agents.

Agent Builder core capabilities

Agent Builder leverages Elastic’s long-term investments in search relevance and retrieval-augmented generation, and work to make Elasticsearch the best vector database to simplify the development of contextual, data-focused AI agents.

Agent Builder allows you to:

Immediately start with a built-in conversational agent that can answer questions, perform analytics and drive investigations over any data in Elasticsearch.
Quickly go from complex unstructured data to a custom agent with configuration-based development experience.
Leverage best-in-class, hybrid search relevance through built-in ES|QL or custom tools to improve context quality and agent reliability.
Execute complex workflows (preview) as reusable tools to enrich data, update records, send messages, and more for rules-based automation.
Connect to data sources outside of Elasticsearch using workflows and MCP to correlate and combine context for agents.
Integrate with any agentic or application framework using built-in and custom tools exposed over MCP, and the ability to connect to external MCP (preview), support for A2A, and full API support.
Extend the capabilities of Agent Builder with integration to third-party solutions like LlamaIndex for complex document processing or Arcade.dev for secure, structured tool access.

To further extend Agent Builder functionality, we are introducing Elastic Workflows, our new rules-based automation capabilities, now in technical preview. For organizational tasks, agents at times need certainty and reliability of rules-based actions, which are often necessary to implement specific business logic. Elastic Workflows provides agents with a simple, declarative way to orchestrate internal and external systems to take actions, gather and transform data and context. Workflows are fully composable, event-driven and flexible, and can be exposed as tools to an agent via MCP.

Go from data to agent in minutes

Developing agents can take weeks of upfront work to consolidate separate data stores, build manual pipelines, tune queries, and manage complex orchestration. Agent Builder reduces the time to develop agents by removing the need for separate data stores, vector databases, RAG pipelines, search layers, query translators, and tool orchestrators, allowing you to focus on agent logic and application delivery.

Agent Builder natively integrates Elasticsearch platform primitives to make agent development fast.

Start with a built-in conversational agent that can immediately chat with and reason with your indexed data.
Integrate agents into applications, dashboards, or CI/CD systems with interactive access via Kibana, APIs, or MCP and A2A.
Build with default tooling to understand your data structure, select the appropriate index, generate optimized hybrid, semantic, and structured queries, and create configurable visualizations using ES|QL based on natural language prompts.

To go deeper, try a complete hands-on walkthrough.

Build on Elasticsearch, a complete data platform for context engineering

For AI agents, context quality is essential to provide effective reasoning and reduce the risks of hallucination. For many enterprise AI agents, the business data required to perform a task is the most crucial piece of context. As a massively scalable data store, vector database, and leader in relevance, Elasticsearch already offers many strong context-engineering primitives. Context engineering goes beyond simple retrieval-augmented generation by allowing you to tailor and scale how data is fetched, ranked, filtered, and presented to agents, helping reduce noise and ambiguity.

Elasticsearch delivers a context engine that combines lexical search, vector search, and structured filtering for retrieval that materially improves LLM performance by ensuring the model operates on relevant and precise context. This capability is supported by agentic retrieval, along with built-in tools and search logic that automatically select the right indexes and transform natural language into optimized queries for context.

With Agent Builder, you can ensure agents receive the most useful context first with controls for relevance and ranking, allowing you to fine-tune scoring, ranking, and filtering logic. Elasticsearch lets you control what matters, why it matters, and how it is prioritized, instead of relying on opaque retrieval behavior. This is all underpinned by Elasticsearch as a scalable data platform to store and scale all your data from text, vectors, metadata, logs, and more on one platform, making it easier to manage context for agents.

Execute complex workflows as reusable tools

While AI agents enable reasoning for complex tasks, much automation depends on reliably executing rules-based actions that enforce specific business logic. Elastic Workflows provides a simple, declarative way to orchestrate internal and external systems to take actions, gather context or data, and integrate them as part of agents. Defined in YAML, workflows are fully composable, allowing them to be as simple or as complex as the job requires. This gives agents an efficient way to take action across the Elasticsearch platform and solutions, as well as with third-party applications.

Integrating a workflow with Agent Builder can be done in three steps (prerequisite: enable workflows with details provided here)

1. Create and save a new workflow using the simple YAML-based editor with built-in autocomplete and testing.

2. Create a new tool in Agent Builder with type “Workflow” and provide a description to help the agent determine when to use the workflow tool.

3. Add the workflow tool to your custom agent.

4. That’s it! Now the agent can call the workflow from within a conversation.

Your agent, your rules

Agent Builder doesn’t lock you into a single development paradigm. Instead, it’s designed to enable open, flexible development approaches for agents with full control of data, relevance, models, interoperability, security and agent design.

Custom agent definitions let you choose exactly which tools an agent can access, embed custom system prompts, tailor the agent’s instructions, and define security boundaries. Agents remain model-agnostic, allowing you to flexibly configure a preferred LLM, both native and across the broader ecosystem, without being locked into a single provider.

Build extensible tools that encapsulate domain-specific logic (e.g., specific index filters, ES|QL joins, analytical pipelines), and constrain them for safe use in production. Full API support enables interoperability with other agentic frameworks, with native support for Model Context Protocol (MCP). A2A integration means you can expose your Elastic agents to other frameworks, services, and client apps, reusing the same data and context engineering logic across integrations.

Agent Builder supports flexible, open development and is designed to integrate easily with popular agent frameworks and platforms. These integrations can be essential to delivering effective agents. As Sam Partee, Co-Founder at Arcade.dev describes,

“Agentic systems fail today because connecting AI to tools and data is complex. Elastic Agent Builder with Arcade.dev gives developers a structured, secure way to handle how agents retrieve context, reason, and act, taking agents from demo to production grade."

Agent Builder also leverages the extensibility of Elasticsearch for handling complex data. As Jerry Liu, CEO at LlamaIndex describes,

“Unlocking enterprise context from unstructured data sources is key to building effective agents. Elastic Agent Builder combined LlamaIndex complex document processing strengthens the critical context layer, helping teams retrieve, process, and prepare data so agents can reason more accurately and deliver better outcomes.”

What can you build?

Agent Builder is already being used for a variety of use cases. Below are a few examples and reference architectures to get started with agents:

Automate infrastructure: In support scenarios, agents have been used to read, think, and chat, but to date, they cannot reach out and touch the infrastructure they may need to manage. Elastic’s engineering team built an agent for automated infrastructure management as part of a hackathon. The agent actively investigates issues with application infrastructure and takes automated actions. It uses workflows to optimize configurations, respond to issues and scale resources, all based on an intelligent understanding of infrastructure logs.
Security threat analysis: A security vulnerability agent was developed with Elastic Agent Builder, MCP, and Elasticsearch. It automates threat analysis by correlating internal security data with external threat intelligence. The agent performs semantic search over historical incidents and configurations, augments results with live internet data, and applies LLM reasoning to assess environmental relevance, prioritize risks, and produce actionable remediation. See the reference architecture.
Technical customer support: Agents can perform multiple support tasks, including case summarization, issue deduplication and creation, and deep technical investigation. Agent Builder enables this with multi-step, hybrid search to find only the most relevant related issues, solutions, and procedures, and formulate root cause hypotheses and remediation plans. Agent Builder can simplify the architecture of complex support systems and accelerate time to delivery.
Product and content discovery: Agent Builder simplifies the process of exposing complex product catalogs for conversational experiences, while allowing organizations to maintain flexibility to include their own business logic and requirements.
Build your own: Join the Agent Builder Hackathon, running from January 22 to February 27, 2026. Work with the community to build context-driven, multi-step AI agents that combine search, workflows, tools, and reasoning to automate real-world tasks*

Start building custom agents now

Get started with an Elastic Cloud Trial, and check out the documentation here. For existing customers, Agent Builder is available in Cloud Serverless and on the Enterprise Tier in Elastic Cloud Hosted and self-managed.

* Click here for full terms, conditions, and eligibility requirements for the hackathon

Building voice agents with Elastic Agent Builder

Kenneth Kreindler — Thu, 22 Jan 2026 00:00:00 GMT

AI has been trapped in a glass box. You type commands, it responds with text, and that’s it. It’s useful but distant, like watching someone move behind a screen. This year, 2026, will be the year when business will shatter this glass and bring AI agents into products, where they really deliver value.

One of the ways the glass will be shattered is by the adoption of voice agents, which are AI agents that recognize human speech and synthesize computer-generated audio. With the rise of low-latency transcriptions, fast large language models (LLMs), and text-to-speech models that sound human, this has become possible.

Voice agents also need access to business data to become really valuable. In this blog, we’ll learn how voice agents work and build one for ElasticSport, a fictitious outdoor sports equipment shop, using LiveKit and Elastic Agent Builder. Our voice agent will be context-aware and will work with our data.

How it works

There are two paradigms in the world of voice agents: The first uses speech-to-speech models, and the second uses a voice pipeline consisting of speech-to-text, LLM, and text-to-speech. Speech-to-speech models have their own benefits, but voice pipelines offer much more customization over the technologies used and how context is managed, along with control over the agent’s behavior. We’ll focus on the voice pipeline model.

Key components

Transcription (speech-to-text)

Transcription is the voice pipeline entrypoint. The transcription component takes as input raw audio frames, transcribes speech into text, and outputs that text. The transcribed text is buffered until the system detects that the user's speech has ended, at which point the LLM generation is kicked off. Various third-party providers offer low-latency transcriptions. When selecting one, consider latency and transcription accuracy, and ensure that they support streamed transcripts.

Examples of third-party APIs: AssemblyAI, Deepgram, OpenAI, ElevenLabs

Turn detection

Turn detection is the component of the pipeline that detects when the speaker has finished speaking and the generation should begin. One common way to do this is through a voice activity detection (VAD) model, such as Silero VAD. VAD uses audio energy levels to detect when audio contains speech and when speech has ended. However, VAD by itself cannot identify the difference between a pause and the end of speech. This is why it’s often combined with an end-of-utterance model that predicts whether the speaker has finished speaking, based on the interim transcript or raw audio.

Examples (Hugging Face): livekit/turn-detector, pipecat-ai/smart-turn-v3

Agent

The agent is the core of a voice pipeline. It’s responsible for understanding intent, gathering the right context, and formulating a reply in text format. Elastic Agent Builder, with its built-in reasoning capabilities, tool library, and workflow integration, make for an agent that can work on top of your data and interact with external services.

LLM (text-to-text)

When selecting an LLM for Elastic Agent Builder, there are two main characteristics to consider: LLM reasoning benchmarks, and time to first token (TTFT).

Reasoning benchmarks indicate how well the LLM is able to generate correct responses. Benchmarks to consider are ones that evaluate multiturn conversation adherence and intelligence benchmarks, such as MT-Bench and the Humanity's Last Exam dataset, respectively.

TTFT benchmarks evaluate how fast the model produces its first output token. There are other types of latency benchmarks, but TTFT is particularly important for voice agents, as audio synthesis can begin as soon as the first token is received, resulting in lower latency between turns, a natural-feeling conversation.

Usually, one needs to make a trade-off between these two characteristics because faster models often perform worse on reasoning benchmarks.

Examples (Hugging Face): openai/gpt-oss-20b, openai/gpt-oss-120b

Synthesis (text-to-speech)

The final part of the pipeline is the text-to-speech model. This component is responsible for converting the text output from the LLM into audible speech. Similar to the LLM, latency is a characteristic to look out for when selecting a text-to-speech provider. Text-to-speech latency is measured by time to first byte (TTFB). That’s the time it takes for the first audio byte to be received. Lower TTFB also reduces turn latency.

Examples: ElevenLabs, Cartesia, Rime

Constructing the voice pipeline

Elastic Agent Builder can be integrated into a voice pipeline at several different levels:

Agent Builder tools only: speech-to-text → LLM (with Agent Builder tools) → text-to-speech
Agent Builder as an MCP: speech-to-text → LLM (with Agent Builder access via MCP) → text-to-speech
Agent Builder as the core: speech-to-text → Agent Builder → text-to-speech

For this project, I chose the Agent Builder as the core approach. With this approach, the full functionality of Agent Builder and workflows can be used. The project uses LiveKit to orchestrate speech-to-text, turn detection, and text-to-speech, and it implements a custom LLM node that integrates directly with Agent Builder.

Elastic support voice agent

We’ll be building a custom support voice agent for a fictitious sports shop called ElasticSport. Customers will be able to call the help line, ask for product recommendations, find product details, check order statuses, and have order information sent to them via text. To achieve this, we first need to configure a custom agent and create tools for executing Elasticsearch Query Language (ES|QL) queries and workflows.

Watch video on Vidyard

Configuring the agent

Prompt

The prompt instructs the agent what personality it should take and how to respond. Importantly, there are a few voice-specific prompts that ensure the responses are synthesized into audio properly and misunderstandings are recovered from gracefully.

You are a Sales Assistant at ElasticSport, an outdoor sport shop specialized in hiking and winter equipment. 

[Profile]
- name: Iva
- company: ElasticSport
- role: Sales Assistant
- language: en-GB
- description: ElasticSport virtual sales assistant

[Context]
- Ask clarifying questions to understand the context.
- Use available tools to answer the user's question.
- Use the knowledge base to retrieve general information

[Style]
- Be informative and comprehensive.
- Maintain a professional, friendly and polite tone.
- Mimic human behavior and speech patterns.
- Be concise. Do not over explain initially

[Response Guideline]
- Present dates in spelled-out month date format (e.g., January fifteenth, two thousand and twenty-four).
- Avoid the use of unpronounceable punctuation such as bullet points, tables, emojis.
- Respond in plain text, avoid any formatting.
- Spell out numbers as words for more natural-sounding speech.
- Respond in short and concise sentences. Responses should be 1 or 2 sentences long.

[ERROR RECOVERY]
### Misunderstanding Protocol
1. Acknowledge potential misunderstanding
2. Request specific clarification

Workflows

We’ll add a small workflow to send an SMS through Twilio’s messaging API. The workflow will be exposed to the custom agent as a tool, resulting in a user experience where the agent can send the caller an SMS while on the call. This allows the caller to, for example, ask, “Can you send more details about X over text?”

name: send sms
enabled: true
triggers:
  - type: manual
inputs:
  - name: message
    type: string
    description: The message to send to the phone number.

  - name: phone_number
    type: string
    description: The phone number to send the message to.

consts:
  TWILIO_ACCOUNT: "****"
  BASIC_AUTH: "****"
  FROM_PHONE_NNUMBER: "****"
steps:
  - name: http_step
    type: http
    with:
      url: https://api.twilio.com/2010-04-01/Accounts/{{consts.TWILIO_ACCOUNT}}/Messages.json
      method: POST
      headers:
        Content-Type: application/x-www-form-urlencoded
        Authorization: Basic {{consts.BASIC_AUTH | base64_encode}}
      body: From={{consts.FROM_PHONE_NNUMBER}}&To={{inputs.phone_number}}&Body={{inputs.message}}
      timeout: 30s

ES|QL tools

The following tools allow the agent to provide relevant responses that are grounded in real data. The example repo contains a setup script to initialize Kibana with product, order, and knowledge base datasets.

Product.search

The product dataset contains 65 fictitious products. This is an example document:

{
      "sku": "ort3M7k",
      "name": "Ortovox Free Rider 26 Backpack",
      "price": 189,
      "currency": "USD",
      "image": "https://via.placeholder.com/150",
      "description": "The Ortovox Free Rider 26 is a technical freeride backpack with a dedicated safety compartment and diagonal ski carry system. Perfect for backcountry missions.\n\nKey Features:\n- 26L capacity\n- Diagonal ski carry system\n- Safety equipment compartment\n- Helmet holder\n- Hydration system compatible",
      "category": "Accessories",
      "subCategory": "Backpacks",
      "brand": "Ortovox",
      "sizes": ["One Size"],
      "colors": ["Black", "Blue", "Orange"],
      "materials": ["Nylon", "Polyester"]
    }

The name and description fields are mapped as semantic_text, enabling the LLM to use semantic search via ES|QL to retrieve relevant products. The hybrid search query performs semantic matching across both fields, with a slightly higher weight applied to matches on the name field using a boost.

The query first retrieves the top 20 results ranked by their initial relevance score. These results are then reranked based on their description field using the .rerank-v1-elasticsearch inference model, and finally trimmed down to the top five most relevant products.

type: ES|QL
toolId: products.search
description: Use this tool to search through the product catalogue by keywords.
query: |
    FROM products
        METADATA _score
      | WHERE
          MATCH(name, ?query, {"boost": 0.6}) OR
            MATCH(description, ?query, {"boost": 0.4})
      | SORT _score DESC
      | LIMIT 20
      | RERANK ?query
            ON description
            WITH {"inference_id": ".rerank-v1-elasticsearch"}
      | LIMIT 5

parameters:
    query: space separated keywords to search for in catalogue

Knowledgebase.search

The knowledgebase datasets contain documents of the following shape, where the title and content fields are stored as semantic text:

{
        id: "8273645",
        createdAt: "2025-11-14",
        title: "International Orders",
        content: `International orders are processed through our international shipping partner. Below are the countries we ship to and average delivery times.
        Germany: 3-5 working days
        France: 3-5 working days
        Italy: 3-5 working days
        Spain: 3-5 working days
        United Kingdom: 3-5 working days
        United States: 3-5 working days
        Canada: 3-5 working days
        Australia: 3-5 working days
        New Zealand: 3-5 working days
        `
}

And the tool uses a similar query as the product.search tool:

type: "ES|QL"
toolId: knowledgebase.search
description: Use this tool to search the knowledgebase.
query: |
  FROM knowledge_base
    METADATA _score
  | WHERE
      MATCH(title, ?query, {"boost": 0.6}) OR
      MATCH(content, ?query, {"boost": 0.4})
  | SORT _score DESC
  | LIMIT 20
  | RERANK ?query
      ON content
      WITH {"inference_id": ".rerank-v1-elasticsearch"}
  | LIMIT 5

parameters:
  query: space separated keywords or natural language phrase to semantically search for in the knowledge base

Orders.search

The final tool that we’ll add is the one used to retrieve orders by order_id:

type: "ES|QL"
toolId: order.search
description: Use this tool to retrieve an order by its ID.
query: |
  FROM orders
    METADATA _score
  | WHERE order_id == ?order_id
  | SORT _score DESC
  | LIMIT 1

parameters:
  order_id: "the ID of the order"

After configuring the agent and attaching these workflows and ES|QL tools to the agent, the agent can be tested inside of Kibana.

Aside from building an ElasticSport support agent, the agent, workflows, and tools can be tailored to other use cases, such as a sales agent that qualifies leads, a servicing agent for home repairs, reservations for a restaurant, or an appointment-scheduling agent.

The final part is linking up the agent we just made with LiveKit, text-to-speech, and speech-to-text models. The repo linked at the end of this blog contains a custom Elastic Agent Builder LLM node that can be used with LiveKit. Just replace the AGENT_ID with your own, and link it with your Kibana instance.

Getting started

Check out the code, and try it for yourself here.

Agent Builder, beyond the chatbox: Introducing Augmented Infrastructure

Alexander Wert, Bill Easton, Gil Raphaelli, Graham Hudgins, Norrie Taylor — Thu, 22 Jan 2026 00:00:00 GMT

This is not talk. We're doing it.

We’ve all seen the rise of AI agents. They’re fantastic at summarizing text, writing code snippets, and answering questions based on documentation. But for those of us in DevOps and site reliability engineering (SRE), there has been a frustrating limitation. Most agents are trapped in the Call Center paradigm, meaning that they can read, think, and chat, but they cannot reach out and touch the infrastructure they’re supposed to be managing.

For our latest hackathon project, we decided to blow that limitation up.

We built Augmented Infrastructure: an infrastructure copilot that not only gives you advice but also creates, deploys, monitors, and fixes your live environment.

The problem: Copy, reformat, paste

Standard agents operate in a vacuum. If your app goes down and costs the company $5 million, a standard agent can read you the runbook on how to fix it. But you still have to do the work. You’re left to copy the code, reformat it for your environment, and paste it into your terminal.

We wanted an agent that understands the difference between talking about Kubernetes and configuring Kubernetes.

The engine: What is Elastic Agent Builder?

To build this, we didn't start from scratch. We built it on top of Elastic Agent Builder. For those unfamiliar, Elastic Agent Builder is a framework designed to rapidly develop agents, and it acts as the bridge between a large language model (LLM) (in our demo, we used Google Gemini) and private data stored in Elasticsearch.

Agent Builder can be used for conversational AI by grounding it in internal data, like documents or logs. But its most powerful feature is the ability to assign tools. These tools allow the LLM to step outside of the chat interface to perform specific tasks. We realized that if we pushed this feature to its limit, we could transform Agent Builder into an automation powerhouse.

Making it work: Building the first version

When we started on the project, we knew we wanted to make the agents be able to change the outside world. We had an idea: What if we built some “runner” software (to run any command the agent could think of on the host)? And then: What if the runners, Elastic Agent Builder, and the user were in a three-way call?

We started by building out a Python project, Augmented Infrastructure Runners, which was essentially a while(true) loop that queried the Elastic Agent Builder conversations API every second and checked for a special syntax we had created:

{
	"tool_name": "my_tool",
       "tool_arguments": "\{stringified json arguments\}"
}

We then updated the prompt to teach it about our new tool calling syntax. Bill is a maintainer of FastMCP, the most popular framework for building Model Context Protocol (MCP) servers in Python. He set out to work using FastMCP client with this new runner software to mount MCP servers and make their tools available to the runner. When the agent saw this, it would run the tool call and it would POST the results back to the conversation as if the user had sent the results. This triggered the LLM to respond to the result, and off we went!

This was great but it had two main problems:

The agent would spew all of this JSON right into the conversation with the user.
The earliest point in time when messages were visible through the conversations API was when a conversation round was completed (that is, when the LLM replied).

So we set out to figure out how to move this into the background.

We then switched to giving the agent a tool called call_external_tool with two arguments: the tool_name and the stringified JSON tool arguments. This external tool call would return nothing, but importantly, would be visible in the GET request to the conversations API. We then gave the runners permission to write documents directly to Elasticsearch, which the Elastic Agent Builder agent could retrieve as needed. The agent is always operating in response to a user message, so we need to kick-start the agent with a user message so that it will go look for results and continue processing. So we had the agents insert a small message into the chat to resume the conversation:

So now we had external tool calls. However, because of the second problem mentioned above, we had to get rid of that final kick-start part. Otherwise, every external tool call required a full conversation round to retrieve the results!

Making it great: Introducing workflows

In addition to Elasticsearch Query Language (ES|QL) and index search tool calls, Agent Builder agents can call Elastic workflow-based tools. Elastic workflows provide a flexible and easy to manage way to execute an arbitrary sequence and logic of actions. For our purposes, all we need the workflow to do is store an external tool request to Elasticsearch and return an ID to poll the results for. This results in the following, simple workflow definition:

name: ai-tool-call
enabled: true
triggers:
  - type: manual
inputs:
  - name: runner_id
    type: string
  - name: tool_calls
    type: string

steps:
  - name: store_request
    type: elasticsearch.create
    with:
      index: distributed-tool-requests
      id: "{{inputs.runner_id}}_{{ execution.id }}"
      document:
        request_id: "{{ execution.id }}"
        runner_id: "{{inputs.runner_id}}"
        tool_call: "{{inputs.tool_calls}}"
        status: "unhandled"

  - name: output_result
    type: console
    with:
      message: "Called tool, with execution id: {{ execution.id }}. Use this ID to poll the results."

With that, instead of relying on the tool call request being written into the conversation, the runners can just poll the Elasticsearch distributed-tool-requests index for new external tool requests and report the results back into another Elasticsearch index with the provided execution.id.

This eliminates the two main issues mentioned above:

The conversation history isn’t cluttered with the payload for the external tool calls anymore.
As the runners are polling the Elasticsearch index instead of the conversation history, they aren’t blocked by the conversation round to be completed for the external tool requests to become visible.

The second point has the great advantage that processing of the external tool calls starts within the agent’s thinking phase (rather than when the conversation round has been completed). This allows us to instruct the LLM in the system prompt to poll for the external tool results until the results are available and eliminates the need for the kick-start message. Overall, this has the nice effect that the conversation feels more natural: The LLM can process multiple external tool requests within a single conversation round (instead of requiring one conversation round per tool request) and, thus, can accomplish more complex user requests in one go.

Putting it all together

To bridge the gap between the LLM and the server rack, we developed a specific architecture using Agent Builder’s tool capabilities:

Augmented Infrastructure runners: We deployed lightweight runners inside the target environments (servers, Kubernetes clusters, cloud accounts). These runners are connected directly to Elastic, using secured endpoints and secrets only available to each of the runners.
ES|QL retrieval: The copilot uses Elastic’s ES|QL to perform hybrid searches. It doesn't just search for knowledge; it searches for capabilities. It queries the connected runners to see which tools are available (for example, list_ec2_instances, install_helm_chart).
Workflow execution: Once the agent decides on a course of action, it creates a structured workflow.
Feedback loop: The runners execute the command locally and report the results back into Elasticsearch. The copilot reads the result from the index and decides the next step.

The demo: From outage to observability

Watch video on Vidyard

In the video, we showcased two distinct scenarios demonstrating the power of this architecture.

Scenario 1: DevOps rescue

We started with a user panicking about a $5-million outage caused by a blind spot in their Kubernetes cluster.

The request: "How do I make sure this doesn't happen again?"
The action: The agent didn't just provide a tutorial. It identified the cluster, created the necessary namespaces, generated Kubernetes secrets, installed the OpenTelemetry Operator, and instantly provided a link to a live APM dashboard.
The result: Full Kubernetes observability and application insights without the user writing a single line of YAML.

Scenario 2: Security handoff

A fundamental rule of infrastructure security is that you cannot protect what you cannot see. While performing our DevOps rescue, the agent sees an opportunity to improve the security of the environment.

With an alert kicked off from a previous Elastic Observability–related investigation, we demonstrate how a security practitioner can chat directly with their infrastructure: first, to enumerate the assets and resources in their cloud environment; and second, to deploy the tools necessary to ensure that the environment is secured.

Discovery: The copilot enumerated AWS resources for the security practitioner and identified a critical gap: an Amazon Elastic Compute Cloud (EC2) instance and an Amazon Elastic Kubernetes Service (EKS) cluster with public endpoints missing endpoint protection.
Remediation: With a simple approval, the copilot deployed Elastic Security extended detection and response (XDR) and cloud detection and response (CDR) to the vulnerable assets, securing the environment in real time.
The Result: Protection of deployed AWS assets and resources with complete runtime security.

The future: Augmented everything

This project proves that Elastic Agent Builder can be the central brain for distributed operations. We aren't limited to just infrastructure. Our runner technology can power:

Augmented synthetics: Diagnosing TLS errors across global runners.
Augmented development: Creating pull requests and implementing CAPTCHAs on frontend services.
Augmented operations: Automatically reconfiguring DNS resolvers during an outage.

Try it yourself

We believe the future of AI isn't just about chat support; it's about Augmented Infrastructure. It’s about having a partner that can deploy, fix, observe, and protect alongside you.

Check out the code and try it for yourself with distributed runners (GitHub) plus Elastic Agent Builder on Elastic Cloud Serverless today!

Create a serverless project on Elastic Cloud.
Deploy the code to a runner.
Set up the runner.
Configure your mcp.json.
Start the runner, which will create your agent and its tools automatically.
Chat with an agent that can reason, plan, and execute actions on your distributed runners!

The team: Alex, Bill, Gil, Graham, & Norrie

Monitoring LLM inference and Agent Builder with OpenRouter

Tomás Murúa — Wed, 21 Jan 2026 00:00:00 GMT

New models are released every week that surpass previous ones in intelligence, speed, or cost. This makes vendor lock-in risky and managing multiple connectors, billing accounts, and APIs unnecessarily complex. Each model behaves differently, in terms of token consumption, response latency, and compatibility with specific tool sets.

In this article, we'll build an AI-enriched audio products catalog, connect it to Elastic Agent Builder, and use OpenRouter to access different models while monitoring their performance across the entire workflow, from data ingestion to agent interactions.

Prerequisites

Elastic Cloud 9.2 or Elastic Cloud Serverless
Integration server with APM enabled
OpenRouter account and API Key
Python 3.9+

What is OpenRouter?

OpenRouter is a platform that unifies access to over 500 models from multiple providers through a single account and API. Instead of managing separate accounts for OpenAI, Anthropic, Google, and others, you access all of them through OpenRouter.

OpenRouter handles load balancing across providers, automatically routing requests to the provider with the best latency and fewest errors. You can also manually select providers or configure fallback chains. OpenRouter is compatible with standard APIs, code assistants, integrated development environments (IDEs), and more.

One key feature is Broadcast, which sends traces of your model usage to external observability systems. Since OpenRouter supports OpenTelemetry, we can monitor our complete pipeline plus any other OpenRouter usage costs in the Elastic Stack.

Architecture overview

We'll use an audio products catalog for which we generate new fields with AI using an inference ingest pipeline and then create an agent that can answer questions based on the indexed product data.

When ingesting data, the ingest pipeline uses an OpenRouter inference endpoint to generate new attribute fields based on the unstructured descriptions of the products, which triggers OpenRouter to send logs about that inference to Elasticsearch.

Similarly, when chatting with the Agent Builder that uses this data, logs are also sent to Elasticsearch for visualization.

We’ll use separate OpenRouter API keys for Agent Builder and ingestion:

OPENROUTER_API_KEY for Agent Builder Interactions
OPENROUTER_INGESTION_KEY for the inference pipeline

This allows us to differentiate traffic in monitoring dashboards and attribute costs to specific workflows.

Setup

First, we need to create an AI connector for the agent to interact with the large language model (LLM) and an inference endpoint for the ingest pipeline to extract fields from descriptions. Both connect to OpenRouter using the same API (but can use different keys for monitoring separation).

Create the AI connector

The AI connector allows Agent Builder to communicate with LLMs. We configure it to use OpenRouter as the provider:

import requests
import os

ELASTIC_URL = os.getenv("ELASTIC_URL")
KIBANA_URL = os.environ["KIBANA_URL"]
ELASTIC_API_KEY = os.environ["ELASTIC_API_KEY"]
OPENROUTER_API_KEY = os.environ["OPENROUTER_AGENT_KEY"]
OPENROUTER_INGESTION_KEY = os.environ.get("OPENROUTER_INGESTION_KEY", OPENROUTER_API_KEY)

# Create AI Connector for Agent Builder
connector_payload = {
    "name": "OpenRouter Agent Connector",
    "connector_type_id": ".gen-ai",
    "config": {
        "apiProvider": "Other",
        "apiUrl": "https://openrouter.ai/api/v1/chat/completions",
        "defaultModel": "openai/gpt-5.2",
        "enableNativeFunctionCalling": True
    },
    "secrets": {
        "apiKey": OPENROUTER_API_KEY
    }
}

response = requests.post(
    f"{KIBANA_URL}/api/actions/connector",
    headers={
        "kbn-xsrf": "true",
        "Authorization": f"ApiKey {ELASTIC_API_KEY}",
        "Content-Type": "application/json"
    },
    json=connector_payload
)

connector = response.json()
print(f"Connector created: {connector['id']}")

We use a reasoning-capable model, like GPT-5.2, for the agent since it needs to handle complex queries and tool orchestration.

Create the inference endpoint

The inference endpoint allows Elasticsearch to call LLMs during data processing:

from elasticsearch import Elasticsearch

es = Elasticsearch(
    hosts=[ELASTIC_URL],
    api_key=ELASTIC_API_KEY,
    request_timeout=60  # Higher timeout for inference operations
)

# Create inference endpoint for ingestion
inference_config = {
    "service": "openai",
    "service_settings": {
        "model_id": "openai/gpt-4.1-mini",
        "api_key": OPENROUTER_INGESTION_KEY,
        "url": "https://openrouter.ai/api/v1/chat/completions"
    }
}

response = es.inference.put(
    inference_id="openrouter-inference-endpoint",
    task_type="completion",
    body=inference_config
)

print(f"Inference endpoint created: {response['inference_id']}")

We use a fast, cheaper model, like GPT-4.1 Mini, for bulk ingestion tasks that don't require advanced reasoning capabilities.

Data pipeline

Let's configure the ingest pipeline. It will read from the product description field and extract structured categories that Agent Builder can use for filtering and aggregations.

For example, given this product description:

"Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium leather ear cushions. Perfect for travel and office use."

We can extract:

Category: Headphones
Features: ["wireless", "noise_cancellation", "long_battery"]
Use case: Travel

The key is providing possible values to the LLM as an enum so it groups consistently. Otherwise, we might get variations like "noise cancellation", "ANC", and "noise-canceling" that are harder to aggregate.

# Define the extraction prompt
EXTRACTION_PROMPT = (
    "Extract audio product information from this description. "
    "Return raw JSON only, no markdown, no explanation. Fields: "
    "category (string, one of: Headphones/Earbuds/Speakers/Microphones/Accessories), "
    "features (array of strings from: wireless/noise_cancellation/long_battery/waterproof/voice_assistant/fast_charging/portable/surround_sound), "
    "use_case (string, one of: Travel/Office/Home/Fitness/Gaming/Studio). "
    "Description: "
)

# Create the enrichment pipeline
pipeline_config = {
    "processors": [
        {
            "script": {
                "source": f"ctx.prompt = '{EXTRACTION_PROMPT}' + ctx.description"
            }
        },
        {
            "inference": {
                "model_id": "openrouter-inference-endpoint",
                "input_output": {
                    "input_field": "prompt",
                    "output_field": "ai_response"
                }
            }
        },
        {
            "json": {
                "field": "ai_response",
                "add_to_root": True  # Parses JSON and adds fields to document root
            }
        },
        {
            "remove": {
                "field": ["prompt", "ai_response"]
            }
        }
    ]
}

es.ingest.put_pipeline(
    id="product-enrichment-pipeline",
    body=pipeline_config
)

print("Pipeline created: product-enrichment-pipeline")

After using OpenAI to extract a JSON with the new attributes, we use the json processor to spread them into new fields.

Now let's index some sample audio products:

# Sample audio product data
products = [
    {
        "name": "Wireless Noise-Canceling Headphones",
        "description": "Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium leather ear cushions. Perfect for travel and office use.",
        "price": 299.99
    },
    {
        "name": "Portable Bluetooth Speaker",
        "description": "Compact waterproof speaker with 360-degree surround sound. 20-hour battery life, perfect for outdoor adventures and pool parties.",
        "price": 149.99
    },
    {
        "name": "Studio Condenser Microphone",
        "description": "Professional USB microphone with noise cancellation and voice assistant compatibility. Ideal for podcasting, streaming, and home studio recording.",
        "price": 199.99
    }
]

# Create index with mapping
es.indices.create(
    index="products-enriched",
    body={
        "mappings": {
            "properties": {
                "name": {"type": "text"},
                "description": {"type": "text"},
                "price": {"type": "float"},
                "category": {"type": "keyword"},
                "features": {"type": "keyword"},
                "use_case": {"type": "keyword"}
            }
        }
    },
    ignore=400  # Ignore if already exists
)

# Index products using the enrichment pipeline
for i, product in enumerate(products):
    es.index(
        index="products-enriched",
        id=i,
        body=product,
        pipeline="product-enrichment-pipeline"
    )
    print(f"Indexed: {product['name']}")

# Refresh to make documents searchable
es.indices.refresh(index="products-enriched")

Agent Builder

Now we can create an Agent Builder agent to use this index and answer both text questions and analytical queries using the new fields we've created:

# Create Agent Builder agent
agent_payload = {
    "id": "audio-product-assistant",
    "name": "Audio Product Assistant",
    "description": "Answers questions about audio product catalog using semantic search and analytics",
    "labels": ["audio"],
    "avatar_color": "#BFDBFF",
    "avatar_symbol": "AU",
    "configuration": {
        "tools": [
            {
                "tool_ids": [
                    "platform.core.search",
                    "platform.core.list_indices",
                    "platform.core.get_index_mapping",
                    "platform.core.execute_esql"
                ]
            }
        ],
        "instructions": """You are an audio product assistant that helps users find and analyze audio equipment.

Use the products-enriched index for all queries. The extracted fields are:
- category: Headphones, Earbuds, Speakers, Microphones, or Accessories
- features: array of product features like wireless, noise_cancellation, long_battery
- use_case: Travel, Office, Home, Fitness, Gaming, or Studio

For analytical questions, use ESQL to aggregate data.
For product searches, use semantic search on the description field."""
    }
}

response = requests.post(
    f"{KIBANA_URL}/api/agent_builder/agents",
    headers={
        "kbn-xsrf": "true",
        "Authorization": f"ApiKey {ELASTIC_API_KEY}",
        "Content-Type": "application/json"
    },
    json=agent_payload
)

agent = response.json()
print(f"Agent created: {agent['id']}")

For tools, we use search for semantic queries and Elasticsearch Query Language (ES|QL) for analytical queries:

Now you can chat with your agent and ask questions like:

"What headphones do we have for travel?"
"Show me products with noise cancellation under $200"
"What's the average price by category?"

The agent uses the AI-enriched fields to provide better filtering and aggregations.

Implementing OpenRouter Broadcast

Now let's set up inference monitoring. First, we need our OpenTelemetry endpoint URL. Navigate to the APM tutorial in Kibana:

https:///app/observabilityOnboarding/otel-apm/?category=application

Collect the URL and authentication token from the OpenTelemetry tab:

Important: Your Kibana server needs to be reachable via the public internet so that it can receive data from OpenRouter.

In OpenRouter, go to Broadcast settings and add a new destination for the "OpenTelemetry Collector":

Important: Configure the endpoint with the /v1/traces path and authentication headers:

Endpoint: https://xxxxx.ingest.us-east-2.aws.elastic-cloud.com:443/v1/traces

Headers: {"Authorization": "Bearer YOUR_APM_SECRET_TOKEN"}

Press Test connection, and you should see a success message.

Monitoring in Elastic

After using OpenRouter models, you should start seeing documents in Kibana. The indexed documents are in the data stream traces-generic.otel-default with service.name: "openrouter" and include information about:

Request and response details.
Token usage (prompt, completion, total).
Cost (in USD).
Latency (time to first token, total).
Model information.

From now on, the activity of the inference pipeline and Agent Builder in relation to LLM usage will be recorded in OpenRouter and sent to Elastic.

Default APM dashboards

You can view the default dashboards in Kibana under Observability > Applications > Service Inventory > openrouter:

The service view shows:

Latency: Average response times across all calls.
Throughput: Requests per minute.
Failed transactions: Error rates.
Transactions: Breakdown by operation type.

Custom LLM monitoring dashboard

For more control over the information displayed, you can create a custom dashboard. We created one that differentiates ingestion from agent chat and measures relevant parameters, like token usage and cost, plus usage outside Elastic, like coding assistants via API key filters:

The dashboard shows:

Success rate by workflow type.
Token usage by model.
Cost breakdown by API key.
Latency trends over time.
Model comparison metrics.

You can download the dashboard here and import it to your Kibana instance using Saved Objects import.

Conclusion

OpenRouter lets you move quickly and test multiple models and providers using the same API and billing account, making it convenient to compare models of different types—large parameter, small parameter, commercial, open source, and more.

Using OpenRouter Broadcast, we can effortlessly monitor the performance of these models during ingestion via ingest pipelines or chat via Agent Builder, plus combine it with OpenRouter usage for other purposes, like coding agents and apps.

Context engineering vs. prompt engineering

Tomás Murúa — Tue, 20 Jan 2026 00:00:00 GMT

In the early days of web development, web design was a single discipline encompassing everything from visual aesthetics to user workflows. As the industry matured, it split into two different fields: user interface (UI) and user experience (UX). They remain interconnected, but each requires distinct expertise and tools.

We're seeing a similar split in AI. Since ChatGPT's launch in November 2022, the industry has focused on improving prompts to optimize large language model (LLM) interactions. As we build more sophisticated AI systems, particularly agents and Model Context Protocol (MCP) tools that need access to external knowledge, two distinct disciplines are emerging: prompt engineering and context engineering. While they work together, they address fundamentally different challenges.

Prompt engineering focuses on how you communicate with the model. Context engineering focuses on what information the model has access to when it generates responses.

What is prompt engineering?

A prompt is an input to a generative AI model used to guide its output. Prompts can consist of text, images, sound, or other media.

Prompt engineering is the iterative process of refining how you communicate with the model to achieve better results. It focuses on the words, structure, and techniques you use within a single interaction.

Here are some examples of prompt engineering techniques:

Common techniques include few-shot prompting (providing examples), chain-of-thought reasoning (asking the model to show its work), and role assignment (giving the model a persona). These techniques address challenges like ambiguity, where a question can be interpreted multiple ways and the model must guess which interpretation the user intended.

A key challenge in prompt engineering is finding what Anthropic calls "the right altitude" for instructions. At one extreme, engineers hardcode complex, brittle logic in their prompts to anticipate every scenario. This creates fragility and maintenance overhead. At the other extreme, engineers provide vague guidance that fails to give the model concrete signals or that falsely assumes shared context. The optimal altitude strikes a balance: specific enough to guide behavior, yet flexible enough to let the model apply good judgment.

Prompt engineering typically operates at the single-turn level, optimizing how you phrase one query to get the best response. This works well for straightforward interactions but reaches its limits when tasks require external knowledge, persistent state, or multistep reasoning.

For a deeper exploration of prompt engineering techniques, refer to The Prompt Report: A Systematic Survey of Prompting Techniques.

What is context engineering?

Context engineering is the broader discipline of curating and maintaining the optimal set of tokens during LLM inference. While prompt engineering asks, "How should I phrase this?," context engineering asks, "What information does the model need access to right now?"

As the 12-Factor Agents framework explains, LLMs are stateless functions that turn inputs into outputs. At any given point, your input to an LLM is essentially, "Here's what's happened so far. What's the next step?" Every interaction becomes context:

The prompt and instructions you give to the model.
Documents or external data you retrieve through retrieval-augmented generation (RAG).
Past state, tool calls, or other history.
Instructions about structured data output formats.

The distinction matters because modern AI agents don't operate in single turns. An agent running in a loop generates an ever-expanding universe of information: tool outputs, retrieved documents, conversation history, intermediate reasoning. Context engineering is the practice of deciding what from that universe makes it into the model's limited working memory at any given moment.

For a comprehensive exploration of context engineering components and best practices, refer to What is context engineering? and You Know, for Context.

Key differences: Prompt engineering vs. context engineering

Dimension	Prompt engineering	Context engineering
Core question	"How should I phrase this?"	"What does the model need to know?"
Scope	Single query	System-wide information flow
Failure mode	Ambiguity: poorly phrased instructions lead to misinterpretation	Retrieval problems: wrong documents, stale information, or context overflow
Tools	Describes desired output	Selects and sequences tools
Debugging approach	Linguistic precision: refine wording, add examples	Data architecture: tune retrieval, prune context, sequence tools

Single-turn vs. multi-turn

Prompt engineering optimizes one interaction. Context engineering thinks in sequences: What did previous turns establish? What tool outputs carry forward? What should persist three steps from now? As tasks move from simple question-answering to multistep agent workflows, context engineering becomes the dominant challenge.

Context window management

Context engineering manages a finite resource with three failure modes:

Too little information leads to hallucination or bad responses. When the LLM lacks enough context, it cannot determine the semantic context to generate accurate responses.
Too much information causes context overflow. This overwhelms the LLM's attention span and lowers relevance across the whole context window, causing the model to struggle identifying which parts matter most.
Distracting or conflicting information confuses the model. Larger context windows increase the chances of conflicting or irrelevant information that distracts the LLM answer.

Key distinction: Prompt engineering takes the context window as given. Context engineering actively curates it.

Tool orchestration

Prompt engineering can request tool use and describe what tools should do. Context engineering decides which tools to make available, what information to pass them, and how their outputs flow back into context.

One of the most common failure modes is bloated tool sets with overlapping functionality. If a human engineer can't definitively say which tool should be used in a given situation, an AI agent can't be expected to do better. Context engineering applies a clear principle: Curate the minimal viable set of tools. Each tool should be self-contained, robust to error, and unambiguous in its purpose. Tools should also be token-efficient, returning only the information necessary rather than everything available.

Just-in-time context vs. pre-retrieval

Traditional RAG systems preprocess and retrieve all potentially relevant data up front, loading it into the prompt before inference. Context engineering increasingly favors just-in-time strategies, like Anthropic’s Agent Skills, which agents discover and dynamically load into the context.

Rather than loading everything up front, agents maintain lightweight references (file paths, stored queries, document IDs) and dynamically load data at runtime using tools. This mirrors human cognition: We don't memorize entire books but maintain systems like file folders and bookmarks to retrieve information on demand.

The trade-off is speed versus precision. Pre-retrieval is faster but risks context overflow. Just-in-time retrieval is slower but keeps the context window focused. The most effective agents often use a hybrid approach: retrieving essential baseline context up front while enabling further exploration as needed.

Practical example: Book recommendation agent

To demonstrate how prompt engineering and context engineering work together, we built a book recommendation agent using Elastic Agent Builder with a dataset of 103,063 books indexed in Elasticsearch.

Setup:

Index: books-dataset with 103,063 documents
Fields: Title, Authors, Description, Category, Publisher, Price, Published Date
Tools: Agent Builder predetermined tools
Model: Elastic Managed LLM

Mappings:

{
  "mappings": {
    "properties": {
      "@timestamp": { "type": "date" },
      "Authors": { "type": "text" },
      "Category": { "type": "text" },
      "Description": { "type": "text" },
      "Price Starting With ($)": { "type": "double" },
      "Published Date": { "type": "date", "format": "iso8601" },
      "Publisher": { "type": "text" },
      "Title": { "type": "text" }
    }
 }

We tested three scenarios to show different outcomes based on prompt quality and context management.

Scenario 1: Prompt engineering failure (ambiguity)

User prompt: “Recommend a good book”

The agent searched for "highly rated popular books" and returned results about Labrador retrievers and a Paul Reiser comedy book, neither matching typical "good book" expectations.

Problem: The agent had to guess what “good” means without any filtering criteria. The LLM interpreted the request based on its assumption of what a “good” book is rather than on the user preferences.

Scenario 2: Context engineering failure (too much information)

User prompt: “Retrieve all books from the database"

Elasticsearch Query Language (ES|QL) query generated:

FROM books-dataset 
| LIMIT 100

Context retrieved: 100 random books across all categories (cooking, history, and fiction all mixed together)
Problem: Too much unfiltered information. The agent brought excessive context that makes finding a relevant book difficult, and the answer is incomplete.

Scenario 3: Both disciplines working together

User prompt: “I enjoy science and fantasy fiction like The Lord of the Rings or Foundation. Find books that match these preferences.”

The agent executed targeted searches, retrieving relevant titles: The Return of the King, Dune: House Corrino, Far Horizons (a collection featuring Foundation and Dune universe stories).

Search query: "science fiction and fantasy books similar to The Lord of the Rings or Foundation"

Agent reasoning

The agent demonstrated context engineering through a smart tool usage and focused retrieval:

The agent used platform.core.search to query the books-dataset index with a targeted search: “science fiction fantasy books similar to The Lord of the Rings Foundation”. From 103,063 documents, it retrieved only the most relevant matches.

Why it worked

Prompt engineering: Clear genre specification and concrete examples (Lord of the Rings, Foundation) eliminated ambiguity.
Context engineering: Focused retrieval bringing only relevant books, maintaining a manageable context window despite the dataset having 103,063 entries.

The agent used the same tools in all three cases, but input quality determined how effectively those tools retrieved relevant context.

Conclusion

Prompt engineering and context engineering are distinct but complementary disciplines. What started as a general practice of prompting is splitting into specialized fields requiring different expertise, much like the UI/UX split in web development.

For straightforward question-answering, prompt engineering skills may be sufficient. But as systems grow more complex, adding retrieval, tools, and multistep reasoning, context engineering becomes the dominant challenge. Teams building production AI systems need both skill sets, and increasingly, they need practitioners who understand how the two disciplines interact.

To dive deeper into context engineering strategies for AI agents, including hybrid retrieval, semantic chunking, and agentic search patterns, see The impact of relevance in context engineering for AI agents.

Faster ES|QL stats with Swiss-style hash tables

Chris Hegarty, Matthew Alp, Nik Everett — Mon, 19 Jan 2026 00:00:00 GMT

We recently replaced key parts of Elasticsearch’s hash table implementation with a Swiss-style design and observed up to 2–3x faster build and iteration times on uniform, high-cardinality workloads. The result is lower latency, better throughput, and more predictable performance for Elasticsearch Query Language (ES|QL) stats and analytics operations.

Why this matters

Most typical analytical workflows eventually boil down to grouping data. Whether it’s computing average bytes per host, counting events per user, or aggregating metrics across dimensions, the core operation is the same — map keys to groups and update running aggregates.

At a small scale, almost any reasonable hash table works fine. At the large scale (hundreds of millions of documents and millions of distinct groups) details start to matter. Load factors, probing strategy, memory layout, and cache behavior can make the difference between linear performance and a wall of cache misses.

Elasticsearch has supported these workloads for years, but we’re always looking for opportunities to modernize core algorithms. As such, we evaluated a newer approach inspired by Swiss tables and applied it to how ES|QL computes statistics.

What are Swiss tables, really?

Swiss tables are a family of modern hash tables popularized by Google’s SwissTable and later adopted in Abseil and other libraries.

Traditional hash tables spend a lot of time chasing pointers or loading keys just to discover that they don’t match. Swiss tables’ defining feature is the ability to reject most probes using a tiny cache-resident array structure, stored separately from the keys and values, called control bytes, to dramatically reduce memory traffic.

Each control byte represents a single slot and, in our case, encodes two things: whether the slot is empty, and a short fingerprint derived from the hash. These control bytes are laid out contiguously in memory, typically in groups of 16, making them ideal for single instruction, multiple data (SIMD) processing.

Instead of probing one slot at a time, Swiss tables scan an entire control-byte block using vector instructions. In a single operation, the CPU compares the fingerprint of the incoming key against 16 slots and filters out empty entries. Only the few candidates that survive this fast path require loading and comparing the actual keys.

This design trades a small amount of extra metadata for much better cache locality and far fewer random loads. As the table grows and probe chains lengthen, those properties become increasingly valuable.

SIMD at the center

The real star of the show is SIMD.

Control bytes are not just compact, they’re also explicitly designed to be processed with vector instructions. A single SIMD compare can check 16 fingerprints at once, turning what would normally be a loop into a handful of wide operations. For example:

In practice, this means:

Fewer branches.
Shorter probe chains.
Fewer loads from key and value memory.
Much better utilization of the CPU’s execution units.

Most lookups never make it past the control-byte scan. When they do, the remaining work is focused and predictable. This is exactly the kind of workload that modern CPUs are good at.

SIMD under the hood

For readers who like to peek under the hood, here’s what happens when inserting a new key into the table. We use the Panama Vector API with 128-bit vectors, thus operating on 16 control bytes in parallel.

The following snippet shows the code generated on an Intel Rocket Lake with AVX-512. While the instructions reflect that environment, the design does not depend on AVX-512. The same high-level vector operations are emitted on other platforms using equivalent instructions (for example, AVX2, SSE, or NEON).

; Load 16 control bytes from the control block
vmovdqu xmm0, XMMWORD PTR [r9+r10*1+0x10]

; Broadcast the 7-bit fingerprint of the new key across the vector
vpbroadcastb xmm1, r11d

; Compare all 16 control bytes to the new fingerprint
vpcmpeqb k7, xmm0, xmm1
kmovq rbx, k7

; Check if any matches were found
test rbx, rbx
jne

Each instruction has a clear role in the insertion process:

vmovdqu: Loads 16 consecutive control bytes into the 128-bit xmm0 register.
vpbroadcastb: Replicates the 7-bit fingerprint of the new key across all lanes of the xmm1 register.
vpcmpeqb: Compares each control byte against the broadcasted fingerprint, producing a mask of potential matches.
kmovq + test: Moves the mask to a general-purposes register and quickly checks whether a match exists.

Finally, we settled on probing groups of 16 control bytes at a time, as benchmarking showed that expanding to 32 or 64 bytes with wider registers provided no measurable performance benefit.

Integration in ES|QL

Adopting Swiss-style hashing in Elasticsearch was not just a drop-in replacement. ES|QL has strong requirements around memory accounting, safety, and integration with the rest of the compute engine.

We integrated the new hash table tightly with Elasticsearch’s memory management, including the page recycler and circuit breaker accounting, ensuring that allocations remain visible and bounded. Elasticsearch's aggregations are stored densely and indexed by a group ID, keeping the memory layout compact and fast for iteration, as well as enabling certain performance optimizations by allowing random access.

For variable-length byte keys, we cache the full hash alongside the group ID. This avoids recomputing expensive hash codes during probing and improves cache locality by keeping related metadata close together. During rehashing, we can rely on the cached hash and control bytes without inspecting the values themselves, keeping resizing costs low.

One important simplification in our implementation is that entries are never deleted. This removes the need for tombstones (markers to identify previously occupied slots) and allows empty slots to remain truly empty, which further improves probe behavior and keeps control-byte scans efficient.

The result is a design that fits naturally into Elasticsearch’s execution model while preserving the performance characteristics that make Swiss tables attractive.

How does it perform?

At small cardinalities, Swiss tables perform roughly on par with the existing implementation. This is expected: When tables are small, cache effects dominate less and there is little probing to optimize.

As cardinality increases, the picture changes quickly.

The heatmap above plots time improvement factors for different key sizes (8, 32, 64, and 128 bytes) across cardinalities from 1,000 up to 10,000,000 groups. As cardinality grows, the improvement factor steadily increases, reaching up to 2–3x for uniform distributions.

This trend is exactly what the design predicts. Higher cardinality leads to longer probe chains in traditional hash tables, while Swiss-style probing continues to resolve most lookups inside SIMD-friendly control-byte blocks.

Cache behavior tells the story

To better understand the speedups, we ran the same JMH benchmarks under Linux perf and captured cache and TLB statistics.

Compared to the original implementation, the Swiss version performs about 60% fewer cache references overall. Last-level cache loads drop by more than 4x, and LLC load misses fall by over 6x. Since LLC misses often translate directly into main-memory accesses, this reduction alone explains a large portion of the end-to-end improvement.

Closer to the CPU, we see fewer L1 data cache misses and nearly 6x fewer data TLB misses, pointing to tighter spatial locality and more predictable memory access patterns.

This is the practical payoff of SIMD-friendly control bytes. Instead of repeatedly loading keys and values from scattered memory locations, most probes are resolved by scanning a compact, cache-resident structure. Less memory touched means fewer misses, and fewer misses mean faster queries.

Wrapping up

By adopting a Swiss-style hash table design and leaning hard into SIMD-friendly probing, we achieved 2–3x speedups for high-cardinality ES|QL stats workloads, along with more stable and predictable performance.

This work highlights how modern CPU-aware data structures can unlock substantial gains, even for well-trodded problems, like hash tables. There is more room to explore here, like additional primitive type specializations and use in other high-cardinality paths, like joins, all of which are just part of the broader and ongoing effort to continually modernize Elasticsearch internals.

If you’re interested in the details or want to follow the work, check out this pull request and meta issue tracking progress on Github.

Happy hashing!

Managing agentic memory with Elasticsearch

Someshwaran Mohankumar — Fri, 16 Jan 2026 00:00:00 GMT

In the emerging discipline of context engineering, giving AI agents the right information at the right time is crucial. One of the most important aspects of context engineering is managing an AI’s memory. Much like humans, AI systems rely on both a short-term memory and a long-term memory to recall information. If we want large language model (LLM) agents to carry on logical conversations, remember user preferences, or build on previous results or responses, we need to equip them with effective memory mechanisms.

After all, everything in the context influences the AI’s responses. Garbage in, garbage out holds true.

In this article, we’ll introduce what short-term and long-term memory mean for AI agents, specifically:

The difference between short- and long-term memory.
How they relate to retrieval-augmented generation (RAG) techniques with vector databases, like Elasticsearch, and why careful memory management is necessary.
The risks of neglecting memory, including context overflow and context poisoning.
Best practices, like context pruning, summarizing, and retrieving only what’s relevant, to keep an agent’s memory both useful and safe.
Finally, we’ll touch on how memory can be shared and propagated in multi-agent systems to enable agents to collaborate without confusion using Elasticsearch.

Short-term versus long-term memory in AI agents

Short-term memory in an AI agent typically refers to the immediate conversational context or state—essentially, the current chat history or recent messages in the active session. This includes the user’s latest query and recent back-and-forth exchanges. It’s very similar to the information a person holds in mind during an ongoing conversation.

AI frameworks often maintain this transient memory as part of the agent’s state (for example, using a checkpointer to store the conversation state as covered by this example from LangGraph). Short-term memory is session-scoped; that is, it exists within a single conversation or task and is reset or cleared when that session ends, unless explicitly saved elsewhere. An example of session-bound short-term memory would be the temporary chat available in ChatGPT.

Long-term memory, on the other hand, refers to information that persists across conversations or sessions. This is the knowledge an agent retains over time, facts it learned earlier, user preferences, or any data we’ve told it to remember permanently.

Long-term memory is usually implemented by storing and fetching it from an external source, such as a file or vector database that’s outside the immediate context window. Unlike short-term chat history, long-term memory isn’t automatically included in every prompt. Instead, based on a given scenario, the agent must recall or retrieve it when relevant tools are invoked. In practice, long-term memory might include a user’s profile info, prior answers or analyses the agent produced, or a knowledge base the agent can query.

For instance, if you have a travel-planner agent, the short-term memory would contain details of the current trip inquiry (dates, destination, budget) and any follow-up questions in that chat; whereas the long-term memory could store the user’s general travel preferences, past itineraries, and other facts shared in previous sessions. When the user returns later, the agent can pull from this long-term store (for example, the user loves beaches and mountains, has an average budget of INR 100,000, has a bucket list to visit, and prefers to experience history and culture rather than kid-friendly attractions) so that it doesn’t treat the user as a blank slate each time.

The short-term memory (chat history) provides immediate context and continuity, while long-term memory provides a broader context that the agent can draw upon when needed. Most advanced AI agent frameworks enable both: They keep track of recent dialogue to maintain context and offer mechanisms to look up or store information in a longer-term repository. Managing short-term memory ensures it stays within the context window, while managing long-term memory helps the agent to ground the answers based on prior interactions and personas.

Memory and RAG in context engineering

How do we give an AI agent a useful long-term memory in practice?

One prominent approach for long-term memory is semantic memory, often implemented via retrieval-augmented generation (RAG). This involves coupling the LLM with an external knowledge store or vector-enabled datastore, like Elasticsearch. When the LLM needs information beyond what’s in the prompt or its built-in training, it performs semantic retrieval against Elasticsearch and injects the most relevant results into the prompt as context. This way, the model’s effective context includes not only the recent conversation (short-term memory) but also pertinent long-term facts fetched on the fly. The LLM then grounds its answer on both its own reasoning and the retrieved information, effectively combining short-term memory and long-term memory to produce a more accurate, context-aware response.

Elasticsearch can be used to implement long-term memory for AI agents. Here’s a high-level example of how context can be retrieved from Elasticsearch for long-term memory.

This way, the agent “remembers” by searching for relevant data rather than by storing everything in its limited prompt, where it leads to different risks.

Using RAG with Elasticsearch or any vector stores offers multiple benefits:

First, it extends the knowledge of the model beyond its training cutoff. The agent can retrieve up-to-date information or domain-specific data that the LLM might not know. This is crucial for questions about recent events or specialized topics.

Second, retrieving context on demand helps reduce hallucinations, especially since LLMs aren’t trained on the proprietary or highly specialized data relative to your niche use case, which is highly likely to expose it to hallucinations. Instead of the LLM guessing or inventing new information as it has been incentivised through evaluation, as highlighted in a recent OpenAI paper (Why Language Models Hallucinate), the model can be grounded by factual references from Elasticsearch. Naturally, the LLM depends on the reliability of the data in the vector store to truly prevent misinformation and the relevant data is retrieved as per the core relevance measures.

Third, RAG allows an agent to work with knowledge bases far larger than anything you could ever fit into a prompt. Instead of pushing entire documents, like long research papers or policy documents, into the context window and risking overload or irrelevant information context poisoning the model’s reasoning, RAG relies on chunking. Large documents are broken into smaller, semantically meaningful pieces, and the system retrieves only the few chunks most relevant to the query. This way, the model doesn’t need a million-token context to appear knowledgeable; it just needs access to the right chunks of a much larger corpus.

It’s worth noting that as LLM context windows have grown (some models now support hundreds of thousands or even millions of tokens), a debate arose about whether RAG is “dead.” Why not push all the data into the prompt? If you feel likewise, refer to this wonderful article by my colleagues, Jeffrey Rengifo and Eduard Martin, Longer context ≠ better: Why RAG still matters. This avoids the “garbage in, garbage out” problem: The LLM stays focused on the few chunks that matter, rather than running through noise.

That said, integrating Elasticsearch or any vector store into an AI agent architecture provides long-term memory. The agent stores knowledge externally and pulls it in as memory context when needed. This could be implemented as an architecture, where after each user query, the agent performs a search on Elasticsearch for relevant info and then appends the top results to the prompt before calling the LLM. The response might also be saved back into the long-term store if it contains useful new information (creating a feedback loop of learning). By using such retrieval-based memory, the agent remains informed and up to date, without having to cram everything it knows into every prompt, even though the context window supports one million tokens. This technique is a cornerstone of context engineering, combining the strengths of information retrieval and generative AI.

Here’s an example of a managed in-memory conversation state using LangGraph's checkpoint system for short-term memory during the session. (Refer to our supporting context engineering app.)

# Initialize chat memory (Note: This is in-memory only, not persistent)
memory = MemorySaver()

# Create a LangGraph agent
langgraph_agent = create_react_agent(model=llm, tools=tools, checkpointer=memory)

...
...
# Only process and display checkpoints if verbose mode is enabled
if args.verbose:
    # List all checkpoints that match a given configuration
    checkpoints = memory.list({"configurable": {"thread_id": "1"}})
    # Process the checkpoints
    process_checkpoints(checkpoints)

Here’s how it stores checkpoints:

Checkpoint:
Timestamp: 2025-12-30T09:19:41.691087+00:00
Checkpoint ID: 1f0e560a-c2fa-69ec-8001-14ee5373f9cf
User: Hi I'm Som, how are you? (Message ID: ad0a8415-5392-4a58-85ad-84154875bbf2)
Agent: Hi Som! I'm doing well, thank you! How about you? (Message ID: 
56d31efb-14e3-4148-806e-24a839799ece)
Agent:  (Message ID: lc_run--019b6e8e-553f-7b52-8796-a8b1fbb206a4-0)

Checkpoint:
Timestamp: 2025-12-30T09:19:40.350507+00:00
Checkpoint ID: 1f0e560a-b631-6a08-8000-7796d108109a
User: Hi I'm Som, how are you? (Message ID: ad0a8415-5392-4a58-85ad-84154875bbf2)
Agent: Hi Som! I'm doing well, thank you! How about you? (Message ID: 
56d31efb-14e3-4148-806e-24a839799ece)

Checkpoint:
Timestamp: 2025-12-30T09:19:40.349027+00:00
Checkpoint ID: 1f0e560a-b62e-6010-bfff-cbebe1d865f6

For long-term memory, here's how we perform semantic search on Elasticsearch to retrieve relevant previous conversations using vector embeddings after summarizing and indexing the checkpoints to Elasticsearch.

Functions: 
retrieve_from_elasticsearch() 

# Enhanced Elasticsearch retrieval with rank_window and verbose display
def retrieve_from_elasticsearch(query: str, k: int = 5, rank_window: int = None) -> tuple[List[Dict[str, Any]], str]:
    """
    Retrieve context from Elasticsearch with score-based ranking
    
    Args:
        query: Search query
        k: Number of results to return
        rank_window: Number of candidates to retrieve before ranking (default: args.rank_window)
        
    Returns:
        Tuple of (retrieved_documents, formatted_context_string)
    """
    if not es_client or not es_index_name:
        return [], "Elasticsearch is not available. Cannot search long-term memory."
    
    if rank_window is None:
        rank_window = args.rank_window
    
    try:
        # Check if index exists and has documents
        if not es_client.indices.exists(index=es_index_name):
            return [], "No previous conversations stored in long-term memory yet."
        
        # Get document count
        try:
            doc_count = es_client.count(index=es_index_name)["count"]
            if doc_count == 0:
                return [], "Long-term memory is empty. No previous conversations to search."
        except Exception as e:
            return [], f"Error checking memory: {str(e)}"
        
        # Generate embedding for the query
        try:
            query_embedding = embeddings.embed_query(query)
        except Exception as e:
            return [], f"Error generating embedding: {str(e)}"
        
        # Perform semantic search using kNN with rank_window
        try:
            search_body = {
                "knn": {
                    "field": "vector",
                    "query_vector": query_embedding,
                    "k": k,
                    "num_candidates": rank_window  # Retrieve more candidates, then rank top k
                },
                "_source": ["text", "content", "message_type", "timestamp", "thread_id"],
                "size": k
            }
            
            response = es_client.search(index=es_index_name, body=search_body)
            
            if not response.get("hits") or len(response["hits"]["hits"]) == 0:
                return [], "No relevant previous conversations found in long-term memory."
            
            # Extract documents with scores
            retrieved_docs = []
            for hit in response["hits"]["hits"]:
                source = hit["_source"]
                score = hit["_score"]
                retrieved_docs.append({
                    "content": source.get("content", source.get("text", "")),
                    "message_type": source.get("message_type", "unknown"),
                    "timestamp": source.get("timestamp", "unknown"),
                    "thread_id": source.get("thread_id", "unknown"),
                    "score": score
                })
            
            # Format context string
            context_parts = []
            for i, doc in enumerate(retrieved_docs, 1):
                context_parts.append(doc["content"])
            
            context_string = "\n\n".join(context_parts)
            
            # Verbose display
            if args.verbose:
                rich.print(f"\n[bold yellow]🔍 RETRIEVAL ANALYSIS[/bold yellow]")
                rich.print("="*80)
                rich.print(f"[blue]Query:[/blue] {query}")
                rich.print(f"[blue]Retrieved:[/blue] {len(retrieved_docs)} documents (from {rank_window} candidates)")
                rich.print(f"[blue]Total context length:[/blue] {len(context_string)} characters\n")
                
                for i, doc in enumerate(retrieved_docs, 1):
                    rich.print(f"[cyan]📄 Document {i} | Score: {doc['score']:.4f} | Type: {doc['message_type']}[/cyan]")
                    rich.print(f"[cyan]   Timestamp: {doc['timestamp']} | Thread: {doc['thread_id']}[/cyan]")
                    content_preview = doc['content'][:200] + "..." if len(doc['content']) > 200 else doc['content']
                    rich.print(f"[cyan]   Content: {content_preview}[/cyan]")
                    rich.print("-" * 80)
            
            return retrieved_docs, context_string
            
        except Exception as e:
            return [], f"Error searching memory: {str(e)}"
            
    except Exception as e:
        return [], f"Error accessing long-term memory: {str(e)}"

Now that we’ve explored how short-term memory and long-term memory are indexed and fetched using LangGraph’s checkpoints in Elasticsearch, let’s take some time to understand why indexing and dumping the complete conversations can be risky.

Risks of not managing context memory

As we’re talking much about context engineering, along with short-term and long-term memory, let’s understand what happens if we don’t manage an agent’s memory and context well.

Unfortunately, many things can go wrong when an AI’s context grows extremely long or contains bad information. As context windows get larger, new failure modes emerge, like:

Context poisoning
Context distraction
Context confusion
Context clash
Context leakage and knowledge conflicts
Hallucinations and misinformation

Let’s break down these issues and other risks that arise from poor context management:

Context poisoning

Context poisoning refers to when incorrect or harmful information ends up in the context and “poisons” the model’s subsequent outputs. A common example is a hallucination by the model that gets treated as fact and inserted into the conversation history. The model might then build on that error in later responses, compounding the mistake. In iterative agent loops, once a false information makes it into the shared context (for example, in a summary of the agent’s working notes), it can be reinforced over and over.

Researchers at DeepMind, in the release of the Gemini 2.5 report (TL;DR, check here), observed this in a long-running Pokémon-playing agent: If the agent hallucinated a wrong game state and that got recorded into its context (its memory of goals), the agent would form nonsensical strategies around an impossible goal and get stuck. In other words, a poisoned memory can send the agent down the wrong path indefinitely.

Context poisoning can happen innocently (by mistake) or even maliciously, for instance, via prompt injection attacks where a user or third-party sneaks in a hidden instruction or false fact that the agent then remembers and follows.

Recommended countermeasures:

Based on insights from Wiz, Zerlo, and Anthropic, countermeasures for context poisoning focus on preventing bad or misleading information from entering an LLM’s prompt, context window, or retrieval pipeline. Key steps include:

Check the context constantly: Monitor the conversation or retrieved text for anything suspicious or harmful, not just the starting prompt.
Use trusted sources: Score or label documents based on credibility so the system prefers reliable information and ignores low scored data.
Spot unusual data: Use tools that detect odd, out-of-place, or manipulated content, and remove it before the model uses it.
Filter inputs and outputs: Add guardrails so harmful or misleading text can’t easily enter the system or be repeated by the model.
Keep the model updated with clean data: Regularly refresh the system with verified information to counter any bad data that slipped through.
Human-in-the-loop: Have people review important outputs or compare them against known, trustworthy sources.

Simple user habits also help, resetting long chats, sharing only relevant information, breaking complex tasks into smaller steps, and maintaining clean notes outside the model.

Together, these measures create a layered defense that protects LLMs from context poisoning and keeps outputs accurate and trustworthy.

Without countermeasures as mentioned here, an agent might remember instructions, like ignore previous guidelines or trivial facts that an attacker inserted, leading to harmful outputs.

Context distraction

Context distraction is when a context grows so long that the model overfocuses on the context, neglecting what it learned during training. In extreme cases, this resembles catastrophic forgetting; that is, the model effectively “forgets” its underlying knowledge and becomes overly attached to the information placed in front of it. Previous studies have shown that LLMs often lose focus when the prompt is extremely long.

The Gemini 2.5 agent, for example, supported a million-token window, but once its context grew beyond a certain point (on the order of 100,000 tokens in an experiment), it began to fixate on repeating its past actions instead of coming up with new solutions. In a sense, the agent became a prisoner of its extensive history. It kept looking at its long log of previous moves (the context) and mimicking them, rather than using its underlying training knowledge to devise fresh and novel strategies.

This is counterproductive. We want the model to use relevant context to help reasoning, not override its ability to think. Notably, even models with huge windows exhibit this context rot: Their performance degrades nonuniformly as more tokens are added. There appears to be an attention budget., Like humans with limited working memory, an LLM has a finite capacity to attend to tokens, and as that budget is stretched, its precision and focus drop.

As a mitigation, you can prevent context distraction using chunking, engineering the right information, regular context summarization, and evaluation and monitoring techniques to measure the accuracy of the response using scoring.

These methods keep the model grounded in both relevant context and its underlying training, reducing the risk of distraction and improving overall reasoning quality.

Context confusion

Context confusion is when superfluous content in the context is used by the model to generate a low-quality response.A prime example is giving an agent a large set of tools or API definitions that it might use. If many of those tools are unrelated to the current task, the model may still try to use them inappropriately, simply because they’re present in context. Experiments have found that providing more tools or documents can hurt performance if they’re not all needed. The agent starts making mistakes, like calling the wrong function or referencing irrelevant text.

In one case, a small Llama 3.1 8B model failed a task when given 46 tools to consider but succeeded when given only 19 tools. The extra tools created confusion, even though the context was within length limits. The underlying issue is that any information in the prompt will be attended to by the model. If it doesn’t know to ignore something, that something could influence its output in undesired ways. Irrelevant bits can “steal” some of the model’s attention and lead it astray (for instance, an irrelevant document might cause the agent to answer a different question than asked). Context confusion often manifests as the model producing a low-quality response that integrates unrelated context. Refer to the research paper: Less is More: Optimizing Function Calling for LLM Execution on Edge Devices.

It reminds us that more context isn’t always better, especially if it’s not curated for relevance.

Context clash

Context clash occurs when parts of the context contradict each other, causing internal inconsistencies that derail the model’s reasoning. A clash can happen if the agent accumulates multiple pieces of information that are in conflict.

For example, imagine an agent that fetched data from two sources: One says Flight A departs at 5 PM, and the other says Flight A departs at 6 PM. If both facts end up in the context, the poor model has no way to know which is correct; it may get confused or produce an incorrect or non-similar answer.

Context clash also frequently occurs in multiturn conversations where the model’s earlier attempts at answering are still lingering in the context along with later refined information.

A research study by Microsoft and Salesforce shows that if you break a complex query into multiple chatbot turns (adding details gradually), the final accuracy drops significantly, compared to giving all details in a single prompt. Why? Because the early turns contain partial or incorrect intermediate answers from the model, and those remain in the context. When the model later tries to answer with all info, its memory still includes those wrong attempts, which conflict with the corrected info and lead it off track. Essentially, the conversation’s context clashes with itself. The model may inadvertently use an outdated piece of context (from an earlier turn) that doesn’t apply after new info is added.

In agent systems, context clash is especially dangerous because an agent might combine outputs from different tools or subagents. If those outputs disagree, the aggregated context is inconsistent. The agent could then get stuck or produce nonsensical results trying to reconcile the contradictions. Preventing context clash involves ensuring the context is fresh and consistent, for instance, clearing or updating any outdated info and not mixing sources that haven’t been vetted for consistency.

Context leakage and knowledge conflicts

In systems where multiple agents or users share a memory store, there’s a risk of information bleeding over between contexts.

For example, if two separate users’ data embeddings reside in the same vector database without proper access control, an agent answering User A’s query might accidentally retrieve some of User B’s memory. This cross-context leak can expose private information or just create confusion in responses.

According to the OWASP Top 10 for LLM Applications, multitenant vector databases must guard against such leakage:

According to LLM08:2025 Vector and Embedding Weaknesses, one of the common risks is context leakage:

In multi-tenant environments where multiple classes of users or applications share the same vector database, there's a risk of context leakage between users or queries. Data federation knowledge conflict errors can occur when data from multiple sources contradict each other. This can also happen when an LLM can’t supersede old knowledge that it has learned while training, with the new data from Retrieval Augmentation.

Another aspect is that an LLM might have trouble overriding its built-in knowledge with new info from memory. If the model was trained on some fact and the retrieved context says the opposite, the model can get confused about which to trust. Without proper design, the agent could mix up contexts or fail to update old knowledge with new evidence, leading to stale or incorrect answers.

Hallucinations and misinformation

While hallucination (the LLM making up plausible-sounding but false information) is a known problem even without long contexts, poor memory management can amplify it.

If the agent’s memory is lacking a crucial fact, the model may just fill in the gap with a guess, and if that guess then enters the context (poisoning it), the error persists.

The OWASP LLM security report (LLM09:2025 Misinformation) highlights misinformation as a core vulnerability: LLMs can produce confident but fabricated answers, and users may overtrust them. An agent with a bad or outdated long-term memory might confidently cite something that was true last year but is false now, unless its memory is kept up to date.

Overreliance on the AI’s output (by either the user or the agent itself in a loop) can make this worse. If no one ever checks the info in memory, the agent can accumulate falsehoods. This is why RAG is often used to reduce hallucinations: By retrieving an authoritative source, the model doesn’t have to invent facts. But if your retrieval pulls in the wrong document (say, one that contains misinformation) or if an early hallucination isn’t pruned, the system may propagate that misinformation throughout its actions.

The bottom line: Failing to manage memory can lead to incorrect and misleading outputs, which can be damaging, especially if the stakes are high (for example, bad advice in a finance or medical domain). An agent needs mechanisms to verify or correct its memory content, not just unconditionally trust whatever is in the context.

In summary, giving an AI agent an infinitely long memory or dumping every possible thing into its context is not a recipe for success.

Best practices for memory management in LLM applications

To avoid the pitfalls above, developers and researchers devised a number of best practices for managing context and memory in AI systems. These practices aim to keep the AI’s working context lean, relevant, and up to date.Here are some of the key strategies, along with examples of how they help.

RAG: Use targeted context

Much of RAG has already been covered in the earlier section, so this serves as a concise set of practical reminders:

Use targeted retrieval, not bulk loading: Retrieve only the most relevant chunks instead of pushing entire documents or full conversation histories into the prompt.
Treat RAG as just-in-time memory recall: Fetch context only when it’s needed, rather than carrying everything forward across turns.
Prefer relevance-aware retrieval strategies: Approaches like top-k semantic search, Reciprocal Rank Fusion, or tool loadout filtering help reduce noise and improve grounding.
Larger context windows don’t remove the need for RAG: Two highly relevant paragraphs are almost always more effective than 20 loosely related pages.

That said, RAG isn’t about adding more context; it’s about adding the right context.

Tool loadout

Tool loadout is about giving a model only the tools it actually needs for a task. The term comes from gaming: You pick a loadout that fits the situation. Too many tools slow you down; the wrong ones cause failure. LLMs behave the same way, according to the research paper Less is more. Once you pass ~30 tools, descriptions start overlapping and the model gets confused. Past ~100 tools, failure is almost guaranteed. This isn’t a context window problem, it’s context confusion.

A simple and effective fix is RAG-MCP. Instead of dumping every tool into the prompt, tool descriptions are stored in a vector database and only the most relevant ones are retrieved per request. In practice, this keeps the loadout small and focused, dramatically shortens prompts, and can improve tool selection accuracy by up to 3x.

Smaller models hit this wall even sooner. The research shows an 8B model failing with dozens of tools but succeeding once the loadout is trimmed. Dynamically selecting tools, sometimes with an LLM first, reasoning about what it thinks it needs, can boost performance by 44%, while also reducing power usage and latency. The takeaway is that most agents only need a few tools, but as your system grows, tool loadout and RAG-MCP become first-order design decisions.

Context pruning: Limit the chat history length

If a conversation goes on for many turns, the accumulated chat history can become too large to fit, leading to context overflow or becoming too distracting to the model.

Trimming means programmatically removing or shortening less important parts of the dialogue as it grows. One simple form is to drop the oldest turns of the conversation when you hit a certain limit, keeping only the latest N messages. More sophisticated pruning might remove irrelevant digressions or previous instructions that are no longer needed. The goal is to keep the context window uncluttered by old news.

For example, if the agent solved a subproblem 10 turns ago and we have since moved on, we might delete that portion of the history from the context (assuming it won’t be needed further). Many chat-based implementations do this: They maintain a rolling window of recent messages.

Trimming can be as simple as “forgetting” the earliest parts of a conversation once they’ve been summarized or are deemed irrelevant. By doing so, we reduce the risk of context overflow errors and also reduce context distraction, so the model won’t see and get sidetracked by old or off-topic content. This approach is very similar to how humans might not remember every word from an hour-long talk but will retain the highlights.

If you’re confused about context pruning, as highlighted by the author Drew Breunig here, usage of the Provence (`naver/provence-reranker-debertav3-v1`) model, a lightweight (1.75 GB), efficient, and accurate context pruner for question answering, can make a difference. It can trim large documents down to only the most relevant text for a given query. You can call it in specific intervals.

Here’s how we invoke the `provence-reranker` model in our code to prune the context:

# Context pruning with Provence
def prune_with_provence(query: str, context: str, threshold: Optional[float] = None) -> str:
    """
    Prune context using Provence reranker model
    
    Args:
        query: User's query/question
        context: Original context to prune
        threshold: Relevance threshold (0-1) for Provence reranker.
                   If None, uses args.pruning_threshold.
                   0.1 = conservative (recommended, no performance drop)
                   0.3-0.5 = moderate to aggressive pruning
    
    Returns:
        Pruned context with only relevant sentences
    """
    if provence_model is None:
        return context
    
    if threshold is None:
        threshold = args.pruning_threshold
    
    try:
        # Use Provence's process method
        provence_output = provence_model.process(
            question=query,
            context=context,
            threshold=threshold,
            always_select_title=False,
            enable_warnings=False
        )
        
        # Extract pruned context from output
        pruned_context = provence_output.get('pruned_context', context)
        reranking_score = provence_output.get('reranking_score', 0.0)
        
        # Log statistics
        original_length = len(context)
        pruned_length = len(pruned_context)
        reduction_pct = ((original_length - pruned_length) / original_length * 100) if original_length > 0 else 0
        
        if args.verbose:
            rich.print(f"[cyan]📊 Pruning stats: {pruned_length}/{original_length} chars ({reduction_pct:.1f}% reduction, threshold={threshold:.2f}, rerank_score={reranking_score:.3f})[/cyan]")
        
        return pruned_context if pruned_context else context
        
    except Exception as e:
        rich.print(f"[yellow]⚠️ Error in Provence pruning: {str(e)}[/yellow]")
        rich.print(f"[yellow]⚠️ Falling back to original context[/yellow]")
        return context

We use the Provence reranker model (`naver/provence-reranker-debertav3-v1`) to score sentence relevance. Threshold-based filtering keeps sentences above the relevance threshold. Also, we introduce a fallback mechanism, where we return to the original context if pruning fails. Finally, statistics logging tracks reduction percentage in verbose mode.

Context summarization: Condense older information instead of dropping it entirely

Summarization is a companion to trimming. When the history or knowledge base becomes too large, you can employ the LLM to generate a brief summary of the important points and use that summary in place of the full content going forward, as we performed in our code above.

For example, if an AI assistant has had a 50-turn conversation, instead of sending all 50 turns to the model on turn 51 (which likely won’t fit), the system might take turns 1–40, have the model summarize them in a paragraph, and then only supply that summary plus the last 10 turns in the next prompt. This way, the model still knows what was discussed without needing every detail. Early chatbot users did this manually by asking, “Can you summarize what we’ve talked about so far?” and then continuing in a new session with the summary. Now it can be automated. Summarization not only saves context window space but can also reduce context confusion/distraction by stripping away extra detail and retaining just the salient facts.

Here’s how we use OpenAI models (you can use any LLMs) to condense context while preserving all relevant information, eliminating redundancy and duplication.

# Context summarization
def summarize_context(query: str, context: str) -> str:
    """
    Summarize context using LLM to reduce duplication and focus on relevant information
    
    Args:
        query: User's query/question
        context: Context to summarize
        
    Returns:
        Summarized context
    """
    try:
        summary_prompt = f"""You are an expert at summarizing conversation context.

Your task: Analyze the provided conversation context and produce a condensed summary that fully answers or supports the user's specific question.

The summary must:
1. Preserve every fact, detail, and information that directly relates to the question
2. Eliminate redundancy and duplicate information
3. Maintain chronological flow when relevant
4. Focus on information that helps answer: "{query}"

Context to summarize:
{context}

Provide a concise summary that preserves all relevant information:"""

        summary = llm.invoke(summary_prompt).content
        
        if args.verbose:
            original_length = len(context)
            summary_length = len(summary)
            reduction_pct = ((original_length - summary_length) / original_length * 100) if original_length > 0 else 0
            rich.print(f"[cyan]📝 Summarization stats: {summary_length}/{original_length} chars ({reduction_pct:.1f}% reduction)[/cyan]")
        
        return summary
        
    except Exception as e:
        rich.print(f"[yellow]⚠️ Error in context summarization: {str(e)}[/yellow]")
        rich.print(f"[yellow]⚠️ Falling back to original context[/yellow]")
        return context

Importantly, when the context is summarized, the model is less likely to get overwhelmed by trivial details or past errors (assuming the summary is accurate).

However, summarization has to be done carefully. A bad summary might omit a crucial detail or even introduce an error. It’s essentially another prompt to the model (“summarize this”), so it can hallucinate or lose nuance. Best practice is to summarize incrementally and perhaps keep some canonical facts unsummarized.

Nonetheless, it has proven very useful. In the Gemini agent scenario, summarizing the context every ~100k tokens was a way to counteract the model’s tendency to repeat itself. The summary acts like a compressed memory of the conversation or data. As developers, we can implement this by having an agent periodically call a summarization function (maybe a smaller LLM or a dedicated routine) on the conversation history or a long document. The resulting summary replaces the original content in the prompt. This tactic is widely used to keep contexts within limits and distill the information.

Context quarantine: Isolate contexts when possible

This is more relevant in complex agent systems or multistep workflows. The idea of context segmentation is to split a big task into smaller, isolated tasks, each with its own context, so that you never accumulate one enormous context that contains everything. Each subagent or subtask works on a piece of the problem with a focused context, and then a higher-level agent, or supervisor or coordinator integrates the results.

Anthropic’s research strategy uses multiple subagents, each investigating a different aspect of a question, with their own context windows, and a lead agent that reads the distilled results from those subagents. This parallel, modular approach means that no single context window gets too bloated. It also reduces the chance of irrelevant information mixing, each thread stays on topic (no context confusion), and it doesn’t carry unnecessary baggage when answering its specific subquestion. In a sense, it’s like running separate threads of thought that only share their outcomes, not their entire thought process.

In multi-agent systems, this approach is essential. If Agent A is handling task A and Agent B is handling task B, there’s no reason for either agent to consume the other’s full context unless it’s truly required. Instead, agents can exchange only the necessary information. For example, Agent A can pass a consolidated summary of its findings to Agent B via a supervisor agent, while each subagent maintains its own dedicated context thread. This setup doesn’t require human-in-the-loop intervention; it relies on a supervisory agent with enabled tools with minimal and controlled context sharing.

Nonetheless, designing your system so that agents or tools operate with minimal necessary context overlap can greatly enhance clarity and performance. Think of it as microservices for AI, each component deals with its context, and you pass messages between them in a controlled way, instead of one monolithic context.These best practices are often used in combination. Also, this gives you the flexibility to trim trivial history, summarize important older messages or conversations, offload the detailed logs to Elasticsearch for long-term context, and use retrieval to bring back anything relevant when needed.

As mentioned here, the guiding principle is that context is a limited and precious resource. You want every token in the prompt to earn its keep, meaning it should contribute to the quality of the output. If something in memory is not pulling its weight (or worse, actively causing confusion), then it should be pruned, summarized, or kept out.

As developers, we can now program the context just like we program code, deciding what information to include, how to format it, and when to omit or update it. By following these practices, we can give LLM agents the much-needed context to perform tasks without falling victim to the failure modes described earlier. The result is agents that remember what they should, forget what they don’t need, and retrieve what they require just in time.

Conclusion

Memory isn’t something you add to an agent; it’s something you engineer. Short-term memory is the agent’s working scratch pad, and long-term memory is its durable knowledge store. RAG is the bridge between the two, turning a passive datastore, like Elasticsearch, into an active recall mechanism that can ground outputs and keep the agent current.

But memory is a double-edged sword. The moment you let context grow unchecked, you invite poisoning, distraction, confusion, and clashes, and in shared systems, even data leakage. That’s why the most important memory work isn’t “store more,” it’s “curate better”: Retrieve selectively, prune aggressively, summarize carefully, and avoid mixing unrelated contexts unless the task truly demands it.

In practice, good context engineering looks like good systems design: smaller, sufficient contexts, controlled interfaces between components, and a clear separation between raw and the distilled state you actually want the model to see. Done right, you don’t end up with an agent that remembers everything - you end up with an agent that remembers the right things, at the right time, for the right reason.

Higher throughput and lower latency: Elastic Cloud Serverless on AWS gets a significant performance boost

Pete Galeotti, Yuvraj Gupta, Rachel Forshee — Wed, 14 Jan 2026 00:00:00 GMT

Elastic Cloud Serverless is already the definitive solution for developers who want to build efficient search and AI applications without the operational burden of managing infrastructure. Now, we're taking the performance of your serverless projects to a whole new level.

We've completed a major infrastructure upgrade for all Elastic Cloud Serverless projects running on AWS, migrating to newer, faster hardware. This change has been rolled out to every serverless project automatically. It delivers higher throughput and lower latency for Elasticsearch, Elastic Observability, and Elastic Security serverless projects on AWS.

Key performance benefits for developers

The new AWS hardware infrastructure underpins everything you do with Elastic Cloud Serverless, translating to tangible benefits for your applications' speed and responsiveness.

Reduced query latency… increased throughput

The improved hardware dramatically enhances the speed of compute resources, which means your search queries are processed faster than ever.

Search and vector search: Whether you're running traditional full-text queries or using cutting-edge vector search for your generative AI and retrieval-augmented generation (RAG) applications, you'll see a marked decrease in latency. Internal benchmarking showed a 35% average decrease in search latency.
Faster indexing: Data ingestion rates are optimized, allowing you to index massive data volumes and complex documents with increased throughput. This is crucial for applications that require near–real-time data visibility. Internal benchmarking showed a 26% average increase in indexing throughput.

Consistent performance under load

Elastic Cloud Serverless is designed to autoscale dynamically in real time to meet demand, minimizing latency, regardless of your workload. With this hardware upgrade, that scaling is now more performant and responsive.

Handling spikes with ease: Whether you're facing a sudden surge in user traffic or a massive batch data ingest, the new infrastructure ensures that your search and indexing resources scale up more efficiently to maintain consistently low latency.
Optimized compute-storage decoupling: The serverless architecture separates compute and storage, which allows workloads to scale independently for optimal performance and cost efficiency. The faster hardware enhances the compute layer, maximizing the efficiency of this decoupled design.

Under the hood: Internal benchmarking results

To quantify the impact of our AWS infrastructure upgrade, the Elastic engineering team conducted comprehensive internal benchmarking against a range of serverless workloads. These workloads provided empirical evidence of performance improvements that you can expect across your applications, regardless of your use case.

The benchmarking approach

We focused our testing on the key metrics that directly affect the developer experience and application responsiveness: response time (that is, latency) and throughput on search and on indexing operations.

Workloads tested: The tests included high-concurrency search operations typical of user-facing applications, complex vector search queries, and high-volume data ingestion/indexing for observability and security use cases. In particular, our testing methodology used publicly available datasets for Rally, Elastic’s benchmarking tool.
- wikipedia: A dataset derived from a snapshot of Wikipedia’s text contents, to measure general-purpose text search performance.
- MSMARCO-Passage-Ranking: A dataset derived from Microsoft’s Machine Reading Comprehension (MS MARCO), to measure search performance on sparse vector fields.
- OpenAI_Vector: A dataset derived from BEIR’s NQ and enriched with embeddings generated by OpenAI’s text-embedding-ada-002 model, to measure search performance on dense vector fields.
Measurement: We compared performance on the old and new infrastructure, measuring latency at the 99th percentile (P99) to capture the worst-case, tail-latency performance and operations per second. Each track was run five times for each hardware profile to ensure consistency in the results.
The goal: Our aim was to validate the infrastructure's ability to deliver consistently faster and more predictable performance across the board, even during periods of rapid autoscaling.

Performance data summary

The results confirm significant gains in efficiency and speed. These gains translate directly into lower response times for your users and lower operational costs as a result of the ability to complete the same amount of work with fewer compute resources.

The following tables detail the quantitative improvements. Higher values are better for throughput; lower values are better for latency.

Searching benchmark results:

Benchmark	Comparison	Old infra	New infra	Differential
`wikipedia` (plain text)	Search operation throughput (ops/s)	729	1107	+52%
`wikipedia` (plain text)	Search operation latency (p99, ms)	56	35	-37%
`MSMARCO-Passage-Ranking` (sparse vectors)	Search operation throughput (ops/s)	22	31	+40%
`MSMARCO-Passage-Ranking` (sparse vectors)	Search operation latency (p99, ms)	108	67	-38%
`OpenAI_Vector` (dense vectors)	Search operation throughput (ops/s)	475	624	+31%
`OpenAI_Vector` (dense vectors)	Search operation latency (p99, ms)	35	22	-37%

Indexing benchmark results:

Benchmark	Comparison	Old infra	New infra	Differential
`wikipedia` (plain text)	Search operation throughput (ops/s)	2845	3220	+13%
`wikipedia` (plain text)	Search operation latency (p99, ms)	1769	1120	-37%
`MSMARCO-Passage-Ranking` (sparse vectors)	Search operation throughput (ops/s)	7087	8900	+26%
`MSMARCO-Passage-Ranking` (sparse vectors)	Search operation latency (p99, ms)	824	677	-18%
`OpenAI_Vector` (dense vectors)	Search operation throughput (ops/s)	2972	3187	+7%
`OpenAI_Vector` (dense vectors)	Search operation latency (p99, ms)	2946	2944	0%

The added bonus: Cost reduction

While our focus is on delivering low-latency performance, the efficiency of the new hardware also has a direct, positive impact on costs for Elasticsearch projects.

Elasticsearch Serverless pricing is usage-based, meaning that you only pay for the ingest and search resources you consume. Because the newer, faster hardware is more efficient, your workloads will often complete tasks using fewer resources, leading to an inherent cost reduction for most projects. You get a premium performance boost without the premium price tag—the definition of optimized efficiency.

What does this mean for you, the developer?

This infrastructure upgrade is entirely managed by Elastic, so you don't need to lift a finger—no migrations and no configuration changes. The improvement is immediate and automatic across all your AWS-based serverless projects.

This upgrade empowers you to:

Build faster applications: Focus on feature velocity, knowing that your underlying search platform is delivering the speed your users demand.
Innovate with confidence: Deploy new search, observability, and security features—including complex AI capabilities, like vector search and relevance ranking—with the assurance that the platform can handle the load at peak performance.
Simplify your stack: Use a fully managed service that handles infrastructure management, capacity planning, and scaling, so you can focus on your code and data.

jina-embeddings-v3 is now available on Elastic Inference Service

Sean Handley, Ranjana Devaji, Brendan Jugan — Mon, 12 Jan 2026 00:00:00 GMT

Today, we are excited to introduce jina-embeddings-v3 on Elastic Inference Service (EIS), enabling fast multilingual dense retrieval directly into Elasticsearch. Jina-embeddings-v3 is the first Jina AI model available on EIS, with many more to come soon.

Jina AI, which recently joined Elastic via acquisition, is a leader in open-source multilingual and multimodal embeddings, rerankers, and small language models. Jina brings deep expertise in search foundation models that help developers build high-quality retrieval and RAG systems across text, images, code, and long multilingual content.

Elastic Inference Service (EIS) makes it easy for developers to add fast, high-quality, and reliable semantic retrieval for search, RAG, and agentic applications with predictable, consumption-based pricing. EIS offers fully-managed GPU-powered inference with ready-to-use models, no additional setup or hosting complexity required.

jina-embeddings-v3 supports high-quality multilingual retrieval with long-context reasoning and task-tuned modes for RAG and agents. It provides developers fast dense embeddings across a broad range of languages without the operational overhead.

Why jina-embeddings-v3?

jina-embeddings-v3 is a text embedding model that supports 32 languages and up to 8192-token context, high relevance at lower cost, and GPU-powered inference through EIS.

Key capabilities

Multilinguality: Closes the language gap and aligns meaning across 32 languages, including Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, Georgian, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Latvian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu, and Vietnamese.
Parameter-efficiency: Delivers higher performance with only 570M parameters, achieving comparable performance to much larger LLM-based embeddings at lower costs.
Dimensionality control: Default in 1024 dimensions, and with Matryoshka representation support, it lets developers dial the embedding size all the way down to 32 dimensions, giving flexibility to balance accuracy, latency, and storage based on your needs.
Task-specific optimization: Features task-specific Low-Rank Adaptation (LoRA) adapters, enabling it to generate high-quality embeddings for various tasks including query-document retrieval, clustering, classification, and text matching.

Get started

POST _inference/text_embedding/.jina-embeddings-v3
{
  "input": ["Rocky Mountain National Park"],
  "input_type": "ingest"
}

The response:

{
  "text_embedding": [
    {
      "embedding": [
        -0.06678891,
        -0.0073341704,
        0.011903269,
        -0.041797,
      ...
      ]
    }
  ]
}

What’s next

Alongside these new models, EIS continues to evolve to support more users and simplify semantic search across environments.

Cloud Connect for EIS: Cloud Connect for EIS will soon bring EIS to self-managed environments, reducing operational overhead and enabling hybrid architectures and scaling where it works best for you.

semantic_text defaults to jina-embeddings-v3 on EIS: semantic_text currently uses ELSER as the embeddings model behind the scenes, but will default to the jina-embeddings-v3 endpoint on EIS in the near future. This change will provide built-in inference at ingestion time, making it easier to adopt multilingual search without additional configuration.

More models: We’re expanding the EIS model catalog to meet the rising inference demands of our customers. In the coming months, we’ll introduce new models that support an even broader set of search and inference workloads. Hot on the heels of jina-embeddings-v3, the next models to follow are jina-reranker-v2-base-multilingual and jina-reranker-v3. Both Jina AI models greatly improve precision through multilingual reranking for RAG and AI agents.

Conclusion

With jina-embeddings-v3 on EIS, you can build multilingual, high-precision retrieval pipelines without managing models, GPUs, or infrastructure. You get fast dense retrieval and tight integration with Elasticsearch’s relevance stack, all in one platform.

Whether you are building global RAG systems, search, or agentic workflows that need reliable context, Elastic now gives you a high-performance model out-of-the-box, and the operational simplicity to move from prototype to production with confidence.

All Elastic Cloud trials have access to the Elastic Inference Service. Try it now on Elastic Cloud Serverless and Elastic Cloud Hosted.

Hybrid search and multistage retrieval in ES|QL

Ioana Tagirta, Aurélien Foucret, Carlos Delgado — Thu, 08 Jan 2026 00:00:00 GMT

In Elasticsearch 9.2, we’ve introduced the ability to do dense vector search and hybrid search in Elasticsearch Query Language (ES|QL). This continues our investment in making ES|QL the best search language to solve modern search use cases.

Multistage retrieval: The challenge of modern search

Modern search has evolved beyond simple keyword matching. Today's search applications need to understand intent, handle natural language, and combine multiple ranking signals to deliver the best results.

Retrieval of the most relevant results happens in multiple stages, with each stage gradually refining the result set. This wasn’t the case in the past, where most use cases would require one or two stages of retrieval: an initial query to get results and a potential rescoring phase.

We start with an initial retrieval, where we cast a wide net to gather results that are relevant to our query. Since we need to sieve through all the data, we should use techniques that return results fast, even when we index billions of documents.

We therefore employ trusted techniques, such as lexical search that Elasticsearch has supported and optimized since the beginning, or vector search, where Elasticsearch excels in speed and accuracy.

Lexical search using BM25 is quite fast and best at exact term matching or phrase matching, and vector or semantic search is better suited for handling natural language queries. Hybrid search combines lexical and vector search results to bring the best from both. The challenge that hybrid search solves is that vector and lexical search have completely different and incompatible scoring functions which produce values in different intervals, following different distributions. A vector search score close to 1 can mean a very close match, but it doesn’t mean the same for lexical search. Hybrid search methods, such as reciprocal rank fusion (RRF) and linear combination of scores, assign new scores that blend the original scores from lexical and vector search.

After hybrid search, we can employ techniques such as semantic reranking and Learning To Rank (LTR), which use specialized machine learning models to rerank the result.

With our most relevant results, we can use large language models (LLMs) to further enrich our response or pass the most relevant results as context to LLMs in agentic workflows in tools such as Elastic Agent Builder.

ES|QL is able to handle all these stages of retrieval. By design, ES|QL is a piped language, where each command transforms the input and sends the output to the next command. Each stage of retrieval is represented by one or more consecutive ES|QL commands. In this article, we show how each stage is supported in ES|QL.

Vector search

In Elasticsearch 9.2, we introduced tech preview support for dense vector search in ES|QL. This is as simple as calling the knn function, which only requires a dense_vector field and a query vector:

FROM books METADATA _score
| WHERE KNN(description_vector, ?query_vector)
| SORT _score DESC
| LIMIT 100

This query executes an approximate nearest neighbor search, retrieving 100 documents that are the most similar to the query_vector.

Hybrid search: Reciprocal rank fusion

In Elasticsearch 9.2, we introduced support for hybrid search using RRF and linear combination of results in ES|QL.

This allows combining vector search and lexical search results into a single result set.

To achieve this in ES|QL, we need to use the FORK and FUSE commands. FORK runs multiple branches of execution, and FUSE merges the results and assigns new relevance scores using RRF or linear combination.

In the following example, we use FORK to run two separate branches, where one is doing a lexical search using the match function, while the other is doing a vector search using the knn function. We then merge the results together using FUSE:

FROM books METADATA _score, _id, _index
| FORK (WHERE KNN(description_vector, ?query_vector) | SORT _score DESC | LIMIT 100)
       (WHERE MATCH(description, ?query) | SORT _score DESC | LIMIT 100)
| FUSE // uses RRF by default
| SORT _score DESC

Let's decompose the query to better understand the execution model and first look at the output of the FORK command:

FROM books METADATA _score, _id, _index
| FORK (WHERE KNN(description_vector, ?query_vector) | SORT _score DESC | LIMIT 100)
       (WHERE MATCH(description, ?query) | SORT _score DESC | LIMIT 100)

The FORK commands outputs the results from both branches and adds a _fork discriminator column:

_id	title	_score	_fork
4001	The Hobbit	0.88	fork1
3999	The Fellowship of the Ring	0.88	fork1
4005	The Two Towers	0.86	fork1
4006	The Return of the King	0.84	fork1
4123	The Silmarillion	0.78	fork1
4144	The Children of Húrin	0.79	fork1
4001	The Hobbit	4.55	fork2
3999	The Fellowship of the Ring	4.25	fork2
4123	The Silmarillion	4.11	fork2
4005	The Two Towers	3.8	fork2
4006	The Return of the King	4.1	fork2

As you’ll notice, certain documents appear twice, which is why we then use FUSE to merge rows that represent the same documents and assign new relevance scores. FUSE is executed in two stages:

For each row, FUSE assigns a new relevance score, depending on the hybrid search algorithm that is being used.
Rows that represent the same document are merged together, and a new score is computed.

In our example, we’re using RRF. As a first step, FUSE assigns a new score to each row using the RRF formula:

score(doc) = 1 / (rank_constant + rank(doc))

Where the rank_constant takes a default value of 60 and rank(doc) represents the position of the document in the result set.

In the first phase, our results become:

_id	title	_score	_fork
4001	The Hobbit	1 / (60 + 1) = 0.01639	fork1
3999	The Fellowship of the Ring	1 / (60 + 2) = 0.01613	fork1
4005	The Two Towers	1 / (60 + 3) = 0.01587	fork1
4006	The Return of the King	1 / (60 + 4) = 0.01563	fork1
4123	The Silmarillion	1 / (60 + 5) = 0.01538	fork1
4144	The Children of Húrin	1 / (60 + 6) = 0.01515	fork1
4001	The Hobbit	1 / (60 + 1) = 0.01639	fork2
3999	The Fellowship of the Ring	1 / (60 + 2) = 0.01613	fork2
4123	The Silmarillion	1 / (60 + 3) = 0.01587	fork2
4005	The Two Towers	1 / (60 + 4) = 0.01563	fork2
4006	The Return of the King	1 / (60 + 5) = 0.01538	fork2

Then the rows are merged together and a new score is assigned. Since a SORT _score DESC follows the FUSE command, the final results are:

_id	title	_score
4001	The Hobbit	0.01639 + 0.01639 = 0.03279
3999	The Fellowship of the Ring	0.01613 + 0.01613 = 0.03226
4005	The Two Towers	0.01587 + 0.01563 = 0.0315
4123	The Silmarillion	0.01538 + 0.01587 = 0.03125
4006	The Return of the King	0.01563 + 0.01538 = 0.03101
4144	The Children of Húrin	0.01515

Hybrid search: Linear combination of scores

Reciprocal rank fusion is the simplest way to do hybrid search, but it isn’t the only hybrid search method that we support in ES|QL.

In the following example, we use FUSE to combine lexical and semantic search results using linear combination of scores:

FROM books METADATA _score, _id, _index
| FORK (WHERE MATCH(semantic_description, ?query) | SORT _score DESC | LIMIT 100)
       (WHERE MATCH(description, ?query) | SORT _score DESC | LIMIT 100)
| FUSE LINEAR WITH { "weights": { "fork1": 0.7, "fork2": 0.3 } }
| SORT _score DESC

Let's first decompose the query and take a look at the input of the FUSE command when we only run the FORK command.

Notice that we use the match function, which is able to not only query lexical fields, such as text or keyword, but also semantic_text fields.

The first FORK branch executes a semantic query by querying a semantic_text field, while the second one executes a lexical query:

FROM books METADATA _score, _id, _index
| FORK (WHERE MATCH(semantic_description, ?query) | SORT _score DESC | LIMIT 100)
       (WHERE MATCH(description, ?query) | SORT _score DESC | LIMIT 100)

The output of the FORK command can contain rows with the same _id and _index values representing the same Elasticsearch document:

_id	title	_score	_fork
4001	The Hobbit	0.88	fork1
3999	The Fellowship of the Ring	0.88	fork1
4005	The Two Towers	0.86	fork1
4006	The Return of the King	0.84	fork1
4123	The Silmarillion	0.78	fork1
4144	The Children of Húrin	0.79	fork1
4001	The Hobbit	4.55	fork2
3999	The Fellowship of the Ring	4.25	fork2
4123	The Silmarillion	4.11	fork2
4005	The Two Towers	3.8	fork2
4006	The Return of the King	4.1	fork2

In the next step, we use FUSE to merge rows that have the same _id and _index values, and assign new relevance scores.

The new score is a linear combination of the scores the row had in each FORK branch:

_score = 0.7 *_score1 + 0.3 * _score2

Here, _score1 and _score2 represent the score a document has in the first FORK branch and the second FORK branch, respectively.

Notice that we also apply custom weights, giving more weight to the semantic score over the lexical one, resulting in this set of documents:

_id	title	_score
4001	The Hobbit	0.7 * 0.88 + 0.3 * 4.55 = 1.981
3999	The Fellowship of the Ring	0.7 * 0.88 + 0.3 * 4.25 = 1.891
4006	The Return of the King	0.7 * 0.84 + 0.3 * 4.1 = 1.818
4123	The Silmarillion	0.7 * 0.78 + 0.3 * 4.11 = 1.779
4005	The Two Towers	0.7 * 0.86 + 0.3 * 3.8 = 1.742
4144	The Children of Húrin	0.7 * 0.79 + 0.3 * 0 = 0.553

One challenge is that the semantic and lexical scores can be incompatible to apply the linear combination, since they can follow completely different distributions. To mitigate this, we first need to normalize the scores, employing score normalization methods, such as minmax. This ensures that the scores from each FORK branch are first normalized to take values between 0 and 1, before applying the linear combination formula.

To achieve this with FUSE, we need to specify the normalizer option:

FROM books METADATA _score, _id, _index
| FORK (WHERE MATCH(semantic_description, ?query) | SORT _score DESC | LIMIT 100)
       (WHERE MATCH(description, ?query) | SORT _score DESC | LIMIT 100)
| FUSE LINEAR WITH { "weights": { "fork1": 0.7, "fork2": 0.3 }, "normalizer": "minmax" }
| SORT _score DESC

Semantic reranking

At this stage, after hybrid search, we should be left with the most relevant documents. We can now use semantic reranking to reorder the results using the RERANK command. By default, RERANK uses the latest Elastic semantic reranking machine learning model, so no additional configuration is needed:

FROM books METADATA _score, _id, _index
| FORK (WHERE KNN(description_vector, ?query_vector) | SORT _score DESC | LIMIT 100)
       (WHERE MATCH(description, ?query) | SORT _score DESC | LIMIT 100)
| FUSE
| SORT _score DESC
| LIMIT 100
| RERANK ?query ON description
| SORT _score DESC

We now have our best results, sorted by relevance.

One key feature that sets the RERANK command apart from other products that offer semantic reranking integrations is that it doesn’t require the input to represent a mapped field from an index. RERANK only expects an expression that evaluates to a string value, making it possible to do semantic reranking using multiple fields:

FROM books METADATA _score, _id, _index
| FORK (WHERE KNN(description_vector, ?query_vector) | SORT _score DESC | LIMIT 100)
       (WHERE MATCH(description, ?query) | SORT _score DESC | LIMIT 100)
| FUSE
| SORT _score DESC
| LIMIT 100
| RERANK ?query ON CONCAT(title, "\n", description) 
| SORT _score DESC

LLM completions

Now we have a set of highly relevant, reranked results.

At this stage, you might simply decide to return the results back to your application or you might want to further enhance your results using LLM completions.

If you’re using ES|QL as part of a retrieval-augmented generation (RAG) workflow, you can choose to call your favorite LLM directly from ES|QL.
To achieve this, we’ve added a new COMPLETION command that takes in a prompt, a completion inference ID which designates which LLM to call, and a column identifier to specify where to output the LLM response.

In the following example, we’re using COMPLETION to add a new _completion column that contains the summary of the content column:

FROM books METADATA _score, _id, _index
| FORK (WHERE KNN(description_vector, ?query_vector) | SORT _score DESC | LIMIT 100)
       (WHERE MATCH(description, ?query) | SORT _score DESC | LIMIT 100)
| FUSE
| SORT _score DESC
| LIMIT 100
| RERANK ?query ON description
| SORT _score DESC
| LIMIT 10
| COMPLETION CONCAT("Summarize the following:\n", description) WITH { "inference_id" : "my_inference_endpoint" }

Each row now contains a summary:

_id	title	_score	summary
4001	The Hobbit	0.03279	Bilbo helps dwarves reclaim Erebor from the dragon Smaug.
3999	The Fellowship of the Ring	0.03226	Frodo begins the quest to destroy the One Ring.
4005	The Two Towers	0.0315	The Fellowship splits; war comes to Rohan; Frodo nears Mordor.
4123	The Silmarillion	0.03125	Ancient myths and history of Middle-earth's First Age.
4006	The Return of the King	0.3101	Sauron is defeated and Aragorn is crowned King.
4144	The Children of Húrin	0.01515	The tragic tale of Túrin Turambar's cursed life.

In another use case, you may simply want to answer a question using the proprietary data that you have indexed in Elasticsearch. In this case, the best search results that we’ve computed in the previous stage can be used as context for the prompt:

FROM books METADATA _score, _id, _index
| FORK (WHERE KNN(description_vector, ?query_vector) | SORT _score DESC | LIMIT 100)
       (WHERE MATCH(description, ?query) | SORT _score DESC | LIMIT 100)
| FUSE
| SORT _score DESC
| LIMIT 100
| RERANK ?query ON description
| SORT _score DESC
| LIMIT 10
| STATS context = VALUES(CONCAT(title, "\n", description)
| COMPLETION CONCAT("Answer the following question ", ?query, "based on:\n", context) WITH { "inference_id" : "my_inference_endpoint" }

Since the COMPLETION command unlocks the ability to send any prompt to an LLM, the possibilities are endless. Although we’re only showing a few examples, the COMPLETION command can be used in a wide range of scenarios, from security analysts using it to assign scores depending on whether a log event can represent a malicious action or data scientists using it to analyze data, to cases where you just need to generate Chuck Norris facts based on your data.

This is only the beginning

In the future, we’ll be expanding ES|QL to improve semantic reranking for long documents, better conditional execution of the ES|QL queries using multiple FORK commands, support sparse vector queries, removing close duplicate results to enhance result diversity, allowing full text search on runtime generated columns, and many other scenarios.

Additional tutorials and guides:

Implementing an agentic reference architecture with Elastic Agent Builder and MCP

Jeffrey Rengifo — Wed, 07 Jan 2026 00:00:00 GMT

In this article, we will present a reference architecture for using Elasticsearch with AI capabilities through the Elastic Agent Builder, exposing an MCP server to access Agent Builder tools and Elasticsearch data.

Model Context Protocol (MCP) is an open-source standard that enables applications and LLMs to communicate with external systems via MCP tools (programmatic capabilities), and LangGraph (an extension of LangChain) provides the orchestration framework for these agentic workflows.

We’ll implement an application that can search both internal knowledge (Elasticsearch stored data) and external sources (on the internet) to identify potential and known vulnerabilities related to a specific tool. The application will gather the information and generate a detailed summary of the findings.

Requirements

Elastic Agent Builder

Elastic Agent Builder is a set of AI-powered capabilities for developing and integrating agents that can interact with your Elasticsearch data. It provides a built-in agent that can be used for natural language conversations with your data or instance, and it also supports tool creation, Elastic APIs, A2A, and MCP. In this article, we will focus on using the MCP server for external access to the Elastic Agent Builder tools.

To know more about Agent Builder features, you can read this article.

Agent Builder MCP feature

The MCP server is available in the Agent Builder and can be accessed at:

{KIBANA_URL}/api/agent_builder/mcp
# Or if you are using a custom Kibana space:
{KIBANA_URL}/s/{SPACE_NAME}/api/agent_builder/mcp

The Agent Builder offers Built-in tools, and you can also create your custom tools.

Reference architecture

To get a complete overview of the elements used by an agentic application in an end-to-end workflow, let’s look at the following diagram:

Elasticsearch is at the center of this architecture, functioning as a vector store, providing the embeddings generation model, and also serving the MCP server to access the data via tools. To better explain the workflow, let’s look at the ingestion and the Agent Builder layer separately.

Here, the first element is the data that will be stored in Elasticsearch. The data passes through an ingest pipeline, where it is processed by the Elasticsearch ELSER model to generate embeddings and then stored in Elasticsearch.

Elastic Agent Builder layer

On this layer, the Agent Builder plays a central role by exposing the tools needed to interact with the Elasticsearch data. It manages the tools that operate over Elasticsearch indices and makes them available for consumption. Then LangChain handles the orchestration via the MCP client.

This architecture allows Agent Builder to work as one of many MCP servers available to the client so that the Elasticsearch agent builder can combine with other MCPs. This way, the MCP client can ask cross-source questions and then combine the answers.

Use case: Security vulnerability agent

The security vulnerability agent identifies potential risks based on a user’s question by combining three complementary layers:

First, it performs a semantic search with embeddings over an internal knowledge base of past incidents, configurations, and known vulnerabilities to retrieve relevant historical evidence.

Second, it searches the internet for newly published recommendations or threat intelligence that may not yet exist internally.

Finally, an LLM correlates and prioritizes both internal and external findings, evaluates their relevance to the user’s specific environment, and produces a clear explanation along with potential mitigation steps.

Developing the application

The application’s code can be found in the attached notebook.

You can see the setup for the Python application below:

# load environment variables
load_dotenv()

ELASTICSEARCH_ENDPOINT = os.getenv("ELASTICSEARCH_ENDPOINT")
ELASTICSEARCH_API_KEY = os.getenv("ELASTICSEARCH_API_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
SERPER_API_KEY = os.getenv("SERPER_API_KEY")
KIBANA_URL = os.getenv("KIBANA_URL")

INDEX_NAME = "security-vulnerabilities"
KIBANA_HEADERS = {
    "kbn-xsrf": "true",
    "Content-Type": "application/json",
    "Authorization": f"ApiKey {ELASTICSEARCH_API_KEY}",
} # Useful for Agent Builder API calls


es_client = Elasticsearch(ELASTICSEARCH_ENDPOINT, api_key=ELASTICSEARCH_API_KEY) # Elasticsearch client

We need to access Agent Builder and create one agent specialized in security queries and one tool to perform semantic search. You need to have the Agent Builder enabled for the next step. Once it’s on, we’ll use the tools API to create a tool that will perform a semantic search.

security_search_tool = {
    "id": "security-semantic-search",
    "type": "index_search",
    "description": "Search internal security documents including incident reports, pentests, internal CVEs, security guidelines, and architecture decisions. Uses semantic search powered by ELSER to find relevant security information even without exact keyword matches. Returns documents with severity assessment and affected systems.",
    "tags": ["security", "semantic", "vulnerabilities"],
    "configuration": {
        "pattern": INDEX_NAME,
    },
}

try:
    response = requests.post(
        f"{KIBANA_URL}/api/agent_builder/tools",
        headers=KIBANA_HEADERS,
        json=security_search_tool,
    )

    if response.status_code == 200:
        print("✅ Security semantic search tool created successfully")    
    else:
        print(f"Response: {response.text}")
except Exception as e:
    print(f"❌ Error creating tool: {e}")

Configure your tools following the best practices defined by Elastic for developing Tools. Once created, this tool will be ready to use in the Kibana UI.

With the tool created, we can start writing the code for the ingestion workflow:

Ingest pipeline

To define the data structure, we need to have a dataset prepared for ingestion. Below is a sample document for this example:

{
    "title": "Incident Report: Node.js Express 4.17 Prototype Pollution RCE",
    "content": "In March 2024, our production Node.js Express 4.17 API gateway experienced a critical prototype pollution vulnerability leading to remote code execution. The attack vector involved manipulating object prototypes through JSON payloads in POST requests. This affected all Express middleware processing user input. Immediate mitigation: upgrade to Express 4.18.2+, implement input validation, use Object.freeze() for critical objects. Related to CVE-2022-24999.",
    "doc_type": "incident_report",
    "severity": "critical",
    "affected_systems": [
      "api-gateway-prod",
      "api-gateway-staging"
    ],
    "date": "2024-03-15"
}

For this type of document, we will use the following index mappings:

index_mapping = {
    "mappings": {
        "properties": {
            "title": {"type": "text", "copy_to": "semantic_field"},
            "content": {"type": "text", "copy_to": "semantic_field"},
            "doc_type": {"type": "keyword", "copy_to": "semantic_field"},
            "severity": {"type": "keyword", "copy_to": "semantic_field"},
            "affected_systems": {"type": "keyword", "copy_to": "semantic_field"},
            "date": {"type": "date"},
            "semantic_field": {"type": "semantic_text"},
        }
    }
}

if es_client.indices.exists(index=INDEX_NAME) is False:
    es_client.indices.create(index=INDEX_NAME, body=index_mapping)
    print(f"✅ Index '{INDEX_NAME}' created with semantic_text field for ELSER")
else:
    print(f"ℹ️  Index '{INDEX_NAME}' already exists, skipping creation")

We are creating a semantic_text field to perform semantic search using the information from the fields marked with the copy_to property.

With that mapping definition, we can ingest the data using the bulk API.

def build_bulk_actions(documents, index_name):
    for doc in documents:
        yield {"_index": index_name, "_source": doc}


try:
    with open("dataset.json", "r") as f:
        security_documents = json.load(f)

    success, failed = helpers.bulk(
        es_client,
        build_bulk_actions(security_documents, INDEX_NAME),
        refresh=True,
    )
    print(f"📥 {success} documents indexed successfully")

except Exception as e:
    print(f"❌ Error during bulk indexing: {str(e)}")

LangChain MCP client

Here we’re going to create an MCP client using LangChain to consume the Agent Builder tools and build a workflow with LangGraph to orchestrate the client execution. The first step is to connect to the MCP server:

client = MultiServerMCPClient(
    {
        "agent-builder": {
            "transport": "streamable_http",
            "url": MCP_ENDPOINT,
            "headers": {"Authorization": f"ApiKey {ELASTICSEARCH_API_KEY}"},
        }
    }
)

tools = await client.get_tools()

print(f"📋 MCP Tools available: {[t.name for t in tools]}") # ['platform_core_search',  ... 'security-semantic-search']

Next, we create an agent that selects the appropriate tool based on the user input:

reasoning = {"effort": "low"}

llm = ChatOpenAI(
    model="gpt-5.2-2025-12-11", reasoning=reasoning, openai_api_key=OPENAI_API_KEY
) # LLM client 

agent = create_agent(
    llm,
    tools=tools,
    system_prompt="""You are a cybersecurity expert specializing in infrastructure security.

        Your role is to:
        1. Analyze security queries from users
        2. Search internal security documents (incidents, pentests, CVEs, guidelines)
        3. Provide actionable security recommendations
        4. Assess vulnerability severity and impact

        When responding:
        - Always search internal documents first using the agent builder tools
        - Provide specific, technical, and actionable advice
        - Cite relevant internal incidents and documentation
        - Assess severity (critical, high, medium, low)
        - Recommend immediate mitigation steps

        Be concise but comprehensive. Focus on practical security guidance.""",
)

We’ll use the GPT-5.2 model, which represents OpenAI’s state-of-the-art for agent management tasks. We configure it with low reasoning effort to achieve faster responses compared to the medium or high settings, while still delivering high-quality results by leveraging the full capabilities of the GPT-5 family. You can read more about the GPT 5.2 here.

Now that the initial setup is done, the next step is to define a workflow capable of making decisions, running tool calls, and summarizing results.

For this, we use LangGraph. We won’t cover LangGraph in depth here; this article provides a detailed overview of its functionality.

The following image shows a high-level view of the LangGraph application.

We need to define the application state:

class AgentState(TypedDict):
    query: str
    agent_builder_response: dict
    internet_results: list
    final_response: str
    needs_internet_search: bool

To better understand how the workflow operates, here is a brief description of each function. For full implementation details, refer to the accompanying notebook.

call_agent_builder_semantic_search: Queries internal documentation using the Agent Builder MCP server and also stores the retrieved messages in the state.
decide_internet_search: Analyzes the internal results and determines whether an external search is required.
perform_internet_search: Runs an external search using the Serper API when needed.
generate_response: Correlates internal and external findings and produces a final, actionable cybersecurity analysis for the user.

With the workflow defined, we can now send a query:

query = "We are using Node.js with Express 4.17 for our API gateway. Are there known prototype pollution or remote code execution vulnerabilities?"

In this example, we want to evaluate whether this specific version of Express is affected by known vulnerabilities.

Research results

See the complete response in this file.

This response clearly correlates internal and internet findings and provides actionable mitigation steps. It successfully highlights the severity of the vulnerability and offers a structured, security-oriented summary.

Extensions and future enhancements

This architecture is modular and allows us to extend its capabilities by replacing, improving, or adding components to the existing list. We could add another agent, consumed by the same MCP client. We can also use an automated ingestion workflow with tools such as Logstash, Kafka, or Elastic self-managed connectors. Feel free to change the LLM, the MCP client framework, or the embeddings model or add more tools depending on your needs.

Conclusion

This reference architecture shows a practical way to combine Elasticsearch, the Agent Builder, and MCP to build an AI-driven application. Its structure keeps each part independent, which makes the system easy to implement, maintain, and extend.

You can start with a simple setup (like the security use case in this article) and scale it by adding new tools, data sources, or agents as your needs grow. Overall, it provides a straightforward path for building flexible and reliable agentic workflows on top of Elasticsearch.

Automating log parsing in Streams with ML

Nastia Havriushenko — Fri, 02 Jan 2026 00:00:00 GMT

In modern observability stacks, ingesting unstructured logs from diverse data providers into platforms like Elasticsearch remains a challenge. Reliance on manually crafted parsing rules creates brittle pipelines, where even minor upstream code updates lead to parsing failures and unindexed data. This fragility is compounded by the scalability challenge: in dynamic microservices environments, the continuous addition of new services turns manual rule maintenance into an operational nightmare.

Our goal was to transition to an automated, adaptive approach capable of handling both log parsing (field extraction) and log partitioning (source identification). We hypothesized that Large Language Models (LLMs), with their inherent understanding of code syntax and semantic patterns, could automate these tasks with minimal human intervention.

We are happy to announce that this feature is already available in Streams!

Dataset description

We chose a Loghub collection of logs for PoC purposes. For our investigation, we selected representative samples from the following key areas:

Distributed systems: We used the HDFS (Hadoop Distributed File System) and Spark datasets. These contain a mix of info, debug, and error messages typical of big data platforms.
Server & web applications: Logs from Apache web servers and OpenSSH provided a valuable source of access, error, and security-relevant events. These are critical for monitoring web traffic and detecting potential threats.
Operating systems: We included logs from Linux and Windows. These datasets represent the common, semi-structured system-level events that operations teams encounter daily.
Mobile systems: To ensure our model could handle logs from mobile environments, we included the Android dataset. These logs are often verbose and capture a wide range of application and system-level activities on mobile devices.
Supercomputers: To test performance on high-performance computing (HPC) environments, we incorporated the BGL (Blue Gene/L) dataset, which features highly structured logs with specific domain terminology.

A key advantage of the Loghub collection is that the logs are largely unsanitized and unlabeled, mirroring a noisy live production environment with microservice architecture.

Log examples:

[Sun Dec 04 20:34:21 2005] [notice] jk2_init() Found child 2008 in scoreboard slot 6
[Sun Dec 04 20:34:25 2005] [notice] workerEnv.init() ok /etc/httpd/conf/workers2.properties
[Mon Dec 05 11:06:51 2005] [notice] workerEnv.init() ok /etc/httpd/conf/workers2.properties
17/06/09 20:10:58 INFO output.FileOutputCommitter: Saved output of task 'attempt_201706092018_0024_m_000083_1138' to hdfs://10.10.34.11:9000/pjhe/test/1/_temporary/0/task_201706092018_0024_m_000083
17/06/09 20:10:58 INFO mapred.SparkHadoopMapRedUtil: attempt_201706092018_0024_m_000083_1138: Committed

In addition, we created a Kubernetes cluster with a typical web application + database set up to mine extra logs in the most common domain.

Example of common log fields: timestamp, log level (INFO, WARN, ERROR), source, message.

Few-shot log parsing with an LLM

Our first set of experiments focused on a fundamental question: Can an LLM reliably identify key fields and generate consistent parsing rules to extract them?

We asked a model to analyse raw log samples and generate log parsing rules in regular expression (regex) and Grok formats. Our results showed that this approach has a lot of potential, but also significant implementation challenges.

High confidence & context awareness

Initial results were promising. The LLM demonstrated a strong ability to generate parsing rules that matched the provided few-shot examples with high confidence. Besides simple pattern matching, the model showed a capacity for log understanding —it could correctly identify and name the log source (e.g., health tracking app, Nginx web app, Mongo database).

The "Goldilocks" dilemma of input samples

Our experiments quickly surfaced a significant lack of robustness because of extreme sensitivity to the input sample. The model's performance fluctuates wildly based on the specific log examples included in the prompt. We observed a log similarity problem where the log sample needs to include just diverse enough logs:

Too homogeneous (overfitting): If the input logs are too similar, the LLM tends to overspecify. It treats variable data—such as specific Java class names in a stack trace—as static parts of the template. This results in brittle rules that cover a tiny ratio of logs and extract unusable fields.
Too heterogeneous (confusion): Conversely, if the sample contains significant formatting variance—or worse, "trash logs" like progress bars, memory tables, or ASCII art—the model struggles to find a common denominator. It often resorts to generating complex, broken regexes or lazily over-generalizing the entire line into a single message blob field.

The context window constraint

We also encountered a context window bottleneck. When input logs were long, heterogeneous, or rich in extractable fields, the model's output often deteriorated, becoming "messy" or too long to fit into the output context window. Naturally, chunking helps in this case. By splitting logs using character-based and entity-based delimiters, we could help the model focus on extracting the main fields without being overwhelmed by noise.

The consistency & standardization gap

Even when the model successfully generated rules, we noted slight inconsistencies:

Service naming variations: The model proposes different names for the same entity (e.g., labeling the source as "Spark," "Apache Spark," and "Spark Log Analytics" in different runs).
Field naming variations: Field names lacked standardization (e.g., id vs. service.id vs. device.id). We normalized names using a standardized Elastic field naming.
Resolution variance: The resolution of the field extraction varied depending on how similar the input logs were to one another.

Log format fingerprint

To address the challenge of log similarity, we introduce a high-performance heuristic: log format fingerprint (LFF).

Instead of feeding raw, noisy logs directly into an LLM, we first apply a deterministic transformation to reveal the underlying structure of each message. This pre-processing step abstracts away variable data, generating a simplified "fingerprint" that allows us to group related logs.

The mapping logic is simple to ensure speed and consistency:

Digit abstraction: Any sequence of digits (0-9) is replaced by a single ‘0’.
Text abstraction: Any sequence of alphabetical characters with whitespace is replaced by a single ‘a’.
Whitespace normalization: All sequences of whitespace (spaces, tabs, newlines) are collapsed into a single space.
Symbol preservation: Punctuation and special characters (e.g., :, [, ], /) are preserved, as they are often the strongest indicators of log structure.

We introduce the log mapping approach. The basic mapping patterns include the following:

Digits 0-9 of any length -> to ‘0.’
Text (alphabetical characters with spaces) of any length -> to ‘a’.
White spaces, tabs, and new lines -> to a single space.

Let's look at an example of how this mapping allows us to transform the logs.

As a result, we obtain the following log masks:

Notice the fingerprints of the first two logs. Despite different timestamps, source classes, and message content, their prefixes (0/0/0 0:0:0 a a.a:) are identical. This structural alignment allows us to automatically bucket these logs into the same cluster.

The third log, however, produces a completely divergent fingerprint (0-0-0...). This allows us to algorithmically separate it from the first group before we ever invoke an LLM.

Bonus part: Instant implementation with ES|QL

It’s as easy as passing this query in Discover.

FROM loghub |
EVAL pattern = REPLACE(REPLACE(REPLACE(REPLACE(raw_message, "[ \t\n]+", " "), "[A-Za-z]+", "a"), "[0-9]+", "0"), "a( a)+", "a") |
STATS total_count = COUNT(), ratio = COUNT() / 2000.0, datasources=VALUES(filename), example=TOP(raw_message, 3, "desc") BY SUBSTRING(pattern, 0, 15) |
SORT total_count DESC |
LIMIT 100

Query breakdown:

FROM loghub: Targets our index containing the raw log data.

EVAL pattern = …: The core mapping logic. We chain REPLACE functions to perform the abstraction (e.g., digits to '0', text to 'a', etc.) and save the result in a “pattern” field.

STATS [column1 =] expression1, … BY SUBSTRING(pattern, 0, 15):

This is a clustering step. We group logs that share the first 15 characters of their pattern and create aggregated fields such as total log count per group, list of log datasources, pattern prefix, 3 log examples

SORT total_count DESC | LIMIT 100 : Surfaces the top 100 most frequent log patterns

The query results on LogHub are displayed below:

As demonstrated in the visualization, this “LLM-free” approach partitions logs with high accuracy. It successfully clustered 10 out of 16 data sources (based on LogHub labels) completely (>90%) and achieved majority clustering in 13 out of 16 sources (>60%) —all without requiring additional cleaning, preprocessing, or fine-tuning.

Log format fingerprint offers a pragmatic, high-impact alternative and addition to sophisticated ML solutions like log pattern analysis. It provides immediate insights into log relationships and effectively manages large log clusters.

Versatility as a primitive

Thanks to ES|QL implementation, LFF serves both as a standalone tool for fast data diagnostics/visualisations, and as a building block in log analysis pipelines for high-volume use cases.

Flexibility

LFF is easy to customize and extend to capture specific patterns, i.e. hexadecimal numbers and IP addresses.

Deterministic stability

Unlike ML-based clustering algorithms, LFF logic is straightforward and deterministic. New incoming logs do not retroactively affect existing log clusters.

Performance and mMemory

It requires minimal memory, no training or GPU making it ideal for real-time high-throughput environments.

Combining log format fingerprint with an LLM

To validate the proposed hybrid architecture, each experiment contained a random 20% subset of the logs from each data source. This constraint simulates a real-world production environment where logs are processed in batches rather than as a monolithic historical dump.

The objective was to demonstrate that LFF acts as an effective compression layer. We aimed to prove that high-coverage parsing rules could be generated from small, curated samples and successfully generalized to the entire dataset.

Execution pipeline

We implemented a multi-stage pipeline that filters, clusters, and applies stratified sampling to the data before it reaches the LLM.

1. Two-stage hierarchical clustering

Subclasses (exact match): Logs are aggregated by identical fingerprints. Every log in one subclass shares the exact same format structure.
Outlier cleaning. We discard any subclasses that represent less than 5% of the total log volume. This ensures the LLM focuses on the dominant signal and won’t be sidetracked by noise or malformed logs.
Metaclasses (prefix match): Remaining subclasses are grouped into Metaclasses by the first N characters of the format fingerprint match. This grouping strategy effectively splits lexically similar formats under a single umbrella.We chose N=5 for Log parsing and N=15 for Log partitioning when data sources are unknown.

2. Stratified sampling. Once the hierarchical tree is built, we construct the log sample for the LLM. The strategic goal is to maximize variance coverage while minimizing token usage.

We select representative logs from each valid subclass within the broader metaclass.
To manage an edge case of too numerous subclasses, we apply random down-sampling to fit the target window size.

3. Rule generation Finally, we prompt the LLM to generate a regex parsing rule that fits all logs in the provided sample for each Metaclass. For our PoC, we used the GPT-4o mini model.

Experimental results & observations

We achieved 94% parsing accuracy and 91% partitioning accuracy on the Loghub dataset.

The confusion matrix above illustrates log partitioning results. The vertical axis represents the actual data sources, and the horizontal axis represents the predicted data sources. The heatmap intensity corresponds to log volume, with lighter tiles indicating a higher count. The diagonal alignment demonstrates the model's high fidelity in source attribution, with minimal scattering.

Our performance benchmarks insights:

Optimal baseline: a context window of 30–40 log samples per category proved to be the "sweet spot," consistently producing robust parsing with both Regex and Grok patterns.
Input minimisation: we pushed the input size to 10 logs per category for Regex patterns and observed only 2% drop in parsing performance, confirming that diversity-based sampling is more critical than raw volume.

An introduction to Jina models, their functionality, and uses in Elasticsearch

Scott Martens — Thu, 01 Jan 2026 00:00:00 GMT

Jina by Elastic provides search foundation models for applications and business process automation. These models provide core functionality for bringing AI to Elasticsearch applications and innovative AI projects.

Jina models fall into three broad categories designed to support information processing, organization, and retrieval:

Semantic embedding models
Reranking models
Small generative language models

Semantic embedding models

The idea behind semantic embeddings is that an AI model can learn to represent aspects of the meaning of its inputs in terms of the geometry of high-dimensional spaces.

You can think of a semantic embedding as a point (technically a vector) in a high-dimensional space. An embedding model is a neural network that takes some digital data as input (potentially anything, but most often a text or an image) and outputs the location of a corresponding high-dimensional point as a set of numerical coordinates. If the model is good at its job, the distance between two semantic embeddings is proportionate to how much their corresponding digital objects mean the same things.

To understand how this is important for search applications, imagine one embedding for the word “dog” and one for the word “cat” as points in space:

A good embedding model should generate an embedding for the word “feline” that’s much closer to “cat” than to “dog,” and “canine” should have an embedding much closer to “dog” than to “cat,” because those words mean almost the same thing:

If a model is multilingual, we would expect the same thing for translations of “cat” and “dog”:

Embedding models translate similarity or dissimilarity in meaning between things into spatial relationships between embeddings. The pictures above have just two dimensions so you can see them on a screen, but embedding models produce vectors with dozens to thousands of dimensions. This makes it possible for them to encode subtleties of meaning for whole texts, assigning a point in a space that has hundreds or thousands of dimensions for documents of thousands of words or more.

Multimodal embeddings

Multimodal models extend the concept of semantic embeddings to things other than texts, especially to images. We would expect an embedding for a picture to be near to an embedding of a faithful description of the picture:

Semantic embeddings have a lot of uses. Among other things, you can use them to build efficient classifiers, to do data clustering, and to accomplish a variety of tasks, like data deduplication and investigating data diversity, both of which are important for big data applications that involve working with too much data to manage by hand.

The biggest direct use of embeddings is in information retrieval. Elasticsearch can store retrieval objects with embeddings as keys. Queries are converted into embedding vectors, and a search returns the stored objects whose keys are the nearest to the query embedding.

Where traditional vector-based retrieval (sometimes called sparse vector retrieval) uses vectors based on words or metadata in documents and queries, embedding-based retrieval (also known as dense vector retrieval) uses AI-assessed meanings rather than words. This makes them generally much more flexible and more accurate than traditional search methods.

Matryoshka representation learning

The number of dimensions an embedding has, and the precision of the numbers in it have significant performance impacts. Very high-dimensional spaces and extremely high-precision numbers can represent highly detailed and complex information, but demand larger AI models that are more expensive to train and to run. The vectors they generate require more storage space, and it takes more computing cycles to calculate the distances between them. Using semantic embedding models involves making important trade-offs between precision and resource consumption.

To maximize flexibility for users, Jina models are trained with a technique called Matryoshka Representation Learning. This causes models to front-load the most important semantic distinctions into the first dimensions of the embedding vector so you can just cut off the higher dimensions and still get good performance.

In practice, this means that users of Jina models can choose how many dimensions they want their embeddings to have. Choosing fewer dimensions reduces precision, but the degradation in performance is minor. On most tasks, performance metrics for Jina models decline 1–2% every time you reduce the embedding size by 50%, down to about a 95% reduction in size.

Asymmetric retrieval

Semantic similarity is usually measured symmetrically. The value you get when comparing “cat” to “dog” is the same as the value you’d get comparing “dog” to “cat.” But when you use embeddings for information retrieval, they work better if you break the symmetry and encode queries differently from the way you encode retrieval objects.

This is because of the way we train embedding models. Training data contains instances of the same elements, like words, in many different contexts, and models learn semantics by comparing the contextual similarities and differences between elements.

So, for example, we might find that the word “animal” doesn’t appear in very many of the same contexts as “cat” or “dog,” and therefore the embedding for “animal” might not be particularly close to “cat” or “dog”:

This makes it less likely that a query for “animal” will retrieve documents about cats and dogs — the opposite of our goal. So instead, we encode “animal” differently when it’s a query than when it’s a target for retrieval:

Asymmetric retrieval means using a different model for queries or specially training an embedding model to encode things one way when they’re stored for retrieval and to encode queries another way.

Multivector embeddings

Single embeddings are good for information retrieval because they fit the basic framework of an indexed database: We store objects for retrieval with a single embedding vector as their retrieval key. When users query the document store, their queries are translated into embedding vectors and the documents whose keys are closest to the query embedding (in the high-dimensional embedding space) are retrieved as candidate matches.

Multivector embeddings work a little differently. Instead of generating a fixed-length vector to represent a query and a whole stored object, they produce a sequence of embeddings representing smaller parts of them. The parts are typically tokens or words for texts and are image tiles for visual data. These embeddings reflect the meaning of the part in its context.

For example, consider these sentences:

She had a heart of gold.
She had a change of heart.
She had a heart attack.

Superficially, they look very similar, but a multivector model would likely generate very different embeddings for each instance of “heart,” representing how each means something else in the context of the whole sentence:

Comparing two objects via their multivector embeddings often involves measuring their chamfer distance: comparing each part of one multivector embedding to each part of another one and summing the minimum distances between them. Other systems, including the Jina Rerankers described below, input them to an AI model trained specifically to evaluate their similarity. Both approaches typically have higher precision than just comparing single-vector embeddings because multivector embeddings contain much more detailed information than single-vector ones.

However, multivector embeddings aren’t well-suited to indexing. They’re often used in reranking tasks, as described for the jina-colbert-v2 model in the next section.

Jina embedding models

Jina embeddings v4

jina-embeddings-v4 is a 3.8 billion (3.8x10⁹) parameter multilingual and multimodal embedding model that supports images and texts in a variety of widely used languages. It uses a novel architecture to take advantage of visual knowledge and language knowledge to improve performance on both tasks, enabling it to excel at image retrieval and especially at visual document retrieval. This means it handles images like charts, slides, maps, screenshots, page scans, and diagrams — common kinds of images, often with important embedded text, which fall outside the scope of computer vision models trained on pictures of real-world scenes.

We’ve optimized this model for several different tasks using compact Low-Rank Adaptation (LoRA) adapters. This allows us to train a single model to specialize at multiple tasks, without compromising performance on any of them, at a minimal added cost in memory or processing.

Main features include:

State-of-the-art performance in visual document retrieval, along with multilingual text and regular image performance that surpasses significantly larger models.
Support for large input context size: 32,768 tokens is roughly equivalent to 80 pages of double-spaced English text, and 20 megapixels is equivalent to an image of 4,500 x 4,500 pixels.
User-selected embedding sizes, from a maximum of 2048 dimensions down to 128 dimensions. We found empirically that performance degrades dramatically below that threshold.
Support for both single embeddings and multivector embeddings. For texts, multivector output consists of one 128-dimensional embedding for each input token. For images, it produces one 128-dimensional embedding for each 28x28 pixel tile needed to cover the image.
Optimization for asymmetric retrieval via a pair of LoRA adapters trained specifically for the purpose.
A LoRA adapter optimized for semantic similarity calculation.
Special support for computer programming languages and IT frameworks, also via a LoRA adapter.

We developed jina-embeddings-v4 to serve as a general, multipurpose tool for a broad array of common search, natural language understanding, and AI analysis tasks. It’s a relatively small model given its capabilities but still takes significant resources to deploy and is best suited to use via a cloud API or in a high-volume environment.

Jina embeddings v3

jina-embeddings-v3 is a compact, high-performance, multilingual, text-only embedding model with under 600 million parameters. It supports up to 8192 tokens of text input and outputs single-vector embeddings with user-chosen sizes from a default of 1024 dimensions down to 64.

We’ve trained jina-embeddings-v3 for a variety of text tasks — not just information retrieval and semantic similarity but also classification tasks, like sentiment analysis and content moderation, as well as clustering tasks, like news aggregation and recommendation. Like jina-embeddings-v4, this model provides LoRA adapters specialized for the following categories of usage:

Asymmetric retrieval
Semantic similarity
Classification
Clustering

jina-embeddings-v3 is a much smaller model than jina-embeddings-v4 with a significantly reduced input context size, but it costs less to operate. Nonetheless, it has very competitive performance, albeit only for texts, and is a better choice for many use cases.

Jina code embeddings

Jina’s specialized code embedding models — jina-code-embeddings (0.5b and 1.5b) — support 15 programming schemes and frameworks, as well as English language texts relating to computing and information technology. They’re compact models with a half-billion (0.5x10⁹) and one-and-a-half-billion (1.5x10⁹) parameters, respectively. Both models support input context sizes of up to 32,768 tokens and let users select their output embedding sizes, from 896 down to 64 dimensions for the smaller model and 1536 down to 128 for the larger.

These models support asymmetric retrieval for five task-specific specializations, using prefix tuning rather than LoRA adapters:

Code to code. Retrieve similar code across programming languages. This is used for code alignment, code deduplication, and support for porting and refactoring.
Natural language to code. Retrieve code to match natural language queries, comments, descriptions, and documentation.
Code to natural language. Match code to documentation or other natural language texts.
Code-to-code completion. Suggest relevant code to complete or enhance existing code.
Technical Q&A. Identify natural language answers to questions about information technologies, ideally suited for technical support use cases.

These models provide superior performance for tasks involving computer documentation and programming materials at a relatively small computational cost. They’re well suited to integration into development environments and code assistants.

Jina ColBERT v2

jina-colbert-v2 is a 560 million parameter multivector text-embedding model. It’s multilingual, trained using materials in 89 languages, and supports variable embedding sizes and asymmetric retrieval.

As previously noted, multivector embeddings are poorly suited to indexing but are very useful for increasing the precision of results of other search strategies. Using jina-colbert-v2, you can calculate multivector embeddings in advance and then use them to rerank retrieval candidates at query time. This approach is less precise than using one of the reranking models in the next section but is much more efficient because it just involves comparing stored multivector embeddings instead of invoking the whole AI model for every query and candidate match. It’s ideally suited for use cases where the latency and computational overhead of using reranking models is too great or where the number of candidates to compare is too large for reranking models.

This model outputs a sequence of embeddings, one per input token, and users can select token embeddings of 128-, 96-, or 64-dimension embeddings. Candidate text matches are limited to 8,192 tokens. Queries are encoded asymmetrically, so users must specify whether a text is a query or candidate match and must limit queries to 32 tokens.

Jina CLIP v2

jina-clip-v2 is a 900 million parameter multimodal embedding model, trained so that texts and images produce embeddings that are close together if the text describes the content of the image. Its primary use is for retrieving images based on textural queries, but it’s also a high-performance text-only model, reducing user costs because you don’t need separate models for text-to-text and text-to-image retrieval.

This model supports a text input context of 8,192 tokens, and images are scaled to 512x512 pixels before generating embeddings.

Contrastive language–image pretraining (CLIP) architectures are easy to train and operate and can produce very compact models, but they have some fundamental limitations. They can’t use knowledge from one medium to improve their performance in another. They can’t use from one medium to improve their performance in another. So, although it might know that the words “dog” and “cat” are closer to each other in meaning than either one is to “car,” it won’t necessarily know that a picture of a dog and a picture of a cat are more related than either one is to a picture of a car.

They also suffer from what is called the modality gap: An embedding of a text about dogs is likely to be closer to an embedding of a text about cats than to an embedding of a picture of dogs. Because of this limitation, we advise using CLIP as a text-to-image retrieval model or as a text-only model, but not mixing the two in a single query.

Reranking models

Reranking models take one or more candidate matches, along with a query as input to the model, and compare them directly, producing much higher precision matches.

In principle, you could use a reranker directly for information retrieval by comparing each query to each stored document, but this would be very computationally expensive and is impractical for any but the smallest collections. As a result, rerankers tend to be used to evaluate relatively short lists of candidate matches found by some other means, like embeddings-based search or other retrieval algorithms. Reranking models are ideally suited to hybrid and federated search schemes, where performing a search might mean that queries get sent to separate search systems with distinct data sets, each one returning different results. They work very well at merging diverse results into a single high-quality result.

Embeddings-based search can be a large commitment, involving reindexing all your stored data and changing user expectations about the results. Adding a reranker to an existing search scheme can add many of the benefits of AI without re-engineering your entire search solution.

Jina reranker models

Jina Reranker m0

jina-reranker-m0 is a 2.4 billion (2.4x10⁹) parameter multimodal reranker that supports textual queries and candidate matches consisting of texts and/or images. It’s the leading model for visual document retrieval, making it an ideal solution for stores of PDF, scans of text, screenshots, and other computer-generated or modified imagery containing text or other semistructured information, as well as mixed data consisting of text documents and images.

This model takes a single query and a candidate match and returns a score. When the same query is used with different candidates, the scores are comparable and can be used to rank them. It supports a total input size of up to 10,240 tokens, including the query text and the candidate text or image. Every 28x28 pixel tile needed to cover an image counts as a token for calculating input size.

Jina Reranker v3

jina-reranker-v3 is a 600 million parameter text reranker with state-of-the-art performance for models of comparable size. Unlike jina-reranker-m0, it takes a single query and a list of up to 64 candidate matches and returns the ranking order. It has an input context of 131,000 tokens, including the query and all text candidates.

Jina Reranker v2

jina-reranker-v2-base-multilingual is a very compact general-purpose reranker with additional features designed to support function-calling and SQL querying. Weighing in at under 300 million parameters, it provides fast, efficient, and accurate multilingual text reranking with additional support for selecting SQL tables and external functions that match text queries, making it suitable for agentic use cases.

Small generative language models

Generative language models are models like OpenAI’s ChatGPT, Google Gemini, and Claude from Anthropic that take text or multimedia inputs and respond with text outputs. There’s no well-defined line that separates large language models (LLMs) from small language models (SLMs), but the practical problems of developing, operating, and using top-of-the-line LLMs are well-known. The best-known ones are not publicly distributed, so we can only estimate their size, but ChatGPT, Gemini, and Claude are expected to be in the 1–3 trillion (1–3x10¹²) parameter range.

Running these models, even if they’re openly available, is far beyond the scope of conventional hardware, requiring the most advanced chips arranged in vast parallel arrays. You can access LLMs via paid APIs, but this incurs significant costs, has a large latency, and is difficult to align with demands for data protection, digital sovereignty, and cloud repatriation. Additionally, costs related to training and customizing models of that size can be considerable.

Consequently, a great deal of research has gone into developing smaller models that might lack all the capabilities of the largest LLMs but can perform specific kinds of tasks just as well at a reduced cost. Enterprises generally deploy software to address specific problems, and AI software is no different, so SLM-based solutions are often preferable to LLM ones. They can typically run on commodity hardware, are faster and consume less energy to run, and are much easier to customize.

Jina’s SLM offerings are growing as we focus on how we can best bring AI into practical search solutions.

Jina SLMs

ReaderLM v2

ReaderLM-v2 is a generative language model that converts HTML into Markdown or into JSON, according to user-provided JSON schemas and natural language instructions.

Data preprocessing and normalization is an essential part of developing good search solutions for digital data, but real-world data, especially web-derived information, is often chaotic, and simple conversion strategies frequently prove to be very brittle. Instead, ReaderLM-v2 offers an intelligent AI model solution that can understand the chaos of a DOM-tree dump of a web page and robustly identify useful elements.

At 1.5 billion (1.5x10⁹) parameters, it’s three orders of magnitude more compact than cutting-edge LLMs but performs on par with them at this one narrow task.

Jina VLM

jina-vlm is a 2.4 billion (2.4x10⁹) parameter generative language model trained to answer natural language questions about images. It has very strong support for visual document analysis, that is, answering questions about scans, screenshots, slides, diagrams, and similar non-natural image data.

For example:

It’s also very good at reading text in images:

But where jina-vlm really excels is understanding the content of informational and man-made images:

Or:

jina-vlm is well suited for automatic caption generation, product descriptions, image alt text, and accessibility applications for vision-impaired people. It also creates possibilities for retrieval‑augmented generation (RAG) systems to use visual information and for AI agents to process images without human assistance.

How to build an agent knowledge base with LangChain and Elasticsearch

Han Xiang Choong — Wed, 31 Dec 2025 00:00:00 GMT

In industry use cases, there are two primary modes of interacting with large language models (LLMs). Direct querying, ie., conversing with an LLM on an ad-hoc basis, is useful for getting assistance on tasks like summarization, proofreading, information extraction, and non-domain-specific querying.

For specific business applications, such as in customer relationship management, maintenance of IT systems, and investigative work, to name only a few examples, direct LLM usage is insufficient. Private, enterprise-specific information, or information about niche interests and topics, or even from specific documents and written sources, tends to be lacking from LLM training datasets.
In addition, real-world data is constantly changing, and enterprise contexts are constantly evolving. LLMs also tend to require reinforcement of factual accuracy. All these factors limit the utility value of using LLMs directly for enterprise use-cases, especially those requiring up-to-date factual information about specific technical or business topics.

R etrieval Augmented Generation (RAG), the use of searchable data stores to retrieve information sources relevant to the context and intention of a user query, was popularized as a way to address this deficiency. A large amount of work has been done to implement, assess, and improve the quality of RAG applications, and RAG has enjoyed widespread adoption in enterprise use cases for productivity enhancement and workflow automation. However, RAG does not leverage the decision-making capacity of large language models.

The agentic model revolves around the LLM being able to take specific actions in response to a user input. These actions may involve the use of tools to augment the LLM's existing capabilities. In this sense, RAG functions as a long-term memory store that the LLM agent may choose to use to augment and reinforce answers to user queries. Where the traditional RAG model involves the LLM querying one or more knowledge bases, an agentic implementation allows an LLM to choose from a set of knowledge bases. This allows for more flexible question-answering behavior, and can improve accuracy, as information from irrelevant knowledge bases is omitted, reducing potential sources of noise. We might call such a system an "agent knowledge base." Let's take a look at how to implement such a system using Elasticsearch.

Designing an agent knowledge base

All code may be found in the GitHub repo.

I recently became interested in scuba diving after trying it and realizing it could cure my persistent thalassophobia, so I decided to set up an agentic knowledge base for diving specifically.

The US Navy Dive Manual - Containing a wealth of technical detail about diving operations and equipment.
Diving Safety Manual - Containing general guidelines and procedures aimed at recreational divers.
The Google Custom Search API - Capable of searching the web for any information not contained within the two manuals.

The intention was that this Diving Assistant would be a one-stop shop for diving-related knowledge, which would be capable of responding to any query, even those out of scope of the knowledge bases ingested. The LLM would recognize the motivation behind a user query, and select the source of information most likely to be relevant. I decided to use LangChain as the agentic wrapper, and built a streamlit UI around it.

Setting up the endpoints

I start by creating a .env file and populating it with the following variables:

ELASTIC_ENDPOINT=
ELASTIC_API_KEY=

# Enable custom search API
# https://developers.google.com/custom-search/v1/introduction/?apix=true
GCP_API_KEY=
GCP_PSE_ID=

AZURE_OPENAI_SYSTEM_PROMPT="You are a helpful assistant. Be as concise and efficient as possible. Convey maximum meaning in fewest words possible."

AZURE_OPENAI_ENDPOINT=
AZURE_OPENAI_API_VERSION=
AZURE_OPENAI_API_KEY=
AZURE_OPENAI_MODEL="gpt-4o-mini"

This project makes use of a GPT-4o-Mini deployed on Azure OpenAI, as well as the Google Custom Search API, and an Elastic Cloud deployment to hold my data. I also add a custom system prompt encouraging the LLM to avoid wordiness as much as possible.

Ingestion and processing

The US Navy Dive Manual and Diving Safety Manual are in PDF format, so the next step was to ingest them into an Elastic Cloud deployment. I set-up this python script using Elastic's bulk API to upload documents to Elastic Cloud:

import os
from collections import defaultdict
from concurrent.futures import ThreadPoolExecutor, as_completed
from elasticsearch import Elasticsearch, helpers # elasticsearch==8.14.0
from tqdm import tqdm # tqdm==4.66.4
from llama_index.core import SimpleDirectoryReader

def bulk_upload_to_elasticsearch(data, index_name, es, batch_size=500, max_workers=10):
    ''' 
    data: [ {document} ]
        document: {
                    "_id": str
                    ...
                  }
    index_name: str 
    es: Elasticsearch 
    batch_size: int 
    max_workers: int
    '''
    total_documents = len(data)
    success_bar = tqdm(total=total_documents, desc="Successful uploads", colour="green")
    failed_bar = tqdm(total=total_documents, desc="Failed uploads", colour="red")

    def create_action(doc):
        '''
        Define upload action from source documents
        '''
        return {
            "_index": index_name,
            "_id": doc["id_"],
            "body": doc["text"]
        }

    def read_and_create_batches(data):
        ''' 
        Yield document batches
        '''
        batch = []
        for doc in data:
            batch.append(create_action(doc))
            if len(batch) == batch_size:
                yield batch
                batch = []
        if batch:
            yield batch

    def upload_batch(batch):
        ''' 
        Make bulk call for batch
        '''
        try:
            success, failed = helpers.bulk(es, batch, raise_on_error=False, request_timeout=45)
            if isinstance(failed, list):
                failed = len(failed)
            return success, failed
        except Exception as e:
            print(f"Error during bulk upload: {str(e)}")
            return 0, len(batch)
''' 
    Parallel execution of batch upload
    '''
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        future_to_batch = {executor.submit(upload_batch, batch): batch for batch in read_and_create_batches(data)}
        for future in as_completed(future_to_batch):
            success, failed = future.result()
            success_bar.update(success)
            failed_bar.update(failed)

    ''' 
    Update progress bars
    '''
    total_uploaded = success_bar.n
    total_failed = failed_bar.n
    success_bar.close()
    failed_bar.close()

    return total_uploaded, total_failed

# This is connecting to ES Cloud via credentials stored in .env 
# May have to change this to suit your env. 
try:
    es_endpoint = os.environ.get("ELASTIC_ENDPOINT")
    es_client = Elasticsearch(
        es_endpoint,
        api_key=os.environ.get("ELASTIC_API_KEY")
    )
except Exception as e:
    es_client = None

print(es_client.ping())

After downloading the US Navy Dive Manual PDF and storing it in its own folder, I use LlamaIndex's SimpleDirectoryReader to load the PDF data, then trigger a bulk upload:

reader = SimpleDirectoryReader(input_dir="./data")
documents = reader.load_data()
bulk_upload_to_elasticsearch([i.to_dict() for i in list(documents)], 
                            "us_navy_dive_manual_raw", 
                            es_client, batch_size=16, max_workers=10)

This sends all the text content to Elastic Cloud, with each page of the PDF as a separate document, to an index called us_navy_dive_manual_raw. No further processing is done, so the process of uploading all 991 pages takes less than a second. The next step is to do semantic embedding within Elastic Cloud.

Semantic data embedding and chunking

In my Elastic Cloud DevTools console, I first deploy the ELSER v2 model using the Elastic inference API

PUT _inference/sparse_embedding/elser_v2
{
  "service": "elasticsearch",
  "service_settings": {
    "num_allocations": 1,
    "num_threads": 8,
    "model_id": ".elser_model_2_linux-x86_64"
  },
  "chunking_settings": {
    "strategy": "sentence",
    "max_chunk_size": 250,
    "sentence_overlap": 1
  }
}

I then define a simple pipeline. Each document stores the text of a page from the dive manual in the body field, so I copy the contents of body to a field called semantic_content.

PUT _ingest/pipeline/diving_pipeline
{
  "processors": [
    {
      "set": {
        "field": "semantic_content",

        "copy_from": "body",
        "if": "ctx.body != null"
      }
    }
  ]
}

I then create a new index called us_navy_dive_manual, and set semantic_content as a semantic_text field:

PUT us_navy_dive_manual
{
  "mappings": {
    "properties": {
      "semantic_content": {
        "type": "semantic_text",
        "inference_id": "elser_v2"
      }
    }
  }
}

I then trigger a reindex job. Now the data will flow from us_navy_dive_manual_raw, to be chunked and embedded using ELSER, and be reindexed into us_navy_dive_manual ready for use.

POST _reindex?slices=auto&wait_for_completion=false
{
  "source": {
    "index": "us_navy_dive_manual_raw",
    "size": 4
  },
  "dest": {
    "index": "us_navy_dive_manual",
    "pipeline": "diving_pipeline"
  },
  "conflicts": "proceed"
}

I repeat this process for the Diving Safety Manual, and with this simple process, data ingestion is completed.

Tooling for agentic search

This agent is relatively simple, so I make use of LangChain's AgentExecutor which creates an agent and bundles it with a set of tools. Complex decisionmaking flows can be achieved using the LangGaph implementation , which we will use in a future blog. We will focus on the parts related to the agents, so for details on the actual streamlit UI, please check out the github repo.

I create two tools for my agent to use. The first is an ElasticSearcher class, which performs a semantic search over an Elastic index, then returns the top 10 articles as text.

class ElasticSearcher:
    def __init__(self):
        self.client = Elasticsearch(
            os.environ.get("ELASTIC_ENDPOINT"),
            api_key=os.environ.get("ELASTIC_API_KEY")
        )

    def search(self, query, index="us_navy_dive_manual", size=10):
        response = self.client.search(
            index=index,
            body={
                "query": {
                    "semantic": {
                        "field": "semantic_content",
                        "query": query
                    }
                }     
            },
            size=size
        )
        return "\n".join([hit["_source"].get("body", "No Body") 
                            for hit in response["hits"]["hits"]])

The second tools is the Googler class, which calls the Google Custom Search API to perform a general web search.

class Googler:
    def __init__(self):
        self.service = build('customsearch', 'v1', developerKey=os.getenv("GCP_API_KEY"))

    def scrape(self, url):
        try:
            response = requests.get(url, timeout=10)
            if response.status_code == 200:
                soup = BeautifulSoup(response.text, 'html.parser')
                for script in soup(["script", "style"]):
                    script.decompose()
                text = soup.get_text()
                lines = (line.strip() for line in text.splitlines())
                chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
                return '\n'.join(chunk for chunk in chunks if chunk)[:5000]
            return None
        except:
            return None

    def search(self, query, n=5):
        results = self.service.cse().list(q=query, cx=os.getenv("GCP_PSE_ID"), num=n).execute()
        scraped_data = []
        for item in results.get('items', []):
            url = item['link']
            title = item['title']
            content = self.scrape(url) or item['snippet']
            scraped_data.append(f"Page: {title}\nURL: {url}\n\n{content}\n")
        return "\n".join(scraped_data)

I then create a set of tools for the agent to use. The description of each tool is an important part of the prompt engineering, as the agent will refer to it primarily when choosing which tool to use for its response to a user query.

tools = [
    Tool(
        name="WebSearch",
        func=lambda q: googler.search(q, n=3),
        description="Search the web for information. Use for current events or general knowledge or to complement with additional information."
    ),
    Tool(
        name="NavyDiveManual",
        func=lambda q: elastic.search(q, index="us_navy_dive_manual"),
        description="Search the Operations Dive Manual. Use for diving procedures, advanced or technical operational planning, resourcing, and technical information."
    ),
    Tool(
        name="DivingSafetyManual",
        func=lambda q: elastic.search(q, index="diving_safety_manual"),
        description="Search the Diving Safety Manual. Use for generic diving safety protocols and best practices."
    )
]

Next, I define an LLM using the AzureChatOpenAI abstraction:

llm = AzureChatOpenAI(
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
    deployment_name=os.getenv("AZURE_OPENAI_MODEL"),
    streaming=False
)

And also create a custom prompt for the LLM, telling it how to make use of the tools and their outputs.

prompt = PromptTemplate.from_template("""Answer the following questions as best you can. You have access to the following tools:
{tools}
You should use multiple tools in conjunction to promote completeness of information.
Be comprehensive in your answer.
Use the following format:
Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question
Question: {input}
{agent_scratchpad}""")

Finally, I define the agent, passing it the LLM, prompt, and toolset, and integrate it into the rest of the UI.

agent = create_react_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=False)

And with that, we are ready to test out our agent knowledge base.

Test 1: Retrieve knowledge from a specific knowledge base

First, we should test that the knowledge base component is working as intended. Let's start by asking a detailed technical question. The agent should choose the right knowledge base, retrieve some information, and use it to craft a response.

List of recommended equipment for ice/cold water diving operations

The UI displays the Agent's thought process, and the first thought it has is to consult the diving safety manual:

Thought: To provide a comprehensive list of recommended equipment for ice 
or cold water diving operations, I should consult specialized diving 
manuals that cover operational procedures and safety protocols. 
This will ensure that I include all necessary items for safety and 
efficiency in such challenging environments. 

Action: NavyDiveManual

Action Input: "recommended equipment for ice cold water 
diving operations"

The agent uses the NavyDiveManual tool to perform a semantic search of the us_navy_dive_manual index. The result is information like this:

CHAPTER 11 — Ice and Cold Water Diving Operations   11-9
A detailed reconnaissance of the dive site will provide the 
planner with information that is helpful in deciding what ancillary equipment is 
required. Diving under ice will require special accessory equipment such as a line 
with lights/strobes for under  water navigation, ice-cutting tools, platforms, engine 
protection kits, and stainless steel ice screws, quick draw, and carabineers.
The method of cutting the hole through the ice depends on ice thickness and availability 
of equipment. Normally, two or more of the following tools are used: hand 
ice chipper, ice handsaw, ice auger, chain saw, thermal ice cutter or blasting equipment. 
In addition, equipment to lift the ice block, remove the slush, and mark ...

Which the LLM then uses to craft a response:

The recommended equipment for ice and cold water diving operations includes:

Thermal Protection:

* Wet suit or Variable Volume Dry Suit (VVDS)
* Heavy insulating socks (if using a wet suit)
* Diving Equipment:
* Approved cold water open-circuit SCUBA or closed-circuit UBA

The response is exactly what we wanted. Great!

Test 2: Iteratively refine knowledge collection

The next test is to see whether the agent can select multiple sources of knowledge to refine its responses.
Let's ask a more open ended question:

What are the risk factors when snorkeling?

The first thought the agent has involves searching Google for information specific to risk factors and dangers:

Thought: To provide a comprehensive answer to the risk factors associated with snorkeling, 
I should gather information on general safety protocols and specific risks from both diving 
and snorkeling perspectives. I will start by searching for general snorkeling risks and 
then look into specific safety guidelines related to diving and snorkeling. 

Action: WebSearch  
Action Input: "risk factors when snorkeling"

This returns web results such as:

Factors Contributing to Snorkel Drowning in Hawai'i - PMC
URL: https://pmc.ncbi.nlm.nih.gov/articles/PMC8899085/
Causes of the extraordinarily high and increasing incidence of snorkeler drownings in Hawai'i 
have remained unexplained for years. Defining the mechanisms and factors predisposing to 
drowning while snorkeling is needed to provide recommendations to substantially mitigate 
the incidence of this form of preventable death. The mechanisms of drowning are described 
and insight into the predisposing factors are explored in this study...

The agent then chooses to consult the diving_safety_manual, and takes the appropriate action to retrieve safety protocols and guidelines.

Action: DivingSafetyManual  
Action Input: "snorkeling safety protocols"

Which returns relevant information from the DivingSafetyManual tool:

Pre-Snorkel briefing       
Immediately before snorkelling commences, and using site diagrams or 
photos as appropriate, the snorkeller supervisor  should emphasise 
the following points: 
• On-site environmental conditions  
• Boundaries of the snorkelling area 
• Times when snorkelli ng is being conducted 
• Specific site hazards 
• Location of lookout 
• Emergency and recall signals 
• Buddy and guided group  procedures

Finally, the LLM provides a comprehensive answer, containing both risk factors and safety guidelines:

The risk factors when snorkeling include:
Environmental Conditions: Strong currents, waves, and marine life can pose 
immediate threats. Snorkelers should be aware of local weather and water 
conditions before entering.
Physical Fitness: Snorkeling is a strenuous activity that can be 
challenging for individuals with pre-existing medical conditions 
such as heart disease, respiratory issues (e.g., asthma), or those 
who are not strong swimmers.
...
To mitigate these risks, snorkelers should engage in pre-snorkel briefings, 
wear personal flotation devices, ensure proper equipment checks, and 
maintain awareness of their environment and physical limits.

Test 3: Searching out of scope

The final test is to see what happens when the user asks a query that is certainly out of scope of the prepared knowledge bases. Let's ask about a specific diving product:

Tell me about the Apeks XTX50 regulator

The agent's first thought is to check the external web for relevant information:

Thought: I need to gather information about the Apeks XTX50 regulator, 
including its features, specifications, and performance in diving conditions. 
I will begin with a web search for comprehensive details.  
Action: WebSearch  
Action Input: "Apeks XTX50 regulator review features specifications"

It discovers a wealth of information from commercial and diving hobbyist sites, and returns both technical specifications and user feedback:

The Apeks XTX50 is a high-performance scuba diving regulator designed for 
divers who seek reliability and advanced features at a moderate price point. 
Here are the key features and specifications:
Features:
First Stage:
Utilizes the DST (Diver Adjustable) first stage with an environmentally 
sealed design to prevent water ingress. Over-balanced diaphragm technology 
provides superior performance as divers descend, with medium pressure gas 
increasing faster than ambient pressure.
...
Divers appreciate the XTX50's reliability and comfort, making it a popular 
choice for both recreational and technical diving.
The regulator is lauded for its performance in colder conditions, with 
many users noting its effectiveness during deep dives.
Some divers have raised concerns about the availability of parts for maintenance, 
as Apeks does not sell parts directly to consumers.
...

The agent did not choose to search the US Navy Dive Manual, which is concerned with operational planning, or with the DivingSafetyManual.

Conclusion

In a traditional RAG implementation, we might have chosen to force the LLM to search and use information from all three data sources simultaneously, but this would have negatively impacted accuracy by introducing noise from irrelevant information. With the agentic knowledge base, we see the LLM making targeted searches to specific sources of knowledge based on user intention and context. The agent is able to refine the knowledge collected by building upon initial searches, and combining them with information collected from other sources.

The agent is also able to handle questions out of the scope of its prepared data, and is also able to exclude knowledge bases not pertinent to the query - Significant enhancements on the traditional RAG model.

This agent knowledge base concept provides a graceful way to combine many different sources into a coherent and comprehensive system, and the next steps would be to expand the range of actions and the diversity of information that can be referred to. Introducing workflows for fact-checking and cross referencing would be a boon to overall reliability, and tools for specialized capabilities like calculation would be a very interesting direction to explore.

NeurIPS 2025 highlights: From model merging to deep learning for code

Scott Martens, Michael Günther — Tue, 30 Dec 2025 00:00:00 GMT

Over the last decade, NeurIPS has become one of the premier academic conferences for AI and machine learning, where the most important papers are presented and where researchers in this community meet and network.

Jina by Elastic engineers Michael Günther and Florian Hönicke attended this year’s conference in San Diego with Daria Kryvosheieva. They presented her internship project, the jina-code-embeddings models, at the Deep Learning for Code (DL4C) workshop.

Coding agents and automated coding are very popular research areas and were prominent topics at this year’s NeurIPS, with more than 60 papers and hundreds of participants at the DL4C workshop. AI models that can generate code are not just important to software developers. They also they enable AI agents to execute code to solve problems and interact with databases and other applications, such as by writing their own SQL queries,creating SVG and HTML on the fly for display, and more.

There’s a lot of interest in AI applications for the IT industry, including Streams, which is using AI to interpret system logs.

Jina’s contribution to the field is a very compact, high-performance embedding model dedicated to retrieving code and computer documentation from knowledge bases and repositories, with applications to integrated development environments (IDEs) code assistants, and IT-centric retrieval-augmented generation (RAG) applications.

Overall, the conference found a good balance between theoretical work and applied research.

San Diego in early December was mild and pleasant, and the city has an easygoing atmosphere. People lingered outside between sessions and, in the evening, the cafés and bars were full of people with conference badges.

We learned a lot at NeurIPS 2025 and enjoyed the trip to a city much warmer than Berlin at this time of year. In this post, we briefly share what we found most valuable at the conference.

Model merging: Theory, practice, and applications

This tutorial, from Marco Ciccone, Malikeh Ehghaghi, and Colin Raffel, was particularly interesting. Over the last few years, model merging has become a widely used technique for making AI models more robust when fine-tuned for specialized applications. In the simplest case, it involves combining two or more fine-tuned models, derived from the same base model, by averaging their weights, as in the image below:

As simplistic as this sounds, it usually works and leads to models that perform better (or at least not much worse) on both fine-tuned tasks,, as well as retaining the performance of the base model on nonspecialized tasks.

The tutorial provided an overview of recent advances in this very active research area, especially developments in more sophisticated merging methods beyond simple weight averaging. Notably:

TIES-Merging, which tries to mitigate merging conflicts between weights by, among other things, selecting subsets of the weights.
Fisher Merging and RegMean, which involve using activation information to improve outcomes from model mergers.

There was also a summary of model development techniques deployed at the largest AI labs, like Google DeepMind and Cohere, which both appear to rely on model merging, ensuring continuing interest and development in this area.

Interesting research

We also attended oral presentations and poster sessions, and several struck us as particularly valuable.

Large language diffusion models

Diffusion models work very differently from most language models. Language models are generally trained using autoregressive methods: Given some length of text, they receive training to produce the next token. Diffusion language models, in contrast, are trained with texts with some tokens masked out, and they learn to fill them in. They generate text nonlinearly, passing over it multiple times and adding tokens in no particular order, instead of generating one word after the other. Diffusion was originally applied very successfully to image generation but has only recently been widely applied to text.

This research applies the diffusion approach to a relatively large transformer-based language model (8 billion parameters) as pretraining and supervised fine-tuning. During pretraining, the model learns to fill in text with random (up to 100%) masked text. During supervised fine-tuning, the prompt is never masked, so it can learn to generate text from instructions.

The resulting model shows comparable performance to autoregressively trained models across many tasks, while excelling in some domains, particularly math-related tasks. This is a very promising direction for language modeling research, and we’re curious to see whether diffusion models will become more prominent for training language models and whether they’re applied to embedding models, as well.

Activation-informed merging of large language models

This paper proposes another technique for improving model merging. The intuition behind this work is identifying and preserving the most important weights of the base model when merging one or more fine-tuned models.

It uses a calibration dataset to obtain the average activations of all layers in the model and identifies the most critical weights by calculating the influence of each weight on model activation levels. It then uses this to determine which weights should not be dramatically changed during merging.

This approach is compatible with using other model-merging techniques. The authors show significant improvements when using this method in combination with various other merging methods.

SuperCLIP: CLIP with simple classification supervision

A well-known problem with image-text models trained using Contrastive Language Image Pretraining (CLIP) is that they aren’t good at capturing fine-grained textual information, due to architectural limitations and to the nature of the web-scraped data typically used to train vision models. During the development of the Jina-CLIP models, we also identified that CLIP models are generally bad at understanding more complex texts because they’re trained on short texts. We compensated by adding longer texts to our training data.

This paper proposes an alternative solution: adding a novel classification loss component to the ordinary CLIP loss.

It relies on added layers during training that use output image tile embeddings to predict the text tokens in its description. Training optimizes for both this objective and the CLIP loss at the same time.

Datasets, documents, and repetitions: The practicalities of unequal data quality

This paper addresses issues in training data quality for large language models (LLMs). Typically, such models are trained with large datasets that routinely contain duplicated items. However, paradoxically, deduplication often produces worse results.

The authors propose an explanation for this confusing finding and offer some elements of a solution.

Their principal findings are:

Large models suffer more when training data is duplicated than small models do.
Duplicating high-quality documents improves training outcomes or at least does less to reduce them than low-quality ones do.
High-quality documents are more likely to appear multiple times in real-world training datasets.

These last two points in particular explain the paradox of deduplication.

Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model?

Reinforcement Learning with Verifiable Rewards (RLVR)is a method of fine-tuning LLMs for reasoning using reinforcement learning in a way that doesn’t require human labeling because the solutions to training problems are automatically verifiable. This contrasts with Reinforcement Learning from Human Feedback (RLHF), where direct human supervision is required during training. For example, this can mean training models to solve math problems or perform coding tasks where the output can be independently tested by machines, that is, checking the solution to a math problem automatically or running unit tests to show that a block of code works correctly.

The authors’ test methodology is to assess the number of correct answers for a set of problems, given a varying number of attempts to answer. Each attempt is done by sampling an answer during the generation process. They compare the model after different numbers of RLVR training epochs the model has had. They show that their training substantially improves the accuracy of the model’s answer, if only given one or a few chances, but not if given many. This suggests that they’ve increased the probability of the right answer, but they haven’t really improved the reasoning capabilities of the model.

Conclusion

Measured by conference activity, research in AI looks like it’s still undergoing explosive growth, with no end in sight. Academic work remains very relevant and is especially important for AI developers who don’t have billions of dollars to rent data centers for research.

However, this explosive growth also makes it more and more difficult to follow everything that’s going on.

Here at Elastic, and especially on the Jina team, we’re always excited about what comes next for AI, and we do our best to stay on top of new developments and emergent directions for research. We hope this article gives you a taste of that excitement and a glimpse into the kind of work going on in search AI today.

Creating reliable agents with structured outputs in Elasticsearch

JD Armada — Mon, 29 Dec 2025 00:00:00 GMT

We’re quickly moving from simple chatbots to agents that can take real, consequential actions on your systems. To make sure these agents are dependable, we can’t rely purely on the free-form text anymore. The ability to generate predictable, machine-readable outputs has become an important layer in building reliable AI agents. Structured outputs are also a key layer in context engineering, which is a set of strategies that ensure LLMs are grounded in the most relevant information for their task. Together, these patterns help turn LLMs from simple conversational tools into reliable components that you can safely integrate into larger systems. In this piece, we’ll walk through what structured outputs are and how they can be leveraged to provide reliable output that meets key contracts. If you’re new to context engineering, check out our article here.

Structured outputs

Structured outputs are LLM responses that conform to predefined schemas or data structures, instead of free-form text. Rather than receiving unpredictable responses, developers can specify exactly how a response should be formatted.

In the example below, if you give an LLM access to your indices within Elasticsearch and ask it to, “analyze this Elasticsearch index”, it’ll respond with a narrative explanation that’s likely to change each time you ask it the same prompt. With structured outputs, you can request a response with specific fields like indexName, documentCount, healthStatus, etc., each with defined types and validation rules.

The structured format on the right can be immediately validated against a schema, without adding the extra step of parsing the text.

The largest model providers are quickly taking note of how important structured outputs are becoming, with Google, OpenAI, and Anthropic releasing support for structured outputs in each of their respective APIs. OpenAI goes beyond by releasing new models that are trained to better understand and adhere to complex schemas. Below is OpenAI’s evaluation of how well their models follow complex JSON schemas.

In this visual, their model gpt-40-2024-08-06 with structured outputs has a perfect score, while their previous model gpt-4-0613 without structured outputs scores below 40%.

How this affects multi-agent systems

Imagine a scenario where a system of agents passed around unstructured, free-form data. Each agent would need custom parsing logic to understand the responses from other agents, which not only bloats your token usage but will surely break in practice. This is compounded by the fact that LLM outputs are probabilistic and therefore unpredictable. How can we trust this type of system to take real actions on our systems? Structured outputs define a contract between agents, replacing ambiguity with reliable, predictable behavior.

The importance of standardization for AI agents and MCP

OpenAPI revolutionized REST API development by giving developers a shared, predictable way to describe endpoints, parameters, and responses. Structured outputs bring that same idea to AI agents by providing contracts that standardize how data is exchanged between agents and systems.

These contracts ensure that:

Downstream systems can parse responses reliably: When an agent needs to perform an action like updating a database, calling an API, or triggering a workflow, the receiving system must be able to trust the shape and integrity of the data.
Type safety is maintained: Structured outputs enable compile-time or runtime validation, catching errors before they propagate throughout the system and turn into bigger problems.
Integration is predictable: With defined schemas, integrating agents into existing infrastructure follows patterns developers already know from traditional API development.
Multi-agent systems can understand each other: When multiple agents need to collaborate, structured outputs provide a common language for exchanging information.

The Model Context Protocol (MCP) extends this by standardizing how agents exchange context between models, tools, and applications. When agents communicate via MCP, structured outputs ensure that the context being shared maintains its structure across systems.

MCP is responsible for the transport and the lifecycle of context, while structured outputs define the shape and constraints of the data within the context.

Together, they enable:

Composable agents can be reused or replaced
Clear contracts between models, tools, and applications
More reliable automation, especially when we need agents to trigger real-world actions
Scalable multi-agent architectures

Other emerging protocols like the agent-to-agent (A2A) protocol also emphasize schemas and contracts to enable reliable communication directly between agents.

MCP and other protocols like A2A alongside structured outputs bring what OpenAPI brought to microservices, which is a shared contract that turns ad-hoc integrations into reliable systems.

Technologies for creating schemas

Now, how do we actually implement structured outputs? Luckily, popular ecosystems like Python and JavaScript already have mature schema and validation libraries that make implementing structured outputs easier. You can use these tools today to control the shape of data your LLM returns, validate it at runtime, or reject it if the model hallucinates. In this section, we’ll look at the most common tools developers reach for and what happens under the hood.

Zod and the JavaScript ecosystem

In the JavaScript and TypeScript space, Zod has become the go-to library for schema definition and validation, due to its efficiency, ease of use, and integration with popular AI orchestration frameworks like Vercel’s AI SDK and Mastra. To see this in action, we’ll look at my colleague Carly’s example. Carly used Zod alongside the AI SDK to create a schema that forces the LLM to return itinerary data in a validated, type-safe format.

The schema below does 3 things:

Ensures that the LLM returns valid JSON.
Ensures that the data has the correct types, constraints, and nesting.
Generates application-ready data that doesn’t need extra processing.

import { generateObject } from 'ai';
import { z } from 'zod';
const { object } = await generateObject({
 model: 'openai/gpt-4.1',
 schemaName: 'Travel Itinerary',
 schemaDescription: 'Sample travel itinerary for a trip',
 schema: z.object({
   title: z.string(),
   location: z.string(),
   hotel: z.object({name: z.string(), roomType: z.string(), amount: z.number(), checkin: z.iso.date(), checkout: z.iso.date()}),
   flights: z.array(z.object({carrier: z.string(), flightNo: z.string().max(8), origin: z.string(), destination: z.string(), date: z.iso.datetime()})),
   excursions: z.array(z.object({ name: z.string(), amount: z.number(), date: z.iso.datetime()}))
 }),
 prompt: 'Generate a travel itinerary based on the specified location',
});

Let’s take a closer look at the important parts of this schema.

Trip information

schema: z.object({
  title: z.string(),
  location: z.string(),

These fields above are simple strings but what we can conclude is that the LLM is not free to make up the structure. The fields title and location have to be included and they must be strings or the response will be rejected.

Hotel details

hotel: z.object({ 
  name: z.string(), 
  roomType: z.string(), 
  amount: z.number(), 
  checkin: z.iso,date(), 
  checkout: z.iso.date() 
}),

Take note that amount is defined as a number type and dates use ISO formats, which means the output can be used right away for calculations, sorting, or storage without any extra parsing.

Flight information

flights: z.array(z.object({ 
  carrier: z.string(), 
  flightNo: z.string().max(8), 
  origin: z.string(), 
  destination: z.string(), 
  date: z.iso.datetime() })),

Flights are an array of objects because trips usually involve multiple legs. We cap flightNo to 8 characters and use datetime() instead of date() to include departure times.

When we run this, the model should generate a JSON object that looks like:

{ 
  "title": "Desert Adventure", 
  "location": "Palm Springs, California", 
  "hotel": { 
    "name": "The Madison", 
    "roomType": "Suite", 
    "amount": 250, 
    "checkin": "2025-12-15", 
    "checkout": "2025-12-20" 
 }, 
  "flights": [ 
    { 
      "carrier": "SouthWest Airlines", 
      "flightNo": "AF123", 
      "origin": "SFO", 
      "destination": "PSP", 
      "date": "2025-12-15T18:00:00Z" 
    } 
 ], 
  "excursions": [ 
    { 
      "name": "ATV Desert Tour", 
      "amount": 50, 
      "date": "2025-12-16T10:00:00Z" 
    } 
 ] 
}

If the model spits out invalid JSON, breaks a specified constraint, or doesn’t include a required field, the request fails immediately instead of silently pushing through bad data.

Pydantic and the Python ecosystem

For Python developers, Pydantic plays a similar role to Zod in JavaScript/TypeScript, giving you runtime validation and strongly typed structured outputs.

Let’s use the same travel itinerary example, but this time we’ll use Pydantic models and LangChain’s support for structured outputs.

from datetime import date, datetime
from decimal import Decimal
from typing import List

from pydantic import BaseModel, Field, ConfigDict, condecimal, constr
from langchain_openai import ChatOpenAI


class Hotel(BaseModel):
    model_config = ConfigDict(populate_by_name=True)

    name: str
    room_type: str = Field(..., alias="roomType")
    amount: condecimal(max_digits=10, decimal_places=2, ge=0) 
    checkin: date
    checkout: date


class Flight(BaseModel):
    model_config = ConfigDict(populate_by_name=True)

    carrier: str
    flight_no: constr(max_length=8) = Field(..., alias="flightNo")
    origin: str
    destination: str
    date: datetime


class Excursion(BaseModel):
    model_config = ConfigDict(populate_by_name=True)

    name: str
    amount: condecimal(max_digits=10, decimal_places=2, ge=0)
    date: datetime


class TravelItinerary(BaseModel):
    model_config = ConfigDict(populate_by_name=True)

    title: str
    location: str
    hotel: Hotel
    flights: List[Flight]
    excursions: List[Excursion]


llm = ChatOpenAI(model="gpt-4.1", temperature=0)
structured_llm = llm.with_structured_output(TravelItinerary)

itinerary: TravelItinerary = structured_llm.invoke(
    "Generate a travel itinerary based on the specified location."
)

This approach feels pretty similar to the Zod example, where you define a schema once and then rely on the framework to handle validation for you at runtime. The main difference is that Pydantic gives you back actual Python objects instead of plain validated JSON. What you get back is a TravelItinerary instance with nested models and properly typed fields, which tends to fit cleaner with Python-based agent pipelines.

When we run this, the model generates structured data maps onto the Pydantic models, and we should get back an object like:

{
  "title": "Palm Springs Getaway",
  "location": "Palm Springs, California",
  "hotel": {
    "name": "The Madison",
    "roomType": "Suite",
    "amount": 250.00,
    "checkin": "2025-12-15",
    "checkout": "2025-12-20"
  },
  "flights": [
    {
      "carrier": "SouthWest Airlines",
      "flightNo": "SW123",
      "origin": "SFO",
      "destination": "PSP",
      "date": "2025-12-15T18:00:00Z"
    }
  ],
  "excursions": [
    {
      "name": "ATV Desert Tour",
      "amount": 50.00,
      "date": "2025-12-16T10:00:00Z"
    }
  ]
}

This JSON should be identical to what we generated using ZOD. Under the hood, this JSON is automatically converted to a TravelItinerary object with nested Hotel, Flight, and Excursion instances. Again, if the model spits out invalid data, breaks a constraint, or doesn’t include a required field, validation fails right away.

Under the hood: JSON schemas

At the API level, all of these approaches essentially convert to JSON schema. Libraries like Zod and Pydantic exist to make defining these schemas intuitive and developer-friendly.

Working directly with raw JSON schemas can still be useful when you need language-agnostic contracts shared across teams or services, but the tradeoff is that you lose native types, composability, and much of the developer experience that the libraries provide.

Combining Elasticsearch with structured outputs

Controlling what the LLM outputs is still only half the battle. Next, we need to know how to make these outputs useful in real systems. Elasticsearch is a natural fit here because it’s designed to work equally well with both structured and unstructured data. This mirrors modern agent architectures where unstructured data provides rich context to power reasoning and retrieval, and structured outputs act as contracts that applications can rely on. Elasticsearch is central to this loop.

Here’s an example of how Elasticsearch fits into this approach:

1. Unstructured inputs

User queries, documents, chat history, logs, or tool traces, for example, are ingested into an Elasticsearch index. To capture both exact text matching and semantic meaning, we’ll use a mix of text fields and vector embeddings as we index this data.

2. Elasticsearch as the context engine

In the moment that an AI agent needs relevant context, it can query Elasticsearch using these different types of search:

Semantic/vector search: To search by the underlying meaning of a word.
Keyword/text search: For exact matches and filters.
Geospatial search: To search by location.
Hybrid search: To search using a mix of the above.

3. LLM reasoning

The retrieved context is passed back to the LLM to ground its response in the most relevant data instead of relying on its trained data.

4. Structured output generation

The model is restricted by the schema we created using either Zod or Pydantic, and produces a validated JSON object instead of free-form text.

5. Structured indexing

The validated output is indexed back into Elasticsearch using explicit mappings, making it easier to query, aggregate, and analyze.

6. Reuse and automate

Now that we’ve added structure to the data, it becomes easy to query, filter, aggregate, or use it as input for downstream systems and workflows.

This loop lets agents use Elasticsearch both as the retrieval layer and as a memory store that enables context-driven reasoning, automation, and long-term learning.

Limitations

Marius Schroder’s structured prompting article mentions some limitations with structured prompting that can also be applied to structured outputs.

He mentions that:

A schema can guarantee format but not correctness: The model can still output garbage where the JSON is structurally valid but filled with the wrong data. For example, an itinerary schema might require a valid ISO date, a price with a numeric type, and a flight number under 8 characters. The model could still return a flight that’s on February 30th (impossible date), or assign a 10 dollar price to a five-star hotel. In this instance, the structure is valid but the facts are wrong, illustrating that schemas validate the shape of data but not the truth.
Complex or deeply nested schemas can still be a point of failure: You can still run into parsing failures or token limits, if the output is large enough, the model might cut off different parts.
Not great for creative scenarios: In this case, free-form text might be the better choice especially if you don’t want to handcuff the LLM too much in creative tasks.

Conclusion

This article dives into the importance of providing structured outputs in multi-agent systems, the most common tools developers reach for, and how Elasticsearch can be a natural extension. If you want to learn more, be sure to check out these resources below.

Resources

Comparing dense vector search performance with the Profile API in Elasticsearch

Alexander Dávila — Tue, 23 Dec 2025 00:00:00 GMT

The Profile API is a debugging tool that Elasticsearch provides to measure query execution performance. This API breaks down query execution, showing how much time was spent on each phase of the search process. This visibility comes in handy when identifying bottlenecks and comparing different query and index configurations.

In this blog, we will explore how the Profile API can help us compare different approaches to vector search in Elasticsearch, understanding execution times and how the total response time is used across different actions. This showcases how search profiling can drive the settings selection, giving us an example of how each one behaves with a particular use case.

Profile API implementation

Profiler API

To enable search profiling in Elasticsearch, we add a “profile” : ”true” parameter to a search request. This instructs Elasticsearch to collect timing information on the query execution without affecting the actual search results.

For example, a simple text query using profiling:

GET wikipedia-brute-force-1shard/_search
{
  "size": 0,
  "profile": true,
  "query": {
    "match": {
      "text": "semantic search"
    }
  }
}

The main parts of the response are:

"profile": {
    "shards": [
      {
        "id": "[OGMMYXQqRseu_8fR0yD4Qg][wikipedia-brute-force-1shard][0]",
        "node_id": "OGMMYXQqRseu_8fR0yD4Qg",
        "shard_id": 0,
        "index": "wikipedia-brute-force-1shard",
        "cluster": "(local)",
        "searches": [
          {
            "query": [
              {...
            //detailed timing of the query tree executed by Lucene on a particular shard.
               ...
              }
            ],
            "rewrite_time": //All queries in Lucene undergo a "rewriting" process that allows Lucene to perform optimizations, such as removing redundant clauses
            ,
            "collector": [
              {...
             // shows high-level execution details about coordinating the traversal, scoring, and collection of matching documents
               ...
              }
            ]
          }
        ],
        "aggregations": [
             //detailed timing of the aggregation tree executed by a particular shard
        ]
      }
    ]
  }

Kibana profiler

In the DevTools app in Kibana we can find a search profiler feature that makes reading the metrics a lot easier. The search profiler in Kibana uses the same profile API seen above but providing a friendlier visual representation of the profiler output.

You can see how the total query time is being spent:

And you can see details on each part of the query.

The profiler feature can help in comparing different queries and index configurations quickly.

When to use the Profile API directly

Automation: Scripts, monitoring tools, CI/CD pipelines
Programmatic analysis: Custom parsing and processing of results
Application integration: Profile directly from your code
No Kibana access: Environments without a Kibana instance or remote servers
Batch processing: Profile multiple queries systematically

When to use the search Profiler in Kibana

Interactive debugging: Quick iteration and experimentation
Visual analysis: Spot bottlenecks through color coding and hierarchy views
Collaboration: Share visual results with other people
Ad-hoc investigation: One-off performance checks without coding

Basic profiling KNN example

For a simple KNN search, we can use:

GET wikipedia-brute-force-1shard/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "knn": {
            "field": "embedding",
            "query_vector": [...],
            "k": 10,
            "num_candidates": 1500
          }
        },
        {
          "match": {
            "text": "country"
          }
        }
      ],
      "filter": {
        "term": {
          "category": "medium"
        }
      }
    }
  },
  "size": 10,
  "_source": [],
  "profile": true
}

Main KNN search metrics in Elasticsearch

We can find KNN metrics in the dfs section of the profile. It shows the execution time for query, rewrite, and collector phases; it also shows the number of vector operations executed in the query.

Vector search time (rewrite_time)

This is the core metric for vector similarity computation time. In the profile object, it's found at:

"dfs": {
  "knn": [{
    "rewrite_time": 198703  // nanoseconds
  }]
}

Unlike traditional Elasticsearch queries, kNN search performs the bulk of its computational work during the query rewrite phase. This is a fundamental architectural difference.

The rewrite_time value represents the cumulative time spent on Vector similarity calculations, HNSW graph traversal and Candidate evaluation

Vector operations count

Found in the same KNN section:

"vector_operations_count": 15000

This metric tells you how many actual vector similarity calculations were performed during the kNN search.

Understanding the count

In our query with num_candidates: 1500, the vector operations count represents:

Approximate search efficiency: The number of vectors actually compared during HNSW (Hierarchical Navigable Small World) graph traversal
Search accuracy trade-off: Higher counts mean a more thorough search, but longer execution time

Query processing time (time_in_nanos)

After finding vector candidates, Elasticsearch processes the actual query on this reduced set:

"query": [{ 
"type": "BooleanQuery", 
"description": "+DenseVectorQuery.Floats +text:country #category:medium", 
"time_in_nanos": 5064686, 
"children": [ 
{ 
"type": "Floats", 
"description": "DenseVectorQuery.Floats",
"time_in_nanos": 566195 
},
{ 
"type": "TermQuery", 
"description": 
"text:country", 
"time_in_nanos": 667083 
}, 
{ 
"type": "TermQuery", 
"description": "category:medium", 
"time_in_nanos": 2725249 
} 
]
}]

The time_in_nanos metric covers the query phase: the computational work of finding and scoring relevant documents. This total time is broken down into children, and each child query represents a clause in our Boolean query:

DenseVectorQuery

Processing kNN results: Scoring the candidate documents identified by kNN
Not computing vectors: Vector similarities were already computed in DFS phase
Fast because: Operating only on the pre-filtered candidate set (10-1500 docs, not millions)

TermQuery: text:country

Inverted index lookup: Finding documents containing "country"
Posting list traversal: Iterating through matching documents
Term frequency scoring: Computing BM25 scores for matched terms

TermQuery: category:medium

Filter application: Identifying documents with category="medium"
No scoring needed: Filters don't contribute to score (notice score_count: 0)

Collection time

The time spent collecting and ranking results:

"collector": [{
  "name": "QueryPhaseCollector",
  "reason": "search_query_phase", 
  "time_in_nanos": 270704,  // ~271 microseconds
  "children": [
    {
      "name": "TopScoreDocCollector",
      "reason": "search_top_hits",
      "time_in_nanos": 215204  // ~215 microseconds
    }
  ]
}]

The time_in_nanos for collectors breaks down into:

TopScoreDocCollector

Collects top hits from the query results.

Understanding collection in Elasticsearch's architecture

In Elasticsearch, a query is distributed among all relevant shards, where it is executed individually. The collection phase operates across Elasticsearch's distributed shard architecture like this:

Per-Shard Collection: Each shard collects its top-scoring documents using the TopScoreDocCollector. This happens in parallel across all shards that hold relevant data.

Result Ranking and Merging: The coordinating node (the node that receives your query) then receives the top results from each shard and merges these partial results together by score to find the global top N results

So for our example:

QueryPhaseCollector (270μs): The time spent on the query phase collection within a single shard.

TopScoreDocCollector (215μs): The actual time spent collecting and ranking top hits from that shard

Note that these times represent the collection phase on a single shard in the profile output. For multi-shard indices, this process happens in parallel on each shard, and the coordinating node adds additional overhead for merging and global ranking, but this merge time is not included in the per-shard collector times shown in the Profiler API.

Experiment set up

The script consists of running 50 queries per experiment using the Profiler under four experiment setups. The experiments measure query processing, fetch, collection, and vector search execution times across multiple index configurations with different vector indexing strategies, quantization techniques, and infrastructure setups:

Experiment 1: Comparing query performance on a flat dense vector vs a HNSW quantized dense vector.
Experiment 2: Understanding the effect of oversharding in vector search.
Experiment 3: Understanding how Elastic boosts the performance of a vector query with filters by applying them before the more expensive KNN algorithm.
Experiment 4: Comparing the performance of a cold query vs a cached query.

Getting started

Prerequisites

Python 3.x
An Elasticsearch deployment
Libraries
- Elasticsearch
- Pandas
- Numpy
- Matplotlib
- Datasets (HuggingFace library)

To reproduce this experiment, you can follow these steps:

1. Clone the repository

git clone https://github.com/Alex1795/profiler_experiments_blog.git

2. Install required libraries:

pip install -r requirements.txt

3. Run the upload script. Make sure to have the following environment variables set beforehand

ES_HOST
API_KEY

Example configuration:

ES_HOST=""
API_KEY=""

To run the upload script, use:

python data_upload.py

This might take several minutes; it is streaming the data from Hugging Face.

4. Once the data is indexed in Elastic, you can run the experiments using:

python profiler_experiments.py

Dataset selection

For this analysis, we will be using pre-generated embeddings generated from the wikimedia/wikipedia dataset, created using the Qwen/Qwen3-Embedding-4B model. We can find these embeddings already generated in Hugging Face.

The model produces 2560-dimensional embeddings that capture the semantic relationships in the Wikipedia articles. This makes this dataset an adequate candidate for testing vector search performance with different index configurations. We will take 50.000 datapoints (documents) from the dataset.

All the documents will be used in 4 indices with 4 different configurations for the dense_vector field.

Profiler data extraction

The heart of the experiments is the extract_profile_data method. This function gets these metrics from the response:

Original field in the Search Profile	Extracted metric	comment
response['took']	total_time_ms	The total time the query took to execute, populated directly from the top-level 'took' key.
shard['dfs']['knn'][0]['rewrite_time']	vector_search_time_ms	The total time spent on vector search operations across all shards, aggregated and converted from nanoseconds to milliseconds.
shard['dfs']['knn'][0]['vector_operations_count']	vector_ops_count	The total number of vector operations performed during the search, aggregated across all shards.
shard['searches'][0]['query'][0]['time_in_nanos']	query_time_ms	The total time spent on query execution across all shards, aggregated and converted from nanoseconds to milliseconds.
shard['searches'][0]['collector'][0]['time_in_nanos']	collect_time_ms	The total time spent on collecting and ranking results across all shards, aggregated and converted from nanoseconds to milliseconds.
shard['fetch']['time_in_nanos']	fetch_time_ms	The total time spent on retrieving documents across all shards, aggregated and converted from nanoseconds to milliseconds.
len(response['profile']['shards'])	shard_count	The total number of shards the query was executed on.
(Calculated)	other_time_ms	The remaining time after accounting for vector search, query, collect, and fetch times, representing overhead such as network latency.

Indices configuration

Each index will have 4 fields:

text (text type): The original text used to generate the embedding
embedding (dense_vector type): 2560-dimensional embedding with a different configuration for each index
category (keyword type): A classification of the length of the text short, medium or long
text_length (integer type): Words count of the text

wikipedia-brute-force-1shard

Relevant settings:

Embedding type: float
Number of shards: 1

Wikipedia-brute-force-3shards

Relevant settings:

Embedding type: float
Number of shards: 3

wikipedia-float32-hnsw

Relevant settings:

Embedding type: HNSW
m=16 (The number of neighbors each node will be connected to in the HNSW graph)
ef_construction=200 (The number of candidates to track while assembling the list of nearest neighbors for each new node)

To learn more about parameters for the dense vector field, see: Parameters for dense vector fields

Experiment execution

Experiment 1: Flat vs int 8 HNSW dense vector

Objective: Compare the performance of a flat dense vector against a vector using HNSW.

Indices to use:

wikipedia-brute-force-1shard
wikipedia-int8-hnsw

Hypothesis: The HNSW index will have significantly lower query latency, especially on larger datasets, as it reduces memory usage by 75% and it avoids comparing the query vector with each vector in the dataset.

Kibana Search Profiler results:

wikipedia-brute-force-1shard

wikipedia-int8-hnsw

Experiment results:

=== Experiment 1: Flat vs. HNSW dense vector ===

Testing Flat (float32) (wikipedia-brute-force-1shard)...
  Average total time (ES): 528.67ms
  Average vector search time: 517.52ms
  Average query time: 0.01ms
  Average collect time: 0.01ms
  Average fetch time: 7.37ms
  Average wall clock time: 853.63ms
  Vector operations: 50000

Testing HNSW (int8) (wikipedia-int8-hnsw)...
  Average total time (ES): 12.67ms
  Average vector search time: 3.66ms
  Average query time: 0.01ms
  Average collect time: 0.01ms
  Average fetch time: 7.47ms
  Average wall clock time: 140.74ms
  Vector operations: 2352

We can see from the metrics that the float approach did 50000 vector operations, which means it compared the query vector with each vector in the dataset, which resulted in ~140 times increase in the vector search time when compared with the HNSW vector.

From the graph below, we can visualize that even if other metrics are similar, the Vector search takes much longer with a float-type dense vector. That being said, it is worth noting that BBQ quantization reduces the recall when compared with a non-quantized vector.

Experiment 2: Impact of over-sharding on brute force search

Objective: Understand how excessive sharding on a single-node Elasticsearch deployment negatively impacts vector search query performance

Indices to use:

wikipedia-brute-force-1shard: The single-shard baseline.
wikipedia-brute-force-3shards: The multi-shard version.

Hypothesis: On a single-node deployment, increasing the number of shards will degrade query performance rather than improve it. The 3-shard index will exhibit higher total query latency compared to the 1-shard index. This can be extrapolated to having an inadequate number of shards for our infrastructure.

Kibana Search Profiler results:

wikipedia-brute-force-1shard

wikipedia-brute-force-3shards

Notice time is more than 3 times here because it runs in 3 separate shards.

Experiment results:

=== Experiment 2: Impact of Sharding on Brute Force Search ===

Testing 1 Shard (wikipedia-brute-force-1shard)...
  Shards: 1
  Average total time (ES): 40.00ms
  Average vector search time: 27.15ms
  Average query time: 0.01ms
  Average collect time: 0.01ms
  Average fetch time: 8.50ms
  Average wall clock time: 204.40ms
  Vector operations: 50000

Testing 3 Shards (wikipedia-brute-force-3shards)...
  Shards: 3
  Average total time (ES): 67.77ms
  Average vector search time: 55.36ms
  Average query time: 0.02ms
  Average collect time: 0.03ms
  Average fetch time: 8.70ms
  Average wall clock time: 338.77ms
  Vector operations: 50000

We can see that even when executing the exact same number of vector operations, having too many shards for this specific dataset added more vector search time, overall making the query slower. This demonstrates how our sharding strategy must go hand in hand with our cluster architecture.

Experiment 3: Combined filter and vector search

Objective: Demonstrate how Elasticsearch efficiently handles pre-filtering before a vector search.

Indices to use:

wikipedia-brute-force-1shard

Note: This experiment is only applicable to hosted deployments, since we can't control the number of shards on serverless. It will be automatically skipped in a serverless project.

Setup: Construct a query that combines a KNN query for a vector search with a filter.

Hypothesis: When a filter is applied, Elasticsearch first prunes the documents that don't match the filter before performing the expensive vector search on the matching documents. The Profile API will show that the number of documents searched by the vector search operation is significantly lower than the total number of documents in the index, leading to a faster query.We will run the query with 4 configurations:

No filter

"knn": {
        "field": "embedding",
        "query_vector":[...],
        "k": k,
        "num_candidates": num_candidates,
        "filter":[] // no filters
}

Term filter on the category field

"knn": {
        "field": "embedding",
        "query_vector":[...],
        "k": k,
        "num_candidates": num_candidates,
        "filter":[
          {
            "term":{
              "category": "short" // term filter on category
               }
          }
         ]
}

Range filter on the text_length field

"knn": {
        "field": "embedding",
        "query_vector":[...],
        "k": k,
        "num_candidates": num_candidates,
        "filter":[
          {
            "range":{
              "text_length": { // range filter on text_length
 			"gte": 1000, 
"lte": 2000
}
               }
          }
         ]
}

A combined filter: term filter on the category field + range filter on the text_length field

"knn": {
        "field": "embedding",
        "query_vector":[...],
        "k": k,
        "num_candidates": num_candidates,
        "filter":[ // the two previous filters combined in the same query
          {
            "range":{
              "text_length": {
 			"gte": 1000, 
"lte": 2000
}
               }
          },
	   {
            "term":{
              "category": "short"
               }
          }
         ]
}

Results:

=== Experiment 3: Combined Filter and Vector Search ===

Testing No Filter...
  Total hits: 10.0
  Average total time (ES): 50.80ms
  Average vector search time: 42.37ms
  Average query time: 0.01ms
  Average collect time: 0.01ms
  Average fetch time: 7.07ms
  Average wall clock time: 287.01ms
  Vector operations: 50000

Testing Category Filter...
  Total hits: 10.0
  Average total time (ES): 8.00ms
  Average vector search time: 0.78ms
  Average query time: 0.01ms
  Average collect time: 0.01ms
  Average fetch time: 6.11ms
  Average wall clock time: 134.40ms
  Vector operations: 198

Testing Text Length Filter...
  Total hits: 10.0
  Average total time (ES): 18.40ms
  Average vector search time: 9.93ms
  Average query time: 0.01ms
  Average collect time: 0.02ms
  Average fetch time: 7.15ms
  Average wall clock time: 144.74ms
  Vector operations: 10387

Testing Combined Filters...
  Total hits: 1.0
  Average total time (ES): 2.20ms
  Average vector search time: 0.68ms
  Average query time: 0.00ms
  Average collect time: 0.01ms
  Average fetch time: 0.59ms
  Average wall clock time: 127.28ms
  Vector operations: 1

We can see that applying filters adds fetch time to our search, but in exchange, it reduces the vector search time dramatically because it executes less vector operations. This shows how Elastic handles filtering before vector search to improve performance and avoid wasting resources by running the vector search before filtering out irrelevant documents.

Even if the results are constrained to a maximum (k=10), underneath, more vector operations are being executed if we don't filter out some documents before. This effect is more notorious with a flat dense vector, of course, but even in quantized vectors, we can still reduce execution time by applying filters before the vector search.

In the graph, we can see how the query time increased with the filters, but the vector search time is much lower, resulting in lower times overall. We can also see that having more filters impacted the time positively (meaning it lowered the total time), so actually applying the filters is worth it, as the overall time decreases.

The results highlight how filtering improves efficiency and is a key benefit of using a hybrid search engine like Elasticsearch.

Experiment 4: Compare cold vs cached query performance

Objective: Demonstrate how Elasticsearch's caching mechanisms significantly improve query performance when the same vector search is executed multiple times.

Indices to use:

wikipedia-float32-hnsw

Setup:

First, clear the Elasticsearch cache

Execute the same vector search query twice:

Cold query: First execution after cache clearing

Cached Query: Second execution with caches populated

Hypothesis:The cached (warm) query will execute significantly faster than the cold query. The Profile API will show reduced times across all query phases, with the most dramatic improvements in vector search operations and data retrieval phases.

Results:

=== Experiment 4: Cache Performance (Cold vs Warm Queries) ===

Testing Cold Query (First Run)...
  Clearing caches...
  Runs executed: 1
  Average total time (ES): 490.00ms
  Average vector search time: 474.77ms
  Average query time: 0.01ms
  Average collect time: 0.01ms
  Average fetch time: 13.48ms
  Average wall clock time: 728.77ms
  ↳ This represents cold start performance

Testing Warm Query (Cached)...
  Runs executed: 5
  Average total time (ES): 14.60ms
  Average vector search time: 6.99ms
  Average query time: 0.01ms
  Average collect time: 0.01ms
  Average fetch time: 3.96ms
  Average wall clock time: 144.35ms

This experiment shows the impact of Elasticsearch's cache on vector search performance. Elastic keeps the embedding data in memory, so it executes faster. On the other hand, if the data isn’t in memory and Elastic has to read from disk often, searches become slower.

In this case, the cold query, executed after clearing all caches, took 490ms total time with vector search operations consuming 474.77ms. This shows the "first-time" cost of loading index segments and vector data structures into memory. In contrast, the warm queries averaged just 14.6ms total time with vector search dropping to 6.99ms, demonstrating a remarkable 33x overall speedup and 68x improvement in vector search operations.

In the graph, we can see the huge difference between the cached and cold queries. This result highlights why vector search systems benefit from an initial warm-up period.

Conclusion

Search profiling can let us look into the execution of our queries and, by extension, compare them. This opens the door to comprehensive analysis that can drive design decisions. In our particular experiment, we could see the difference between dense vector configurations and derive complex insights.

Particularly, in our experiments, we have been able to use the profiler to confirm in practice that:

A quantized dense vector performs queries much faster than a non-quantized one
Having an appropriate sharding strategy can lead to better performance
Combining vector search + filters is a powerful tool to improve performance in our queries

Cache can impact performance meaningfully, so for production systems, it might be a good idea to start with a warm-up process using common queries.

Influencing BM25 ranking with multiplicative boosting in Elasticsearch

Alexander Marquardt — Mon, 22 Dec 2025 00:00:00 GMT

BM25 is one of the most widely used scoring models in Elasticsearch for text-based search. In many e-commerce implementations, it forms a major component of how product relevance is determined because it provides a well-understood, interpretable score that reflects how closely an item matches a shopper’s query. In addition to this text relevance, merchandising and search teams often need to influence the ranking with business metrics such as margin, stock levels, popularity, personalization, or campaign strategy, in a way that doesn’t destabilize the underlying text relevance.

The most intuitive levers for doing this are boosted should clauses or rank_feature fields. These may initially appear effective, but both approaches degrade and may even fail, as query patterns shift or catalog composition changes. Their shared limitation is that they introduce additive adjustments into a scoring system whose scale varies substantially across queries. A boost like “+2” might overwhelm the base BM25 score in one query while barely registering in another. In other words, additive methods may create brittle, unpredictable ranking behavior.

In contrast, function_score with multiplicative boosting provides a stable and mathematically proportional way to shape BM25 scores without distorting their underlying structure. Your application logic determines what merits uplift; function_score expresses that intent in a predictable and explainable way that preserves the geometry (high-level relative ordering) of the BM25 relevance signal, nudging rankings in controlled ways rather than overwhelming the core text relevance.

This article builds on two earlier pieces that demonstrated practical uses of multiplicative boosting: (1) Boosting e-commerce search by profit and popularity with the function score query in Elasticsearch, and (2) How to improve e-commerce search relevance with personalized cohort-aware ranking. Here we step back from those examples to examine the architectural principle that underlies them: why multiplicative boosting via function_score is one of the most reliable and scalable ways to influence BM25-based ranking in Elasticsearch.

Why it's important to preserve base BM25 rankings

In many Elasticsearch-based applications, including e-commerce, BM25 remains a central component of how text relevance is assessed. It provides a signal that is interpretable and transparent for teams who need to understand why a product ranked where it did. These properties make BM25 particularly attractive in environments where explainability and operational predictability matter.

Because of this, most teams want to shape, rather than replace, the rankings produced by BM25. For example, they may want to allow higher-margin items to surface slightly more often, reduce exposure for low-stock products without hiding them, or highlight items aligned with a particular user segment. Ideally, this shaping should preserve the geometry of the rankings produced by the BM25 algorithm.

The difficulty arises when teams try to achieve these goals using mechanisms that add separate scoring streams on top of the base BM25 ranking. These additive adjustments are not always comparable to BM25’s scale and behave inconsistently as queries, data distributions, and catalog composition evolve. Over time, the ranking becomes brittle, unintuitive, and difficult to tune. A reliable influence mechanism must work with BM25’s scoring geometry rather than overpowering it.

The function_score query with multiplicative boosting provides this property. It allows teams to apply business influence in a proportional, explainable way while keeping BM25’s underlying structure intact.

Why many approaches to influencing ranking degrade (or break) BM25

Teams often begin with mechanisms that look straightforward: boosted should clauses, rank_feature fields, or custom script_score logic. These tools can be effective in their intended use cases, which is why they seem like natural levers for adding business influence. But when they are used to shape or influence BM25-based text relevance, they may create unstable, opaque, or brittle ranking behavior.

The underlying issue is that these approaches introduce independent additive scoring contributions into a system whose base BM25 values vary widely across queries, fields, and data sets. Without respecting that variability, the influence becomes unpredictable.

Below are the three most common patterns and why they fail in practice.

1. Additive boosts via should clauses

A boosted should clause feels intuitive: “Promote items that match this business rule.” But under the hood, the behavior is fundamentally additive.

Consider a query of the form:

GET products/_search
{
  "query": {
    "bool": {
      "must": [ { "match": { "description": "running shoes" }}],
      "should": [ { "term": { "brand": { "value": "nike", "boost": 1 }}}]
    }
  }
}

This kind of query results in the following behavior:

final_score = base_BM25 + should_BM25

The problem is that base_BM25 and should_BM25 do not scale together. As your dataset changes, or as different queries are issued, the magnitude of BM25 can shift dramatically. For example, the base BM25 scores for three products might be 12, 8, 4 in one context, and 0.12, 0.08, 0.04 in another. Such a change might happen after a catalog update or a modification to the query structure.

A boosted should clause adds its own BM25-style contribution to the final score. In this situation, an additive contribution (i.e. should_BM25 = +2) behaves inconsistently:

When base_BM25 is small (0.12), +2 dominates the score — roughly an 18× increase.
When base_BM25 is large (12), the same +2 barely shifts the document — only about a 17% increase.

This instability means that the combined must score and should score have no stable meaning across queries or catalogs. A rule that slightly promotes a brand for one query can dominate the ranking for another, or become irrelevant in a third. This is not a tuning issue; it is a structural property of additive scoring.

2. Using rank_feature for business influence

The rank_feature family is extremely useful for representing numeric qualities such as recency or popularity. It is fast, compressed, and operationally simple. However, when it is used to influence text relevance (BM25), it runs into the same structural limitation described in the previous section.

A rank_feature clause produces its own scoring contribution, which is then added to the BM25 score:

final_score = base_BM25 + feature_score

Just as with boosted should clauses, the two components do not scale together. BM25 values vary substantially across queries depending on term rarity and catalog statistics, while the feature_score follows the scale of the underlying business attribute being boosted (for example, popularity or recency), which typically bears no relationship to the scale of BM25. As a result, the two scoring streams drift apart as your corpus or query patterns evolve.

The consequence is the same as what we discussed above with relation to the should-clause problem:

The feature score can dominate BM25 in one query and be negligible in another.
Tuning becomes fragile because you are calibrating two independent scales — BM25, which varies with query term statistics, and the feature score, which varies with the business attribute’s own distribution.

Although rank_feature remains an excellent mechanism for representing raw numeric attributes, it is not well-suited for proportional influence on BM25, where the goal is not to add a second score but to gently shape the existing one.

Custom scoring with script_score

When boosted clauses or rank_feature fields become difficult to tune, teams often turn to script_score as a last resort. It provides complete freedom to manipulate the score, including adding, subtracting, multiplying, or replacing the BM25 value according to any business rule. A script_score query replaces Elasticsearch’s scoring pipeline with custom logic. Instead of shaping the BM25 score, the script builds a separate scoring mechanism whose behavior depends entirely on the code inside the script. While this can be powerful, it introduces three challenges that become more significant as the system grows.

1. Opacity

Scoring logic is hidden inside a script rather than expressed declaratively. When ranking behavior changes unexpectedly, it is difficult to understand whether the issue is the script itself, a data shift, or an interaction with BM25. Merchandisers and relevance engineers lose the ability to reason about why a document moved up or down.

2. Performance and operational cost

Script scoring bypasses many of Elasticsearch’s optimizations and caching pathways. Each document that matches the initial query must execute the script, often leading to higher CPU usage and unpredictable latency.

3. Fragility when combined with BM25

Because script_score allows arbitrary computations, it is easy to drift into scoring behaviors that no longer resemble BM25 or that fail to preserve its relative structure. As the dataset evolves or query patterns shift, the custom logic may interact with BM25 in unanticipated ways. A script that behaved reasonably early in development can produce surprising or unstable results once the catalog grows or data distributions change. Because script_score allows arbitrary math, two engineers working on different parts of the system may unintentionally encode competing scoring models, making ranking difficult to reason about as the organization scales.

How function_score provides predictable influence on BM25

BM25 already captures how well a document matches a query. It reflects text relevance, term rarity, document length, and the statistical shape of the corpus. When teams introduce business signals including margin, stock levels, popularity, personalization, or merchandising strategy, the goal is not to replace this relevance. The goal is to influence it.

This distinction is subtle but crucial. Most business requirements are proportional in nature:

Promote higher-margin items modestly
Reduce exposure for low-stock products, but don’t hide them
Give this user segment a slight uplift for matching products
Boost for popularity, but not so much that textual relevance is lost

These are naturally expressed as percentage adjustments rather than as fixed additive values. A merchandiser is rarely asking for “+2 points of score”; they are asking for “a little more visibility,” irrespective of the absolute numeric scale of the BM25 score. Mathematically, this means that the desired transformation is:

final_score = BM25 × boost_factor

Where boost_factor might be 1.05, 1.2, or 1.5, depending on the signal. Multiplicative boosting does not attempt to reinvent scoring; it simply adjusts the BM25 output by a proportional factor. A multiplicative adjustment has three properties that align well with real-world ranking control:

The boost remains proportional. In other words, a 20% uplift is always a 20% uplift—whether BM25 is 0.12 or 12. The magnitude of the boost does not depend on the underlying BM25 scale.
BM25 retains its role as the primary signal. The multiplicative shaping nudges the ordering without overriding it. Strong textual matches still win; business logic influences but does not dominate.
Because the operation is multiplicative, not additive, changing the query or updating the corpus does not require re-tuning numeric constants. The boost has the same meaning everywhere.

Elasticsearch’s function_score query provides an elegant mechanism for expressing this pattern. By using:

score_mode: “sum” to assemble a boost factor (building the multiplier), and
boost_mode: “multiply” to apply the boost (multiplier) to BM25

You can express business intent in a way that remains stable and explainable as your data and query patterns evolve. Instead of adding a second score beside BM25, function_score transforms BM25 itself—shaping it gently, predictably, and in line with how merchandisers and product owners think about ranking adjustments.

Examples in practice: How multiplicative boosting behaves in real e-commerce queries

To illustrate how multiplicative boosting works in real-world ranking scenarios, it helps to look at a small, concrete example. The goal here is not to demonstrate tuning or production-scale scoring, but rather to show how function_score influences BM25 in predictable, proportional ways that align with business intent.

Consider a simple catalog with three basketball shoes from three different brands: Nike, Adidas, and Reebok. The product descriptions are intentionally crafted so the BM25 scores exhibit natural differences based on query specificity and field length—just as they would in a real catalog.

Example dataset

For the following examples, we use a small, straightforward sample dataset with the following characteristics.

Brand	Description
nike	“Nike basketball shoes”
adidas	“New Adidas basketball shoes”
reebok	“Reebok basketball shoes”

We can create an index with the above products with the following commands from Kibana Dev Tools:

PUT products
{
  "mappings": {
    "properties": {
      "brand":       { "type": "keyword" },
      "description": { "type": "text" }
    }
  }
}

POST products/_bulk
{ "index": { "_id": "nike-001" } }
{ "brand": "nike",    "description": "Nike basketball shoes" }
{ "index": { "_id": "adi-001" } }
{ "brand": "adidas",  "description": "New Adidas basketball shoes" }
{ "index": { "_id": "ree-001" } }
{ "brand": "reebok", "description": "Reebok basketball shoes" }

With this dataset, we now evaluate three queries:

A baseline “basketball shoes” search
The same query with a 50% promotion for Adidas and a 25% promotion for Nike
A specific “Reebok basketball shoes” query while the Adidas and Nike promotions are still active

Each scenario highlights a different property of multiplicative boosting.

1. Baseline ranking: No promotion

GET products/_search
{
  "size": 3,
  "_source": ["brand", "description"],
  "query": {
    "match": { "description": "basketball shoes" }
  }
}

This query returns the following results where Nike and Reebok are ranked above adidas:

Rank	Brand	Score (BM25)
1/2 (tie)	nike	0.27845407
1/2 (tie)	reebok	0.27845407
3	adidas	0.24686474

2. Adding 50% Adidas uplift and 25% Nike uplift with function_score

If marketing launches a campaign where Adidas basketball shoes should receive a 50% uplift and Nike a 25% uplift, then the application layer could construct its queries to include those uplifts as follows:

GET products/_search
{
  "size": 3,
  "_source": ["brand", "description"],
  "query": {
    "function_score": {
      "query": {
        "match": { "description": "basketball shoes" }
      },
      "functions": [
        {
          "filter": { "term": { "brand": "adidas" } },
          "weight": 0.5
        },
        {
          "filter": { "term": { "brand": "nike" } },
          "weight": 0.25
        },
        {
          "weight": 1.0
        }
      ],
      "score_mode": "sum",
      "boost_mode": "multiply"
    }
  }
}

How the multiplier is constructed

Base weight = 1.0
Adidas gets an additional +0.5
So Adidas’s multiplier = 1.5
Nike gets an additional +0.25
So Nike’s multiplier = 1.25
All other brands (including Reebok) get the base weight multiplier = 1.0

Apply multiplier:

Final score = BM25 × multiplier

Product	BM25	Multiplier	Final score
Adidas	0.24686474	1.5	0.37029710
Nike	0.27845407	1.25	0.34806758
Reebok	0.27845407	1.0	0.27845407

Result

Adidas moves to the top, Nike follows, and Reebok is at the bottom with no change in its score. This is exactly the behavior that multiplicative boosting is designed to produce:

Adidas and Nike both gain visibility, but in proportion to their configured uplifts.
The relative differences in BM25 still matter; we are reshaping the ranking, not replacing it.
The ordering changes primarily where BM25 scores are close.

With additive boosts, the same “50% versus 25%” business intent would have to be approximated with numeric constants on an arbitrary BM25 scale, and the effect would vary drastically across queries.

3. Specific intent still wins: “Reebok basketball shoes”

Now run a highly specific branded query for “Reebok basketball shoes”, with the same Adidas (50%) and Nike (25%) promotions still active:

GET products/_search
{
  "size": 3,
  "_source": ["brand", "description"],
  "query": {
    "function_score": {
      "query": {
        "match": { "description": "Reebok basketball shoes" }
      },
      "functions": [
        {
          "filter": { "term": { "brand": "adidas" } },
          "weight": 0.5
        },
        {
          "filter": { "term": { "brand": "nike" } },
          "weight": 0.25
        },
        {
          "weight": 1.0
        }
      ],
      "score_mode": "sum",
      "boost_mode": "multiply"
    }
  }
}

The response shows the following results:

Rank	Brand	Final score
1	reebok	1.3011196
2	adidas	0.3702971
3	nike	0.34806758

Result

Reebok wins overwhelmingly because BM25 correctly detects strong intent for “Reebok basketball shoes”. Adidas and Nike still receive their 50% and 25% promotions, respectively, but those multipliers are nowhere near enough to override the BM25 score.

This is exactly the behavior that multiplicative boosting is designed to produce:

When BM25 scores are close, boosts can shift the relative ordering.
When BM25 scores differ significantly (as they do here, due to strong text matching), the same boosts have little practical effect.

Promotions influence the ranking, but they do not override the core text relevance signal.

What this example demonstrates

These real queries illustrate the key properties of multiplicative boosting:

The influence is proportional, not arbitrary. A percentage-based uplift has the same proportional effect regardless of the underlying BM25 scale.
Text relevance remains in control. Strong brand-intent queries still surface the correct product.The system behaves intuitively. Merchandisers see exactly the ranking changes they expect.
The math is stable across queries. The same promotion works correctly whether the match is broad or highly specific.
Application logic stays clean. The business layer decides the uplift; Elasticsearch applies it predictably.

Multiplicative boosting through function_score preserves relevance in a predictable and controllable way, while enabling business impact.

Application logic remains the author of influence

There is a clear separation between deciding what should be boosted and applying that boost in Elasticsearch. function_score handles the second task, but the first belongs firmly to application logic.

Your application logic is where decisions are made about:

Which margin thresholds matter for your business
Whether popularity should rise or fall based on seasonality
How to interpret customer behavior or cohort membership
How to encode campaign rules
When to surface or suppress certain product groups

These are business decisions, not scoring decisions. Elasticsearch does not infer whether a user is budget-focused or luxury-oriented, whether a promotion is active, or whether low stock requires a visibility adjustment. Those determinations occur upstream, in the part of the system that has access to user context, session features, analytics, and business configuration. After application logic produces clear numeric signals for fields such as weights, uplift factors, thresholds, and cohort tags, a function_score query provides a reliable way to express those signals as controlled multipliers on BM25.

This creates a clean architectural contract:

Application logic: decides what should be influenced.
BM25 provides the core text relevance.
function_score applies influence in a mathematically stable way.

Because business logic lives outside the index, teams can adjust or experiment with uplift strategies without reindexing or restructuring documents.

Conclusion

E-commerce search must balance core text relevance with business considerations such as profitability, stock position, customer intent, seasonality, and personalization. BM25 provides a stable and interpretable foundation for text relevance, but influencing that score requires care. Business signals should shape the ranking, not overpower it.

However, the most commonly used levers such as boosted should clauses, rank_feature fields, and ad-hoc script scoring often behave unpredictably. These approaches can appear effective in early development, but their limitations emerge as soon as the catalog evolves or new query patterns arrive. Additive boosts fluctuate wildly because their impact depends entirely on the underlying scale of BM25, which varies dramatically across queries. A boost that produces a subtle nudge in one situation can dominate the ordering in another. Script scoring introduces its own challenges: opaque logic, reduced performance, and scoring behavior that becomes harder to understand or maintain over time.

Multiplicative boosting with function_score avoids these pitfalls by transforming BM25 proportionally rather than competing with it. Instead of adding a second, independent score component, it applies a controlled multiplier to BM25 itself. This produces the kind of predictable adjustments that merchandisers actually intend. For example, it allows slight promotions for high-margin items, modest reductions for low-stock products, or gentle uplifts for relevant user cohorts.

Equally important, the architecture remains clean. Application logic determines which business signals matter, and function_score applies them in a consistent, explainable way. Business teams can evolve business strategy without destabilizing relevance, and Engineering teams can refine relevance without disturbing business rules.

This principle is the foundation of the previous blogs that demonstrated how to influence e-commerce rankings: (1) Boosting e-commerce search by profit and popularity with the function score query in Elasticsearch, and (2) How to improve e-commerce search relevance with personalized cohort-aware ranking. Both approaches rely on the idea that business signals should guide BM25, not override it. Multiplicative boosting through function_score provides a practical, transparent, and scalable method for achieving that balance in real-world e-commerce search.

Elasticsearch Serverless pricing demystified: VCUs and ECUs explained

Sander Philipse, Pete Galeotti — Fri, 19 Dec 2025 00:00:00 GMT

Navigating Elasticsearch Serverless pricing is simple... you pay for the resources you use. Getting a handle on VCUs, ECUs, and the factors that drive your consumption is key to making informed decisions about your usage. In this blog, we'll break down exactly how Elasticsearch Serverless pricing works so you can plan, monitor, and optimize your spend.

When we built Elasticsearch Serverless, we had to decide how to bill our users. While a charge per query may have been easier to reason about from a consumption perspective, it would be a lot harder to reason about from a resource perspective. Instead, we implemented a simple pricing scheme comprising three dimensions for compute: search, ingest, and machine learning VCUs. This means we charge users for the actual resources we allocate to fulfill your requested workloads.

VCU, ECU, and other terms

Let's start by defining a few terms that will keep coming back throughout this post.

VCU

A VCU is a Virtual Compute Unit, representing a fraction of RAM, CPU, and local disk for caching. We separate compute by the workloads they support, so we have three flavors of VCU:

Search VCU
Ingest VCU
Machine Learning (ML) VCU

VCU’s are charged by the hour.

Regional pricing

We have different prices for different regions and different cloud providers. You can find a full list of prices on this page.

ECU

An ECU is an Elastic Consumption Unit, which is the unit we bill you in. The nominal value of an ECU is $1.00 USD. All of the different components of consumption are charged at a specific rate of ECUs per time unit. For example, one Gigabyte of storage might cost 0.047 ECU per month, so 100 GB of storage will cost you 4.7 ECU = $4.70 for one month. Similarly, if your search workload consumed 10 VCUs in a day and the Search VCU rate in your region is 0.09 ECU, your cost for that day would be $0.90.

Interactive Dataset Size

The amount of data in your project has a direct influence on your costs. We make the distinction of “interactive dataset” primarily for time-series data, as this relates to the amount of data in the Boost Window. For non-time-series data, this is simply the amount of data in the project.

Project settings

We have three project settings that allow you to control your project's usage.

Search power

Search Power controls the speed of searches against your data. With Search Power, you can improve search performance by adding more resources for querying, or you can reduce provisioned resources to cut costs. Choose from three Search Power settings:

On-demand: Autoscales based on data and search load, with a lower minimum baseline for resource use. This flexibility results in more variable query latency and reduced maximum throughput.

Performant: Delivers consistently low latency and autoscales to accommodate moderately high query throughput.

High-availability: Optimized for high-throughput scenarios, autoscaling to maintain query latency even at very high query volumes.

Boost window

For time series use cases, the boost window is the number of days of data that constitutes your interactive dataset size. The interactive dataset is the portion of your data that we keep cached, and that we use to determine how to scale the Search tier for your project. By default, the boost window is seven days.

Data retention

You can set the number of days of data that are retained in your project, which will affect the amount of storage we need. You can do this on a per-data stream basis in your project.

Price components

Serverless Elasticsearch contains a few different pricing components. For most use cases, the components you will care most about are Search, Ingest, and ML VCUs, as well as the Elastic Inference Service's token consumption.

Search VCUs

Search VCU consumption is the most complex part of pricing. We make this simple for you by automatically determining the right amount of VCUs that are needed to fulfill your workloads. For more details on how our autoscaling logic works, see our earlier blog on the topic.

Search VCU inputs

Search VCUs are allocated based on a few factors, but mainly, we can boil it down to three inputs: the interactive dataset size, the search load on the system, and Search Power.

For traditional search use cases, the interactive dataset size will generally be your entire dataset. For time series use cases, it will be the portion of your dataset that fits inside the Boost Window.

Search load measures the amount of load being placed on the system by currently active searches. The main contributing factors are the number of searches per second, the complexity of the searches (the more that needs to be computed, the higher the load), and the size of the dataset that needs to be searched to fulfill the result. If we can get you the right number of results by scanning 10% of the dataset, then the load will be much lower than if we need to scan the full dataset.

Finally, Search Power influences the number of VCUs we allocate. Each Search Power setting defines the baseline capacity of the search tier.

In short: the larger the dataset size and the higher the search load, the more VCUs we need to fulfill your search requests. Search Power allows you to tune to what extent we will scale up and down.

Minimum VCUs

Elasticsearch Serverless is designed to align infrastructure costs directly with your application's demand.

For smaller workloads, the search infrastructure can scale down to zero VCUs during periods of inactivity. If the system detects fifteen minutes of total inactivity, the associated hardware resources are deprovisioned. This makes the platform highly cost-effective for development environments, bursty workloads, or applications with intermittent usage. Note that inactivity means actual inactivity: no user-initiated searches whatsoever. As soon as we need to serve a search of any kind, we need to allocate hardware resources to execute that search.

As your interactive dataset grows, the system eventually reaches a storage threshold where a baseline level of resources is required to maintain data availability and indexing readiness. A minimum VCU allocation is maintained to ensure your data remains "warm" and queryable, even if no active searches are occurring.

VCU consumption is not linear

Because our hardware is allocated in steps, consumption of VCUs does not necessarily scale linearly with workload size. Each scaling step can contain a wide range of workloads, and if your workload is at the bottom of that range, it may have a lot of room to grow before we need to jump to the next scaling step.

This can make estimating based on a non-representative workload hard. For example, you may be consuming 2 VCUs per hour on a small workload. It's entirely possible that you could increase your workload size by a factor of 100 and still fit in that 2 VCU per hour load before we need to start increasing the amount of VCUs we allocate to serve your workload.

We know this makes estimating your cost a little harder, and we are working on ways to make that easier for you. If you need more help estimating your likely price, you can always talk to our customer team and get more personalized assistance.

Ingest VCUs

Ingest VCUs are much simpler than Search VCUs.

Ingest VCU Inputs

Ingest VCUs have essentially three inputs: the number of indices, the ingest rate, and the ingest complexity. We need to allocate a little bit of memory for every index in your system, which is why the number of indices matters. Read indices in data streams do not count for this calculation.

The faster you ingest, the more CPU we will need to process that ingestion. And the more complex your ingest requests, the more CPU we will need. Some factors that make ingest requests more expensive to execute are complicated field mappings or a lot of post-processing.

Minimum Ingest VCUs

We do not have a minimum number of VCUs we allocate to your ingest. If you do not ingest data, we do not need to allocate any VCUs to processing ingestion. There is an exception for a large number of indices (think: thousands of indices), where we do need to keep some resources allocated to be responsive when indexing requests come in.

VCU consumption is not linear

As with Search VCUs, we allocate Ingest VCUs based on step functions. Each step can contain a wide range of workloads: it's entirely possible that if you have a minimal amount of ingest, you could increase your ingest rate by a factor of 100 and still fit in the same step, thus not actually increasing your cost.

AI workloads

When running machine learning tasks in Serverless, we give you three options:

You use our Elastic Inference Service (EIS) to run your inference and completion workloads. We take care of everything, and you are charged per token.
You use traditional Elasticsearch Machine Learning capabilities to run your workloads. These use our Trained Models capabilities. We will scale up and down based on your machine learning workload requirements.
You do it yourself, outside of our systems, and just bring your vectors or other inference results to store and search in Elasticsearch.

EIS

The pricing for EIS is quite straightforward: you get charged a rate per one million consumed tokens. Token consumption is generally easy to predict for inference workloads. For LLM-based tasks, particularly agentic ones, this can be more complex, and some experimentation and trial runs may be useful to determine how many tokens your workloads typically consume.

ML VCUs

Machine Learning VCUs work on one simple input: machine learning workloads. The more inference you require, the more VCUs we will consume. Once you stop performing inference, we will scale down. We will keep a trained model in memory for about 24 hours after you last used it so that we can be responsive, which means that the minimal amount of VCU required to keep that model available will remain up for 24 hours before scaling down entirely.

We generally recommend our customers use EIS instead of our Machine Learning nodes for inference, particularly if your usage is periodic. By switching to EIS, you will not have to wait for machine learning nodes to spin up, and we won't charge you for unused ML node time before scaling down. EIS charges on a per token basis.

Storage

We charge storage per gigabyte per month. Storage does serve as an input into other parts of our system, particularly Search VCUs (see Search VCU above), but the pricing for storage itself is quite straightforward.

Data Out (egress)

We charge you for the data you take out of the system.

To minimize your egress costs, we recommend a few optimizations on your queries:

Do not return vectors in your query responses. We do this by default for indices created after October 2025. You can always return vectors in your responses explicitly if necessary.
Return only the fields needed for your application. You can do this by using the fields and _source parameters.

Support

We charge support as a percentage of your total ECU usage. We currently have four levels of support:

Limited support
Base support
Enhanced support
Premium support

Project subtype profiles

We currently offer two project subtypes for Serverless Elasticsearch, referred to as “General Purpose” and “Vector Optimized”. All Serverless Elasticsearch projects created through the cloud console UI will be created using the “General Purpose” option. You may create a “Vector Optimized” by calling the API directly with the optimized_for parameter (see documentation for all options).

The difference between the two options is the allocation of resources. We allocate approximately four times more resources (aka VCUs) to the “Vector Optimized” profile, which will result in your costs being up to four times higher. This is why we recommend starting on the “General Purpose” profile and only using the “Vector Optimized” profile when your use case demands the use of uncompressed dense vectors with high dimensionality, and quantization and DiskBBQ will not serve your needs.

When Serverless Elasticsearch was envisioned years ago, we thought that vector workloads would require much more resources to remain performant. However, with innovations like semantic_text, sparse_vector models, and Better Binary Quantization (BBQ), we’ve found that many vector workloads perform well on the “General Purpose” profile at a fraction of the cost. Therefore, don’t let the “Vector Optimized” label fool you…you can get excellent price and performance for vector workloads on the “General Purpose” profile.

Monitoring costs

We recognize that keeping track of your costs, especially when you are new to Elasticsearch Serverless, is important to you. We built a few tools just for this purpose, and continue to improve them for even greater visibility.

Cloud console billing usage

The Elastic Cloud Console provides billing details for your cloud account, across all cloud-based resources, including Elasticsearch Serverless. There, you can find a breakdown of all the price components described in this article. Filters allow you to zoom in on specific time periods and resources.

To further monitor your costs, you can also configure custom budget alerts from the Budgets and notifications tab under the Billing and subscriptions page.

AutoOps monitoring

We’re bringing AutoOps to Serverless! One of the key value propositions of Elasticsearch Serverless is that we ensure everything runs smoothly, but that also means you have limited observability into the infrastructure. AutoOps for Serverless gives users visibility into what is driving usage, and, therefore, costs.

AutoOps is rolled out in new Serverless regions regularly, and we're always working to add new monitoring tools. Make sure to check out the region coverage and future planned monitoring tools.

Building a local RAG personal knowledge assistant with LocalAI and Elasticsearch

Jeffrey Rengifo — Thu, 18 Dec 2025 00:00:00 GMT

Building a Retrieval Augmented Generation (RAG) system locally is possible with components that run entirely on a mid-range laptop. Elasticsearch provides vector database infrastructure, while LocalAI makes it simple to run small, efficient language models without requiring a powerful GPU or external services. By combining these tools, we can enable private, fast, and offline access to company or personal data.

The goal is to build a full RAG system: Embeddings for retrieval and LLM for answers generated locally, while using as few resources as possible without affecting latency and the quality of the answers.

Prerequisites

Docker
Python 3.10+

Use case: Personal knowledge assistant

The goal is to unlock insights from local files through a simple assistant. In this example, we'll focus on internal documentation from a CRM migration project, which includes meeting transcripts, progress reports, and planning notes. Everything will run on the same machine; Elasticsearch will handle storage and semantic search, while a local LLM will produce answers and summaries based on the retrieved documents.

Why do this locally?

Deciding to go local and use this stack of tools in particular presents multiple advantages, such as:

Privacy: Since you are using a local LLM, you have full discretion over the information you pass to it. While some cloud-based LLMs offer enterprise tiers that disable tracking or data retention, this is not guaranteed across all providers or plans.
Flexibility: Tools like LocalAI offer a wide range of models and make it easy to replace them as needed, whether for evaluating new models, running tests, handling security-related updates, or switching between models to support different types of tasks. On the other hand, using Elasticsearch as the vector database offers integration with many third-party embedding models.
Cost: With this approach you don’t need to pay for any cloud-based service for embeddings or LLM usage, which makes it more affordable.
Independence from the internet: Another advantage of a local solution is that you can work completely offline, which also makes it suitable for isolated or air-gaped environments where network access is intentionally limited due to strict security or compliance requirements.
Speed: Depending on the chosen model and your hardware, this can potentially be faster than a web service.

Setting up the core: Minimal Elasticsearch instance

To install Elasticsearch locally, we will use start-local, which allows you to install Elasticsearch with just one command using Docker under the hood.

Since we will not be using Kibana, we will install only Elasticsearch with the --esonly flag:

curl -fsSL https://elastic.co/start-local | sh -s -- --esonly

If everything goes well, you’ll see a message like this:

🎉 Congrats, Elasticsearch is installed and running in Docker!
🔌 Elasticsearch API endpoint: http://localhost:9200
🔑 API key: MDk0NVRwa0IxRGRjVmVKSGl2ZFc6dUFaNkZTUHVXRzEzVjdrejQzSUNxZw==

NOTE: If you forgot your credentials, you can find them at …/elastic-start-local/.env

You can check if the Elasticsearch instance is running using the command docker ps

docker ps

Response:

CONTAINER ID   IMAGE                                                       COMMAND                  CREATED          STATUS                    PORTS                                NAMES
061fbfbb67bb   docker.elastic.co/elasticsearch/elasticsearch:9.1.3-arm64   "/bin/tini -- /usr/l…"   11 minutes ago   Up 11 minutes (healthy)   127.0.0.1:9200->9200/tcp, 9300/tcp   es-local-dev

To this Elasticsearch instance, we can send requests, for example:

curl -X GET "http://localhost:9200/" \
  -H "Authorization: ApiKey MDk0NVRwa0IxRGRjVmVKSGl2ZFc6dUFaNkZTUHVXRzEzVjdrejQzSUNxZw=="

Response:

{
  "name" : "061fbfbb67bb",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "IC_VY7cyQw6F_wJbH5Ik3A",
  "version" : {
    "number" : "9.1.3",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "0c781091a2f57de895a73a1391ff8426c0153c8d",
    "build_date" : "2025-08-24T22:05:04.526302670Z",
    "build_snapshot" : false,
    "lucene_version" : "10.2.2",
    "minimum_wire_compatibility_version" : "8.19.0",
    "minimum_index_compatibility_version" : "8.0.0"
  },
  "tagline" : "You Know, for Search"
}

This local instance will store our CRM migration notes and reports so they can later be searched semantically.

Adding AI: Choosing the right local models

So, now we will choose two models to make it work:

Embeddings model: For embeddings, we will use the multilingual model multilingual-e5-small. It is available pre-configured in Elasticsearch but needs to be deployed before use.
Completion model: For chatting, generating responses, and interacting with the data, we need to choose a model with the best size-to-performance ratio. For that, I prepared the following table comparing some small-sized models:

Model	Parameters	Size in memory (Approx)
llama-smoltalk-3.2-1b-instruct	1B	500 MB
dolphin3.0-qwen2.5-0.5b	0.5B	200 MB
fastllama-3.2-1b-instruct	1B	550 MB
smollm2-1.7b-instruct	1.7B	1.0 GB

The final decision depends on your needs and your machine, and for this example we will use the dolphin3.0-qwen2.5-0.5b model because it is a model but with powerful capabilities in a RAG system. It provides the best size-to-parameters ratio in the table. All the other options work well for this use case but by its size, the dolphin3.0-qwen2.5-0.5b is our choice.

The balance of CPU, and memory usage is important because our assistant needs to summarize meetings and reports in a reasonable time using mid range laptop resources.

To download the dolphin3.0-qwen2.5-0.5b, we will use LocalAI, which is an easy-to-use solution to run models locally. You can install LocalAI on your machine, but we will use Docker to isolate the LocalAI service and models. Follow these instructions to install the official LocalAI Docker image.

LocalAI REST API

One of the main features of LocalAI is its ability to serve models through HTTP requests in an OpenAI API-compatible format. This feature will be useful in later steps.

The LocalAI service will be accessible at port 8080, which is where we will send the HTTP requests. Let’s send a request to download the dolphin3.0-qwen2.5-0.5b model:

curl -X POST http://localhost:8080/models/apply \
  -H "Content-Type: application/json" \
  -d '{"id": "dolphin3.0-qwen2.5-0.5b"}'

# Response: 
{"uuid":"d5212e97-bf1d-11f0-ba2a-22b2311545e6","status":"http://localhost:8080/models/jobs/d5212e97-bf1d-11f0-ba2a-22b2311545e6"}%

We can check the download status using the ID generated in the previous step:

curl -s http://localhost:8080/models/jobs/d5212e97-bf1d-11f0-ba2a-22b2311545e6

# Response: 

{
  "deletion": false,
  "file_name": "/models/dolphin-3.0-qwen2.5-0.5b-instruct.Q4_K_M.gguf.partial",
  "error": null,
  "processed": false,
  "message": "processing",
  "progress": 9.860515383462856,
  "file_size": "391.2 MiB",
  "downloaded_size": "37.2 MiB",
  "gallery_element_name": ""
}

The progress field represents the percentage of the current download; we need to wait for it to complete. Once it’s completed, we can create a test to make sure that everything is working well:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "dolphin3.0-qwen2.5-0.5b",
    "messages": [
      {
        "role": "user",
        "content": "Why is the sky blue?"
      }
    ],
    "stream": false
  }

See the LLM result here.

Showing the workflow: Project data to answers

Now that we have an embeddings model and a general-purpose LLM model, it’s time to combine them with Elasticsearch and build a tool that can help us explore our data. For this walkthrough, we prepared a folder with relevant data for our mock CRM-migration project, including reports and meeting transcripts. In a real application, this ingestion step would typically be automated through a deployment pipeline or a background process, but here we will trigger it manually for simplicity.

Data overview

All the dataset is accessible in this GitHub repository.

/CRM migration data

|__

|-- meeting_QA-team_wednesday.txt

|-- meeting_development-team_monday.txt

|-- meeting_management-sync_friday.txt

|-- report_QA-team.txt

|-- report_development-team.txt

To illustrate how it looks, let's test a couple of examples:

meeting_development-team_monday.txt:

MEETING TRANSCRIPT - DEVELOPMENT TEAM

Date: Monday, September 16, 2025

Time: 09:00 AM - 10:15 AM

Participants: Alice (Tech Lead), John (Senior Developer), Sarah (Backend Developer), Mike (DevOps Engineer)

[09:02] Alice:Let's review the search API deployed last week. Any issues?

[09:03] Sarah:API works but performance degrades with 1,000+ queries per minute. Response times jump from 200ms to 3 seconds.

...

report_development-team.txt:

WEEKLY REPORT - DEVELOPMENT TEAM
Week of September 16-20, 2025
Prepared by: Alice Thompson, Tech Lead

=== EXECUTIVE SUMMARY ===
Development team completed critical infrastructure components but identified performance bottlenecks requiring attention before production deployment.

=== KEY ACCOMPLISHMENTS ===
- Database schema and indexes completed for CRM
...

Elasticsearch setup

Now we need a data structure and an inference endpoint in Elasticsearch to store and embed the data.

First, let’s create an inference endpoint using the .multilingual-e5-small model:

def setup_inference_endpoint():
    inference_id = "e5-small-model"
    try:
        es_client.inference.put(
            inference_id=inference_id,
            task_type="text_embedding",
            body={
                "service": "elasticsearch",
                "service_settings": {
                    "num_allocations": 1,
                    "num_threads": 1,
                    "model_id": ".multilingual-e5-small",
                },
            },
        )

        print(f"✅ Inference endpoint '{inference_id}' created successfully")
    except Exception as e:
        print(f"❌ Error creating inference endpoint: {str(e)}")

The response should be this:

{"inference_id":"e5-small-model","task_type":"text_embedding","service":"elasticsearch","service_settings":{"num_allocations":1,"num_threads":1,"model_id":".multilingual-e5-small"},"chunking_settings":{"strategy":"sentence","max_chunk_size":250,"sentence_overlap":1}}

This will automatically download the model and create the inference endpoint for our embeddings during ingestion and query time. If you need to install the embeddings model in an air-gapped environment, you can follow these instructions.

Now, let’s create the mappings for the data. We will create 3 fields: file_title to store the file name, file_content to store the file content of each document, and semantic to store the embeddings and plain text content of both fields (file_title and file_content):

def setup_index():
    try:
        if es_client.indices.exists(index=INDEX_NAME):
            print(f"✅ Index '{INDEX_NAME}' already exists")
            return False

        print(f"Creating index '{INDEX_NAME}'...")
        es_client.indices.create(
            index=INDEX_NAME,
            body={
                "mappings": {
                    "properties": {
                        "file_title": {"type": "text", "copy_to": "semantic_field"},
                        "file_content": {"type": "text", "copy_to": "semantic_field"},
                        "semantic_field": {
                            "type": "semantic_text",
                            "inference_id": "e5-small-model",
                        },
                    }
                }
            },
        )
        print(f"Index '{INDEX_NAME}' created successfully")
 return True
    except Exception as e:
        print(f"Error creating index: {str(e)}")
	 exit(1)

Elasticsearch response:

{"acknowledged":true,"shards_acknowledged":true,"index":"team-data"}

With this setup, each file from the CRM migration project gets indexed and becomes searchable.

Python script

To centralize Elasticsearch, data, and LLMs, we will create a simple Python script to ingest the data, make search requests to Elasticsearch, and send prompts to the LLM. This approach allows us to customize the workflow, change prompts and models, and automate processes:

Let’s create a venv environment to handle the dependencies required to execute the script:

python -m venv venv && source venv/bin/activate

Now we need to install the elasticsearch dependencies to interact with our locally running Elasticsearch instance and requests will be used to handle HTTP requests:

pip install elasticsearch requests openai

After installation, create a Python file named `script.py` and let's start scripting:

import os
import time

from elasticsearch import Elasticsearch, helpers
from openai import OpenAI

ES_URL = "http://localhost:9200"
ES_API_KEY = "NDdDQWM1b0JPSDBFTV9JQzA0WVo6eHFXcWFJQmFYNzBwS1RjUllpRUNHZw=="
INDEX_NAME = "team-data"
LOCAL_AI_URL = "http://localhost:8080/v1"
DATASET_FOLDER = "./Dataset"


es_client = Elasticsearch(ES_URL, api_key=ES_API_KEY)
ai_client = OpenAI(base_url=LOCAL_AI_URL, api_key="sk-x") # The API key needs to have a value to work

In the code above, we import the necessary packages, set up some relevant variables, and instantiate the Elasticsearch Python client and the OpenAI client to handle AI requests. There’s no need for a real OpenAI API key to make it work;you can use any value there.

Using the bulk API, we created two methods to ingest the data directly from the folder to Elasticsearch index_documents and load_documents. To execute semantic queries, we'll use the semantic_search method:

def load_documents(dataset_folder, index_name):
    for filename in os.listdir(dataset_folder):
        if filename.endswith(".txt"):
            filepath = os.path.join(dataset_folder, filename)

            with open(filepath, "r", encoding="utf-8") as file: # UTF-8 encoding ensures proper handling of special characters and international text
                content = file.read()

            yield {
                "_index": index_name,
                "_source": {"file_title": filename, "file_content": content},
            }


def index_documents():
    try:
        if es_client.indices.exists(index=INDEX_NAME) is False:
            print(f"❌ Error: Index '{INDEX_NAME}' does not exist. ")
            exit(1)

        success, _ = helpers.bulk(es_client, load_documents(DATASET_FOLDER, INDEX_NAME))

        print(f"✅ Indexed {success} documents successfully")
        return success
    except Exception as e:
        print(f"❌ Error indexing documents: {str(e)}")
        exit(1)


def semantic_search(query, size=3):
    start_time = time.time()
    search_body = {
        "query": {"semantic": {"field": "semantic_field", "query": query}},
        "size": size,
    }

    response = es_client.search(index=INDEX_NAME, body=search_body)
    search_latency = (time.time() - start_time) * 1000  # ms

    return response["hits"]["hits"], search_latency

The query_local_ai function handles the request to LocalAI models.

def query_local_ai(prompt, model):
    start_time = time.time()

    try:
        response = ai_client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
        )

        ai_latency = (time.time() - start_time) * 1000  # ms

        # Extract response text
        response_text = response.choices[0].message.content

        # Calculate tokens per second if usage info is available
        tokens_per_second = 0
        if hasattr(response, "usage") and response.usage:
            total_tokens = response.usage.completion_tokens
            if ai_latency > 0:
                tokens_per_second = (total_tokens / ai_latency) * 1000  # tokens/second

        return response_text, ai_latency, tokens_per_second
    except Exception as e:
        ai_latency = (time.time() - start_time) * 1000
        
        return f"Error: {str(e)}", ai_latency, 0

We will pass the Elasticsearch-retrieved data with a prompt to the query_local_ai function:

if __name__ == "__main__":
    print("🚀 Setting up infrastructure...")

    # Setup inference endpoint and index
    setup_inference_endpoint()
    is_created = setup_index()

    if is_created:  # Index was just created, need to index documents
        print("\n📥 Indexing documents...")
        success = index_documents()

        if success == 0:  # if indexing failed, exit
            print("❌ Documents indexing failed. Exiting.")
            exit(1)

        time.sleep(1)  # Wait for indexing to complete

    query = "Can you summarize the performance issues in the API?"

    print(f"🔍 Search: '{query}'")
    search_results, search_latency = semantic_search(query)

    context = ""
    citations = []
    for idx, hit in enumerate(search_results, 1):
        source = hit["_source"]
        context += f"[{idx}] File: {source['file_title']}\n"
        context += f"Content: {source['file_content']}\n\n"
        citations.append(f"[{idx}] {source['file_title']}")

    prompt = f"""Based on the following documents, answer the user's question. 
        You MUST cite your sources using the format [1], [2], etc. when referencing information from the documents.

        Documents:
        {context}

        User Question: {query}
    """

    ai_model = "dolphin3.0-qwen2.5-0.5b"

    print(f"🤖 Asking to model: {ai_model}")
    response, ai_latency, tokens_per_second = query_local_ai(prompt, ai_model)

    print(f"\n💡 Question: {query}\n📝 Answer: {response}")
    for citation in citations:
        print(f"  {citation}")

    print(f"\n🔍 Search Latency: {search_latency:.0f}ms")
    print(f"🤖 AI Latency: {ai_latency:.0f}ms | {tokens_per_second:.1f} tokens/s")

Finally, we can see the complete script workflow: first, we ingest the documents using index_documents; then we retrieve Elasticsearch data using semantic_search, and with those results, we send a request to the dolphin3.0-qwen2.5-0.5b model to generate the LLM response with our requirements (including citation generation) by calling the query_local_ai function. The latency and tokens per second will be measured and printed at the end of the script.In this workflow, the query “Can you summarize the performance issues in the API?” serves as the user’s natural language request that guides both the search and the final LLM response.

Run the following command to execute the script:

python script.py

Response:

📝 Answer: 
The performance issues in the API can be summarized as follows:

1. **Search API degrades with 1,000+ queries per minute**: The search API has degraded, with performance degrades to a time of 3 seconds compared to the target of 200ms.

2. **Complex queries are slow and no caching layer**: Complex queries take longer to execute as expected.

3. **CPU limits during spikes**: The CPU usage is at 100% at the peak hour and limits to 70% at the peak hour.

📚 Citations:
  [1] report_development-team.txt
  [2] meeting_development-team_monday.txt
  [3] meeting_management-sync_friday.txt

See the complete answer here.

The model’s answer is satisfactory: it highlights the key performance issues in a concise way and correctly points out underlying causes, offering a solid basis for further diagnosis.

Latency

As shown in the application results above, we have the following latency:

📥 Indexing documents...
🔍 Search: 'Can you summarize the  performance issues in the API?'
🤖 Asking to model...

...

🔍 Search Latency: 14ms
🤖 AI Latency: 16044ms | 9.5 tokens/s

Adding all times, we can see that the entire flow takes 17 seconds to get a response, producing 9.5 tokens per second.

Hardware usage

The last step is to analyze the resource consumption of the entire environment. We describe it based on the Docker environment configuration shown in the following screenshot:

With 8GB of RAM, we have enough memory to run both the LocalAI container and the Elasticsearch container. This configuration is representative of a mid-range laptop setup, which helps us better approximate realistic inference performance.

Resources consumption

Using the Docker Live Charts extension, we can see the resource consumption of both containers working together while generating responses:

Consumption per container is as follows:

When it starts running, Elasticsearch uses about 0.5 cores for indexing data. On the LocalAI side, dolphin3.0-qwen2.5-0.5b consumes 100% of the 6 available cores when producing the answer. When analyzing memory consumption, it uses approximately 2.2GB in total: 1.9 GB for Elasticsearch and 200 MB for LocalAI (client and model).

Alternative model with higher resource requirements: smollm2-1.7b-instruct

To see the flexibility of this approach, let's change the model by just switching the variable ai_model to ai_model = "smollm2-1.7b-instruct" in code. This model requires significantly more memory due to its larger parameter count, which impacts the tokens-per-second rate and increases the overall latency when generating a response.

🤖 Asking to model: smollm2-1.7b-instruct

💡 Question: Can you summarize the performance issues in the API?
📝 Answer: The development team identified two key technical challenges for the API:

1.  The search API degrades at 1,000+ queries per minute, causing average execution times to jump from 200ms to 3 seconds.
2.  The root cause is complex database queries without a caching layer, leading to poor query performance.

📚 Citations:
  [1] report_development-team.txt
  [2] meeting_development-team_monday.txt
  [3] meeting_management-sync_friday.txt

🔍 Search Latency: 16ms
🤖 AI Latency: 47561ms | 4.8 tokens/s

As expected, being a heavier model, smollm2-1.7b-instruct produces fewer tokens per second (4.8) for the same question and takes significantly more time (around 30 seconds longer).

The response looks good and detailed. It’s similar to the one generated by the dolphin3.0-qwen2.5-0.5b model but takes longer to generate and consumes more resources, as this model uses approximately 1 GB of memory.

Alternative balance model: llama-smoltalk-3.2-1b-instruct

Now let’s try again by changing ai_model to llama-smoltalk-3.2-1b-instruct.

🤖 Asking to model: llama-smoltalk-3.2-1b-instruct
💡 Question: Can you summarize the performance issues in the API?

📝 Answer: Based on the documents, it appears that the main performance issue with the API is related to the search query optimization. The API degrades to around 1,000+ queries per minute (QP/min) when there are 12 of 18 API endpoints integrated with authentication. This issue is caused by complex queries without a caching layer, leading to performance degrades and slow response times.

However, there is also a smaller issue with the "Search" API, where it degrades to around 3+ seconds after 1.2 seconds execution time. This is likely due to multi-filter searches and the need for a caching layer to improve performance.

To address these issues, the team is working on implementing a caching layer (Sarah) and optimizing bool queries and adding calculated index fields (John) to improve query efficiency. They are also working on setting up auto-scaling for the database (Mike) to ensure that it can handle increased traffic.

A meeting was held to discuss these issues and a plan for improvement was agreed upon. The team will work together to implement a caching layer and optimize the queries, and the team will work with product team to ensure that the migration is completed on time and does not impact the October migration date.

📚 Citations:
  [1] report_development-team.txt
  [2] meeting_development-team_monday.txt
  [3] meeting_management-sync_friday.txt

🔍 Search Latency: 12ms
🤖 AI Latency: 21019ms | 5.8 tokens/s

Analyzing the results, llama-smoltalk-3.2-1b-instruct delivers responses similar to the other models, varying slightly in format and extension. However, this comes at a higher cost compared to the lighter model (about 5 seconds slower and 4 tokens fewer per second.) It also consumes more memory than the dolphin3.0-qwen2.5-0.5b model (around 500 MB more in total). This makes it reliable for accurate summarization tasks but less efficient for fast or interactive scenarios.

Table comparison

To get a better view of the model’s consumption, let’s include a table comparing the results:

Model	Memory Usage	Latency	Tokens/s
dolphin3.0-qwen2.5-0.5b	~200 MB	16,044 ms	9.5 tokens/s
smollm2-1.7b-instruct	~1 GB	47,561 ms	4.8 tokens/s
llama-smoltalk-3.2-1b-instruct	~700 MB	21,019 ms	5.8 tokens/s

Conclusion

Combining e5-small for embeddings and dolphin3.0-qwen2.5-0.5b for completions, we could set up an efficient and fully functional RAG application on a mid-end laptop, with all data kept private. As we saw in the latency section from the first test we ran using the dolphin model, the part of the flow that takes the longest is the LLM inference step (16 s), while Elasticsearch vector retrieval was fast (81 ms).

dolphin3.0-qwen2.5-0.5b was the best candidate as a LLM to generate answers. Other models like llama-smoltalk-3.2-1b-instruct are indeed fast and reliable, but they tend to be heavier models. They require more resources, producing fewer tokens per second in exchange for slightly better quality in the responses.

Boosting e-commerce search by profit and popularity with the function score query in Elasticsearch

Alexander Marquardt — Wed, 17 Dec 2025 00:00:00 GMT

Overview

In this article, you will learn how to combine BM25 relevance with real business metrics like profit margin and popularity using Elasticsearch’s function_score query. This step-by-step guide shows how to control scaling with logarithmic boosts and allows full explainability for each ranking calculation.

Introduction

In many use cases, search results focus on lexical (keyword) and semantic (meaning-based) analysis to find the content that most accurately and authoritatively answers a user’s query. However, e-commerce search is a bit more complex.

Results must reflect the shopper’s intent and incorporate business objectives such as profit margin, product popularity, or other factors that don’t always directly align with purely lexical or semantic matching.

While text relevance ensures customer satisfaction, ranking by profitability and popularity turns search into a business optimization engine.

In order to demonstrate how business signals can be incorporated into search results, in this post we’ll explore:

How to boost product rankings by margin (profitability 0% to 200% in the demo data below).
How to extend that same logic to include popularity (number of sales).

Once you understand how to boost by margin and popularity, extending search to incorporate other signals is straightforward.

Setup

Below is a small dataset you can paste directly into Dev Tools to follow along.

POST _bulk
{ "index": { "_index": "blog_food_products" } }
{ "product_id": "MCC-HOME-500", "description": "McCain Home Chips 500g - High Margin", "margin": 200, "popularity": 100 }
{ "index": { "_index": "blog_food_products" } }
{ "product_id": "MCC-HOME-1000", "description": "McCain Home Chips 1kg", "margin": 100, "popularity": 640 }
{ "index": { "_index": "blog_food_products" } }
{ "product_id": "MCC-HOME-1500", "description": "McCain Home Chips 1.5kg", "margin": 50, "popularity": 10000 }
{ "index": { "_index": "blog_food_products" } }
{ "product_id": "BIR-CHIPS-450", "description": "BirdsEye Crispy Chips 450g", "margin": 9, "popularity": 880 }
{ "index": { "_index": "blog_food_products" } }
{ "product_id": "BIR-CHIPS-900", "description": "BirdsEye Crispy Chips 900g", "margin": 12, "popularity": 720 }
{ "index": { "_index": "blog_food_products" } }
{ "product_id": "TRE-MINT-33", "description": "Trebor Peppermint 33g", "margin": 5, "popularity": 1100 }
{ "index": { "_index": "blog_food_products" } }
{ "product_id": "TRE-MINT-4X38", "description": "Trebor Peppermint 4x38g", "margin": 8, "popularity": 680 }
{ "index": { "_index": "blog_food_products" } }
{ "product_id": "TIC-MINT-16", "description": "TicTac Mint 16g", "margin": 3.5, "popularity": 980 }
{ "index": { "_index": "blog_food_products" } }
{ "product_id": "TIC-MINT-6X16", "description": "TicTac Mint 6x16g", "margin": 7, "popularity": 640 }

Each document represents a product with:

margin: profit margin (percent)
popularity: relative sales volume (e.g. weekly average, or last week’s sum)

Ranking without margin

We can see how baseline results look by executing a simple query for “McCain chips” that does not take into consideration margin, as follows:

POST blog_food_products/_search
{
  "size": 5,
  "_source": ["description", "margin"],
  "query": {
    "match": {
      "description" : "McCain Chips"    }
  }
}

Which returns the following results:

{
   "_index": "blog_food_products",
   "_id": "GKO3MJoBBtzDfCS5JfQM",
   "_score": 1.6089411,
   "_source": {
     "description": "McCain Home Chips 1kg",
     "margin": 100
   }
 },
 {
   "_index": "blog_food_products",
   "_id": "GaO3MJoBBtzDfCS5JfQM",
   "_score": 1.6089411,
   "_source": {
     "description": "McCain Home Chips 1.5kg",
     "margin": 50
   }
 },
 {
   "_index": "blog_food_products",
   "_id": "F6O3MJoBBtzDfCS5JfQM",
   "_score": 1.3280699,
   "_source": {
     "description": "McCain Home Chips 500g - High Margin",
     "margin": 200
   }
 },
 {
   "_index": "blog_food_products",
   "_id": "GqO3MJoBBtzDfCS5JfQM",
   "_score": 0.5837885,
   "_source": {
     "description": "BirdsEye Crispy Chips 450g",
     "margin": 9
   }
 },
 {
   "_index": "blog_food_products",
   "_id": "G6O3MJoBBtzDfCS5JfQM",
   "_score": 0.5837885,
   "_source": {
     "description": "BirdsEye Crispy Chips 900g",
     "margin": 12
   }
 }

As you can see from the above results, the high margin version of the chips is 3rd in the results because the ordering does not consider margin.

Ranking by margin

Without any additional context, all sizes of “McCain chips” look equally relevant — but from a business perspective, it is possible that the higher-margin items should rank higher.

Product	Margin (%)	Description
McCain Home Chips 500g – High Margin	200%	small pack
McCain Home Chips 1kg	100%	mid pack
McCain Home Chips 1.5kg	50%	family pack

We’ll use Elasticsearch’s function_score query to apply a margin-based boost.

POST blog_food_products/_search
{
 "size": 5,
 "explain": false,                       // keep only for tuning
 "_source": ["description", "margin"],
 "query": {
   "function_score": {
 
 
     /* ───────────────────────────────────────────────
      * Base query
      * Replace with your actual BM25 or semantic query.
      * ─────────────────────────────────────────────── */
     "query": {
       "match": {
         "description" : "McCain Chips"       }
     },
 
 
     /* ───────────────────────────────────────────────
      * Margin-driven boost
      * ------------------------------------------------
      * Elasticsearch computes (for the ln1p modifier):
      *
      *   log_margin  = ln(1 + margin * factor)  
      *   boost       = 1 + log_margin            // +1 baseline via explicit { "weight": 1 }
      *   final_score = BM25 * boost
      *
      * Picking `factor` to cap around 2× at max_margin ≈ 200:
      *
      *   1 + ln(1 + factor * 200)  ≈  2
      *   ln(1 + 200*factor)        =  1
      *   1 + 200*factor            =  e
      *   factor                    =  (e - 1) / 200  ≈ 0.00859
      *
      * You can keep a little headroom, e.g. use 0.0085.
      * ─────────────────────────────────────────────── */
     "functions": [
       {
         "filter": { "range": { "margin": { "gt": 0 } } },
         "field_value_factor": {
           "field"   : "margin",
           "modifier": "ln1p",     // natural log of (1 + margin * factor)
           "factor"  : 0.0085,     // ≈ (e - 1) / 200
           "missing" : 0
         }
       },
       { "weight": 1 }              // explicit neutral baseline (keeps zero/small margins neutral)
     ],
 
     "score_mode": "sum",           // boost = 1 + ln(1 + margin*factor)  (sum of the two functions)
     "boost_mode": "multiply"       // final_score = BM25 × boost
     // "max_boost": 2.0                // optional: clamp hard ceiling
   }
 }
}

The above query results in the following, which reflect the impact of the margin boosting on the score. Notice that, as we intended, the high-margin McCain Chips have been boosted to the 1st position in the results.

{
  "_index": "blog_food_products",
  "_id": "F6O3MJoBBtzDfCS5JfQM",
  "_score": 2.6471777,
  "_source": {
    "description": "McCain Home Chips 500g - High Margin",
    "margin": 200
  }
},
{
  "_index": "blog_food_products",
  "_id": "GKO3MJoBBtzDfCS5JfQM",
  "_score": 2.5987387,
  "_source": {
    "description": "McCain Home Chips 1kg",
    "margin": 100
  }
},
{
  "_index": "blog_food_products",
  "_id": "GaO3MJoBBtzDfCS5JfQM",
  "_score": 2.1787827,
  "_source": {
    "description": "McCain Home Chips 1.5kg",
    "margin": 50
  }
},
{
  "_index": "blog_food_products",
  "_id": "G6O3MJoBBtzDfCS5JfQM",
  "_score": 0.64049,
  "_source": {
    "description": "BirdsEye Crispy Chips 900g",
    "margin": 12
  }
},
{
  "_index": "blog_food_products",
  "_id": "GqO3MJoBBtzDfCS5JfQM",
  "_score": 0.62682253,
  "_source": {
    "description": "BirdsEye Crispy Chips 450g",
    "margin": 9
  }
}

Understanding the formula

The function_score query allows us to apply a smooth, interpretable boost based on margin without overwhelming BM25’s lexical relevance.

Here’s how it works:

margin_boost = ln(1 + margin × factor)
boost = 1 + margin_boost
final_score = BM25 × boost

Where the query is specified with the following fields:

field_value_factor – uses a document field to influence scoring without scripting overhead.
modifier: “ln1p” – computes ln(1 + margin × factor)
- Note: ln1p(x) is shorthand for ln(1 + x).
factor – controls scale; 0.0085 caps boosts near 2× at margin=200.
weight: 1 – ensures a minimum boost of 1 for neutral items.
score_mode: “sum” – adds constant 1 (from that standalone “weight” : 1) and the margin_boost together.
boost_mode: “multiply” – multiplies BM25 by the computed boost.

Why was that formula chosen?

The logarithmic (ln1p) scaling behaves well across real-world data:

It grows fast at small margins (rewarding incremental gains).
It flattens at high margins (preventing runaway scores).
It’s continuous and interpretable — no thresholds or discontinuities.

Margin	ln(1 + margin × 0.0085)	Boost (≈1+ln1p)	Boost Multiplier
5	0.042	1.04	×1.04
50	0.35	1.35	×1.35
100	0.63	1.63	×1.63
200	0.99	1.99	×1.99

Ranking by margin and popularity

We can extend the same logic to add a popularity boost. Here, we tune the popularity factor so that the boost increases by roughly +1.0, at a popularity of 10,000. (These thresholds depend on your dataset’s scale.)

POST blog_food_products/_search
{
 "size": 5,
 "_source": ["product_id", "description", "margin", "popularity"],
 "query": {
   "function_score": {
     "query": {
       "match": {
         "description": "McCain Chips"       }
     },
     "functions": [
       {
         // calculate margin_boost
         "filter": { "range": { "margin": { "gt": 0 } } },
         "field_value_factor": {
           "field":   "margin",
           "modifier":"ln1p",           // ln(1 + margin * margin_f)
           "factor":  0.008591,         // ≈ (e - 1) / 200
           "missing": 0
         },
         "weight": 1                 // full impact from margin
       },
       {
         // calculate popularity_boost
         "filter": { "range": { "popularity": { "gt": 0 } } },
         "field_value_factor": {
           "field":   "popularity",
           "modifier":"ln1p",           // ln(1 + popularity * popularity_f)
           "factor":  0.0001718,        // ≈ (e - 1) / 10,000
           "missing": 0
         },
         "weight": 0.5                 // popularity counts for half the impact of margin
       },
       {
         "weight": 1                   // ensures minimum boost of 1
       }                 
     ],
     "score_mode": "sum",               // boost = 1 + margin_boost + 0.5×popularity_boost
     "boost_mode": "multiply"           // final_score = BM25 * boost
     // "max_boost": 4.0                // optional: clamp hard ceiling
   }
 }
}

Which returns results with the most popular product in 1st place, even though it is not the highest margin, as follows — in this case, the impact of the popularity boost has pushed up McCain Home Chips 1.5kg to 1st place.

{
  "_index": "blog_food_products",
  "_id": "IqPBMJoBBtzDfCS5CvRg",
  "_score": 2.988299,
  "_source": {
    "product_id": "MCC-HOME-1500",
    "description": "McCain Home Chips 1.5kg",
    "margin": 50,
    "popularity": 10000
  }
},
{
  "_index": "blog_food_products",
  "_id": "IaPBMJoBBtzDfCS5CvRg",
  "_score": 2.6905532,
  "_source": {
    "product_id": "MCC-HOME-1000",
    "description": "McCain Home Chips 1kg",
    "margin": 100,
    "popularity": 640
  }
},
{
  "_index": "blog_food_products",
  "_id": "IKPBMJoBBtzDfCS5CvRg",
  "_score": 2.667411,
  "_source": {
    "product_id": "MCC-HOME-500",
    "description": "McCain Home Chips 500g - High Margin",
    "margin": 200,
    "popularity": 100
  }
},
{
  "_index": "blog_food_products",
  "_id": "JKPBMJoBBtzDfCS5CvRg",
  "_score": 0.67510986,
  "_source": {
    "product_id": "BIR-CHIPS-900",
    "description": "BirdsEye Crispy Chips 900g",
    "margin": 12,
    "popularity": 720
  }
},
{
  "_index": "blog_food_products",
  "_id": "I6PBMJoBBtzDfCS5CvRg",
  "_score": 0.66836256,
  "_source": {
    "product_id": "BIR-CHIPS-450",
    "description": "BirdsEye Crispy Chips 450g",
    "margin": 9,
    "popularity": 880
  }
}

What the resulting boosts look like

The “factors” are tuned to add +1.0 to the boost at the assumed maximums. These are calculated to satisfy the following formulas:

ln(1 + 200 × margin_f)   = 1.0 
i.e. margin_f  = 0.008592
ln(1 + 10 000 × popularity_f)  = 1.0 
i.e. popularity_f  = 0.0001718

Then:

margin_boost = ln(1 + margin × margin_f)
popularity_boost = ln(1 + popularity × popularity_f)
boost = 1 + margin_boost + 0.5 × popularity_boost
final_score = BM25 × boost

Each cell in the table below represents the total BM25 multiplier for various margin and popularity values.

How to read the table:

The first column (popularity = 0) isolates the margin effect.
Moving right, popularity increases the boost — but since its weight is 0.5, its contribution to the summed boost is halved.
Even at extreme values (popularity = 100,000), the boost flattens due to logarithmic scaling.

Tuning

If you find popularity can spike very high (e.g., 100k+) and you don’t want boosts above some ceiling, you can:

Lower the popularity factor further, or
Add “max_boost”: to function_score, or
Split weights, e.g. “weight”: 0.25 on popularity and “weight”: 1 on margin (still with score_mode: “sum”), if you want one to dominate less.

Using rank_feature for similar use cases

At first glance, rank_feature and rank_features look like a natural choice for incorporating numeric signals such as popularity, recency, or even profit margin. They are fast, compressed, and easy to operationalize — which is why many teams reach for them first.

However, they are not a good fit for this type of scoring model, for the following reasons:

1. Rank-feature contributions are strictly additive

The score takes the form:

final_score = BM25 + feature_boost

This means the effect of the boost changes dramatically depending on the scale of the BM25 score.

When BM25 is small, the boost dominates the ranking.
When BM25 is large, the identical boost becomes negligible.

We need consistent, proportional behavior instead.

2. Impossible to express “percentage-based” or multiplicative logic

This article’s model requires expressing things like:

“Popularity increases relevance by ~20%.”
“Margin strengthens relevance but never overrides it.”

rank_feature cannot do this. It does not support multiplicative shaping of the BM25 score.

3. Combining multiple signals becomes unstable and hard to tune

If you try to combine margin, popularity, availability, or other business metrics via rank_features, each feature adds another independent additive term. These interact in opaque ways, making tuning brittle and unpredictable.

4. Bottom line

rank_feature is great for simple additive numeric boosts. It is not suitable when you need:

stable behavior across queries
proportional / multiplicative effects
explainable blending of multiple signals

For this reason, the article uses function_score instead, because it provides explicit, controlled scoring that behaves consistently regardless of BM25 scale.

Wrapping up

Elastic’s function_score query makes it simple to transform search ranking from content relevance into business-aware optimization.

By combining BM25 relevance with economic signals like margin and popularity, you can:

Align search with real business outcomes.
Tune scaling via a single parameter (factor).
Maintain full explainability through _explain.

Once this foundation is in place, you can extend it to Stock levels (reduce the ranking of low-stock products), Recency (prioritize new products), or other business-critical signals that you want to take into consideration. Each new signal simply adds to the boost which is then multiplied by the base BM25 relevance score.

Reducing Elasticsearch frozen tier costs with Deepfreeze S3 Glacier archival

Bret Wortman — Tue, 16 Dec 2025 00:00:00 GMT

The problem: Your old data is expensive (and getting more so)

As legal requirements change and data retention periods grow longer, many Elasticsearch customers have started to wonder: How can I retain my historical data without storage costs breaking the bank?

You've got your ILM policies humming along nicely. Fresh data comes in hot, ages into warm or cold, gets frozen as a searchable snapshot, and eventually—after 180 days, 10 years, or longer, depending on your compliance requirements—gets deleted.

When the delete action runs, it typically deletes the searchable snapshot with it. Sure, you might have regular snapshots elsewhere, but those aren't searchable without a full restore. If someone comes to you six months later asking, "Hey, can we check what happened in Q2 2023?" you're looking at a lengthy restore operation, a manual and time-consuming re-ingestion, or an awkward conversation about data retention policies.

The obvious solution is to just... not delete things, right? Keep those frozen indices around forever! But that creates its own problems:

Cost creep: Frozen tier storage isn't free, and it adds up
Cluster clutter: Managing hundreds or thousands of ancient frozen indices gets messy
Wasted resources: You're paying for data to be "searchable" when you haven't searched it in months

What you really want is something in between: a way to keep the data accessible without drowning in S3 storage costs.

The solution: Deepfreeze makes old data cheap (but not gone)

Deepfreeze is a new repository management solution from Elastic. It allows you to retain searchable snapshots for deleted indices, move them to a cheaper S3 storage tier, and easily restore them if the saved data is needed again. All because Elasticsearch lets you delete an index while keeping its searchable snapshot intact.

When you configure your ILM delete action, there's an often-overlooked option available in the delete phase:

{
  "delete": {
    "delete_searchable_snapshot": false
  }
}

Set that to false, and when the index gets deleted, the snapshot repository keeps the snapshot files in S3. The index is gone from your cluster (no more rent!), but the underlying data is still sitting in your S3 bucket, waiting patiently.

But here's where it gets interesting: if you just leave those snapshots in the same repository, they're still "managed" by Elasticsearch, which means they must remain in a standard access tier. AWS S3 Intelligent-Tiering won't move them to cheaper tiers because, from S3's perspective, Elasticsearch still actively manages that bucket.
Deepfreeze solves this by rotating snapshot repositories.

Think of it like this: instead of one big filing cabinet where you keep adding folders forever, you start a new filing cabinet every month. Old cabinets get closed, labeled, and moved to cheaper storage. If you need something from an old cabinet, you can always bring it back out and open it up.

Note: Deepfreeze currently supports AWS. Azure and GCP are on the roadmap for future development.

How Deepfreeze works: Four steps to data freedom

Deepfreeze is a Python automation tool that runs periodically (typically via cron) to manage this rotation process. Here's what happens when you execute its monthly rotate action:

Step 1: Create a new S3 bucket

Deepfreeze creates a fresh S3 bucket with a one-up name (like Repository-000002). This bucket is configured with your preferred storage class—by default, this is Standard, but Intelligent Tiering is an option.

Please note that I use the term “bucket” here for ease of discussion. In reality, Deepfreeze creates new base paths inside a bucket to avoid any potential issues with AWS bucket creation limits. For simplicity, I refer to this combination of a bucket and a path as a bucket.

Step 2: Mount the bucket as an Elasticsearch repository

The new S3 bucket gets registered as a snapshot repository in your Elasticsearch cluster. Your ILM policies will now start using this repository for new frozen indices.

Step 3: Update ILM policies to use the new repository

Here's where the magic happens. Deepfreeze automatically:

Scans all your ILM policies
Finds any reference to the old repository in their searchable_snapshot actions
Copies the ILM policies to new ones referencing the new repository
Updates index templates to use the new policies

You don't have to edit dozens of policies manually. Deepfreeze handles it all.

Step 4: Unmount old repositories (but keep recent ones)

Deepfreeze maintains a sliding window of mounted repositories—by default, it keeps the last 6 months accessible. Older repositories get unmounted from Elasticsearch, but the S3 buckets remain intact.

Once unmounted, those buckets are no longer under active Elasticsearch management, and Deepfreeze can now move them to Glacier.

The benefits: Why you should care

Cost optimization

Elasticsearch's frozen tier storage is cheaper than hot or warm storage, but it still incurs standard S3 storage costs. The S3 Glacier tier can be 95% cheaper than standard S3 storage. Over months and years, this adds up to real money.

Data retention without the guilt

Compliance requirements often mandate keeping data for years. Deepfreeze lets you meet those requirements without constantly explaining to finance why your storage bill keeps growing.

Quick access when needed

Because the searchable snapshots are still in their native format, you can remount them using Elasticsearch's standard _mount API. No lengthy restore process, no reformatting—just point to the old repository and mount the index.

Deepfreeze makes this easier still, by keeping track of the datetime range of every repo it processes, making something like this possible, where the system does all the work of thawing data from the specified range by requesting the buckets from S3, mounting the repos when the data is available, and even mounting the indices that cover the specified dates:

$ deepfreeze thaw --start-date 2024-10-12 --end-date 2024-12-01

After the duration has passed (30 days, by default), AWS automatically re-freezes the data to Glacier. Deepfreeze checks for this and updates its metadata, repositories, and indices during normal operations.

(AWS doesn’t actually move the data from Glacier to Standard. Instead, it copies the desired data back to Standard and then, when the duration has expired, it deletes it.)

Full automation

Run Deepfreeze on a schedule (the first of every month works for a monthly rotation) and forget about it. It handles all the tedious policy updates, bucket creation, and repository management automatically. Run it from cron or a .service file. Examples are available on GitHub.

Reduced cluster clutter

Your Elasticsearch cluster only needs to track the repositories that are actively mounted. Ancient indices don't show up in your cluster state, monitoring dashboards, or backup routines. Everything stays cleaner.

Configuration: Making Deepfreeze work for you

Deepfreeze is flexible. You can configure it via environment variables, command-line arguments, or both. For example, you can change the default number of mounted repositories by setting the DEEPFREEZE_KEEP environment variable. The default value is 6. Increasing this value gives you access to more data.

DEEPFREEZE_KEEP=10

The ideal setting for the DEEPFREZE_KEEP value depends on the likelihood that older data is needed. If you rarely look at data older than a year, set the keep value to 12 to ensure your cluster has a full year of data; Deepfreeze will push anything older to glacier storage.

You can also change the AWS Storage Class and use intelligent_tiering (Auto-tiering based on access) instead of the default standard (Full-price, always accessible):

DEEPFREEZE_STORAGE_CLASS=intelligent_tiering

Options for each action are in the GitHub README.

Real-world use case

Whether you’re looking to ingest application logs, security events, IoT sensor data, or something entirely different, cost savings will depend on your ingest rate and retention periods.

For example, if you're ingesting 175GB of application logs per day, but typically only need the last 6 months searchable for troubleshooting. Assume compliance requires 7 years of retention:

Without Deepfreeze:

Keep 7 years × 60TB = 420TB in frozen tier
Cost: ~$9,660/month

With Deepfreeze:

Keep 6 months × 30TB = 30TB in frozen tier: $690/month
Keep 6.5 years × 390TB in S3 Archive: $1,560/month
Total: $2,250/month (77% savings)

Technical deep dive: under the hood

For those who want to understand what's really happening, here's a peek at the implementation.

The Deepfreeze process flow

START
  │
  ├─► Initialize Elasticsearch client (SSL/TLS, auth)
  │
  ├─► Calculate next repository suffix
  │
  ├─► Validate: At least one existing repo with prefix exists
  │
  ├─► Validate: New repo name doesn't already exist
  │
  ├─► Create new S3 bucket path via boto3
  │   └─► Configure: storage class, ACL, region
  │
  ├─► Register new repository in Elasticsearch
  │   └─► Type: s3
  │   └─► Settings: bucket, base_path, storage_class
  │
  ├─► Update all ILM policies
  │   ├─► Fetch all policies from cluster
  │   ├─► For each policy using the last repo:
  │   │   └─► Find searchable_snapshot actions
  │   │   └─► Replace old repo name with new
  │   │   └─► Submit updated policy
  │   └─► Log policy update count
  │
  ├─► Unmount old repositories
  │   ├─► List all repos matching prefix
  │   ├─► Exclude any thawed repos from consideration
  │   ├─► Sort by suffix (oldest first)
  │   ├─► While (total count > KEEP):
  │   │   └─► DELETE oldest repository (S3 bucket stays!)
  │   └─► Log unmounted repo count
  │
END

Key implementation details

Smart repository discovery: The tool discovers repositories by prefix matching, ensuring it only manages repositories it created:

def get_repos(self) -> list:
    """Get all repositories matching our prefix"""
    all_repos = self.client.snapshot.get_repository(name='*')
    return [r for r in all_repos if r.startswith(self.repo_name_prefix)]

Automatic ILM policy updates: No manual policy editing required. Deepfreeze walks the policy structure and updates repository references:

# Simplified conceptual example
for policy_name, policy in ilm_policies.items():
    for phase in policy['phases']:
        if 'searchable_snapshot' in phase['actions']:
            phase['actions']['searchable_snapshot']['snapshot_repository'] = new_repo
    es.ilm.put_policy(name=policy_name, policy=policy)

Important considerations

The S3 Intelligent-Tiering question

There's an interesting architectural question worth discussing: How does AWS S3 Intelligent-Tiering classify the temperature of unmounted searchable snapshots?

The idea is that once unmounted from Elasticsearch, the S3 bucket is no longer actively managed, so access patterns drop to zero, and Intelligent-Tiering should move the data through Archive Access tiers down to Deep Archive Access (or even Glacier Instant Retrieval).

Because this relies on everything playing nice and leaving the data untouched and unaccessed for at least 90 days, Deepfreeze defaults to putting repositories in Standard and moving them to Glacier as soon as they’ve been unmounted, rather than waiting for IT to decide it’s time. We ensure that the data moves to Glacier as quickly as possible, but not before.

Initial setup requirements

Before running Deepfreeze, you need:

1. ILM policies configured to retain searchable snapshots:

"delete": {
  "delete_searchable_snapshot": false
}

2. At least one existing repository with your chosen prefix (deepfreeze validates this)

3. AWS credentials configured for S3 access (via environment, IAM role, or credentials file)

4. Elasticsearch authentication with privileges to manage repositories and ILM policies

The setup command runs a comprehensive set of pre-checks to ensure the conditions are right before it starts.

Post-setup tasks

Deepfreeze just creates the environment for managing repositories; it doesn’t help your data get into those repos. We don’t know your business and what data you want preserved for posterity. After the Deepfreeze setup runs, ensure you have at least one ILM policy that uses this repo with snapshot deletion disabled. You will also need to ensure that you have an index template that associates this ILM policy with an index or data stream you want to preserve.

Getting started

Installation

# Clone the repository
git clone https://github.com/elastic/deepfreeze.git
cd deepfreeze

# Install in development mode
pip install -e .

# Or install dependencies directly
pip install -r requirements.txt

Basic usage

# Set up environment variables (or use .env file)
export DEEPFREEZE_ELASTICSEARCH=https://es.example.com:9200
export DEEPFREEZE_CA=/path/to/http_ca.crt  # For self-signed certificates
export DEEPFREEZE_USERNAME=elastic
export DEEPFREEZE_KEEP=6

# Use a config.yml file instead
cp /path/to/deepfreeze/packages/deepfreeze-cli/config.yml.example ~/.deepfreeze/config.yml
# ...and then edit to set your values

# Run deepfreeze setup
deepfreeze setup

# With command-line options
deepfreeze setup --repo_name_prefix dftest \
                 --bucket_name_prefix myorg_dftest \
                 --base_path_prefix df_snapshots

Recommended workflow

Start small: Test with a single index or low-priority data
Monitor costs: Watch your AWS bill for the first few months to verify savings
Verify tiering: Check S3 metrics to ensure Intelligent-Tiering is working as expected if chosen
Automate gradually: Once confident, add to cron and expand to more indices

The bottom line

Deepfreeze solves a real problem that many Elasticsearch operators face: how do you keep historical data accessible without going broke?

By automating the rotation of snapshot repositories and letting AWS S3's native tiering capabilities do the heavy lifting, you get:

✅ Massive cost savings on long-term data retention
✅ Full compliance with data retention policies
✅ Quick access to recent historical data (6 months by default)
✅ On-demand remounting of older data when needed
✅ Clean, automated workflow that runs itself

Deepfreeze works with your existing ILM policies and requires minimal configuration. It's not a redesign of your entire data management strategy—it's an optimization that pays for itself almost immediately.

There are costs associated with retrieving data from Glacier, and it does take time (6 hours for Standard retrieval as of this writing). Still, if you anticipate frequently needing your historical data, Deepfreeze may not be a great fit, and keeping your data in frozen tiers might be a better solution.

If you're running Elasticsearch at scale and storing more than a few months of historical data, Deepfreeze is worth a serious look. Your CFO will thank you, and you'll sleep better knowing that you can still fulfill a six-month-old audit request.

Resources & next steps

Repository: github.com/elastic/deepfreeze
Documentation: See /README.md and the docs/ directory in the repository
Dependencies: Python 3.8+, others as described in pyproject.toml

Have questions or want to share your deepfreeze success story? Open an issue on GitHub or reach out!

Getting started with Elastic Agent Builder and Strands Agents SDK

Jonathan Simon — Mon, 15 Dec 2025 00:00:00 GMT

Have an idea for an AI agent? It probably involves doing something with data because if an agent is going to initiate a useful action, it needs to make a decision, and it needs the right data to make the right decision.

Elastic Agent Builder makes it easy to build data-connected AI agents. We’ll show you how in this blog post. Let’s walk through all the steps required to create an Agent with an MCP tool that accesses data stored in Elastic. Then we'll use the Strands Agents SDK and its Agent2Agent (A2A) capabilities to operate the agent. The Strands Agents SDK is a multi-agent AI development platform that you can use to build agentic apps with just enough code to ensure you get the outcome that you want.

Let's build an AI agent that plays the game of RPS+, which is a take on the classic game of Rock Paper Scissors with an added twist; it gives the game players a couple of extra choices.

Prerequisites

Here’s what’s required to follow along with the steps in this blog post:

A text editor running on your local computer
- Visual Studio Code is what we’ll be using for the example instructions in this blog post
Python 3.10 or greater running on your local computer

Create a Serverless Project

The first thing we need is an Elasticsearch Serverless project, which includes Elastic Agent Builder.

Go to cloud.elastic.co and create a new Elasticsearch Serverless project.

Create an index and add data

Next, we’re going to add some data to our Elasticsearch project. Open Developer Tools, where we can run commands to create a new index and insert some data into it. Select Developer Tools from the top-level navigation menu.

Copy and paste the following PUT command into the request input area of the Developer Tools console. This statement creates an Elasticsearch index named “game-docs”.

PUT /game-docs
{
  "mappings": {
    "properties": {
      "title": { "type": "text" },
      "content": { 
        "type": "text"
      },
      "filename": { "type": "keyword" },
      "last_modified": { "type": "date" }
    }
  }
}

Click the Send request button that appears on the right-hand side of the statement in Developer Tools. You should see a notification confirming that the game-docs index was created in the response area of Developer Tools.

An index named game-docs is a great place to store the data for the game we’re making. Let’s put a document named rps+-md into this index that contains all the data our game requires. Copy and paste the following PUT command into the Developer Tools console.

PUT /game-docs/_doc/rps+-md
{
  "title": "Rock Paper Scissors +",
  "content": "
# Game Name
RPS+

# Starting Prompt
Let's play RPS+ !
---
What do you choose?

# Game Objects
1. Rock 🪨 👊
2. Paper 📜 🖐
3. Scissors ✄ ✌️
4. Light ☼ 👍
5. Dark Energy ☄ 🫱

# Judgement of Victory
* Rock beats Scissors
  * because rocks break scissors
* Paper beats Rock
  * because paper covers rock
* Scissors beat Paper
  * because scissors cut paper
* Rock beats Light
  * because you can build a rock structure to block out light
* Paper beats Light
  * because knowledge stored in files and paper books helps us understand light
* Light beats Dark Energy
  * because light enables humans to lighten up and laugh in the face of dark energy as it causes the eventual heat death of the universe
* Light beats Scissors
  * because light is needed to use scissors safely
* Dark Energy beats Rock
  * because dark energy rocks more than rocks. It rocks rocks and everything else in its expansion of the universe
* Dark Energy beats Paper
  * because humans, with their knowledge stored in files and paper books, can't explain dark energy 
* Scissors beat Dark Energy
  * because a human running with scissors is darker than dark energy

# Invalid Input
I was hoping for an worthy opponent
  - but alas it appears that time has past
  - but alas there's little time for your todo list when [todo:fix this] is so vast

# Cancel Game
The future belongs to the bold. Goodbye..
",
  "filename": "RPS+.md",
  "last_modified": "2025-11-25T12:00:00Z"
}

Click the Send request button next to the statement to run it and add the rps+-md document to the game-docs index.

We should now have some data to query, and with Agent Builder, that’s simpler than ever.

Select Agents from the top-level navigation menu.

Then all you have to do is ask the default Elastic AI Agent, “What data do I have?”

The Elastic AI Agent evaluates the data and returns a concise explanation of the data we have.

Create a tool

Okay we’ve now got some data in Elastic, let’s put it to use. Agent Builder includes built-in support for creating MCP tools that help agents access the data they need to have the right context for their task. Let’s create a simple tool that will retrieve our game data.

Click the Agent Builder actions menu.

Select View all tools from the menu options.

Click + New Tool.

In the Create Tool form, select the ES|QL as the tool Type and enter the following values.

For Tool ID:

example.get_game_docs

For Description:

Get RPS+ doc from Elasticsearch game-docs index.

For Configuration, enter the following query into the ES|QL Query text area:

FROM game-docs | WHERE filename == "RPS+.md"

Your completed Create tool form should look like this. Click Save to create the tool.

We’ve got a new tool hanging on the tool rack. Tools shouldn’t just be perpetually hanging on a rack; they should be put to a worthy use. Let’s create an agent that can use our new custom tool.

Create an Agent and assign it a tool

Creating an agent is refreshingly simple with Agent Builder. You just have to enter the agent instructions with a few details, and that’s all you need. Let’s create an agent now.

Click Manage agents.

Click + New agent.

Enter the following information into the New Agent form.

For Agent ID, enter the text below:

rps_plus_agent

In the Custom Instructions text area, enter the following instructions:

When prompted, if the prompt contains an integer, then select the corresponding numbered item in the list of "Game Objects" from your documents. Otherwise select a random game object. This is your chosen game object for a single round of the game.

# General Game Rules
* 2 players
    - the user: the person playing the game
    - you: the agent playing the game and serving as the game master
* Each player chooses a game object which will be compared and cause them to tie, win or lose.

# Start the game
1. This is the way each new game always starts. You make the first line of your response only the name of your chosen game object. 

2. The remainder of your response should be the "Starting Prompt" text from your documents and generate a list of "Game Objects" for the person playing the game to choose a game object from.  

# End of Game: The game ends in one of the following three outcomes:
1. Invalid Input: If the player responds with an invalid game object choice, respond with variations of the "Invalid Input" text from your documents and then end the game.

2. Tie: The game ends in a tie if the user chooses the same game object as your game object choice.

3. Win or Lose: The game winner is decided based on the "Judgement of Victory" conditions from your documents. Compare the user's game object choice and your game object choice and determine who chose the winning game object.

# Game conclusion
Respond with a declaration of the winner of the game by outputting the corresponding text in the "Judgement of Victory" section of your documents.

For the Display name, enter the text below:

RPS+ Agent

For the Display description, enter the text below:

An agent that plays the game RPS+

Give the agent the custom tool we created previously by clicking the Tools tab.

Select only the example.get_game_docs tool that we created previously.

Click Save to create the new agent.

Let’s test out our new agent. There’s a handy link to start a chat with any agent from the list of agents.

Just enter “start game” and the game will start. It works!

You can see that the agent displays its game object choice at the top of its response. This is useful in that we can see the agent’s choice and confirm the game is functioning as expected. However, knowing your opponent's choice before you choose doesn’t make for a very fun game of Rock, Paper, Scissors. To polish and hone the game to its final form, we can use an agent orchestration platform that can control agents with code.

The Strands Agents SDK enters the chat.

Strands Agents SDK

If you’re curious about trying out new agent development frameworks, then the Strands Agents SDK is worth your time to give it a try. The Strands Agents SDK was released by AWS (May 2025) as an open source Python implementation, and there’s now a Typescript version as well.

Getting started with the Strands Agents SDK in Python

Fire up your coding engines, we’re now going to race through the process of cloning and running an example app that uses Strands Agents to control the RPS+ Agent via the A2A protocol. Let’s create a fine-tuned version of the RPS+ game so that the agent’s choice is revealed after you make your choice, because after all, it’s the guessing and surprise outcome that makes games like Rock Paper Scissors fun.

On your local computer, open Visual Studio Code and open a new terminal.

In the newly opened terminal, run the following command to clone the Elasticsearch Labs repository:

git clone https://github.com/elastic/elasticsearch-labs

Run the following cd command to change the directory into the elasticsearch-labs directory:

cd elasticsearch-labs

Next, run the following command to open the repository in Visual Studio Code:

code .

In the Visual Studio File Explorer, expand the supporting-blog-content and agent-builder-a2a-strands-agents folders and then open the file elastic_agent_builder_a2a_rps+.py. Here’s what the file looks like opened in Visual Studio Code:

Here’s the contents of elastic_agent_builder_a2a_rps+.py that you should see in your text editor:

import asyncio
from dotenv import load_dotenv
from uuid import uuid4
import httpx
import os
import random
from a2a.client import A2ACardResolver, ClientConfig, ClientFactory
from a2a.types import Message, Part, Role, TextPart

DEFAULT_TIMEOUT = 60  # set request timeout to 1 minute


def create_message(*, role: Role = Role.user, text: str, context_id=None) -> Message:
    return Message(
        kind="message",
        role="user",
        parts=[Part(TextPart(kind="text", text=text))],
        message_id=uuid4().hex,
        context_id=context_id,
    )


async def main():
    load_dotenv()
    a2a_agent_host = os.getenv("ES_AGENT_URL")
    a2a_agent_key = os.getenv("ES_API_KEY")
    custom_headers = {"Authorization": f"ApiKey {a2a_agent_key}"}

    async with httpx.AsyncClient(
        timeout=DEFAULT_TIMEOUT, headers=custom_headers
    ) as httpx_client:
        # Get agent card
        resolver = A2ACardResolver(httpx_client=httpx_client, base_url=a2a_agent_host)
        agent_card = await resolver.get_agent_card(
            relative_card_path="/rps_plus_agent.json"
        )
        # Create client using factory
        config = ClientConfig(
            httpx_client=httpx_client,
            streaming=True,
        )
        factory = ClientFactory(config)
        client = factory.create(agent_card)
        # Use the client to communicate with the agent
        print("\nSending 'start game' message to Elastic A2A agent...")
        random_game_object = random.randint(1, 5)
        msg = create_message(text=f"start with game object {random_game_object}")
        async for event in client.send_message(msg):
            if isinstance(event, Message):
                context_id = event.context_id
                response_complete = event.parts[0].root.text
                # Get agent choice from the first line of the response
                parsed_response = response_complete.split("\n", 1)
                agent_choice = parsed_response[0]
                print(parsed_response[1])
        # User choice sent for game results from the agent
        prompt = input("Your Choice  : ")
        msg = create_message(text=prompt, context_id=context_id)
        async for event in client.send_message(msg):
            if isinstance(event, Message):
                print(f"Agent Choice : {agent_choice}")
                print(event.parts[0].root.text)


if __name__ == "__main__":
    asyncio.run(main())

Let's review what's going on in this code. Starting with the main() method, the code begins by accessing the environment variables for the agent URL and the API Key. Then we use those values to create an httpx client that we can use to get the agent card for the agent. The client then uses the agent card details to send a "start game" request to the agent. One interesting thing to note here is that we include a random_game_object value as part of the "start game" request. This value is a random number generated with the random module of Python's standard library. The reason for doing this is that it turns out that the powerful LLMs (that make AI agents possible) are not great at randomness. No problem, Python to the rescue.

Continuing with the code, once the agent responds to the "start game" request, the code strips out the agent's game object selection and saves it in the agent_choice variable. The remainder of the response is displayed as text to the end user. The user is then prompted for input of their game object choice, which is sent to the agent. The code then displays the agent’s game object choice along with the agent’s final determination of the game outcome.

Setting your agent URL and API Key as environment variables

Since the example app will be running on your local computer, in order to communicate with our Agent Builder agent, we need to provide the Strands Agents SDK with an A2A URL and API Key for the agent. The example app uses a file named .env to store these values.

Make a copy of the env.example file and name the new file .env

Jump back to Elastic Agent Builder, where we can get both of the values we need.

Select View all tools from the Agent Builder action menu at the top right of the page.

Click the MCP Server dropdown at the top of the Tools page and select Copy MCP Server URL.

Paste the MCP Server URL into the .env file as a replacement for the placeholder value. Now we need to make one update to the URL, that is, replacing the ending text “mcp” with “a2a” because the A2A protocol is what the Agent Strands SDK will be using to communicate with the agent running in Elastic Agent Builder.

The edited URL should look something like this:

https://rps-game-project-12345a.kb.us-east-1.aws.elastic.cloud/api/agent_builder/a2a

The other value we need to get while we’re here in Elastic Cloud is an API Key. Click Elasticsearch in the top level navigation.

Click the Copy API Key button to copy the API KEY.

Now, back in Visual Studio Code, paste the API Key in the .env file to replace the placeholder text. Your .env file should look something like this:

Run the example app

Open a new Terminal in Visual Studio Code.

Start by running the following cd command in the terminal:

cd elasticsearch-labs/supporting-blog-content/agent-builder-a2a-strands-agents

Run the following command to create a Python virtual environment.

python -m venv .venv

Depending on your local computer’s operating system, run the following command to activate the virtual environment.

MacOS/Linux

source .venv/bin/activate

Windows

.venv\Scripts\activate

The example app uses Strands Agents SDK, and we are now at the point in this walk-through where we need to install it. Run the following command to install the Strands Agents SDK along with all of its required Python library dependencies.

pip install -r requirements.txt

Time to clear the launch pad and begin the countdown. We’re ready to launch this app. Stand back. Let’s run it using the following command:

python elastic_agent_builder_a2a_rps+.py

You should be challenged with a game of RPS+. Well done and best of luck!

Build your AI apps with relevant context

Building an AI Agent is now a skill in your toolbox. And you’ve seen how easy it is to use Elastic Agent Builder agents via A2A in agent development frameworks like Strands Agents SDK. Try Elastic for building AI agents connected to the relevant context in your custom data.

Introducing Elasticsearch support in the Google MCP Toolbox for Databases

Enrico Zimuel, Laurent Saint-Félix — Fri, 12 Dec 2025 00:00:00 GMT

In this article, we’ll walk through how to use the Google MCP Toolbox with Elasticsearch to build a simple tool for extracting information from an Elasticsearch index.

We recently contributed to the Google MCP Toolbox for Databases open-source project by adding support for Elasticsearch as a database.

With this new feature, you can now use the Google MCP Toolbox to connect to Elasticsearch and directly “converse” with your data.

Elasticsearch

We need to have an Elasticsearch instance running. You can activate a free trial on Elastic Cloud or install it locally using the start-local script:

curl -fsSL https://elastic.co/start-local | sh

This will install Elasticsearch and Kibana on your computer and generate an API key to be used for configuring Google MCP Toolbox.

The API key will be shown as output of the previous command and stored in a .env file in the elastic-start-local folder.

Install the example dataset

After the installation, you can log in to Kibana using the username elastic and the password generated by the start-local script (stored in a .env file).

You can install the eCommerce orders data set available from Kibana. It includes a single index named kibana_sample_data_ecommerce containing information about 4,675 orders from an ecommerce website. For each order, we have the following information:

Customer information (name, ID, birth date, email, etc.)
Order date
Order ID
Products (list of all the products with price, quantity, ID, category, discount, etc.)
SKU
Total price (taxless, taxed)
Total quantity
Geo information (city, country, continent, location, region)

To install the sample data, open the Integrations page in Kibana (search for “Integration” in the search top bar) and install the “Sample Data”. For more details, refer to the documentation here: https://www.elastic.co/docs/explore-analyze/#gs-get-data-into-kibana.

The goal of this article is to show how easy it is to configure Google MCP Toolbox to connect to Elasticsearch and interact with the kibana_sample_data_ecommerce index using natural language.

Google MCP Toolbox

The Google MCP Toolbox is an open-source MCP server designed to make it easy for applications and AI agents to interact with databases securely and efficiently. Previously known as the “GenAI Toolbox for Databases,” the project was renamed after adopting full compatibility with the Model Context Protocol (MCP). Its purpose is to remove the heavy lifting traditionally required when connecting agents to databases by handling connection pooling, authentication, observability, and other operational concerns behind the scenes.

At its core, the Toolbox allows developers to define reusable, high-level tools that encapsulate database interactions. These tools can then be invoked by any MCP-compatible client—such as an AI agent—without requiring the client to implement low-level SQL queries or manage database connections. This approach dramatically reduces the amount of boilerplate code needed to build database-aware agents, making it possible to integrate advanced data operations in only a few lines of application logic. Once a tool is defined, it can be shared across multiple agents, frameworks, or languages (Figure 1).

A major advantage of using the Toolbox is the built-in security model. Authentication flows such as OAuth2 and OIDC are supported natively, allowing developers to avoid handling or storing sensitive database credentials in agents. The platform also provides observability features—including metrics and tracing—via OpenTelemetry, which is essential for debugging, monitoring, and production deployments. Altogether, MCP Toolbox serves as a unified, secure, and extensible interface for interacting with your data from any MCP-enabled system.

How to install MCP Toolbox

You can install the MCP Toolbox server on Linux using the following command:

export VERSION=0.21.0
curl -L -o toolbox https://storage.googleapis.com/genai-toolbox/v$VERSION/linux/amd64/toolbox
chmod +x toolbox

If you want to install it on macOS or Windows, you can follow the instructions detailed here.

Configure Toolbox for Elasticsearch

To configure the MCP Toolbox for Elasticsearch, we need to create a tools.yaml file, as follows:

sources:
  my-cluster:
    kind: elasticsearch
    addresses:
      - http://localhost:9200
    apikey: 

tools:
  customer-orders:
    kind: elasticsearch-esql
    source: my-cluster
    description: Get the orders made by a customer identified by name.
    query: |
    	FROM kibana_sample_data_ecommerce | WHERE MATCH(customer_full_name, ?name, {"operator": "AND"})
    parameters:
      - name: name
        type: string
        description: The customer name.

toolsets:
  elasticsearch-tools:
    - customer-orders

You need to replace the value with a valid Elasticsearch API key. If you are running Elasticsearch locally using start-local, you can find the API key in the .env file generated by start-local, under the ES_LOCAL_API_KEY variable. If you are using Elastic Cloud, you can generate an API key by following the procedure described here.

The previous tools contain the following ES|QL query for Elasticsearch:

FROM kibana_sample_data_ecommerce | WHERE MATCH(customer_full_name, ?name)

If you are not familiar with ES|QL, it is a query language developed by Elastic, similar to SQL, that can be used to search across one or more indices. You can read more about ES|QL in the official documentation here.

The query above searches for all orders stored in the kibana_sample_data_ecommerce index that contain the specified customer’s name, using the ?name parameter (the question mark denotes a parameter).

The customer’s name is defined in the earlier YAML configuration using the type string and the description “The customer name.”

This tool can be used to answer questions about a customer’s orders—for example: How many orders did customer Foo place in October 2025?

The descriptions of the tools and their parameters are essential for extracting the relevant information from the user’s natural-language request. This extraction is performed using the function-calling capability of a Large Language Model (LLM). In practice, an LLM can determine which function (tool) needs to be executed to obtain the necessary information, along with the appropriate parameters for that function.

For more information about function calls, we suggest reading the OpenAI function calling with Elasticsearch article by Ashish Tiwari.

Run the Toolbox server

You can run the MCP Toolbox using the previous tools.yaml file with the following command:

./toolbox --tools-file tools.yaml --ui

The –ui parameter runs a web application at http://127.0.0.1:5000/ui (Figure 2).

You can select the Tools > customer-orders and insert a customer name in the parameter name (e.g. Gwen Sanders) and click on the Run Tool button. You should see a JSON response as reported in Figure 3.

The setup is completed, and the MCP Toolbox can execute the customer-orders tool to communicate with Elasticsearch, running the ES|QL query.

Using the MCP Toolbox with Gemini CLI

We can use any MCP client to communicate with the MCP Toolbox for Databases. For instance, we can use Gemini CLI, a command-line tool to use Gemini. You can install Gemini CLI following the instructions reported here.

Gemini CLI offers a pre-configured extension for MCP Toolbox, available at gemini-cli-extensions/mcp-toolbox. You can install this extension by running the following command:

gemini extensions install https://github.com/gemini-cli-extensions/mcp-toolbox

After the installation, you need to go into the directory where you stored the tools.yaml configuration file for MCP Toolbox and execute Gemini CLI as follows (this step is required for the Gemini CLI to be automatically configured with MCP Toolbox):

gemini

You should see an output ad reported in Figure 4.

You can check if the MCP Toolbox is connected using the following command:

/mcp list

You should see the mcp_toolbox with the customer-orders tools listed (Figure 5).

If the MCP Toolbox is connected to the Gemini CLI, we can now try asking some questions, such as: “Give me the orders for the customer Gwen Sanders.” The Gemini CLI will then request permission to execute the customer-orders tool from the mcp_toolbox server (see Figure 6).

After the confirmation, Gemini CLI will execute the request to the MCP Toolbox, getting a JSON response as a result and using it to format the response (Figure 7).

The response from Gemini CLI will report that Gewn Sanders did just one order of 2 products, for a total price of 132 euros.

MCP Toolbox SDKs

Google MCP Toolbox also offers an SDK to access all the functionalities from a program written in Go, Python and Javascript.

For instance, the Python SDK is available on Github at the following page: https://github.com/googleapis/mcp-toolbox-sdk-python.

We need to create a simple agent to connect to the MCP Toolbox. We need to install the following packages:

pip install toolbox-core
pip install google-adk

And create a new agent project using the following command:

adk create my_agent

This will create a new directory named my_agent with a file agent.py.

Update my_agent/agent.py with the following content to connect to Toolbox:

from google.adk import Agent
from google.adk.apps import App
from toolbox_core import ToolboxSyncClient

client = ToolboxSyncClient("http://127.0.0.1:5000")

root_agent = Agent(
    name='root_agent',
    model='gemini-2.5-flash',
    instruction="You are a helpful AI assistant designed to search information about a dataset of ecommerce orders.",
    tools=client.load_toolset(),
)

app = App(root_agent=root_agent, name="my_agent")

Create a .env file with your Google API key:

echo 'GOOGLE_API_KEY="YOUR_API_KEY"' > my_agent/.env

Finally, we can run the agent and observe the results. To execute the agent, you can run the following command:

adk run my_agent

Or, you can serve it via a web interface:

adk web --port 8000

In both cases, you can interact with the MCP Toolbox using a Q&A interface. For instance, you can ask the previous question: Give me the orders of the customer Gwen Sanders.

For more information about the different SDKs, you can refer to this documentation page.

Conclusion

In this article, we demonstrated the Elasticsearch integration for the Google MCP Toolbox for Databases. Using a simple YAML configuration file, we can define a set of tools that translate natural-language questions into Elasticsearch queries using the ES|QL language.

We showed how to interact with the kibana_sample_data_ecommerce dataset, which contains orders from an e-commerce website. With this configuration file, we can simply run the MCP Toolbox server and connect to it from any MCP client.

Finally, we demonstrated how to use the Gemini CLI as a client to connect to the MCP Toolbox for Databases and query the e-commerce data stored in Elasticsearch. We executed a natural-language query to retrieve information about orders for a specific customer identified by name.

As the MCP ecosystem continues to grow, this pattern—lightweight tool definitions backed by secure, production-ready infrastructure—creates new opportunities for building increasingly capable, data-aware agents with minimal effort. Whether you're experimenting locally with Elastic’s sample datasets or integrating search capabilities into a larger application, MCP Toolbox provides a reliable, extensible foundation for interacting with your Elasticsearch data using natural language.

For more information about the development of agentic AI applications, you can read the Building AI Agentic workflows with Elasticsearch article by Anish Mathur and Dana Juratoni.

For more information about the Google MCP Toolbox, you can visit https://googleapis.github.io/genai-toolbox/getting-started/introduction/.

Evaluating search query relevance with judgment lists

Jhon Guzmán — Thu, 11 Dec 2025 00:00:00 GMT

Developers working on search engines often encounter the same issue: the business team is not satisfied with one particular search because the documents they expect to be at the top of the search results appear third or fourth on the list of results.

However, when you fix this one issue, you accidentally break other queries since you couldn’t test all cases manually. But how can you or your QA team test if a change in one query has a ripple effect in other queries? Or even more importantly, how can you be sure that your changes actually improved a query?

Towards a systematic evaluation

Here is where judgment lists come in useful. Instead of depending on manual and subjective testing any time you make a change, you can define a fixed set of queries that are relevant for your business case, together with their relevant results.

This set becomes your baseline. Every time you implement a change, you use it to evaluate if your search actually improved or not.

The value of this approach is that it:

Removes uncertainty: you no longer need to wonder if your changes impact other queries; the data will tell you.
Stops manual testing: once the judgment sets are recorded, the test is automatic.
Supports changes: You can show clear metrics that support the benefits of a change.

How to start building your judgment list

One of the easiest ways to start is to take a representative query and manually select the relevant documents. There are two ways to do this list:

Binary Judgments: Each document associated with a query gets a simple tag: relevant (usually with a score of “1”) and not-relevant (“0”).
Graded Judgments: Here, each document gets a score with different levels. For example: setting a 0 to 4 scale, similar to a Likert scale, where 0 = “not at all relevant” and 4 = “totally relevant,” with variations like “relevant,” “somewhat relevant,” etc.

Binary judgments work well when the search intent has clear limits: Should this document be in the results or not?

Graded judgements are more useful when there are grey areas: some results are better than others, so you can get “very good,” “good,” and “useless” results and use metrics that value the order of the results and the user’s feedback. However, graded scales also introduce drawbacks: different reviewers may use the scoring levels differently, which makes the judgments less consistent. And because graded metrics give more weight to higher scores, even a small change (like rating something a 3 instead of a 4) can create a much bigger shift in the metric than the reviewer intended. This added subjectivity makes graded judgments noisier and harder to manage over time.

Do I need to classify the documents myself?

Not necessarily, since there are different ways to create your judgment list, each with its own advantages and disadvantages:

Explicit Judgments: Here, SMEs go over each query/document and manually decide if (or how) relevant it is. Though this provides quality and control, it is less scalable.
Implicit Judgments: With this method, you infer the relevant documents based on real-user behavior like clicks, bounce rate, and purchases, among others. This approach allows you to gather data automatically, but it might be biased. For example, users tend to click top results more often, even if they are not relevant.
AI-Generated Judgments: This last option uses models (like LLMs) to automatically evaluate queries and documents, often referred to as LLM juries. It’s fast and easy to scale, but the quality of the data depends on the quality of the model you’re using and how well LLM training data aligns with your business interests. As with human grades, LLM juries can introduce their own biases or inconsistencies, so it’s important to validate their output against a smaller set of trusted judgments. LLM models are probabilistic by nature, so it is not uncommon to see an LLM model giving different grades to the same result regardless of setting temperature parameter as 0.

Below are some recommendations to choose the best method for creating your judgment set:

Decide how critical some features are for you that only users can properly judge (like price, brand, language, style, and product details). If those are critical, you need explicit judgments for at least some part of your judgment list.
Use implicit judgements when your search engine already has enough traffic so you can use clicks, conversions, and lingering time metrics to detect usage trends. You should still interpret these carefully, contrasting them with your explicit judgement sets to prevent any bias (e.g: users tend to click top-ranked results more often, even if lower-ranked results are more relevant)

To address this, position debiasing techniques adjust or reweight click data to better reflect true user interest. Some approaches include:

Results shuffling: Change the order of search results for a subset of users to estimate how position affects clicks.
Click models include Dynamic Bayesian Network DBN, User Browsing Model UBM. These Statistical models estimate the probability of a click reflects real interest rather than just position, using patterns like scrolling, dwell time, click sequence, and returning to the results page.

Example: Movie rating app

Prerequisites

To run this example, you need a running Elasticsearch 8.x cluster, locally or Elastic Cloud (Hosted or Serverless), and access to the REST API or Kibana.

Think about an app in which users can upload their opinions about movies and also search for movies to watch. As the texts are written by users themselves, they can have typos and many variations in terms of expression. So it’s essential that the search engine is able to interpret that diversity and provide helpful results for the users.

To be able to iterate queries without impacting the overall search behavior, the business team in your company created the following binary judgment set, based on the most frequent searches:

Query	DocID	Text
DiCaprio performance	doc1	DiCaprio's performance in The Revenant was breathtaking.
DiCaprio performance	doc2	Inception shows Leonardo DiCaprio in one of his most iconic roles.
DiCaprio performance	doc3	Brad Pitt delivers a solid performance in this crime thriller.
DiCaprio performance	doc4	An action-packed adventure with stunning visual effects.
sad movies that make you cry	doc5	A heartbreaking story of love and loss that made me cry for hours.
sad movies that make you cry	doc6	One of the saddest movies ever made — bring tissues!
sad movies that make you cry	doc7	A lighthearted comedy that will make you laugh
sad movies that make you cry	doc8	A science-fiction epic full of action and excitement.

Creating the index:

PUT movies
{
  "mappings": {
    "properties": {
      "text": {
        "type": "text"
      }
    }
  }
}

BULK request:

POST /movies/_bulk
{ "index": { "_id": "doc1" } }
{ "text": "DiCaprio performance in The Revenant was breathtaking." }
{ "index": { "_id": "doc2" } }
{ "text": "Inception shows Leonardo DiCaprio in one of his most iconic roles." }
{ "index": { "_id": "doc3" } }
{ "text": "Brad Pitt delivers a solid performance in this crime thriller." }
{ "index": { "_id": "doc4" } }
{ "text": "An action-packed adventure with stunning visual effects." }
{ "index": { "_id": "doc5" } }
{ "text": "A heartbreaking story of love and loss that made me cry for hours." }
{ "index": { "_id": "doc6" } }
{ "text": "One of the saddest movies ever made -- bring tissues!" }
{ "index": { "_id": "doc7" } }
{ "text": "A lighthearted comedy that will make you laugh." }
{ "index": { "_id": "doc8" } }
{ "text": "A science-fiction epic full of action and excitement." }

Below is the Elasticsearch query the app is using:

GET movies/_search
{
 "query": {
   "match": {
     "text": {
       "query": "DiCaprio performance",
       "minimum_should_match": "100%"
     }
   }
 }
}

From judgment to metrics

By themselves, judgment lists do not provide much information; they are only an expectation of the results from our queries. Where they really shine is when we use them to calculate objective metrics to measure our search performance.

Nowadays, most of the popular metrics include

Precision: Measures the proportion of results that are truly relevant within all search results.
Recall: Measures the proportion of relevant results the search engine found among x results.
Discounted Cumulative Gain (DCG): Measures the quality of the result’s ranking, considering the most relevant results should be at the top.
Mean Reciprocal Rank (MRR): Measures the position of the first relevant result. The higher it is in the list, the higher its score.

Using the same movie rating app as an example, we’ll calculate the recall metric to see if there’s any information that is being left out of our queries.

In Elasticsearch, we can use the judgment lists to calculate metrics via the Ranking Evaluation API. This API receives as input the judgment list, the query, and the metric you want to evaluate, and returns a value, which is a comparison of the query result with the judgment list.

Let’s run the judgment list for the two queries that we have:

POST /movies/_rank_eval
{
 "requests": [
   {
     "id": "dicaprio-performance",
     "request": {
       "query": {
         "match": {
           "text": {
             "query": "DiCaprio performance",
             "minimum_should_match": "100%"
           }
         }
       }
     },
     "ratings": [
       {
         "_index": "movies",
         "_id": "doc1",
         "rating": 1
       },
       {
         "_index": "movies",
         "_id": "doc2",
         "rating": 1
       },
       {
         "_index": "movies",
         "_id": "doc3",
         "rating": 0
       },
       {
         "_index": "movies",
         "_id": "doc4",
         "rating": 0
       }
     ]
   },
   {
     "id": "sad-movies",
     "request": {
       "query": {
         "match": {
           "text": {
             "query": "sad movies that make you cry",
             "minimum_should_match": "100%"
           }
         }
       }
     },
     "ratings": [
       {
         "_index": "movies",
         "_id": "doc5",
         "rating": 1
       },
       {
         "_index": "movies",
         "_id": "doc6",
         "rating": 1
       },
       {
         "_index": "movies",
         "_id": "doc7",
         "rating": 0
       },
       {
         "_index": "movies",
         "_id": "doc8",
         "rating": 0
       }
     ]
   }
 ],
 "metric": {
   "recall": {
     "k": 10,
     "relevant_rating_threshold": 1
     }
 }
}

We’ll use two requests to _rank_eval: one for the DiCaprio query and another for sad movies. Each request includes a query and its judgment list (ratings). We don’t need to grade all documents since the ones not included in the ratings are considered as with no judgment. To do the calculations, recall only considers the “relevant set,” the documents that are considered relevant in the rating.

In this case, the DiCaprio query has a recall of 1, while the sad movies got 0. This means that for the first query, we were able to get all relevant results, while in the second query, we did not get any. The average recall is therefore 0.5.

{
 "metric_score": 0.5,
 "details": {
   "dicaprio-performance": {
     "metric_score": 1,
     "unrated_docs": [],
     "hits": [
       {
         "hit": {
           "_index": "movies",
           "_id": "doc1",
           "_score": 2.4826927
         },
         "rating": 1
       },
       {
         "hit": {
           "_index": "movies",
           "_id": "doc2",
           "_score": 2.0780432
         },
         "rating": 1
       }
     ],
     "metric_details": {
       "recall": {
         "relevant_docs_retrieved": 2,
         "relevant_docs": 2
       }
     }
   },
   "sad-movies": {
     "metric_score": 0,
     "unrated_docs": [],
     "hits": [],
     "metric_details": {
       "recall": {
         "relevant_docs_retrieved": 0,
         "relevant_docs": 2
       }
     }
   }
 },
 "failures": {}
}

Maybe we’re being too strict with the minimum_should_match parameter since by demanding that 100% of the words in the query are found in the documents, we’re probably leaving relevant results out. Let’s remove the minimum_should_match parameter so that a document is considered relevant if only one word in the query is found in it.

POST /movies/_rank_eval
{
 "requests": [
   {
     "id": "dicaprio-performance",
     "request": {
       "query": {
         "match": {
           "text": {
             "query": "DiCaprio performance"
           }
         }
       }
     },
     "ratings": [
       {
         "_index": "movies",
         "_id": "doc1",
         "rating": 1
       },
       {
         "_index": "movies",
         "_id": "doc2",
         "rating": 1
       },
       {
         "_index": "movies",
         "_id": "doc3",
         "rating": 0
       },
       {
         "_index": "movies",
         "_id": "doc4",
         "rating": 0
       }
     ]
   },
   {
     "id": "sad-movies",
     "request": {
       "query": {
         "match": {
           "text": {
             "query": "sad movies that make you cry"
           }
         }
       }
     },
     "ratings": [
       {
         "_index": "movies",
         "_id": "doc5",
         "rating": 1
       },
       {
         "_index": "movies",
         "_id": "doc6",
         "rating": 1
       },
       {
         "_index": "movies",
         "_id": "doc7",
         "rating": 0
       },
       {
         "_index": "movies",
         "_id": "doc8",
         "rating": 0
       }
     ]
   }
 ],
 "metric": {
   "recall": {
     "k": 10,
     "relevant_rating_threshold": 1
     }
 }
}

As you can see, by removing the minimum_should_match parameter in one of the two queries, we now get an average recall of 1 in both.

{
  "metric_score": 1,
  "details": {
    "dicaprio-performance": {
      "metric_score": 1,
      "unrated_docs": [],
      "hits": [
        {
          "hit": {
            "_index": "movies",
            "_id": "doc1",
            "_score": 2.0661702
          },
          "rating": 1
        },
        {
          "hit": {
            "_index": "movies",
            "_id": "doc3",
            "_score": 0.732218
          },
          "rating": 0
        },
        {
          "hit": {
            "_index": "movies",
            "_id": "doc2",
            "_score": 0.6271719
          },
          "rating": 1
        }
      ],
      "metric_details": {
        "recall": {
          "relevant_docs_retrieved": 2,
          "relevant_docs": 2
        }
      }
    },
    "sad-movies": {
      "metric_score": 1,
      "unrated_docs": [],
      "hits": [
        {
          "hit": {
            "_index": "movies",
            "_id": "doc7",
            "_score": 2.1307156
          },
          "rating": 0
        },
        {
          "hit": {
            "_index": "movies",
            "_id": "doc5",
            "_score": 1.3160692
          },
          "rating": 1
        },
        {
          "hit": {
            "_index": "movies",
            "_id": "doc6",
            "_score": 1.190063
          },
          "rating": 1
        }
      ],
      "metric_details": {
        "recall": {
          "relevant_docs_retrieved": 2,
          "relevant_docs": 2
        }
      }
    }
  },
  "failures": {}
}

In summary, removing the minimum_should_match: 100% clause, allows us to got a perfect recall for both queries.

We did it! Right?

Not so fast!

By improving recall, we open the door to a wider range of results. However, each adjustment implies a trade-off. This is why defining complete test cases, using different metrics to evaluate changes.

Using judgment lists and metrics prevents you from going in blind when making changes since you now have data to back them up. Validation is no longer manual and repetitive, and you can test your changes in more than just one use case. Additionally, A/B testing allows you to test live which configuration works best for your users and business case, thus coming full circle from technical metrics and real-world metrics.

Final recommendations for using judgment lists

Working with judgment lists is not only about measuring but also about creating a framework that allows you to iterate with confidence. To achieve this, you can follow these recommendations:

Start small, but start. You don’t need to have 10,000 queries with 50 judgment lists each. You only need to identify the 5–10 most critical queries for your business case and define which documents you expect to see at the top of the results. This already gives you a base. You typically want to start with the top queries plus the queries with no results. You can also start testing with an easy-to-configure metric like Precision and then work your way up in complexity.
Validate with users. Complement the numbers with A/B testing in production. This way, you’ll know if changes that look good in the metrics are also generating a real impact.
Keep the list alive. Your business case will evolve, and so will your critical queries. Update your judgment periodically to reflect new needs.
Make it part of the flow. Integrate judgment lists into your development pipelines. Make sure each configuration change, synonym, or text analysis is automatically validated against your base list.
Connect technical knowledge with strategy. Don’t stop at measuring technical metrics like precision or recall. Use your evaluation results to inform business outcomes.

How to improve e-commerce search relevance with personalized cohort-aware ranking

Alexander Marquardt — Wed, 10 Dec 2025 00:00:00 GMT

Overview

In this article, we explore how to make Elasticsearch search results more relevant for different e-commerce user segments using an explainable, multiplicative boosting strategy — without any machine learning post-processing.

Introduction: Why personalization matters

Elasticsearch is very good at ranking results by textual relevance (BM25) and by semantic relevance (vectors). In e-commerce, that is necessary but not sufficient. Two people can type the same query and reasonably expect different results:

A luxury shopper searching for “red lipstick” expects prestige brands near the top.
A budget shopper wants affordable options promoted.
A gift buyer may prefer popular bundles.

The goal is to adjust ranking so that, for a given query, products that align with the user’s segment rise modestly in the list, without destroying the underlying relevance. This article shows how to add cohort-aware personalization on top of Elasticsearch’s relevance using only function_score, a keyword field, and small multiplicative boosts.

Multiplicative boosting for cohort personalization

The core challenge in cohort personalization is stability. You want a product that is relevant to the query to remain relevant, with a controlled, explainable uplift when it matches the user’s segment. What often goes wrong is that personalization signals are added to the score in a way that either:

overwhelms BM25 on some queries, or
has almost no effect on others.

This happens because most boosting approaches use additive scoring. However, BM25 scales can vary dramatically across queries and datasets, so a fixed additive adjustment (e.g., “add +2.0 for a cohort match”) is sometimes a massive change to the BM25 score, and other times is negligible. Instead, what we want is a guarantee that if a product is a good match for the query, and it aligns with the user’s cohort, then its score is increased by a controlled percentage regardless of the absolute BM25 scale. We can achieve this with a multiplicative pattern:

final_score = BM25 × (1 + cohort_overlap × weight_per_cohort)

This article shows how to implement this pattern using Elasticsearch’s function_score query, a cohorts field on the product, and a list of user cohorts passed at query time.

Modeling cohorts in your product catalog

The simplest way to enable cohort-aware ranking is to treat cohorts as tags. For example, a product might carry tags such as:

Lipstick: ["female", "beauty", "luxury"]
Men’s deodorant: ["male", "personal_care", "sport"]
Glitter gloss: ["female", "beauty", "youth", "party"].

A user or session carries a set of tags inferred from behavior and profile:

High-income female luxury shopper: ["female", "beauty", "luxury"]
Budget-oriented female shopper: ["female", "beauty", "budget"]

Cohort overlap is the count of shared cohort tags between the user/session and the product. No weighting, no semantic similarity — just a simple intersection. For example, if the user cohorts are [“female”, “beauty”, “budget”] and a lipstick has [“female”, “beauty”, “luxury”], the overlap is 2. If a men’s deodorant has [“male”, “personal_care”, “sport”], the overlap with that same user is 0.

The intuition is that (a) BM25 ranks documents depending on how relevant they are to the user’s query, and (b) cohort overlap boosts products based on how well each product aligns with the user's segment. To accomplish this, we transform the cohort overlap between the user cohorts and the product cohorts into a multiplicative boost that scales BM25.

To avoid field explosion, we keep all cohort tags in a single keyword field, such as follows:

{
  "product_id": "LIP-001",
  "description": "Premium cherry red lipstick with velvet finish",
  "cohorts": ["female", "beauty", "luxury"]
}

This is easy for merchandisers to understand, avoids hundreds of boolean fields like is_female or is_luxury, and works efficiently with term filters.

Why additive boosts don’t work

One subtle but important point is that even a standard boolean query is additive. When Elasticsearch scores a document, the base BM25 score from the main query (typically in a must) and every matching should clause contributes an additive score. This means “additive boosting” is not just about boosts, it’s fundamental to how boolean scoring works.

Personalization built on additive logic behaves inconsistently because BM25 scales differ per query and dataset. For example the base BM25 scores for three products might be 12, 8, 4 in one instance, and might be 0.12, 0.08, 0.04 after updating your dataset or modifying your query. In this case an additive boost (e.g., +2.0) becomes a dominating force when the base BM25 scores are small (a +2.0 boost on a score of 0.12 is about 18× higher) or a rounding error when the base BM25 scores are large (+2.0 boost on a score of 12 is only about 1.17× higher). This creates inconsistent, unpredictable ranking behavior.

Why multiplicative boosting is the right shape

If we apply a multiplicative boost, the shape is consistent:

final_score = BM25 × boost
boost = 1 + overlap × weight_per_cohort

With weight_per_cohort = 0.1, an overlap of 2 gives a boost of 1.2 (20% increase), an overlap of 1 gives a boost of 1.1 (10% increase), and an overlap of 0 gives a boost of 1.0 (no change). This means that a product that is more aligned with the user cohort gets a predictable percentage uplift, regardless of whether its BM25 score is 0.01 or 10.0. BM25 remains the primary signal; cohort alignment gently reshapes the ranking.

How function_score gives us multiplicative behavior

To convert cohort overlap into a controlled percentage boost, we need a way to take the normal BM25 score and scale it up by a factor such as 1.1, 1.2, or 1.3. Elasticsearch does not support multiplying a score directly inside a standard query, but function_score provides exactly this capability: it lets us compute an additional score component and combine it with the base score using a chosen strategy, which is "multiply" for this use case.

Elasticsearch’s function_score lets us implement multiplicative cohort boosting in three steps. First, each cohort match contributes a small weight (e.g., 0.1). Second, we include a baseline weight of 1.0 so that the final multiplier never drops below 1. Third, we sum all cohort contributions using score_mode: "sum" to produce a boost factor that represents (1 + overlap × weight). Finally, we combine this boost factor with the BM25 score using boost_mode: "multiply", which gives us the exact multiplicative behavior we want.

The calculation below shows how the final score is calculated, where BM25 is the base relevance; n is the number of matched cohorts; w is weight_per_cohort (e.g., 0.1); and additive baseline = 1.0:

sum_score = baseline + n × w
final_score = BM25 × sum_score

So, with 2 overlapping cohorts and w = 0.1:

sum_score = 1.0 + 2 × 0.1 = 1.2
final_score = BM25 × 1.2

This is exactly the multiplicative behavior we want.

Putting it together: index, data, and baseline ranking

Create a simple index:

PUT product_catalog
{
  "mappings": {
    "properties": {
      "product_id": {
        "type": "keyword"
      },
      "description": {
        "type": "text"
      },
      "cohorts": {
        "type": "keyword"
      }
    }
  }
}

Index a few products:

POST _bulk
{ "index": { "_index": "product_catalog", "_id": "LIP-001" }}
{ "product_id": "LIP-001", "description": "Premium cherry red lipstick with velvet finish", "cohorts": ["female", "beauty", "luxury"] }
{ "index": { "_index": "product_catalog", "_id": "LIP-002" }}
{ "product_id": "LIP-002", "description": "Affordable matte red lipstick for everyday wear", "cohorts": ["female", "beauty", "budget"] }
{ "index": { "_index": "product_catalog", "_id": "LIP-003" }}
{ "product_id": "LIP-003", "description": "Glitter red gloss for parties and festivals", "cohorts": ["female", "beauty", "youth", "party"] }

A baseline query for “red lipstick” might look like:

POST product_catalog/_search
{
  "size": 5,
  "_source": ["product_id", "description"],
  "query": {
    "multi_match": {
      "query": "red lipstick",
      "fields": ["description"]
    }
  }
}

This returns a pure BM25 ranking (without any cohort boosting). In this example, the scores of LIP-001 and LIP-002 will be very close (or identical), because they match the same query terms with similar frequencies and have comparable lengths.

The relative ranking is what matters; the exact numeric scores may differ depending on shard configuration, analyzer differences, or Elasticsearch version.

Product ID	Description	BM25 score
LIP-001	Premium cherry red lipstick with velvet finish	0.603535
LIP-002	Affordable matte red lipstick for everyday wear	0.603535
LIP-003	Glitter red gloss for parties and festivals	0.13353139

Persona A: high-income luxury shopper

Suppose we know that Persona A fits into the following cohorts:

["female", "beauty", "luxury"]

We translate that into a set of cohort filters, each with a small weight, plus a baseline factor:

GET product_catalog/_search
{
  "explain": true,
  "query": {
    "function_score": {
      "query": {
        "multi_match": {
          "query": "red lipstick",
          "fields": ["description"]
        }
      },
      "functions": [
        { "filter": { "term": { "cohorts": "female" }},  "weight": 0.1 },
        { "filter": { "term": { "cohorts": "beauty" }},  "weight": 0.1 },
        { "filter": { "term": { "cohorts": "luxury" }},  "weight": 0.1 },
        { "weight": 1.0 }
      ],
      "score_mode": "sum",
      "boost_mode": "multiply"
    }
  }
}

For this persona LIP-001 (“Premium cherry red lipstick with velvet finish”) matches “female”, “beauty”, and “luxury” which means that the cohort overlap is 3 and therefore the boost factor is 1.3. On the other hand, LIP-002 and LIP-003 match “female” and “beauty” which results in a boost factor of 1.2.

Product ID	Description	Base BM25 score	Boost factor	New score
LIP-001	Premium cherry red lipstick with velvet finish	0.603535	1.3x (30%)	0.7845955
LIP-002	Affordable matte red lipstick for everyday wear	0.603535	1.2x (20%)	0.724242
LIP-003	Glitter red gloss for parties and festivals	0.13353139	1.2x (20%)	0.16023767

As desired for this luxury user, the luxury lipstick (LIP-001) receives the strongest uplift and will tend to rise above similar alternatives in the results.

Persona B: budget-oriented shopper

A budget-conscious shopper might belong to the following cohorts:

["female", "beauty", "budget"]

The query for this user is nearly identical to the previous query, except for the cohort values which now reflect “budget” rather than “luxury”:

GET product_catalog/_search
{
  "query": {
    "function_score": {
      "query": {
        "multi_match": {
          "query": "red lipstick",
          "fields": ["description"]
        }
      },
      "functions": [
        { "filter": { "term": { "cohorts": "female" }},  "weight": 0.1 },
        { "filter": { "term": { "cohorts": "beauty" }},  "weight": 0.1 },
        { "filter": { "term": { "cohorts": "budget" }},  "weight": 0.1 },
        { "weight": 1.0 }
      ],
      "score_mode": "sum",
      "boost_mode": "multiply"
    }
  }
}

For this persona LIP-002 (“Affordable matte red lipstick for everyday wear”) matches “female”, “beauty”, and “budget” which means that the cohort overlap is 3 and therefore the boost factor is 1.3. On the other hand, LIP-001 and LIP-003 match “female” and “beauty” which results in a boost factor of 1.2.

Product ID	Description	Base BM25 score	Boost factor	New score
LIP-002	Affordable matte red lipstick for everyday wear	0.603535	1.3x (30%)	0.7845955
LIP-001	Premium cherry red lipstick with velvet finish	0.603535	1.2x (20%)	0.724242
LIP-003	Glitter red gloss for parties and festivals	0.13353139	1.2x (20%)	0.16023767

As desired for this budget user, the budget lipstick (LIP-002) receives the strongest uplift and will tend to rise above similar alternatives in the results.

How to build the cohort filter dynamically (Python example)

You will normally inject the cohort filters at query time based on the user/session profile. For example:

user_cohorts = ["female", "beauty"]
functions = [
    { "filter": { "term": { "cohorts": cohort } }, "weight": 0.1 }
    for cohort in user_cohorts
]
# add baseline multiplier
functions.append({ "weight": 1.0 })

Using term filters on a keyword field is fast, shard-cache friendly, and fully visible in the _explain API, which shows exactly which filters fired and which weights were applied.

How cohort assignment works

Cohort assignment is intentionally left outside Elasticsearch, and is outside the scope of this article. However, sources could include:

browsing events (“has viewed lipstick” → beauty)
gender inference (from preferences or marketing profile)
device characteristics (mobile shopper)
location (“urban buyer”)
historical purchases
marketing segments
personalization cookies

All of these are input signals, but the scoring mechanism in Elasticsearch remains the same. Elasticsearch does not need to know how you inferred the segments. This separation of concerns keeps Elasticsearch focused on ranking, while your application or data science layer owns the logic for inferring segments.

How to choose the right boost weight

In our examples, we used 0.1 per cohort. This value is tunable. Staying between 0.05 and 0.20 will likely provide good results. You should A/B test weights based on:

catalog diversity
number of cohort tags per product
variability in BM25
business goals (revenue vs. discovery vs. personalization)

Limit the number of cohorts assigned to each product

Giving a product 20 cohort tags leads to:

Noise in the signals
Gaming by merchandisers (“tag everything as luxury”)
Loss of explainability
Over-boosting

As a starting point (to be confirmed by your own testing), we recommend:

Approximately 5 cohorts per product.
Optionally, an offline validation step (ingest pipeline, CI script, or index-time check) that warns or blocks when more than 5 tags are assigned.

Customized cohort boosting per user

So far, our examples assume every cohort contributes equally. In reality, some users strongly prefer certain segments. In some cases, you might know that certain cohorts are especially important for an individual user. For example:

A user who almost always buys luxury brands.
A user who consistently picks budget options.

You can encode this by assigning different weights per cohort instead of a flat 0.1. For example, if your application has detected a “super-luxury” shopper, then you could modify the function scoring as follows:

"functions": [
  { "filter": { "term": { "cohorts": "female" }},  "weight": 0.1 },
  { "filter": { "term": { "cohorts": "beauty" }},  "weight": 0.1 },
  { "filter": { "term": { "cohorts": "luxury" }},  "weight": 0.2 },
  { "weight": 1.0 }
]

In the above example matching “female” or “beauty” each add +0.1 while matching luxury adds +0.2. In this example, a product matching all three cohorts would get:

boost = 1.0 + 0.1 + 0.1 + 0.2 = 1.4

This remains fully explainable, and you can document the configuration (“luxury is 2× as important as other cohorts for this user”). Additionally, the explain API will show exactly how those numbers contributed to the final score.

Conclusion:

This Elasticsearch-native approach to cohort personalization uses only lightweight metadata and standard query constructs, while preserving explainability, stability, and business control over the relevance model. This delivers precise, predictable relevance that ensures the business goals never sacrifice the quality of the search results.

Implementation summary

If you want to adopt this pattern in production, the high-level steps are:

Add a single keyword field (cohorts) to each product containing 3–5 cohort tags.
Compute user/session cohorts in your application logic (from browsing, purchase history, CRM, etc.) and pass them with the query.
Inject dynamic function_score filters into your query with one per user cohort, and each with a small weight (e.g., 0.1), plus a baseline weight (1.0).
Wrap your existing BM25 query in function_score with score_mode: "sum" and boost_mode: "multiply" to apply multiplicative boosting.
Tune per-cohort weights (typically 0.05–0.20) based on A/B experiments, ensuring BM25 remains the dominant signal.

These steps let you layer cohort personalization cleanly on top of your existing search relevance, without scripts, ML models, or major architecture changes.

What’s next?

This pattern is a powerful example of how to build sophisticated relevance rules directly into your queries, ensuring speed and reliability.

Implement custom personalization faster: If you're ready to deploy and optimize this advanced cohort personalization strategy, or to tackle other complex relevance challenges, our team can help you build, tune, and operationalize your Elasticsearch solution quickly. Contact Elastic Services for help implementing this and other advanced search techniques.
Join the discussion: For general questions about advanced relevance techniques and implementation, join the broader Elastic Stack community for search discussions.

AutoOps in action: Investigating Elasticsearch cluster performance on ECK

Aram Favela — Tue, 09 Dec 2025 00:00:00 GMT

At Elastic, the InfoSec Security Engineering team is responsible for deploying and managing InfoSec's ECK clusters. In the Elastic on Elastic series, we highlight InfoSec’s role as Customer Zero. By running the latest stack versions and features, we aim to provide practical advice and an inside look at how we operate. In this segment, we cover installing AutoOps in our multi-cluster Elastic Cloud on Kubernetes (ECK) environment and highlight how it delivered value immediately.

AutoOps for Elasticsearch streamlines cluster operations by providing performance recommendations, resource utilization insights, real-time issue detection, and guided remediation. With the recent announcement of AutoOps for self-managed (on‑premises) clusters, we were eager to deploy it and see how it could help us monitor and maintain the health of our large multi-cluster ECK environment.

Installing AutoOps for ECK

To ensure scalability and consistency, the Security Engineering team aims to manage all infrastructure as code (IaC). Staying in line with this approach, we created a Helm chart to deploy the AutoOps agent across our ECK environment. For additional information on our base ECK and Helm approach, please refer to a previous blog post in this series.

AutoOps chart

cloud-connected-autoops/
├─ chart.yaml
├─ values.yaml
└─ templates/
   └─ deployment.yaml

chart.yaml

apiVersion: v1
description: Autoops cloud connected Agent chart
name: cloud-connected-autoops
version: 0.1.0

values.yaml

version: 9.2.0

deployment.yaml
This is derived from the official AutoOps repository, however we templatized the image version to simplify future upgrades and enable compatibility with our existing automation:

 image: >-
   docker.elastic.co/elastic-agent/elastic-otel-collector-wolfi:{{ .Values.version }}

With the AutoOps Helm chart defined, we can now install AutoOps in our environment using Cloud Connect. This feature allows us to use Elastic Cloud services within our self-managed ECK clusters without the overhead of installing and maintaining additional infrastructure.

The documentation outlines the required steps for setting up an Elastic Cloud Connect account and generating the necessary connection details for each cluster. Given our ECK environment, we utilized the Kubernetes installation method.

With the connection configured, we only needed to enable the Elastic Agent by adding a simple boolean flag to the values.yml for each ECK cluster:

cloud-connected-autoops:
  enabled: true

Once the installation is complete, each connected cluster should be listed on the AutoOps overview page:

AutoOps in action

We currently utilize Stack Monitoring to monitor cluster health and use default rules to send us alerts. While we plan to migrate to AutoOps alerting in the near future, our existing alerting still relies on Stack Monitoring.

The opportunity to test AutoOps presented itself when we received a thread pool write rejections alert in our alerting Slack channel:

After receiving the standard alert, we checked AutoOps for additional context. AutoOps flagged the same issue but provided extra crucial detail, including the exact index causing the write rejections:

Next, we navigated to the shard Analyzer page, which visualizes sharding hotspots. The visualization made it clear that this particular node had the highest ingestion rate and was managing two write indices. The graph also indicated other nodes that were underutilized:

To alleviate the load on es-data-3, we moved the smaller write index to es-data-5 using a cluster reroute command:

POST /_cluster/reroute
{
  "commands": [
    {
      "move": {
        "index": ".ds-logs-osquery_manager.result-workstation-2025.09.15-000632",
        "shard": 0, 
        "from_node": "es-data-3",  
        "to_node": "es-data-5"
      }
    }
  ]
}

After executing the reroute, the write thread pool for es-data-3 decreased immediately:

Without AutoOps, investigating Elasticsearch performance would have required manually querying individual node metrics, shard allocations, and thread pools, often through multiple point in time API calls. AutoOps centralizes this data, continuously collects it in real time, and visualizes trends over time. As a result, we were able to investigate and resolve an issue that would have previously taken thirty minutes or more in just five minutes.

While this example only scratches the surface of AutoOps’ capabilities, our hands-on experience shows that its detailed metrics and visualizations make investigating and resolving Elasticsearch performance issues easier and more efficient than previously possible.

To learn more about deploying AutoOps for Self Managed clusters, check out the announcement.

How excessive replica counts can degrade performance, and what to do about it

Alexander Marquardt — Mon, 08 Dec 2025 00:00:00 GMT

Replicas are essential to Elasticsearch: they provide high availability and help scale out search workloads. But like any distributed system, too much redundancy can become counterproductive. Excessive replica counts magnify write load, increase shard overhead, exhaust filesystem cache, elevate heap pressure, and can destabilize a cluster.

This article explains why excessive replica counts can cause severe performance degradation, how to diagnose the symptoms, and how right-sizing replica counts restored stability in a real large-scale customer deployment.

The role of replicas in Elasticsearch

Replicas in Elasticsearch serve two primary purposes:

High availability: If a node fails, replicas ensure data remains available.
Search scalability: Replicas allow Elasticsearch to distribute search load across multiple nodes.

However, each replica is a full physical copy of its primary shard, and every write must be applied to every replica. Replicas provide resilience, but they can also consume CPU, heap, filesystem cache, disk I/O, cluster state bandwidth, and recovery bandwidth. Replicas are powerful, but they are not free.

When high replica counts can make sense

There is a narrow set of scenarios where high replica counts genuinely improve performance:

The cluster contains a small amount of extremely hot data where the working set fits into RAM on every node.
The cluster is intentionally overprovisioned.
The data is infrequently written or updated.

In this scenario, replicas help utilize all nodes effectively, maximizing CPU utilization and cache efficiency.

Diagram 1: High replica counts benefit a single, small, hot data set (one or more indices) that fits into RAM on all nodes. All queries hit cached data and throughput scales efficiently.

The reality in large, multi-index clusters

Most production environments contain many indices, diverse workloads, variable shard sizes, and mixed read/write patterns. In these settings, high replica counts introduce compounding issues that can severely degrade performance.

Cache thrashing and memory pressure

Every shard copy competes for limited filesystem cache. With excessive replicas:

The working set grows beyond RAM capacity
Nodes are forced to read from disk for routine queries
Useful cached pages are constantly evicted, which causes “cache churn”. Cache hit rates collapse
Latency becomes unpredictable

When multiple indices compete for the same finite memory, the cost of serving a single query increases dramatically because the shard data needed for that query is not in RAM.

Diagram 2: In clusters with many indices and excessive replicas, shards compete for limited RAM and heap, causing frequent cache evictions and memory pressure.

Note: This diagram is conceptual. In practice, nodes hold interleaved fragments of many shards in filesystem cache, but the underlying principle remains the same as what is illustrated in the diagram.

Write amplification

If an index has 5 replicas, a single document write becomes 6 independent writes, each with its own merge cycles, segment management, and I/O cost. This directly increases:

Disk utilization
Indexing latency
Merge pressure
Threadpool saturation
Backpressure and retry load

Indexing throughput may become unsustainable with high replica counts. The diagram illustrates how an update to a single index with 5 replicas results in a write operation on every node hosting a shard copy, in this example all six nodes in the cluster.

Diagram 3: Every write operation is multiplied by the number of replicas, dramatically increasing disk I/O and risking saturation in write-heavy environments.

Increased shard overhead

More replicas mean more:

Shards
Segment files
File descriptors
Cluster state updates
Memory reserved for per-shard data structures

This expands JVM heap usage and increases GC frequency.

Diagnosing excessive replication: key symptoms

Clusters suffering from excessive replicas often exhibit the following operational symptoms:

Frequent page faults and swapping: Working set cannot fit in RAM, leading to constant cache misses.
Excessive garbage collection (GC): High heap usage and long GC pauses due to too many shards.
Elevated disk I/O: Write amplification and cache churn drive up disk operations.
Unassigned shards and node instability: Resource exhaustion can cause nodes to leave and shards to be reallocated.
Search latency spikes: Queries frequently miss cache and hit disk, causing unpredictable response times.

If you observe these symptoms, review your replica counts and sharding strategy.

The solution: right-size your replicas

Best practices

Set replicas based on failure tolerance, not guesswork. For most clusters, 1 replica is sufficient (2 if spanning 3 availability zones).
Monitor cache hit rates and heap usage. Ensure your hot working set fits in RAM; otherwise, reduce replica count or re-architect your sharding strategy.

Using the earlier six-node example, reducing replicas from 5 to 1 dramatically reduces cache contention, improves cache locality, and lowers write amplification as shown in the following diagram.

Diagram 4: Reducing replicas from 5 to 1 dramatically reduces the data hosted on each node and the overall memory contention.

Impact of reducing replicas

A large enterprise customer experienced severe and persistent cluster instability. Symptoms included:

High latency
Nodes repeatedly leaving the cluster
Excessive disk I/O
Frequent GC interruptions
Search throughput collapse

Upon escalation, the root cause was quickly identified: The 20-node cluster had 12 replicas configured across numerous indices. After reducing replica counts to a sane baseline (typically 1) and rebalancing shards:

Search latency normalized almost immediately
Disk I/O dropped dramatically
GC returned to normal levels
Nodes stabilized with no further drop-outs

Right-sizing replicas was the key intervention.

Common misconception: Will fewer replicas overload my nodes?

A common concern is that reducing the number of replicas will concentrate search traffic on fewer nodes, creating hotspots or bottlenecks. In reality, Elasticsearch distributes queries across all available shard copies (primaries and replicas) for each index. Reducing replicas does not change the total query volume handled by the cluster; it changes the memory dynamics on each node.

With fewer replicas, each node holds fewer shards, making it far more likely that the data required for a query is already resident in RAM. The overall QPS per node remains comparable, but the cost per query drops dramatically because far fewer lookups result in (expensive) disk I/O.

Diagram 5: Same query load, improved cache hits: Before and after reducing replicas

Recommendations

Audit your cluster: Review replica counts across all indices to ensure that you are really benefitting from the number of replicas you have assigned.
Avoid “one-size-fits-all” settings: Tune replicas and primaries per index based on workload.
Educate your team: Replicas are a tool, not a universal solution. Understand the trade-offs.
Modifying the number of replicas that can be done at any time. Test changes in a controlled environment and monitor performance before and after adjustments.

Conclusion

Replicas are essential for resilience and search scalability, but in many use cases high replica counts can silently undermine Elasticsearch cluster performance.

Excessive replicas amplify writes, increase shard overhead, fragment system memory and cache behavior, and destabilize large, multi-index workloads.

If your cluster exhibits unexplained latency, GC pressure, or instability, start by auditing replica settings. In Elasticsearch performance engineering, more is not always better—often, less is faster and more reliable.

Build a financial AI search workflow using LangGraph.js and Elasticsearch

Jeffrey Rengifo — Fri, 05 Dec 2025 00:00:00 GMT

Building AI search applications often involves coordinating multiple tasks, data retrieval, and data extraction into a seamless workflow. LangGraph simplifies this process by letting developers orchestrate AI agents using a node-based structure. In this article, we are going to build a financial solution using LangGraph.js

What is LangGraph

LangGraph is a framework for building AI agents and orchestrating them in a workflow to create AI-assisted applications. LangGraph has a node architecture where we can declare functions that represent tasks and assign them as nodes of the workflow. The result of multiple nodes interacting together will be a graph. LangGraph is part of the broader LangChain ecosystem, which provides tools for building modular and composable AI systems.

For a better understanding of why LangGraph is useful, let's solve a problematic situation using it.

Overview of the solution

In a venture capital firm, investors have access to a large database with many filtering options, but when one wants to combine criteria, it becomes hard and slow. This may cause some relevant startups not to be found for investment. It results in spending a lot of hours trying to identify the best candidates, or even losing opportunities.

With LangGraph and Elasticsearch, we can perform filtered searches using natural language, eliminating the need for users to manually build complex requests with dozens of filters. To make it more flexible, the workflow automatically decides based on the user's input between two query types:

Investment-focused queries: These target financial and funding aspects of startups, such as funding rounds, valuation, or revenue. Example: “Find startups with Series A or Series B funding between $8M–$25M and monthly revenue above $500K.”
Market-focused queries: These concentrate on industry verticals, geographic markets, or business models, helping identify opportunities in specific sectors or regions. Example: “Find fintech and healthcare startups in San Francisco, New York, or Boston.”

To keep the queries robust, we will make the LLM build search templates instead of full DSL queries. This way, you always get the query you want, and the LLM just has to fill in the blanks and not carry the responsibility of building the query you need every time.

What you need to get started

Elasticsearch APIKey
OpenAPI APIKey
Node 18 or newer

Step-by-step instructions

In this section, let’s see how the app will look. For that, we will use TypeScript, a superset of JavaScript that adds static types to make the code more reliable, easier to maintain, and safer by catching errors early while remaining fully compatible with existing JavaScript.

The nodes' flow will look as follows:

The image above is generated by LangGraph and represents the workflow that defines the execution order and conditional logic between nodes:

decideStrategy: Uses an LLM to analyze the user’s query and decide between two specialized search strategies, investment-focused or market-focused.
prepareInvestmentSearch: Extracts filter values from the query and builds a predefined template emphasizing financial and funding-related parameters.
prepareMarketSearch: Extracts filter values as well, but dynamically builds parameters emphasizing market, industry, and geographic context.
executeSearch: Sends the constructed query to Elasticsearch using a search template and retrieves the matching startup documents.
visualizeResults: Formats the final results into a clear, readable summary showing key startup attributes such as funding, industry, and revenue.

This flow includes a conditional branching, working as an “if” statement, that determines whether to use the investment or market search path based on the user’s input. This decision logic, driven by the LLM, makes the workflow adaptive and context-aware, a mechanism we’ll explore in more detail in the next sections.

LangGraph State

Before seeing each node individually, we need to understand how the nodes communicate and share data. For that, LangGraph enables us to define the workflow state. This defines the shared state that will be passed between nodes.

The state acts as a shared container that stores intermediate data throughout the workflow: it begins with the user’s natural language query, then keeps the selected search strategy, the prepared parameters for Elasticsearch, the retrieved search results, and finally the formatted output.

This structure allows every node to read and update the state, ensuring a consistent flow of information from the user input to the final visualization.

const VCState = Annotation.Root({
  input: Annotation(), // User's natural language query
  searchStrategy: Annotation(), // Search strategy chosen by LLM
  searchParams: Annotation(), // Prepared search parameters
  results: Annotation(), // Search results
  final: Annotation(), // Final formatted response
});

Set up the application

All the code on this section can be found in the elasticsearch-labs repository.

Open a terminal in the folder where the app will be located and Initialize a Node.js application with the command:

npm init -y

Now we can install the necessary dependencies for this project:

npm install @elastic/elasticsearch @langchain/langgraph @langchain/openai @langchain/core dotenv zod && npm install --save-dev @types/node tsx typescript

@elastic/elasticsearch: Helps us handle Elasticsearch requests such as data ingestion and retrieval.
@langchain/langgraph: JS dependency to provide all LangGraph tools.
@langchain/openai: OpenAI LLM client for LangChain.
@langchain/core: Provides the fundamental building blocks for LangChain apps, including prompt templates.
dotenv: Necessary dependency to use environment variables in JavaScript.
zod: Dependency to type data.

@types/node tsx typescript allows us to write and run TypeScript code.

Now create the following files:

elasticsearchSetup.ts: Will create the index mappings, load the data set from a JSON file, and ingest the data to Elasticsearch.
main.ts: will include the LangGraph application.
.env: file to store the environment variables

In the .env file, let’s add the following environment variables:

ELASTICSEARCH_ENDPOINT="your-endpoint-here"
ELASTICSEARCH_API_KEY="your-key-here"
OPENAI_API_KEY="your-key-here"

The OpenAPI APIKey will not be used directly on the code; instead, it will be used internally by the library @langchain/openai.

All the logic regarding mappings creation, search templates creation, and dataset ingestion can be found in the elasticsearchSetup.ts file. In the next steps, we will be focusing on the main.ts file. Also, you can check the dataset to better understand how the data looks in the dataset.json.

LangGraph app

In the main.ts file, let’s import some necessary dependencies to consolidate the LangGraph application. In this file, you must also include the node functions and the state declaration. The graph declaration will be done in a main method in the next steps. The elasticsearchSetup.ts file will contain Elasticsearch helpers we are going to use within the nodes in further steps.

import { writeFileSync } from "node:fs";
import { StateGraph, Annotation, START, END } from "@langchain/langgraph";
import { ChatOpenAI } from "@langchain/openai";
import { z } from "zod";
import {
  esClient,
  ingestDocuments,
  createSearchTemplates,
  INDEX_NAME,
  INVESTMENT_FOCUSED_TEMPLATE,
  MARKET_FOCUSED_TEMPLATE,
  createIndex,
} from "./elasticsearchSetup.js";

const llm = new ChatOpenAI({ model: "gpt-4o-mini" });

As mentioned before, the LLM client will be used to generate the Elasticsearch search template parameters based on the user's question.

async function saveGraphImage(app: any): Promise {
  try {
    const drawableGraph = app.getGraph();
    const image = await drawableGraph.drawMermaidPng();
    const arrayBuffer = await image.arrayBuffer();

    const filePath = "./workflow_graph.png";
    writeFileSync(filePath, new Uint8Array(arrayBuffer));
    console.log(`📊 Workflow graph saved as: ${filePath}`);
  } catch (error: any) {
    console.log("⚠️  Could not save graph image:", error.message);
  }
}

The method above generates the graph image in png format and uses the Mermaid.INK API behind the scenes. This is useful if you want to see how the app nodes interact together with a styled visualization.

LangGraph nodes

Now lets see each node detailed:

decideSearchStrategy node

The decideSearchStrategy node analyzes the user input and determines whether to perform an investment focused or market focused search. It uses an LLM with a structured output schema (defined with Zod) to classify the query type. Before making the decision, it retrieves the available filters from the index using an aggregation, ensuring the model has up to date context about industries, locations, and funding data.

To extract the filters possible values and send them to the LLM, let’s use an aggregation query to retrieve them directly from the Elasticsearch index. This logic is allocated in a method called getAvailableFilters:

async function getAvailableFilters() {
  try {
    const response = await esClient.search({
      index: INDEX_NAME,
      size: 0,
      aggs: {
        industries: {
          terms: { field: "industry", size: 100 },
        },
        locations: {
          terms: { field: "location", size: 100 },
        },
        funding_stages: {
          terms: { field: "funding_stage", size: 20 },
        },
        business_models: {
          terms: { field: "business_model", size: 10 },
        },
        lead_investors: {
          terms: { field: "lead_investor", size: 100 },
        },
        funding_amount_stats: {
          stats: { field: "funding_amount" },
        },
      },
    });

    return response.aggregations;
  } catch (error) {
    console.error("❌ Error getting available filters:", error);
    return {};
  }
}

With the aggregation query above, we have the following results:

{
  "industries": {
    "doc_count_error_upper_bound": 0,
    "sum_other_doc_count": 0,
    "buckets": [
      {
        "key": "logistics",
        "doc_count": 5
      },
      ...
    ]
  },
  "locations": {
    "doc_count_error_upper_bound": 0,
    "sum_other_doc_count": 0,
    "buckets": [
      {
        "key": "San Francisco, CA",
        "doc_count": 4
      },
      {
        "key": "New York, NY",
        "doc_count": 3
      },
      ...
    ]
  },
  "funding_stages": {
    "doc_count_error_upper_bound": 0,
    "sum_other_doc_count": 0,
    "buckets": [
      {
        "key": "Series A",
        "doc_count": 8
      },
      ...
    ]
  },
  "business_models": {
    "doc_count_error_upper_bound": 0,
    "sum_other_doc_count": 0,
    "buckets": [
      {
        "key": "B2B",
        "doc_count": 13
      },
      ...
    ]
  },
  "lead_investors": {
    "doc_count_error_upper_bound": 0,
    "sum_other_doc_count": 0,
    "buckets": [
      {
        "key": "Battery Ventures",
        "doc_count": 1
      },
      {
        "key": "Benchmark Capital",
        "doc_count": 1
      },
      ...
    ]
  },
  "funding_amount_stats": {
    "count": 20,
    "min": 4500000,
    "max": 35000000,
    "avg": 14075000,
    "sum": 281500000
  }
}

See all the results here.

For both strategies, we are going to use hybrid search to detect both the structured part of the question (filters) and the more subjective parts (semantics). Here is an example of both queries using search templates:

await esClient.putScript({
      id: INVESTMENT_FOCUSED_TEMPLATE,
      script: {
        lang: "mustache",
        source: `{
          "size": 5,
          "retriever": {
            "rrf": {
              "retrievers": [
                {
                  "standard": {
                    "query": {
                      "semantic": {
                        "field": "semantic_field",
                        "query": "{{query_text}}"
                      }
                    }
                  }
                },
                {
                  "standard": {
                    "query": {
                      "bool": {
                        "filter": [
                          {"terms": {"funding_stage": {{#join}}{{#toJson}}funding_stage{{/toJson}}{{/join}}}},
                          {"range": {"funding_amount": {"gte": {{funding_amount_gte}}{{#funding_amount_lte}},"lte": {{funding_amount_lte}}{{/funding_amount_lte}}}}},
                          {"terms": {"lead_investor": {{#join}}{{#toJson}}lead_investor{{/toJson}}{{/join}}}},
                          {"range": {"monthly_revenue": {"gte": {{monthly_revenue_gte}}{{#monthly_revenue_lte}},"lte": {{monthly_revenue_lte}}{{/monthly_revenue_lte}}}}}
                        ]
                      }
                    }
                  }
                }
              ],
              "rank_window_size": 100,
              "rank_constant": 20
            }
          }
        }`,
      },
    });

Look at queries detailed in the elasticsearchSetup.ts file. In the following node, it will be decided which of the two queries will be used:

// Node 1: Decide search strategy using LLM
async function decideSearchStrategy(state: typeof VCState.State) {
  // Zod schema for specialized search strategy decision
  const SearchDecisionSchema = z.object({
    search_type: z
      .enum(["investment_focused", "market_focused"])
      .describe("Type of specialized search strategy to use"),
    reasoning: z
      .string()
      .describe("Brief explanation of why this search strategy was chosen"),
  });

  const decisionLLM = llm.withStructuredOutput(SearchDecisionSchema);

  // Get dynamic filters from Elasticsearch
  const availableFilters = await getAvailableFilters();

  const prompt = `Query: "${state.input}"
    Available filters: ${JSON.stringify(availableFilters, null, 2)}

    Choose between two specialized search strategies:
    
    - investment_focused: For queries about funding stages, funding amounts, monthly revenue, lead investors, financial performance
    
    - market_focused: For queries about industries, locations, business models, market segments, geographic markets
    
    Analyze the query intent and choose the most appropriate strategy.
  `;

  try {
    const result = await decisionLLM.invoke(prompt);
    console.log(
      `🤔 Search strategy: ${result.search_type} - ${result.reasoning}`
    );

    return {
      searchStrategy: result.search_type,
    };
  } catch (error: any) {
    console.error("❌ Error in decideSearchStrategy:", error.message);
    return {
      searchStrategy: "investment_focused",
    };
  }
}

prepareInvestmentSearch and prepareMarketSearch nodes

Both nodes use a shared helper function, extractFilterValues, which leverages the LLM to identify relevant filters mentioned in the user’s input, such as industry, location, funding stage, business model, etc. We are using this schema to build our search template.

// Extract all possible filter values from user input
async function extractFilterValues(input: string) {
  const FilterValuesSchema = z.object({
    // Investment-focused filters
    funding_stage: z
      .array(z.string())
      .default([])
      .describe("Funding stage values mentioned in query"),
    funding_amount_gte: z
      .number()
      .default(0)
      .describe("Minimum funding amount in USD"),
    funding_amount_lte: z
      .number()
      .default(100000000)
      .describe("Maximum funding amount in USD"),
    lead_investor: z
      .array(z.string())
      .default([])
      .describe("Lead investor values mentioned in query"),
    monthly_revenue_gte: z
      .number()
      .default(0)
      .describe("Minimum monthly revenue in USD"),
    monthly_revenue_lte: z
      .number()
      .default(10000000)
      .describe("Maximum monthly revenue in USD"),
    industry: z
      .array(z.string())
      .default([])
      .describe("Industry values mentioned in query"),
    location: z
      .array(z.string())
      .default([])
      .describe("Location values mentioned in query"),
    business_model: z
      .array(z.string())
      .default([])
      .describe("Business model values mentioned in query"),
  });

  const extractorLLM = llm.withStructuredOutput(FilterValuesSchema);
  const availableFilters = await getAvailableFilters();

  const extractPrompt = `Extract ALL relevant filter values from: "${input}"
    Available options: ${JSON.stringify(availableFilters, null, 2)}
    Extract only values explicitly mentioned in the query. Leave fields empty if not mentioned.`;

  return await extractorLLM.invoke(extractPrompt);
}

Depending on the detected intent, the workflow selects one of two paths:

prepareInvestmentSearch: builds financially oriented search parameters, including funding stage, funding amount, investor, and renew information. You can find the entire query template in the elasticsearchSetup.ts file:

// Node 2A: Prepare Investment-Focused Search Parameters 
async function prepareInvestmentSearch(state: typeof VCState.State) {
  console.log(
    "💰 Preparing INVESTMENT-FOCUSED search parameters with financial emphasis..."
  );

  try {
    // Extract all filter values from input
    const values = await extractFilterValues(state.input);

    let searchParams: any = {
      template_id: INVESTMENT_FOCUSED_TEMPLATE,
      query_text: state.input,
      ...values,
    };

    return { searchParams };
  } catch (error) {
    console.error("❌ Error preparing investment-focused params:", error);
    return {
      searchParams: {},
    };
  }
}

prepareMarketSearch: creates market-driven parameters focused on industries, geographies, and business models. See the entire query in the elasticsearchSetup.ts file:

// Node 2B: Prepare Market-Focused Search Parameters
async function prepareMarketSearch(state: typeof VCState.State) {
  console.log(
    "🔍 Preparing MARKET-FOCUSED search parameters with market emphasis..."
  );

  try {
    // Extract all filter values from input
    const values = await extractFilterValues(state.input);

    let searchParams: any = {
      template_id: MARKET_FOCUSED_TEMPLATE,
      query_text: state.input,
      ...values,
    };

    return { searchParams };
  } catch (error) {
    console.error("❌ Error preparing market-focused params:", error);
    return {};
  }
}

executeSearch node

This node takes the generated search parameters from the state and sends them to Elasticsearch first, using the _render API to visualize the query for debugging purposes, and then sends a request to retrieve the results.

// Node 3: Execute Search
async function executeSearch(state: typeof VCState.State) {
  const { searchParams } = state;

  try {
    // getting formed query from template for debugging
    const renderedTemplate = await esClient.renderSearchTemplate({
      id: searchParams.template_id,
      params: searchParams,
    });

    console.log(
      "📋 Complete query:",
      JSON.stringify(renderedTemplate.template_output, null, 2)
    );

    const results = await esClient.searchTemplate({
      index: INDEX_NAME,
      id: searchParams.template_id,
      params: searchParams,
    });

    return {
      results: results.hits.hits.map((hit: any) => hit._source),
    };
  } catch (error: any) {
    console.error(`❌ ${state.searchParams.search_type} search error:`, error);
    return { results: [] };
  }
}

visualizeResults node

Finally, this node displays the Elasticsearch results.

// Node 4: Visualize results
async function visualizeResults(state: typeof VCState.State) {
  const results = state.results || [];

  let formattedResults = `🎯 Found ${results.length} startups matching your criteria:\n\n`;

  results.forEach((startup: any, index: number) => {
    formattedResults += `${index + 1}. **${startup.company_name}**\n`;
    formattedResults += `   📍 ${startup.location} | 🏢 ${startup.industry} | 💼 ${startup.business_model}\n`;
    formattedResults += `   💰 ${startup.funding_stage} - $${(
      startup.funding_amount / 1000000
    ).toFixed(1)}M\n`;
    formattedResults += `   👥 ${startup.employee_count} employees | 📈 $${(
      startup.monthly_revenue / 1000
    ).toFixed(0)}K MRR\n`;
    formattedResults += `   🏦 Lead: ${startup.lead_investor}\n`;
    formattedResults += `   📝 ${startup.description}\n\n`;
  });

  return {
    final: formattedResults,
  };
}

Programmatically, the entire graph looks like this:

  const workflow = new StateGraph(VCState)
    // Register nodes - these are the processing functions
    .addNode("decideStrategy", decideSearchStrategy)
    .addNode("prepareInvestment", prepareInvestmentSearch)
    .addNode("prepareMarket", prepareMarketSearch)
    .addNode("executeSearch", executeSearch)
    .addNode("visualizeResults", visualizeResults)
    // Define execution flow with conditional branching
    .addEdge(START, "decideStrategy") // Start with strategy decision
    .addConditionalEdges(
      "decideStrategy",
      (state: typeof VCState.State) => state.searchStrategy, // Conditional function
      {
        investment_focused: "prepareInvestment", // If investment focused -> RRF template preparation
        market_focused: "prepareMarket", // If market focused -> dynamic query preparation
      }
    )
    .addEdge("prepareInvestment", "executeSearch") // Investment prep -> execute
    .addEdge("prepareMarket", "executeSearch") // Market prep -> execute
    .addEdge("executeSearch", "visualizeResults") // Execute -> visualize
    .addEdge("visualizeResults", END); // End workflow

As you can see, we have a conditional edge where the app decides which “path” or node will run next. This feature is useful when workflows need branching logic, such as choosing between multiple tools or including a human-in-the-loop step.

With the core LangGraph features understood, we can set up the application where the code will be running:

Put everything together in a main method, here we declare the graph with all the elements under the variable workflow:

async function main() {
  await createIndex();
  await createSearchTemplates();
  await ingestDocuments();

  // Create the workflow graph with shared state
  const workflow = new StateGraph(VCState)
    // Register nodes - these are the processing functions
    .addNode("decideStrategy", decideSearchStrategy)
    .addNode("prepareInvestment", prepareInvestmentSearch)
    .addNode("prepareMarket", prepareMarketSearch)
    .addNode("executeSearch", executeSearch)
    .addNode("visualizeResults", visualizeResults)
    // Define execution flow with conditional branching
    .addEdge(START, "decideStrategy") // Start with strategy decision
    .addConditionalEdges(
      "decideStrategy",
      (state: typeof VCState.State) => state.searchStrategy, // Conditional function
      {
        investment_focused: "prepareInvestment", // If investment focused -> RRF template preparation
        market_focused: "prepareMarket", // If market focused -> dynamic query preparation
      }
    )
    .addEdge("prepareInvestment", "executeSearch") // Investment prep -> execute
    .addEdge("prepareMarket", "executeSearch") // Market prep -> execute
    .addEdge("executeSearch", "visualizeResults") // Execute -> visualize
    .addEdge("visualizeResults", END); // End workflow


  const app = workflow.compile();

  await saveGraphImage(app);

  const query =
    "Find startups with Series A or Series B funding between $8M-$25M and monthly revenue above $500K";

  const marketResult = await app.invoke({ input: query });
  console.log(marketResult.final);
}

The query variable simulates the user input entered in a hypothetical search bar:

From the natural language phrase “Find startups with Series A or Series B funding between $8M-$25M and monthly revenue above $500K”, all the filters will be extracted.

Finally, invoke the main method:

main().catch(console.error);

Results

🔍 Checking if index exists...
🏗️ Creating index...
✅ Index created successfully!
Ingesting documents...
✅ Documents ingested successfully!
✅ Investment-focused template created successfully!
✅ Market-focused template created successfully!

📊 Workflow graph saved as: ./workflow_graph.png

🔍 Query: "Find startups with Series A or Series B funding between $8M-$25M and monthly revenue above $500K"

🤔 Search strategy: investment_focused - The query specifically seeks profitable fintech startups with defined funding amounts and high monthly revenue, which aligns closely with financial performance metrics and investment-related criteria.

💰 Preparing INVESTMENT-FOCUSED search parameters with financial emphasis...

📋 Complete query: {
  "size": 5,
  "retriever": {
    "rrf": {
      "retrievers": [
        {
          "standard": {
            "query": {
              "semantic": {
                "field": "semantic_field",
                "query": "Find startups with Series A or Series B funding between $8M-$25M and monthly revenue above $500K"
              }
            }
          }
        },
        {
          "standard": {
            "query": {
              "bool": {
                "filter": [
                  {
                    "terms": {
                      "funding_stage": [
                        "Series A",
                        "Series B"
                      ]
                    }
                  },
                  {
                    "range": {
                      "funding_amount": {
                        "gte": 8000000,
                        "lte": 25000000
                      }
                    }
                  },
                  {
                    "terms": {
                      "lead_investor": []
                    }
                  },
                  {
                    "range": {
                      "monthly_revenue": {
                        "gte": 500000,
                        "lte": 0
                      }
                    }
                  }
                ]
              }
            }
          }
        }
      ],
      "rank_window_size": 100,
      "rank_constant": 20
    }
  }
}
🎯 Found 5 startups matching your criteria:

1. **TechFlow**
   📍 San Francisco, CA | 🏢 logistics | 💼 B2B
   💰 Series A - $8.0M
   👥 45 employees | 📈 $500K MRR
   🏦 Lead: Sequoia Capital
   📝 TechFlow optimizes supply chain operations using AI-powered route optimization and real-time tracking. Founded in 2023, shows remarkable growth with $500K monthly revenue.

2. **DataViz**
   📍 New York, NY | 🏢 enterprise software | 💼 B2B
   💰 Series A - $10.0M
   👥 42 employees | 📈 $450K MRR
   🏦 Lead: Battery Ventures
   📝 DataViz creates intuitive data visualization tools for enterprise customers. No-code platform allows business users to create dashboards without technical expertise.

3. **FinanceAI**
   📍 San Francisco, CA | 🏢 fintech | 💼 B2C
   💰 Series C - $25.0M
   👥 120 employees | 📈 $1200K MRR
   🏦 Lead: Tiger Global Management
   📝 FinanceAI provides AI-powered investment advisory services to retail investors. Uses machine learning to analyze market trends with over 100,000 active users.

4. **UrbanMobility**
   📍 New York, NY | 🏢 logistics | 💼 B2B2C
   💰 Series B - $15.0M
   👥 78 employees | 📈 $750K MRR
   🏦 Lead: Kleiner Perkins
   📝 UrbanMobility revolutionizes urban transportation through autonomous delivery drones and smart logistics hubs. Partners with major retailers for same-day delivery across Manhattan and Brooklyn.

5. **HealthTech Solutions**
   📍 Boston, MA | 🏢 healthcare | 💼 B2B
   💰 Series B - $18.0M
   👥 95 employees | 📈 $900K MRR
   🏦 Lead: General Catalyst
   📝 HealthTech Solutions develops medical devices and software for remote patient monitoring. Comprehensive telehealth platform reducing hospital readmissions by 30%.

✨  Done in 18.80s.

For the input sent, the application chooses the investment-focused path, and as a result, we can see the Elasticsearch query generated by the LangGraph workflow, which extracts the values and ranges from the user input. We can also see the query sent to Elasticsearch with the extracted values applied, and finally, the results formatted by the visualizeResults node with the results.

Now let's test the market-focused node using the query “Find fintech and healthcare startups in San Francisco, New York, or Boston”:

...

🔍 Query: Find fintech and healthcare startups in San Francisco, New York, or Boston

🤔 Search strategy: market_focused - The query is focused on finding fintech startups in San Francisco that are disrupting traditional banking and payment systems, which pertains to specific industries (fintech) and locations (San Francisco). Thus, a market-focused strategy is more appropriate.

🔍 Preparing MARKET-FOCUSED search parameters with market emphasis...

📋 Complete query: {
  "size": 5,
  "retriever": {
    "rrf": {
      "retrievers": [
        {
          "standard": {
            "query": {
              "semantic": {
                "field": "semantic_field",
                "query": "Find fintech and healthcare startups in San Francisco, New York, or Boston"
              }
            }
          }
        },
        {
          "standard": {
            "query": {
              "bool": {
                "filter": [
                  {
                    "terms": {
                      "industry": [
                        "fintech",
                        "healthcare"
                      ]
                    }
                  },
                  {
                    "terms": {
                      "location": [
                        "San Francisco, CA",
                        "New York, NY",
                        "Boston, MA"
                      ]
                    }
                  },
                  {
                    "terms": {
                      "business_model": []
                    }
                  }
                ]
              }
            }
          }
        }
      ],
      "rank_window_size": 50,
      "rank_constant": 10
    }
  }
}
🎯 Found 5 startups matching your criteria:

1. **FinanceAI**
   📍 San Francisco, CA | 🏢 fintech | 💼 B2C
   💰 Series C - $25.0M
   👥 120 employees | 📈 $1200K MRR
   🏦 Lead: Tiger Global Management
   📝 FinanceAI provides AI-powered investment advisory services to retail investors. Uses machine learning to analyze market trends with over 100,000 active users.

2. **CryptoWallet**
   📍 Miami, FL | 🏢 fintech | 💼 B2C
   💰 Series B - $16.0M
   👥 73 employees | 📈 $820K MRR
   🏦 Lead: Coinbase Ventures
   📝 CryptoWallet provides secure digital wallet solutions for cryptocurrency trading and storage. Multi-chain support with enterprise-grade security features.

...

✨  Done in 7.41s.

Learnings

During the writing process I learned:

We must show the LLM the exact values of filters, otherwise we rely on the user typing the exact values of things. For low cardinality this approach is fine, but when the cardinality is high we need some mechanism to filter results out
Using search templates makes the results much more consistent than letting the LLMwrite the Elasticsearch query, and it's also faster
Conditional edges are a powerful mechanism to build applications with multiple variants and branching paths.
Structured output is extremely useful when generating information with LLMs because it enforces predictable, type-safe responses. This improves reliability and reduces prompt misinterpretations.

Combining semantic and structured search through hybrid retrieval produces better and more relevant results, balancing precision and context understanding.

Conclusion

In this example, we combine LangGraph.js with Elasticsearch to create a dynamic workflow capable of interpreting natural language queries and deciding between financial or market-focused search strategies. This approach reduces the complexity of crafting manual queries while improving flexibility and accuracy for venture capital analysts.

Improving Kibana dashboard interactivity with variable controls

Teresa Alvarez Soler — Thu, 04 Dec 2025 00:00:00 GMT

We’re excited to share that variable controls are now available in Kibana dashboards starting from version 8.18 and all the 9.x series! This feature has been one of the most consistently requested additions from dashboard users—and it’s finally here 🎉 Over the past few months, we’ve continued expanding and refining variable controls, making it the perfect moment to give them their own dedicated blog post.

What are variable controls?

If you’ve worked with Kibana dashboards before, you probably know our classic dashboard controls—those handy dropdowns that show values from your data so you can filter things with a couple of clicks.

Variable controls look similar on the surface, but they come with a clever twist: instead of automatically filtering every panel on your dashboard, they can be plugged directly into ES|QL queries inside individual visualizations.

That means you get to decide where each control applies. Even better, you can use them for all sorts of creative tricks—like tweaking time intervals, switching breakdown fields, or changing visualization parameters on the fly. Basically, they give your dashboards a truly interactive experience, allowing you to get your insights faster and more easily.

Use cases for variable controls

Alright, variable controls sound useful —but what can you actually do with them? Here are some examples of how they level up your dashboards:

Filter selected visualizations

Want to filter some visualizations, but leave others untouched? Variable controls let you do exactly that. Pick the panels you want to respond to and wire them up in the ES|QL queries behind your visualizations.

Select different time intervals

Give your users the power to switch between “5 minutes”, “1 hour”, “1 day”, or whatever time buckets make sense. Build a variable control with predefined intervals and connect it to your time series query.

Change functions

Instead of creating multiple charts for each operation, let dashboard users choose if they want to see max, avg, different percentiles, or any other aggregator.

Group by different fields

Sometimes you need to break down the data by different dimensions during an investigation. With variable controls, you can define multiple “group by” fields and let dashboard users pick the one that helps them uncover their insights.

How can you create them?

The easiest (and probably most delightful) way to create a variable control is directly from the ES|QL query editor in your visualization. Just start typing your query, use the autocomplete menu, and Kibana will helpfully scaffold the control for you.

But if you prefer to start from the variable itself, you can also go to: Add panel → Controls → Variable control and add the variable to your visualizations after creating the control.

Example 1: Filtering control with multi-value selection

1. Pick a visualization that is powered by a ES|QL query and click “Create control” within the WHERE clause

2. You will automatically get redirected to the variable creation flyout, where the “Values from a query” type will be selected for you, and the name of the variable is already pre-populated. Remember that the name of a control always needs to start with “?...” to work in the visualization query.

You will usually need a query like this one to get the values from a field and update them according to the time range selected in the dashboard:

FROM 
| WHERE @timestamp <=?_tend and @timestamp >?_tstart
| STATS BY

3. When saving the control, you will see it appearing at the top of the dashboard, and your visualization query will be updated with the variable control name.

4. If you want to add multi-value selection to the control, you need to use the MV_CONTAINS function in the query and select “Allow multiple selections” during the control creation in step 2 (available from 9.3).

Example 2: Time interval control

If you’re building a time series, you can easily add a variable control for your date histogram interval:

1. When writing an ES|QL query for your time series, click “Create control”. When building a variable for intervals, it’s better to use TBUCKET instead of BUCKET so it accepts more readable intervals like “1 hour”, “1 day”, etc. There will also be an auto option for TBUCKET soon so it can automatically adapt to time ranges.

2. Define the intervals to populate the options in the dropdown menu.

3. Select different intervals in the dropdown menu and see how your visualization changes.

Example 3: variables for functions

Build a variable using the “Static values” type of control and add function names to your dropdown values. It’s important to use a name for your variable that starts with “??...” to replace functions.

2. Include the variable name in your ES|QL query.

Example 4: variables for fields

You can use the “Static values” type of control and write down the names of the fields you want. It’s important to use a variable name that starts with “??...” to make it work for fields.

2. Reference the variable where you want in the visualization query.

Variable controls in Discover

Variable controls aren’t just a dashboard feature—they’re also available directly in the ES|QL editor in Discover. You can build controls for a faster data exploration experience in Discover, bring them to the dashboard and vice versa.

Technical details

By now, you’ve probably noticed that variable controls come with a few rules—like which parts of a query they can reference and the naming prefixes you need to use (“?...” for values and “??...” for fields or functions). That’s because variables aren’t just simple string replacements happening on the client. They’re actually first-class citizens in the query language itself (known as parameters in ES|QL).

This design brings some big advantages. For one, Kibana can understand the context of each variable, which allows us to automatically generate and pre-fill its configuration for you. It’s also much more secure: because the language strictly validates variable inputs, it prevents malicious injections and gracefully errors out if something looks off. Plus, it improves performance and stability by shifting complex validation and error handling to the server instead of the client. A note on performance, a best practice is to build variables that include fast queries since they load before the dashboard so, slow queries can affect the whole dashboard performance.

Of course, this architecture also comes with a few limitations—for now. Variables don’t yet support an “Any” option for filtering, and they can’t currently be used with certain operators like LIKE or FROM (for switching data sources). The good news? We’re actively working on adding these capabilities.

What the future holds for controls

We’re not stopping here! Some of the improvements on our radar include:

✨ The ability to place controls anywhere on the dashboard

✨ Chaining your controls—meaning the output of one control becomes the input for the next

✨ Better selection options like “Any” selection for variables

✨ New control types (search-type control and variables for your datasources)

✨ And more quality-of-life improvements you’ve been asking for like pre-filtering normal controls

If you have ideas or feedback, we’d love to hear from you.

Up to 12x Faster Vector Indexing in Elasticsearch with NVIDIA cuVS: GPU-acceleration Chapter 2

Wed, 03 Dec 2025 00:00:00 GMT

Earlier this year, Elastic announced the collaboration with NVIDIA to bring GPU acceleration to Elasticsearch, integrating with NVIDIA cuVS—as detailed in a session at NVIDIA GTC and various blogs. This post is an update on the co-engineering effort with the NVIDIA vector search team.

Recap

First, let’s bring you up to speed. Elasticsearch has established itself as a powerful vector database, offering a rich set of features and strong performance for large-scale similarity search. With capabilities such as scalar quantization, Better Binary Quantization (BBQ), SIMD vector operations, and more disk-efficient algorithms like DiskBBQ, it already provides efficient and flexible options for managing vector workloads.

By integrating NVIDIA cuVS as a callable module for vector search tasks, we aim to deliver significant gains in vector indexing performance and efficiency to better support large-scale vector workloads.

The challenge

One of the toughest challenges in building a high-performance vector database is constructing the vector index - the HNSW graph. Index building quickly becomes dominated by millions or even billions of arithmetic operations as every vector is compared against many others. In addition, index lifecycle operations, such as compaction and merges, can further increase the overall compute overhead of indexing. As data volumes and associated vector embeddings grow exponentially, accelerated computing GPUs, built for massive parallelism and high-throughput math, are ideally positioned to handle these workloads.

Enter the Elasticsearch-GPU Plugin

NVIDIA cuVS is an open-source CUDA-X library for GPU-accelerated vector search and data clustering that enables fast index building and embedding retrieval for AI and recommendation workloads.

Elasticsearch uses cuVS through cuvs-java, an open-source library developed by the community and maintained by NVIDIA. The cuvs-java library is lightweight and builds on the cuVS C API using Panama Foreign Function to expose cuVS features in an idiomatic Java way, while remaining modern and performant.

The cuvs-java library is integrated into a new Elasticsearch plugin; therefore, vector indexing on the GPU can occur on the same Elasticsearch node and process, without the need to provision any external code or hardware. During index building, if the cuVS library is installed and a GPU is present and configured, Elasticsearch will use the GPU to accelerate the vector indexing process. The vectors are given to the GPU, which constructs a CAGRA graph. This graph is then converted to the HNSW format, making it immediately available for vector search on the CPU. The final format of the built graph is the same as what would be built on the CPU; this allows Elasticsearch to leverage GPUs for high-throughput vector indexing when the underlying hardware supports it, while freeing CPU power for other tasks (concurrent search, data processing, etc.).

Index build acceleration

As part of integrating GPU acceleration into Elasticsearch, several enhancements were made to cuvs-java, focusing on efficient data input/output and function invocation. A key enhancement is the use of cuVSMatrix to transparently model vectors, whether they reside on the Java heap, off-heap, or in GPU memory. This enables data to move efficiently between memory and the GPU, avoiding unnecessary copies of potentially billions of vectors.

Thanks to this underlying zero-copy abstraction, both transferring to GPU memory and retrieving the graph can occur directly. During indexing, vectors are first buffered in memory on the Java heap, then sent to the GPU to construct the CAGRA graph. The graph is subsequently retrieved from the GPU, converted into HNSW format, and persisted to disk.

At merge time, the vectors are already stored on disk, bypassing the Java heap entirely. Index files are memory-mapped, and data is transferred directly into GPU memory. The design also easily accommodates different bit-widths, such as float32 or int8, and naturally extends to other quantization schemes.

Drumroll…so, how does it perform?

Before we get into the numbers, a bit of context is helpful. Segment merging in Elasticsearch typically runs automatically in the background during indexing, which makes it difficult to benchmark in isolation. To obtain reproducible results, we used force-merge to explicitly trigger segment merging in a controlled experiment. Since force-merge performs the same underlying merge operations as background merging, its performance serves as a useful indicator of expected improvements, even though the exact gains may differ in real-world indexing workloads.

Now, let’s see the numbers.

Our initial benchmark results are very promising. We ran the benchmark on an AWS g6.4xlarge instance with locally attached NVMe storage. A single node of Elasticsearch was configured to use the default, optimal number of indexing threads (8 - one for each physical core), and to disable merge throttling (which is less applicable with fast NVMe disks).

For the dataset, we used 2.6 million vectors with 1,536 dimensions from the OpenAI Rally vector track, encoded as base64 strings, and indexed as float32 hnsw. In all scenarios, the constructed graphs achieve recall levels of up to 95%. Here’s what we found:

Indexing Throughput: By moving graph construction to the GPU during in-memory buffer flushes, we increase throughput by ~12x.
Force-merge: After indexing completes, the GPU continues to accelerate segment merging, speeding up the force-merge phase by ~7x.

CPU usage: Offloading graph construction to the GPU significantly reduces both average and peak CPU utilization. The graphs below illustrate CPU usage during indexing and merging, highlighting how much lower it is when these operations run on the GPU. Lower CPU utilization during GPU indexing frees up CPU cycles that can be redirected to improve search performance.

Recall: Accuracy remains effectively the same between CPU and GPU runs, with the GPU-built graph reaching marginally higher recall.

Comparing along another dimension: Price

The earlier comparison intentionally used identical hardware, with the only difference being whether the GPU was used during indexing. That setup is useful for isolating raw compute effects, but we can also look at the comparison from a cost perspective.

At roughly the same hourly price as the GPU-accelerated configuration, one can provision a CPU-only setup with approximately twice the comparable CPU and memory resources: 32 vCPUs (AMD EPYC) and 64 GB of RAM, allowing to double the number of indexing threads to 16.

To keep the comparison fair and consistent, we ran this CPU-only experiment on an AWS g6.8xlarge instance, with the GPU explicitly disabled. This allowed us to hold all other hardware characteristics constant while evaluating the cost–performance trade-off of GPU acceleration versus CPU-only indexing.

The more powerful CPU instance does show improved performance compared to the benchmarks in the above section, as you would expect. However, when we compare this more powerful CPU instance against the original GPU-accelerated results, the GPU still delivers substantial performance gains: ~5x improvement in indexing throughput, and ~6x in force merge, all while building graphs that achieve recall levels of up to 95%.

Conclusion

In end-to-end scenarios, GPU acceleration with NVIDIA cuVS delivers nearly a 12x improvement in indexing throughput and a 7x decrease in force-merge latency, with significantly lower CPU utilization. This shows that vector indexing and merge workloads benefit significantly from GPU acceleration. On a cost-adjusted comparison, GPU acceleration continues to yield substantial performance gains, with approximately 5x higher indexing throughput and 6x faster force-merge operations.

GPU-accelerated vector indexing is currently planned for Tech Preview in Elasticsearch 9.3, which is scheduled to be released early in 2026.

Stay tuned for more.

ES|QL in 9.2: Smart Lookup Joins and time-series support

Tyler Perkins, Kostas Krikellas, Julian Kiryakov — Tue, 02 Dec 2025 00:00:00 GMT

Elasticsearch 9.2, released in October, is packed with significant advancements that make analyzing your data faster, more flexible, and more accessible than ever before. At the heart of this release are important enhancements to ES|QL, our piped query language, designed to bring even more value directly to end users.

Here's a look at the features in Elasticsearch 9.2 that will transform your data analysis workflows with ES|QL.

Revolutionizing data correlation: A smarter, faster, and more flexible Lookup Join

The LOOKUP JOIN command in ES|QL has undergone a significant transformation in Elasticsearch 9.2, becoming dramatically more efficient and versatile. LOOKUP JOIN combines data from your ES|QL query results table with matching records from a specified lookup mode index. It adds fields from the lookup index as new columns to your results table based on matching values in the join field. Previously, joining data was limited to a single field and simple equality. Not anymore! These enhancements empower you to tackle complex data correlation scenarios with ease.

Key enhancements to Lookup Join include:

Multi-Field Joins: Easily join on multiple fields. For example, to join application_logs with service_registry on service_name, environment and version:

FROM application_logs
| LOOKUP JOIN service_registry ON service_name, environment, version

Unleashing complex join predicates with expressions (tech preview):

No longer are you confined to simple equality. LOOKUP JOIN now allows you to specify multiple criteria for correlation and incorporate a range of binary operators, including ==, !=, <, >, <=, and >=. This means you can create highly nuanced join conditions, enabling you to ask much more sophisticated questions of your data.

Example 1: Finding application metrics with per-service SLA threshold

FROM application_metrics
| LOOKUP JOIN sla_thresholds
      ON service_name == sla_service AND response_time > sla_response_time

Example 2: This query calculates the amount due, based on regional pricing policies that change over time. It joins three datasets based on complex date-range and equality conditions to calculate a final due_amount. The second lookup join uses the measurement_date field from the meter_readings index and region_id field from the customers index to join to the pricing_policies index and find the correct pricing policy for the particular region and measurement_date.

FROM meter_readings
| LOOKUP JOIN customers
      ON meter_id
| LOOKUP JOIN pricing_policies
      ON
        region_id == region AND
          measurement_date >= policy_begin_date AND
          measurement_date < policy_end_date
| EVAL due_amount = (kwh_consumed * rate_per_kwh + base_charge) * (1 + tax_rate)
| EVAL period = policy_name
| KEEP customer_name, period, due_amount, measurement_date, kwh_consumed,
    rate_per_kwh, base_charge, tax_rate
| SORT measurement_date

Massive performance gains for filtered joins:

We've improved performance for "expanding joins" that are filtered using lookup table conditions. Expanding joins produce multiple matches per input row, which can create large intermediate result sets. This gets worse when many of those rows are discarded by a subsequent filter. In 9.2, we optimize these joins by filtering out unnecessary rows when a filter is applied to lookup data, avoiding processing rows that would be discarded. In some scenarios, these joins can be up to 1000 times faster!

This optimization is crucial when dealing with "expanding joins," where a lookup might initially generate many potential matches. By intelligently pushing down filters, only the relevant data is processed, drastically reducing query execution time and enabling real-time analysis on massive datasets. This means you get your insights much faster, even with very large or complex join operations.

Lookup Join Cross-Cluster Search (CCS) compatibility:

When Lookup Join went GA in 8.19 and 9.1, it lacked Cross-Cluster Search (CCS) support. For organizations operating across multiple clusters, LOOKUP JOIN now seamlessly integrates with CCS in 9.2. Simply place your lookup index on all the remote clusters where you want to perform a join, and ES|QL will automatically leverage these remote lookup indices to join with your remote data. This simplifies distributed data analysis and ensures consistent enrichment across your entire Elasticsearch deployment.

These improvements mean you can correlate diverse datasets with unprecedented precision, speed, and ease, uncovering deeper, more actionable insights without complex workarounds or pre-processing steps.

Enrich your data with ease: Kibana Discover UX for Lookup Indices

Data enrichment should be simple, not a hurdle. We've introduced a fantastic new user experience in Kibana's Discover for creating and managing lookup indices.

Intuitive Workflow: Discover’s comprehensive autocomplete will guide you through the process, suggesting lookup indices and join fields in the ES|QL editor, making it incredibly easy to connect your uploaded data with existing indices. Type the name of a lookup index that doesn’t exist, and get direct access to the Lookup editor with one click to create the index. Type the name of an existing lookup index, and we’ll suggest an option to edit it:

In-line management (CRUD): Keep your reference datasets current with in-line editing capabilities (Create, Read, Update, Delete) directly in Discover.

Effortless file upload: You can now directly upload files, such as CSVs, within Discover and instantly use them in your LOOKUP JOIN's. No more context switching from jumping around to different areas of Kibana!

Whether you're mapping user IDs to names, adding business metadata, or joining static reference files, this feature democratizes data enrichment, putting the power of joins directly into the hands of every user – fast, simple, and all in one place.

Preserve your context: Introducing INLINE STATS (tech preview)

Aggregating data is crucial, but sometimes you need to see the aggregates alongside your original data. We're excited to introduce INLINE STATS as a Tech Preview feature.

Unlike the STATS command, which replaces your input fields with aggregated output, INLINE STATS preserves all your original input fields and simply adds the new aggregated fields. This empowers you to perform further operations on your original input fields after aggregation, providing a more continuous and flexible analysis workflow.

For example, to calculate the average flight distance while keeping individual flight rows:

FROM kibana_sample_data_flights
 | KEEP Carrier, Dest, DistanceMiles
 | INLINE STATS avgDist = ROUND(AVG(DistanceMiles))
       BY Dest
 | WHERE DistanceMiles > avgDist

In this query, avgDist is added to each row with the corresponding Dest(ination) that we grouped by, and then, because we still have the flight info columns, we’re able to filter the results to the flights with a distance greater than the average.

Time-series support in ES|QL (tech preview)

Elasticsearch uses time-series data streams for storing metrics. We’re adding support for time series aggregations in ES|QL, through the TS source command. This is available in Elastic Cloud serverless and 9.2 basic as a tech preview.

Time series analysis is largely based on aggregation queries that summarize metric values over time buckets, sliced by one or more filtering dimensions. Most aggregation queries rely on two-step processing, with (a) an inner aggregation function summarizing values per time series and (b) an outer aggregation function, combining the results from (a) across time series.

The TS source command, combined with STATS, provides a concise yet effective way for expressing such queries over time series. More concretely, consider the following example for calculating the total rate of requests per host and hour:

TS my_metrics
| WHERE @timestamp > NOW() - 1 day
| STATS SUM(RATE(requests))
      BY host, TBUCKET(1h)

In this case, the time series aggregation function RATE is first evaluated per time series and hour. The produced partial aggregates are then combined using SUM to calculate the final aggregate values per host and hour.

You can check out the list of available time series aggregation functions here. counter rate is now supported, arguably the most important aggregation function for processing counters.

The TS source command is designed to be combined with STATS, with execution tuned to efficiently support time series aggregations. For example, the data is sorted before going into the STATS. Processing commands that may enrich or alter the time series data or its order, such as FORK or INLINE STATS, are currently not allowed between TS and STATS. This limitation might be lifted in the future.

The STATS tabular output can be further processed with any applicable command. For example, the following query calculates the ratio of the average cpu_usage per host and hour to the maximum value per host:

TS my_metrics
| STATS avg_usage = AVG(AVG_OVER_TIME(cpu_usage))
      BY host, time_bucket = TBUCKET(1h)
| INLINE STATS max_avg_usage = MAX(avg_usage)
      BY host
| EVAL ratio = avg_usage / max_avg_usage
| KEEP host, time_bucket, ratio
| SORT host, time_bucket DESC

Time series data is stored on our underlying columnar storage engine that is powered by Lucene doc values. The TS command adds vectorized query execution through the ES|QL compute engine. Query performance is often improved by more than an order of magnitude, compared to equivalent DSL queries, and is on par with established, metrics-specific systems. We’ll be providing a detailed architectural and performance analysis in the future, so stay tuned.

Expanding your toolkit: New ES|QL functions

To further enhance ES|QL's utility and versatility, we've added a suite of new functions:

String manipulation: CONTAINS, MV_CONTAINS, URL_ENCODE, URL_ENCODE_COMPONENT, URL_DECODE for more robust text and URL processing.

Time series & geospatial: TBUCKET for flexible time bucketing, TO_DENSE_VECTOR for vector operations, and a comprehensive set of geospatial functions like ST_GEOHASH, ST_GEOTILE, ST_GEOHEX, TO_GEOHASH, TO_GEOTILE, TO_GEOHEX for advanced location-based analysis.

Date formatting: DAY_NAME, MONTH_NAME for more readable date representations.

These functions provide you with a richer set of tools to manipulate and analyze your data directly within ES|QL.

Under the hood: More performance and efficiency

Beyond the highlighted features, Elasticsearch 9.2 includes numerous performance optimizations across ES|QL. We sped up RLIKE (LIST) with pushdown in cases where the function replaces multiple similar RLIKE queries. With RLIKE (LIST), we can merge those queries into a single Automaton and apply one automaton instead of multiple. We also have faster loading of keyword fields with index sorts, and general query optimizations - these improvements ensure that your ES|QL queries run more efficiently than ever before.

Get started today!

Elasticsearch 9.2 represents a significant leap forward for ES|QL, bringing unprecedented power and flexibility to your data analysis workflows. We encourage you to explore these new features and experience the difference they make.

For a comprehensive list of all changes and enhancements in Elasticsearch 9.2, please consult the official release notes. Happy querying!

Building a ChatGPT connector with Elasticsearch to query GitHub issues

Tomás Murúa — Mon, 01 Dec 2025 00:00:00 GMT

Recently, OpenAI announced the custom connectors feature for ChatGPT on Pro/Business/Enterprise and Edu plans. In addition to out-of-the-box connectors to tap data on Gmail, GitHub, Dropbox, etc. It is possible to create custom connectors using MCP servers.

Custom connectors give you the ability to combine your existing ChatGPT connectors with additional sources of data like Elasticsearch to get comprehensive answers.

In this article, we’ll build an MCP server that connects ChatGPT to an Elasticsearch index containing information on internal GitHub issues and pull requests. This enables natural language queries to be answered using your Elasticsearch data.

We’ll deploy the MCP server using the FastMCP on Google Colab with ngrok to get a public URL that ChatGPT can connect to, eliminating the need for complex infrastructure setup.

For a comprehensive overview of MCP and its ecosystem, refer to The Current State of MCP.

Prerequisites

Before starting, you will need:

Elasticsearch cluster (8.X or higher)
Elasticsearch API key with read access to your index
Google account (for Google Colab)
Ngrok account (free tier works)
ChatGPT account with Pro/Enterprise/Business or Edu plan

Understanding ChatGPT MCP connector requirements

ChatGPT MCP connectors require implementing two tools: search and fetch. For more details, refer to OpenAI Docs.

Search tool

Returns a list of relevant results from your Elasticsearch index based on a user query.

What it receives:

A single string with the user’s natural language query.
Example: “Find issues related to Elasticsearch migration.”

What it returns:

An object with a result key containing an array of result objects. Each result includes:
- id - Unique Document identifier
- title - Issue or PR title
- url - Link to the issue/PR

In our implementation:

return {
    "results": [
        {
            "id": "PR-612",
            "title": "Fix memory leak in WebSocket notification service",
            "url": "https://internal-git.techcorp.com/pulls/612"
        },
        # ... more results
    ]
}

Fetch tool

Retrieves the complete content of a specific document.

What it receives:

A single string with the Elasticsearch document ID from the search result
Example: “Get me the details of PR-578.”

What it returns:

A complete document object with:
- id - Unique Document identifier
- title - Issue or PR title
- text - Complete issue/PR description and details
- url - Link to the issue/PR
- type - Document type (issue, pull_request)
- status - Current status (open, in_progress, resolved)
- priority - Priority level (low, medium, high, critical)
- assignee - Person assigned to the issue/PR
- created_date - When it was created
- resolved_date - When it was resolved (if applicable)
- labels - Tags associated with the document
- related_pr - Related pull request ID

return {
    "id": "PR-578",
    "title": "Security hotfix: Patch SQL injection vulnerabilities",
    "text": "Description: CRITICAL SECURITY FIX for ISSUE-1889. Patches SQL...",
    "url": "https://internal-git.techcorp.com/pulls/578",
    "type": "pull_request",
    "status": "closed",
    "priority": "critical",
    "assignee": "sarah_dev",
    "created_date": "2025-09-19",
    "resolved_date": "2025-09-19",
    "labels": "security, hotfix, sql",
    "related_pr": null
}

Note: This example uses a flat structure where all fields are at the root level. OpenAI requirements are flexible and also support nested metadata objects.

GitHub issues and PRs dataset

For this tutorial, we are going to use an internal GitHub dataset containing issues and pull requests. This represents a scenario where you want to query private, internal data through ChatGPT.

The dataset can be found here. And we will update the index of the data using the bulk API.

This dataset includes:

Issues with descriptions, status, priority, and assignees
Pull requests with code changes, reviews, and deployment info
Relationships between issues and PRs (e.g., PR-578 fixes ISSUE-1889)
Labels, dates, and other metadata

Index mappings

The index uses the following mappings to support hybrid search with ELSER. The text_semantic is used for semantic search, while other fields enable keyword search.

{
  "mappings": {
    "properties": {
      "id": {
        "type": "keyword"
      },
      "title": {
        "type": "text"
      },
      "text": {
        "type": "text"
      },
      "text_semantic": {
        "type": "semantic_text",
        "inference_id": ".elser-2-elasticsearch"
      },
      "url": {
        "type": "keyword"
      },
      "type": {
        "type": "keyword"
      },
      "status": {
        "type": "keyword"
      },
      "priority": {
        "type": "keyword"
      },
      "assignee": {
        "type": "keyword"
      },
      "created_date": {
        "type": "date",
        "format": "iso8601"
      },
      "resolved_date": {
        "type": "date",
        "format": "iso8601"
      },
      "labels": {
        "type": "keyword"
      },
      "related_pr": {
        "type": "keyword"
      }
    }
  }
}

Build the MCP server

Our MCP server implements two tools following OpenAI specifications, using hybrid search to combine semantic and text matching for better results.

Search tool

Uses hybrid search with RRF (Reciprocal Rank Fusion), combining semantic search with text matching:

@mcp.tool()
    async def search(query: str) -> Dict[str, List[Dict[str, Any]]]:
        """
        Search for internal issues and PRs using hybrid search (semantic + text with RRF).
        Returns list with id, title, and url per OpenAI spec.
        """
        if not query or not query.strip():
            return {"results": []}

        logger.info(f"Searching for: '{query}'")

        try:
            # Hybrid search with RRF (Reciprocal Rank Fusion)
            response = es_client.search(
                index=ELASTICSEARCH_INDEX,
                size=10,
                source=["id", "title", "url", "type", "priority"],
                retriever={
                    "rrf": {
                        "retrievers": [
                            {
                                # Semantic search with ELSER
                                "standard": {
                                    "query": {
                                        "semantic": {
                                            "field": "text_semantic",
                                            "query": query
                                        }
                                    }
                                }
                            },
                            {
                                # Text search (BM25) for keyword matching
                                "standard": {
                                    "query": {
                                        "multi_match": {
                                            "query": query,
                                            "fields": [
                                                "title^3",
                                                "text^2",
                                                "assignee^2",
                                                "type",
                                                "labels",
                                                "priority"
                                            ],
                                            "type": "best_fields",
                                            "fuzziness": "AUTO"
                                        }
                                    }
                                }
                            }
                        ],
                        "rank_window_size": 50,
                        "rank_constant": 60
                    }
                }
            )

            results = []
            if response and 'hits' in response:
                for hit in response['hits']['hits']:
                    source = hit['_source']
                    results.append({
                        "id": source.get('id', hit['_id']),
                        "title": source.get('title', 'Unknown'),
                        "url": source.get('url', '')
                    })

            logger.info(f"Found {len(results)} results")
            return {"results": results}

        except Exception as e:
            logger.error(f"Search error: {e}")
            raise ValueError(f"Search failed: {str(e)}")

Key points:

Hybrid search with RRF: Combines semantic search (ELSER) and text search (BM25) for better results.
Multi-match query: Searches across multiple fields with boosting (title^3, text^2, assignee^2). The caret symbol (^) multiplies relevance scores, prioritizing matches in titles over content.
Fuzzy matching: fuzziness: AUTO handles typos and misspellings by allowing approximate matches.
RRF parameter tuning:
- rank_window_size: 50 - Specifies how many top results from each retriever (semantic and text) are considered before merging.
- rank_constant: 60 - This value determines how much influence documents in individual result sets have over the final ranked result.
Returns only the required fields: id, title, url per OpenAI specification, and avoids exposing additional fields unnecessarily.

Fetch tool

Retrieves document details by document ID, where it exists:

@mcp.tool()
    async def fetch(id: str) -> Dict[str, Any]:
        """
        Retrieve complete issue/PR details by ID.
        Returns id, title, text, url.
        """
        if not id:
            raise ValueError("ID is required")

        logger.info(f"Fetching: {id}")

        try:
            # Search by the 'id' field (not _id) since IDs are stored as a field
            response = es_client.search(
                index=ELASTICSEARCH_INDEX,
                body={
                    "query": {
                        "term": {
                            "id": id  # Search by your custom 'id' field
                        }
                    },
                    "size": 1
                }
            )

            if not response or not response['hits']['hits']:
                raise ValueError(f"Document with id '{id}' not found")

            hit = response['hits']['hits'][0]
            source = hit['_source']

            result = {
                "id": source.get('id', id),
                "title": source.get('title', 'Unknown'),
                "text": source.get('text', ''),
                "url": source.get('url', ''),
                "type": source.get('type', ''),
                "status": source.get('status', ''),
                "priority": source.get('priority', ''),
                "assignee": source.get('assignee', ''),
                "created_date": source.get('created_date', ''),
                "resolved_date": source.get('resolved_date', ''),
                "labels": source.get('labels', ''),
                "related_pr": source.get('related_pr', '')
            }

            logger.info(f"Fetched: {result['title']}")
            return result

        except Exception as e:
            logger.error(f"Fetch error: {e}")
            raise ValueError(f"Failed to fetch '{id}': {str(e)}")

Key points:

Search by document ID field: Uses term query on the custom id field
Returns complete document: Includes full text field with all content
Flat structure: All fields at the root level, matching Elasticsearch's document structure.

Deploy on Google Colab

We'll use Google Colab to run our MCP server and ngrok to expose it publicly so ChatGPT can connect to it.

Step 1: Open the Google Colab notebook

Access our pre-configured notebook Elasticsearch MCP for ChatGPT.

Step 2: Configure your credentials

You’ll need three pieces of information:

Elasticsearch URL: Your Elasticsearch cluster URL.
Elasticsearch API Key: API key with read access to your index.
Ngrok Auth Token: Free token from ngrok. We will use ngrok to expose the MCP URL to the internet so ChatGPT can connect to it.

Getting your ngrok token

Sign up for a free account at ngrok
Go to your ngrok dashboard
Copy your auth token

Adding secrets to Google Colab

In the Google Colab notebook:

Click the Key icon in the left sidebar to open Secrets.
Add these three secrets:

ELASTICSEARCH_URL=https://your-cluster.elastic.com:443
ELASTICSEARCH_API_KEY=your-api-key
NGROK_TOKEN=your-ngrok-token

3. Enable notebook access for each secret

Step 3: Run the notebook

Click Runtime and then Run all to execute all cells
Wait for the server to start (about 30 seconds)
Look for the output showing your public ngrok URL

4. The output will show something like:

Connect to ChatGPT

Now we’ll connect the MCP server to your ChatGPT account.

Open ChatGPT and go to Settings.
Navigate to Connectors. If you're using a Pro account, you need to turn on developer mode in connectors.

If you're using ChatGPT enterprise or business, you need to publish the connector to your workplace.

3. Click on Create.

Note: In Business, Enterprise, and Edu workspaces, only workspace owners, admins, and users with the respective setting enabled (for Enterprise/Edu) can add custom connectors. Users with a regular member role do not have the ability to add custom connectors themselves.

Once a connector is added and enabled by an owner or admin user, it becomes available for all members of the workspace to use.

4. Enter the required information and your ngrok URL ending in /sse/. Note the “/” after “sse.” It won’t work without it:

Name: Elasticsearch MCP
Description: Custom MCP for searching and fetching GitHub internal information.

5. Press Create to save the custom MCP.

The connection is instant if your server is running. No additional authentication is needed, as the Elasticsearch API key is configured in your server.

Test the MCP server

Before asking questions, you need to select which connector ChatGPT should use.

Prompt 1: Search for issues

Ask: “Find issues related to Elasticsearch migration” and confirm the actions tool calling.

ChatGPT will call the search tool with your query. You can see it’s looking for available tools and preparing to call the Elasticsearch tool and confirms with the user before taking any actions against the tool.

Tool call request:

{
  "query": "Elasticsearch migration issues"
}

Tool response:

{
  "results": [
    {
      "id": "PR-598",
      "title": "Elasticsearch 8.x migration - Application code changes",
      "url": "https://internal-git.techcorp.com/pulls/598"
    },
    {
      "id": "ISSUE-1712",
      "title": "Migrate from Elasticsearch 7.x to 8.x",
      "url": "https://internal-git.techcorp.com/issues/1712"
    },
    {
      "id": "RFC-045",
      "title": "Design Proposal: Microservices Migration Architecture",
      "url": "https://internal-git.techcorp.com/rfcs/045"
    }
    // ... 7 more results
  ]
}

ChatGPT processes the results and presents them in a natural, conversational format.

Behind the scenes

Prompt: “Find issues related to Elasticsearch migration”

1. ChatGPT calls search(“Elasticsearch migration”)

2. Elasticsearch performs a hybrid search

Semantic search understands concepts like “upgrade” and “version compatibility.”
Text search finds exact matches for “Elasticsearch” and “migration.”
RRF combines and ranks results from both approaches

3. Returns top 10 matching events with id, title, url

4. ChatGPT identifies “ISSUE-1712: migrate from Elasticsearch 7.x to 8.x” as most relevant result

Prompt 2: Get complete details

Ask: “Get me details of ISSUE-1889”

ChatGPT recognizes you want detailed information about a specific issue and calls the fetch tool and confirms with the user before taking any actions against the tool.

Tool call request:

{
  "id": "ISSUE-1889"
}

Tool response:

{
  "id": "ISSUE-1889",
  "title": "SQL injection vulnerability in search endpoint",
  "text": "Description: Security audit identified SQL injection vulnerability in /api/v1/search endpoint. User input from query parameter is not properly sanitized before being used in raw SQL query. Severity: HIGH - Immediate action required Affected Code: - File: services/search/query_builder.py - Line: 145-152 - Issue: String concatenation used instead of parameterized queries Investigation: - @security_team_alice: Confirmed exploitable with UNION-based injection - @sarah_dev: Checking all other endpoints for similar patterns - @john_backend: Found 3 more instances in legacy codebase Remediation: - Rewrite using SQLAlchemy ORM or parameterized queries - Add input validation and sanitization - Implement WAF rules as additional layer - Security regression tests Comments: - @tech_lead_mike: Stop all other work, this is P0 - @sarah_dev: PR-578 ready with fixes for all 4 vulnerable endpoints - @alex_devops: Deployed hotfix to production 2025-09-19 at 14:30 UTC - @security_team_alice: Verified fix, conducting full pentest next week Resolution: All vulnerable endpoints patched. Added pre-commit hooks to catch raw SQL queries. Security training scheduled for team.",
  "url": "https://internal-git.techcorp.com/issues/1889",
  "type": "issue",
  "status": "closed",
  "priority": "critical",
  "assignee": "sarah_dev",
  "created_date": "2025-09-18",
  "resolved_date": "2025-09-19",
  "labels": "security, vulnerability, bug, sql",
  "related_pr": "PR-578"
}

ChatGPT synthesizes the information and presents it clearly.

Behind the scenes

Prompt: “Get me the details of ISSUE-1889”

ChatGPT calls fetch(“ISSUE-1889”)
Elasticsearch retrieves the complete document
Returns a full document with all fields at the root level
ChatGPT synthesizes the information and responds with proper citations.

Conclusion

In this article, we built a custom MCP server that connects ChatGPT to Elasticsearch using dedicated search and fetch MCP tools, enabling natural language queries over private data.

This MCP pattern works for any Elasticsearch index, documentation, products, logs, or any other data you want to query through natural language.

Developing an agentic RAG assistant using LangChain and Elasticsearch

Kirti Sodhi — Fri, 28 Nov 2025 00:00:00 GMT

This blog post delves into agentic RAG workflows, explaining their key features and common design patterns. It further demonstrates how to implement these workflows through a hands-on example that uses Elasticsearch as the vector store and LangChain to construct the agentic RAG framework. Finally, the article briefly discusses best practices and challenges associated with designing and implementing such architectures. You can follow along to create a simple agentic RAG pipeline with this Jupyter notebook.

Introduction to agentic RAG

Retrieval Augmented Generation (RAG) has become a cornerstone in LLM-based applications, enabling models to provide optimal answers by retrieving relevant context based on user queries. RAG systems enhance the accuracy and context of LLM responses by drawing on external information from APIs or data stores, instead of being limited to pre-trained LLM knowledge. On the other hand, the AI agents operate autonomously, making decisions and taking actions to achieve their designated objectives.

Agentic RAG is a framework that unifies the strengths of both retrieval-augmented generation and agentic reasoning. It integrates RAG into the agent’s decision-making process, enabling the system to dynamically choose data sources, refine queries for better context retrieval, generate more accurate responses, and apply a feedback loop to continuously improve output quality.

Key features of agentic RAG

The agentic RAG framework marks a major advancement over traditional RAG systems. Instead of following a fixed retrieval process, it leverages dynamic agents capable of planning, executing, and optimizing results in real time.

Let’s look at some of the key features that distinguish agentic RAG pipelines:

Dynamic decision making: Agentic RAG uses a reasoning mechanism to understand the user’s intent and route each query to the most relevant data source, producing accurate and context-aware responses.
Comprehensive query analysis: Agentic RAG deeply analyzes user queries, including sub-questions and their overall intent. It assesses query complexity and dynamically selects the most relevant data sources to retrieve information, ensuring accurate and complete responses.
Multi-stage collaboration: This framework enables multi-stage collaboration through a network of specialized agents. Each agent handles a specific part of a larger objective, working sequentially or simultaneously to achieve a cohesive outcome.
Self-evaluation mechanisms: The agentic RAG pipeline uses self-reflection to evaluate retrieved documents and generated responses. It can check if the retrieved information fully addresses the query and then review the output for accuracy, completeness, and factual consistency.
Integration with external tools: This workflow can interact with external APIs, databases, and real-time information sources, incorporating up-to-date information and adapting dynamically to evolving data.

Workflow patterns of agentic RAG

The workflow patterns define how agentic AI structures, manages, and orchestrates LLM-based applications in a reliable and efficient manner. Several frameworks and platforms, such as LangChain, LangGraph, CrewAI, and LlamaIndex, can be used to implement these agentic workflows.

Sequential retrieval chain: Sequential workflows divide complex tasks into simple, ordered steps. Each step improves the input for the next one, leading to better results. For example, when creating a customer profile, one agent might pull basic details from a CRM, another retrieves purchase history from a transaction database, and a final agent combines this information to generate a complete profile for recommendations or reports.
Routing retrieval chain: In this workflow pattern, a router agent analyzes the input and directs it to the most appropriate process or data source. This approach is particularly effective when multiple distinct data sources exist with minimal overlap. For instance, in a customer service system, the router agent categorizes incoming requests, such as technical issues, refunds, or complaints, and routes them to the appropriate department for efficient handling.
Parallel retrieval chain: In this workflow pattern, multiple independent subtasks are executed concurrently, and their outputs are later aggregated to generate a final response. This approach significantly reduces processing time and increases workflow efficiency. For example, in a customer service parallel workflow, one agent retrieves similar past requests, and another consults relevant knowledge base articles. An aggregator then combines these outputs to generate a comprehensive resolution.
Orchestrator worker chain: This workflow shares similarities with parallelization due to its utilization of independent subtasks. However, a key distinction lies in the integration of an orchestrator agent. This agent is responsible for analyzing user queries, dynamically segmenting them into subtasks during runtime, and identifying the appropriate processes or tools required to formulate an accurate response.

Building an agentic RAG pipeline from scratch

To illustrate the principles of agentic RAG, let's design a workflow using LangChain and Elasticsearch. This workflow adopts a routing-based architecture, where multiple agents collaborate to analyze queries, retrieve relevant information, evaluate results, and generate coherent responses. You could refer to this Jupyter notebook to follow along with this example.

The workflow starts with the router agent, which analyzes the user's query to select the optimal retrieval method, i.e., either a vectorstore, websearch, or composite approach. The vectorstore handles traditional RAG-based document retrieval, the websearch fetches the most recent information not stored in the vectorstore, and the composite approach combines both when information from multiple sources is needed.

If the documents are deemed suitable, the summarization agent generates a clear and contextually appropriate response. However, if the documents are insufficient or irrelevant, the query rewriting agent reformulates the query to improve the search. This revised query then reinitiates the routing process, allowing the system to refine its search and enhance the final output.

Prerequisites

This workflow relies on the following core components to execute the example effectively:

Python 3.10
Jupyter notebook
Azure OpenAI
Elasticsearch
LangChain

Before proceeding, you will be prompted to configure the following set of required environment variables for this example.

AZURE_OPENAI_ENDPOINT="Add your azure openai endpoint"
AZURE_OPENAI_KEY="Add your azure openai key"
AZURE_OPENAI_DEPLOYMENT="gpt-4.1"
AZURE_OPENAI_API_VERSION="Add your azure openai api version"

ES_ENDPOINT = "Add your Elasticsearch ENDPOINT"
ES_API_KEY = "Add your Elasticsearch API KEY"

Data sources

This workflow is illustrated using a subset of the AG News dataset. The dataset comprises news articles across diverse categories, such as International, Sports, Business, and Science/Technology.

dataset = load_dataset("ag_news", split="train[:1000]")
docs = [
    Document(
        page_content=sample["text"],
        metadata={"category": sample["label"]}
    )
    for sample in dataset
]

The ElasticsearchStore module is utilized from the langchain_elasticsearch as our vector store. For retrieval, we implement the SparseVectorStrategy, employing ELSER, Elastic's proprietary embedding model. It is essential to confirm that the ELSER model is correctly installed and deployed in your Elasticsearch environment before initiating the vector store.

elastic_vectorstore = ElasticsearchStore.from_documents(
    docs,
    es_url=ES_ENDPOINT,
    es_api_key=ES_API_KEY,
    index_name=index_name,
    strategy=SparseVectorStrategy(model_id=".elser_model_2"),
)

elastic_vectorstore.client.indices.refresh(index=index_name)

The web search functionality is implemented using DuckDuckGoSearchRun from the LangChain community tools, which allows the system to retrieve live information from the web efficiently. You can also consider using other search APIs that may provide more relevant results. This tool was chosen as it allows searches without requiring an API key.

duckduckgo = DuckDuckGoSearchRun(description= "A custom DuckDuckGo search tool for finding latest news stories.", verbose=True)
def websearch_retriever(query):
    results = duckduckgo.run(f"{query}")
    return results

Composite retriever is designed for queries that require a combination of sources. It is used to provide a comprehensive and contextually accurate response by simultaneously retrieving real-time data from the web and consulting historical news from the vector store.

def composite_retriever(query):
    related_docs = vectorstore_retriever(query)
    related_docs += websearch_retriever(query)
    return related_docs

Setting up the agents

In the next step, the LLM agents are defined to provide reasoning and decision-making capabilities within this workflow. The LLM chains we will create include: router_chain, grade_docs_chain, rewrite_query_chain, and summary_chain.

The router agent uses an LLM assistant to determine the most appropriate data source for a given query at run-time. The grading agent evaluates the retrieved documents for relevance. If the documents are deemed relevant, they are passed to the summary agent to generate a summary. Otherwise, the rewrite query agent reformulates the query and sends it back to the routing process for another retrieval attempt. You can find the instructions for all the agents under the LLM chains section of the notebook.

class RouteQuery(BaseModel):
    datasource: Literal["vectorstore", "websearch", "composite"] = Field(
        ...,
        description="Choose to route the query to web search, vectorstore or composite."
    )

router_prompt = ChatPromptTemplate.from_template("""You are an assistant that decides the best data source for questions based on news articles.
Choose one of the following options:
- 'vectorstore': for general, background, or historical news articles.
- 'websearch': for recent discoveries, 'latest', 'current', or '2025' type queries.
- 'composite': when the question needs both historical and current knowledge on news articles.

Question: {query}

Return one word: 'vectorstore', 'websearch', or 'composite'.
""")
router_structured = llm.with_structured_output(RouteQuery)
router_chain: RunnableSequence = router_prompt | router_structured

The llm.with_structured_output constrains the output of the model to follow a predefined schema defined by the BaseModel under the RouteQuery class, ensuring consistency of the results. The second line composes a RunnableSequence by connecting router_prompt with router_structured, forming a pipeline in which the input prompt is processed by the language model to produce structured, schema-compliant results.

Define graph nodes

This part involves defining the states of the graph, which represent the data that flows between different components of the system. A clear specification of these states ensures that each node in the workflow knows what information it can access and update.

class RAGState(TypedDict):
    query: str
    docs: List[Document]
    router: str
    summary: str
    self_reflection: bool
    retry_count: int = 0

Once the states are defined, the next step is to define the nodes of the graph. Nodes are like the functional units of the graph that perform specific operations on the data. There are 7 different nodes in our pipeline.

def router(state: RAGState):
   router = router_chain.invoke({'query': state["query"]})
   logger.info(f"Router selected the datasource: {router.datasource}")
   logger.info(f"User query: {state['query']}")
   return {"router": router.datasource}

def vectorstore(state: RAGState):
   return {"docs": vectorstore_retriever(state["query"])}

def websearch(state: RAGState):
   return {"docs": websearch_retriever(state["query"])}

def composite(state: RAGState):
   return {"docs": composite_retriever(state["query"])}

def self_reflection(state: RAGState):
   evaluation = grade_docs_chain.invoke(
       {"query": state["query"], "docs": state["docs"]}
   )
   if evaluation.binary_score:
       logger.info(f"Self-reflection passed -- binary_score={evaluation.binary_score}")
   else:
       logger.info(f"Self-reflection failed -- binary_score={evaluation.binary_score}")

   return {
       "self_reflection": evaluation.binary_score,
   }

def query_rewriter(state: RAGState):
   retry_count = state.get("retry_count", 0) + 1
   new_query = rewrite_query_chain.invoke({"query": state["query"]})
   logger.info(f"Query rewritten: {new_query}, retry_count: {retry_count}")
   return {
       "query": new_query,
       "retry_count": retry_count,
   }

def summarize(state: RAGState):
   summary = summarize_chain.run(
       query=state["query"],
       docs=state["docs"],
   )
   return {"summary": summary}

The query_rewriter node serves two purposes in the workflow. First, it rewrites the user query using the rewrite_query_chain to improve retrieval when the documents evaluated by the self-reflection agent are deemed insufficient or irrelevant. Second, it acts as a counter that tracks how many times the query has been rewritten.

Each time the node is invoked, it increments the retry_count stored in the workflow state. This mechanism prevents the workflow from entering an infinite loop. If the retry_count exceeds a predefined threshold, the system can fallback to an error state, a default response, or any other predefined condition you choose.

Compiling the graph

The last step is to define the edges of the graph and add any necessary conditions before compiling it. Every graph must start from a designated starting node, which serves as the entry point for the workflow. Edges in the graph represent the flow of data between nodes and can be of two types:

Straight edges: These define a direct, unconditional flow from one node to another. Whenever the first node completes its task, the workflow automatically proceeds to the next node along the straight edge.
Conditional edges: These allow the workflow to branch based on the current state or the results of a node’s computation. The next node is selected dynamically depending on conditions such as evaluation results, routing decisions, or retry counts.

graph.add_edge(START, "router")

def after_router(state: RAGState):
   route = state.get("router", None)
   if route == "vectorstore":
       return "vectorstore"
   elif route == "websearch":
       return "websearch"
   else:
       return "composite"

def after_self_reflection(state: RAGState):
   if state["self_reflection"]:
           return "summarize"
   return "query_rewriter"

def after_query_rewriter(state: RAGState):
   while state['retry_count'] <= 3:
           return "router"
   raise RuntimeError("Maximum retries (3) reached -- evaluation failed.")

graph.add_conditional_edges(
   "router",
   after_router,
   {
       "vectorstore": "vectorstore",
       "websearch": "websearch",
       "composite": "composite"
   }
)

graph.add_edge("vectorstore", "self_reflection")
graph.add_edge("websearch", "self_reflection")
graph.add_edge("composite", "self_reflection")
graph.add_conditional_edges(
   "self_reflection",
   after_self_reflection,
   {
       "summarize": "summarize",
       "query_rewriter": "query_rewriter"
   }
)
graph.add_conditional_edges("query_rewriter", after_query_rewriter, {"router": "router"})
graph.add_edge("summarize", END)
agent=graph.compile()

With that, your first agentic RAG pipeline is ready and can be tested using the compiled agent.

result = agent.invoke({"query": query1})
logger.info(f"\nFinal Summary:\n: {result['summary']}")

Testing the agentic RAG pipeline

We will now test this pipeline using three distinct types of queries as below. Note that results can differ, and the examples shown below illustrate just one potential outcome.

query1="What are the latest AI models released this month?"
query2="What technological innovations are discussed in Sci/Tech news?"
query3="Compare a Sci/Tech article from the dataset with a current web article about AI trends."

For the first query, the router selects websearch as the data source. The query fails the self-reflection evaluation and is subsequently redirected to the query rewriting stage, as shown in the output.

INFO     | __main__:router:11 - Router selected the datasource: websearch
INFO     | __main__:router:12 - User query: What are the latest AI models released this month?
Latest Singapore news, including the city state's relationships with Malaysia and Mahathir, China and Xi Jinping, and the rest of Southeast Asia. 3 days ago · The latest military news, insights and analysis from China. All the latest news, opinions and analysis on Hong Kong, China, Asia and around the world Latest news, in-depth features and opinion on Malaysia, covering politics, economy, society and the Asean member-nation's relationships with China, Singapore, and other Southeast Asian ... Oct 12, 2025 · Brics (an acronym for Brazil, Russia, India, China and South Africa) refers to an association of 10 leading emerging markets. The other member states are Egypt, Ethiopia, ...
INFO     | __main__:self_reflection:31 - Self-reflection failed -- binary_score=False
INFO     | __main__:query_rewriter:40 - Query rewritten: query='Which AI models have been officially released in June 2024?', retry_count: 1
INFO     | __main__:router:11 - Router selected the datasource: websearch
INFO     | __main__:router:12 - User query: query='Which AI models have been officially released in June 2024?'
Dream Machine is a text-to-video model created by Luma Labs and launched in June 2024 . It generates video output based on user prompts or still images. Dream Machine has been noted for its ability to realistically capture motion... Released in June 2023. In June 2024 , Baidu announced Ernie 4.0 Turbo. In April 2025, Ernie 4.5 Turbo and X1 Turbo were released . These models are optimized for faster response times and lower operational costs.[28][29]. The meaning of QUERY is question, inquiry. How to use query in a sentence. Synonym Discussion of Query. QUERY definition: 1. a question, often expressing doubt about something or looking for an answer from an authority.... Learn more. Query definition: a question; an inquiry.. See examples of QUERY used in a sentence.
INFO     | __main__:self_reflection:29 - Self-reflection passed -- binary_score=True
INFO     | __main__::2 - 
Final Summary:
: In June 2024, two AI models were officially released: Dream Machine, a text-to-video model launched by Luma Labs, and Ernie 4.0 Turbo, announced by Baidu, which is optimized for faster response times and lower operational costs.

Next, we examine an example where vectorstore retrieval is used, demonstrated with the second query.

INFO     | __main__:router:11 - Router selected the datasource: vectorstore
INFO     | __main__:router:12 - User query: What technological innovations are discussed in Sci/Tech news?
INFO     | __main__:self_reflection:29 - Self-reflection passed -- binary_score=True
INFO     | __main__::2 - 
Final Summary:
: Recent Sci/Tech news highlights several technological innovations: NASA is collaborating with Silicon Valley firms to build a powerful Linux-based supercomputer to support theoretical research and shuttle engineering; new chromatin transfer techniques have enabled the cloning of cats; cybersecurity advancements are being discussed in relation to protecting personal technology; Princeton University scientists assert that existing technologies can be used immediately to stabilize global warming; and a set of GameBoy micro-games has been recognized for innovation in game design.

The final query is directed to composite retrieval, which utilizes both the vectorstore and web search.

INFO     | __main__:router:11 - Router selected the datasource: composite
INFO     | __main__:router:12 - User query: Compare a Sci/Tech article from the dataset with a current web article about AI trends.
Atlas currently only available on macOS, built on Chromium with planned features like ad-blocking still in development. OpenAI's Atlas browser launched with bold promises of AI -powered web browsing, but early real-world testing reveals a different story. Career-long data are updated to end-of-2024 and single recent year data pertain to citations received during calendar year 2024. The selection is based on the top 100,000 scientists by c-score (with and without self-citations) or a percentile rank of 2% or above in the sub-field. In this article I list 45 AI tools across 21 different categories. After exploring all the available options in each category, I've carefully selected the best tools based on my personal experience. Reading a complex technical article ? Simply highlight confusing terminology and ask "what's this?" to receive instant explanations. compare browsers. Comparison showing traditional browser navigation versus OpenAI Atlas AI -powered workflows. After putting Gemini, ChatGPT, Grok, and DeepSeek through rigorous testing in October 2025, it's clear that there isn't one AI that reigns supreme across all categories.
INFO     | __main__:self_reflection:29 - Self-reflection passed -- binary_score=True
INFO     | __main__::2 - 
Final Summary:
: A Sci/Tech article from the dataset highlights NASA's development of robust artificial intelligence software for planetary rovers, aiming to make them more self-reliant and capable of decision-making during missions. In contrast, a current web article about AI trends focuses on the proliferation of AI-powered tools across various categories, including browsers like OpenAI Atlas, and compares leading models such as Gemini, ChatGPT, Grok, and DeepSeek, noting that no single AI currently excels in all areas. While the NASA article emphasizes specialized AI applications for autonomous robotics in space exploration, the current trends article showcases the broadening impact of AI across consumer and professional technologies, with ongoing competition and rapid innovation among major AI platforms.

In the above workflow, agentic RAG intelligently determines which data source to use when retrieving information for a user query, thereby improving the accuracy and relevance of the response. You can create additional examples to test the agent and review the outputs to see if they yield any interesting results.

Best practices for building agentic RAG workflows

Now that we understand how agentic RAG works, let’s look at some best practices for building these workflows. Following these guidelines will help keep the system efficient and easy to maintain.

Prepare for fallbacks: Plan fallback strategies in advance for scenarios where any step of the workflow fails. These may include returning default answers, triggering error states, or using alternative tools. This ensures that the system handles failures gracefully without breaking the overall workflow.
Implement comprehensive logging: Try implementing logging at each stage of the workflow, such as retries, generated outputs, routing choices, and query rewrites. These logs help to improve transparency, make debugging easier, and help refine prompts, agent behavior, and retrieval strategies over time.
Select the appropriate workflow pattern: Examine your use case and select the workflow pattern that best suits your needs. Use sequential workflows for step-by-step reasoning, parallel workflows for independent data sources, and orchestrator-worker patterns for multi-tool or complex queries.
Incorporate evaluation strategies: Integrate evaluation mechanisms at different stages of the workflow. These can include self-reflection agents, grading retrieved documents, or automated quality checks. Evaluation helps verify that retrieved documents are relevant, responses are accurate, and all parts of a complex query are addressed.

Challenges

While agentic RAG systems offer significant advantages in terms of adaptability, precision, and dynamic reasoning, they also come with certain challenges that must be addressed during their design and implementation stages. Some of the key challenges include:

Complex workflows: As more agents and decision points are added, the overall workflow becomes increasingly complex. This can lead to higher chances of errors or failures at runtime. Whenever possible, prioritize streamlined workflows by eliminating redundant agents and unnecessary decision points.
Scalability: It can be challenging to scale agentic RAG systems to handle large datasets and high query volumes. Incorporate efficient indexing, caching, and distributed processing strategies to maintain performance at scale.
Orchestration and computational overhead: The execution of workflows with multiple agents requires advanced orchestration. This includes careful scheduling, dependency management, and agent coordination to prevent bottlenecks and conflicts, all of which add to the overall system complexity.
Evaluation complexity: The evaluation of these workflows presents inherent challenges, as each stage requires a distinct assessment strategy. For instance, the RAG stage must be evaluated for the relevance and completeness of retrieved documents, while generated summaries need to be checked for quality and accuracy. Likewise, the effectiveness of query reformulation requires a separate evaluation logic to determine whether the rewritten query improves retrieval outcomes.

Conclusion

In this blog post, we introduced the concept of agentic RAG and highlighted how it enhances the traditional RAG framework by incorporating autonomous capabilities from agentic AI. We explored the core features of agentic RAG and demonstrated these features through a hands-on example, building a news assistant using Elasticsearch as the vector store and LangChain to create the agentic framework.

Additionally, we discussed best practices and key challenges to consider when designing and implementing an agentic RAG pipeline. These insights are intended to guide developers in creating robust, scalable, and efficient agentic systems that effectively combine retrieval, reasoning, and decision-making.

What’s next

The workflow we built is simple, leaving ample room for improvements and experimentation. We can enhance this by experimenting with various embedding models and refining retrieval strategies. Additionally, integrating a re-ranking agent to prioritize retrieved documents could be beneficial. Another area for exploration involves developing evaluation strategies for agentic frameworks, specifically identifying common and reusable approaches applicable across different types of frameworks. Finally, experimenting with these frameworks on large and more complex datasets.

In the meantime, if you have similar experiments to share, we’d love to hear about them! Feel free to provide feedback or connect with us through our community Slack channel or discussion forums.

Resources

Hybrid search without the headache: simplifying hybrid search with retrievers

Mike Pellegrini — Thu, 27 Nov 2025 00:00:00 GMT

Hybrid search is widely recognized as a powerful search approach, combining the precision and speed of lexical search with the natural language capabilities of semantic search. However, applying it in practice can be tricky, often requiring deep knowledge about your index and the construction of verbose queries with non-trivial configurations. In this blog, we will explore how the multi-field query format for linear and RRF retrievers makes hybrid search simpler and more approachable, removing common headaches and enabling you to leverage its full power with greater ease. We will also review how the multi-field query format enables you to perform hybrid search queries with no previous knowledge about your index.

The score range problem

To set the stage, let’s review one of the primary reasons hybrid search can be hard: varying score ranges. Our old friend BM25 produces unbounded scores. In other words, BM25 can generate scores ranging from close to 0 to (theoretically) infinity. In contrast, queries against dense_vector fields will produce scores bounded between 0 and 1. Exacerbating this problem, semantic_text obfuscates the field type used to index embeddings, so unless you have detailed knowledge about your index and inference endpoint configuration, it can be hard to tell what the score range of your query will be. This presents a problem when trying to interleave lexical and semantic search results, as the lexical results may take precedence over the semantic ones even if the semantic results are more relevant. The generally accepted solution for this problem is to normalize the scores prior to interleaving results. Elasticsearch has two tools for this, the linear and RRF retrievers.

The RRF retriever applies the RRF algorithm, using document rank as a measure of relevance and discarding the score. Since the score is not considered, score range mismatches are not an issue.

The linear retriever uses a linear combination to determine a document’s final score. This involves taking each component query’s score for the document, normalizing it, and summing them to generate the total score. Mathematically, the operation can be expressed as:

Total Score = 𝚺(N(Sx))

Where N is the normalization function, and SX is the score for query X. The normalization function is key here, as it transforms each query’s score to use the same range. You can learn more about the linear retriever here.

Breaking it down

Users can implement effective hybrid search with these tools, but it requires some knowledge about your index. Let’s take a look at an example with the linear retriever, where we will query an index with two fields:

PUT linear_retriever_example
{
  "mappings": {
    "properties": {
      "semantic_text_field": { <1>
        "type": "semantic_text",
        "inference_id": ".multilingual-e5-small-elasticsearch"
      },
      "text_field": { <2>
        "type": "text"
      }
    }
  }
}

1. semantic_text_field is a semantic_text field that uses E5, a text embedding model

2. text_field is a standard text field

GET linear_retriever_example/_search
{
  "retriever": {
    "linear": {
      "retrievers": [
        {
          "retriever": {
            "standard": {
              "query": {
                "match": { <1>
                  "semantic_text_field": "foo"
                }
              }
            }
          },
          "normalizer": "minmax"
        },
        {
          "retriever": {
            "standard": {
              "query": {
                "match": {
                  "text_field": "foo"
                }
              }
            }
          },
          "normalizer": "minmax"
        }
      ]
    }
  }
}

1. We use a match query on our semantic_text field, which we added support for in Elasticsearch 8.18/9.0

When constructing the query, we need to keep in mind that semantic_text_field uses a text embedding model, so any queries on it will generate a score between 0 and 1. We also need to know that text_field is a standard text field, and therefore queries on it will generate an unbounded score. To create a result set with proper relevance, we need to use a retriever that will normalize the query scores before combining them. In this example, we use the linear retriever with minmax normalization, which normalizes each query’s score to a value between 0 and 1.

The query construction in this example is fairly straightforward because only two fields are involved. However, it can get complicated very quickly as more fields, and of varying types, are added. This demonstrates how writing an effective hybrid search query often requires deeper knowledge of the index being queried, so that component query scores are properly normalized prior to combination. This poses a barrier to the wider adoption of hybrid search.

Query grouping

Let’s extend the example: What if we wanted to query one text field and two semantic_text fields? We could construct a query like this:

GET linear_retriever_example/_search
{
  "retriever": {
    "linear": {
      "retrievers": [
        {
          "retriever": {
            "standard": {
              "query": {
                "semantic": {
                  "field": "semantic_text_field_1",
                  "query": "foo"
                }
              }
            }
          },
          "normalizer": "minmax"
        },
        {
          "retriever": {
            "standard": {
              "query": {
                "semantic": {
                  "field": "semantic_text_field_2",
                  "query": "foo"
                }
              }
            }
          },
          "normalizer": "minmax"
        },
        {
          "retriever": {
            "standard": {
              "query": {
                "match": {
                  "text_field": "foo"
                }
              }
            }
          },
          "normalizer": "minmax"
        }
      ]
    }
  }
}

That seems good on its face, but there’s a potential problem. Now the semantic_text field matches make up ⅔ of the total score:

Total Score = N(semantic_text_field_1 score) + N(semantic_text_field_2 score) + N(text_field score)

This probably isn’t what you want because it creates an unbalanced score. The effects may not be that noticeable in an example like this one with only 3 fields, but it becomes problematic when more fields are queried. For instance, most indices contain far more lexical fields than semantic (i.e. dense_vector, sparse_vector, or semantic_text). What if we were querying an index with 9 lexical fields and 1 semantic field using the pattern above? The lexical matches would make up 90% of the score, blunting the effectiveness of semantic search.

A common way to address this is to group queries into lexical and semantic categories and weight the two evenly. This prevents either category from dominating the total score.

Let’s put that into practice. What would this grouped queries approach look like for this example when using the linear retriever?

GET linear_retriever_example/_search
{
  "retriever": {
    "linear": {
      "retrievers": [
        {
          "retriever": {
            "linear": {
              "retrievers": [
                {
                  "retriever": {
                    "standard": {
                      "query": {
                        "semantic": {
                          "field": "semantic_text_field_1",
                          "query": "foo"
                        }
                      }
                    }
                  },
                  "normalizer": "minmax"
                },
                {
                  "retriever": {
                    "standard": {
                      "query": {
                        "semantic": {
                          "field": "semantic_text_field_2",
                          "query": "foo"
                        }
                      }
                    }
                  },
                  "normalizer": "minmax"
                }
              ]
            }
          },
          "normalizer": "minmax"
        },
        {
          "retriever": {
            "standard": {
              "query": {
                "match": {
                  "text_field": "foo"
                }
              }
            }
          },
          "normalizer": "minmax"
        }
      ]
    }
  }
}

Wow, this is getting verbose! You may have even needed to scroll up and down multiple times to examine the whole query! Here, we use two levels of normalization to create the query groups. Mathematically, it can be expressed as:

Total Score = N(N(semantic_text_field_1 score) + N(semantic_text_field_2 score)) + N(text_field score)

This second level of normalization ensures that the queries against the semantic_text fields and text field are weighted evenly. Note that we omit the second-level normalization for text_field in this example since there is only one lexical field, sparing you from even more verbosity.

This query structure is already unwieldy, and we’re only querying three fields. It becomes increasingly unmanageable, even for seasoned search practitioners, as you query more fields.

The multi-field query format

We added the multi-field query format for the linear and RRF retrievers in Elasticsearch 8.19, 9.1, and serverless to simplify all of this. You can now perform the same query as above with just:

GET linear_retriever_example/_search
{
  "retriever": {
    "linear": {
      "fields": [ "semantic_text_field_1", "semantic_text_field_2", "text_field" ],
      "query": "foo",
      "normalizer": "minmax"
    }
  }
}

Which shrinks the query from 55 lines to just 9! Elasticsearch automatically uses the index mappings to:

Determine the type of each field queried
Group each field into a lexical or semantic category
Weight each category evenly in the final score

This allows anyone to execute an effective hybrid search query without needing to know details about the index or the inference endpoints used.

When using RRF, you can omit the normalizer, since rank is used as a proxy for relevance:

GET rrf_retriever_example/_search
{
  "retriever": {
    "rrf": {
      "fields": [ "semantic_text_field_1", "semantic_text_field_2", "text_field" ],
      "query": "foo"
    }
  }
}

Per-field boosting

When using the linear retriever, you can apply a per-field boost to adjust the importance of matches in certain fields. For example, let’s say you’re querying four fields: two semantic_text fields and two text fields:

GET linear_retriever_example/_search
{
  "retriever": {
    "linear": {
      "fields": [ "semantic_text_field_1", "semantic_text_field_2", "text_field_1", "text_field_2" ],
      "query": "foo",
      "normalizer": "minmax"
    }
  }
}

By default, each field is weighted equally in its group (lexical or semantic). The score breakdown looks like:

In other words, each field is 25% of the total score.

We can use the field^boost syntax to add a per-field boost to any field. Let’s apply a boost of 2 to semantic_text_field_1 and text_field_1:

GET linear_retriever_example/_search
{
  "retriever": {
    "linear": {
      "fields": [ "semantic_text_field_1^2", "semantic_text_field_2", "text_field_1^2", "text_field_2" ]
      "query": "foo",
      "normalizer": "minmax"
    }
  }
}

Now the score breakdown looks like:

Each query group is still weighted equally, but now the field weight within the groups has changed:

semantic_text_field_1 is 66% of the semantic query group score, 33% of the total score
text_field_1 is 66% of the lexical query group score, 33% of the total score

ℹ️ Note that the total score range will not change when a per-field boost is applied. This is an intended side-effect of score normalization, which ensures that lexical and semantic query scores remain directly comparable with each other.

ℹ️ Per-field boosting can also be used with the RRF retriever in Elasticsearch 9.2+

Wildcard resolution

You can use the * wildcard in the fields parameter to match multiple fields. Continuing the example above, this query is functionally equivalent to querying semantic_text_field_1, semantic_text_field_2, and text_field_1 explicitly:

GET linear_retriever_example/_search
{
  "retriever": {
    "linear": {
      "fields": [ "semantic_text_field_*", "*_field_1" ],
      "query": "foo",
      "normalizer": "minmax"
    }
  }
}

It’s interesting to note that the *_field_1 pattern matches both text_field_1 and semantic_text_field_1. This is handled automatically; the query will execute as if each of the fields were explicitly queried. It’s also fine that the semantic_text_field_1 matches both patterns; all field name matches are de-duplicated before query execution.

You can use the wildcard in a variety of ways:

Prefix matching (ex: *_text_field)
Inline matching (ex: semantic_*_field)
Suffix matching (ex: semantic_text_field_*)

You can also use multiple wildcards to apply a combination of the above, such as *_text_field_*.

Default query fields

The multi-field query format also allows you to query an index you know nothing about. If you omit the fields parameter, it will query all fields specified by the index.query.default_field index setting:

GET linear_retriever_example/_search
{
  "retriever": {
    "linear": {
      "query": "foo",
      "normalizer": "minmax"
    }
  }
}

By default, index.query.default_field is set to *. This wildcard will resolve to every field type in the index that supports term queries, which is most. The exceptions are:

dense_vector fields
rank_vector fields
Geometry fields: geo_point, shape

This functionality is especially useful when you want to perform a hybrid search query on an index provided by a third party. The multi-field query format allows you to execute an appropriate query in a simple way. Just exclude the fields parameter, and all applicable fields will be queried.

Conclusion

The score range problem can make effective hybrid search a headache to implement, particularly when there’s limited insight into the index being queried or the inference endpoints in use. The multi-field query format for the linear and RRF retrievers alleviates this pain by packaging an automated, query-grouping-based hybrid search approach into a simple and approachable API. Additional functionality, such as per-field boosting, wildcard resolution, and default query fields, extends the functionality to cover many use cases.

Try out the multi-field query format today

You can check out the linear and RRF retrievers with the multi-field query format in fully managed Elasticsearch Serverless projects with a free trial. It’s also available in stack versions starting from 8.19 & 9.1.

Get started in minutes on your local environment with a single command:

curl -fsSL https://elastic.co/start-local | sh

Building an AI agent for HR with Elastic Agent Builder and GPT-OSS

Tomás Murúa — Wed, 26 Nov 2025 00:00:00 GMT

Introduction

This article will show you how to build an AI agent for HR using GPT-OSS and Elastic Agent Builder. The agent can answer your questions without sending data to OpenAI, Anthropic, or any external service.

We’ll use LM Studio to serve GPT-OSS locally and connect it to Elastic Agent Builder.

By the end of this article, you’ll have a custom AI agent that can answer natural language questions about your employee data while maintaining full control over your information and model.

Prerequisites

For this article, you need:

Elastic Cloud hosted 9.2, serverless or local deployment
Machine with 32GB RAM recommended (minimum 16GB for GPT-OSS 20B)
LM Studio installed
Docker Desktop Installed

Why use GPT-OSS?

With a local LLM you have the control to deploy it in your own infrastructure and fine-tune it to fit your own needs. All this while maintaining control over the data that you share with the model, and of course, you don’t have to pay a license fee to an external provider.

OpenAI released GPT-OSS on August 5, 2025, as part of their commitment to the open model ecosystem.

The 20B parameter model offers:

Tool use capabilities
Efficient inference
OpenAI SDK compatible
Compatible with agentic workflows

Benchmark comparison:

Solution architecture

The architecture runs entirely on your local machine. Elastic (running in Docker) communicates directly with your local LLM through LM Studio, and the Elastic Agent Builder uses this connection to create custom AI agents that can query your employee data.

For more details, refer to this documentation.

Building an AI agent for HR: Steps

We’ll divide the implementation into 5 steps:

Configure LM studio with a local model
Deploy Local Elastic with Docker
Create the OpenAI connector in Elastic
Upload employee data to Elasticsearch
Build and test your AI Agent

Step 1: Configure LM Studio with GPT-OSS 20B

LM Studio is a user-friendly application that allows you to run large language models locally on your computer. It provides an OpenAI-compatible API server, making it easy to integrate with tools like Elastic without a complex setup process. For more details, refer to the LM Studio Docs.

First, download and install LM Studio from the official website. Once installed, open the application.

In the LM Studio interface:

Go to the search tab and search for “GPT-OSS”
Select the openai/gpt-oss-20b from OpenAI
Click download

The size of this model should be approximately 12.10GB. The download may take a few minutes, depending on your internet connection.

Once the model is downloaded:

Go to the local server tab
Select the openai/gpt-oss-20b
Use the default port 1234
On the right panel, go to Load and set the Context Length to 40K or higher

5. Click start server

You should see this if the server is running.

[LM STUDIO SERVER] Success! HTTP server listening on port 1234
[LM STUDIO SERVER] Supported endpoints:
[LM STUDIO SERVER] ->	GET  http://localhost:1234/v1/models
[LM STUDIO SERVER] ->	POST http://localhost:1234/v1/responses
[LM STUDIO SERVER] ->	POST http://localhost:1234/v1/chat/completions
[LM STUDIO SERVER] ->	POST http://localhost:1234/v1/completions
[LM STUDIO SERVER] ->	POST http://localhost:1234/v1/embeddings
Server started.

Step 2: Deploy Local Elastic with Docker

Now we’ll set up Elasticsearch and Kibana locally using Docker. Elastic provides a convenient script that handles the entire setup process. For more details refer to the official documentation.

Run the start-local script

Execute the following command in your terminal:

curl -fsSL https://elastic.co/start-local | sh

This script will:

Download and configure Elasticsearch and Kibana
Start both services using Docker Compose
Automatically activate a 30-day Platinum trial license

Expected output

Just wait for the following message and save the password and API key shown; you’ll need them to access Kibana:

🎉 Congrats, Elasticsearch and Kibana are installed and running in Docker!
🌐 Open your browser at http://localhost:5601
   Username: elastic
   Password: KSUlOMNr
🔌 Elasticsearch API endpoint: http://localhost:9200
🔑 API key: cnJGX0pwb0JhOG00cmNJVklUNXg6cnNJdXZWMnM4bncwMllpQlFlUTlWdw==
Learn more at https://github.com/elastic/start-local

Access Kibana

Open your browser and navigate to:

http://localhost:5601

Enable Agent Builder

Once logged in to Kibana, navigate to Management > AI > Agent Builder and activate the Agent Builder.

Step 3: Create the OpenAI connector in Elastic

Now we’ll configure Elastic to use your local LLM.

Access Connectors

In Kibana
Go to Project Settings > Management
Under Alerts and Insights, select Connectors
Click Create Connector

Configure the connector

Select OpenAI from the list of connectors. LM Studio uses the OpenAI SDK, making it compatible.

Fill in the fields with these values:

Connector name: LM Studio - GPT-OSS 20B
Select an OpenAI provider: Other (OpenAI Compatible Service)
URL: http://host.docker.internal:1234/v1/chat/completions
Default model: openai/gpt-oss-20b
API Key: testkey-123 (any text works, because LM Studio Server doesn't require authentication.)

To finish the configuration, click Save & test.

Important: Toggle ON the “Enable native function calling”; this is required for the Agent Builder to work properly. If you don’t enable this, you’ll get a No tool calls found in the response error.

Test the connection

Elastic should automatically test the connection. If everything is configured correctly, you’ll see a success message like this:

Response:

{
  "status": "ok",
  "data": {
    "id": "chatcmpl-flj9h0hy4wcx4bfson00an",
    "object": "chat.completion",
    "created": 1761189456,
    "model": "openai/gpt-oss-20b",
    "choices": [
      {
        "index": 0,
        "message": {
          "role": "assistant",
          "content": "Hello! 👋 How can I assist you today?",
          "reasoning": "Just greet.",
          "tool_calls": []
        },
        "logprobs": null,
        "finish_reason": "stop"
      }
    ],
    "usage": {
      "prompt_tokens": 69,
      "completion_tokens": 23,
      "total_tokens": 92
    },
    "stats": {},
    "system_fingerprint": "openai/gpt-oss-20b"
  },
  "actionId": "ee1c3aaf-bad0-4ada-8149-118f52dad757"
}

Step 4: Upload employee data to Elasticsearch

Now we’ll upload the HR employee dataset to demonstrate how the agent works with sensitive data. I generated a fictional dataset with this structure.

Dataset structure

{
  "employee_id": "0f4dce68-2a09-4cb1-b2af-6bcb4821539b",
  "full_name": "Daffi Stiebler",
  "email": "lscutchings0@huffingtonpost.com",
  "date_of_birth": "1975-06-20T15:39:36Z",
  "hire_date": "2025-07-28T00:10:45Z",
  "job_title": "Physical Therapy Assistant",
  "department": "HR",
  "salary": "108455",
  "performance_rating": "Needs Improvement",
  "years_of_experience": 2,
  "skills": "Java",
  "education_level": "Master's Degree",
  "manager": "Carl MacGibbon",
  "emergency_contact": "Leigha Scutchings",
  "home_address": "5571 6th Park"
}

Create the index with mappings

First, create the index with proper mappings. Note that we’re using semantic_text fields for some key fields; this enables semantic search capabilities for our index.

PUT hr-employees
{
  "mappings": {
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "employee_id": {
        "type": "keyword"
      },
      "full_name": {
        "type": "text",
        "copy_to": "employee_semantic"
      },
      "email": {
        "type": "keyword"
      },
      "date_of_birth": {
        "type": "date",
        "format": "iso8601"
      },
      "hire_date": {
        "type": "date",
        "format": "iso8601"
      },
      "job_title": {
        "type": "text",
        "copy_to": "employee_semantic"
      },
      "department": {
        "type": "text",
        "copy_to": "employee_semantic"
      },
      "salary": {
        "type": "double"
      },
      "performance_rating": {
        "type": "text",
        "copy_to": "employee_semantic"
      },
      "years_of_experience": {
        "type": "long"
      },
      "skills": {
        "type": "text",
        "copy_to": "employee_semantic"
      },
      "education_level": {
        "type": "text",
        "copy_to": "employee_semantic"
      },
      "manager": {
        "type": "text",
        "copy_to": "employee_semantic"
      },
      "emergency_contact": {
        "type": "keyword"
      },
      "home_address": {
        "type": "keyword"
      },
      "employee_semantic": {
        "type": "semantic_text"
      }
    }
  }
}

Index with Bulk API

Copy and paste the dataset into your Dev Tools in Kibana and execute it:

POST hr-employees/_bulk
{"index": {}}
{"employee_id": "57728b91-e5d7-4fa8-954a-2384040d3886", "full_name": "Filide Gane", "email": "vhallahan1@booking.com", "job_title": "Business Systems Development Analyst", "department": "Marketing", "salary": "$52330.27", "performance_rating": "Meets Expectations", "years_of_experience": 12, "skills": "Java", "education_level": "Bachelor's Degree", "date_of_birth": "2000-02-07T16:49:32Z", "hire_date": "2023-11-07T13:03:16Z", "manager": "Freedman Kings", "emergency_contact": "Vilhelmina Hallahan", "home_address": "75 Dennis Junction"}
{"index": {}}
{"employee_id": "...", ...}

Verify the data

Run a query to verify:

GET hr-employees/_search

Step 5: Build and test your AI agent

With everything configured, it’s time to build a custom AI agent using Elastic Agent Builder. For more details refer to the Elastic documentation.

Add the connector

Before we can create our new agent, we have to set our Agent builder to use our custom connector called LM Studio - GPT-OSS 20B because the default one is the Elastic Managed LLM. For that, we need to go to Project Setting > Management > GenAI Settings; now we select the one we created and click Save.

Access Agent Builder

Go to Agents
Click on Create a new agent

Configure the agent

To create a new agent, the required fields are the Agent ID, Display Name, and Display Instructions.

But there are more customization options, like the Custom Instructions that guide how your agent is going to behave and interact with your tools, similar to a system prompt, but for our custom agent. Labels help organize your agents, avatar color, and avatar symbol.

The ones that I chose for our agent based on the dataset are:

Agent ID: hr_assistant

Custom instructions:

You are an HR Analytics Assistant that helps answer questions about employee data.
When responding to queries:
- Provide clear, concise answers
- Include relevant employee details (name, department, salary, skills)
- Format monetary values with currency symbols
- Be professional and maintain data confidentiality

Labels: Human Resources and GPT-OSS

Display name: HR Analytics Assistant

Display description:

A specialized AI assistant for Human Resources that helps analyze employee data, compensation, performance metrics, and talent management. Ask questions about employees, departments, salaries, or performance analytics.

With all the data in there, we can click on Save our new agent.

Test the agent

Now you can ask natural language questions about your employee data, and GPT-OSS 20B will understand the intent and generate an appropriate response.

Prompt:

Which employee is the one with the highest salary in the hr-employees index?

Answer:

The Agent process was:

1. Understand your question using the GPT-OSS connector

2. Generate the appropriate Elasticsearch query (using the built-in tools or custom ES|QL)

3. Retrieve matching employee records

4. Present results in natural language with proper formatting

Unlike traditional lexical search, the agent powered by GPT-OSS understands intent and context, making it easier to find information without knowing exact field names or query syntax. For more details on the agent's thinking process, refer to this article.

Conclusion

In this article, we built a custom AI agent using Elastic’s Agent Builder to connect to the OpenAI GPT-OSS model running locally. By deploying both Elastic and the LLM on your local machine, this architecture allows you to leverage generative AI capabilities while maintaining full control over your data, all without sending information to external services.

We used GPT-OSS 20B as an experiment, but the officially recommended models for Elastic Agent Builder are referenced here. If you need more advanced reasoning capabilities, there's also the 120B parameter variant that performs better for complex scenarios, though it requires a higher-spec machine to run locally. For more details, refer to the official OpenAI documentation.

Top Elastic Agent Builder projects and learnings from Cal Hacks 12.0

JD Armada — Tue, 25 Nov 2025 00:00:00 GMT

Watch video on YouTube

A few weeks ago, we had the incredible opportunity to sponsor Cal Hacks 12.0, one of the largest in-person hackathons with over 2000 participants coming from all over the world. We offered a dedicated prize track for the best use of Elastic Agent Builder on Serverless, and the response was phenomenal. In just 36 hours, we received 29 submissions that used Agent Builder in creative ways, from building wildfire intelligence tools to StackOverflow validators.

Beyond the impressive projects, the experience at Cal Hacks 12.0 also gave us something equally valuable: quick, unfiltered feedback from developers encountering our Stack for the first time. Hackathons are unique pressure tests with tight timelines, zero prior familiarity, and unpredictable obstacles (like the infamous WiFi outages). They expose exactly where the developer experience shines and where it still needs work. This matters even more now, as developers interact with the Elastic Stack in new ways, increasingly through LLM-driven workflows. In this blog post, we’ll dive deeper into what the participants built with Agent Builder and what we learned in the process.

The winning projects

First Place: AgentOverflow

Stack Overflow rebuilt for the LLM and agent era.

How it works:

1. Share JSON - the “Solution Schema.”

One click from a Claude share will scrape, extract, and assemble a Share Solution JSON, which is a structured format containing:

Problem
Context
Code
Tags
Verified solution steps.

A validator (LAVA) checks and enforces structure, the user adds a line of extra context, then it’s stored and indexed within Elasticsearch.

2. Find Solution

When you get stuck, click on Find Solution and AgentOverflow will scrape your current conversation, use it to build a query, and run a hybrid Elasticsearch search to surface:

Ranked, community-validated fixes
The exact prompts that originally solved the problem

This allows developers to copy, paste, and unblock their current session quickly.

3. MCP - context injection for LLMs

By connecting to the stored structured solutions within Elasticsearch through MCP (Model Context Protocol), LLMs are fed high signal context (code, logs, configs, prior fixes) at runtime without extra noise.

AgentOverflow uses Agent Builder with Elasticsearch as a structured memory layer that injects relevant context into LLMs. This transforms them from passive chatbots into context-aware problem solvers.

Runner-up: MarketMind

A real-time interpretable view of market energy, powered by six Elastic Agents.

How it works:

1. Collect real-time market data

MarketMind pulls price-action, fundamentals, sentiment, volatility and risk metrics from Yahoo Finance. This data is ingested and organized into multiple Elasticsearch indices.

2. Six specialized agents analyze the market

Each agent, built with Agent Builder, focuses on a different layer of the market. They read from an Elasticsearch index, compute their own domain-specific metrics, and generate a standardized JSON output with scores and reasoning.

3. Aggregate signals into a unified “market energy” model

The combined outputs appear as glowing pulses around each stock, illustrating whether momentum is building, risk is rising, or sentiment is shifting.

4. Visualize insights

The frontend was built with React and Next.js, using TypeScript, SVG physics-based visuals, and Chart.js for live candlestick charts. This turns raw analysis into real-time actionable feedback.

Other interesting projects:

Here are some other strong contenders that used Elastic in different parts of their stack:

Find the full list of projects that were submitted to our track here.

What we learned from developers

Agent Builder is user-friendly:

Most teams had never used Elastic before and were still able to build agents quickly with little support. We held a workshop for those who needed more guidance, but most were able to ingest their data and build an agent to perform actions on that data.

LLMs excel at kNN queries, but still need guidance in generating ES|QL:

Asking ChatGPT-5 to generate ES|QL queries returned incorrect information, often mixing ES|QL and SQL. Feeding the LLM the docs in a markdown file seemed to be a workable fix.

Snapshot-only ES|QL functions leaked into docs:

The upcoming FIRST and LAST aggregation functions had unintentionally slipped into our ES|QL docs. Because we fed those docs to ChatGPT, the model dutifully used these functions, even though they aren't available in Serverless yet. Thanks to the feedback from the group, engineering quickly opened and merged a fix to remove the functions from the published docs (PR #137341).

Missing Serverless-specific guidance:

A team tried enabling LOOKUP JOIN on an index that wasn’t created in lookup mode. The error message sent them chasing commands that don’t exist on Serverless. We relayed this to the product team, who immediately opened a fix for a Serverless-specific, actionable message. Longer term, the vision is to hide the reindexing complexity entirely (Issue #4838).

Value of in-person events:

Online hackathons are great, but nothing matches the rapid feedback loop you get when you're debugging shoulder-to-shoulder with builders. We watched teams integrate Agent Builder across different use cases, spotted where the developer experience with ES|QL could be improved, and fixed issues much quicker than trying to do so over asynchronous channels.

Conclusion

Cal Hacks 12.0 gave us more than a weekend of cool demos; it also gave us insight into how new developers are interacting with the Elastic Stack. In just 36 hours, we saw teams pick up Agent Builder, ingest data into Elasticsearch, design multi-agent systems, and test our features in a variety of ways. The event also reminded us why in-person events matter. The rapid feedback loops, real conversations, and hands-on debugging helped us understand current developer needs. We’re excited to bring back what we learned to the engineering team. We’ll see you at the next hackathon.

Creating an LLM agent newsroom with A2A protocol and MCP in Elasticsearch: Part II

Justin Castilla — Mon, 24 Nov 2025 00:00:00 GMT

A2A and MCP: the code in action

This is the companion piece to the article “Creating an LLM Agent newsroom with A2A protocol and MCP in Elasticsearch!”, which explained the benefits of implementing both A2A and MCP architectures within the same agent to truly reap the unique benefits of both frameworks. A repository is available should you wish to run the demo on your own.

Let's walk through how our newsroom agents collaborate using both A2A and MCP to produce a news article. The accompanying repository to see the agents in action may be found here.

Step 1: Story assignment

The News Chief (acting as the client) assigns a story:

{
  "message_type": "task_request",
  "sender": "news_chief",
  "receiver": "reporter_agent",
  "payload": {
    "task_id": "story_renewable_energy_2024",
    "assignment": {
      "topic": "Renewable Energy Adoption in Europe",
      "angle": "Policy changes driving solar and wind expansion",
      "target_length": 1200,
      "deadline": "2025-09-30T18:00:00Z"
    }
  }
}

Step 2: Reporter requests research

The Reporter Agent recognizes it needs background information and delegates to the Researcher Agent via A2A:

{
  "message_type": "task_request",
  "sender": "reporter_agent",
  "receiver": "researcher_agent",
  "payload": {
    "task_id": "research_eu_renewable_2024",
    "parent_task_id": "story_renewable_energy_2024",
    "capability": "fact_gathering",
    "parameters": {
      "queries": [
        "EU renewable energy capacity 2024",
        "Solar installations growth Europe",
        "Wind energy policy changes 2024"
      ],
      "depth": "comprehensive"
    }
  }
}

Step 3: Reporter requests historical context from Archive Agent

The Reporter Agent recognizes that historical context would strengthen the story. It delegates to the Archive Agent (powered by Elastic's A2A Agent) via A2A to search the newsroom's Elasticsearch-powered article archive:

{
  "message_type": "task_request",
  "sender": "reporter_agent",
  "receiver": "archive_agent",
  "payload": {
    "task_id": "archive_search_renewable_2024",
    "parent_task_id": "story_renewable_energy_2024",
    "capability": "search_archive",
    "parameters": {
      "query": "European renewable energy policy changes and adoption trends over past 5 years",
      "focus_areas": ["solar", "wind", "policy", "Germany", "France"],
      "time_range": "2019-2024",
      "result_count": 10
    }
  }
}

Step 4: Archive Agent uses Elastic A2A Agent with MCP

The Archive Agent uses Elastic's A2A Agent, which in turn uses MCP to access Elasticsearch tools. This demonstrates the hybrid architecture where A2A enables agent collaboration while MCP provides tool access:

# Archive Agent using Elastic A2A Agent
async def search_historical_articles(self, query_params):
    # The Archive Agent sends a request to Elastic's A2A Agent
    elastic_response = await self.a2a_client.send_request(
        agent="elastic_agent",
        capability="search_and_analyze",
        parameters={
            "natural_language_query": query_params["query"],
            "index_pattern": "newsroom-articles-*",
            "filters": {
                "topics": query_params["focus_areas"],
                "date_range": query_params["time_range"]
            },
            "analysis_type": "trend_analysis"
        }
    )
    
    # Elastic's A2A Agent internally uses MCP tools:
    # - platform.core.search (to find relevant articles)
    # - platform.core.generate_esql (to analyze trends)
    # - platform.core.index_explorer (to identify relevant indices)
    
    return elastic_response

The Archive Agent receives comprehensive historical data from Elastic's A2A Agent and returns it to the Reporter:

{
  "message_type": "task_response",
  "sender": "archive_agent",
  "receiver": "reporter_agent",
  "payload": {
    "task_id": "archive_search_renewable_2024",
    "status": "completed",
    "archive_data": {
      "historical_articles": [
        {
          "title": "Germany's Energiewende: Five Years of Solar Growth",
          "published": "2022-06-15",
          "key_points": [
            "Germany added 7 GW annually 2020-2022",
            "Policy subsidies drove 60% of growth"
          ],
          "relevance_score": 0.94
        },
        {
          "title": "France Balances Nuclear and Renewables",
          "published": "2023-03-20",
          "key_points": [
            "France increased renewable target to 40% by 2030",
            "Solar capacity doubled 2021-2023"
          ],
          "relevance_score": 0.89
        }
      ],
      "trend_analysis": {
        "coverage_frequency": "EU renewable stories increased 150% since 2019",
        "emerging_themes": ["policy incentives", "grid modernization", "battery storage"],
        "coverage_gaps": ["Small member states", "offshore wind permitting"]
      },
      "total_articles_found": 47,
      "search_confidence": 0.91
    }
  }
}

This step demonstrates how Elastic's A2A Agent integrates into the newsroom workflow. The Archive Agent (a newsroom-specific agent) coordinates with Elastic's A2A Agent (a third-party specialist) to leverage Elasticsearch's powerful search and analytics capabilities. Elastic's agent uses MCP internally to access Elasticsearch tools, showing the clean separation between agent coordination (A2A) and tool access (MCP).

Step 5: Researcher uses MCP servers

The Researcher Agent accesses multiple MCP servers to gather information:

# Researcher Agent using MCP to access tools
async def gather_facts(self, queries):
    results = []
    
    # Use News API MCP Server
    news_data = await self.mcp_client.invoke_tool(
        server="news_api_mcp",
        tool="search_articles",
        parameters={
            "query": queries[0],
            "date_range": "2024-01-01 to 2024-09-30",
            "sources": ["reuters", "bloomberg", "euractiv"]
        }
    )
    
    # Use Statistics MCP Server
    stats_data = await self.mcp_client.invoke_tool(
        server="eurostat_mcp",
        tool="get_energy_statistics",
        parameters={
            "metrics": ["solar_capacity_mw", "wind_capacity_mw"],
            "countries": ["EU27"],
            "year": 2024
        }
    )
    
    # Use Fact-Checking MCP Server
    verified_facts = await self.mcp_client.invoke_tool(
        server="factcheck_mcp",
        tool="verify_claims",
        parameters={
            "claims": [
                "EU solar capacity increased 40% in 2024",
                "Germany leads in wind installations"
            ]
        }
    )
    
    return self._synthesize_research(news_data, stats_data, verified_facts)

Step 6: Researcher returns data to Reporter

The Researcher Agent sends comprehensive research back via A2A:

{
  "message_type": "task_response",
  "sender": "researcher_agent",
  "receiver": "reporter_agent",
  "payload": {
    "task_id": "research_eu_renewable_2024",
    "status": "completed",
    "research_data": {
      "key_facts": [
        {
          "claim": "EU solar capacity reached 260 GW in 2024",
          "source": "Eurostat Q3 2024 Report",
          "confidence": 0.95,
          "verified": true
        },
        {
          "claim": "Germany installed 12 GW of wind capacity in 2024",
          "source": "German Federal Network Agency",
          "confidence": 0.92,
          "verified": true
        }
      ],
      "statistics": {
        "solar_growth_rate": "35%",
        "wind_growth_rate": "28%"
      },
      "sources_count": 15
    }
  }
}

Step 7: Reporter writes article

The Reporter Agent uses the research data and its own LLM capabilities to write the article. During writing, the Reporter uses the MCP servers for style and templates:

# Reporter Agent writing with MCP assistance
async def write_article(self, research_data, assignment):
    # Get style guidelines via MCP
    style_guide = await self.mcp_client.get_resource(
        server="newsroom_mcp",
        resource="style://editorial/ap_style_guide"
    )
    
    # Get article template via MCP
    template = await self.mcp_client.get_resource(
        server="newsroom_mcp",
        resource="template://articles/news_story"
    )
    
    # Generate article using LLM + research + style
    draft = await self.llm.generate(
        prompt=f"""
        Write a news article following these guidelines:
        {style_guide}
        
        Using this template:
        {template}
        
        Based on this research:
        {research_data}
        
        Assignment: {assignment}
        """
    )
    
    # Self-evaluate confidence in claims
    confidence_check = await self._evaluate_confidence(draft)
    
    return draft, confidence_check

Step 8: low confidence triggers Re-Research

The Reporter Agent evaluates its draft and finds that one claim has low confidence. It sends another request to the Researcher Agent:

{
  "message_type": "collaboration_request",
  "sender": "reporter_agent",
  "receiver": "researcher_agent",
  "payload": {
    "request_type": "fact_verification",
    "claims": [
      {
        "text": "France's nuclear phase-down contributed to 15% increase in renewable capacity",
        "context": "Discussing policy drivers for renewable growth",
        "current_confidence": 0.45,
        "required_confidence": 0.80
      }
    ],
    "urgency": "high"
  }
}

The Researcher verifies the claim using fact-checking MCP servers and returns updated information:

{
  "message_type": "collaboration_response",
  "sender": "researcher_agent",
  "receiver": "reporter_agent",
  "payload": {
    "verified_claims": [
      {
        "original_claim": "France's nuclear phase-down contributed to 15% increase...",
        "verified_claim": "France's renewable capacity increased 18% in 2024, partially offsetting reduced nuclear output",
        "confidence": 0.88,
        "corrections": "Percentage was 18%, not 15%; nuclear phase-down is gradual, not primary driver",
        "sources": ["RTE France", "French Energy Ministry Report 2024"]
      }
    ]
  }
}

Step 9: Reporter revises and submits to Editor

The Reporter incorporates the verified facts and sends the completed draft to the Editor Agent via A2A:

{
  "message_type": "task_request",
  "sender": "reporter_agent",
  "receiver": "editor_agent",
  "payload": {
    "task_id": "edit_renewable_story",
    "parent_task_id": "story_renewable_energy_2024",
    "content": {
      "headline": "Europe's Renewable Revolution: Solar and Wind Surge 30% in 2024",
      "body": "[Full article text...]",
      "word_count": 1185,
      "sources": [/* array of sources */]
    },
    "editing_requirements": {
      "check_style": true,
      "check_facts": true,
      "check_seo": true
    }
  }
}

Step 10: Editor reviews using MCP tools

The Editor Agent uses multiple MCP servers to review the article:

# Editor Agent using MCP for quality checks
async def review_article(self, content):
    # Grammar and style check
    grammar_issues = await self.mcp_client.invoke_tool(
        server="grammarly_mcp",
        tool="check_document",
        parameters={"text": content["body"]}
    )
    
    # SEO optimization check
    seo_analysis = await self.mcp_client.invoke_tool(
        server="seo_mcp",
        tool="analyze_content",
        parameters={
            "headline": content["headline"],
            "body": content["body"],
            "target_keywords": ["renewable energy", "Europe", "solar", "wind"]
        }
    )
    
    # Plagiarism check
    originality = await self.mcp_client.invoke_tool(
        server="plagiarism_mcp",
        tool="check_originality",
        parameters={"text": content["body"]}
    )
    
    # Generate editorial feedback
    feedback = await self._generate_feedback(
        grammar_issues, 
        seo_analysis, 
        originality
    )
    
    return feedback

The Editor approves the article and sends it forward:

{
  "message_type": "task_response",
  "sender": "editor_agent",
  "receiver": "reporter_agent",
  "payload": {
    "status": "approved",
    "quality_score": 9.2,
    "minor_edits": [
      "Changed 'surge' to 'increased' in paragraph 3 for AP style consistency",
      "Added Oxford comma in list of countries"
    ],
    "approved_content": "[Final edited article]"
  }
}

Step 11: Publisher publishes via CI/CD

Finally, the Printer Agent publishes the approved article using the MCP servers for the CMS and CI/CD pipeline:

# Publisher Agent publishing via MCP
async def publish_article(self, content, metadata):
    # Upload to CMS via MCP
    cms_result = await self.mcp_client.invoke_tool(
        server="wordpress_mcp",
        tool="create_post",
        parameters={
            "title": content["headline"],
            "body": content["body"],
            "status": "draft",
            "categories": metadata["categories"],
            "tags": metadata["tags"],
            "featured_image_url": metadata["image_url"]
        }
    )
    
    post_id = cms_result["post_id"]
    
    # Trigger CI/CD deployment via MCP
    deploy_result = await self.mcp_client.invoke_tool(
        server="cicd_mcp",
        tool="trigger_deployment",
        parameters={
            "pipeline": "publish_article",
            "environment": "production",
            "post_id": post_id,
            "schedule": "immediate"
        }
    )
    
    # Track analytics
    await self.mcp_client.invoke_tool(
        server="analytics_mcp",
        tool="register_publication",
        parameters={
            "post_id": post_id,
            "publish_time": datetime.now().isoformat(),
            "story_id": metadata["story_id"]
        }
    )
    
    return {
        "status": "published",
        "post_id": post_id,
        "url": f"https://newsroom.example.com/articles/{post_id}",
        "deployment_id": deploy_result["deployment_id"]
    }

The Publisher confirms publication via A2A:

{
  "message_type": "task_complete",
  "sender": "printer_agent",
  "receiver": "news_chief",
  "payload": {
    "task_id": "story_renewable_energy_2024",
    "status": "published",
    "publication": {
      "url": "https://newsroom.example.com/articles/renewable-europe-2024",
      "published_at": "2025-09-30T17:45:00Z",
      "post_id": "12345"
    },
    "workflow_metrics": {
      "total_time_minutes": 45,
      "agents_involved": ["reporter", "researcher", "archive", "editor", "printer"],
      "iterations": 2,
      "mcp_calls": 12
    }
  }
}

Here is the complete sequence of the A2A workflow in the accompanying repository using the same Agents described above.

#	From	To	Action	Protocol	Description
1	User	News Chief	Assign Story	HTTP POST	User submits story topic and angle
2	News Chief	Internal	Create Story	-	Creates story record with unique ID
3	News Chief	Reporter	Delegate Assignment	A2A	Sends story assignment via A2A protocol
4	Reporter	Internal	Accept Assignment	-	Stores assignment internally
5	Reporter	MCP Server	Generate Outline	MCP/HTTP	Creates article outline and research questions
6a	Reporter	Researcher	Request Research	A2A	Sends questions (parallel with 6b)
6b	Reporter	Archivist	Search Archive	A2A JSONRPC	Searches historical articles (parallel with 6a)
7	Researcher	MCP Server	Research Questions	MCP/HTTP	Uses Anthropic via MCP to answer questions
8	Researcher	Reporter	Return Research	A2A	Returns research answers
9	Archivist	Elasticsearch	Search Index	ES REST API	Queries news_archive index
10	Archivist	Reporter	Return Archive	A2A JSONRPC	Returns historical search results
11	Reporter	MCP Server	Generate Article	MCP/HTTP	Creates article with research/archive context
12	Reporter	Internal	Store Draft	-	Saves draft internally
13	Reporter	News Chief	Submit Draft	A2A	Submits completed draft
14	News Chief	Internal	Update Story	-	Stores draft, updates status to "draft_submitted"
15	News Chief	Editor	Review Draft	A2A	Auto-routes to Editor for review
16	Editor	MCP Server	Review Article	MCP/HTTP	Analyzes content using Anthropic via MCP
17	Editor	News Chief	Return Review	A2A	Sends editorial feedback and suggestions
18	News Chief	Internal	Store Review	-	Stores editor feedback
19	News Chief	Reporter	Apply Edits	A2A	Routes review feedback to Reporter
20	Reporter	MCP Server	Apply Edits	MCP/HTTP	Revises article based on feedback
21	Reporter	Internal	Update Draft	-	Updates draft with revisions
22	Reporter	News Chief	Return Revised	A2A	Returns revised article
23	News Chief	Internal	Update Story	-	Stores revised draft, status to "revised"
24	News Chief	Publisher	Publish Article	A2A	Auto-routes to Publisher
25	Publisher	MCP Server	Generate Tags	MCP/HTTP	Creates tags and categories
26	Publisher	Elasticsearch	Index Article	ES REST API	Indexes article to news_archive index
27	Publisher	Filesystem	Save Markdown	File I/O	Saves article as .md file in /articles
28	Publisher	News Chief	Confirm Publication	A2A	Returns success status
29	News Chief	Internal	Update Story	-	Updates story status to "published"

Conclusion

Both A2A and MCP have important roles to play in the modern augmented-LLM infrastructure paradigm. A2A offers flexibility for complex multi-agent systems but potentially less portability and higher operational complexity. MCP offers a standardized approach for tool integration that's simpler to implement and maintain, though it's not designed to handle multi-agent orchestration.

The choice isn't binary. As demonstrated through our newsroom example, the most sophisticated and effective LLM-backed systems often combine both approaches: agents coordinate and specialize through A2A protocols while accessing their tools and resources through MCP servers. This hybrid architecture provides the organizational benefits of multi-agent systems alongside the standardization and ecosystem advantages of MCP. This suggests that there may not need to be a choice at all: simply use both as a standard approach

It's up to you as a developer or architect to test and determine the best mixture of both solutions to create the right outcome for your specific use case. Understanding the strengths, limitations, and appropriate applications of each approach will enable you to build more effective, maintainable, and scalable AI systems.

Whether you're building a digital newsroom, a customer service platform, a research assistant, or any other LLM-powered application, carefully considering your coordination needs (A2A) and tool access requirements (MCP) will set you on the path to success.

Additional resources

Elasticsearch Agent Builder: https://www.elastic.co/docs/solutions/search/elastic-agent-builder
A2A Specification: https://a2a-protocol.org/latest/specification/
A2A and MCP Integration: https://a2a-protocol.org/latest/topics/a2a-and-mcp/
Model Context Protocol: https://modelcontextprotocol.io

Getting started with Elastic Agent Builder and Microsoft Agent Framework

Jonathan Simon — Fri, 21 Nov 2025 00:00:00 GMT

Elastic 9.2 was recently released and includes a new feature called Agent Builder. It enables developers to quickly create AI agents and tools powered by data stored in Elasticsearch. Any tools or agents you create in Agent Builder can be utilized immediately within your own custom AI apps.

In this blog post we’ll walk through all the steps to use Elastic Agent Builder to create an agent. Then we’ll walk through the process of running an example Python app that uses the Microsoft Agent Framework to orchestrate your Elastic agent.

Create an Elastic Serverless project

To use Agent Builder you need an Elastic deployment or an Elastic serverless project, so let’s begin by creating an Elastic serverless project. Go to Elastic Cloud and create a new Elasticsearch Serverless project.

Create an index and add data

Now that we’ve got an Elastic project, let’s create an index, which is what Elasticsearch uses to store data. Open Developer Tools in Elastic Cloud where we can run a command to create an index.

Copy the following PUT command which creates an index named my-docs containing a mixture of fields, and our content leveraging semantic search.

PUT /my-docs
{
  "mappings": {
    "properties": {
      "title": { "type": "text" },
      "content": { 
        "type": "semantic_text"
      },
      "filename": { "type": "keyword" },
      "last_modified": { "type": "date" }
    }
  }
}

Paste the PUT command into the input area of the Developer Tools console. Hover your mouse over the command in the console and then click the Run button to execute the command.

The next step is to add some data to the my-docs index that you just created. Copy and paste the following command into the Develop Tools console.

PUT /my-docs/_doc/greetings-md
{
  "title": "Greetings",
  "content": "
# Greetings

## Basic Greeting
Hello!

## Helloworld Greeting
Hello World! 🌎

## Not Greeting
I'm only a greeting agent. 🤷

",
  "filename": "greetings.md",
  "last_modified": "2025-11-04T12:00:00Z"
}

Click the command’s Run button to execute the command which will add a document to the my-docs index.

As you can see, the command above adds a document named greetings.md that includes the contents of different potential types of greeting responses.

Now that we’ve got some data in an Elastic index, let’s get a confirmation of what data we have to work with. Using the power of the built-in Elastic AI Agent that is enabled by default in Agent Builder, you can now have a chat about your data. Select Agents in the navigation menu.

Then simply ask, “What data do I have?”

The default Elastic AI Agent provides a nice summary of the data currently stored in Elastic.

Create a tool

The next step on this walkthrough journey is to create an agent that can utilize the data stored in Elastic.

As you’ve seen the default agent in Elastic Agent builder is already useful for chatting with your data but to really give agents custom powers, they need access to tools via the Model Context Protocol (MCP). Agent Builder has fully featured tool creation and management functionality that you can use to quickly create custom MCP tools that are hosted right in the same scalable Elastic project as your data.

Let’s create a tool in Elastic Agent Builder that can access the data now stored in Elastic. Click + New to start a new chat.

Then click on Manage tools.

Click the + New tool button.

In the Create Tool form, select the ES|QL as the tool Type and enter the following values.

For Tool ID:

example.get_greetings

For Description:

Get greetings doc from Elasticsearch my_docs index.

For Configuration enter the following query into the ES|QL Query text area:

FROM my-docs | WHERE filename == "greetings.md"

Your completed Create a new tool form should look like the following completed form. Click Save to create the tool.

Create an Agent and assign it a tool

Ah! There’s that feeling of having a new tool and being ready to use it. Agents need tools to give them special abilities beyond what general LLMs can provide and we’ve now got a brand new tool. Let’s create an agent that can put our tool to good use. Select Agents in the navigation menu.

Click Create a new agent.

Based on the name of the tool and the data it’s accessing, you’ve probably already guessed that we’re going to be creating a greeting agent and you’re right! Let’s create a Hello World agent right now.

In the New Agent form, enter the following values.

For Agent ID enter the text:

helloworld_agent

In the Custom Instructions text area enter the following instructions:

If the prompt contains greeting text like "Hi" or "Hello" then respond with only the Basic Hello text from your documents.

If the prompt contains the text “Hello World” then respond with only the Hello World text from your documents.

In all other cases where the prompt does not contain greeting words, then respond with only the Not Greeting text from your documents.

For Display name enter the text:

HelloWorld Agent

For the Display description enter the text:

An agent that responds to greetings.

Your completed New Agent form should look like the following completed form. The next step is to assign the agent the tool we created in the previous step. Click the Tools tab.

Select only the example.get_greetings tool that we created previously. Unselect all the other available tools. This will configure the agent being created to only have access to the tool we’ve created.

Click Save to create the agent.

You’ll be taken to the Agents list where you can see that the new HelloWorld Agent has been created.We can quickly test out our new agent right inside Agent Builder. Select Agents in the navigation menu.

Select the HelloWorld Agent from the Agent Chat agent selector. Enter the prompt “hello world” and you should get back the Hello World text from the greetings.md document stored in the my-docs Elastic index.

Well done. Now that we know our agent is working as expected, let’s explore the immediate development benefit that you get with tools and agents created in Agent Builder. Any tools you create in Agent Builder are usable via MCP by any agent-building platform that supports MCP. Also, any agents you create in Agent Builder are available for use in any agent-building platform that supports the AgentToAgent (A2A) protocol.

Microsoft Agent Framework

If you’re interested in trying out new Agent development tools, then there’s a recently announced open-source development kit called the Microsoft Agent Framework that you should definitely try out for yourself. The Agent Framework allows you to use the A2A protocol to orchestrate agentic apps that can combine multiple agents running on different hosts to enable solutions that aren’t possible with only a generic GenAI Large Language Model. The Agent Framework is available in Python and C#. Let’s see how we can use the Python-based Agent Framework to call the custom Elastic Agent we just created.

Getting started with the Agent Framework in Python

Let’s run some code! On your local computer open Visual Studio Code and open a new terminal.

In the open terminal, clone the Elastic Search Labs source code repository which contains the Elastic Agent Builder A2A example app.

git clone https://github.com/elastic/elasticsearch-labs

In the terminal, cd to change directory to elasticsearch-labs.

cd elasticsearch-labs

In the terminal, enter the following command to open the current folder in the Visual Studio Code editor.

code .

In the Visual Studio File Explorer, expand the supporting-blog-content and agent-builder-a2a-agent-framework folders and then open the file named elastic_agent_builder_a2a.py in the text editor.

Here’s the contents of elastic_agent_builder_a2a.py that you should see in your text editor:

import asyncio
from dotenv import load_dotenv
import httpx
import os
from a2a.client import A2ACardResolver
from agent_framework.a2a import A2AAgent


async def main():
    load_dotenv()
    a2a_agent_host = os.getenv("ES_AGENT_URL")
    a2a_agent_key = os.getenv("ES_API_KEY")

    print(f"Connection to Elastic A2A agent at: {a2a_agent_host}")

    custom_headers = {"Authorization": f"ApiKey {a2a_agent_key}"}

    async with httpx.AsyncClient(timeout=60.0, headers=custom_headers) as http_client:
        # Resolve the A2A Agent Card
        resolver = A2ACardResolver(httpx_client=http_client, base_url=a2a_agent_host)
        agent_card = await resolver.get_agent_card(
            relative_card_path="/helloworld_agent.json"
        )
        print(f"Found Agent: {agent_card.name} - {agent_card.description}")

        # Use the Agent
        agent = A2AAgent(
            name=agent_card.name,
            description=agent_card.description,
            agent_card=agent_card,
            url=a2a_agent_host,
            http_client=http_client,
        )
        prompt = input("Enter Greeting >>> ")
        print("\nSending message to Elastic A2A agent...")
        response = await agent.run(prompt)
        print("\nAgent Response:")
        for message in response.messages:
            print(message.text)


if __name__ == "__main__":
    asyncio.run(main())

The code within the main() method demonstrates how to control your Elastic Agent Builder agent using the Agent Framework. It creates an http_client using a URL and API key for the agent which you’ll provide from your Elastic project. Then the Agent Framework’s A2ACardResolver is called with that http_client to get your agent’s A2A agent card based on the relative_card_path of “/helloworld_agent.json” to reference your agent’s Agent ID which is “helloworld_agent”. The code then uses the Agent Framework to invoke your agent with the A2A agent card. The final part of the main() method prompts the user of the app for input of a “greeting” and then sends the user input as a prompt to your agent. Based on the instructions and tools specified when you created your agent, the agent’s response is displayed to the app user.

Setting your agent URL and API Key as environment variables

Make a copy of the file env.example and name the new file .env Edit the newly created .env file to set the values of the environment variables to use specific values copied from your Elastic project.

First we’ll replace with the Agent URL path that you can copy from your Elastic project’s Agent Builder - Tools page. Back in Elastic Agent Builder click Agents in the navigation menu.

Select Manage tools.

Click the MCP Server dropdown at the top of the Tools page. Select Copy MCP Server URL.

Back in Visual Studio Code, within the .env file, find where the placeholder text “” appears and paste in the copied MCP Server URL to replace the placeholder text. Now edit the pasted MCP Server URL. Delete the text “mcp” at the end of the URL and replace it with the text “a2a”. The edited URL should look something like this:

https://example-project-a123.kb.westus2.azure.elastic.cloud/api/agent_builder/a2a

The next placeholder text to replace in the .env file is . We’ll replace it with an actual API Key from your Elastic project. Back in your Elastic project, click Elasticsearch in the navigation menu.

Click Create API key to create a new API key.

Enter a Name for the API key and click Create API key.

Click the copy button to copy the API key.

Back in Visual Studio Code, within the .env file , find where the placeholder text “” appears and paste in the copied API Key value to replace the placeholder text.

Now we can save the changes we’ve made to the .env file. The edited file should look something like this:

Run the example app

It’s time to run the code. To do so, open a new terminal in Visual Studio Code. Click the Terminal top level menu and select New Terminal.

In the new terminal, cd to change directory to the directory containing the agent-builder-a2a-agent-framework example app.

cd elasticsearch-labs/supporting-blog-content/agent-builder-a2a-agent-framework

In the terminal, create a Python virtual environment by running the following code.

python -m venv .venv

Activate the virtual environment by running the following command (based on your operating system) in the terminal window:

If you’re running MacOS or Linux, the command to activate the virtual environment is:

source .venv/bin/activate

If you’re on Windows, the command to activate the virtual environment is:

.venv\Scripts\activate

The code in the elastic_agent_builder_a2a.py file is powered by the Microsoft Agent Framework and we still need to install it, so let's do that now. Run the following pip command to install the Python based Agent Framework along with its necessary Python packages:

pip install -r requirements.txt

Hurray! Everything is now in its right place. It’s time for the good feeling fireworks…let’s run it. Run the example code by entering the following command into the terminal:

python elastic_agent_builder_a2a.py

You should see the agent framework connect to the Elastic Agent. When prompted for a greeting, enter “hello world”. You should see the HelloWorld Agent’s response → Hello World! 🌎

Top-notch work!

Building agents and connecting them to tools in Agent Builder gets you immediate operability with the latest agent development platforms like the Microsoft Agent Framework. You now know how to create an Elastic agent and put it to use as a scalable relevant data source, ready to provide custom context to all the AI apps you’ll be building next.

Try Elastic for free and build some agents today!

You Know, for Context - Part III: The power of hybrid search in context engineering

Woody Walton — Thu, 20 Nov 2025 00:00:00 GMT

We’ve discussed both hybrid search (Part I) and context engineering, (Part II); now, let’s dive into how they work together for the greatest effect in supplying targeted context to RAG and agentic AI operations.

Search isn’t dead, it’s just moved

So we’ve had this shift from primarily searching for context through a text box and using the information (the context) returned to construct the answers ourselves, to now using natural language to tell an agent what we want and letting it automatically research and compile the answer for us. Many in the tech world are pointing to this shift and proclaiming “search is dead” (well, the SEO and ad-words world is definitely changing: GEO anyone?), but search is still absolutely critical to agentic operations — it’s just largely performed out of sight via tools now.

Previously, humans were the main arbiters of subjective relevance: each user has their own reasons for running the search, and their personal experience colors the relative accuracy of the results. If we are to trust that agents can come to the same conclusion (or better) that we would have, we need to ensure the contextual information they have access to is as close to our subjective intent as possible. We have to engineer the context we provide LLMs towards that goal!

Generating context with hybrid search retrieval

Just a reminder from way back in Part I that Elastic’s hybrid search combines the strengths of traditional keyword-based search (syntax flexibility, keyword precision, and relevance scoring) with the semantic understanding of vector similarity search, and offers multiple reranking techniques. This synergy (a truer usage of that word has never been found!) allows for highly relevant results, with queries that can be much more nuanced in how they target content. It’s not just that you can apply subjective relevance as one of your retrieval stages; it’s really that the first-stage retrieval can include relevance scoring along with all of those other modes at once.

Superior accuracy & efficiency

Using a data platform that can provide distributed search, retrieval, and reranking as your primary context retrieval engine makes a lot of sense. You’re able to use advanced query syntax to add the missing component of subjective intent, and filter out content that might distract from or muddy the value of the contextual information returned. You can select from any of the individual syntax options available, or combine modalities into a single search that targets each type of data in the manner it understands best, and then combine/re-order them with reranking. You can have the response filtered to only include the fields/values you want, keeping extraneous data at bay. In service to agents, that targeting flexibility lets you build tools that are extremely accurate in how they retrieve context.

Context refinement (aggregations and non-content signals)

Aggregations can be especially useful in shaping the content a tool delivers to the context window. Aggregations naturally provide numerical-based facts about the shape of the contextual data returned, which makes it easier and more accurate for LLMs to reason over. Because aggregations can be hierarchically nested, it’s an easy way to add multi-level detail for the LLM to generate a more nuanced understanding. Aggregations can also help with managing the context window size — you can easily reduce a query result of 100k documents to a few hundred tokens of aggregated insights.

Non-content signals are the inherent indicators in your data that tell you the bigger picture of what you’re looking at; they’re the additional characteristics of the results, things like popularity, freshness, geo-location, categories, host diversity, or price bands. These bits of information can be useful for informing the agent in how it weighs the importance of the context it has received. Some simple examples might help illustrate this best:

Boosting recently published and popular content - Imagine you have a knowledge base of articles. You want to find articles relevant to a user's query, but you also want to boost articles that are both recent and have been found helpful by other users (e.g., have a high "likes" count). In this scenario, we can use a hybrid search to find relevant articles and then rerank them based on a combination of their publication date and popularity.
E-commerce search with sales and stock adjustment - In an e-commerce setting, you want to show customers products that match their search term, but you also want to promote products that are selling well and are in stock. You might also want to down-rank products with low stock to avoid customer frustration.
Prioritizing high-severity issues in a bug tracker - For a software development team, when searching for issues, it's crucial to surface high-severity, high-priority, and recently updated issues first. You can use non-signals like ‘criticality’ and ‘most-discussed’ to weigh different factors independently, ensuring that the most critical and actively discussed issues rise to the top

These example queries and more can be found in the accompanying Elasticsearch Labs content page.

Security enforcement

A critical advantage of leveraging a search-powered speed layer like Elastic for context engineering is its built-in security framework. Elastic's platform ensures that context delivered to agentic and generative AI operations respects and protects sensitive privately held information through granular role-based access control (RBAC) and attribute-based access control (ABAC). This means that not only are queries handled with efficiency, but also that the results are filtered according to the specific permissions of the agent or the user initiating the request.

Agents run as the authenticated user, so security is implicitly applied through the security features built-into the platform:

Fine-grained permissions: Define access at the document, field, or even term level, ensuring that AI agents only receive data they are authorized to see.
Role-based access control (RBAC): Assign roles to agents or users, granting access to specific datasets or functionalities based on their defined responsibilities.
Attribute-based access control (ABAC): Implement dynamic access policies based on attributes of the data, the user, or the environment, allowing for highly adaptable and context-aware security.
Document-level security (DLS) and field-level security (FLS): These capabilities ensure that even within a retrieved document, only authorized portions are visible, preventing sensitive information from being exposed.
Integration with enterprise security: Seamlessly integrate with existing identity management systems (like LDAP, SAML, OIDC) to enforce consistent security policies across the entire organization.

By integrating these security measures directly into the context retrieval mechanism, Elastic acts as a secure gatekeeper, ensuring that AI agents operate within defined data boundaries, preventing unauthorized data exposure, and maintaining compliance with data privacy regulations. This is paramount for building trust in agentic AI systems that handle confidential or proprietary information.

As an added bonus, by using a unified data speed layer over your enterprise data sources, you alleviate the unexpected ad hoc query loads on those repositories that agentic tools would create. You get a single location to search everything in near real-time, and one place to apply security and governance controls.

Hybrid search-based tools

There are some core features (with more coming all the time) of the Elastic platform that turbo boost the pursuit of context engineering. The main thing here is that the platform offers a multitude of ways to achieve things, with the flexibility to adapt, change, and expand methods as the AI ecosystem advances.

Introducing Agent Builder

Elastic Agent Builder is our first foray into the realm of agentic AI tools built to chat with the data you’re already storing in Elastic. Agent Builder offers a chat interface that enables users to create and manage their own agents and tools within Kibana. It comes with built-in MCP and A2A servers, programmatic APIs, and a set of pre-built system tools for querying and exploring Elasticsearch indices, and for generating ES|QL queries from natural language. Agent Builder allows you to create custom tools that target and sculpt the contextual data returned to the agent through expressive ES|QL query syntax.

How does ES|QL perform hybrid search, you ask? The core capability is accomplished through the combination of the semantic_text field type and the FORK/FUSE commands (FUSE uses RRF by default to merge results from each fork). Here’s a simple example for a fictitious product search:

FROM products
| FORK
  (MATCH description "high performance gaming laptop" | EVAL search_type = "bm25"),
  (MATCH description_semantic "high performance gaming laptop" | EVAL search_type = "semantic")
| FUSE 
| LIMIT 20
| KEEP product_name, description, _score, search_type

The EVAL clause included with each of the FORK branches in the example above isn’t strictly necessary; it’s only included to demonstrate how you could track which search modality a given result was returned from.

Search templates

Let’s say you want to point your own external agentic tools to your Elastic deployment. And instead of ES|QL, you want to use multi-stage retrievers or re-use existing DSL syntax you’ve developed, and also want to be able to control the inputs the query accepts, the syntax used to execute the search, and the fields returned in the output. Search templates allow users to define predefined structures for common search patterns, improving efficiency and consistency in retrieving data. This is particularly beneficial for agentic tools interacting with search APIs, as they help standardize boilerplate code and enable faster iteration on search logic. And if you ever need to adjust any of those factors, you just update the search template and voilā the changes are implemented. If you’re looking for an example of search templates in action with agentic tools, take a look at the Elasticsearch Labs blog ‘MCP for intelligent search’, which utilizes a search template behind a tool call from an external MCP server.

Integrated workflows (FTW!)

One of the most difficult things to navigate in our new agentic AI world is the non-deterministic nature of semi-autonomous, self-directed “reasoning” agents. Context engineering is a critical discipline to agentic AI: they’re the techniques that help narrow the possible conclusions our agent can generate down to what we know of ground truth. Even with a highly accurate and relevant context window, (when we get outside the realm of numerical facts) we’re still missing that bit of reassurance that the agent’s response is fully repeatable, dependable.

When you run the same request to an agent multiple times, the answers might be essentially the same with just that little bit of difference in the response. That’s usually fine for simple queries, maybe barely noticeable, and we can try to shape the output with context engineering techniques. But as the tasks we ask of our agents become more complex, there’s more of a chance that one or more of the sub-tasks could introduce a variance that slightly changes the end-result. It’ll likely get worse as we begin to rely more on agent-to-agent communications, and those variances will become cumulative. This points again to the idea that the tools our agents interact with need to be very flexible and tuneable to precisely target contextual data, and that they should respond in an expected output format. It also indicates that for many use cases we have a need to direct agent and tool interactions — this is where workflows enter into the picture!

Elastic will soon have completely customizable workflows built into the core of the platform. These workflows will be able to operate with agents and tools in a bi-directional manner, so workflows will be able to call agents and tools, and agents and tools will be able to call workflows. Having these capabilities fully integrated into the same search AI platform where all of your data lives will be transformational, the potential of workflows is extremely exciting! Soon, coming very soon!

Elastic as the unified memory bank

By virtue of being a distributed data platform that’s made for near real-time search, Elastic naturally performs the long-term memory functions for agentic AI systems. With the built-in Agent Builder chat experience, we also have tracking and management of the short-term memory and chat history. And because the entire platform is API-first, it’s extremely easy to utilize Elastic as the platform to persist a tool’s contextual output (and to be able to refer to it later) that might overwhelm the agent’s context window; this technique is sometimes called “note-taking” in context engineering circles.

Having short-term and long-term memory both on the same search platform leads to a lot of intrinsic benefits: imagine being able to use chat histories and persisted contextual responses as part of the semantic influencers to future chat interactions, or to perform threat analytics, or to create persisted data products that are automatically generated from frequently repeated tool calls… The possibilities are endless!

Conclusion

The emergence of large language models has changed the way we’re able to match content and the methods we use to interrogate our data. We’re rapidly shifting away from our current world, where humans perform the research, contextual consideration, and logical reasoning to answer their own questions, to one where those steps are largely automated through agentic AI. In order for us to trust the generated answers we receive, we need assurance that the agent has considered all of the most relevant information (including the factor of subjective relevance) in generating its response. Our primary method for making agentic AI trustworthy is by grounding the tools that retrieve additional context through RAG and context engineering techniques, but how those tools perform the initial retrieval can be critical to the accuracy of the response.

The Elastic Search AI platform provides the flexibility and advantage of hybrid search, along with several built-in features that help agentic AI in terms of accuracy, performance, and scalability; in other words, Elastic makes a fantastic platform for several aspects of context engineering! In standardizing context retrieval via a search platform, we simplify agentic tool operations on several fronts — and similar to the oxymoron “slow down to go faster,” simplicity at the context generation layer means faster and more trustworthy agentic AI.

Implementing agentic search with observability to autotune relevance in Elasticsearch

Craig Angulo, James Williams — Wed, 19 Nov 2025 00:00:00 GMT

Every search bar is a broken promise. Users type natural language queries, "beachfront paradise in Hawaii with a chef's kitchen" but get back irrelevant results from lexical search. We've solved this with Agentic search and relevance autotuning: a reference architecture that combines natural language understanding with self-improving search that learns from every user interaction. No data scientists needed. No manual tuning. Just search that gets smarter automatically. In this post, I'll show you exactly how to build it.

Who does this help:

Insurance companies: Customers find the right coverage without understanding policy jargon
SaaS companies: Developers find documentation using natural language
E-commerce: Buyers discover products using conversational queries
Any business with search: Turn your search bar from a cost center into a competitive advantage

Objective

In this blog, we walk through the steps you will need to take to enable an agentic search solution leveraging any LLM you want and the Elastic platform. This search solution will automatically train a Learn-to-Rank model based on user interactions. We will be leveraging a dataset of properties to make a home-search-agent that allows users to ask natural language queries “Show me houses for sale in Hawaii that have 3 bedrooms, a pool, and cost under 1 million” and have rich search results returned to them.

Architecture overview

User asks naturally → "Show me houses for sale in Hawaii that have 3 bedrooms"
Agent understands → Translates to optimized Elasticsearch query
Results delivered → Relevant results, no keyword gymnastics
System learns → Every interaction teaches the model what users actually want

Learn-to-Rank (LTR) implementation

The most difficult part of training Learn-to-Rank algorithms is creating judgment lists; this is now handled automatically with Elastic Logging. As searches are run with your home-search-agent those searches, the results, and follow-up questions get logged for training.

48 features are looked at in this example, ranging from property attributes (pool, number of rooms) to engagement signals (click-through rate, follow-up questions)
Once enough conversations have been logged, an XGBoost-based reranking model will be trained on the data and deployed back to Elasticsearch.
The feedback loop: search → log events → train model → d→ improved search

Setup and deployment

A detailed step-by-step is provided in the Github readme section, along with a technical deep dive

Ensure you have Node.js, Git, and Python >=3.10, <3.13
Clone the provided code repository:https://github.com/jwilliams-elastic/agentic-search-o11y-autotune
Set up a virtual environment with the requirements.txt provided
Create an Elastic Serverless project and copy down the following:
1. ELASTIC_URL
2. ELASTIC_API_KEY
Create an .env file and provide the credentials from Step 4 (Additionally, you will provide your LLM’s API_KEY)
Open a Terminal in your virtual environment and run npm run dev. This will spin up your Mastra server and provide you with a URL like http://localhost:4111/

What was deployed and how it is used

Multiple search templates were created during the deployment workflow
Each one of these search templates provides different configurations from V1-V4
The home-search-agent has a prompt provided that allows the LLM to fill in search templates
The LLM is able to convert natural language queries and fill in search templates via an ElasticSearch tool the agent can access

Demo workflow

Open http://localhost:4111/workflows
Run elastic-setup-workflow (.env file has default values, but you can override in Mastra UI)
Run search-autotune-workflow (LOW and HIGH option generates different simulated search engagement behavior - HIGH = Luxury, LOW = Affordable)
Open http://localhost:4111/agents and run the "Home Search Agent"
Show the difference b/t LTR and no-LTR LLM judgment with a query like "affordable home", "luxury home" and "6 bed, 6 bath single family home near orlando fl with garage and pool under 5M with designer finishes throughout"
You can trigger engagement by asking for more detail for a specific result (ex: tell me more about result #20 in v4 results)
Open the "Agentic Search Analytics" dashboard - KPIs like CTR, Average Click Position and search template usage.

Agents in Action

With your Elastic Agents running on the Mastra Framework, you now can choose between the home-search-agent, which allows you to run property searches, and the home-search-agent-ltr-comparison, which allows you to see the direct benefits of training on personalized data.

The home-search-agent

The home-search-agent-ltr-comparison

Why Now?

As users increasingly expect ChatGPT-like search experiences, traditional search engines lose customers to poor relevance and complicated manual tuning. This architecture democratizes access by enabling non-technical users to find what they need without crafting complex queries, while LTR models continuously adapt to individual behavior patterns to increase relevance. Automated retraining keeps results current as trends evolve, and the built-in observability dashboard reveals usage patterns and gaps in your offerings. These insights directly inform feature development and keep you aligned with actual user needs rather than assumptions. Ready to transform your search engine's relevance? Contact us at elastic.co/contact to find out how your search can begin working for you.

You Know, for Context - Part II: Agentic AI and the need for context engineering

Woody Walton — Tue, 18 Nov 2025 00:00:00 GMT

With that (fairly extensive) background on the ways LLMs have changed the underlying processes of information retrieval, let’s see how they’ve also changed the way we query for data.

A new way of interacting with data

Generative (genAI) and agentic AI do things differently than traditional search. Whereas the way we used to begin researching information was a search (“let me Google that…”), the initiating action for both gen AI and Agents is usually through natural language entered into a chat interface. The chat interface is a discussion with an LLM that uses its semantic understanding to turn our question into a distilled answer, a summarized response seemingly coming from an oracle that has a broad knowledge of all kinds of information. What really sells it is the LLM’s ability to generate coherent, thoughtful sentences that string together the bits of knowledge it surfaces — even when it’s inaccurate or totally hallucinated, there’s a truthiness to it.

That old search bar we’ve been so used to interacting with can be thought of as the RAG engine we used when we ourselves were the reasoning agent. Now, even Internet search engines are turning our well-worn “hunt and peck” lexical search experience into AI-driven overviews that answer the query with a summary of the results, helping users avoid the need to click through and evaluate individual results themselves.

Generative AI & RAG

Generative AI tries to use its semantic understanding of the world to parse the subjective intention stated through a chat request, and then uses its inference abilities to create an expert answer on the fly. There are several parts to a generative AI interaction: it starts with the user’s input/query, previous conversations in the chat session can be used as additional context, and the instructional prompt that tells the LLM how to reason and what procedures to follow in constructing the response. Prompts have evolved from simple "explain this to me like I am a five-year-old” type of guidance to complete breakdowns for how to process requests. These breakdowns often include distinct sections describing details of the AI’s persona/role, pre-generation reasoning/internal thought process, objective criteria, constraints, output format, audience, as well as examples to help demonstrate the expected results.

In addition to the user’s query and the system prompt, retrieval augmented generation (RAG) provides additional contextual information in what’s called a “context window.” RAG has been a critical addition to the architecture; it’s what we use to inform the LLM about the missing pieces in its semantic understanding of the world.

Context windows can be kind of persnickety in terms of what, where, and how much you give them. Which context gets selected is very important, of course, but the signal-to-noise ratio of the provided context also matters, as well as the length of the window.

Too little information

Providing too little information in a query, prompt, or context window can lead to hallucinations because the LLM can’t accurately determine the correct semantic context to generate a response from. There are also issues with the vector similarity of document chunk sizes — a short, simple question may not semantically align with the rich, detailed documents found in our vectorized knowledge bases. Query expansion techniques such as Hypothetical Document Embeddings (HyDE) have been developed that use LLMs to generate a hypothetical answer that is richer and more expressive than the short query. The danger here, of course, is that the hypothetical document is itself a hallucination that takes the LLM even farther astray from the correct context.

Too much information

Just like it does to us humans, too much information in a context window can overwhelm and confuse an LLM about what the important parts are supposed to be. Context overflow (or “context rot”) affects the quality and performance of generative AI operations; it greatly impacts the LLM’s “attention budget” (its working memory) and dilutes relevance across many competing tokens. The concept of “context rot” also includes the observation that LLMs tend to have a positional bias — they prefer the content at the beginning or end of a context window over content in the middle section.

Distracting or conflicting information

The larger a context window gets, the more chance there is that it might include superfluous or conflicting information that can serve to distract the LLM from selecting and processing the correct context. In some ways, it becomes a problem of garbage in/garbage out: just dumping a set of document results into a context window gives the LLM a lot of information to chew on (potentially too much), but depending on how the context was selected there is a greater possibility for conflicting or irrelevant information seeping in.

Agentic AI

I told you there was a lot of ground to cover, but we did it — we’re finally talking about agentic AI topics! Agentic AI is a very exciting new usage of LLM chat interfaces that expands on generative AI’s (can we call it “legacy” already?) ability to synthesize responses based on its own knowledge and contextual information you provide. As generative AI became more mature, we realized there was a certain level of tasking and automation we could have LLMs perform, initially relegated to tedious low-risk activities that can easily be checked/validated by a human. Over a short period of time, that initial scope grew: an LLM chat window can now be the spark that sends an AI agent off to autonomously plan, execute, and iteratively evaluate and adapt its plan to achieve its specified goal. Agents have access to their LLMs’ own reasoning, the chat history and thinking memory (such as it is), and they also have specific tools made available that they can utilize towards that goal. We’re also now seeing architectures that allow a top-level agent to function as the orchestrator of multiple sub-agents, each with their own logic chains, instruction sets, context, and tools.

Agents are the entry point to a mostly automated workflow: they’re self-directed in that they are able to chat with a user and then use ‘logic’ to determine what tools it has available to help answer the user’s question. Tools are usually considered passive as compared to agents and built to do one type of task. The types of tasks a tool could perform are kind of limitless (which is really exciting!) but a primary task tools perform is to gather contextual information for an agent to consider in executing its workflow.

As a technology, agentic AI is still in its infancy and prone to the LLM equivalent of attention deficit disorder — it easily forgets what it was asked to do, and often runs off to do other things that weren’t part of the brief at all. Underneath the apparent magic, the “reasoning” abilities of LLMs are still based on predicting the next most likely token in a sequence. For reasoning (or someday, artificial general intelligence (AGI)) to become reliable and trustworthy, we need to be able to verify that when given the correct, most up-to-date information that they will reason through the way we expect them to (and perhaps give us that little extra bit more that we might not have thought of ourselves). For that to happen, agentic architectures will need the ability to communicate clearly (protocols), to adhere to the workflows and constraints we give them (guardrails), to remember where they are in a task (state), manage their available memory space, and validate their responses are accurate and meet the task criteria.

Talk to me in a language I can understand

As is common in new areas of development (especially so in the world of LLMs), there were initially quite a few approaches for agent-to-tool communications, but they quickly converged on the Model Context Protocol (MCP) as the de facto standard. The definition of Model Context Protocol is truly in the name - it’s the protocol a model uses to request and receive contextual information. MCP acts as a universal adapter for LLM agents to connect to external tools and data sources; it simplifies and standardizes the APIs so that different LLM frameworks and tools can easily interoperate. That makes MCP a kind of pivot point between the orchestration logic and system prompts given to an agent to perform autonomously in the service of its goals, and the operations sent to tools to perform in a more isolated fashion (isolated at least with regards to the initiating agent).

This ecosystem is all so new that every direction of expansion feels like a new frontier. We have similar protocols for agent-to-agent interactions (Agent2Agent (A2A) natch!) as well as other projects for improving agent reasoning memory (ReasoningBank), for selecting the best MCP server for the job at hand (RAG-MCP), and using semantic analysis such as zero-shot classification and pattern detection on input and output as Guardrails to control what an agent is allowed to operate on.

You might have noticed that the underlying intent of each of these projects is to improve the quality and control of the information returned to an agent/genAI context window? While the agentic AI ecosystem continues to develop the ability to handle that contextual information better (to control, manage, and operate on it), there will always be the need to retrieve the most relevant contextual information as the grist for the agent to mill on.

Welcome to context engineering!

If you’re familiar with generative AI terms, you’ve probably heard of ‘prompt engineering’ - at this point, it’s almost a pseudo-science of its own. Prompt engineering is used to find the best and most efficient ways of proactively describing the behaviors you want the LLM to use in generating its response. ‘Context engineering’ extends ‘prompt engineering’ techniques beyond the agent side to also cover available context sources and systems on the tools side of the MCP protocol, and includes the broad topics of context management, processing, and generation:

Context management - Related to maintaining state and context efficiency across long-running and/or more complex agentic workflows. Iterative planning, tracking, and orchestration of tasks and tool calling to accomplish the agent’s goals. Due to the limited “attention budget” agents have to work within, context management is largely concerned with techniques that help refine the context window to capture both the fullest scope and the most important bits of context (its precision versus recall!). Techniques include compression, summarization, and persisting context from previous steps or tool calls to make room in working memory for additional context in subsequent steps.
Context processing - The logical and hopefully mostly programmatic steps to integrate, normalize, or refine context acquired from disparate sources so that the agent can reason across all the context in a somewhat uniform manner. The underlying work is to make context from all sources (prompts, RAG, memory, etc.), all consumable by the agent as efficiently as possible.
Context generation - If context processing is about making retrieved context usable to the agent, then context generation gives the agent the outreach to request and receive that additional contextual information at will, but also with constraints.

The various ephemera of LLM chat applications map directly (and sometimes in overlapping ways) to those high-level functions of context engineering:

Instructions / system prompt - Prompts are the scaffolding for how the generative (or agentic) AI activity will direct its thinking towards accomplishing the user’s goal. Prompts are context in their own right; they aren’t just tonal instructions — they also frequently include task execution logic and rules for things like “thinking step by step” or “take a deep breath” before responding to validate the answer fully addresses the user’s request. Recent testing has shown markup languages are very effective at framing the different parts of a prompt, but care should also be taken to calibrate the instructions to a sweet spot between too vague and too specific; we want to give enough instruction for the LLM to find the right context, but not be so prescriptive that it misses unexpected insights.
Short-term memory (state/history) - Short-term memory is essentially the chat session interactions between the user and the LLM. These are useful in refining context in live sessions, and can be saved for future retrieval and continuation.
Long-Term Memory - Long-term memory should consist of information that is useful across multiple sessions. And it’s not just domain-specific knowledge bases accessed through RAG; recent research uses the outcomes from previous agentic/generative AI requests to learn and reference within current agentic interactions. Some of the most interesting innovations in the long-term memory space are related to adjusting how state is stored and linked-to so that agents can pick up where they left off.
Structured output - Cognition requires effort, so it’s probably no surprise that even with reasoning capabilities, LLMs (just like humans) want to expend less effort when thinking, and in the absence of a defined API or protocol, having a map (a schema) for how to read data returned from a tool call is extremely helpful. The inclusion of Structured Outputs as part of the agentic framework helps to make these machine-to-machine interactions faster and more reliable, with less thinking-driven parsing needed.
Available tools - Tools can do all sorts of things, from gathering additional information (e.g., issuing RAG queries to enterprise data repositories, or through online APIs) to performing automated actions on behalf of the agent (like booking a hotel room based on the criteria of the request from the agent). Tools could also be sub-agents with their own agentic processing chains.
Retrieval Augmented Generation (RAG) - I really like the description of RAG as “dynamic knowledge integration.” As described earlier, RAG is the technique for providing the additional information the LLM didn’t have access to when it was trained, or it’s a reiteration of the ideas we think are most important to get the right answer — the one that’s most relevant to our subjective query.

Phenomenal cosmic power, itty bitty living space!

Agentic AI has so many fascinating and exciting new realms to explore! There are still lots of the old traditional data retrieval and processing problems to solve, but also brand new classes of challenges that are only now being exposed to the light of day in the new age of LLMs. Many of the immediate issues we’re grappling with today are related to context engineering, about getting LLMs the additional contextual information they need without overwhelming their limited working memory space.

The flexibility of semi-autonomous agents that have access to an array of tools (and other agents) gives rise to so many new ideas for implementing AI, it’s hard to fathom the different ways we might put the pieces together. Most of the current research falls into the field of context engineering and is focused on building memory management structures that can handle and track larger amounts of context — that’s because the deep-thinking problems we really want LLMs to solve present increased complexity and longer-running, multi-phased thinking steps where remembering is extremely important.

A lot of the ongoing experimentation in the field is trying to find the optimal task management and tool configurations to feed the agentic maw. Each tool call in an agent’s reasoning chain incurs cumulative cost, both in terms of compute to perform that tool’s function as well as the impact to the limited context window. Some of the latest techniques to manage context for LLM agents have caused unintended chain effects like “context collapse” where compressing/summarizing accumulated context for long-running tasks gets too lossy. The desired outcome is tools that return succinct and accurate context, without extraneous information bleeding into the precious context window memory space.

So many/too many possibilities

We want separation of duties with flexibility to reuse tools/components, so it makes complete sense to create dedicated agentic tools for connecting to specific data sources — each tool can specialize in querying one type of repository, one type of data stream, or even one use case. But beware: in the drive to save time/money/prove something is possible there’s going to be a strong temptation to use LLMs as a federation tool… Try not to, we’ve been down that road before! Federated query acts like a “universal translator” that converts an incoming query into the syntax that the remote repository understands, and then has to somehow rationalize the results from multiple sources into a coherent response. Federation as a technique works okay at small scales, but at large scales and especially when data is multimodal, federation tries to bridge gaps that are just too wide.

In the agentic world, the agent would be the federator and the tools (through MCP) would be the manually-defined connections to disparate resources. Using dedicated tools to reach out across unconnected data sources might seem like a powerful new way to dynamically unite different data streams on a per query basis, but using tools to ask the same question to multiple sources will likely end up causing more issues than it solves. Each of those data sources are likely different types of repositories underneath, each with their own capabilities for retrieving, ranking, and securing the data within them. Those variances or “impedance mis-matches” between repositories add to the processing load, of course. They also potentially introduce conflicting information or signals, where something as seemingly innocuous as a scoring misalignment could wildly throw off the importance given to a bit of returned context, and affect the relevance of the generated response in the end.

Context switching is hard for computers, too

When you send an agent out on a mission, often their first task is to find all relevant data it has access to. Just as it is with humans if each data source the agent connects to replies with dissimilar and disaggregated responses, there will be cognitive load (though not exactly the same kind) associated with extracting the salient contextual bits from the retrieved content. That takes time/compute, and each little bit adds up in the agentic logic chain. This leads to the conclusion that, just like what’s being discussed for MCP, most agentic tools should instead behave more like APIs — isolated functions with known inputs and outputs, tuned to support the needs of different kinds of agents. Heck, we’re even realizing that LLMs need context for context — they do much better at connecting the semantic dots, especially when it’s a task like translating natural language into structured syntax, when they have a schema to refer to (RTFM indeed!).

7th inning stretch!

Now we’ve covered the impact LLMs have had on retrieving and querying for data, as well as how the chat window is maturing into the agentic AI experience. Let’s put the two topics together and see how we can use our newfangled search and retrieval capabilities to improve our results in context engineering. Onwards to Part III: The power of hybrid search in context engineering!