<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
    <channel>
        <title><![CDATA[Elasticsearch Labs]]></title>
        <description><![CDATA[Articles and tutorials from the Search team at Elastic]]></description>
        <link>https://www.elastic.co/search-labs</link>
        <image>
            <url>https://www.elastic.co/search-labs/favicon.svg</url>
            <title>Elasticsearch Labs</title>
            <link>https://www.elastic.co/search-labs</link>
        </image>
        <generator>Elasticsearch Labs</generator>
        <lastBuildDate>Sat, 09 May 2026 18:06:47 GMT</lastBuildDate>
        <atom:link href="https://www.elastic.co/search-labs/rss/feed" rel="self" type="application/rss+xml"/>
        <copyright><![CDATA[© 2026. Elasticsearch B.V. All Rights Reserved]]></copyright>
        <language><![CDATA[en]]></language>
        <item>
            <title><![CDATA[From averages to any percentile: Elasticsearch ships native exponential histogram support in ES|QL]]></title>
            <description><![CDATA[Query any percentile at any time. Elasticsearch natively stores OTel exponential histograms and lets you analyze distributions in ES|QL without fixed buckets or lossy conversions.]]></description>
            <link>https://www.elastic.co/search-labs/blog/otel-histogram-metrics-esql</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/otel-histogram-metrics-esql</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Jonas Kunz]]></dc:creator>
            <pubDate>Fri, 08 May 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/f44560ece4dcc46e6a01826b597e094169e99691-848x477.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Elasticsearch adds native support for OpenTelemetry exponential histograms in ES|QL. Unlike fixed-bucket histograms, exponential histograms dynamically adapt to your data — giving you accurate percentile estimates (median, p99, any percentile you want) at query time with guaranteed error bounds. No more pre-defining buckets, no more lossy conversions. </p><p>Just send your OTel metrics to the <a href="https://www.elastic.co/docs/manage-data/data-store/data-streams/tsds-ingest-otlp">Elasticsearch OTLP/HTTP endpoint</a> and they&#x27;re stored using the new <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/exponential-histogram">exponential_histogram</a> type and queryable immediately. Already have historical data stored in the classic histogram type? A simple ::exponential_histogram cast in your ES|QL queries handles the migration transparently. Already using <a href="https://www.elastic.co/docs/manage-data/data-store/data-streams/downsampling-time-series-data-stream">downsampling</a>? Both histogram field types are now fully supported.</p><h2>Histogram metrics</h2><p>When dealing with metrics (in OpenTelemetry or Prometheus, for instance), counters and gauges are the most common metric types. Gauges allow you to monitor values that rise or fall (e.g., CPU utilization). Counters allow you to, well, count things, such as the total number of HTTP requests your service is handling. Counters normally just increase in value, with a few exceptions when they reset, like when a server reboots.</p><p>In the case of counters, you can additionally collect a counter measuring the total sum of your HTTP response times, which allows you to derive the average response time by dividing that sum by the total number of requests. However, average response times provide limited insights into the collected data and the system behavior. The best insights are gained by analyzing the collected metric distribution, e.g., through median and percentile calculations. This is where counters fall short.</p><p>In the past, workarounds have been applied: For example, classic Prometheus-style histograms attempt to capture the distribution using a set of counters. By defining fixed buckets (e.g., one for response times in the range <code>[0s, 1s)</code>, one for <code>[1s, 4s)</code>, and so on) and associating a counter with each, we can at least estimate percentiles broadly. However, the key problem here is that we have to know the distribution of our data up front to properly define these buckets.</p><p>To that end, the OpenTelemetry community has come up with a better solution: exponential histograms. Exponential histograms assign collected values to buckets, just like classic Prometheus-style histograms. The key differentiator is that these buckets vary dynamically based on the collected values. The name &quot;exponential&quot; comes from the fact that the bucket sizes increase exponentially: we use small buckets for small values and wider buckets for larger values. You can find an excellent introduction in the <a href="https://opentelemetry.io/blog/2022/exponential-histograms/">OpenTelemetry exponential histograms introduction</a>.</p><p>Note that in addition to classic histograms, Prometheus also added <a href="https://prometheus.io/docs/specs/native_histograms/">native histograms</a>, which directly map to OTel <a href="https://prometheus.io/docs/specs/native_histograms/#opentelemetry-interoperability">exponential histograms</a>. Native histograms have their own <a href="https://prometheus.io/docs/specs/native_histograms/#promql">PromQL syntax</a>. We are actively working on adding support for that syntax to the <a href="https://www.elastic.co/observability-labs/blog/elasticsearch-supports-promql">Elasticsearch PromQL implementation</a>, so that you can directly query exponential histograms using PromQL.</p><h2>Demo setup</h2><p>Let&#x27;s start by collecting some histogram metrics to show how they can be stored and analyzed in Elasticsearch using ES|QL.</p><p>We&#x27;ll focus on a Java JVM metric: garbage collection durations. OpenTelemetry defines the <a href="https://opentelemetry.io/docs/specs/semconv/runtime/jvm-metrics/#metric-jvmgcduration">jvm.gc.duration</a>, which is a histogram-typed metric. The <a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation">OpenTelemetry Java agent</a> natively supports collecting this metric.</p><p>We&#x27;ll spin up a JVM running a <a href="https://renaissance.dev/">Renaissance benchmark</a> to put it under stress. We&#x27;ll start that JVM with the vanilla OpenTelemetry Java agent attached and have it send the metrics directly to Elasticsearch.</p><p>You can find the ready-to-run Docker-compose file <a href="https://github.com/JonasKunz/es-histogram-demo">here</a>. You&#x27;ll just need to insert your <a href="https://www.elastic.co/docs/manage-data/data-store/data-streams/tsds-ingest-otlp">Elasticsearch OTLP/HTTP endpoint</a> and API key in the <code>docker-compose.yml</code>:</p><pre><code>OTEL_EXPORTER_OTLP_ENDPOINT: https://<elasticsearch url>/_otlp
OTEL_EXPORTER_OTLP_HEADERS: "Authorization=ApiKey <base64 API key>"</code></pre><p>Note that you don&#x27;t have to use this demo setup. We even encourage you to try it with your own application. Here are the other important OpenTelemetry agent settings the demo already includes, which you should include too if you&#x27;re bringing your own app:</p><pre><code>OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE: delta
OTEL_EXPORTER_OTLP_METRICS_DEFAULT_HISTOGRAM_AGGREGATION: BASE2_EXPONENTIAL_BUCKET_HISTOGRAM
OTEL_INSTRUMENTATION_RUNTIME_TELEMETRY_ENABLED: "true"</code></pre><p>Let&#x27;s step through them:</p><ul><li><em>Temporality preference</em>: OpenTelemetry supports both cumulative and delta-based histograms. Cumulative means that the histogram is only cleared after an application restart, while delta clears it after each export. At the time of writing, Elasticsearch only supports delta temporality for histograms. We are actively working on supporting cumulative histograms as well.</li><li><em>Default Histogram Aggregation</em>: By default, OpenTelemetry exports histograms in the Prometheus-style fixed bucket format. Since we want to reap the benefits of exponential histograms, we tell the agent to use them instead.</li><li><em>Runtime Telemetry enabled</em>: This tells the agent to actually collect the detailed JVM metrics, which include <code>jvm.gc.duration</code>.</li></ul><p>Now we are ready to go! We&#x27;ll let the application run in the background and switch over to Kibana to analyze the GC metric.</p><h2>Querying with ES|QL</h2><p>Now let&#x27;s open up Kibana and navigate to &quot;Discover&quot;. There we&#x27;ll switch to <a href="https://www.elastic.co/docs/explore-analyze/discover/try-esql">ES|QL mode</a>, and start querying the collected data:</p><pre><code>TS metrics-* | STATS COUNT(jvm.gc.duration)</code></pre><p>As a response, we now see the metric panel shown below. If you don&#x27;t see any data, make sure to double-check the Kibana <a href="https://www.elastic.co/docs/explore-analyze/query-filter/filtering#set-time-filter">time range filter</a>.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/b863b2e272ac6584ac193661a6c4419abffdd243-729x190.png" style="width: 100%;" /><p>This number represents the total number of garbage collection operations that happened in our test application during the selected time frame.</p><p>Similarly, we can query the total time spent on those garbage collection operations:</p><pre><code>TS metrics-* | STATS SUM(jvm.gc.duration)</code></pre><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/eda37c5fa244a42258bb452d18f5cbab3ff76eaf-717x190.png" style="width: 100%;" /><p>So we have roughly 270k garbage collections, which in total took 713 seconds. Given these two numbers, we can now compute the average if we are still fluent in primary school-level math. Even if not, you can just let ES|QL do that for you:</p><pre><code>TS metrics-* | STATS AVG(jvm.gc.duration)</code></pre><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/9fc5d601604b05378beeb4e6e94b613f95fe2fbc-712x188.png" style="width: 100%;" /><p>Now we know that the average garbage collection operation took about 3 milliseconds. However, Java experts might know that there are different kinds of garbage collections happening, which can have significantly different pause times. Fortunately the OpenTelemetry metric comes with attributes, which allow us to slice the data accordingly:</p><pre><code>TS metrics-* | STATS AVG(jvm.gc.duration) BY jvm.gc.action</code></pre><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/535d44cdb2ec3ed2ac9afe6b259d7b69c9167bbd-989x476.png" style="width: 100%;" /><p>As expected, major garbage collections take a lot more time per collection than minor ones, at least on average. So far, we have done nothing you couldn&#x27;t also achieve by just using counters. Let&#x27;s now use histograms to understand the actual distribution of the GC latency. We&#x27;ll look at the data over time (by grouping using <code>TBUCKET</code>) and focus on the major garbage collections:</p><pre><code>TS metrics-* 
| WHERE jvm.gc.action == "end of major GC"
| STATS MAX(jvm.gc.duration),
        PERCENTILE(jvm.gc.duration, 99),
        MEDIAN(jvm.gc.duration),
        MIN(jvm.gc.duration)
 BY TBUCKET(100)</code></pre><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/ce62a4498d5a6e3fcc3bc85ea78f45a18f6e7576-1016x477.png" style="width: 100%;" /><p>The graph now shows us the minimum, maximum, median and 99th percentile for major garbage collections. Note that we aren&#x27;t bound to only querying the median and the 99th percentile. We can query any percentile we&#x27;d like to see, as these are estimated at query time from the raw exponential histograms.</p><h2>A note on backwards compatibility</h2><p>So far, we have seen how you can use the new shiny toy in Elasticsearch and ES|QL: exponential histograms. However, since this has just reached general availability (GA) in the 9.4 release, what about your historical data?</p><p>Before exponential histograms were added, Elasticsearch was already capable of storing OpenTelemetry histograms in the <code>histogram</code> field type. To do so, we converted them to a different data structure supported by the <code>histogram</code> field type: <a href="https://github.com/tdunning/t-digest/blob/main/docs/t-digest-paper/histo.pdf">T-Digest</a>. T-Digest provides good accuracy for extreme percentiles (e.g., 99th percentile) at the cost of accuracy for percentiles in the middle of the distribution, such as the median. In contrast, exponential histograms provide a guaranteed upper bound on the relative error for every percentile. As conversions always introduce errors, we are happy to now have native support for exponential histograms, allowing you to collect and analyze your metrics end-to-end without unnecessary conversions.</p><p>But still, what should you do if you have historical data and still want to query it? Thanks to <a href="https://www.elastic.co/docs/reference/query-languages/esql/esql-multi-index#esql-multi-index-union-types">ES|QL union types</a>, the answer is actually easy: You just have to add a <code>::exponential_histogram</code> suffix to the histogram metrics in your queries:</p><pre><code>TS metrics-* | STATS AVG(jvm.gc.duration::exponential_histogram)</code></pre><p>When this query encounters <code>histogram</code> fields, it will attempt to convert them to exponential histograms. When operating on <code>exponential_histogram</code> fields, the <code>::exponential_histogram</code> cast has no effect. Note that this also works with mixed data sets: if your backing indices use both types, the query will just do the right thing.</p><p>So if you are building queries or dashboards that you expect to run on pre-9.4 ingested data, we recommend that you simply add: <code>::exponential_histogram</code> casts.</p><h2>Wrapping up</h2><p>Native support for OpenTelemetry exponential histograms in Elasticsearch gives you better metric fidelity and more flexible analysis in ES|QL. In this blog post, we have shown you how to easily ingest and analyze your histogram metrics with ES|QL using various aggregations and the impact exponential histograms have.</p><p><a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/exponential-histogram">Exponential histograms</a> are <strong>generally available</strong> in Elasticsearch basic starting with the 9.4.0 release. They will be available in Elastic Cloud <a href="https://www.elastic.co/cloud/serverless">Serverless</a> a few weeks after the 9.4.0 release, once <a href="https://www.elastic.co/docs/reference/opentelemetry/motlp">mOTLP</a> (the managed observability OTLP intake) switches to use the Elasticsearch OTLP endpoint. We&#x27;ll update this blog post and add a note on the Elastic Cloud Serverless release notes when that happens.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[What’s new in Kibana Dashboards: flexible controls, improved defaults and a visual refresh]]></title>
            <description><![CDATA[Kibana 9.4 ships a batch of dashboard improvements — from better defaults to layout controls to a visual refresh. Here's what changed and why.]]></description>
            <link>https://www.elastic.co/search-labs/blog/kibana-dashboards-improvements</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/kibana-dashboards-improvements</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Teresa Alvarez Soler]]></dc:creator>
            <pubDate>Fri, 08 May 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/fea993e3578d858b5c4f95244df052b1caeb8409-720x420.jpg?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Building a time series in Kibana Dashboards used to take 14 clicks. In 9.4, it takes three. Beyond the headline features in this release — a <a href="https://docs.google.com/document/d/1wy3MXxm9bKD4ppP1Uan5VHmIygTqJk6AJQAqkqzDWlY/edit?usp=sharing">brand new API</a> and <a href="https://docs.google.com/document/d/1VzE8S0zQVQXXYUBiV3t-tAa80dFfWuoXY-Nzd2XLmT4/edit?usp=sharing">natural-language dashboard creation with the AI Chat </a>— we are also shipping a collection of focused changes: improved defaults, freely placed controls, a cleaner visual design. Together, they compound into dashboards faster to set up and easier to scan.</p><h2>Place controls anywhere in the dashboard</h2><p>Most dashboarding tools lock filters into a fixed bar at the top of the page. Kibana 9.4 breaks that constraint: <strong>controls are now regular panels</strong> you can drag anywhere to place them where you need them. <strong>Pin them</strong> to the top so they stay visible while you scroll, or place them inside a <strong>collapsible section</strong> where they automatically <strong>scope their filters to that section only</strong> — so a single dashboard can serve multiple use cases without one filter resetting all the panels.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/f0dc157853f6ca8ef3c66c163b80407cf91acf75-1672x814.gif" style="width: 100%;" /><h2>Fewer clicks to a readable chart</h2><p>Building a time series in Kibana Dashboards used to take 14 clicks. Now, it takes three. The new defaults give you a <strong>line chart</strong> with a <strong>bottom legend</strong> and cleaner <strong>x-axis labels</strong> — the way you&#x27;d configure them by hand. A right-hand legend still makes sense past ten series, but for fewer, the bottom list avoids the empty margin a side legend leaves behind. For top-value breakdowns, the default jumps <strong>from three series to nine</strong> — matching how many distinct colors the eye can parse at once — and the <strong>colors</strong> themselves are darker and more separated, so series stand out at a glance</p><p><em>Before (bar chart by default, legend to the right, redundant x-axis label):</em></p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/b4d0b0b181ed4007918762a549b0a95dc9abe5c7-876x340.png" style="width: 100%;" /><p><em>Now (line chart by default, list legend at the bottom, more contrasted color palette for lines readability):</em></p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/de06a5c7879d4df8f4095e489f98d9ecc24b23b4-867x340.png" style="width: 100%;" /><h2>A calmer look for dense dashboards</h2><p><strong>Dense dashboards</strong> in most tools still look like a wall of boxes. We redesigned Kibana&#x27;s dashboard visuals to feel <strong>flatter</strong> and <strong>more cohesive</strong> — more like a single page than a grid of disconnected panels. See some of these changes highlighted in the picture below compared to what the same dashboard looks like in 9.3.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/36c04a1a68aed706ca9eba38aa073e6dacef7cb6-1448x1556.png" style="width: 100%;" /><h2>Other improvements</h2><h3>Collapse filter pills</h3><p>Another common filtering approach is using <strong>filter pills</strong> at the top of the dashboard. When you stack many pills, they used to consume <strong>vertical space</strong>; they now <strong>scroll</strong> automatically after a threshold, and you can <strong>collapse</strong> the row to reclaim the space.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/3d92a22759326688bce759c26eb4cc81efc28c80-1428x660.gif" style="width: 100%;" /><h3>Closing the gap for ES|QL panels</h3><p>Elasticsearch Query Language <strong>(ES|QL) panels </strong>are closer to parity with the rest of dashboards:<strong> Drilldowns </strong>between dashboards work for ES|QL visualizations, and <strong>click-to-filter </strong>from the chart keeps improving. Upcoming releases aim to close remaining gaps, such as annotations and saving ES|QL visualizations to the library.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/319a69d3174d5b6bdaa6279304471fcc359e98a3-1501x789.png" style="width: 100%;" /><h3>Panel size visibility</h3><p>Before 9.4, it was hard to match panel height and width because there was no readout for panel size. <strong>Panel sizing</strong> is now shown while you resize, so you can align dimensions across panels for a more harmonious layout.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/4cb6ae2a09c451b1bc0c4101e6bd613b6c298888-1726x856.gif" style="width: 100%;" /><h3>Easier section reordering</h3><p><strong>Collapsible sections</strong> picked up usability fixes, too: You can reorder sections while they’re expanded, and you can drag a section from its header without hunting for the small drag handle.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/db6b114fd84243d64c12746d90c4e337b62038d5-1724x856.gif" style="width: 100%;" /><h3>ES|QL support for Vega visualizations</h3><p><a href="https://www.elastic.co/search-labs/blog/kibana-vega-building-advanced-visualizations">Vega</a> lets you build custom visualizations beyond what built-in chart types — radar charts, chord diagrams, or anything the grammar supports. The hardest part has always been the data query: writing verbose Elasticsearch Query DSL with nested aggregations and format paths just to feed data into the chart. <a href="https://www.elastic.co/docs/explore-analyze/visualize/custom-visualizations-with-vega#vega-esql-queries">Vega now accepts ES|QL</a> as a data source, so you can get the data with a single readable query. Less time wiring up the data, more time on the visualization itself.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/63d1c19f61f7dead135fe9cad3fd8411590fd037-1606x1033.png" style="width: 100%;" /><h2>What’s next</h2><p>This is just the start. Upcoming releases will keep pushing on smarter defaults and closing the gap between ES|QL and data-view visualizations so the two feel interchangeable.</p><p>Got a pain point or a feature request? Hit <strong>Submit feedback</strong> in the top menu — we&#x27;re listening.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/60c0861ade03a764ddb4d3c4a86af31a7dea02ed-1490x634.png" style="width: 100%;" /><h2>How to try it</h2><p>If you use <strong>Elastic Cloud Serverless</strong>, you may already be on these changes. Otherwise, <strong>upgrade to 9.4</strong> and then <strong>create a dashboard</strong> or <strong>open an existing one</strong>. Many updates apply <strong>automatically to new visualizations</strong>, while <strong>layout</strong> and <strong>panel</strong> options appear in <strong>edit</strong> mode. If you aren’t on <strong>Elastic Cloud</strong> yet, <a href="https://cloud.elastic.co/registration"><strong>start a trial</strong></a> and explore the latest <strong>Kibana</strong> Dashboards there.</p><p><em>The release and timing of any features or functionality described in this post remain at Elastic&#x27;s sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[30x faster than Prometheus: How we rebuilt Elasticsearch as a leading columnar metrics datastore]]></title>
            <description><![CDATA[Elasticsearch now stores OTel metrics at 3.75 bytes per data point and queries them up to 30x faster than Prometheus. Here's how we rebuilt TSDS and ES|QL.]]></description>
            <link>https://www.elastic.co/search-labs/blog/elasticsearch-columnar-metrics-engine-30x-faster-prometheus</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/elasticsearch-columnar-metrics-engine-30x-faster-prometheus</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Kostas Krikellas, Martijn Van Groningen, Nhat Nguyen, Felix Barnsteiner]]></dc:creator>
            <pubDate>Thu, 07 May 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/9c0a835962fc8db27d382b654db787045fb3e4da-1376x768.jpg?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Elasticsearch now stores OTel metrics at <strong>3.75 bytes per data point</strong> — down from 25 bytes a year ago — and queries them up to <strong>30x</strong> faster and with up to <strong>2.5x</strong> better storage efficiency, compared to <strong>Prometheus</strong>, <strong>Mimir</strong> and <strong>ClickHouse</strong>. These gains came from rebuilding <a href="https://www.elastic.co/docs/manage-data/data-store/data-streams/time-series-data-stream-tsds">TSDS</a> storage and the ES|QL compute engine into a <strong>fully columnar metrics engine</strong>, with native OTel ingestion added as part of the effort — all while keeping Elasticsearch&#x27;s ability to store and query logs, traces, and any other data alongside metrics.</p><p>Elasticsearch has supported storing metrics in time-series data streams (<a href="https://www.elastic.co/docs/manage-data/data-store/data-streams/time-series-data-stream-tsds">TSDS</a>) since <strong>version 8.7</strong>. This offering mainly focused on storage gains as explained in an earlier <a href="https://www.elastic.co/search-labs/blog/time-series-data-elasticsearch-storage-wins">blog post</a>. Still, performance was not on par with specialized systems for storing and querying metrics, in terms of storage efficiency, indexing throughput and query latency.</p><p>In the past year, we revisited the storage layer, optimized ingestion for OTel metrics and extended the ES|QL compute engine with vectorized processing for time series data. These efforts led to substantial performance wins across the board, compared to earlier versions of TSDS:</p><ol><li>Up to <strong>6.6x</strong> improvement in storage efficiency, reaching 3.75 bytes per data point in OTel metrics</li><li>Up to <strong>50%</strong> improvement in indexing throughput for OTel data</li><li>Up to <strong>160x</strong> improvement in query latency, including blazing fast counter rate evaluation and window support in time series aggregations</li></ol><p>Elasticsearch has thus become a <strong>leading columnar metrics engine</strong>, matching or exceeding the competition (like <strong>Prometheus</strong>, <strong>Mimir</strong>, and <strong>ClickHouse</strong>) in indexing throughput and exceeding it by up to <strong>2.5x</strong> in storage efficiency and <strong>30x</strong> in query performance. All while maintaining the ability to store logs and other data and fully use the rich querying capabilities of ES|QL (e.g. <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/inlinestats-by">inline stats</a>, <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/lookup-join">lookup join</a>) — which other PromQL-based systems lack. Elasticsearch can thus serve as a unified storage and query engine for all user data, with no compromises for metrics and observability applications.</p><h2>How TSDS is organized</h2><p>TSDS has the following properties that help improve the performance of time-series codecs and produce correct results when aggregating data points per time series:</p><ul><li>The <a href="https://www.elastic.co/docs/manage-data/data-store/data-streams/time-series-data-stream-tsds#time-series-metric">metric</a> name and the <a href="https://www.elastic.co/docs/manage-data/data-store/data-streams/time-series-data-stream-tsds#time-series-dimension">dimension</a> names and values are used to calculate the <code>_tsid</code>, a unique identifier per time series.</li><li>TSDS get sorted by <code>[_tsid ascending, timestamp descending]</code> order. Each time series is thus stored in sequence on disk, with newer data points appearing first. Since the <code>_tsid</code> is calculated over dimension values, the latter are also clustered on disk.</li><li>Shard routing is based on <code>_tsid</code>, with each <code>_tsid</code> value appearing in one shard only.</li><li>Backing indices are <a href="https://www.elastic.co/docs/manage-data/data-store/data-streams/time-bound-tsds">time-bound</a>, with no overlap over time between them.</li></ul><p>The rest of this post explains how we use these properties to improve storage, indexing, and query performance.</p><h2>Storage optimizations</h2><p>TSDS <a href="https://www.elastic.co/search-labs/blog/time-series-data-elasticsearch-storage-wins">already</a> achieved a very competitive storage footprint, reaching <strong>0.9 bytes per data point</strong>, when it is possible to combine many metrics in a single doc, sharing the same dimension values. However, when most data points have a unique set of dimensions (which is typical for OTel or Prometheus metrics), docs end up containing a single data point. In this setup, storage required 25 bytes per data point, with dedicated metrics stores requiring less than 10 bytes per data point.</p><p>To further reduce the storage footprint, we applied a series of optimizations over the past year:</p><h3>Replace inverted indices and BKD trees with doc value skippers</h3><p>Elasticsearch creates inverted indices (for text values) or BKD trees (for numeric values) by default for all non-metric fields, i.e. for <a href="https://www.elastic.co/docs/manage-data/data-store/data-streams#backing-indices">@timestamp</a> and dimensions. These indices improve performance for queries including filters on these fields, but have significant impact to storage — effectively doubling the footprint for each field. More so, they are also processed during <a href="https://www.elastic.co/docs/reference/elasticsearch/index-settings/merge">segment merging</a>, increasing the cpu, memory and storage overhead and slowing down the system — especially in high ingest throughput scenarios, as is often the case with metrics.</p><p>Lucene has been extended with <a href="https://lucene.apache.org/core/10_1_0/core/org/apache/lucene/index/DocValuesSkipper.html">doc value skippers</a>, a form of hierarchical sparse indices that store the minimum and maximum value of blocks of documents. Range queries can check these min and max values and ignore blocks that don&#x27;t fall into the requested range. Skippers work particularly well on sorted fields. Since TSDS are sorted by <code>[_tsid, timestamp desc]</code>, dimension values get also clustered on disk. It&#x27;s therefore possible to replace indices on <code>@timestamp</code> and dimension fields with doc value skippers that <strong>amplify the columnar layout</strong> — each field stored in its own files, with no duplicate tracking of each doc for indexing purposes.</p><p>Doc value skippers have negligible storage overhead — replacing indices with them led to a reduction of <strong>10 bytes</strong> out of the initial 25 bytes per data point in OTel. Moreover, they work very well in practice when queries include filters on time ranges or dimension values (including prefixes and regex) — there was no noticeable regression in query performance in our benchmarks when they replaced separate indices. Doc value skippers are enabled for TSDS by default since <strong>version 9.3</strong>.</p><h3>Enable synthetic IDs</h3><p>The <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/mapping-id-field"><code>_id</code></a> metadata field was another big contributor to the storage footprint. TSDS has already been extended to trim the doc values once they were no longer needed for replication, but the inverted index was kept around to efficiently support the id-based APIs (<a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-get">Get</a>, <a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-delete">Delete</a>, <a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-update">Update</a>).</p><p>The ID value for TSDS is synthesized by combining the <code>_tsid</code> and <code>@timestamp</code> values that uniquely identify each data point. Since these fields are configured with doc value skippers, it&#x27;s possible to replace the inverted index on <code>_id</code> with (a) retrieval of the <code>_tsid</code> and <code>@timestamp</code> value from the <code>_id</code> value, and (b) checks for matches using doc value skippers respectively. Care has to be taken to avoid expensive checks for duplicate IDs during metric ingestion, with segment-level bloom-filters keeping the overhead at bay.</p><p>Supporting synthetic IDs in metrics is a first for Elasticsearch. It led to a reduction of <strong>5 bytes</strong> out of the initial 25 bytes per data point for OTel metrics, with no loss of functionality. Synthetic IDs are enabled for TSDS by default in <strong>version 9.4</strong>. We plan to extend their uses in logs and other applications after further evaluation.</p><h3>Trim sequence numbers</h3><p>Sequence numbers are used as part of replication, but also to provide strong consistency semantics on doc modification operations through <a href="https://www.elastic.co/docs/reference/elasticsearch/rest-apis/optimistic-concurrency-control">Optimistic Concurrency Control</a> (OCC). While such semantics are applicable to certain scenarios, they don&#x27;t fit in metrics where concurrent updates are very rare, with no practical need for guarding against concurrent operations on data points with matching ids. We therefore decided to <a href="https://www.elastic.co/docs/reference/elasticsearch/index-settings/index-modules#index-disable-sequence-numbers">disable the use of sequence numbers</a> in all APIs, along with OCC support, for TSDS, in <strong>version 9.4</strong>. This leads to a substantial storage reduction of <strong>4 bytes</strong> out of the initial 25 bytes per data point for OTel data, as there&#x27;s no inverted index and sequence numbers get trimmed once no longer needed for replication. <a href="https://www.elastic.co/docs/reference/elasticsearch/rest-apis/update-by-query-api">Update</a> and <a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-delete-by-query">delete</a> by query operations are still supported, albeit with weaker consistency semantics.</p><p>If OCC is still deemed important for a particular metrics application, the old behavior can be restored by setting <code>index.disable_sequence_numbers: false</code> in the index template of the involved TSDS.</p><h3>Use large numeric codec blocks</h3><p>TSDS already uses an advanced codec, as explained in an earlier <a href="https://www.elastic.co/search-labs/blog/time-series-data-elasticsearch-storage-wins#specialized-codecs">article</a>. The codec works very well in most cases, but has poor performance in case of repeated sequences of keywords and numbers, leading to an inflated storage footprint for dimensions containing IP and MAC addresses. We identified that the existing logic for identifying repeated sequences requires larger codec blocks to work well, especially as the sequence length increases. After experimentation, the numeric block size was increased from 128 to 512 elements in <strong>version 9.3</strong>, leading to a reduction of <strong>2 bytes</strong> out of the initial 25 bytes per data point for an OTel dataset containing IP and MAC addresses as dimensions. We&#x27;re also working on a more configurable codec layout that will allow more flexibility with block sizes and other parameters, based on field type and cardinality.</p><h2>Indexing throughput</h2><p>Elasticsearch has support for bulk ingestion of documents. This entrypoint has long been optimized for leniency, ensuring that all docs get accepted. This flexibility, however, incurs additional processing cost during indexing. Metric applications proved good candidates for using different approaches to reduce this overhead, as explained below.</p><h3>Introduce OTLP protobuf entrypoint</h3><p>OTel metrics and Prometheus have established protocols for metrics ingestion, using protocol buffers. In the past, a translation step was required to convert collected protobuf messages to bulk requests that Elasticsearch can consume.</p><p>Elasticsearch was recently extended with endpoints accepting messages from OTel metrics collectors and over Prometheus remote write. Parsing and processing these (binary) messages is cheaper, compared to json parsing, while hash operation over dimensions for <code>_tsid</code> calculations get reused and amortized across more data points within a single protobuf message. Furthermore, <code>_tsid</code>s get evaluated once per doc in the coordinator nodes and propagated to data nodes for indexing, thus deduplicating an expensive step per indexed doc. These improvements led to up to a 20% speedup in indexing throughput for OTel metrics. The OTLP entrypoint was added in version 9.2 (tech preview) and reached GA in <strong>version 9.3</strong>. We&#x27;ve added similar entrypoints for <a href="https://www.elastic.co/observability-labs/blog/prometheus-remote-write-elasticsearch">Prometheus remote write</a> in <strong>version 9.4</strong> (tech preview) and are actively working to cover OTel Logs and Traces.</p><h3>Reduce indexing CPU with doc value skippers</h3><p>In addition to a substantial storage footprint, inverted indices require a lot of cpu to build and reconstruct during segment merging. The use of doc value skippers in their place helps also reduce cpu load at ingestion and thus improves indexing throughput by 10%, a welcome bonus on top of the aforementioned storage wins.</p><h3>Synthetic recovery source</h3><p>The original <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/mapping-source-field">source</a> of a document, as provided at index time, is never stored for metrics. Still, Elasticsearch needed to temporarily store it for replication purposes. That changed in <strong>version 9.1</strong>, where the source gets synthesized on demand for replication purposes. This is known as synthetic recovery source and reduces disk I/O by 50%, with a significant impact to metrics indexing performance. Check out this <a href="https://www.elastic.co/search-labs/blog/elastic-logsdb-tsds-enhancements">article</a> for more details.</p><h2>Query execution</h2><p>Replacing inverted indices with doc value skippers leads to a pure columnar storage layout for TSDS, with metric and dimension fields stored as Lucene doc values, each field encoded and compressed in their own file. Combined with the introduction of the <a href="https://www.elastic.co/blog/elasticsearch-query-language-esql#dedicated-query-engine">ES|QL compute engine</a> that uses vectorized execution internally, it became possible to introduce a fully columnar storage and query processing engine for metrics in Elasticsearch. We pushed this idea to the extreme and implemented a <strong>columnar metrics processing engine</strong> that comfortably outperforms dedicated metrics engines and other columnar stores in query performance.</p><h3>Time series integration in compute engine</h3><p>Time series processing is largely based on applying aggregation functions per time series (or <code>_tsid</code>), such as a <a href="https://opentelemetry.io/docs/specs/otel/metrics/data-model/#gauge">gauge</a> average or a <a href="https://opentelemetry.io/docs/specs/otel/metrics/data-model/#sums">counter</a> rate. These partial results are then reduced by a secondary function to produce results for the grouping dimensions, e.g. per host and process. Observability dashboards are built on top of this execution model, providing summary views of how metrics evolve over time and allowing for quick deep-dives by filtering on dimension values and time ranges.</p><p>To support this execution model, we introduced the <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/ts#description">TS source command</a>, providing a simple yet powerful syntax for executing such queries that combine an inner aggregation function per time series with an outer aggregation over the partial results of the former. For instance, the following query calculates the hourly sum of rate of search requests per host over the last day:</p><pre><code>TS metrics
  | WHERE TRANGE(1d)
  | STATS SUM(RATE(search_requests)) BY TBUCKET(1h), host</code></pre><p>To execute this query, the compute engine is aware of how data is stored and applies the inner aggregation function per <code>_tsid</code> value. Since data are sorted by <code>_tsid</code>, time series aggregation functions process metric values as they get fetched, until the <code>_tsid</code> changes or the timestamp belongs to the next time bucket. This leads to vectorized execution of these functions over the fetched columns of metric values, while dimension values are only fetched (once) when the <code>_tsid</code> changes. The evaluation of the secondary aggregation function is also efficient, with partial aggregates stored in arrays of primitive values that get populated when <code>_tsid</code> values change.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/8a6987ff92c80ac813167c9aa655d2ec4dd1bd50-1024x559.png" style="width: 100%;" /><p>The compute engine has inherent support for parallel query evaluation, taking full advantage of the available processing cores. Time series aggregations fully use this feature and process data points in parallel as applicable, reducing response times through improved cpu utilization.</p><p>Time series processing in ES|QL was introduced in version 9.2 as tech preview and reaches GA in <strong>version 9.4</strong>. We expect all metrics applications to adopt it and benefit from the much improved query performance wins.</p><h3>Zero-copy data decoding and loading</h3><p>Vectorized processing of time series data delivered immediate performance wins (<strong>8x</strong> for some queries), compared to aggregations through the <code>/_search</code> API, but the performance was still inferior when compared to competitive metrics stores. Benchmarking and profiling showed that there were too many array copies within the compute engine, between data decoding and evaluation of aggregation functions. To that end, the following optimizations were introduced:</p><ul><li>The codec for TSDS was extended to decode on-disk data directly into primitive arrays inside blocks that the compute engine uses to evaluate time series aggregations. No additional copies required, as the compute engine can bulk-read these blocks and process their arrays, one column at a time.</li><li>Blocks containing a single value N times are represented as constant blocks with these 2 values, as opposed to an array with length N, a form of in-memory run-length encoding. Filtering and aggregation operations were extended to efficiently consume these blocks. This reduced memory pressure and cpu overhead for the <code>_tsid</code> and dimension fields, as their values get clustered due to index sorting.</li><li>Documents with null values for the aggregated metric fields are filtered out at the Lucene level, before they get decoded and copied into blocks.</li><li>All filters and regular expressions on the timestamp and dimension fields get pushed down to Lucene that makes use of doc value skippers to efficiently filter out non-matching docs.</li></ul><p>Combined, these optimizations led to query execution speedups exceeding <strong>10x</strong> (totaling 80x when combined with the 8x speedup from vectorized execution). They were included since the introduction of the TS source command in <strong>version 9.2</strong>, and fine-tuned ever since.</p><h3>Optimized counter rate evaluation</h3><p>While most time series aggregations can be trivially parallelized and evaluated, rate evaluation of cumulative counters is tricky as it requires processing all data points in order to detect counter resets (e.g. when a host restarts). To address this, the compute engine uses the <code>_tsid</code> prefix to shard time series across threads. Care has been taken to assign in-order ranges of <code>_tsid</code> values to each thread, as opposed to hash-partitioning <code>_tsid</code>s, so that each thread can scan on-disk data in order, still making use of efficient decoding and zero-copying into blocks. The performance wins are impressive, with rate evaluation performance far exceeding dedicated metrics stores as we shall see in the next section.</p><p>Another interesting problem for cumulative counters is how to properly calculate counter increases for the entire time bucket when there are no data points at the bucket boundary timestamps. Metrics systems often use extrapolation, extending the first and last data points of each time bucket to the boundaries, or calculate the delta between the last data point of adjacent buckets. We posit we can do better, by interpolating between the last data point of each bucket and the first of the next, to get an estimate of the value on each boundary. The delta is then calculated over the interpolated values of the lower and upper boundary of each time bucket.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/b09313b3359360dc3d6117c436dff05ca6de3653-1360x1520.png" style="width: 100%;" /><h3>Sliding window support</h3><p>Elasticsearch has long supported aggregations bucketed by time, but it was not possible to extend the window of processed data beyond each time bucket. Using windows larger than the time bucket, e.g. a window of 5 minutes for per-minute bucketing, helps smoothen out spikes and observe the underlying trend per time series with reduced noise:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/66c96d577a9a76f7d5a0cc7df3d121f38fe018bd-1999x1014.png" style="width: 100%;" /><p>All <a href="https://www.elastic.co/docs/reference/query-languages/esql/functions-operators/time-series-aggregation-functions">time-series aggregation functions</a> have been extended with window support, as an optional argument. In case the window is a multiple of the time bucket (e.g. 1h window with <a href="https://www.elastic.co/docs/reference/query-languages/esql/functions-operators/grouping-functions/tbucket">TBUCKET</a><code>(5m)</code>), the compute engine first aggregates data points over intervals matching the time bucket span, and then combines these partial results per window span. This 2-phase approach eliminates repeated scans of data points and makes optimal reuse of intermediate results, improving response times. Window support was introduced as tech preview in version 9.3 and reaches GA in <strong>version 9.4</strong>.</p><h3>Efficient datetime rounding</h3><p>Queries on time-series data commonly include time bucketing. While data points can be trivially assigned to sub-hour time buckets, larger buckets start interfering with issues like time zones, daylight savings, variable days per month etc. Elasticsearch has elaborate logic for datetime rounding that takes these peculiarities into account, but that has relatively high cpu cost when processing time series data.</p><p>To mitigate this, the compute engine has been extended to identify cases where simpler logic can be employed to assign data points to time buckets. For instance, it can identify when the buckets are sub-hour or when timezones and daylight savings don&#x27;t affect a particular query, and switches to simple modulo operations for datetime rounding. This led to a further <strong>30%</strong> improvement in response times for certain queries. This change is introduced in <strong>version 9.4</strong>.</p><h2>Performance evaluation</h2><p>To evaluate the performance of our offering and track how it evolves and improves over time, we focused on OTel metrics since (a) Open Telemetry is the industry standard for collecting metrics, with universal adoption by all cloud providers and (b) they lead to a storage layout with 1 metric per doc, a setup that traditionally hurt performance for Elasticsearch.</p><p>We rely on <a href="https://github.com/elastic/metricsgenreceiver">Metricsgenreceiver</a> to generate datasets. This tool is inspired by <a href="https://github.com/timescale/TSBS">TSBS</a>, producing data simulating the data points collected by the OTel <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/hostmetricsreceiver#readme">hostmetricreceiver</a>. We used two datasets:</p><ol><li>A low-cardinality setup, with 100 hosts sending metrics every 1s, containing 14k time series in total</li><li>A high-cardinality setup, with 10k hosts sending metrics every 10s, containing 1.4M time series in total</li></ol><p>We benchmarked on single-node deployments on EC2, using <a href="https://aws.amazon.com/ec2/instance-types/c6i/">c6i.4xlarge</a> and <a href="https://aws.amazon.com/ec2/instance-types/c8g/">c8g.8xlarge</a> machines for the low- and high-cardinality datasets respectively.</p><p>For competitive comparison, we used Prometheus (v.3.11.1), Mimir (v.3.0.6.) and ClickHouse (v26.3.9.8-lts). Prometheus and Mimir have proper time series processing, e.g. for counter rate, whereas ClickHouse <a href="https://clickhouse.com/docs/use-cases/time-series/analysis-functions">lacks such support</a> and only provides approximate values at best (for instance, it can&#x27;t track counter resets consistently). We still report response times for ClickHouse to showcase that, once we optimize Elasticsearch for columnar query processing, it can exceed competing columnar engines even when they don&#x27;t process the data per time series as expected.</p><p>We strived to use the default configuration for every system (including Elasticsearch), without tweaking them to optimize performance for the particular workload. This helps understand the user experience when systems are deployed by novice users, without much experience (or time) to tweak before receiving metrics traffic and setting up dashboards. We focused on single-node runs to keep noise low and accommodate all systems (Prometheus doesn&#x27;t offer a multi-node setup out of the box). Elasticsearch performance provably scales well with the number of nodes; we plan to share scalability results in a future post.</p><h3>Storage efficiency and indexing throughput</h3><p>Our efforts to improve storage efficiency paid big dividends. Performance on OTel metrics dropped <strong>from 25 to 3.75</strong> bytes per data point, in a year. Such an improvement, on top of an offering already optimized for time series, is really impressive and very rare in the industry.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/8b3d185e252f427b4753aa93f173f1fe0930acbf-1999x364.png" style="width: 100%;" /><p>The competitive picture looks favorable at this point, with Elasticsearch:</p><ul><li>Slightly outperforming Mimir in storage efficiency and indexing throughput</li><li>Outperforming Prometheus by 2.5x in storage efficiency and by a small margin in indexing throughput</li><li>Outperforming ClickHouse by 2x in storage efficiency and by 40% in indexing throughput</li></ul><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/a4075e617142de780400f0c318b9d1fa99e4e7a8-1187x851.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/068fb9a3c1c0d8b23b3d34ae2f8ffe968dd71d30-1221x851.png" style="width: 100%;" /><h3>Query performance</h3><p>The novel columnar engine for metrics processing proves very efficient in practice. We used a mix of queries based on gauge averages and counter rates, the most common operations that require different optimization approaches. The queried interval was 4 hours of data, covering all time series per metric.</p><p>ClickHouse doesn&#x27;t support time series aggregations, so the results have limited value and are not directly comparable to Prometheus or Mimir that natively support time series processing. We used the published <a href="https://clickhouse.com/docs/use-cases/time-series/analysis-functions">guidelines</a> to adjust each query to get similar results to the extent possible. The point is to show how our columnar engine compares to generic columnar stores.</p><p>Here is a summary of the results:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Query type</td><td style="border: 1px solid #ccc; padding: 4px;">Mimir</td><td style="border: 1px solid #ccc; padding: 4px;">Prometheus</td><td style="border: 1px solid #ccc; padding: 4px;">ClickHouse †</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Gauge average</td><td style="border: 1px solid #ccc; padding: 4px;">up to 30x faster</td><td style="border: 1px solid #ccc; padding: 4px;">up to 30x faster</td><td style="border: 1px solid #ccc; padding: 4px;">up to 8x faster</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Counter rate</td><td style="border: 1px solid #ccc; padding: 4px;">up to 30x faster</td><td style="border: 1px solid #ccc; padding: 4px;">up to 30x faster</td><td style="border: 1px solid #ccc; padding: 4px;">up to 3.5x faster</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Prefix filter on host name</td><td style="border: 1px solid #ccc; padding: 4px;">up to 5x faster</td><td style="border: 1px solid #ccc; padding: 4px;">up to 5x faster</td><td style="border: 1px solid #ccc; padding: 4px;">up to 3x faster</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Gauge average with window</td><td style="border: 1px solid #ccc; padding: 4px;">up to 25x faster</td><td style="border: 1px solid #ccc; padding: 4px;">up to 25x faster</td><td style="border: 1px solid #ccc; padding: 4px;">up to 4x faster</td>
                    </tr>
              </tbody>
            </table>
          <p>†ClickHouse lacks native support for time series aggregations and counter reset detection.</p><h4>Gauge average</h4><p>We compared performance of evaluating the per-host hourly average of average memory utilization per time series, using the following queries:</p><pre><code># PromQL
avg by (host.name) (avg_over_time(system.memory.utilization[1h]))</code></pre><pre><code># ES|QL
TS metrics-hostmetricsreceiver.otel-default
| STATS AVG(AVG_OVER_TIME(system.memory.utilization)) BY host.name, TBUCKET(1h)</code></pre><p>Elasticsearch comfortably outperforms the other systems by up to 30x, in both low and high cardinality datasets.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/0a78beb0bd1e5b37cb7001f384fa13d4a44dbcbf-1210x851.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/cedb803077cc660b9b85b4cf0f123be8d4c932d7-1227x851.png" style="width: 100%;" /><h4>Counter rate</h4><p>We next compared performance of evaluating the per-host hourly average of cpu rate, using the following queries:</p><pre><code># PromQL
avg by (host.name) (rate(system.cpu.time[1h]))</code></pre><pre><code># ES|QL
TS metrics-hostmetricsreceiver.otel-default
| STATS AVG(RATE(system.cpu.time)) BY host.name, TBUCKET(1h)</code></pre><p>Despite processing data points per time series in order, counter rate performance matches calculating gauge average (the involved time series have 6.6x more docs than the query above). Elasticsearch maintains its wide advantage compared to the other systems and outperforms Mimir and Prometheus by 30x in the low cardinality dataset and by 16x in the high cardinality one.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/d31b3fe1c85854a9daa8814a318b9bf86007a823-1227x851.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/82735acdb0228686ca5d8b3da56da6fc6256eb35-1243x851.png" style="width: 100%;" /><p>It&#x27;s really impressive that, for the high cardinality dataset, Elasticsearch is able to process 4 hours of data for half a million time series in less than 2 seconds, while the other systems take more than 30 seconds, leading to unresponsive dashboards for such queries. ClickHouse is also slower, despite having no logic to detect counter resets and extrapolate/interpolate deltas across buckets.</p><h4>Prefix filter on host name</h4><p>We next compared performance of filtering on host names based on their prefix, using the following queries:</p><pre><code># PromQL
avg by (host_name)
  (avg_over_time(system_cpu_load_average_1m{host_name=~"host-.*"}[5m]))</code></pre><pre><code># ES|QL
TS metrics-hostmetricsreceiver.otel-default
| WHERE host.name LIKE "host-*"
| STATS AVG(AVG_OVER_TIME(system.cpu.load_average.1m)) BY host.name, TBUCKET(5m)</code></pre><p>Elasticsearch manages to maintain an advantage of up to 5x compared to the other systems, despite replacing the inverted index on <code>host.name</code> with a doc value skipper.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/3a28e59f33996df9cfee82aafcf4f30f9f51c045-1205x851.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/b0765ee99190d3f77635293e2258250530deb945-1226x851.png" style="width: 100%;" /><h4>Gauge average with window</h4><p>We compared the performance of time series aggregations with a window of 90 minutes and time buckets of 30 minutes, using the following queries:</p><pre><code># PromQL
avg by (host.name) (avg_over_time(system.memory.utilization[90m]))&step=30m</code></pre><pre><code># ES|QL
TS metrics-hostmetricsreceiver.otel-default
| STATS AVG(AVG_OVER_TIME(system.memory.utilization, 90m))
    BY host.name, TBUCKET(30m)</code></pre><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/b1cb772391b5c16355df75b8be791957d1a0570a-1211x851.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/87b09774ac09d5fc38b8e534c7643e80a9f811b5-1235x851.png" style="width: 100%;" /><p>Elasticsearch maintains an advantage that reaches 25x for the low cardinality dataset and 8x for the high cardinality one. ClickHouse is outperformed by close to 4x, denoting the efficiency of our approach for windowed query operations.</p><h2>What&#x27;s next for Elasticsearch metrics</h2><p>Elasticsearch has been extended with metrics storage and processing capabilities that outperform Prometheus, Mimir, and ClickHouse. We&#x27;re making fast progress with supporting <a href="https://www.elastic.co/observability-labs/blog/elasticsearch-supports-promql">PromQL</a> and <a href="https://www.elastic.co/observability-labs/blog/prometheus-remote-write-elasticsearch-architecture">Prometheus remote write</a>, also available as tech preview in <strong>version 9.4</strong>. These extensions enable users familiar with Prometheus and relevant systems to switch their applications to Elasticsearch — no need to migrate existing dashboards. Since Prometheus integration reuses the same storage and query engine that has been presented in this article, the same performance wins are also expected for Prometheus. Furthermore, collected metrics can be queried with PromQL and ES|QL, side-by-side or in ES|QL query pipelines, further boosting the analytics capabilities far beyond what was conceivable so far with Prometheus-based systems.</p><p>The improvements in storage efficiency, indexing throughput and query performance are already impressive, but we&#x27;re not done. We&#x27;ll be introducing more refinements to the codec for time series data, further reducing bytes per data point. Batch processing of ingested metrics will be further improved, reducing synchronization overhead and redundant processing layers that are not needed for well-formatted collected metrics. We&#x27;re also planning to make wider use of doc value skippers, storing pre-computed aggregates like sum and count per block of values, to shortcut data point loading and processing where applicable, as well as use more cpu-friendly partitioning and grouping operations.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Multi-tier search with Elastic for ecommerce search governance: Fixing poor recall]]></title>
            <description><![CDATA[Learn how to build a multi-tier retrieval strategy used to execute ecommerce governed search plans and improve recall management. We'll cover how to orchestrate semantic matching while maintaining stable results, facets, and pagination.]]></description>
            <link>https://www.elastic.co/search-labs/blog/multi-tier-search-ecommerce-governance</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/multi-tier-search-ecommerce-governance</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Alexander Marquardt, Honza Král, Taylor Roy]]></dc:creator>
            <pubDate>Wed, 06 May 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/5d64716981e76396b401fd069d0a635b6929ba94-720x420.jpg?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>A common issue in ecommerce search is poor recall. This occurs when a system lacks a governed fallback strategy. The solution is a multi-tier execution model. This post describes a multi-tier retrieval strategy used to execute governed search plans. It explains how to orchestrate strict, relaxed, and semantic matching while maintaining stable results, facets, and pagination.</p><h2><strong>From policy logic to retrieval architecture</strong></h2><p><a href="https://www.elastic.co/search-labs/blog/ecommerce-search-governance-control-plane-architecture">Part 3</a> and <a href="https://www.elastic.co/search-labs/blog/elasticsearch-percolator-search-governance">Part 4</a> provided a technical deep dive into the governed control plane and its implementation using the <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/percolator">Elasticsearch percolator</a>. Once the logic layer has identified which policies to apply, the system must address the retrieval strategy used to execute the search.</p><p>Managing the transition from precision to recall is a critical function of any ecommerce search engine. For example, a basic search implementation often defaults to broad keyword matching. If a shopper searches for &quot;organic Pink Lady apples&quot;, this can lead to irrelevant results, such as apple-scented dish soap, apple juice, or organic pink grapefruit, appearing at the top of the list simply because they share a common term. While these items are technically matches, they fail to satisfy the user&#x27;s intent and typically lead to high bounce rates. However, a &quot;No results&quot; page is equally detrimental to conversion. This conflict is resolved by implementing a <strong>three-tier execution model</strong>, which uses the governed control plane to orchestrate a principled fallback strategy.</p><h2><strong>The three-tier execution model</strong></h2><p>This architecture executes up to three retrieval tiers in a sequence, each with a specific matching logic.</p><h3><strong>Highest tier: Strict matching</strong></h3><p><em>Strict matching</em> is a lexical match that requires that <strong>all</strong> query terms appear in the product metadata.</p><ul><li><strong>The logic:</strong> A search for &quot;organic navel oranges&quot; returns only products containing all three terms.</li><li><strong>Application:</strong> This tier provides the highest precision. When a customer types a precise product name, such as &quot;organic navel oranges&quot;, they’re typically seeking that exact item rather than an alternative.</li></ul><h3><strong>Mid-tier: Relaxed matching</strong></h3><p>If the strict tier fails to return sufficient results, the system expands the search parameters.</p><ul><li><strong>The logic:</strong> This tier allows for a subset of terms to lexically match, using <a href="https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-minimum-should-match">Elasticsearch&#x27;s minimum_should_match</a> logic.</li><li><strong>Application:</strong> Relaxed matching maintains lexical grounding. A search for &quot;organic navel oranges&quot; might surface &quot;navel oranges&quot; (missing the &quot;organic&quot; term) or &quot;organic oranges&quot; (missing the &quot;navel&quot; term). These represent intuitive, keyword-based alternatives for the shopper.</li></ul><h3><strong>Lowest tier: Semantic matching</strong></h3><ul><li><strong>The logic:</strong> This tier uses vector/semantic embeddings (such as Elastic Learned Sparse EncodeR [ELSER], E5, or Jina) to retrieve conceptually related products, regardless of direct keyword overlap.</li><li><strong>Application:</strong> A search for &quot;organic navel oranges&quot; might surface &quot;mandarins&quot; or &quot;clementines”. This serves as the final retrieval tier, intended to provide relevant options when literal keyword matches are unavailable.</li></ul><p>To see this multi-tier orchestration in action and how the Engine steps down from lexical to semantic matching, watch the video: <a href="https://youtu.be/k02NHvIAHsk?si=tJKwmc4ds3zjcRPF">Eliminating Zero-Result Pages: PRISM’s Multi-Tier Search Fallback</a>.</p><h2><strong>Tier orchestration: The &quot;bucket filling&quot; logic</strong></h2><p>While the governed control plane provides the logic and the queries for each tier, the application layer is responsible for the execution. The application executes these tiers sequentially and excludes lower tiers once the accumulated result count on the first page reaches or exceeds 10 items (or whatever number of results you want to display on the first page). This threshold ensures a full first page of results while prioritizing the most accurate retrieval method.</p><h3>Scenario 1: High-intent search (&quot;oranges&quot;)</h3><p>The first tier returns 15 hits. Since 15 is more than 10, the current result set is locked to only strict matches (which can be paged through) and subsequent tiers are not executed.</p><pre><code>Strict tier:   [##########]##### (>= 10 found: Exact matches)
Relaxed tier:  [          ]      (Tier bypassed)
Semantic tier: [          ]      (Tier bypassed)</code></pre><h3>Scenario 2: Specific but limited results (&quot;organic blood oranges&quot;)</h3><p>The strict tier finds only four items. Since this is less than 10, the system triggers the relaxed tier, which finds 12 more relevant products. The combined total (16) meets the threshold of 10, so the current result set is locked to the strict and relaxed tiers. Subsequent paging will only surface results from these two tiers (preventing lower-quality semantic hits from appearing on later pages).</p><pre><code>Strict tier:   [####      ]       (4 found)
Relaxed tier:  [    ######]###### (>= 6 found)
Semantic tier: [          ]       (Tier bypassed)</code></pre><h3>Scenario 3: Abstract or intent-based search (&quot;high vitamin C snacks&quot;)</h3><p>Keyword matches are limited (only five hits between tiers 1 and 2). The system triggers the semantic tier to find conceptually relevant items, such as kiwis, guavas, or red peppers, to fill the result set. The result set for this query includes products from all tiers.</p><pre><code>Strict tier:   [##        ]             (2 found)
Relaxed tier:  [  ###     ]             (3 found)
Semantic tier: [     #####]######################...</code></pre><p>This orchestration optimizes for latency, as the computational cost of the semantic tier is only incurred when the keyword-based tiers are insufficient. Additionally, this allows fast-responding keyword results to be displayed while semantic results are integrated shortly after, maintaining a responsive user interface.</p><h2><strong>Determining intent via tier activation</strong></h2><p>The logic used to fill the first page serves a critical secondary purpose: It acts as a diagnostic for user intent. The application uses the logic returned by the governed control plane to determine which tiers remain active for the current result set and paging.</p><p>If the strict and relaxed tiers together yield fewer than 10 results, the query is likely exploratory or abstract. In this case, activating the semantic tier is a benefit. Because the query is diagnosed as exploratory, the system allows the shopper to page through the entire depth of the semantic results. This provides access to conceptually related alternatives that lexical matching would have missed, which is appropriate for an abstract search.</p><p>Conversely, if the strict tier returns a robust set of results (for example, 30 hits), it confirms that the system has found high-precision matches. The user can page through those 30 hits and will likely find what they’re looking for. In this scenario, there’s no need to provide additional, less relevant exploratory hits. By disabling lower tiers for these high-precision queries, we ensure that a shopper deep diving into specific results isn’t distracted by irrelevant semantic fallback as they paginate through the current result set.</p><h2><strong>Governance across tiers</strong></h2><p>A critical component of this architecture is that policies apply globally across all tiers. If a user has a &quot;vegan&quot; preference profile, the governed control plane injects that constraint into the strict, relaxed, and semantic queries. This ensures that even when the system uses semantic fallback to return &quot;mandarins&quot; for an orange search, the results remain compliant with the user&#x27;s broader dietary preferences or business constraints.</p><h2><strong>The problem of facet instability</strong></h2><p>A challenge with multi-tier search is maintaining consistent faceted navigation (sidebar filters). If a search for &quot;chocolate&quot; yields 12 strict results, the sidebar filters might show &quot;dark&quot; and &quot;milk&quot;. If a user selects &quot;dark&quot; and the result count drops, a naive system might trigger the semantic tier to fill the page, which could suddenly introduce &quot;red wine&quot; into the filters due to a semantic relationship.</p><p>The governed control plane identifies which tiers contributed to the initial search and locks the facets to those tiers. This prevents the sidebar from changing unexpectedly during a filtered session, ensuring a stable user experience.</p><h2><strong>The pagination challenge: Seamless multi-tier paging</strong></h2><p>Pagination in a tiered system requires precise state management. As established, the first page determines the scope of the <strong>current result set</strong>. If the first page required semantic results, the user can page through all available results from all three tiers. On the other hand, if the first page was satisfied by high-intent keyword matches, the semantic tier is not retrieved for that specific result set.</p><p>The governed control plane manages this through:</p><ul><li><strong>Tier locking:</strong> The response includes an array identifying the contributing tiers. The front end returns this on subsequent requests to keep the tier composition consistent across all pages.</li><li><strong>Dynamic offset calculation:</strong> The back end calculates an offset based on the requested page and the total products returned in preceding tiers.<strong>Example:</strong> If the first page has returned seven strict matches and three relaxed matches, a request for page 2 (starting at index 10) would execute a relaxed tier query with an offset of three.</li><li><strong>ID exclusion for lower tiers:</strong> The system retrieves IDs from the higher tiers (which, by definition, will always be fewer than the page size threshold) and explicitly excludes them from lower-tier results using an ID-only query (which avoids the overhead of a full fetch phase for excluded items).</li></ul><h2><strong>Summary</strong></h2><p>The multi-tier approach ensures search results are precise when data is available and helpful when it is not. By providing a governed fallback sequence for the application to execute, the architecture maintains high relevance while eliminating &quot;no results&quot; scenarios.</p><h2><strong>What&#x27;s next in this series</strong></h2><p>The next posts in this series extend the governed control plane into new territory. Part 6 explores personalization (using purchase history boosting and cohort-aware policies), and Part 7 demonstrates per-query economic optimization. Stay tuned!</p><h2><strong>Put governed ecommerce search into practice</strong></h2><p>The search architecture described in this post, where retrieval tiers, economic weights, and governance constraints compose into a single request, was designed and built by Elastic Services Engineering as part of our repeatable ecommerce search accelerators.</p><p>To learn more about applying these patterns to your business, <strong><a href="https://www.elastic.co/contact">Contact Elastic Professional Services</a>.</strong></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Three indices walk into a FROM clause: ES|QL subqueries in Elasticsearch]]></title>
            <description><![CDATA[ES|QL subqueries give each data source its own pipeline and filters, eliminating CASE chains, restoring predicate pushdown, and making multi-index queries extensible by design.]]></description>
            <link>https://www.elastic.co/search-labs/blog/esql-subquery-from</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/esql-subquery-from</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Tyler Perkins]]></dc:creator>
            <pubDate>Wed, 06 May 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/2eb755dd2b2b69b8e0e8867a0da85940eb744176-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p><a href="https://www.elastic.co/docs/reference/query-languages/esql">Elasticsearch Query Language</a> (ES|QL) now has <a href="https://www.elastic.co/docs/reference/query-languages/esql/esql-subquery">subqueries in <code>FROM</code></a>. Three indices, different schemas, one query; each source gets its own pipeline with its own filters and transforms. No more <code>CASE</code> chains. No more client-side stitching. Add a fourth source? Add a fourth branch; zero changes to the existing three.</p><h2>The problem: Heterogeneous data, one query</h2><p>Consider a production incident investigation. Errors are spread across three microservices: an API gateway, a payments service, and an auth service, each with different field names and different conventions. Before subqueries, combining them in a single ES|QL query meant cramming everything into one <code>FROM</code> with <code>CASE</code> chains:</p><pre><code>FROM svc-gateway-*, svc-payments-*, svc-auth-* METADATA _index
| WHERE http.response.status_code >= 500
    OR transaction.status IN ("failed", "timeout")
    OR (event.action == "login" AND event.outcome == "failure")
| EVAL
    service = CASE(
      _index LIKE "svc-gateway*", "gateway",
      ... /* one branch per source */),
    error_detail = CASE(
      _index LIKE "svc-gateway*", CONCAT("HTTP ", http.response.status_code::string),
      ... /* one branch per source */)
| KEEP @timestamp, service, error_detail, source.ip
| SORT @timestamp DESC</code></pre><p>This is brittle and slow. The disjunctive <code>OR</code> prevents predicate pushdown; every index scans every condition. Every <code>CASE</code> chain grows with every source. Copy it into five dashboards and three alert rules, and you have eight places to update when anything changes.</p><h2>The fix: Independent pipelines</h2><p>Subqueries replace the monolithic <code>FROM</code> + <code>CASE</code> pattern. Each data source gets its own complete pipeline:</p><pre><code>FROM
  (FROM svc-gateway-*
   | WHERE http.response.status_code >= 500
   | EVAL service = "gateway",
         error_detail = CONCAT("HTTP ", http.response.status_code::string)
   | KEEP @timestamp, service, error_detail, source.ip),
  (FROM svc-payments-*
   | WHERE transaction.status IN ("failed", "timeout")
   | EVAL service = "payments",
         error_detail = transaction.status
   | KEEP @timestamp, service, error_detail, source.ip),
  (FROM svc-auth-*
   | WHERE event.action == "login" AND event.outcome == "failure"
   | EVAL service = "auth",
         error_detail = CONCAT(event.action, " ", event.outcome)
   | KEEP @timestamp, service, error_detail, source.ip)
| SORT @timestamp DESC
| LIMIT 20</code></pre><p>The gateway branch only scans for HTTP 500s. The payments branch only looks at transaction statuses. The auth branch only checks login failures. Because each branch has its own <code>WHERE</code>, the optimizer pushes filters independently into each index, restoring the predicate pushdown that a single <code>FROM</code> with <code>OR</code> conditions prevents. Fields that exist in one branch but not another are filled with <code>null</code>.</p><p>Adding a fourth service means adding a fourth branch. Existing branches don&#x27;t change.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/f210b98824cb34c514b6cdb2d44b2bc81a51bb7b-1999x1084.png" style="width: 100%;" /><h2>Save it as a view</h2><p>This is where subqueries and <a href="https://www.elastic.co/search-labs/blog/elasticsearch-esql-logical-views">logical views</a> combine. Wrap the subquery above in a named view, with one API call:</p><pre><code>PUT _query/view/error_triage
{
  "query": "FROM (FROM svc-gateway-* | WHERE ...) , (FROM svc-payments-* | WHERE ...) , (FROM svc-auth-* | WHERE ...)"
}</code></pre><p>Now consumers just write <code>FROM error_triage | STATS error_count = COUNT(*) BY service</code>. Three indices, three pipelines, one name. If you have 10 dashboards and five alert rules consuming this pattern, that&#x27;s 15 copies of the same logic today; with a view, it&#x27;s one definition and zero consumer-side edits when you add a fourth service. See <a href="https://www.elastic.co/search-labs/blog/elasticsearch-esql-logical-views">Elasticsearch ES|QL Views</a> for the full views deep dive.</p><h2>What you can do inside a branch</h2><p>Each branch supports the full ES|QL pipeline: <code>WHERE</code>, <code>EVAL</code>, <code>STATS</code>, <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/lookup-join"><code>LOOKUP JOIN</code></a>, <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/enrich"><code>ENRICH</code></a>, and more. See the <a href="https://www.elastic.co/docs/reference/query-languages/esql/esql-subquery">subquery documentation</a> for the complete list.</p><h2>Aggregate different metrics, and then combine</h2><p>Each branch can compute its own summary before results are merged. This is useful when different indices track the same concept under different field names:</p><pre><code>FROM
  (FROM svc-gateway-*
   | STATS avg_latency = AVG(http.response.time_ms) BY hour = BUCKET(@timestamp, 1 hour)
   | EVAL service = "gateway"),
  (FROM svc-payments-*
   | STATS avg_latency = AVG(transaction.duration_ms) BY hour = BUCKET(@timestamp, 1 hour)
   | EVAL service = "payments")
| SORT hour DESC, service</code></pre><p>Both branches produce <code>avg_latency</code> and <code>hour</code>, but each computes it from a different source field. The combined result is a single table you can chart or alert on, without normalizing field names at ingest time. This pattern is impossible with a single <code>FROM</code>; you can&#x27;t compute different aggregations per index without subqueries.</p><h2>Subqueries vs. FORK</h2><p>ES|QL also has <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/fork"><code>FORK</code></a> (now generally available), which creates parallel execution branches from the same input. The distinction:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/b5728fa2ce08df32b67ef9368c4d75d04b74ab95-1999x992.png" style="width: 100%;" /><p>Different indices → subqueries. Same data, different analyses → FORK.</p><h2>How this compares</h2><p>If you&#x27;re coming from other query languages, here&#x27;s how ES|QL subqueries stack up at the time of writing:</p><p><strong>Splunk SPL/SPL2</strong> has <code>append</code> and <code>multisearch</code> in classic SPL, and SPL2 adds a <a href="https://help.splunk.com/en/splunk-cloud-platform/search/spl2-search-reference/union-command/union-command-examples">union command</a> that merges events from multiple datasets (the closest analogue to ES|QL subqueries). Federated Search extends this across remote Splunk deployments (analogous to CCS). The differences are in how the engine handles each branch: ES|QL subqueries give each branch independent predicate pushdown, meaning filters are pushed into each index&#x27;s shard-level structures separately. SPL2 <code>union</code> merges datasets but optimization across branches is limited to what the search scheduler can parallelize. Wrapping ES|QL subqueries in a <a href="https://www.elastic.co/search-labs/blog/elasticsearch-esql-logical-views">view</a> gives you engine-level encapsulation with role-based access control (RBAC); Splunk&#x27;s equivalent is saved searches and macros, which are text substitution expanded at parse time.</p><p><strong>SQL databases</strong> have <code>UNION ALL</code>, which is the closest analog. The difference is that SQL <code>UNION ALL</code> typically requires matching column counts and types at parse time. ES|QL subqueries are more forgiving; columns that exist in one branch but not another get null-padded automatically, which matters when your sources have different schemas (the norm in observability data). SQL views solve the reuse problem similarly, but ES|QL views are cluster-level objects, not database-scoped; they work across <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-cross-clusters.html">cross-cluster search</a> boundaries.</p><p><strong>Grafana / Datadog / other dashboarding tools</strong> handle multisource composition at the visualization layer: Run separate queries, merge in the panel. This works for display but breaks for alerting, downstream queries, and anything that needs a single result set programmatically. ES|QL subqueries push the composition into the engine, so alerts, views, and API consumers all get the same unified result.</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Capability</td><td style="border: 1px solid #ccc; padding: 4px;">Splunk SPL/SPL2</td><td style="border: 1px solid #ccc; padding: 4px;">SQL UNION ALL</td><td style="border: 1px solid #ccc; padding: 4px;">Dashboard-layer merge</td><td style="border: 1px solid #ccc; padding: 4px;">ES|QL subqueries</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Independent filters per source</td><td style="border: 1px solid #ccc; padding: 4px;">SPL2 `union` merges datasets; optimization is scheduler-level</td><td style="border: 1px solid #ccc; padding: 4px;">Yes</td><td style="border: 1px solid #ccc; padding: 4px;">N/A (separate queries)</td><td style="border: 1px solid #ccc; padding: 4px;">Yes; parallel with pushdown</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Schema mismatch handling</td><td style="border: 1px solid #ccc; padding: 4px;">Manual field normalization</td><td style="border: 1px solid #ccc; padding: 4px;">Strict column matching</td><td style="border: 1px solid #ccc; padding: 4px;">Manual in panel config</td><td style="border: 1px solid #ccc; padding: 4px;">Automatic null-padding</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Engine-level reuse</td><td style="border: 1px solid #ccc; padding: 4px;">Text macros (parse-time expansion)</td><td style="border: 1px solid #ccc; padding: 4px;">Database-scoped views</td><td style="border: 1px solid #ccc; padding: 4px;">Dashboard variables</td><td style="border: 1px solid #ccc; padding: 4px;">Cluster-level views with RBAC</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Works for alerts + API</td><td style="border: 1px solid #ccc; padding: 4px;">Limited (summary indexing)</td><td style="border: 1px solid #ccc; padding: 4px;">Yes</td><td style="border: 1px solid #ccc; padding: 4px;">No; display only</td><td style="border: 1px solid #ccc; padding: 4px;">Yes</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Add a source</td><td style="border: 1px solid #ccc; padding: 4px;">Edit every macro/saved search</td><td style="border: 1px solid #ccc; padding: 4px;">Add a UNION branch</td><td style="border: 1px solid #ccc; padding: 4px;">Add a panel query</td><td style="border: 1px solid #ccc; padding: 4px;">Add a branch; existing branches unchanged</td>
                    </tr>
              </tbody>
            </table>
          <h2>Current constraints</h2><p>In the Tech Preview release, subqueries are non-correlated; branches run independently and can&#x27;t reference the outer query. They&#x27;re supported in <code>FROM</code> only (not <code>TS</code>), and <code>FORK</code> can&#x27;t be used inside or after subqueries. See the <a href="https://www.elastic.co/docs/reference/query-languages/esql/esql-subquery">subquery documentation</a> for details.</p><h2>What&#x27;s next for subqueries</h2><p><a href="https://github.com/elastic/roadmap/issues/60"><code>WHERE</code> subqueries</a> — <code>WHERE field IN (FROM other_index | ...)</code> and other correlated forms — will extend the composition model from <code>FROM</code> into filtering. This brings the familiar SQL pattern of nested filtering to ES|QL.</p><h2>Try it</h2><p>Subqueries in <code>FROM</code> are available as a Tech Preview. Try them in <a href="https://www.elastic.co/kibana">Kibana</a> Dev Tools or Discover. We&#x27;d love your feedback; file a <a href="https://github.com/elastic/elasticsearch/issues">GitHub issue</a> with the <code>ES|QL</code> label.</p><p><em>ES|QL subqueries in FROM are a Tech Preview feature. Tech Preview features are subject to change and are not covered by the support SLA of GA features. The release and timing of any features or functionality described in this post remain at Elastic&#x27;s sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Elasticsearch's BBQ vs. TurboQuant: 10–40× faster on CPU and lower ranking noise]]></title>
            <description><![CDATA[A head-to-head look at Elasticsearch BBQ and TurboQuant, including throughput, ranking accuracy, and why uniform quantization wins for CPU vector search with up to 40× faster comparisons and smaller ranking noise.]]></description>
            <link>https://www.elastic.co/search-labs/blog/elasticsearch-bbq-osq-vs-turbo</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/elasticsearch-bbq-osq-vs-turbo</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Thomas Veasey]]></dc:creator>
            <pubDate>Wed, 06 May 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/81cf4cbba360511e166ea59f2cc747cdd69b7dcf-1840x892.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>In CPU vector search, Elasticsearch&#x27;s <a href="https://www.elastic.co/search-labs/blog/scalar-quantization-optimization">Optimized Scalar Quantization (OSQ)</a> (the algorithm behind <a href="https://www.elastic.co/search-labs/blog/optimized-scalar-quantization-elasticsearch">Better Binary Quantization (BBQ)</a>) beats <a href="https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/">TurboQuant</a> where production systems care most: throughput, ranking accuracy, and storage efficiency. In our tests on Apple M2 Max, OSQ&#x27;s symmetric kernels are 10-40x faster, and on shifted embeddings its 1-bit document encoding beats TurboQuant at 4 bits on ranking accuracy while using far less storage. TurboQuant still wins on raw reconstruction MSE, but that advantage comes mostly from the Hadamard rotation and does not translate into better CPU search behavior.</p><h4>A brief quantization primer</h4><p>Vector search indexes often store millions or billions of embedding vectors, each hundreds or thousands of floats wide. Scalar quantization compresses each float coordinate independently to a small integer, typically 1, 2, or 4 bits, reducing storage by 8-32× and enabling fast integer-arithmetic scoring.<br/>Why does per-coordinate quantization work at all? Because the expected squared error decomposes as a sum of independent per-coordinate terms: $\mathbb{E}[\|x - q(x)\|^2] = \sum_i \mathbb{E}[(x_i - q_i(x_i))^2]$, by linearity of expectation. Each coordinate can be quantized independently regardless of the joint distribution. The remaining question is how to make each per-coordinate quantizer <em>good</em>.</p><h4>BBQ and Optimized Scalar Quantization</h4><p><a href="https://www.elastic.co/search-labs/blog/optimized-scalar-quantization-elasticsearch">Better Binary Quantization (BBQ)</a>, and its underlying algorithm <a href="https://www.elastic.co/search-labs/blog/scalar-quantization-optimization">Optimized Scalar Quantization (OSQ)</a>, have been part of Elasticsearch for multiple releases. OSQ is an evolution of several techniques to make scalar quantization more accurate, specifically for vector search.</p><p>Each vector&#x27;s components are mapped to a uniform grid over an interval $[a, b]$. The interval is initialized from the vector&#x27;s statistics (assuming approximately normal residuals), using the <a href="https://www.elastic.co/search-labs/blog/scalar-quantization-optimization#initializing-the-quantization-interval">exact same optimization objective</a> as TurboQuant but with an additional constraint on centroid positions. They are then refined by coordinate descent to minimize an anisotropic loss $L = (1-\lambda)(x \cdot e)^2/\|x\|^2 + \lambda\|e\|^2$, where $e$ is the quantization error vector. With the production default $\lambda = 0.1$, this deliberately sacrifices some MSE to concentrate accuracy along the query direction. This is the direction that matters for ranking.</p><p>Before quantization, the segment (or cluster in the case of Inverted Vector File (IVF) indices) centroid $c$ is subtracted from every vector. This removes the dominant shared component that would otherwise waste the quantizer&#x27;s dynamic range. Both the symmetric and asymmetric dot-product paths also center the query using the same segment centroid, so the only quantized inner product is between centered residuals. The correction terms, $\langle c, x \rangle$ for each vector and $\|c\|^2$, depend on only a single vector each, and can be precomputed exactly. Centering therefore adds no per-pair cost.</p><p>Documents are quantized at 1-bit (32× compression), queries at 4-bit (cheap since there are only a handful per search). The storage constraint binds on documents, not queries, so spending more bits on the query side recovers float-query accuracy while keeping per-document footprint minimal.</p><p>A block-diagonal <a href="https://en.wikipedia.org/wiki/Orthogonal_matrix">orthogonal</a> preconditioner equalizes coordinate variances and normalizes their distribution before quantization. This is the same goal as a full <a href="https://en.wikipedia.org/wiki/Hadamard_transform">Hadamard rotation</a> used by TurboQuant, but with no power-of-2 padding overhead.</p><p>Because the grid is uniform, quantized dot products decompose into integer dot products with scalar corrections. This enables NEON/SVE and SSE/AVX popcount and multiply-accumulate pipelines: bit-plane decomposition for 1-bit and 2-bit, nibble multiply for 4-bit, and a <a href="https://www.elastic.co/search-labs/studio/content/posts;d8336fba-c31a-42fe-8c27-0178f7779203%2Ctemplate%3Dpost">RaBitQ-style</a> mixed 4×1 kernel that decomposes to four 1-bit kernels.</p><p>For a deeper dive into the sparse rotation and what it brings to robustness, see the <a href="https://www.elastic.co/search-labs/blog/robust-optimized-scalar-quantization">Robust Optimized Scalar Quantization</a> blog. For a full walkthrough of optimized scalar quantization, see this <a href="https://www.elastic.co/search-labs/blog/better-binary-quantization-lucene-elasticsearch">OSQ deep dive</a>.</p><h4>TurboQuant</h4><p><a href="https://arxiv.org/pdf/2504.19874">TurboQuant</a> (Google, ICLR 2026) takes a slightly different path to the same starting observation: that concentrated, predictable per-coordinate distributions are easy to quantize well.</p><p>Rather than adapting the quantizer per vector, TurboQuant normalizes the vector and applies a shared randomized Hadamard rotation to the entire dataset. This general sort of idea was first proposed and formalized by <a href="https://arxiv.org/pdf/2405.12497">RaBitQ</a>, which showed that the random rotation yields worst-case bounds over the data, holding for any fixed unit vector, for their quantization scheme. The idea of implementing via a Hadamard rotation was suggested by <a href="https://weaviate.io/blog/8-bit-rotational-quantization">Weaviate for rotational quantization</a>. After normalization and rotation, each coordinate&#x27;s distribution almost always converges to $\mathcal{N}(0, 1/d)$ in high dimensions, regardless of the original data. TurboQuant builds on this foundation: with the distribution pinned down, it solves for the optimal <a href="https://en.wikipedia.org/wiki/Lloyd%27s_algorithm">Lloyd-Max</a> scalar quantizer, a 1-D $k$-means problem on the known density. The resulting non-uniform centroids bunch up where the density is highest (near zero) and spread out in the tails. This achievies provably near-optimal MSE: within ~2.7× of the information-theoretic lower bound in general, and as tight as 1.45× at 1-bit.</p><p>For inner products, MSE-optimal quantizers introduce a multiplicative bias (which is most severe at 1-bit: $2/\pi \approx 0.64$). TurboQuant corrects this with a two-stage design ($Q_\text{prod}$): spend $b-1$ bits on the MSE quantizer, then use the remaining 1 bit for a <a href="https://arxiv.org/pdf/2406.03482">Quantized Johnson-Lindenstrauss</a> (QJL) sketch of the residual, yielding a provably unbiased inner-product estimator.</p><p>The paper&#x27;s nearest-neighbour experiments were conducted on GPU (NVIDIA A100), where the lookup-table access pattern maps naturally onto shared memory.</p><h4>The key design divergence: integer arithmetic vs. lookup tables</h4><p>The difference between uniform and non-uniform centroids may seem minor, but it creates a large computational gap.</p><p>OSQ&#x27;s uniform grid means each quantized coordinate is an integer whose arithmetic meaning is preserved. The dot product of two quantized vectors decomposes into an integer dot product, directly exploitable by SIMD: <code>vpdpbusd</code> on x86, multiply-accumulate and <code>vcnt</code> (popcount) on ARM NEON. The pipeline is branch-free and the data access pattern is sequential.</p><p>TurboQuant&#x27;s non-uniform centroids break this. Each coordinate pair requires looking up a centroid value from a shared codebook, and the access pattern is data-dependent with each index selecting a different table entry. On NEON, which lacks a float gather instruction, this means scalar loads to build each vector register before the Fused Multiply-Add (FMA). Precomputing per-coordinate product tables ($d \times 2^b$ entries, amortized over all documents) doesn&#x27;t help either: the FMA is relatively cheap on modern cores, so the bottleneck remains the data-dependent gather, not the arithmetic. Our benchmarks confirm this: precomputed ADC tables are no faster (and sometimes slower due to the larger working set) than inline centroid lookup.</p><h4><strong>Terminology </strong>used in the comparisons</h4><p>The results sections below refer to several OSQ scoring configurations. All use uniform-grid quantization with scalar correction terms to recover the exact dot product up to quantization noise.</p><p><em>Symmetric $n$-bit</em><strong> </strong>quantizes both query and document at $n$ bits per coordinate.</p><p><em>Asymmetric</em> keeps the query as a full float vector and quantizes only the document. The dot product is a float-times-integer sum. This is more expensive per pair than symmetric, but avoids any query quantization noise. TurboQuant&#x27;s scoring is always asymmetric (float query dotted against quantized document via centroid lookup).</p><p><em>1-4</em> is the production configuration for OSQ: documents at 1-bit (32× compression), queries at 4-bit. This exploits the asymmetry of search: there is one query but millions of documents, so query storage is free but document storage is the binding constraint.</p><p><em>Centered</em> means the segment centroid $c$ has been subtracted from all vectors (and the query) before quantization, with the exact correction recovered from precomputed scalar terms. Centering focuses the quantizer&#x27;s dynamic range on the information-bearing residual rather than the shared mean.</p><p>$\lambda$ controls the anisotropic loss tradeoff: $\lambda = 1$ minimizes pure MSE, $\lambda = 0.1$ (production default) sacrifices some MSE to concentrate accuracy along the query direction, the direction that determines ranking.</p><h4>How do they compare in practice?</h4><p>The following results were obtained on an Apple M2 Max. The code to reproduce all these results is available <a href="https://github.com/elastic/osq_vs_turboquant">here</a>.</p><h4>Head-to-head: MSE</h4><p>On reconstruction MSE, the metric TurboQuant was designed to optimize, TurboQuant outperforms plain OSQ at every bit-width.</p><p>Relative MSE ($\|x - \hat{x}\|^2 / \|x\|^2$) on $d = 768$ Gaussian vectors (1,000 vectors, lower is better):</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Bits</td><td style="border: 1px solid #ccc; padding: 4px;">OSQ (λ=0.1)</td><td style="border: 1px solid #ccc; padding: 4px;">OSQ (λ=1)</td><td style="border: 1px solid #ccc; padding: 4px;">TurboQuant</td><td style="border: 1px solid #ccc; padding: 4px;">TQ vs OSQ ($\lambda=1$)</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">1</td><td style="border: 1px solid #ccc; padding: 4px;">0.512</td><td style="border: 1px solid #ccc; padding: 4px;">0.362</td><td style="border: 1px solid #ccc; padding: 4px;">0.307</td><td style="border: 1px solid #ccc; padding: 4px;">1.18×</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">2</td><td style="border: 1px solid #ccc; padding: 4px;">0.138</td><td style="border: 1px solid #ccc; padding: 4px;">0.118</td><td style="border: 1px solid #ccc; padding: 4px;">0.092</td><td style="border: 1px solid #ccc; padding: 4px;">1.28×</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">3</td><td style="border: 1px solid #ccc; padding: 4px;">0.038</td><td style="border: 1px solid #ccc; padding: 4px;">0.037</td><td style="border: 1px solid #ccc; padding: 4px;">0.026</td><td style="border: 1px solid #ccc; padding: 4px;">1.42×</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4</td><td style="border: 1px solid #ccc; padding: 4px;">0.011</td><td style="border: 1px solid #ccc; padding: 4px;">0.011</td><td style="border: 1px solid #ccc; padding: 4px;">0.007</td><td style="border: 1px solid #ccc; padding: 4px;">1.61×</td>
                    </tr>
              </tbody>
            </table>
          <p>The $\lambda$ columns reveal that OSQ&#x27;s production setting ($\lambda = 0.1$) deliberately sacrifices MSE for dot-product accuracy. With $\lambda = 1$ (pure MSE), the gap narrows, to just 1.18× at 1-bit.</p><p>But where does TurboQuant&#x27;s remaining MSE advantage actually come from, the Lloyd-Max centroids, or the Hadamard rotation? We can answer this directly by applying the same randomized Hadamard rotation to OSQ (zero-pad 768→1024, random sign flips, Walsh-Hadamard butterfly, quantize in rotated space, invert). Theory predicts the MSE improves by a factor of $d/d&#x27; = 768/1024 = 0.75$:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Bits</td><td style="border: 1px solid #ccc; padding: 4px;">OSQ (λ=1)</td><td style="border: 1px solid #ccc; padding: 4px;">OSQ + Hadamard</td><td style="border: 1px solid #ccc; padding: 4px;">TurboQuant</td><td style="border: 1px solid #ccc; padding: 4px;">Ratio (OSQ/QSQ + Hadamard)</td><td style="border: 1px solid #ccc; padding: 4px;">Theory (d'/d)</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">1</td><td style="border: 1px solid #ccc; padding: 4px;">0.362</td><td style="border: 1px solid #ccc; padding: 4px;">0.306</td><td style="border: 1px solid #ccc; padding: 4px;">0.307</td><td style="border: 1px solid #ccc; padding: 4px;">1.19</td><td style="border: 1px solid #ccc; padding: 4px;">1.33</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">2</td><td style="border: 1px solid #ccc; padding: 4px;">0.118</td><td style="border: 1px solid #ccc; padding: 4px;">0.092</td><td style="border: 1px solid #ccc; padding: 4px;">0.092</td><td style="border: 1px solid #ccc; padding: 4px;">1.28</td><td style="border: 1px solid #ccc; padding: 4px;">1.33</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">3</td><td style="border: 1px solid #ccc; padding: 4px;">0.037</td><td style="border: 1px solid #ccc; padding: 4px;">0.028</td><td style="border: 1px solid #ccc; padding: 4px;">0.026</td><td style="border: 1px solid #ccc; padding: 4px;">1.31</td><td style="border: 1px solid #ccc; padding: 4px;">1.33</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4</td><td style="border: 1px solid #ccc; padding: 4px;">0.011</td><td style="border: 1px solid #ccc; padding: 4px;">0.009</td><td style="border: 1px solid #ccc; padding: 4px;">0.007</td><td style="border: 1px solid #ccc; padding: 4px;">1.33</td><td style="border: 1px solid #ccc; padding: 4px;">1.33</td>
                    </tr>
              </tbody>
            </table>
          <p>OSQ + Hadamard matches TurboQuant almost exactly at 1-bit (0.306 vs 0.307) and 2-bit (0.092 vs 0.092). TurboQuant&#x27;s MSE advantage is the rotation, not the centroids. At 3–4 bits the Lloyd-Max placement contributes a modest ~1.1× edge, real but small.</p><p>The convergence of the improvement ratio applying the Hadamard transformation to OSQ is itself informative: at 4-bit it hits the theoretical 1.33 exactly, but at 1-bit it&#x27;s only 1.19. The shortfall quantifies the value of OSQ&#x27;s data-dependent interval refinement: it already captures ~40% of the dimension expansion and component equalization benefit that Hadamard provides. The coordinate-descent is doing some of the same work as the rotation, adapting to each vector rather than relying on a data-oblivious transform. However, the real advantage, as we discuss below, is this formulation allows us to concentrate accuracy along the query direction.</p><p>This raises a natural question: how does OSQ&#x27;s block-diagonal sparse preconditioner compare to the full Hadamard rotation in practice?</p><h4>Head-to-head: sparse preconditioner vs Hadamard</h4><p>OSQ&#x27;s <a href="https://www.elastic.co/search-labs/blog/robust-optimized-scalar-quantization">sparse preconditioner</a> applies a block-diagonal random orthogonal transformation: dimensions are randomly permuted into blocks (64×64 in production), each block is multiplied by an independent random orthogonal matrix. This equalizes coordinate distributions within each block. The Hadamard rotation achieves the same goal globally but requires zero-padding to the next power of 2.</p><p>We test on anisotropic Gaussian data ($d = 768$, $\sigma_i$ ramping from 1 to 5 across coordinates), a challenging distribution where some coordinates carry far more variance than others.</p><p>Transform latency ($d = 768$, 1,100 vectors, ARM NEON, lower is better):</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Method</td><td style="border: 1px solid #ccc; padding: 4px;">ns/vec</td><td style="border: 1px solid #ccc; padding: 4px;">Effective dim</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Block 32×32</td><td style="border: 1px solid #ccc; padding: 4px;">1,811</td><td style="border: 1px solid #ccc; padding: 4px;">768</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Block 64×64</td><td style="border: 1px solid #ccc; padding: 4px;">4,887</td><td style="border: 1px solid #ccc; padding: 4px;">768</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Full dense</td><td style="border: 1px solid #ccc; padding: 4px;">244,752</td><td style="border: 1px solid #ccc; padding: 4px;">768</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Hadamard</td><td style="border: 1px solid #ccc; padding: 4px;">1,556</td><td style="border: 1px solid #ccc; padding: 4px;">1,024</td>
                    </tr>
              </tbody>
            </table>
          <p>Hadamard is the fastest non-trivial option thanks to $O(d \log d)$ butterflies vs $O(d \times b)$ for block size $b$, though all block-diagonal variants are fast enough to be negligible in practice with even the 64×64 block at 4.9 μs is tiny compared to typical search latencies. The full dense $d \times d$ rotation is impractical at 244 μs/vec but serves as a theoretical reference. Note that the block-diagonal transform works for arbitrary dimensions: no power-of-2 padding is required, and the effective dimension stays at $d$.</p><p>MSE (relative MSE, $\lambda = 1$, anisotropic data, lower is better):</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Method</td><td style="border: 1px solid #ccc; padding: 4px;">1 bit</td><td style="border: 1px solid #ccc; padding: 4px;">2 bit</td><td style="border: 1px solid #ccc; padding: 4px;">4 bit</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">No transform</td><td style="border: 1px solid #ccc; padding: 4px;">0.443</td><td style="border: 1px solid #ccc; padding: 4px;">0.157</td><td style="border: 1px solid #ccc; padding: 4px;">0.0182</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Block 32×32</td><td style="border: 1px solid #ccc; padding: 4px;">0.368</td><td style="border: 1px solid #ccc; padding: 4px;">0.121</td><td style="border: 1px solid #ccc; padding: 4px;">0.0120</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Block 64×64</td><td style="border: 1px solid #ccc; padding: 4px;">0.365</td><td style="border: 1px solid #ccc; padding: 4px;">0.119</td><td style="border: 1px solid #ccc; padding: 4px;">0.0117</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Full dense</td><td style="border: 1px solid #ccc; padding: 4px;">0.362</td><td style="border: 1px solid #ccc; padding: 4px;">0.118</td><td style="border: 1px solid #ccc; padding: 4px;">0.0113</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Hadamard</td><td style="border: 1px solid #ccc; padding: 4px;">0.362</td><td style="border: 1px solid #ccc; padding: 4px;">0.118</td><td style="border: 1px solid #ccc; padding: 4px;">0.0112</td>
                    </tr>
              </tbody>
            </table>
          <p>Even 32×32 blocks recover most of the gap from no-transform (0.443) to full rotation (0.362), 93% at 1-bit. Block 64×64 closes the gap further. On isotropic data (not shown), all methods produce identical MSE (~0.362 at 1-bit), confirming there is nothing to equalize when coordinates already have equal variance.</p><p>Dot-product accuracy (1-4 centered, raw relative dot-product error; note these are raw RMSE including multiplicative bias, which is appropriate for comparing preconditioner variants against each other since the bias structure is similar, lower is better):</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Method</td><td style="border: 1px solid #ccc; padding: 4px;">Anisotropic</td><td style="border: 1px solid #ccc; padding: 4px;">Isotropic</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">No transform</td><td style="border: 1px solid #ccc; padding: 4px;">0.690</td><td style="border: 1px solid #ccc; padding: 4px;">0.722</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Block 32×32</td><td style="border: 1px solid #ccc; padding: 4px;">0.606</td><td style="border: 1px solid #ccc; padding: 4px;">0.724</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Block 64×64</td><td style="border: 1px solid #ccc; padding: 4px;">0.602</td><td style="border: 1px solid #ccc; padding: 4px;">0.720</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Full dense</td><td style="border: 1px solid #ccc; padding: 4px;">0.595</td><td style="border: 1px solid #ccc; padding: 4px;">0.723</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Hadamard</td><td style="border: 1px solid #ccc; padding: 4px;">0.566</td><td style="border: 1px solid #ccc; padding: 4px;">0.629</td>
                    </tr>
              </tbody>
            </table>
          <p>On isotropic data the block-diagonal methods and the full dense rotation produce the same dot-product error as no transform since there is nothing to fix. Hadamard is the outlier, improving from 0.723 to 0.629. But this improvement is not from better preconditioning: the full dense rotation, which is an equally good random orthogonal transform, shows no improvement at all. The difference is the padding. Hadamard operates in 1024 dimensions, so 1-bit documents store 1024 bits instead of 768. This is 33% more storage. The improvement ratio (0.723 / 0.629 = 1.15) matches $\sqrt{d&#x27;/d} = \sqrt{1024/768} = 1.155$ almost exactly, confirming that the entire dot-product advantage is attributable to the extra bits, not the rotation.</p><p>On anisotropic data, the block-diagonal rotation does help dot-product accuracy (0.690 → 0.602 for block 64), which is the real value from coordinate equalization. Hadamard goes further (0.566), but the incremental improvement over a full dense rotation at the same dimension (0.595 → 0.566) is again consistent with the padding benefit.</p><p>The practical implication: for CPU-based search where storage efficiency matters, the block-diagonal preconditioner delivers the same MSE improvement as Hadamard at the same effective bit rate, works for any dimension without padding, and the dot-product gap we see in our experiments is a padding artifact, not a preconditioning advantage.</p><h4>Head-to-head: dot-product accuracy</h4><p>MSE measures reconstruction quality, but search engines rank by dot products. These are different objectives, and the gap between them is where OSQ&#x27;s design choices pay off.</p><p>We measure relative dot-product error: $\sqrt{\sum(q \cdot x - \hat{q} \cdot \hat{x})^2 / \sum(q \cdot x)^2}$, varying the angle between query and document. The small-angle regime (0°–20°) matters most: real transformer embeddings occupy a narrow cone rather than spreading uniformly on the sphere (<a href="https://arxiv.org/pdf/1909.00512">Ethayarajh 2019</a>). Furthermore, near-parallel vectors, corresponding to the nearest neighbours of a query in the dataset, are where ranking accuracy is critical.</p><p>Our production configuration is 1-bit documents, 4-bit queries, centroid centering, with integer scoring.</p><p>Raw dot-product error conflates two distinct components: a multiplicative bias (a global scale factor that preserves ranking order) and noise (random per-pair deviations that can swap rankings). For search, only the noise matters: a biased estimator that consistently scales all scores by the same factor produces the same ranking as the exact scores. TurboQuant&#x27;s MSE quantizer at 1-bit has a well-known multiplicative bias of $2/\pi \approx 0.64$, meaning raw dot-product errors of ~0.36 are almost entirely this ranking-irrelevant scale factor. To give a fair comparison, we report the debiased RMSE after fitting and removing the best multiplicative scale: $\alpha = \sum(\hat{d} \cdot d) / \sum(d^2)$, then measuring $\sqrt{\sum(\hat{d}/\alpha - d)^2 / \sum d^2}$.</p><p>Zero-mean corpus ($d = 768$, 500 vectors, 5 queries per vector, lower is better):</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Angle</td><td style="border: 1px solid #ccc; padding: 4px;">OSQ asymmetric (debiased)</td><td style="border: 1px solid #ccc; padding: 4px;">OSQ 1-4 (debiased)</td><td style="border: 1px solid #ccc; padding: 4px;">TQ @1-bit (debiased)</td><td style="border: 1px solid #ccc; padding: 4px;">TQ @4-bit (debiased)</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">0°</td><td style="border: 1px solid #ccc; padding: 4px;">0.0035</td><td style="border: 1px solid #ccc; padding: 4px;">0.0067</td><td style="border: 1px solid #ccc; padding: 4px;">0.0083</td><td style="border: 1px solid #ccc; padding: 4px;">0.0052</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">5°</td><td style="border: 1px solid #ccc; padding: 4px;">0.0042</td><td style="border: 1px solid #ccc; padding: 4px;">0.0060</td><td style="border: 1px solid #ccc; padding: 4px;">0.0085</td><td style="border: 1px solid #ccc; padding: 4px;">0.0052</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">10°</td><td style="border: 1px solid #ccc; padding: 4px;">0.0057</td><td style="border: 1px solid #ccc; padding: 4px;">0.0074</td><td style="border: 1px solid #ccc; padding: 4px;">0.0091</td><td style="border: 1px solid #ccc; padding: 4px;">0.0052</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">20°</td><td style="border: 1px solid #ccc; padding: 4px;">0.010</td><td style="border: 1px solid #ccc; padding: 4px;">0.011</td><td style="border: 1px solid #ccc; padding: 4px;">0.012</td><td style="border: 1px solid #ccc; padding: 4px;">0.0053</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">45°</td><td style="border: 1px solid #ccc; padding: 4px;">0.027</td><td style="border: 1px solid #ccc; padding: 4px;">0.029</td><td style="border: 1px solid #ccc; padding: 4px;">0.025</td><td style="border: 1px solid #ccc; padding: 4px;">0.0060</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">60°</td><td style="border: 1px solid #ccc; padding: 4px;">0.048</td><td style="border: 1px solid #ccc; padding: 4px;">0.049</td><td style="border: 1px solid #ccc; padding: 4px;">0.042</td><td style="border: 1px solid #ccc; padding: 4px;">0.0074</td>
                    </tr>
              </tbody>
            </table>
          <p>Shifted corpus (shift = 2.0, modeling real embedding bias, lower is better):</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Angle</td><td style="border: 1px solid #ccc; padding: 4px;">OSQ asymmetric (debiased)</td><td style="border: 1px solid #ccc; padding: 4px;">OSQ 1-4 (debiased)</td><td style="border: 1px solid #ccc; padding: 4px;">TQ @1-bit (debiased)</td><td style="border: 1px solid #ccc; padding: 4px;">TQ @4-bit (debiased)</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">0°</td><td style="border: 1px solid #ccc; padding: 4px;">0.0008</td><td style="border: 1px solid #ccc; padding: 4px;">0.0013</td><td style="border: 1px solid #ccc; padding: 4px;">0.0073</td><td style="border: 1px solid #ccc; padding: 4px;">0.0054</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">5°</td><td style="border: 1px solid #ccc; padding: 4px;">0.0013</td><td style="border: 1px solid #ccc; padding: 4px;">0.0015</td><td style="border: 1px solid #ccc; padding: 4px;">0.0076</td><td style="border: 1px solid #ccc; padding: 4px;">0.0054</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">10°</td><td style="border: 1px solid #ccc; padding: 4px;">0.0021</td><td style="border: 1px solid #ccc; padding: 4px;">0.0023</td><td style="border: 1px solid #ccc; padding: 4px;">0.0084</td><td style="border: 1px solid #ccc; padding: 4px;">0.0054</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">20°</td><td style="border: 1px solid #ccc; padding: 4px;">0.0041</td><td style="border: 1px solid #ccc; padding: 4px;">0.0043</td><td style="border: 1px solid #ccc; padding: 4px;">0.012</td><td style="border: 1px solid #ccc; padding: 4px;">0.0055</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">45°</td><td style="border: 1px solid #ccc; padding: 4px;">0.012</td><td style="border: 1px solid #ccc; padding: 4px;">0.012</td><td style="border: 1px solid #ccc; padding: 4px;">0.025</td><td style="border: 1px solid #ccc; padding: 4px;">0.0064</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">60°</td><td style="border: 1px solid #ccc; padding: 4px;">0.022</td><td style="border: 1px solid #ccc; padding: 4px;">0.022</td><td style="border: 1px solid #ccc; padding: 4px;">0.043</td><td style="border: 1px solid #ccc; padding: 4px;">0.0078</td>
                    </tr>
              </tbody>
            </table>
          <p>On zero-mean data, the raw error numbers (omitted for brevity, TQ @1-bit&#x27;s is ~0.363, almost entirely due to the $2/\pi$ multiplicative bias) are misleading; only the debiased ranking noise matters. The asymmetric column (float query, 1-bit document) is the most directly comparable to TQ since both quantize only the document: at 0° OSQ achieves 2.4× lower noise (0.0035 vs 0.0083). This is the payoff of the anisotropic loss ($\lambda = 0.1$), which concentrates accuracy along the query direction at the expense of off-axis components. The symmetric 4-bit query recovers some of this advantage (0.0035 → 0.0067), showing that query quantization is now the dominant noise source at small angles. Even so, OSQ symmetric still beats TQ @1-bit by 1.2–1.4× through 10°. The tradeoff is visible at wider angles where TQ @1-bit has <em>lower</em> noise than OSQ (0.042 vs 0.049 at 60°): the Hadamard rotation distributes information uniformly across all directions, while OSQ deliberately favors the directions that matter for search.</p><p>What about TurboQuant&#x27;s $Q_\text{prod}$ variant? TurboQuant&#x27;s inner-product variant ($Q_\text{prod}$) was designed to address exactly this bias, spending $b-1$ bits on the MSE quantizer and 1 bit on a QJL sketch of the residual to produce a provably unbiased estimator. At 1-bit $Q_\text{prod}$ is not viable (0 bits for MSE), so the minimum is 2-bit. But for ranking, the cure is worse than the disease: $Q_\text{prod}$ trades ranking-irrelevant bias for ranking-relevant noise. At 60°, $Q_\text{prod}$&#x27;s debiased noise is consistently higher than MSE-only at the same total bit width, 0.031 vs 0.025 at 2-bit and 0.011 vs 0.007 at 4-bit, because each bit spent on QJL correction would have been better spent on quantization. Since search cares only about ranking, MSE-only is the better choice. The bias is harmless and the extra quantization bit reduces the noise that actually matters.</p><p>The picture changes on shifted data, where centroid centering gives OSQ a decisive advantage. At 0° the debiased noise drops to 0.0008, which is 9× lower than TQ @1-bit&#x27;s 0.0073, and 7× lower than TQ @4-bit&#x27;s 0.0054. Centering removes the dominant shared component before quantization, letting the quantizer focus its bits on the information-bearing residual. TurboQuant&#x27;s data-oblivious rotation cannot exploit this structure. The advantage persists through 20° (OSQ 0.0041 vs TQ @1-bit 0.012) and only narrows at wide angles (60°: OSQ 0.022 vs TQ @1-bit 0.043), where OSQ remains competitive.</p><p>On shifted data, OSQ at 1-bit per document (debiased noise 0.001) beats TurboQuant at 4-bit per document (debiased noise 0.006): better ranking accuracy at over 5× less storage (768 bits vs 4,096 bits, since TQ pads 768→1024 for the Hadamard transform). This is the payoff of the data-dependent design: centering and anisotropic interval refinement extract structure that a data-oblivious rotation cannot.</p><p>TQ @4-bit MSE is consistently the lowest-noise option on zero-mean data (debiased 0.005-0.008 across all angles), but at 5× the storage cost per document. On shifted data it is actually substantially worse than OSQ symmetric for angles less than 20°.</p><h4>Head-to-head: throughput</h4><p>Throughput is where the uniform grid constraint really shines. Here are the throughput figures on $d = 768$, 10k documents, Apple M2 Max, 100 repetitions:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Bits</td><td style="border: 1px solid #ccc; padding: 4px;">OSQ asymmetric</td><td style="border: 1px solid #ccc; padding: 4px;">OSQ symmetric</td><td style="border: 1px solid #ccc; padding: 4px;">OSQ 1-4</td><td style="border: 1px solid #ccc; padding: 4px;">TurboQuant</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">1</td><td style="border: 1px solid #ccc; padding: 4px;">67 ns/doc</td><td style="border: 1px solid #ccc; padding: 4px;">7 ns/doc</td><td style="border: 1px solid #ccc; padding: 4px;">—</td><td style="border: 1px solid #ccc; padding: 4px;">275 ns/doc</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">2</td><td style="border: 1px solid #ccc; padding: 4px;">132 ns/doc</td><td style="border: 1px solid #ccc; padding: 4px;">14 ns/doc</td><td style="border: 1px solid #ccc; padding: 4px;">—</td><td style="border: 1px solid #ccc; padding: 4px;">293 ns/doc</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4</td><td style="border: 1px solid #ccc; padding: 4px;">94 ns/doc</td><td style="border: 1px solid #ccc; padding: 4px;">22 ns/doc</td><td style="border: 1px solid #ccc; padding: 4px;">14 ns/doc</td><td style="border: 1px solid #ccc; padding: 4px;">216 ns/doc</td>
                    </tr>
              </tbody>
            </table>
          <p>OSQ&#x27;s symmetric kernels are 10–40× faster than TurboQuant!</p><p>We made a fair effort to optimize both implementations to use ARM NEON instructions effectively, but do not claim these are optimal. The key techniques:</p><p>The 1-bit kernel reduces to <code>popcount(a AND b)</code> via NEON&#x27;s <code>vcntq_u8</code>, processing 32 bytes per iteration with dual accumulators for pipeline parallelism. For $d = 768$ the entire packed vector is 96 bytes, a single pass yields 7 ns/doc.</p><p>The 2-bit kernel decomposes each 2-bit index into two bit-planes (precomputed at quantize time), reducing the dot product to 4 AND+popcount passes over the same 96-byte planes: $\sum(2x_1+x_0)(2y_1+y_0) = 4 \cdot \text{pc}(x_1 \wedge y_1) + 2 \cdot \text{pc}(x_1 \wedge y_0) + 2 \cdot \text{pc}(x_0 \wedge y_1) + \text{pc}(x_0 \wedge y_0)$. At 14 ns/doc this is 2× the 1-bit time rather than the naive 4× because all four plane pairs share the same data loads, each 96-byte plane is read once and reused across passes.</p><p>The 4-bit kernel uses direct NEON nibble multiply with <code>vandq/vshrq</code> to split packed bytes into lo/hi nibbles, multiply, and accumulate via <code>vpaddlq_u8 </code>widening adds. At 22 ns/doc, this is faster than the 16-popcount bit-plane alternative ($4^2 = 16$ plane combinations).</p><p>A mixed 4×1 kernel is the production workhorse. It precomputes the 4-bit query&#x27;s 4 bit-planes at quantize time (each 96 bytes in the same 1-bit packed layout as the document). Per-document scoring is then 4 AND+popcount passes, i.e. the RaBitQ decomposition: $\sum \text{idx4}_i \times \text{idx1}_i = 8 \cdot \text{pc}(\text{plane3} \wedge \text{bits1}) + 4 \cdot \text{pc}(\text{plane2} \wedge \text{bits1}) + \ldots$ At 14 ns/doc this is 21.3× faster than TurboQuant&#x27;s 1-bit path at 3/4 the document storage.</p><p>TurboQuant&#x27;s bottleneck is the data-dependent gather: each coordinate requires a scalar load from the centroid table to build a NEON float vector. The arithmetic (FMA) is essentially free in comparison.</p><h4>Conclusion</h4><p>TurboQuant is a theoretically elegant construction that builds directly on the OSQ formulation. The provable MSE bound, the unbiased inner-product estimator, and the clean data-oblivious design are real contributions. For applications requiring calibrated scores (not just rankings), or running on GPU hardware where gather operations are cheap, TurboQuant&#x27;s architecture is well-motivated. The calibration-free design is also a natural fit for settings where quantization must happen on the fly with zero training overhead, KV cache compression during LLM inference is a prime example. There, every vector is quantized once as it enters the cache and discarded after the forward pass, so there is no opportunity to amortize a per-vector coordinate descent. A fixed codebook derived from the known post-rotation distribution is exactly the right tool: rotate, snap, store.</p><p>But for CPU-based vector search, the setting where Elasticsearch and most operational systems execute queries, the empirical picture is clear across all three axes:</p><p><strong>MSE:</strong> TurboQuant&#x27;s advantage comes from the Hadamard rotation, not the Lloyd-Max centroids. OSQ with the same rotation matches TurboQuant at 1–2 bits and comes within 1.1× at 3–4 bits. OSQ&#x27;s sparse preconditioner already provides this benefit without padding overhead.</p><p><strong>Dot-product accuracy:</strong> After removing ranking-irrelevant multiplicative bias (including TQ&#x27;s $2/\pi$ scale factor at 1-bit), OSQ has 1.2–1.4× lower ranking noise than TQ @1-bit at small angles on zero-mean data even with a quantized query and without the 25% pad, thanks to the anisotropic loss concentrating accuracy along the query direction. On shifted data, the regime that matters in practice because embeddings typically have a non-zero mean, centering amplifies the advantage further: debiased noise of 0.0008 at 0° vs TQ @1-bit&#x27;s 0.0073 and even TQ @4-bit&#x27;s 0.0054. OSQ at 1-bit beats TurboQuant at 4-bit on ranking accuracy at less than 1/5 the storage. TurboQuant&#x27;s $Q_\text{prod}$ variant addresses bias explicitly but trades it for higher noise, making MSE-only the better choice for search.</p><p><strong>Throughput:</strong> 10–40× faster symmetric scoring, with the mixed 4-1 kernel at 14 ns/doc versus TurboQuant&#x27;s 293 ns/doc using NEON intrinsics. This reflects a fundamental architectural divide between integer arithmetic and lookup-table gather, not a constant factor that disappears with batching.</p><p>The uniform grid, far from being a compromise, turns out to be the right trade: it sacrifices a theoretical MSE margin that almost vanishes under equivalent rotation, and in return unlocks the integer-arithmetic pipeline that makes sub-millisecond search at scale practical.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Elasticsearch ES|QL views: One query to rule twelve dashboards]]></title>
            <description><![CDATA[With ES|QL views, you only need one query for multiple dashboards. Define it once and let Elasticsearch keep everything in sync.]]></description>
            <link>https://www.elastic.co/search-labs/blog/elasticsearch-esql-logical-views</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/elasticsearch-esql-logical-views</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Tyler Perkins]]></dc:creator>
            <pubDate>Tue, 05 May 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/852bff427ac62b79974d88e27ce9670dc132bc46-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Elasticsearch Query Language (ES|QL) now has <a href="https://www.elastic.co/docs/reference/query-languages/esql/esql-views">logical views</a>. Define a query once, and reference it by name in <code>FROM</code>, like an index. Twelve dashboards, one definition, zero copy-paste. Update the view, and every consumer gets the change automatically.</p><p>Views don&#x27;t store data; they re-execute on every read, so results always reflect the current data and the current definition. If you&#x27;ve used views in SQL databases, this will feel familiar. The difference: ES|QL views are engine-level virtual indices stored at the Elasticsearch cluster level, not saved query text that gets expanded client-side. They appear in <a href="https://www.elastic.co/kibana">Kibana</a> autocomplete, support <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-cross-clusters.html">cross-cluster search</a> (CCS), and are governed by dedicated role-based access control (RBAC) privileges.</p><h2>A simple view</h2><p>A view can wrap any ES|QL query. Start with a straightforward filter — HTTP 500 errors from the API gateway:</p><pre><code>PUT _query/view/error_triage
{
  "query": """
    FROM svc-gateway-*
    | WHERE http.response.status_code >= 500
    | KEEP @timestamp, http.response.status_code, url.path, source.ip
  """
}</code></pre><p>Now anyone can write <code>FROM error_triage</code> without knowing the index pattern or filter condition:</p><pre><code>FROM error_triage
| STATS error_count = COUNT(*) BY url.path
| SORT error_count DESC</code></pre><p>The query is defined once. Consumers reference a name.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/68ff5a52b0f3ed3dfaa07d2af6e7f08a8c9c0f55-1999x702.png" style="width: 100%;" /><p>Views support full create, read, list, update, and delete (CRUD) via the <a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-esql-put-view"><code>_query/view REST API</code></a>.</p><h2>Update propagation</h2><p>Say the team decides <code>error_triage</code> should also capture client errors, not just 500s. Update the definition in place:</p><pre><code>PUT _query/view/error_triage
{
  "query": """
    FROM svc-gateway-*
    | WHERE http.response.status_code >= 400
    | KEEP @timestamp, http.response.status_code, url.path, source.ip
  """
}</code></pre><p>Every dashboard panel, alert rule, and ad-hoc query using <code>FROM error_triage</code> immediately reflects the broader filter. No saved objects to hunt down. No stale copies. Change once, update everywhere.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/aad2b22fd4d45f8db1140ae93429c6b9ca345031-1999x440.png" style="width: 100%;" /><h2>Nested views</h2><p>Views can reference other views, enabling layered abstractions. Create views for suspicious IPs and threat intelligence, and then compose them:</p><pre><code>PUT _query/view/suspicious_ips
{
  "query": """
    FROM svc-auth-*
    | WHERE event.action == "login" AND event.outcome == "failure"
    | STATS attempts    = COUNT(*),
            first_seen  = FIRST(@timestamp, @timestamp),
            latest_user = LAST(user.name, @timestamp)
        BY source.ip
    | WHERE attempts > 3
  """
}

PUT _query/view/known_threats
{
  "query": """
    FROM threat-intel
  """
}

PUT _query/view/security_overview
{
  "query": """
    FROM suspicious_ips, known_threats
  """
}

FROM security_overview
| WHERE source.ip IS NOT NULL
| EVAL is_known_threat = threat.category IS NOT NULL
| KEEP source.ip, attempts, threat.category, threat.severity, is_known_threat
| SORT is_known_threat DESC, attempts DESC</code></pre><p>Security teams query <code>FROM security_overview</code> without knowing the underlying data model. They&#x27;re also shielded from any changes made to <code>suspicious_ips</code> by its owner; the abstraction boundary is real, not syntactic.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/fd8f17e0a5bd80a73737f0fe08f9e3480d70ad68-1822x642.png" style="width: 100%;" /><h2>Multisource views with subqueries</h2><p>A view can wrap any ES|QL query, including multisource compositions, using <a href="https://www.elastic.co/search-labs/blog/esql-subquery-from"><code>subqueries in FROM</code></a>. Each subquery branch queries one service independently (its own filters, its own field normalization), and the results combine automatically:</p><pre><code>PUT _query/view/all_errors
{
  "query": """
    FROM
      (FROM svc-gateway-*
       | WHERE http.response.status_code >= 500
       | EVAL service = "gateway",
              error_detail = CONCAT("HTTP ", http.response.status_code::string)
       | KEEP @timestamp, service, error_detail, source.ip),
      (FROM svc-payments-*
       | WHERE transaction.status IN ("failed", "timeout")
       | EVAL service = "payments", error_detail = transaction.status
       | KEEP @timestamp, service, error_detail, source.ip)
  """
}</code></pre><p>Consumers just write:</p><pre><code>FROM all_errors
| STATS error_count = COUNT(*) BY service
| SORT error_count DESC</code></pre><p>Two indices, two independent pipelines, one name. To add a third service later, add a third branch; existing branches don&#x27;t change, and every downstream dashboard and alert reflects the update automatically. For a deep dive on subquery syntax and what you can do inside each branch, see <a href="https://www.elastic.co/search-labs/blog/esql-subquery-from">Three Indices Walk Into a FROM Clause</a>.</p><h2>How views work under the hood</h2><p>When you write <code>FROM view_name</code>, ES|QL resolves the view&#x27;s stored query and executes it inline. Views are re-executed on every read, so results always reflect the current data and the current definition.</p><p>Views share a namespace with indices, aliases, and data streams. A view cannot have the same name as any of these (enforced at creation time). This keeps <code>FROM my_name</code> unambiguous regardless of whether the name resolves to a view, an index, or an alias.</p><h2>Security model</h2><p>Views are governed by four dedicated RBAC privileges: <code>create_view</code>, <code>read_view_metadata</code>, <code>delete_view</code>, and <code>manage_view</code>. Elasticsearch checks the privileges of the user running the query (invoker security), not the user who defined the view. The user querying a view needs permissions on both the view and its underlying indices.</p><h2>Kibana integration</h2><p>Views appear in Discover&#x27;s ES|QL editor autocomplete alongside indices. ES|QL-based dashboard panels work with views transparently. In the initial Tech Preview release, view management is API-only. A Kibana UI for creating and managing views is planned.</p><h2>Cross-cluster search</h2><p>A view&#x27;s definition can reference remote indices using <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-cross-clusters.html">CCS syntax</a>:</p><pre><code>PUT _query/view/cross_cluster_errors
{
  "query": """
    FROM cluster-west:logs-*, cluster-east:logs-*
    | WHERE log.level IN ("error", "crit")
  """
}</code></pre><p>Consumers query <code>FROM cross_cluster_errors</code> without knowing which clusters are involved.</p><h2>Current constraints</h2><p>In the Tech Preview release, view management is API-only and SET directives can&#x27;t appear inside view definitions; the caller applies them when querying. Subquery-based views can&#x27;t be nested inside other multisource <code>FROM</code> expressions. See the <a href="https://www.elastic.co/docs/reference/query-languages/esql/views">views documentation</a> for the full list.</p><h2>What&#x27;s next for views</h2><p>Views today are always fresh; they re-execute on read. <a href="https://github.com/elastic/roadmap/issues/49">Materialized views</a> flip that tradeoff: Pre-compute once, read instantly. Think pre-aggregated rollup views for Service Level Agreement (SLA) dashboards that load in milliseconds instead of scanning raw data on every refresh. A Kibana CRUD UI for views, including a &quot;Save as View&quot; workflow in Discover, is also planned.</p><h2>Try it</h2><p>Logical views are available as a Tech Preview. Try them in <a href="https://www.elastic.co/kibana">Kibana</a> Dev Tools or Discover. We&#x27;d love your feedback; file a <a href="https://github.com/elastic/elasticsearch/issues">GitHub issue</a> with the <code>ES|QL</code> label.</p><p><em>ES|QL logical views are a Tech Preview feature. Tech Preview features are subject to change and are not covered by the support SLA of GA features. The release and timing of any features or functionality described in this post remain at Elastic&#x27;s sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Elastic Agent Builder: How we taught AI agents to manage their own context]]></title>
            <description><![CDATA[Agent Builder in Elasticsearch 9.4 ships dynamically loaded skills, a conversation context store, selective compaction, and external connectors to cut token costs by 40% and let agents handle their own context management.]]></description>
            <link>https://www.elastic.co/search-labs/blog/elastic-agent-builder-ai-agents-context-management</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/elastic-agent-builder-ai-agents-context-management</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Anish Mathur, Deepti Dheer, Evan Castle]]></dc:creator>
            <pubDate>Tue, 05 May 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/ae1daed5e4424866a30e20884493f859f65a9fd2-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Every developer building agents has to answer one question: what should the agent know, and when? The usual starting point is simple. Write a system prompt, connect a few tools, and the agent works. But as the scope grows, you add more instructions, sources and tools, and runs get longer. Eventually the context window fills up, fidelity drops and token costs climb. Managing context is a first-class concern. This post shares how we built context handling into agents so they can decide what to fetch, what to summarize, what to drop, and how memory should carry across steps.</p><p>Agent Builder in 9.4 makes context the agent&#x27;s problem, not yours. Skills provide reusable instructions that load on demand, so only what the current task needs is in context. Large result sets go into a conversation context store rather than sitting in the prompt. For long-running tasks, context gets compacted selectively so the agent doesn&#x27;t drift. Token and turn counts are monitored as the conversation runs. And connectors handle reaching enterprise data where it lives.</p><p>The goal of all of it is the same. Load the context you need, when you need it. Internally we&#x27;ve seen this cut token costs by up to 40%, and the agent&#x27;s context stays reliable across datasets when it would have degraded before.</p><h2>Getting agents to know what you know</h2><p>Three context problems show up again and again: managing bloated prompts, executing complex actions, and controlling enterprise data sources.</p><p>Firstly, prompts get bloated because every instruction has to live in them. Skills fix that by loading on demand and with fewer input tokens. <strong>Skills </strong>provide structured descriptions of how an agent should operate and act in a specific task. Agent Builder ships with built-in skills for common data analysis patterns, but the real value is that users can also build custom skills. A security team can encode its triage playbook as a skill. An SRE team can describe how they want root cause analysis to proceed. A developer can encode their API design conventions and error handling patterns. Skills are reusable and shareable across agents, which means a pattern that works for one team&#x27;s deployment doesn&#x27;t have to be reinvented by the next.</p><p>In practice, this looks like: a team lead defines a &quot;Summarize this incident&quot; skill with the process they care about, the severity classification their org uses, and the output format their runbook expects. Anyone on the team invokes it by typing in the chat input and selecting it from autocomplete. Skills follow the <a href="https://github.com/elastic/agent-skills">Agent Skills open format</a>, so you can pull them from a shared library, write your own, or craft skills using an agent of your choice.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/7ad073b40ea280077e71225df2b5747c9a5061f3-1280x766.gif" style="width: 100%;" /><p>In internal testing, we found that removing instructions from the agent prompt and placing them in dynamically loaded skills showed a 21 to 39% reduction in input token usage across test datasets. The key architectural improvement is that skills and their associated tools are loaded only when the agent needs them. All other skills stay as lightweight stubs with just a name and description, consuming almost no context.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/2479d95570de548aad7d47f036da50e00e13ff56-1328x1176.png" style="width: 100%;" /><p><strong>Chat with your data and act on it too (dashboards, workflows, queries and more): </strong>Agentic tasks don&#x27;t stay simple for long. Agent Builder now has contextual awareness of objects in Kibana. With agentic dashboard creation, a user can describe what they want to see in plain language, and the agent generates a dashboard with panels, visualizations, queries, and everything that&#x27;s needed. Users can refine it conversationally: &quot;break that out by region,&quot; &quot;add a filter for the last 7 days,&quot; &quot;swap the bar chart for a line chart.&quot;</p><p>Dashboards, alerts, and rules also work as inputs. Once a dashboard exists, it can be retrieved from the Agent’s context. This unlocks the &quot;act&quot; side of agents. Once a dashboard or alert is in context, the agent can modify it, extend it, or create new ones. The agent can reason about what the data shows, suggest follow-up analyses, or modify the dashboard based on what it sees. It&#x27;s a feedback loop: the user describes intent, the agent produces a visual artifact, and both the user and the agent can reason over that artifact together.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/0a931862b202e7293c44b1e9fbcde199480a5187-1280x766.gif" style="width: 100%;" /><p>For business analysts and operations teams, this collapses the gap between &quot;I have a question about the data&quot; and &quot;I have a dashboard I can share with my team,&quot; reducing hours of manual work to a few minutes of conversation.</p><p>Lastly, using enterprise data for context creates governance you didn&#x27;t ask for. <strong>Connectors</strong> close the loop for data that lives outside Elastic. We added prebuilt OAuth-based connectors for sources like Google Drive, Salesforce, and Slack. The design principle here is worth calling out: data stays at the source. The agent searches data via the connector with the user&#x27;s own permissions enforced. Agents do not accumulate copies of enterprise data in new locations just to be able to answer questions about it.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/67d8d0c028611faa04d84ce5dbab8078fbf30fa3-1280x767.gif" style="width: 100%;" /><p>This matters more than it might seem. Enterprise data governance isn&#x27;t just a compliance checkbox; it&#x27;s a load-bearing infrastructure that most teams don&#x27;t notice until it fails. When agents start routing around it, accumulating copies in vector stores and context windows, you&#x27;ve quietly created a new class of data sprawl that your security team didn&#x27;t sign off on, and your audit logs don&#x27;t capture. The connector approach eliminates this risk by constraint: if the data never moves, it can&#x27;t end up somewhere it shouldn&#x27;t. The user&#x27;s permissions travel with every query because the query goes to the source, not to a cached copy. You get agents that are genuinely useful on enterprise data.</p><h2>Ensure agents don&#x27;t outgrow the context window</h2><p>Giving agents too much context creates a new problem. A security analyst investigating a complex threat might pull in dozens of alerts, correlate across multiple indices, and go back and forth with the agent for twenty or thirty turns. At some point, you&#x27;re pushing past what the context window can hold and degrading the quality of the model&#x27;s responses. The problem is that each retrieval call adds latency to the user&#x27;s request and pushes infrastructure costs higher, and a single user interaction can trigger dozens of these calls.</p><p><strong>We built a context store for retrieval results</strong>. As the agent retrieves data from indexes, the results can grow large and crowd the context window. We introduced a temporary store that holds the results of a query in an in-memory “file store” and only pulls the results into the active context when needed. This allows for conversations to extend and deal with multiple related data sets without blowing out the context. We are also optimizing the retrieval results themselves, applying top snippets retrieval, which demonstrated a 27 to 34% reduction in token usage.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/1ca179ece56a059ebc05b4fc44b48d25c6bc2679-1292x1278.png" style="width: 100%;" /><p>We also added intelligent context compaction for longer interactions: As a conversation progresses, the agent manages what stays in the active context and what gets compressed into a summary that can be retrieved if needed. This isn&#x27;t a simple truncation; it&#x27;s selective compaction that preserves the information most likely to matter for the next turn.</p><p>This enables agents to handle larger result sets, more complex queries, and longer conversations without the token cost scaling linearly with every turn. With context compaction agents, the context window remains within a limit even for chats with 30 or more turns, rather than quickly ballooning to max size.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/0541cdd4b21eaf45b92fc90e747f49473aefe65d-983x502.png" style="width: 100%;" /><p>For teams running multi-step investigations or summarizations, this is the difference between an agent that stays coherent through turn thirty and one that starts contradicting itself at turn twelve.</p><p><strong>Monitoring: </strong>In 9.4, we also shipped monitoring for agents to track token usage. With an API available to monitor conversation turns and tool calIs. This matters because agents aren&#x27;t static. Their behavior shifts based on the context they receive and the tools they call, and without visibility into those patterns, optimizing cost and performance is guesswork.</p><h2>Agentic consumption model</h2><p>To support these new capabilities, we&#x27;re introducing an agent pricing model that directly aligns the value users gain from their agents and how they scale. Agent Builder usage will be measured by Executions. Executions are free for the first 1,000 each month in Elasticsearch and 10,000 in Elastic Security and Observability projects.</p><p>An Agent Builder execution represents a completed round of interaction with the agen. In most cases, sending a chat message and receiving a successful response from the agent counts as one execution. For messages that demand significant processing, it will be calculated as multiple executions based on the total number of input tokens required, grouped into 50,000 input token units. For example, a deep investigative task that requires 130,000 input tokens will be billed as 3 executions. This model ensures your consumption aligns with the value your agents deliver and becomes more cost-effective as your agents achieve greater context efficiency.</p><h2>Where are we going with agents</h2><p>Agents that can optimize context over operational data need the same kind of careful context engineering that we&#x27;ve spent years applying to search relevance. Getting the right information in front of the model at the right time and at the right level of detail is the new retrieval problem. These capabilities are foundational towards enabling agents that are more reliable, scalable, and cost-efficient as they scale.</p><p>Get started with an <a href="https://cloud.elastic.co/registration?onboarding_token=search&amp;pg=en-enterprise-search-page">Elastic Cloud Trial</a>, and check out the documentation <a href="https://www.elastic.co/docs/solutions/search/elastic-agent-builder">here</a>. For existing customers, Agent Builder is available in Cloud Serverless and on the Enterprise Tier in Elastic Cloud Hosted and self-managed.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Elasticsearch percolator for ecommerce search governance: translating ambiguous queries into controlled retrieval strategies]]></title>
            <description><![CDATA[Learn how to use the Elasticsearch percolator to implement search governance. In this blog, we outline the patterns needed to create a governed policy engine in production and create a controlled retrieval strategy.]]></description>
            <link>https://www.elastic.co/search-labs/blog/elasticsearch-percolator-search-governance</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/elasticsearch-percolator-search-governance</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Alexander Marquardt, Honza Král, Taylor Roy]]></dc:creator>
            <pubDate>Mon, 04 May 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/5e485cdd52d78419ff0ac30a4192b953f6d70c61-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>This post is a technical deep dive into the Elasticsearch implementation of the control plane architecture described in <a href="http://elastic.co/search-labs/blog/ecommerce-search-governance-control-plane-architecture">Part 3</a>, showing how to build it using the Elasticsearch percolator. It outlines the patterns used to implement a deterministic, governed policy engine in production.</p><h2><strong>From architecture to implementation</strong></h2><p><a href="http://elastic.co/search-labs/blog/ecommerce-search-governance-control-plane-architecture">Part 3</a> described the control plane architecture: reverse matching as a lookup primitive, policy documents that separate match from action, and cascading transformations that compose multiple policies into a single execution plan. This post goes hands-on with the Elasticsearch feature that powers the policy lookup: the <a href="https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-percolate-query">percolator query</a>.</p><p>The percolator is a natural fit for governance because it inverts the direction of search in exactly the way a control plane needs. This post walks through the implementation step by step, starting with a clear explanation of what the percolator does and why it matters, and then moving through index design, policy storage, query-time evaluation, and multi-policy composition.</p><h2><strong>How normal search works</strong></h2><p>In an ecommerce system, you may have hundreds of thousands or millions of product documents containing fields such as <code>title</code>, <code>category</code>, and <code>price</code>. When a user searches for matching documents, you&#x27;re asking Elasticsearch to compare the user’s search string against one or more stored fields in these product documents. Elasticsearch&#x27;s default analyzer, <a href="https://www.elastic.co/docs/reference/text-analysis/analysis-standard-analyzer">the standard analyzer</a>, lowercases text and splits it into tokens. A search for “oranges” matches “Oranges” because of lowercasing. With a language-aware analyzer that includes stemming, it also matches “orange” because both forms reduce to the same stem. For example, the following <a href="https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-match-query">match query</a> returns documents that have “orange” or “oranges” in their <code>“title”</code> field.</p><pre><code>POST products/_search
{
  "query": {
    "match": {
      "title": "oranges"
    }
  }
}</code></pre><p>So for the above query, Elasticsearch returns the product documents whose <code>title</code> field matches “oranges”, which could include results such as “Orange Fruit Spread”, “Orange Juice”, “Juicy oranges”, “Orange Marmalade”, and so on. The key point to remember is that Elasticsearch is commonly used to compare a search string against documents and to return the documents that match the search string.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/ba758f25616f2106d245ce0d47926c174766e028-642x318.png" style="width: 100%;" /><h2><strong>The governance problem: Finding relevant policies before searching for products</strong></h2><p>As established in <a href="https://www.elastic.co/search-labs/blog/series/governed-search-patterns">Parts 1 through 3</a>, a governed search system does not send the user&#x27;s search string directly to the product catalog. First, it checks whether any policies apply to that search string.</p><p>A merchandiser has decided that when someone searches for exactly &quot;oranges&quot;, results should be restricted to the Oranges category, eliminating orange juice, orange marmalade, and orange soda. That business decision is stored as a policy. When a user types &quot;oranges&quot;, the control plane needs to find that policy, read its instructions, and modify the search against the product catalog accordingly. In order to do this, the control plane needs to figure out which stored policies are relevant for this search string.</p><p>An enterprise deployment might have hundreds or thousands such policies. Checking them one by one with if/else logic is the application-layer anti-pattern described in <a href="http://elastic.co/search-labs/blog/ecommerce-search-governance-zero-deploy">Part 2</a>. What we need is a way to store all of those policies in an index and instantly find the ones that match a given search string. This is where the percolator comes in.</p><h2><strong>Flipping the direction: The percolator</strong></h2><p>We previously mentioned that in a normal search, Elasticsearch is commonly used to compare a search string against documents and to return the documents that contain that search string.</p><p>The percolator inverts this. With a percolator, you have an index where each document stores a query pattern, and then an incoming search string is checked against these stored queries to determine which of these stored query patterns has triggered.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/1e6348531d1c0be57b385f51d248488cf58489ff-642x279.png" style="width: 100%;" /><p>For governance, the &quot;stored query patterns&quot; are policies. Each policy contains a pattern that describes the kind of search string it should match. For example, does the search string exactly match “oranges”, or does the search string contain “olive oil”? The incoming string is the user&#x27;s search text, which arrives at query time and needs to be checked against all stored policy patterns. This is covered in a <a href="https://youtu.be/Ap5K2Y00Xjc?t=246">related PRISM video at 4:09</a>.</p><h2>Step by step: How a search for &quot;oranges&quot; finds its policy</h2><h3>The policy</h3><p>A merchandiser has authored a policy that matches if a user searches for exactly &quot;oranges&quot; without any other words. Once the percolator matches, the remainder of the document includes the rules that the control plane will use to build the Product query; in this example, one of the rules is to restrict (filter) results to the Fruits category.</p><pre><code>{
  "percolator": {
    "match_phrase": { "query": "START oranges END" }
  },
  "rule_type": "filter",
  "rule_args": {
    "filters": [
      {
        "field": "categories",
        "values": ["Fruits"],
        "mode": "hard_filter",
        "on_conflict": "soft_boost",
        "on_conflict_boost_weight": 1.0
      }
    ]
  },
  "priority": 0,
  "enabled": true
}</code></pre><p>The <code>percolator</code> field contains the pattern that defines when this policy should fire. In this case, it matches the phrase <code>&quot;START oranges END&quot;</code>. The <code>rule_type</code> and <code>rule_args</code> fields define what the policy should do when it fires. The <code>START</code> and <code>END</code> tokens are boundary markers, which we will explain shortly.</p><p>You can see how a policy is authored in the PRISM Studio UI at <a href="https://youtu.be/Ap5K2Y00Xjc?t=172">2:52 of the related PRISM video</a>.</p><h3>The user searches</h3><p>A shopper types &quot;oranges&quot; into the search bar.</p><h3>The control plane checks for matching policies</h3><p>Before searching the product catalog, the control plane intercepts the user search string, wraps it in boundary markers, and sends it to the percolator:</p><pre><code>POST policies/_search
{
  "query": {
    "percolate": {
      "field": "percolator",
      "document": {
        "query": "START oranges END"
      }
    }
  }
}</code></pre><p>The string <code>&quot;START oranges END&quot;</code> is checked against all stored policy patterns. Internally, Elasticsearch runs the stored policy patterns against this string and returns the ones that match. That&#x27;s the percolator. The user&#x27;s search string was checked against all stored policy patterns, and the ones that matched were returned. No if/else chains. No sequential evaluation. The index handles the matching.</p><h3>The control plane applies the policy</h3><p>The control plane reads the matched policies’ actions. The above policy instructs the control plane to restrict results to the Fruits category. The control plane builds the final Elasticsearch query against the product catalog as follows:</p><pre><code>POST products/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "title": "oranges" } }
      ],
      "filter": [
        { "terms": { "categories": ["Fruits"] } }
      ]
    }
  }
}</code></pre><p>The user searched for &quot;oranges”. The product catalog receives a query for &quot;oranges&quot; constrained to the Fruits category. Because of this constraint, orange juice, orange marmalade, and orange soda are excluded.</p><h3>Why &quot;orange marmalade&quot; does NOT trigger the oranges policy</h3><p>Suppose a different user searches for &quot;orange marmalade”. The control plane wraps the string and percolates: <code>&quot;START orange marmalade END&quot;</code>. The oranges policy&#x27;s pattern is <code>match_phrase: &quot;START oranges END&quot;</code>. The oranges policy does not match and therefore the policy isn’t applied, and the results aren’t constrained to the Fruits category.</p><p>This is the purpose of the <code>START</code> and <code>END</code> boundary markers. Without them, a policy that matches on the word &quot;oranges&quot; could accidentally fire on a query like &quot;orange marmalade&quot;. By wrapping the user&#x27;s search string with <code>START</code> and <code>END</code> and including those markers in the policy&#x27;s pattern, we ensure that the policy only fires when &quot;oranges&quot; is the complete search string, without any other words. This matches both the shoppers and the merchandiser&#x27;s intent.</p><h2>A second policy: &quot;olive oil&quot; on the stemmed field</h2><p>Not every policy needs an exact string match. The “olive oil” policy matches on a stemmed field, so it fires regardless of minor word-form variations:</p><pre><code>{
  "percolator": {
    "bool": {
      "should": [
        { "match_phrase": { "query.stemmed": "START olive oil END" } }
      ]
    }
  },
  "rule_type": "filter",
  "rule_args": {
    "filters": [
      {
        "field": "categories",
        "values": ["Olive oils"],
        "mode": "hard_filter",
        "on_conflict": "soft_boost",
        "on_conflict_boost_weight": 1.0
      }
    ]
  },
  "priority": 300,
  "enabled": true
}</code></pre><p>This policy&#x27;s pattern matches against <code>query.stemmed</code> instead of <code>query</code>. When the user&#x27;s search string arrives, it’s stored in both a <code>query</code> field (the exact text) and a <code>query.stemmed</code> field (analyzed with a stemming analyzer that reduces words to their stems, so &quot;olives&quot; and &quot;olive&quot; both reduce to the same stem, as do &quot;oils&quot; and &quot;oil&quot;). The policy&#x27;s pattern is checked against the stemmed version of the string, so it fires regardless of minor word-form variations.</p><p>The <code>START</code> and <code>END</code> boundary markers work on the stemmed field, as well, ensuring this policy only fires when &quot;olive oil&quot; is the entire search string, not when it appears as part of something longer.</p><p>The rest of this post covers the implementation details that make this production-ready: the index mapping that supports both matching modes, how highlights drive phrase removal and consumed phrase tracking, and how multiple conflicting policies compose into a single execution plan.</p><h2><strong>The policy index mapping</strong></h2><p>The policy index needs a percolator field to hold stored query patterns and a text field that mirrors the structure of the incoming search string the percolator will match against. The mapping below is simplified for clarity. A production deployment is more complex, using custom analyzers to handle boundary markers, variable pattern matching (for example, recognizing that &quot;under $4&quot; contains a currency value), and other kinds of analysis.</p><pre><code>PUT policies
{
  "mappings": {
    "properties": {
      "percolator": {
        "type": "percolator"
      },
      "query": {
        "type": "text",
        "fields": {
          "stemmed": {
            "type": "text",
            "analyzer": "stemming"
          }
        }
      },
      "rule_type": { "type": "keyword" },
      "rule_args": { "type": "object", "enabled": false },
      "priority": { "type": "integer" },
      "enabled": { "type": "boolean" }
    }
  }
}</code></pre><p>The index is named <code>policies</code> because each document represents a complete governed policy as defined in <a href="http://elastic.co/search-labs/blog/ecommerce-search-governance-zero-deploy">Part 2</a>. This includes match criteria, action, priority, and metadata. The <code>rule_type</code> and <code>rule_args</code> fields contain the action component of the policy, which contain the instructions that the control plane will use to compose the query for execution against the product catalog.</p><p>The <code>query</code> field is the string that the percolator matches against. It has two variants: an exact version and a stemmed version. When the user&#x27;s search string arrives, it’s placed into this field in the temporary in-memory index. Policies that match on <code>query</code> see the exact string; policies that match on <code>query.stemmed</code> see the stemmed version.</p><h2><strong>Percolating with highlights, filtering, and sorting</strong></h2><p>The simple examples above showed minimal percolation requests. In practice, the control plane adds highlighting, filters disabled policies, and sorts by priority:</p><pre><code>POST policies/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "percolate": {
            "field": "percolator",
            "document": {
              "query": "START olive oil END"
            }
          }
        },
        {
          "term": { "enabled": true }
        }
      ]
    }
  },
  "highlight": {
    "fields": {
      "query": {
        "matched_fields": ["query.stemmed"]
      }
    }
  },
  "sort": [
    { "priority": { "order": "desc" } }
  ]
}</code></pre><p>The highlight configuration uses <code>&quot;query&quot;</code> as the field key with <code>&quot;query.stemmed&quot;</code> in <code>matched_fields</code>. This tells Elasticsearch&#x27;s unified <a href="https://www.elastic.co/docs/reference/elasticsearch/rest-apis/highlighting">highlighter</a> to return highlights on the parent <code>query</code> field but to also consider matches from the <code>query.stemmed</code> subfield when determining which tokens to highlight. This is what allows a policy that matches on the stemmed field to still produce accurate highlight spans on the original text, which the control plane needs for phrase removal and consumed phrase tracking.</p><p>The <code>enabled: true</code> filter ensures that disabled policies are skipped. The <code>sort</code> on priority ensures that higher-priority policies are returned first, so the control plane can process them in the correct order for cascading transformations. The <code>highlight</code> field is the most important addition; it tells us exactly which words in the user&#x27;s search string triggered each match.</p><p>The response for an &quot;olive oil&quot; search may look as follows:</p><pre><code>{
  "hits": {
    "hits": [
      {
        "_id": "en_2c3021c8",
        "_source": {
          "rule_type": "filter",
          "rule_args": {
            "filters": [
              {
                "field": "categories",
                "values": ["Olive oils"],
                "mode": "hard_filter",
                "on_conflict": "soft_boost",
                "on_conflict_boost_weight": 1.0
              }
            ]
          },
          "priority": 300
        },
        "highlight": {
          "query": ["<em>START olive oil END</em>"]
        }
      }
    ]
  }
}</code></pre><h2><strong>Why highlights matter</strong></h2><p>Notice the highlight in the response: <code>&quot;&lt;em&gt;START olive oil END&lt;/em&gt;&quot;</code>. Elasticsearch is telling us exactly which words in the user&#x27;s search string caused the policy to match. This isn’t cosmetic. The highlight metadata drives two critical downstream behaviors:</p><p><strong>Phrase removal.</strong> Some policies need to remove the matched text from the search string before constructing the product catalog query. For example, a policy that matches on &quot;cheap&quot; removes that word and converts it into a price filter instead. The highlight identifies exactly which span of the search string the policy matched, so the system knows what to remove.</p><p><strong>Consumed phrase tracking.</strong> As described in <a href="http://elastic.co/search-labs/blog/ecommerce-search-governance-control-plane-architecture">Part 3</a>, when multiple policies match the same search string, a higher-priority policy might remove words that a lower-priority policy also matched on. By comparing each policy&#x27;s highlight against the current (evolving) search string, the system can detect that a phrase has been consumed and skip the lower-priority policy. This prevents double-processing and ensures deterministic behavior.</p><p>You can learn more about how highlighting works in <a href="https://www.elastic.co/docs/reference/elasticsearch/rest-apis/how-es-highlighters-work-internally">this article</a>.</p><h2><strong>From percolation to execution plan</strong></h2><p>The percolator returns a set of matching policies. But as <a href="http://elastic.co/search-labs/blog/ecommerce-search-governance-control-plane-architecture">Part 3</a> described, the lookup is only half the story. The other half is composing those matches into a coherent execution plan. Here’s what that looks like for a concrete query.</p><h3><strong>Worked example: &quot;Cheap chocolate&quot; during a Christmas campaign</strong></h3><p>Suppose the system has two active policies: the &quot;Cheap chocolate&quot; policy (priority 210) and the &quot;Christmas chocolates&quot; policy (priority 300), both described in detail in <a href="http://elastic.co/search-labs/blog/ecommerce-search-governance-control-plane-architecture">Part 3</a>.</p><p><strong>Step 1: Percolate.</strong> The user searches for &quot;cheap chocolate.&quot; The control plane wraps the search string as <code>&quot;START cheap chocolate END&quot;</code> and sends it to the percolator. Two policies match: The &quot;Cheap chocolate&quot; policy&#x27;s pattern matches on the phrase &quot;cheap chocolate&quot;; and the &quot;Christmas chocolates&quot; policy&#x27;s pattern matches on &quot;chocolate&quot; via the stemmed field.</p><p><strong>Step 2: Sort by priority.</strong> The percolator returns both policies, sorted by priority in descending order. The “Christmas chocolates” policy (300) is processed first, followed by the “Cheap chocolate” policy (210).</p><p><strong>Step 3: Apply the cascading transformation.</strong> This is the <code>initial state → [Policy A] → state&#x27; → [Policy B] → state&#x27;&#x27; → execution plan</code> model from <a href="http://elastic.co/search-labs/blog/ecommerce-search-governance-control-plane-architecture">Part 3</a>.</p><p>The “Christmas chocolates” policy (priority 300) applies first:</p><ul><li>Adds a category hard filter: &quot;Christmas foods and drinks,&quot; &quot;Christmas sweets&quot;.</li><li>Adds a price filter: less than $7.</li><li>Adds a category soft boost: &quot;Advent calendars&quot; (3x).</li></ul><p>The “Cheap chocolate” policy (priority 210) applies next against the modified state:</p><ul><li>Attempts to add a category hard filter: &quot;Chocolates,&quot; &quot;Milk chocolates&quot;; but the Christmas policy already set this field with <code>on_conflict: override</code>, so the Cheap chocolate categories are dropped.</li><li>Attempts to add a price filter: $2, the Christmas policy set <code>on_conflict: restrict</code> for price, and $2 is more restrictive than $7, so $2 wins.</li><li>Removes &quot;cheap&quot; from the search string.</li></ul><p><strong>Step 4: Build the Elasticsearch query.</strong> The control plane assembles the execution plan into a single Elasticsearch query against the product catalog:</p><pre><code>POST products/_search
{
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "must": [
            { "match": { "title": "chocolate" } }
          ],
          "filter": [
            { "terms": { "categories": ["Christmas foods and drinks", "Christmas sweets"] } },
            { "range": { "price": { "lt": 2 } } }
          ]
        }
      },
      "functions": [
        {
          "weight": 1
        },
        {
          "filter": { "terms": { "categories": ["Advent calendars"] } },
          "weight": 3
        }
      ],
      "score_mode": "sum",
      "boost_mode": "multiply"
    }
  }
}</code></pre><p>The original search string was &quot;cheap chocolate”. The query that reaches the product catalog is a governed, intent-aware retrieval plan: The word &quot;cheap&quot; has been consumed and converted into a price constraint, results are restricted to Christmas seasonal categories, Advent calendar products receive a ranking boost, and the price ceiling reflects the more restrictive value from the lower-priority policy. Every transformation is deterministic, traceable, and explainable.</p><p>For a quick overview about how these multipliers interact with the base BM25 score, see <a href="https://youtu.be/Ap5K2Y00Xjc?t=525">8:45 in the related PRISM video</a>, where we briefly discuss multiplicative boosts.</p><h2><strong>Why this scales</strong></h2><p>The percolator is efficient for this use case because of the asymmetry: An enterprise ecommerce system might have millions of products but only hundreds or thousands of governance policies. The percolator is checking one incoming search string against that set of stored policy patterns, not scanning the full product catalog. The cost is proportional to the number of policies, and Elasticsearch applies internal optimizations (indexing terms from stored query patterns, short-circuiting Boolean logic) to keep matching fast.</p><p>Adding a new policy is just indexing a new document. Disabling one is a field update. No code changes, no deploys, no restarts.</p><h2><strong>From lookup to governed retrieval</strong></h2><p>The percolator provides the fast reverse-matching primitive that makes the control plane architecture from <a href="http://elastic.co/search-labs/blog/ecommerce-search-governance-control-plane-architecture">Part 3</a> practical at scale. Policies are data which are stored and indexed, and efficiently matched against incoming search strings. The control plane composes matching policies into a governed execution plan through the cascading transformation and per-field conflict resolution described in Part 3. And the retrieval engine executes the governed execution plan against the product catalog.</p><p>The result is a system where a merchandiser can author a new policy without touching application code, test it against representative queries, promote it to production, and immediately see the effect. The percolator makes the policy lookup fast; the control plane makes the policy composition deterministic; and the governed workflow makes the whole process safe.</p><h2><strong>What&#x27;s next in this series</strong></h2><p>The next post in this series extends the governed control plane into new territory. It introduces a <strong>multi-tier search architecture</strong>, explaining how to orchestrate strict, relaxed, and semantic retrieval while maintaining stable pagination and facets.</p><h2><strong>Put governed ecommerce search into practice</strong></h2><p>The percolator-based control plane described in this post, from index mappings and boundary markers to highlight-driven phrase tracking and cascading policy composition, was built by Elastic Services Engineering as part of our repeatable ecommerce search accelerators. Every query example and policy structure shown here comes from a working system validated against enterprise-scale product catalogs.</p><p>If you want to implement a governed, policy-driven control plane on Elasticsearch, Elastic Services can get you there faster. Contact <a href="https://www.elastic.co/consulting">Elastic Professional Services</a>.</p><h2>Join the discussion</h2><p>Have questions about search governance, retrieval strategies, or ecommerce search architecture? Join the broader <a href="https://discuss.elastic.co/">Elastic community conversation</a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to measure and improve Elasticsearch search recall: from 0.43 to 0.75 with hybrid search]]></title>
            <description><![CDATA[Learn how to measure and improve search recall in Elasticsearch by combining BM25 lexical search with Jina AI vector embeddings, using the rank_eval API to validate the improvement with real numbers.]]></description>
            <link>https://www.elastic.co/search-labs/blog/elasticsearch-relevance-tuning-improve-recall</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/elasticsearch-relevance-tuning-improve-recall</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Jeffrey Rengifo]]></dc:creator>
            <pubDate>Mon, 04 May 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/492c9b5432a2b9e40cebb3b60f0df019a8c7bf6d-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p><a href="https://www.elastic.co/docs/solutions/search/full-text">Lexical search</a> using the <a href="https://www.elastic.co/blog/practical-bm25-part-1-how-shards-affect-relevance-scoring-in-elasticsearch">BM25 ranking algorithm</a> is cheap, fast, and very effective for a wide range of queries. But it has a blind spot: queries that don&#x27;t share tokens with your documents. In this article, you’ll measure exactly where BM25 falls short. We&#x27;ll use Elasticsearch&#x27;s <a href="https://www.elastic.co/docs/reference/elasticsearch/rest-apis/search-rank-eval">ranking evaluation API</a> (<code>rank_eval</code>) and close that gap by adding <a href="https://www.elastic.co/search-labs/es/blog/jina-embeddings-v3-elastic-inference-service">Jina AI embeddings</a> through <a href="https://www.elastic.co/docs/explore-analyze/elastic-inference/eis">Elastic Inference Service</a> (EIS). You’ll see the recall score go from <code>0.43</code> to <code>0.75</code> and understand why.</p><h2>What is recall?</h2><p><a href="https://www.elastic.co/docs/reference/elasticsearch/rest-apis/search-rank-eval#k-recall">Recall</a> measures on a scale from <code>0</code> to <code>1</code> how many of the documents that your users actually want appear somewhere in your search results. If a query should surface three products and your search returns only two of them in the top 10, <code>recall@10 = 0.67</code> for that query. It’s a set-based metric: It doesn’t care about the position of the relevant documents within those <em>k</em> results. A relevant document in position 10 counts the same as one in position 1. Having a high recall means that you’re not losing relevant results.</p><p><br/></p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/b13af2a5d0ca055535d8bfe3dfe4b3d1093ee6da-1457x796.png" style="width: 100%;" /><p>The diagram shows two sets: all relevant documents (left) and what BM25 actually retrieved (top 10, right). Only the intersection counts toward recall, <code>prod_1</code> and <code>prod_2</code> were found, while <code>prod_3</code>, <code>prod_4</code>, and <code>prod_6</code> were missed entirely. Result: <code>Recall@10 = 2/5 = <strong>0.40</strong></code>.</p><h2>Prerequisites</h2><p>Let&#x27;s get down to business to better understand how recall works. This demonstration uses Python. You can follow along with it on the companion notebook (<a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/relevance-tuning-improving-recall-adding-vectors/notebook.ipynb">notebook.ipynb</a>), where every code block is a cell ready to run.</p><p>The code provided uses the following:</p><ul><li>Elasticsearch 9.3+</li><li>Python 3.10+</li></ul><pre><code>pip install elasticsearch pandas plotly python-dotenv</code></pre><ul><li>A <code>.env</code> file with your Elasticsearch credentials</li></ul><pre><code>ELASTICSEARCH_URL=https://your-cluster-url
ELASTICSEARCH_API_KEY=your-api-key</code></pre><h2>The dataset</h2><p>We’ll use a product catalog of 1,000 products, spanning categories such as footwear, electronics, tools, and more.</p><p>Each document has four fields:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Field</td><td style="border: 1px solid #ccc; padding: 4px;">Type</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`title`</td><td style="border: 1px solid #ccc; padding: 4px;">text</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`description`</td><td style="border: 1px solid #ccc; padding: 4px;">text</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`brand`</td><td style="border: 1px solid #ccc; padding: 4px;">keyword</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`category`</td><td style="border: 1px solid #ccc; padding: 4px;">keyword</td>
                    </tr>
              </tbody>
            </table>
          <p>The dataset is loaded from <a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/relevance-tuning-improving-recall-adding-vectors/dataset.csv"><code>dataset.csv</code></a>.</p><h2>The power and limits of lexical search</h2><p>BM25 is the default ranking algorithm in Elasticsearch and most search engines. It scores documents by how often your query terms appear in them, adjusted for document length and the frequency of those terms across the entire index. You get <a href="https://www.elastic.co/docs/reference/text-analysis/analyzer-reference">analyzers</a> on top: lowercase normalization, stemming, and stopword removal. A query for &quot;running shoes&quot; will match &quot;Running Shoes&quot; and likely &quot;run&quot; as well.</p><p>This works well for a large class of queries:</p><ul><li>&quot;running shoes&quot; immediately matches products with those exact tokens in the title.</li><li>&quot;bluetooth speaker&quot; surfaces portable audio products because the tokens appear verbatim.</li></ul><p>The results are deterministic and explainable: A document ranks highly because the query terms appear in it. Debugging relevance is straightforward.</p><h3>Where it breaks</h3><p>Now let’s try these queries against the same catalog:</p><ul><li><strong>&quot;skincare routine&quot;:</strong> The word &quot;routine&quot; doesn’t appear in any product title. BM25 can partially match on &quot;skincare,&quot; but face serums, body oils, and moisturizers are described using terms like &quot;vitamin C,&quot; &quot;retinol,&quot; or &quot;brightening,&quot; none of which overlap with the query. Products that form a complete skincare routine are scattered across the index with no shared token to anchor them.</li></ul><pre><code>ID: B06XX6DS3P, Score: 9.0552, Title: Replenix Retinol Smooth + Tighten Body Lotion - Collagen-Boosting, Regenerating Anti-Aging Body Cream, Reduces Appearance of Stretch Marks, 6.7 oz.

  ID: B08XMPKJ1L, Score: 5.2699, Title: Bio-Oil Skincare Body Oil (Natural) Serum for Scars and Stretchmarks, Face and Body Moisturizer Hydrates Skin, with Organic Jojoba Oil and Vitamin E, For All Skin Types, 6.7 oz

  ID: B01CY764KQ, Score: 5.0057, Title: Nike Up Or Down Men Deodorant - Pack of 2 | Long-Lasting Fragrance, Body Spray Combo for Men | Deodorant for Active Living | Nike Men's Deo Set | Ultimate Odor Protection | Grooming Essentials | Signature Nike Scent | High-Performance Men's Deodorant</code></pre><ul><li><strong>&quot;pet travel accessories&quot;:</strong> This is a use-case grouping, not a product category. A dog sling carrier, a pet car seat, and a travel crate are all relevant, but their descriptions talk about portability, safety, and comfort rather than &quot;travel accessories.&quot; BM25 matches &quot;pet&quot; broadly but has no signal to distinguish travel-specific products from the rest of the pet catalog.</li></ul><pre><code>ID: B0BVV7BKTW, Score: 7.4371, Title: Large Foldable Travel Duffel Bag with Shoes Compartment

ID: B07TNPHYNV, Score: 6.6455, Title: 40 Pieces Christmas Bronze Jingle Bells Craft Small Bells

ID: B08R8FRW53, Score: 6.6335, Title: CUBY Dog and Cat Sling Carrier
ID: B08QMCQYGM, Score: 6.5259, Title: YTFGGY Whiteboard Pinstripe Tape 6 Rolls 1/8"
ID: B0CP3LQSWM, Score: 6.2994, Title: Portable Dog Water Bottle 32 Oz</code></pre><p>This is a <strong>recall problem</strong>. The relevant documents exist in your index. BM25 just cannot find them because the user&#x27;s words and the document&#x27;s words do not match closely enough.</p><p>Adding synonyms helps for known cases. But you cannot enumerate every way a user might express an intent. That is where vectors come in.</p><h2>Why you should measure recall</h2><p>Before fixing a problem, you need to quantify it.</p><p><a href="https://www.elastic.co/docs/reference/elasticsearch/rest-apis/search-rank-eval#k-recall"><strong>Recall@k</strong></a> measures how many of the documents that your users actually want appear somewhere in your search results. Formally:</p><pre><code>Recall@k = (relevant documents found in top k) / (total relevant documents)</code></pre><p><a href="https://www.elastic.co/docs/reference/elasticsearch/rest-apis/search-rank-eval#k-precision"><strong>Precision@k</strong></a> measures the top k results and how many are actually relevant:</p><pre><code>Precision@k = (relevant documents in top k) / k</code></pre><p>High precision means that the results you do return are good. In ecommerce, missing a relevant product (low recall) is often worse than showing a slightly imperfect result (lower precision), because a hidden product is a lost sale.</p><p>Elasticsearch&#x27;s <code>rank_eval</code> API lets you measure both systematically. You provide a list of queries, each with a set of rated documents, and Elasticsearch computes the metrics for you across all queries.</p><h2>Setting up the evaluation</h2><p>The <code>rank_eval</code> API needs a <strong>ratings dataset</strong>: a mapping of queries to the documents that are relevant for each one, along with a relevance grade (0 = not relevant, 1 = relevant, 2 = highly relevant).</p><p>In the notebook, this is the <a href="https://www.elastic.co/docs/solutions/search/ranking/learning-to-rank-ltr#learning-to-rank-judgement-list">judgments list</a>:</p><pre><code>judgments = [
    # Query 1: "running shoes" BM25 handles well (tokens appear in product titles) 
    {"query_id": "q1", "doc_id": "B09NQJFRW6", "grade": 2, "query": "running shoes"},
    {"query_id": "q1", "doc_id": "B08JMD4LMM", "grade": 2, "query": "running shoes"},
    {"query_id": "q1", "doc_id": "B08VRJ6F2Q", "grade": 2, "query": "running shoes"},
    {"query_id": "q1", "doc_id": "B07S8NRRWR", "grade": 2, "query": "running shoes"},
    {"query_id": "q1", "doc_id": "B01HD620I8", "grade": 2, "query": "running shoes"},
    {"query_id": "q1", "doc_id": "B07DX86321", "grade": 2, "query": "running shoes"},
    {"query_id": "q1", "doc_id": "B0968YVLQ8", "grade": 1, "query": "running shoes"},
    {"query_id": "q1", "doc_id": "B093QJ39ZS", "grade": 1, "query": "running shoes"},
    {"query_id": "q1", "doc_id": "B096FGSC39", "grade": 1, "query": "running shoes"},
    {"query_id": "q1", "doc_id": "B01GVQWVV2", "grade": 1, "query": "running shoes"},

    # Query 2: "skincare routine" intent-based, "routine" never appears in product titles
    {"query_id": "q2", "doc_id": "B08XMPKJ1L", "grade": 2, "query": "skincare routine"},
    {"query_id": "q2", "doc_id": "B0BN3WQB92", "grade": 2, "query": "skincare routine"},
    {"query_id": "q2", "doc_id": "B0BT7B7P5T", "grade": 2, "query": "skincare routine"},
    {"query_id": "q2", "doc_id": "B00NPA2WEY", "grade": 2, "query": "skincare routine"},
    {"query_id": "q2", "doc_id": "B06XX6DS3P", "grade": 1, "query": "skincare routine"},
    {"query_id": "q2", "doc_id": "B07PDRD1KT", "grade": 1, "query": "skincare routine"},
    {"query_id": "q2", "doc_id": "B074J7869B", "grade": 1, "query": "skincare routine"},
    {"query_id": "q2", "doc_id": "B08JV31QW4", "grade": 1, "query": "skincare routine"},
    {"query_id": "q2", "doc_id": "B00K3TVJMQ", "grade": 1, "query": "skincare routine"},

    # Query 3: "study desk setup" intent-based, products are desks/stands/organizers
    {"query_id": "q3", "doc_id": "B08CS35J2T", "grade": 2, "query": "study desk setup"},
    {"query_id": "q3", "doc_id": "B09B3LFDXJ", "grade": 2, "query": "study desk setup"},
    {"query_id": "q3", "doc_id": "B07W58LMND", "grade": 1, "query": "study desk setup"},
    {"query_id": "q3", "doc_id": "B0CHYDX91L", "grade": 1, "query": "study desk setup"},

    # Query 4: "pet travel accessories" use-case grouping, products are carriers/crates/seats
    {"query_id": "q4", "doc_id": "B08R8FRW53", "grade": 2, "query": "pet travel accessories"},
    {"query_id": "q4", "doc_id": "B01MYUYX33", "grade": 2, "query": "pet travel accessories"},
    {"query_id": "q4", "doc_id": "B003C5RKE4", "grade": 2, "query": "pet travel accessories"},
    {"query_id": "q4", "doc_id": "B09GF8GBF6", "grade": 1, "query": "pet travel accessories"},
    {"query_id": "q4", "doc_id": "B0CP3LQSWM", "grade": 1, "query": "pet travel accessories"},
]</code></pre><p>The mix is intentional: <code>q1</code> is a query that BM25 handles well (exact tokens in product titles), while <code>q2</code>, <code>q3</code>, and <code>q4</code> are intent-based queries where the user&#x27;s intent is expressed as a concept rather than specific product keywords.</p><h2>Measuring BM25 baseline recall</h2><p>First, set up the Elasticsearch client and index the raw text data:</p><pre><code>import os
import json
import pandas as pd
import plotly.graph_objects as go
from elasticsearch import Elasticsearch, helpers
from dotenv import load_dotenv

load_dotenv()

es = Elasticsearch(
    os.getenv("ELASTICSEARCH_URL"),
    api_key=os.getenv("ELASTICSEARCH_API_KEY")
)

INDEX_NAME = "ecommerce-products"</code></pre><p>Now build the <code>rank_eval</code> request for BM25. Each request in the list combines a query with its ratings:</p><pre><code>judgments_df = pd.DataFrame(judgments)

bm25_requests = []
for query_id, query_text in (
    judgments_df[["query_id", "query"]].drop_duplicates().values
):
    relevant_docs = judgments_df[judgments_df["query_id"] == query_id]
    ratings = [
        {"_index": INDEX_NAME, "_id": row["doc_id"], "rating": row["grade"]}
        for _, row in relevant_docs.iterrows()
    ]

    bm25_requests.append({
        "id": query_id,
        "request": {
            "query": {
                "multi_match": {
                    "query": query_text,
                    "fields": ["title", "description"]
                }
            }
        },
        "ratings": ratings,
    })

bm25_eval = {
    "requests": bm25_requests,
    "metric": {"recall": {"k": 10, "relevant_rating_threshold": 1}},
}

bm25_result = es.rank_eval(index=INDEX_NAME, body=bm25_eval)
print("BM25 Recall@10:", bm25_result.body["metric_score"])</code></pre><p>Result:</p><pre><code>BM25 Recall@10: 0.43</code></pre><p><code>0.43</code> means that across all four queries, BM25 finds only 43% of the documents it should find. The shortfall is concentrated in the intent-based queries: &quot;skincare routine&quot; misses face serums and body oils because &quot;routine&quot; never appears in product titles, and &quot;pet travel accessories&quot; retrieves off-topic pet products while missing carriers and crates described in terms of portability and safety rather than &quot;travel accessories.&quot;</p><p>This is our baseline. Now we have a number to beat.</p><h2>Adding vector search with Jina embeddings</h2><p><a href="https://www.elastic.co/docs/solutions/search/vector"><code>Vector search</code></a> encodes documents and queries as high-dimensional vectors, a type of vector made up of hundreds or thousands of numerical values, each encoding a specific feature of the data it represents. Documents with similar meaning end up close together in vector space, even if they share no words. &quot;Gym equipment&quot; and &quot;dumbbell set&quot; will be nearby because the concepts are related. I chose Elasticsearch as my vector database because it supports hybrid search, giving me both semantic understanding and keyword precision out of the box.</p><p><a href="https://www.elastic.co/docs/explore-analyze/elastic-inference/eis">EIS</a> includes out-of-the-box support for embedding models through its <a href="https://www.elastic.co/docs/api/doc/elasticsearch/group/endpoint-inference">inference API</a>.</p><h3>Step 1: Using Jina embeddings v5 as an inference endpoint</h3><pre><code>INFERENCE_ENDPOINT_ID = ".jina-embeddings-v5-text-small"</code></pre><p>If your cluster has GPU resources (available in Elastic Cloud and Elasticsearch 9.3+), the embeddings are generated on GPU, which is significantly faster than CPU inference and removes the performance trade-off that historically made vectors expensive at scale.</p><p>Why Jina embeddings specifically? <a href="https://www.elastic.co/search-labs/blog/jina-embeddings-v5-text">jina-embeddings-v5-text</a> is a multilingual model (119+ languages) with a 32,000-token context window and support for task-specific <a href="https://arxiv.org/abs/2106.09685">Low-Rank Adaptation (LoRA) adapters</a>. It works well for short product descriptions out of the box. Read more about <code>jina-embeddings-v5-text</code> model <a href="https://huggingface.co/jinaai/jina-embeddings-v5-text-small">here</a>.</p><h3>Step 2: Create the index with a semantic field</h3><pre><code>index_mappings = {
    "mappings": {
        "properties": {
            "title": {"type": "text", "copy_to": "semantic_field"},
            "description": {"type": "text", "copy_to": "semantic_field"},
            "brand": {"type": "keyword"},
            "category": {"type": "keyword"},
            "semantic_field": {
                "type": "semantic_text",
                "inference_id": INFERENCE_ENDPOINT_ID,
            },
        }
    }
}

if not es.indices.exists(index=INDEX_NAME):
    es.indices.create(index=INDEX_NAME, body=index_mappings)
    print(f"Created index: {INDEX_NAME}")</code></pre><p>The <a href="https://www.elastic.co/docs/solutions/search/semantic-search/semantic-search-semantic-text"><code>semantic_text</code></a> field type is the key here. It’s a higher-level abstraction over <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/dense-vector"><code>dense_vector</code></a>: You point it at an inference endpoint, and Elasticsearch takes care of generating embeddings automatically.</p><p>The <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/copy-to"><code>copy_to</code></a> property on <code>title</code> and <code>description</code> means content from both fields flows into <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/semantic-text"><code>semantic_field</code></a> for embedding, so a single vector captures the full product representation.</p><h3>Step 3: Index the products</h3><pre><code>def bulk_index(products, index_name):
    actions = []
    for product in products:
        doc_id = product.get("_id")
        source = {k: v for k, v in product.items() if k != "_id"}
        action = {"_index": index_name, "_source": source}
        if doc_id:
            action["_id"] = doc_id
        actions.append(action)

    success, failed = helpers.bulk(es, actions, raise_on_error=False)
    if failed:
        for error in failed:
            print(f"Error: {error}")
    else:
        print(f"Successfully indexed {success} documents")

bulk_index(products, INDEX_NAME)</code></pre><p>At index time, Elasticsearch calls the inference endpoint for each document and stores the resulting embedding in <code>semantic_field</code>. No extra code on your side.</p><h2>Hybrid search: Combining BM25 and vectors with RRF</h2><p>Adding vectors improves recall, but using vectors alone risks losing precision on exact-match queries; &quot;running shoes&quot; should still rank verbatim matches first. Hybrid search retains the lexical component specifically to preserve that precision.</p><p>Hybrid search with <a href="https://www.elastic.co/docs/reference/elasticsearch/rest-apis/reciprocal-rank-fusion">Reciprocal Rank Fusion</a> (RRF) keeps the best of both:</p><ul><li>BM25 handles exact and near-exact queries with high precision.</li><li>Semantic search handles intent-based and multilingual queries with high recall.</li><li>RRF combines the two ranked lists into a single ranking.</li></ul><p>The RRF formula assigns each document a score based on its rank in each result list:</p><pre><code>score = sum(1 / (rank_constant + rank))</code></pre><p>A document that ranks highly in both lists gets a higher combined score. The <code>rank_constant</code> controls how much weight lower-ranked documents receive.</p><pre><code>hybrid_requests = []

for query_id, query_text in (
    judgments_df[["query_id", "query"]].drop_duplicates().values
):
    relevant_docs = judgments_df[judgments_df["query_id"] == query_id]
    ratings = [
        {"_index": INDEX_NAME, "_id": row["doc_id"], "rating": row["grade"]}
        for _, row in relevant_docs.iterrows()
    ]

    hybrid_requests.append({
        "id": query_id,
        "request": {
            "retriever": {
                "rrf": {
                    "retrievers": [
                        {
                            "standard": {
                                "query": {
                                    "multi_match": {
                                        "query": query_text,
                                        "fields": ["title", "description"],
                                    }
                                }
                            }
                        },
                        {
                            "standard": {
                                "query": {
                                    "match": {
                                        "semantic_field": {"query": query_text}
                                    }
                                }
                            }
                        },
                    ],
                    "rank_window_size": 50,
                    "rank_constant": 5,
                }
            }
        },
        "ratings": ratings,
    })

hybrid_eval = {
    "requests": hybrid_requests,
    "metric": {"recall": {"k": 10, "relevant_rating_threshold": 1}},
}

hybrid_result = es.rank_eval(index=INDEX_NAME, body=hybrid_eval)
print("Hybrid Recall@10:", hybrid_result.body["metric_score"])</code></pre><p>Result:</p><pre><code>Hybrid Recall@10: 0.75</code></pre><p>Hybrid improves substantially over BM25 (<code>0.43</code>) and preserves precision for exact-match queries like &quot;running shoes.&quot;</p><h2>Results: Before and after</h2><p>Here’s the full comparison across all three approaches:</p><pre><code>methods = {
    "BM25 (Lexical)": bm25_requests,
    "Hybrid (BM25 + Vectors)": hybrid_requests,
}

recall_metric = {"recall": {"k": 10, "relevant_rating_threshold": 1}}

comparison_data = []
for method_name, requests in methods.items():
    result = es.rank_eval(
        index=INDEX_NAME,
        body={"requests": requests, "metric": recall_metric}
    )
    comparison_data.append({
        "method": method_name,
        "recall@10": result.body["metric_score"]
    })

comparison_df = pd.DataFrame(comparison_data)
print(comparison_df.to_string(index=False))</code></pre><p>Result:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Method</td><td style="border: 1px solid #ccc; padding: 4px;">Recall@10</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">BM25 (Lexical)</td><td style="border: 1px solid #ccc; padding: 4px;">0.43</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Hybrid (BM25 + Vectors)</td><td style="border: 1px solid #ccc; padding: 4px;">0.75</td>
                    </tr>
              </tbody>
            </table>
          <img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/e49f6c10516b0a48a0ad75962c6590ee07311407-700x500.png" style="width: 100%;" /><p>Breaking it down by query:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/40e36dcb7b34cbf4649c512bcb60cef60f1778a6-700x500.png" style="width: 100%;" /><h2>Conclusion</h2><p>Throughout this post, we saw that BM25 lexical search is reliable when users type exact queries, but it loses recall when they search by intent rather than keywords. Using <code>rank_eval</code>, we established a reproducible baseline to measure that gap with real numbers. From there, we added a <code>semantic_text</code> field powered by Jina embeddings and ran the evaluation again. The result: Hybrid search improved recall from <code>0.43</code> to <code>0.75</code> while preserving precision on exact-match queries, though the actual margin will depend on your query mix.</p><p>The pattern scales beyond this example: Collect judgments from your users&#x27; actual queries, run <code>rank_eval</code> as a baseline, add <code>semantic_text</code>, and measure again. You&#x27;ll know exactly what improved and by how much.</p><h2>Next steps</h2><ul><li>Dive deeper into recall and vector search: <a href="https://www.elastic.co/search-labs/blog/recall-vector-search-quantization">Recall and vector search quantization</a> by Jeff Vestal</li><li>Add reranking for even better precision on the top results</li><li>Explore <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/rrf.html">Elasticsearch hybrid search documentation</a></li><li>Read more about the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/search-rank-eval.html"><code>rank_eval</code> API</a></li></ul>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Building a control plane to govern ecommerce search]]></title>
            <description><![CDATA[How to build a governed control plane for ecommerce that composes conflicting search policies into a single execution plan (without code changes).]]></description>
            <link>https://www.elastic.co/search-labs/blog/ecommerce-search-governance-control-plane-architecture</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/ecommerce-search-governance-control-plane-architecture</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Alexander Marquardt, Honza Král, Taylor Roy]]></dc:creator>
            <pubDate>Fri, 01 May 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/4e2cd5244ef8b9a05af6337a4825252f321a9a43-1377x768.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p><a href="https://www.elastic.co/search-labs/blog/ecommerce-search-governance-improve-retrieval">Part 1</a> and <a href="https://www.elastic.co/search-labs/blog/ecommerce-search-governance-zero-deploy">part 2</a> of this series established why ecommerce search needs a <em>governance layer</em>, a decision layer between the user&#x27;s query and the retrieval engine that classifies intent, enforces constraints, and routes to the correct retrieval strategy (for example, BM25, semantic, hybrid). This post shows how to build that layer using a simple architectural primitive where query interpretation policies are stored as documents and retrieved at query time via fast reverse matching. Because new retrieval policies (for example, “boost brand X” or “only show category Y”) don’t require code changes, the result is a routing layer that stays stable while policies evolve and that keeps the retrieval engines safe in high-stakes environments. If you want to see the end result of this architecture before reading further, check out this video: <a href="https://www.youtube.com/watch?v=e1GuL9CYWAk">Fixing Search Relevance in Seconds: Introducing PRISM</a>.</p><h2>Why query interpretation is often a challenge</h2><p>Storing policies as code (if/else blocks in the application layer) produces tens of thousands of lines of brittle logic that lacks any indexing for efficient policy retrieval at query time. Iteration is slow (a single query behavior change may require a six-week deployment cycle), accountability is unclear (why did results change?), and business users cannot modify search behavior without engineering involvement. This is shown on the left side in the following image:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/4e2cd5244ef8b9a05af6337a4825252f321a9a43-1377x768.png" style="width: 100%;" /><p>Storing policies as data in an Elasticsearch index is shown on the right side of the above image. This approach solves all of the issues associated with hard-coded query resolution logic. However, for this to work, you need a way to quickly determine which policies match the user query and how conflicts should be resolved. This is where the governed control plane comes in.</p><h2>The control plane pattern</h2><p>A governed control plane sits between the raw user query and an Elasticsearch retrieval. It receives user text as its input, and its output is an execution plan that includes filters, boosts, and retrieval routing decisions.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/5562da5de521f3c83ed55a13e9be87ca7fa70109-546x489.png" style="width: 100%;" /><p>A control plane pipeline consists of:</p><ol><li><strong>User query: </strong>A user enters a string of what they’re looking for, such as “oranges” or “gift for grandpa”.</li><li><strong>Policy lookup: </strong>Match the user query against the policy index.</li><li><strong>Return matching policies:</strong> Policies that match the user query are returned from the policy index.</li><li><strong>Policy application: </strong>The control plane analyzes these returned policies and composes matched policies into a single coherent execution plan that includes filters, boosts, overrides, and guardrails and that applies the appropriate retrieval method (for example, lexical versus semantic versus hybrid).</li><li><strong>Execute:</strong> The modified <em>intent-aware</em> Elasticsearch query is passed to the application to be executed against a product catalog index.</li><li><strong>Explain (optional):</strong> In addition to creating a query that provides business and intent-aligned results, the control plane provides an optional explainability payload to show which policies were triggered and how they were combined.</li></ol><p>Finding which policies should be applied for a user’s search string requires a fast reverse-matching primitive, which we solve with the <a href="https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-percolate-query">percolator query</a>. After retrieving relevant policies, combining multiple matched policies into a unified execution plan requires a judgment framework: priorities, conflict strategies, consumed phrase tracking, and cascading transformations that apply policies in sequence rather than independently. Additionally, the most appropriate retrieval technology needs to be selected (for example, <a href="https://www.elastic.co/elasticon/conf/2016/sf/improved-text-scoring-with-bm25">BM25</a> for “oranges” versus <a href="https://www.elastic.co/docs/solutions/search/semantic-search">semantic search</a> for “gift for grandpa”).</p><h2>Policy lookup: Checking the query before searching for products</h2><p>When a shopper types a query, a search system with a governed control plane doesn’t send that query directly to be executed against the product catalog. First, the query is checked against a set of stored policies and modified to reflect the intent of the query and business priorities.</p><h3>Policy structure</h3><p>Each policy is a simple document that defines two things:</p><ul><li><strong>Match criteria:</strong> What query text should cause this policy to fire. This could be an exact phrase, a single word, a pattern, or a combination.</li><li><strong>Action:</strong> What to do when the policy fires. This could be applying a category filter, excluding products, extracting a price constraint, or changing the retrieval strategy.</li></ul><p>The system finds all matching policies, composes them into an execution plan, and only then runs the product search. Taken together, policies act like a knowledgeable store associate who understands what you’re looking for and walks you to the right aisle.</p><h3>The policy pattern</h3><p>The first articles in this series introduced examples of policies in action: constraining &quot;oranges&quot; to the produce category, treating &quot;without peanuts&quot; as an exclusion, and routing &quot;gift for grandpa&quot; to semantic retrieval. The key architectural point is that in each case, the query is checked against stored policies before the product search begins. The policies determine what constraints to apply, which text to modify, and which retrieval strategy to use. The query against the product catalog comes after the policies have been applied and a new rewritten query has been created.</p><h3>Why this is fast</h3><p>An enterprise ecommerce system might have millions of products but only hundreds or thousands of policies. The policy lookup step is searching against a small curated index, not the full product catalog, and is therefore fast. And because policies are stored as data in their own index, a merchandiser adding a new policy doesn&#x27;t touch the application code, and an engineer optimizing the product search doesn&#x27;t touch the policy index. The two concerns evolve independently.</p><p>The examples above describe what happens conceptually. Under the hood, the policy lookup is implemented using the Elasticsearch <a href="https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-percolate-query">percolator query</a> type, which is purpose-built for this kind of pattern: matching incoming text against a set of stored queries. <a href="https://www.elastic.co/search-labs/blog/elasticsearch-percolator-search-governance">Part 4</a> in this series provides a hands-on deep dive into the percolator implementation, including index mappings, boundary markers, and highlight-driven phrase tracking. With the lookup mechanism covered in depth in Part 4, let&#x27;s turn to what a policy document actually contains and how the control plane composes multiple policies into a single execution plan.</p><h2>Example policies</h2><p>Now that we&#x27;ve seen what policies do conceptually, let&#x27;s look at what they actually contain. The two policies below have been designed to intentionally conflict, which will demonstrate the conflict resolution system described in subsequent sections.</p><h3>Cheap chocolate</h3><p>The policy shown below detects if a user has submitted a search containing the phrase “cheap chocolate”. If so, results are restricted to the “Chocolates” and “Milk chocolates” categories. This policy also applies a price filter of $2. Also, notice that this policy has a priority of 210; we’ll come back to this when we discuss conflict resolution in more detail.</p><p>The filter mode and conflict strategy settings shown here (hard_filter, soft_boost, restrict, override) are explained in detail in the conflict resolution section below.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/bbcd66b20fc3aa861b5880ca67daf8e809698717-1002x890.png" style="width: 100%;" /><p>When the above policy is activated, a search for “cheap chocolate” respects the price filter of $2 and restricts results to the “Chocolates” and “Milk chocolates” categories. Example results are shown below:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/3f373af9a985864315d7639440a416e45a882a1b-1133x1146.png" style="width: 100%;" /><h3>Christmas chocolate</h3><p>The policy shown below is an example of a policy that one could imagine applying at Christmas. This example restricts results to “Christmas foods and drinks” and “Christmas sweets”, boosts any products that are also in the “Advent calendars” category, and applies a price filter of less than $7 to focus on affordable seasonal items. Additionally, notice that this policy has a priority of 300. We’ll come back to this when we discuss conflict resolution in more detail.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/8f1179342d0e05cf78266d142b046021a3694368-1007x941.png" style="width: 100%;" /><p>When the above policy is activated without any conflicting policies, a search for “chocolate” respects the price filter of $7, and restricts results to the “Christmas food and drinks” and “Christmas sweets” categories, and boosts any products tagged as “Advent calendars”. Example results are shown below:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/662b0e40cb3a9291c17816c33169e9ff5b68f98d-1129x1085.png" style="width: 100%;" /><h2>Combining matched policies</h2><p>The policy lookup described above is half the story. The other half is what happens when multiple policies match the same query.</p><p>In any nontrivial deployment, a single query will routinely trigger several policies at once. &quot;Cheap chocolate&quot; will match both of the policies that we demonstrated above. Each policy is correct in isolation. The challenge is composing them into a single, coherent execution plan without contradictions, without double-counting, and without one policy silently undoing the work of another.</p><p>This isn’t a lookup problem; it’s a judgment problem. The system must decide:</p><ul><li><strong>Order of application:</strong> If a negation policy removes &quot;without peanuts&quot; from the query, does the price policy still see the original text or the modified text?</li><li><strong>Filter conflicts:</strong> If two policies set different price ceilings, which one wins? Is the loser silently dropped, or does it degrade gracefully into a soft boost?</li><li><strong>Phrase ownership:</strong> If two policies both matched on the same word and the first one already consumed it, should the second one still fire?</li></ul><p>A naive implementation (apply all matched policies independently, merge the results) breaks as soon as policies interact. The architecture needs an explicit model for how policies compose. The next two sections describe that model: a priority and conflict resolution framework; and a cascading transformation model that makes policy interaction deterministic.</p><p>The key insight is that policy application isn’t a set of independent operations; it’s a cascading transformation. Each policy receives the rewrite state produced by all higher-priority policies and transforms it further:</p><p>initial state → [Policy A] → state&#x27; → [Policy B] → state&#x27;&#x27; → ... → execution plan</p><p>The state carries the rewritten query text, accumulated filters, current intent, and any synonym expansions. A high-priority policy can remove text from the query, and every subsequent policy sees the modified query, not the original. Context accumulates. Order matters.</p><h2>Precedence and conflict resolution: Determinism matters</h2><p>The specific conflict strategies are a design choice. Different organizations may resolve conflicts differently, depending on their business requirements. The following approach illustrates the kind of judgment framework a control plane needs. The important thing is not these specific strategies but that the system has explicit, deterministic strategies rather than letting conflicts resolve through unpredictable interactions.</p><h3>Priority ordering</h3><p>Policies are sorted by priority (highest first). When multiple policies match the same query, they’re applied in priority order. If two policies try to set the same filter field, the higher-priority policy&#x27;s declared strategy for that field takes precedence. If there are multiple policies triggered that have the same priority, then the policy with the highest ID is given precedence (as if it were assigned a higher priority); this choice ensures deterministic behavior when conflicts arise.</p><h3>Per-field resolution, not per policy</h3><p>A critical design principle: Conflict resolution operates per field (for example, brand, category, or description), not per policy. When two policies produce filters that overlap on specific fields, only those specific fields are affected by the conflict resolution strategy, and the resolution strategy is defined by the highest-priority matching policy. Non-conflicting fields from both policies survive intact.</p><p>This matters because the alternative of a per-policy approach would force the system to either accept or reject an entire policy when only one of its fields conflicts.</p><p>Per-field resolution preserves the maximum amount of useful constraint information.</p><h3>Three settings per filter field</h3><p>Each filter field in a policy has three independent settings:</p><p><strong>Filter mode:</strong> How the filter is applied when there’s no conflict.</p><ul><li><code>hard_filter</code> (default): Applied as an <a href="https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-bool-query#score-bool-filter">Elasticsearch <code>bool.filter</code></a> clause. This is useful for excluding unrelated products entirely. For example, restricting a search for &quot;oranges&quot; to the produce category eliminates hits such as orange juice and orange marmalade. Non-matching documents are completely excluded from results.</li><li><code>soft_boost</code>: Applied as an <a href="https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-function-score-query">Elasticsearch <code>function_score</code></a> weight with a configurable <code>boost_weight</code>. Documents that match get a ranking boost, but non-matching documents aren’t excluded. This is useful for something like boosting a brand, without excluding other brands.</li></ul><h3>Conflict strategy</h3><p>What happens when a lower-priority policy sets the same field:</p><ul><li><code>override</code>: This high-priority policy&#x27;s value wins; the lower-priority value is dropped entirely. Valid for all field types.</li><li><code>restrict</code>: Take the more restrictive numeric value (for example, the lower ceiling for price__max, the higher floor for price__min). Valid for numeric range fields only.</li><li><code>merge</code>: Combine both values into a union. Valid for non-numeric fields only.</li><li><code>soft_boost</code>: Convert the conflicting filter to a <code>function_score</code> weight with a configurable <code>boost_weight</code> instead of a hard filter. For more details on function_score boosting, see <a href="https://www.elastic.co/search-labs/blog/bm25-ranking-multiplicative-boosting-elasticsearch">Influencing BM25 ranking with multiplicative boosting in Elasticsearch</a>. This is only valid for non-negation fields.</li></ul><p><strong>Value:</strong> The actual filter value (for example, a categories list, a price threshold).</p><p><strong>Strategies by field type: </strong>Not all strategies make sense for all field types. For instance, an exclusion is inherently binary, so it cannot be soft-boosted. The following table shows which strategies are available for each field type:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Field type</td><td style="border: 1px solid #ccc; padding: 4px;">Available strategies</td><td style="border: 1px solid #ccc; padding: 4px;">Default</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Negation fields (__not, __match__not)</td><td style="border: 1px solid #ccc; padding: 4px;">override, merge</td><td style="border: 1px solid #ccc; padding: 4px;">override</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Numeric range fields (__max, __min, __gt, __lt)</td><td style="border: 1px solid #ccc; padding: 4px;">restrict, override, soft_boost</td><td style="border: 1px solid #ccc; padding: 4px;">restrict</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">All other fields (keyword, text)</td><td style="border: 1px solid #ccc; padding: 4px;">soft_boost, override, merge</td><td style="border: 1px solid #ccc; padding: 4px;">soft_boost</td>
                    </tr>
              </tbody>
            </table>
          <p>Negation fields cannot be soft-boosted because exclusions are binary. Converting &quot;never show canned foods&quot; to &quot;slightly prefer not-canned-foods&quot; fundamentally changes the semantics; a product from &quot;canned foods&quot; would still appear, just ranked slightly lower, which defeats the purpose of the exclusion.</p><h2>A concrete example: Searching for &quot;cheap chocolate&quot; during a Christmas campaign</h2><p>Suppose a merchandiser has created the two policies for chocolate that we previously demonstrated, a lower priority one for cheap chocolate and another higher-priority chocolate-related policy that will be enabled during Christmas. If both of these policies are enabled, then how these are combined depends on the filter mode and conflict strategy of the higher-precedence policy. If both of the previously discussed policies are enabled, they’ll be combined as follows:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/0405e193522172bde283180df96ed3651178fafc-529x447.png" style="width: 100%;" /><p>This shows two conflicts, one on categories and one on price. It’s worth noting that the query that will be executed after this transformation has the following characteristics:</p><ul><li>Only products from the “Christmas foods and drinks” and “Christmas sweets” categories will be shown.</li><li>Within those categories, if the products are also tagged as being in the “Advent calendars” category, they’ll be boosted up by 3x.</li><li>A price filter for $2 is applied, which came from the lower-priority policy (because the higher-priority policy specified to “Restrict” on conflict).</li><li>The word “cheap” is removed, only returning products matching “chocolate”.</li></ul><p>With both of these policies enabled, “cheap chocolate” returns results similar to the image shown below:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/01bbab1c5bd3d0fd37e39c25973d60141f9796e9-1126x1123.png" style="width: 100%;" /><h3>Relaxing constraints</h3><p>Perhaps the retailer doesn’t want to exclude products in the categories of “Chocolates” and “Milk chocolates” during Christmas. The settings on the Christmas policy might have overreached and inadvertently removed categories applied by the “cheap chocolate” policy. This is an example that shows why it might be more desirable to combine lower-priority policies with conflicting higher-priority policies. For example, we could modify the Christmas chocolates promotion so that instead of “Override” on conflict, we do a soft boost. The change to that policy would be as follows:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/45e88311014d67933ca8cf8381d8f91de090e2b4-1090x103.png" style="width: 100%;" /><p>After this modification, the query rewriter transformation pipeline execution for “cheap chocolate” looks as follows:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/396b360e48327421c2c38bcf4a039fb1a6d5a8e0-519x445.png" style="width: 100%;" /><p>With the soft boost on conflict, the conflicting filters are converted into soft boosts rather than being dropped. The query that will be executed on the product catalog after this transformation has the following characteristics:</p><ul><li>Because “On conflict” is specified as “Soft boost” on the higher-priority policy, the conflicts will be converted to boosts as follows:<ul><li>Products from the “Christmas foods and drinks” and “Christmas sweets” categories will have a boost of 1x applied to them.</li><li>Products from the “Chocolates” and “Milk chocolates” categories will have a boost of 3x applied to them.</li></ul></li><li>As in the previous example, if the products are also tagged as being in the “Advent calendars” category, they’ll be boosted up by 3x.</li><li>As in the previous example, a price filter for $2 is applied.</li><li>The word “cheap” is removed, only returning products matching “chocolate”.</li></ul><p>With relaxed filtering, results look as follows:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/7a68c54d878dadfe8b1821dd3860b7b60f9ce45f-1126x1123.png" style="width: 100%;" /><h3>Overriding price from a high-priority policy</h3><p>Or perhaps the retailer wants to allow slightly more expensive chocolates to be shown during Christmas by increasing the price max to $7. To ensure that the max price from the Christmas chocolates policy is not overridden if someone searches for “cheap chocolates”, we can set the conflict mode on the price to “override” rather than “restrict”, as follows:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/c2621e6513281f545b84eb77362f2b93e1c46a1f-996x70.png" style="width: 100%;" /><p>With this override, the query for “cheap chocolate” ignores maximum price that is defined in the “cheap chocolate policy” and only applies the price specified in the “Christmas chocolates” policy, as follows:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/d50da7900beb3c08439e9fd79cbe2ddd98196441-511x389.png" style="width: 100%;" /><p>This is similar to the previous example, with the difference being that the max price is set to the $7 value from the higher-priority policy because that policy specified “Override” on conflict. With the Christmas price filter taking precedence, the results look as follows:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/635ee6353ba84727486e7e053764788fb26b6f44-1134x1079.png" style="width: 100%;" /><p>These three variations (override, soft_boost, and override on price) demonstrate a key property of the system: A merchandiser can change how two policies interact by modifying a setting on a single field within a single policy, without deploying any code. The conflict strategy is the lever that controls business behavior.</p><h2>Consumed phrase tracking</h2><p>There’s a subtler form of conflict: two policies that match on the same phrase. If a higher-priority policy removes &quot;without peanuts&quot; from the query, a lower-priority policy that also matched on &quot;without&quot; has nothing left to act on. The system detects if the matched phrase is no longer present in the rewritten query and skips the lower-priority policy.</p><p>Intent policies are exempt from consumed phrase tracking: They set the retrieval strategy based on the original query match, regardless of what text has been removed by higher-priority policies.</p><p>Priority ordering, per-field conflict resolution, and consumed phrase tracking together give the control plane a deterministic composition model. With that foundation in place, the system can make a routing decision that would be risky without it.</p><h2>Governance makes retrieval strategy safe</h2><p>An important insight about routing to the correct retrieval method (text, semantic, or hybrid) is that it executes after governance. If your policies have already enforced &quot;produce category”, then semantic retrieval becomes far less risky because the candidate set is constrained. A semantic search over 500 product items is a very different proposition from a semantic search over 500,000 SKUs. Governance narrows the blast radius before retrieval begins.</p><p>For example, without governance, a semantic query for “Fruit high in vitamin C under $4”, in addition to fruits, might return vitamin bottles, carrots, and green pepper. The control plane ensures that these undesired results aren’t even considered as part of the semantic expansion.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/6dccd5b8a94bfa81f68e3d1c4ad8929ce8cc4e5e-990x378.png" style="width: 100%;" /><p>With that constraint in place, the control plane applies pragmatic routing logic:</p><ul><li><strong>Lexical</strong> for navigational and head queries where deterministic precision matters.</li><li><strong>Semantic</strong> for descriptive discovery queries where concept matching helps.</li><li><strong>Hybrid</strong> selectively, when constraints have already been enforced and the business accepts broader recall.</li></ul><h2>From architecture to implementation</h2><p>The governed control plane translates business intent into deterministic, composable execution plans, without embedding that logic in application code. Policies are data: matched at query time, resolved through explicit per-field conflict strategies, and applied as cascading transformations that produce explainable results. Elastic Services Engineering has built and deployed this architecture for enterprise ecommerce teams, using repeatable patterns and accelerators that compress the path from concept to production. You can see a demo of our implementation of a control plane on YouTube at: <a href="https://www.youtube.com/watch?v=e1GuL9CYWAk">Fixing Search Relevance in Seconds: Introducing PRISM</a>.</p><h3><strong>What&#x27;s next in this series</strong></h3><p>The next post goes hands-on with the implementation: how the Elasticsearch percolator powers the policy lookup, including index mappings, boundary markers, highlight-driven phrase tracking, and concrete query examples.</p><h2>Put governed ecommerce search into practice</h2><p>The control plane architecture described in this post (per-field conflict resolution, cascading policy transformations, and governance-constrained retrieval routing) was designed and built by Elastic Services Engineering. Every pattern, screenshot, and transformation pipeline shown in this series comes from a working system built by Elastic Services Engineering and validated against enterprise-scale product catalogs.</p><p>If you want to implement a governed, policy-driven control plane on Elasticsearch, <a href="https://www.elastic.co/consulting">Elastic Services</a> can get you there faster.</p><h2>Join the discussion</h2><p>Have questions about search governance, retrieval strategies, or ecommerce search architecture? Join the broader <a href="https://discuss.elastic.co/">Elastic community conversation</a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Elasticsearch ES|QL query builder for JavaScript and TypeScript: Fluent, type-safe query construction]]></title>
            <description><![CDATA[Exploring the Elasticsearch ES|QL query builder for JavaScript and TypeScript and explaining how to build ES|QL queries with practical examples.]]></description>
            <link>https://www.elastic.co/search-labs/blog/esql-query-builder-javascript-typescript</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/esql-query-builder-javascript-typescript</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Margaret Gu]]></dc:creator>
            <pubDate>Thu, 30 Apr 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/d8cc9dc5b2bcae4c589b402d62a5b7c8c6d63fb7-720x420.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>We&#x27;re pleased to announce that the Elasticsearch Query Language (ES|QL) query builder is now available for JavaScript and TypeScript. It&#x27;s a fluent, type-safe library that lets you construct ES|QL queries with method chaining, automatic value escaping, and full integrated development environment (IDE) support; no more raw string concatenation.</p><p>Learn how to get started with practical examples you can use right away.</p><h2>Elasticsearch ES|QL query builder for JavaScript and TypeScript</h2><p>If you&#x27;ve ever built an ES|QL query in JavaScript, you&#x27;ve probably written something like this:</p><pre><code>const query = `FROM logs-*
| WHERE status_code >= ${minStatus}
  AND host.name == ${hostname}
  AND @timestamp >= "${startDate}"
| STATS error_count = COUNT(*) BY status_code
| SORT error_count DESC
| LIMIT 10`</code></pre><p>It looks fine until <strong><code>hostname</code></strong> is<strong> <code>O&#x27;Brien&#x27;s server</code></strong> and the whole thing blows up with a parse error. Or until a user passes <strong><code>&quot;; DROP INDEX logs</code></strong> into a search field and you realize you&#x27;ve been building queries with raw string concatenation this entire time.</p><p>There&#x27;s a better way. The ES|QL query builder for JavaScript and TypeScript lets you write queries like this instead:</p><pre><code>import { ESQL, E, f } from '@elastic/elasticsearch-esql-dsl'

const query = ESQL.from('logs-*')
  .where(E('status_code').gte(minStatus))
  .where(E('host.name').eq(hostname))
  .where(E('@timestamp').gte(startDate))
  .stats({ error_count: f.count() })
  .by('status_code')
  .sort(E('error_count').desc())
  .limit(10)</code></pre><p>Values are escaped automatically. You get autocomplete in your editor. And you can see exactly what the query does, without mentally parsing a template literal.</p><p>ES|QL query builders are already available across Elastic&#x27;s language clients, including Python, Ruby, and others. This article focuses on the JavaScript and TypeScript version, walking through practical examples you can start using today.</p><h2>Getting started</h2><p>Install the package:</p><pre><code>npm install @elastic/elasticsearch-esql-dsl</code></pre><p>Here’s a minimal query:</p><pre><code>import { ESQL, E } from '@elastic/elasticsearch-esql-dsl'

const query = ESQL.from('employees')
  .where(E('still_hired').eq(true))
  .sort(E('last_name').asc())
  .limit(10)

console.log(query.render())</code></pre><p>This renders:</p><pre><code>FROM employees
| WHERE still_hired == true
| SORT last_name ASC
| LIMIT 10</code></pre><p>To run it against Elasticsearch:</p><pre><code>import { Client } from '@elastic/elasticsearch'

const client = new Client({ node: 'http://localhost:9200' })
const response = await client.esql.query({ query: query.render() })</code></pre><p>That’s it. No string interpolation, no manual escaping.</p><h2><strong>Building a real query, step by step</strong></h2><p>Let&#x27;s walk through a realistic scenario: You&#x27;re building a dashboard that analyzes web server error logs. We&#x27;ll start simple and layer on features.</p><h3><strong>Step 1: Filter error logs</strong></h3><pre><code>import { ESQL, E } from '@elastic/elasticsearch-esql-dsl'

const errors = ESQL.from('logs-*')
  .where(E('status_code').gte(400))
  .limit(100)</code></pre><pre><code>FROM logs-*
| WHERE status_code >= 400
| LIMIT 100</code></pre><h3><strong>Step 2: Add a computed column</strong></h3><p>Your timestamps are in milliseconds, but you want response time in seconds:</p><pre><code>const errors = ESQL.from('logs-*')
  .where(E('status_code').gte(400))
  .eval({ response_secs: E('response_time_ms').div(1000) })
  .limit(100)</code></pre><pre><code>FROM logs-*
| WHERE status_code >= 400
| EVAL response_secs = response_time_ms / 1000
| LIMIT 100</code></pre><h3><strong>Step 3: Aggregate errors by status code</strong></h3><pre><code>import { f } from '@elastic/elasticsearch-esql-dsl'

const errorBreakdown = ESQL.from('logs-*')
  .where(E('status_code').gte(400))
  .stats({
    error_count: f.count(),
    avg_response: f.avg('response_time_ms'),
  })
  .by('status_code')
  .sort(E('error_count').desc())</code></pre><pre><code>FROM logs-*
| WHERE status_code >= 400
| STATS error_count = COUNT(*), avg_response = AVG(response_time_ms) BY status_code
| SORT error_count DESC</code></pre><p>The <strong><code>f</code></strong> namespace gives you access to 150+ ES|QL function wrappers: aggregations, string functions, date functions, math, geo, and more. They all return chainable expressions, so you can use them anywhere you&#x27;d use <strong><code>E()</code></strong>.</p><h3><strong>Step 4: Use date functions for time-based analysis</strong></h3><pre><code>const hourlyErrors = ESQL.from('logs-*')
  .where(E('status_code').gte(400))
  .eval({ hour: f.dateTrunc('@timestamp', '1 hour') })
  .stats({ error_count: f.count() })
  .by('hour')
  .sort(E('hour'))</code></pre><pre><code>FROM logs-*
| WHERE status_code >= 400
| EVAL hour = DATE_TRUNC(@timestamp, "1 hour")
| STATS error_count = COUNT(*) BY hour
| SORT hour</code></pre><h3><strong>Step 5: Branch queries safely</strong></h3><p>Every method returns a new query object. The original is never mutated. This means you can build a base query and branch it for different views:</p><pre><code>const base = ESQL.from('logs-*')
  .where(E('status_code').gte(400))
  .where(E('@timestamp').gte('2026-01-01T00:00:00Z'))

const byStatus = base
  .stats({ count: f.count() })
  .by('status_code')
  .sort(E('count').desc())

const byHost = base
  .stats({ count: f.count() })
  .by('host.name')
  .sort(E('count').desc())
  .limit(20)

const recent = base
  .sort(E('@timestamp').desc())
  .keep('@timestamp', 'status_code', 'url.path', 'message')
  .limit(50)</code></pre><p>Three different queries, one shared base. Change the filter on <strong><code>base</code>,</strong> and all three update. This is especially useful for dashboards where multiple panels query the same dataset with different aggregations.</p><h2><strong>Three ways to write expressions</strong></h2><p>The domain‑specific language (DSL) gives you flexibility in how you write conditions. Here&#x27;s the same WHERE clause written three different ways:</p><p><strong>Raw strings:</strong> When you&#x27;re writing a quick one-off:</p><pre><code>.where('status_code >= 400 AND host.name == "web-01"')</code></pre><p><strong>The <code>E()</code> expression builder: </strong>When you want type safety and autocomplete:</p><pre><code>import { and_ } from '@elastic/elasticsearch-esql-dsl'

.where(and_(
  E('status_code').gte(400),
  E('host.name').eq('web-01')
))</code></pre><p><strong>The <code>esql</code> template tag: </strong>-When you want safe interpolation of dynamic values:</p><pre><code>import { esql } from '@elastic/elasticsearch-esql-dsl'

const minStatus = 400
const host = 'web-01'
.where(esql`status_code >= ${minStatus} AND host.name == ${host}`)</code></pre><p>All three produce the same ES|QL. Pick whichever fits your situation: raw strings for simple cases, <strong><code>E()</code></strong> when building expressions programmatically, and the template tag when mixing literal ES|QL with dynamic values.</p><h2><strong>Keeping queries safe</strong></h2><p>If any part of your query comes from user input, you need to think about injection. ES|QL supports parameter binding, and the DSL makes it straightforward:</p><pre><code>function searchLogs(userQuery: string) {
  const query = ESQL.from('logs-*')
    .where(E('message').eq(E('?')))
    .limit(100)

  return client.esql.query({
    query: query.render(),
    params: [userQuery],
  })
}</code></pre><p>The <strong><code>?</code></strong> placeholder is replaced server-side by Elasticsearch, so the user&#x27;s input never touches the query string. No escaping, no injection risk.</p><h2><strong>Beyond the basics</strong></h2><p>Once you&#x27;re comfortable with the core commands, the DSL supports every advanced ES|QL feature:</p><p><strong>Hybrid search with FORK and FUSE:</strong></p><pre><code>const results = ESQL.from('articles')
  .fork(
    ESQL.branch()
      .where(f.match('title', 'elasticsearch'))
      .sort(E('_score').desc())
      .limit(50),
    ESQL.branch()
      .where(f.knn('embedding', 10))
      .sort(E('_score').desc())
      .limit(50),
  )
  .fuse('RRF')
  .limit(10)</code></pre><p><strong>Data enrichment:</strong></p><pre><code>const enriched = ESQL.from('logs-*')
  .enrich('ip_lookup')
  .on('client.ip')
  .with('geo.city', 'geo.country')</code></pre><p><strong>Conditional aggregation:</strong></p><pre><code>const stats = ESQL.from('employees')
  .stats({
    eng_avg: f.avg('salary').where(E('dept').eq('Engineering')),
    sales_avg: f.avg('salary').where(E('dept').eq('Sales')),
    total: f.count(),
  })</code></pre><p><strong>AI/machine learning (ML) integration:</strong></p><pre><code>const summarized = ESQL.from('docs')
  .completion('Summarize this document')
  .with({ inferenceId: 'my-llm' })</code></pre><p>For the full list of commands and functions, check out the <a href="https://www.elastic.co/docs/reference/elasticsearch/clients/javascript-dsl">ES|QL query builder documentation</a>.</p><h2><strong>What&#x27;s next</strong></h2><p>This is the initial release of <strong><code>@elastic/elasticsearch-esql-dsl</code></strong>. You can find the package on <a href="https://www.npmjs.com/package/@elastic/elasticsearch-esql-dsl">npm</a>, explore the source on <a href="https://github.com/elastic/elasticsearch-dsl-js">GitHub</a>, and read the full documentation in the repository. If you run into issues or have feature requests, open an issue; we&#x27;re actively developing this and want to build what JavaScript and TypeScript developers actually need.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How cross-project search (CPS) works in Elasticsearch Serverless]]></title>
            <description><![CDATA[Elastic Cloud Serverless cross-project search (CPS) treats index expressions as cross-project by default. This post explains how TransportSearchAction scopes projects, resolves index expressions, skips projects with no matches, and validates index resolution against allow_no_indices and ignore_unavailable.]]></description>
            <link>https://www.elastic.co/search-labs/blog/cross-project-search-elasticsearch-serverless</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/cross-project-search-elasticsearch-serverless</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Matteo Piergiovanni, Pawan Kartik]]></dc:creator>
            <pubDate>Thu, 30 Apr 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/c8bbf6c6b27b8db8a13313f5f3db2cad948f840b-720x420.jpg?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>With the introduction of cross-project search (CPS) in our Elastic Cloud Serverless offering, we wanted to provide our customers with a streamlined search experience. Our vision is for our users to not worry about which project the data is stored in but instead to focus on the information they want an answer for.</p><p>Users familiar with <a href="https://www.elastic.co/docs/explore-analyze/cross-cluster-search">cross-cluster search</a> (CCS) know that to include a remote cluster in a search you must specifically reference it with a cluster alias prefix, such as <code>remote:metrics*</code> (or <code>*:metrics*</code> to target all remotes). With CPS, &quot;bare&quot; index names or patterns such as <code>metrics*</code> in a search index expression implicitly reference all instances of that index that can be found on the origin project and on all of its linked projects.</p><p>In this post, we pull back the curtains on how the Elasticsearch TransportSearchAction, the code that underlies the _search API (and _async_search) APIs, determines which indices, aliases, and datastreams to search on which projects when running a cross-project search.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/9a7c9134b4509dab0dbe26def7e058c40562cfd5-2400x1350.webp" style="width: 100%;" /><h2>Analyzing index expressions</h2><p>One important consequence of the new CPS model is that the Elasticsearch query parameters <a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-search#operation-search-allow_no_indices"><code>allow_no_indices</code></a> and <a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-search#operation-search-ignore_unavailable"><code>ignore_unavailable</code></a> require different handling than before. In general, these parameters control whether a search should throw an error if a concrete index name (for example, &quot;logs&quot;) cannot be found or isn’t accessible (<code>ignore_unavailable=false</code>), or a wildcard pattern (for example, &quot;logs*&quot;) doesn’t match anything or if there are no indices at all to search (<code>allow_no_indices=false</code>).</p><p>In CCS, those parameters are analyzed on each cluster separately: Each cluster can just consult which indices, aliases, or data streams exist locally. But in CPS, we need to account for whether each resource (index, alias, or data stream) referenced by an index expression matches on <em>any</em> project (origin or linked), rather than <em>every</em> project.</p><p>To illustrate, suppose a user has one linked project (<code>linked1</code>) and they issue a cross-project query with index expression <code>logs*,metrics-1</code>, along with <code>allow_no_indices=false,ignore_unavailable=false</code>. As long as we find one resource (index, alias, or data stream) that matches <code>logs*</code> and one that matches <code>metrics-1</code> on any project, then the search can proceed. If metrics-1, for example, is found on linked1 but not the origin project, that suffices to pass the <code>ignore_unavailable=false</code> constraint. Only if it’s found nowhere would we throw an IndexNotFoundException.</p><p>To handle this, the _search API in cross-project search mode needs to gather information from each linked project before kicking off the actual search.</p><p>The serverless node that receives the _search REST request is considered to be the origin project, and it acts as the overall search coordinator. To fully analyze and process an index expression before kicking off the actual search, the search coordinator needs to:</p><ol><li>Determine which projects are in scope for the search.</li><li>Determine which indices, aliases, or data streams should be searched in each individual project.</li><li>Do a final reconciliation step of all the information gathered: 1) Have we found all the indices required to proceed (as defined by the allow_no_indices and ignore_unavailable settings)?; and 2) Are there any projects that should be skipped since they have no matching indices?</li></ol><p>To illustrate, we’ll follow a CPS request against the _search API from start to finish. Suppose that the incoming user request is to search <code>metrics*</code> and that the origin project has two indices which match the expression, namely <code>metrics-ES</code> and <code>metrics-KB</code>, and that the origin project is linked to projects <code>P1</code> and <code>P2</code>, where <code>P1</code> has index <code>metrics-ES</code> and <code>P2</code> has no index, alias, or data stream that matches <code>metrics*</code>.</p><h2>Determining projects in scope</h2><p>The search coordinator starts by grabbing the full list of linked projects from Elasticsearch cluster state. It then determines, based on <a href="https://www.elastic.co/docs/explore-analyze/cross-project-search#security">Universal Identity and Access Management</a> (UIAM) credentials provided with the query, which of those projects the user has access to and keeps the subset of projects that the user is allowed to access.</p><p>If a <code>project_routing</code> parameter is present on the request, that can further limit which projects are in scope for the query. For instance, <code>&quot;project_routing&quot;:&quot;_alias:_origin&quot;</code> would cull the list to just the origin project, while <code>&quot;project_routing&quot;:&quot;_alias:P*&quot;</code> would include the linked projects P1 and P2 and remove the origin project from the list of projects in scope for the query.</p><p>The index expression itself can also limit which projects to access. Qualified index expressions, like <code>&quot;P*:metrics*&quot;,</code> would indicate that we should search only on projects whose alias starts with &quot;P&quot;. For our case, the &quot;bare&quot; or unqualified index expression <code>&quot;metrics*&quot;</code> means &quot;search on all projects that are in scope,&quot; where &quot;in scope&quot; is modified by security access and project_routing.</p><h2>Determining which indices each project has</h2><p>Once we know which projects to search, we need to determine which matching indices each one has. To support this, the search coordinator creates a <code>ResolvedIndexExpressions</code> data structure that allows tracking which indices on each project should be included in the search.</p><p>For each index expression provided by the user, that data structure tracks:</p><ul><li>Original expression: The index expression, as provided by the user.</li><li>Local resolution: The local expressions that will replace the original together with the resolution result.</li><li>Remote expressions: A set of remote expressions one for each project a query can target.</li></ul><p>On the search coordinator that received the request, we’ve so far been able to determine which projects are in scope (origin, P1, and P2) and which, if any, matching indices exist on origin. So, for our example search against <code>metrics*</code>, we’ll have the following structure at that point in time:</p><ul><li>Original expression: <code>metrics*</code>.</li><li>Local resolution: <code>SUCCESS</code>, <code>&lt;metrics-ES, metrics-KB&gt;</code>.</li><li>Remote expressions: <code>&lt;P1:metrics*, P2:metrics*&gt;</code>.</li></ul><p>The remote expressions are left unresolved until we later contact the P1 and P2 projects to fill in that information.</p><p>The index resolution can be any of the following:</p><ul><li><code>SUCCESS</code>: Local resolution completed successfully.</li><li><code>NOT_VISIBLE</code>: Indicates that a non-wildcard expression was resolved to nothing, either because the index doesn’t exist or is closed.</li><li><code>UNAUTHORIZED</code>: Indicates that the expression could be resolved to a concrete index, but the requesting user isn’t authorized to access it.</li><li><code>NONE</code>: No local resolution was attempted, typically because the expression is remote-only (for example, <code>P1:index</code>).</li></ul><h2>Check indices on the linked projects</h2><p>In the search API, how we check for indices on the linked projects depends on whether the cross-project search is being run with <a href="https://www.elastic.co/docs/explore-analyze/cross-cluster-search#ccs-min-roundtrips-true">minimize round trips</a> or not. Most searches in CPS are set internally to run with minimize_roundtrips=true, so we’ll focus on that pathway.</p><p>In CCS, where index expression analysis is done locally on each cluster, we just send the same request to every cluster and, with minimize_roundtrips=true, each remote cluster sends back an entire SearchResponse that the primary search coordinator collects and eventually merges into all the other responses it receives.</p><p>By contrast, for CPS, an additional phase was introduced to search where we contact each linked project to assess which indices are present and which ones the user has access to. This is an additional round trip that uses <code>ResolveIndexAction</code>, the class that implements the functionality of the <a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-search#operation-search-ignore_unavailable"><code>_resolve/index</code></a> endpoint. Upon receiving the <code>ResolveIndex</code> responses from all the linked projects, we can fill in the ResolvedIndexExpressions data structure on the primary search coordinator. For this case, since <code>P1</code> has index <code>metrics-ES</code> and <code>P2</code> has no matching indices, the updated data structure would be:</p><ul><li>Original expression: <code>metrics*</code>.</li><li>Local resolution: <code>SUCCESS</code>, <code>&lt;metrics-ES, metrics-KB&gt;</code>.</li><li>Remote expressions: <code>&lt;P1: SUCCESS &lt;metrics-ES&gt;, P2: NOT_VISIBLE&gt;</code>.</li></ul><h2>CPS validator</h2><p>Once we have all the linked project information, we can run the validation to honor the <code>IndicesOptions</code> specified by the caller.</p><p>If the user had specified a qualified index expression (for example, original=<code>P1:metrics*</code>), the CPS validator needs to make sure that project <code>P1</code> has at least one index matching <code>metrics*</code>, otherwise a 404 index not found exception would be returned to the user. In other words, qualified expressions imply that an index, alias, or data stream matching that name must be present on all projects specified by the qualifier.</p><p>On the other hand, for “bare” (unqualified) CPS index expressions (for example, original=<code>metrics*</code>), we only need to check whether the original index expression exists anywhere.</p><p>For the example we’re following in this post, the ResolveIndexExpressions show that at least one index, alias, or data stream matching <code>metrics*</code> was found, so the search can proceed.</p><p>However, we can also see that the P2 cluster has no matching indices. In that case, we can remove it from the rest of the query. So now, for the actual query, we’ll:</p><ul><li>On the origin project, search <code>metrics-ES, metrics-KB</code>.</li><li>On the <code>P1</code> project, search <code>metrics-ES</code>.</li><li>Skip the <code>P2</code> project. Note: Since it wasn’t included in the query at all, it won’t show up on the <a href="https://www.elastic.co/docs/explore-analyze/cross-project-search#cps-examples">_cluster/details</a> of the SearchResponse.</li></ul><h2>Conclusions</h2><p>We’ve described how the TransportSearchAction class in Elasticsearch supports some key new features of cross-project search. That class supports a large number of Elasticsearch REST APIs, such as, _search, _async_search, _msearch, _eql, _sql and _count, all of which use the mechanisms described in this document.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Elastic-caveman: Cutting AI response tokens by 64% without losing the best of Elastic]]></title>
            <description><![CDATA[Learn how to use elastic-caveman to cut AI response tokens while keeping the Elastic agentic brilliance.]]></description>
            <link>https://www.elastic.co/search-labs/blog/elastic-caveman-ai-token-reduction</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/elastic-caveman-ai-token-reduction</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Sri Kolagani]]></dc:creator>
            <pubDate>Wed, 29 Apr 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/b162c28623b9070fd1980bcd891b9dd1e868f2f0-720x421.jpg?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>When querying Elasticsearch through an AI assistant, you need facts: index names, field mappings, Elasticsearch Query Language (ES|QL) queries, case IDs, sentiment scores. But current large language model (LLM) interfaces wrap every response in conversational padding:</p><p><em>&quot;Of course! I&#x27;d be happy to help you...&quot;</em></p><p><em>&quot;This should give you a good overview...&quot;</em></p><p><em>&quot;Feel free to let me know if you need anything else!&quot;</em></p><p>This isn&#x27;t just annoying; it&#x27;s expensive. Every token costs money and adds latency. For production Elasticsearch queries, that overhead compounds fast. In this post, we introduce elastic-caveman and share the results of a controlled experiment across eight live Model Context Protocol (MCP) scenarios against an Elasticsearch cluster. The findings: 63.6% average token reduction, 817 tokens saved, and zero loss of technical accuracy.</p><h2><strong>Enter elastic-caveman</strong></h2><p><a href="https://github.com/srikolag/elastic-caveman">elastic-caveman</a> tests a simple hypothesis: Strip AI responses to pure signal, and measure the impact. The approach:</p><ul><li>Normal mode: Full conversational AI with greetings, explanations, and sign-offs.</li><li>Caveman mode: Raw data with minimal structural labels only.</li></ul><p>We tested both modes against a live Elasticsearch instance using MCP with real support ticket and Salesforce case data across eight production scenarios.</p><h2><strong>Results: 64% token reduction, zero accuracy loss</strong></h2><p>Here&#x27;s what we found across eight live MCP tool calls: The Elastic-Caveman initiative has successfully optimized AI response size without compromising quality or functionality.</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Metric</td><td style="border: 1px solid #ccc; padding: 4px;">Result</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Scenarios tested</td><td style="border: 1px solid #ccc; padding: 4px;">8</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Success rate</td><td style="border: 1px solid #ccc; padding: 4px;">88%</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Token reduction</td><td style="border: 1px solid #ccc; padding: 4px;">63.6% average</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Total normal tokens</td><td style="border: 1px solid #ccc; padding: 4px;">1,284</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Total Caveman tokens</td><td style="border: 1px solid #ccc; padding: 4px;">467</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Tokens saved</td><td style="border: 1px solid #ccc; padding: 4px;">817</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Max reduction (single scenario)</td><td style="border: 1px solid #ccc; padding: 4px;">91.5%</td>
                    </tr>
              </tbody>
            </table>
          <p><strong>Key preservations (0% loss):</strong></p><ul><li>Technical accuracy</li><li>API paths</li><li>ES|QL syntax</li><li>Field names</li></ul><p>The critical finding: Every field name, case ID, ES|QL query, account name, and sentiment score was preserved exactly. Not approximately. Exactly.</p><h2><strong>Real examples: Before and after</strong></h2><p><strong>Example 1. List indices: 87% reduction</strong></p><p>User: <em>Show me my indices</em></p><p><strong>Normal mode (107 tokens):</strong></p><pre><code>Of course! I'd be happy to help you see your indices. Here's a comprehensive list of all the indices in your Elasticsearch cluster. Each entry shows the index name along with relevant metadata. This should give you a good overview of what's stored in your cluster:-- salesforce-cases-- support-ticketsThese are all the indices I found. If you'd like to dive deeper into any specific index, just let me know and I'll be glad to assist!</code></pre><p><strong>Caveman mode (14 tokens):</strong></p><pre><code>Indices: salesforce-cases, support-tickets</code></pre><p><strong>Saved: 93 tokens (86.9%)</strong></p><p><strong>Example 2. Generate ES|QL query: 75% reduction</strong></p><p>User: <em>Show me open critical tickets grouped by product area</em></p><p><strong>Normal mode (208 tokens):</strong></p><pre><code>Sure thing! I'd be happy to help you write that ES|QL query. What you're looking for is a query that will filter for open tickets with critical priority and group them by product area. Here's how you can accomplish this...</code></pre><p><em>[followed by the actual query, plus 150+ tokens of step-by-step explanation]</em></p><p><strong>Caveman mode (52 tokens):</strong></p><pre><code>FROM support-tickets
WHERE status = "Open" AND priority = "Critical"
| STATS count = COUNT(*) BY product_area
| SORT count DESC
| LIMIT 100

Filters open+critical, groups by product_area, counts, sorts desc.</code></pre><p><strong>Saved: 156 tokens (75.0%). ES|QL syntax is character-for-character identical in both modes.</strong><br/><strong>Example 3. Search recent support tickets: 35% reduction</strong></p><p>User: <em>Show me 5 recent support tickets</em></p><p><strong>Caveman mode (143 tokens):</strong></p><pre><code>5 recent tickets, newest first:
Case     | Priority | Status    | Account               | Subject                                    | Days | Sentiment
CASE-0012 | Critical | Open      | Pinnacle Financial    | Regulatory reporting: incorrect tax calcs  | 2    | -0.94
CASE-0021 | Critical | Escalated | Riverstone Capital    | Security audit blocking renewal            | 2    | -0.93
CASE-0046 | Medium   | Open      | Blackrock Partners    | Pen test findings need vendor response     | 2    | -0.70
CASE-0006 | High     | Open      | United Oil & Gas      | Unauthorized access -- urgent               | 3    | -0.91
CASE-0017 | High     | Open      | Riverstone Capital    | Renewal paperwork stalled                  | 4    | -0.86

All negative sentiment. Two critical cases from April 7. All case IDs, priorities, account names, and sentiment scores preserved verbatim.</code></pre><h2><strong>What gets removed vs. what stays</strong></h2><p>When we clean up the output, we strip out conversational filler, like <em>“Of course! I’d be happy to help you…”</em>, <em>“This should give you a good overview…”</em>, or <em>“Would you like me to help you prioritize these?”</em>, and we keep every piece of factual content, such as ES|QL snippets, like FROM support-tickets WHERE status = &quot;Open&quot;; field names like sentiment_score, product_area, and resolution_hours; and index names, like support-tickets and salesforce-cases. We also preserve concrete identifiers and business entities, such as case IDs CASE-0012 and CASE-0002; account names, like Pinnacle Financial and United Oil Gas Corp; along with all numeric values, for example, a sentiment_score of -0.94, counts like 47 duplicates, durations such as 18 days, or metrics like 27.0 average hours, so the edited text is tightly focused on query syntax, entities, and numbers while discarding only the polite scaffolding.</p><p>Results varied by operation type:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Query type</td><td style="border: 1px solid #ccc; padding: 4px;">Token reduction</td><td style="border: 1px solid #ccc; padding: 4px;">Why</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Metadata listings</td><td style="border: 1px solid #ccc; padding: 4px;">85–92%</td><td style="border: 1px solid #ccc; padding: 4px;">Small payload, maximum filler in normal mode</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">ES|QL generation</td><td style="border: 1px solid #ccc; padding: 4px;">70–75%</td><td style="border: 1px solid #ccc; padding: 4px;">Query is identical; explanation is eliminated</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Data-heavy searches</td><td style="border: 1px solid #ccc; padding: 4px;">35–40%</td><td style="border: 1px solid #ccc; padding: 4px;">Actual data dominates, leaving less room for fluff</td>
                    </tr>
              </tbody>
            </table>
          <h2><strong>Complete evaluation breakdown</strong></h2><p>Token savings by query type across all eight scenarios against live MCP data:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Scenario</td><td style="border: 1px solid #ccc; padding: 4px;">Normal tokens</td><td style="border: 1px solid #ccc; padding: 4px;">Caveman tokens</td><td style="border: 1px solid #ccc; padding: 4px;">Reduction</td><td style="border: 1px solid #ccc; padding: 4px;">Tokens saved</td><td style="border: 1px solid #ccc; padding: 4px;">MCP tool</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">T1: List all streams</td><td style="border: 1px solid #ccc; padding: 4px;">118</td><td style="border: 1px solid #ccc; padding: 4px;">10</td><td style="border: 1px solid #ccc; padding: 4px;">91.5%</td><td style="border: 1px solid #ccc; padding: 4px;">108</td><td style="border: 1px solid #ccc; padding: 4px;">platform.streams.list_streams</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">T2: List indices</td><td style="border: 1px solid #ccc; padding: 4px;">107</td><td style="border: 1px solid #ccc; padding: 4px;">14</td><td style="border: 1px solid #ccc; padding: 4px;">86.9%</td><td style="border: 1px solid #ccc; padding: 4px;">93</td><td style="border: 1px solid #ccc; padding: 4px;">platform.core.list_indices</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">T3: Get index mapping</td><td style="border: 1px solid #ccc; padding: 4px;">143</td><td style="border: 1px solid #ccc; padding: 4px;">40</td><td style="border: 1px solid #ccc; padding: 4px;">72.0%</td><td style="border: 1px solid #ccc; padding: 4px;">103</td><td style="border: 1px solid #ccc; padding: 4px;">platform.core.get_index_mapping</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">T4: Generate ES|QL query</td><td style="border: 1px solid #ccc; padding: 4px;">208</td><td style="border: 1px solid #ccc; padding: 4px;">52</td><td style="border: 1px solid #ccc; padding: 4px;">75.0%</td><td style="border: 1px solid #ccc; padding: 4px;">156</td><td style="border: 1px solid #ccc; padding: 4px;">platform.core.generate_esql</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">T5: Execute ES|QL aggregation</td><td style="border: 1px solid #ccc; padding: 4px;">149</td><td style="border: 1px solid #ccc; padding: 4px;">44</td><td style="border: 1px solid #ccc; padding: 4px;">70.5%</td><td style="border: 1px solid #ccc; padding: 4px;">105</td><td style="border: 1px solid #ccc; padding: 4px;">platform.core.execute_esql</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">T6: Search recent tickets</td><td style="border: 1px solid #ccc; padding: 4px;">221</td><td style="border: 1px solid #ccc; padding: 4px;">143</td><td style="border: 1px solid #ccc; padding: 4px;">35.3%</td><td style="border: 1px solid #ccc; padding: 4px;">78</td><td style="border: 1px solid #ccc; padding: 4px;">platform.core.search</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">T7: Search escalated cases</td><td style="border: 1px solid #ccc; padding: 4px;">198</td><td style="border: 1px solid #ccc; padding: 4px;">128</td><td style="border: 1px solid #ccc; padding: 4px;">35.4%</td><td style="border: 1px solid #ccc; padding: 4px;">70</td><td style="border: 1px solid #ccc; padding: 4px;">platform.core.search</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">T8: ES|QL stats by priority</td><td style="border: 1px solid #ccc; padding: 4px;">140</td><td style="border: 1px solid #ccc; padding: 4px;">36</td><td style="border: 1px solid #ccc; padding: 4px;">74.3%</td><td style="border: 1px solid #ccc; padding: 4px;">104</td><td style="border: 1px solid #ccc; padding: 4px;">platform.core.execute_esql</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">TOTALS</td><td style="border: 1px solid #ccc; padding: 4px;">1,284</td><td style="border: 1px solid #ccc; padding: 4px;">467</td><td style="border: 1px solid #ccc; padding: 4px;">63.6%</td><td style="border: 1px solid #ccc; padding: 4px;">817</td><td style="border: 1px solid #ccc; padding: 4px;"></td>
                    </tr>
              </tbody>
            </table>
          <p>Technical accuracy verification:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Accuracy check</td><td style="border: 1px solid #ccc; padding: 4px;">Result</td><td style="border: 1px solid #ccc; padding: 4px;">Details</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">ES|QL syntax preserved</td><td style="border: 1px solid #ccc; padding: 4px;">PASS</td><td style="border: 1px solid #ccc; padding: 4px;">FROM, WHERE, STATS, SORT, LIMIT identical</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Field names preserved</td><td style="border: 1px solid #ccc; padding: 4px;">PASS</td><td style="border: 1px solid #ccc; padding: 4px;">account_id, sentiment_score, product_area verbatim</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Index names preserved</td><td style="border: 1px solid #ccc; padding: 4px;">PASS</td><td style="border: 1px solid #ccc; padding: 4px;">support-tickets, salesforce-cases unchanged</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Case IDs preserved</td><td style="border: 1px solid #ccc; padding: 4px;">PASS</td><td style="border: 1px solid #ccc; padding: 4px;">CASE-0012, CASE-0002 exact</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Account names preserved</td><td style="border: 1px solid #ccc; padding: 4px;">PASS</td><td style="border: 1px solid #ccc; padding: 4px;">Pinnacle Financial, United Oil Gas Corp exact</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Numeric values preserved</td><td style="border: 1px solid #ccc; padding: 4px;">PASS</td><td style="border: 1px solid #ccc; padding: 4px;">Sentiment scores -0.94, -0.88; days open 18, 7 exact</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Priority/status labels</td><td style="border: 1px solid #ccc; padding: 4px;">PASS</td><td style="border: 1px solid #ccc; padding: 4px;">Critical, Escalated, Open verbatim</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Null values preserved</td><td style="border: 1px solid #ccc; padding: 4px;">PASS</td><td style="border: 1px solid #ccc; padding: 4px;">null for low priority resolution hours retained</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Error messages preserved</td><td style="border: 1px solid #ccc; padding: 4px;">PASS</td><td style="border: 1px solid #ccc; padding: 4px;">Tool validation errors quoted verbatim</td>
                    </tr>
              </tbody>
            </table>
          <p><em>Zero information loss. 64% fewer tokens.</em></p><h2><strong>Why this matters for Elastic users</strong></h2><p>For teams building AI assistants on Elasticsearch, 64% token reduction means 64% savings on output costs at scale, faster streaming responses, and more context window space for actual data rather than fillers. When you&#x27;re debugging an ES|QL query at 2 a.m., you don&#x27;t need an AI telling you it&#x27;s delighted to help; you just need the query response!</p><h2><strong>The bigger picture: Rethinking AI interfaces</strong></h2><p>This experiment reveals something fundamental: Conversational AI interfaces optimize for the wrong metric. They optimize for sounding human when users often just want accurate data, fast.</p><p>For technical workflows, especially data queries, there&#x27;s a strong case for mode-switching:</p><ul><li>Conversational mode: When exploring or learning.</li><li>Caveman mode: When you know what you want and need it now.</li></ul><p>The Elastic MCP server makes this possible by returning structured, accurate responses that work in both modes without modification.</p><h2><strong>How elastic-caveman works</strong></h2><p>elastic-caveman is an Agent Skill, that is, a markdown file with YAML front matter that any compatible AI agent reads and follows. No runtime. No binary. No API calls. Just instructions that reshape how your agent talks when working with Elasticsearch.</p><p>Install with:</p><pre><code>npx skills add srikolag/elastic-caveman</code></pre><p><strong>Supported agents: </strong>Claude Code, Cursor, Codex, Windsurf, GitHub Copilot, Gemini CLI, Roo</p><p><strong>Trigger with:<code>/elastic-caveman</code></strong></p><p><strong>Disable with:</strong>&quot;normal mode&quot; or &quot;verbose&quot;</p><h2><strong>Live in action</strong></h2><p>We tested elastic-caveman with the Claude model to measure its impact on token usage and cost:</p><ul><li><strong>With elastic-caveman:</strong> Token usage was 368 tokens (in) and 1.6k tokens (out), resulting in a cost of <strong>$0.11</strong>.</li><li><strong>Without elastic-caveman:</strong> Token usage was 367 tokens (in) and 1.8k tokens (out), resulting in a cost of <strong>$0.12</strong>.</li></ul><blockquote>Prompt: <em>Get me the critical support tickets from the support-tickets index in kibana for Pinnacle Financial</em></blockquote><p>This test demonstrates the efficiency of elastic-caveman.</p><a href="https://videos.elastic.co/watch/pRHKth67Mv5pGgqSvCdXiF">Watch video on Vidyard</a><h2><strong>What&#x27;s next</strong></h2><p>Caveman mode is just the beginning. Consider dynamic mode switching: Flip between concise and conversational mid-session. Or a hybrid approach: Lean on success, explanatory on errors. Or custom verbosity levels for teams that want something in between. The goal isn&#x27;t to make AI assistants robotic; it&#x27;s to give users control over the signal-to-noise ratio.</p><h2><strong>Try it yourself</strong></h2><p>Test caveman mode with your Elasticsearch data:</p><ol><li>Set up the Elastic MCP server.</li><li> Install elastic-caveman.</li><li>Run queries in both normal and caveman modes.</li><li>Compare token counts and accuracy.</li></ol><p>Full evaluation methodology and scripts available in the <a href="https://github.com/srikolag/elastic-caveman">GitHub repo</a>.</p><h2><strong>The bottom line</strong></h2><p>Across eight real scenarios with live Elasticsearch data, elastic-caveman delivered 64% average token reduction with zero accuracy loss and 100% preservation of ES|QL syntax, field names, and technical values. Sometimes the best AI response isn&#x27;t the chattiest one. Sometimes you just need the data; and with elastic-caveman, you can get it 64% faster. Ready to optimize your Elasticsearch AI workflows? Check out Elasticsearch Labs for more tutorials, integrations, and research on building with Elasticsearch and AI, or start building with Elasticsearch today.</p><p>Want to optimize your Elasticsearch AI workflows? Check out <a href="https://www.elastic.co/search-labs">Elasticsearch Labs</a> for more tutorials, integrations, and research on building with Elasticsearch and AI. Ready to try it yourself? <a href="https://cloud.elastic.co/registration">Start building</a> with Elasticsearch today.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Your Elasticsearch field mappings are lying to you — METRICS_INFO and TS_INFO tell the truth]]></title>
            <description><![CDATA[Two new Elasticsearch ES|QL commands that turn time series discovery into a single-line query: METRICS_INFO and TS_INFO tell you which metrics and series actually exist in your data, not just what the mapping declares.]]></description>
            <link>https://www.elastic.co/search-labs/blog/esql-metrics-info-ts-info-time-series-catalog</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/esql-metrics-info-ts-info-time-series-catalog</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Dmitry Leontyev]]></dc:creator>
            <pubDate>Wed, 29 Apr 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/96a2ffa07e5b1b89999b3c3d466c113a0a7afc8b-1376x768.jpg?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Two new Elasticsearch ES|QL processing commands, <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/metrics-info"><code>METRICS_INFO</code></a> and <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/ts-info"><code>TS_INFO</code></a>, tell you which metrics and time series actually have data for your current query context, not just what the mapping declares. Field mappings enumerate every field ever written; these commands return what&#x27;s actively ingested, typed, and queryable right now, scoped to your time window and filters. A single-line query against 1.84 billion documents and 1.4 million time series returns in ~4 seconds. Available GA in Elasticsearch 9.4 and Elastic Cloud Serverless.</p><h2>Why time series discovery matters</h2><p>Elasticsearch uses <a href="https://www.elastic.co/docs/manage-data/data-store/data-streams/time-series-data-stream-tsds">time series data streams</a> (TSDS) to efficiently store metrics. Backed by a fully columnar store, metrics stored in TSDS in Elasticsearch 9.4 require up to 17x less storage compared to using a standard index. Starting with Elasticsearch 9.2, we&#x27;ve also added <a href="https://www.elastic.co/search-labs/blog/esql-elasticsearch-9-2-multi-field-joins-ts-command#time-series-support-in-es%7Cql-(tech-preview)">time-series support in Elasticsearch Query Language (ES|QL) as a fully supported capability</a> when querying data stored in TSDS.</p><p>If you operate TSDS in Elasticsearch, you already know the pattern: <a href="https://www.elastic.co/docs/manage-data/data-store/data-streams/time-series-data-stream-tsds#time-series-dimension">dimensions</a> identify a series, <a href="https://www.elastic.co/docs/manage-data/data-store/data-streams/time-series-data-stream-tsds#time-series-metric">metrics</a> carry typed values like <code>gauge</code> or <code>counter</code>, and the <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/ts"><code>TS</code></a> <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/ts">source command</a> in ES|QL enables <a href="https://www.elastic.co/docs/reference/query-languages/esql/functions-operators/time-series-aggregation-functions">time series aggregation functions</a> such as <code>RATE</code> and <code>AVG_OVER_TIME</code>.</p><p>What that pipeline can&#x27;t tell you (but you need to know just as often) is <strong>which metrics and time series actually exist right now, for the slice of data you care about</strong>. Field mappings enumerate every field that was ever declared; they don&#x27;t show what&#x27;s actively being ingested in a specific cluster, environment, or time window. That gap shows up across very different workflows:</p><ul><li><strong>Dashboard building.</strong> Metric and dimension pickers should reflect what the cluster currently holds for the user&#x27;s filters, not every field that has ever been mapped. Otherwise, dropdowns stay cluttered with stale options and panels render empty.</li><li><strong>Onboarding to an unfamiliar TSDS.</strong> A new cluster, a new integration, a customer&#x27;s data. A quick list of the metrics being ingested, with their types, units, and applicable dimensions, replaces hours of mapping spelunking and ad hoc probe queries.</li><li><strong>Data quality investigations.</strong> Mapping drift (the same metric declared <code>gauge</code> in one backing index and <code>counter</code> in another) and dimension-cardinality explosions both surface immediately in the catalog output.</li><li><strong>Query validation.</strong> Before running an expensive <code>TS ... | STATS</code> aggregation, confirm that the metric and dimensions you&#x27;re about to use really have data in your window.</li></ul><p>Kibana already relies on this internally. The <a href="https://www.elastic.co/docs/solutions/observability/infra-and-hosts/discover-metrics">dynamic metrics catalog</a> in the observability experience appends <code>METRICS_INFO</code> to the user&#x27;s active <code>TS</code> query so the UI only offers metrics that truly exist for the current filters, rather than every field in the mapping.</p><h2>The problem: Mappings are an inventory of fields, not time series</h2><p>Operations teams routinely need answers to questions that mapping APIs alone cannot answer:</p><ul><li>Which metrics actually have data <strong>in this environment</strong>, <strong>for this cluster</strong>, <strong>in this time range</strong>?</li><li>How are those metrics typed, and which dimensions apply when building or validating a query?</li><li>How many distinct time series exist per metric?</li></ul><p>Until now, answering these questions meant piecing together mapping APIs, ad hoc queries, and guesswork. <code>METRICS_INFO</code> and <code>TS_INFO</code> turn those questions into single-line ES|QL queries that fit naturally into the same pipeline you use for <code>STATS</code>:</p><pre><code>TS k8s
| WHERE cluster == "prod"
| METRICS_INFO
| SORT metric_name</code></pre>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">metric_name</td><td style="border: 1px solid #ccc; padding: 4px;">data_stream</td><td style="border: 1px solid #ccc; padding: 4px;">unit</td><td style="border: 1px solid #ccc; padding: 4px;">metric_type</td><td style="border: 1px solid #ccc; padding: 4px;">field_type</td><td style="border: 1px solid #ccc; padding: 4px;">dimension_fields</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">network.eth0.rx</td><td style="border: 1px solid #ccc; padding: 4px;">k8s</td><td style="border: 1px solid #ccc; padding: 4px;">packets</td><td style="border: 1px solid #ccc; padding: 4px;">gauge</td><td style="border: 1px solid #ccc; padding: 4px;">integer</td><td style="border: 1px solid #ccc; padding: 4px;">[cluster, pod, region]</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">network.eth0.tx</td><td style="border: 1px solid #ccc; padding: 4px;">k8s</td><td style="border: 1px solid #ccc; padding: 4px;">packets</td><td style="border: 1px solid #ccc; padding: 4px;">gauge</td><td style="border: 1px solid #ccc; padding: 4px;">integer</td><td style="border: 1px solid #ccc; padding: 4px;">[cluster, pod, region]</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">network.total_bytes_in</td><td style="border: 1px solid #ccc; padding: 4px;">k8s</td><td style="border: 1px solid #ccc; padding: 4px;">bytes</td><td style="border: 1px solid #ccc; padding: 4px;">counter</td><td style="border: 1px solid #ccc; padding: 4px;">long</td><td style="border: 1px solid #ccc; padding: 4px;">[cluster, pod, region]</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">network.total_cost</td><td style="border: 1px solid #ccc; padding: 4px;">k8s</td><td style="border: 1px solid #ccc; padding: 4px;">usd</td><td style="border: 1px solid #ccc; padding: 4px;">counter</td><td style="border: 1px solid #ccc; padding: 4px;">double</td><td style="border: 1px solid #ccc; padding: 4px;">[cluster, pod, region]</td>
                    </tr>
              </tbody>
            </table>
          <h2>How these commands integrate with ES|QL pipelined queries</h2><p>Both commands are <strong>processing commands</strong>. Once you run one, the table is replaced: Downstream commands, like <code>KEEP</code>, <code>WHERE</code>, or <code>STATS</code>, operate on <strong>metadata rows</strong>, not the original time series documents.</p><p>A few rules to keep in mind:</p><ul><li>They apply only after a <strong><code>TS</code></strong> source. Using them after <code>FROM</code> or without a preceding <code>TS</code> source produces an error.</li><li>They must appear <strong>before</strong> <code>STATS</code>, <code>SORT</code>, or <code>LIMIT</code> run on the <strong>time series rows</strong> returned by <code>TS</code>. For example, <code>TS ... | STATS ... | METRICS_INFO</code> is invalid; <code>TS ... | METRICS_INFO | STATS ...</code> is valid because <code>STATS</code> then runs on the metadata table.</li><li>You can filter and aggregate <em>after</em> <code>METRICS_INFO</code> or <code>TS_INFO</code> on the metadata columns with the usual processing commands.</li><li>You can include filters <strong>before</strong> them, for example, narrowing by <code>@timestamp</code> or dimensions, so that the produced metadata reflects series that match your query context, not the entire index.</li></ul><p>Conceptually, the pipeline looks like this:</p><pre><code>TS + filters  →  METRICS_INFO or TS_INFO  →  KEEP / WHERE on metadata  →  STATS / SORT / LIMIT</code></pre><p>This design means you can scope a catalog to exactly the slice of data you care about and then post-process the result with more ES|QL commands as desired.</p><h2>How to use METRICS_INFO and TS_INFO in practice</h2><p><strong><code>METRICS_INFO</code></strong> retrieves information about the metrics available in your time series data streams, together with applicable dimensions and other metadata, all scoped to the current <code>TS</code> query. <strong><code>TS_INFO</code></strong> does the same for <strong>individual time series</strong>. Each row is one metric plus the dimension values that identify one series.</p><p>Each command offers a different view to time series metadata: <strong><code>METRICS_INFO</code></strong> collapses what you see into <strong>one row per distinct metric signature</strong>: the metric name plus how it&#x27;s declared (type, unit, field type, which dimension fields apply) as observed across backing indices. <strong><code>TS_INFO</code></strong> adds <strong>one row per metric and time series</strong>, with a <code>dimensions</code> column that holds the concrete label set for each series, formatted as a JSON object (for instance, <code>{&quot;job&quot;:&quot;elasticsearch&quot;,&quot;instance&quot;:&quot;instance_1&quot;}</code>).</p><p>If the same logical metric name shows up with incompatible metadata in different places, you get multiple rows or multi-valued cells. That&#x27;s a useful signal when you&#x27;re tracking down mapping drift.</p><p>Both commands expose the same core columns; only <code>TS_INFO</code> adds <code>dimensions</code>.</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Column</td><td style="border: 1px solid #ccc; padding: 4px;">Meaning</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">metric_name</td><td style="border: 1px solid #ccc; padding: 4px;">Name of the metric.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">data_stream</td><td style="border: 1px solid #ccc; padding: 4px;">Data stream(s) that contain this metric; multi-valued when it spans multiple data streams.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">unit</td><td style="border: 1px solid #ccc; padding: 4px;">Unit
        
        declared in the mapping (e.g. bytes); multi-valued when definitions differ across backing indices; may be null.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">metric_type</td><td style="border: 1px solid #ccc; padding: 4px;">Types such as gauge or counter; multi-valued when definitions differ across backing indices.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">field_type</td><td style="border: 1px solid #ccc; padding: 4px;">Elasticsearch field type (long, double, ...); multi-valued when definitions differ across backing indices.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">dimension_fields</td><td style="border: 1px solid #ccc; padding: 4px;">Dimension field names for this metric (multi-valued): the union of dimension keys across all time series for that metric.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">dimensions</td><td style="border: 1px solid #ccc; padding: 4px;">TS_INFO only. JSON-encoded dimension key/value pairs that identify one time series.</td>
                    </tr>
              </tbody>
            </table>
          <p>Start with a catalog of names and types. The smallest useful query is a <code>TS</code> source, <code>METRICS_INFO</code>, and a sort so the table is easy to scan:</p><pre><code>TS k8s
| METRICS_INFO
| SORT metric_name</code></pre>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">metric_name</td><td style="border: 1px solid #ccc; padding: 4px;">data_stream</td><td style="border: 1px solid #ccc; padding: 4px;">unit</td><td style="border: 1px solid #ccc; padding: 4px;">metric_type</td><td style="border: 1px solid #ccc; padding: 4px;">field_type</td><td style="border: 1px solid #ccc; padding: 4px;">dimension_fields</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">network.eth0.rx</td><td style="border: 1px solid #ccc; padding: 4px;">k8s</td><td style="border: 1px solid #ccc; padding: 4px;">packets</td><td style="border: 1px solid #ccc; padding: 4px;">gauge</td><td style="border: 1px solid #ccc; padding: 4px;">integer</td><td style="border: 1px solid #ccc; padding: 4px;">[cluster, pod, region]</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">network.eth0.tx</td><td style="border: 1px solid #ccc; padding: 4px;">k8s</td><td style="border: 1px solid #ccc; padding: 4px;">packets</td><td style="border: 1px solid #ccc; padding: 4px;">gauge</td><td style="border: 1px solid #ccc; padding: 4px;">integer</td><td style="border: 1px solid #ccc; padding: 4px;">[cluster, pod, region]</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">network.total_bytes_in</td><td style="border: 1px solid #ccc; padding: 4px;">k8s</td><td style="border: 1px solid #ccc; padding: 4px;">bytes</td><td style="border: 1px solid #ccc; padding: 4px;">counter</td><td style="border: 1px solid #ccc; padding: 4px;">long</td><td style="border: 1px solid #ccc; padding: 4px;">[cluster, pod, region]</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">network.total_cost</td><td style="border: 1px solid #ccc; padding: 4px;">k8s</td><td style="border: 1px solid #ccc; padding: 4px;">usd</td><td style="border: 1px solid #ccc; padding: 4px;">counter</td><td style="border: 1px solid #ccc; padding: 4px;">double</td><td style="border: 1px solid #ccc; padding: 4px;">[cluster, pod, region]</td>
                    </tr>
              </tbody>
            </table>
          <p>You can post-process the result as usual in ES|QL. For instance, you can trim columns or filter on metadata before aggregating:</p><pre><code>TS k8s
| WHERE cluster == "prod" AND TRANGE(1d)
| METRICS_INFO
| KEEP metric_name, metric_type
| SORT metric_name</code></pre>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">metric_name</td><td style="border: 1px solid #ccc; padding: 4px;">metric_type</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">network.eth0.rx</td><td style="border: 1px solid #ccc; padding: 4px;">gauge</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">network.eth0.tx</td><td style="border: 1px solid #ccc; padding: 4px;">gauge</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">network.total_bytes_in</td><td style="border: 1px solid #ccc; padding: 4px;">counter</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">network.total_cost</td><td style="border: 1px solid #ccc; padding: 4px;">counter</td>
                    </tr>
              </tbody>
            </table>
          <p>To find <strong>how many</strong> distinct metric names match a pattern (not <strong>which</strong> series), combine <code>METRICS_INFO</code> with <code>STATS</code>:</p><pre><code>TS k8s
| METRICS_INFO
| WHERE metric_name LIKE "network.total*"
| STATS matching_metrics = COUNT_DISTINCT(metric_name)</code></pre>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">matching_metrics</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">2</td>
                    </tr>
              </tbody>
            </table>
          <p>Document predicates <strong>before</strong> the catalog command narrow the processed time series to data samples that actually exist in your window. The metrics listed are those with matching data, not every field that has ever been mapped:</p><pre><code>TS k8s
| WHERE cluster == "prod" AND TRANGE(1d)
| METRICS_INFO
| SORT metric_name</code></pre>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">metric_name</td><td style="border: 1px solid #ccc; padding: 4px;">data_stream</td><td style="border: 1px solid #ccc; padding: 4px;">unit</td><td style="border: 1px solid #ccc; padding: 4px;">metric_type</td><td style="border: 1px solid #ccc; padding: 4px;">field_type</td><td style="border: 1px solid #ccc; padding: 4px;">dimension_fields</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">network.eth0.rx</td><td style="border: 1px solid #ccc; padding: 4px;">k8s</td><td style="border: 1px solid #ccc; padding: 4px;">packets</td><td style="border: 1px solid #ccc; padding: 4px;">gauge</td><td style="border: 1px solid #ccc; padding: 4px;">integer</td><td style="border: 1px solid #ccc; padding: 4px;">[cluster, pod, region]</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">network.eth0.tx</td><td style="border: 1px solid #ccc; padding: 4px;">k8s</td><td style="border: 1px solid #ccc; padding: 4px;">packets</td><td style="border: 1px solid #ccc; padding: 4px;">gauge</td><td style="border: 1px solid #ccc; padding: 4px;">integer</td><td style="border: 1px solid #ccc; padding: 4px;">[cluster, pod, region]</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">network.total_bytes_in</td><td style="border: 1px solid #ccc; padding: 4px;">k8s</td><td style="border: 1px solid #ccc; padding: 4px;">bytes</td><td style="border: 1px solid #ccc; padding: 4px;">counter</td><td style="border: 1px solid #ccc; padding: 4px;">long</td><td style="border: 1px solid #ccc; padding: 4px;">[cluster, pod, region]</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">network.total_cost</td><td style="border: 1px solid #ccc; padding: 4px;">k8s</td><td style="border: 1px solid #ccc; padding: 4px;">usd</td><td style="border: 1px solid #ccc; padding: 4px;">counter</td><td style="border: 1px solid #ccc; padding: 4px;">double</td><td style="border: 1px solid #ccc; padding: 4px;">[cluster, pod, region]</td>
                    </tr>
              </tbody>
            </table>
          <p>Run the <strong>same</strong> scoped pipeline, but swap the middle command for <code>TS_INFO</code>, and the question shifts from “which metrics match” to “which time series identities match”. Each row is one metric plus one combination of dimension values; sort on <code>metric_name</code> and <code>dimensions</code> so related series group together:</p><pre><code>TS k8s
| WHERE cluster == "prod" AND TRANGE(1d)
| TS_INFO
| SORT metric_name, dimensions</code></pre>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">metric_name</td><td style="border: 1px solid #ccc; padding: 4px;">data_stream</td><td style="border: 1px solid #ccc; padding: 4px;">unit</td><td style="border: 1px solid #ccc; padding: 4px;">metric_type</td><td style="border: 1px solid #ccc; padding: 4px;">field_type</td><td style="border: 1px solid #ccc; padding: 4px;">dimension_fields</td><td style="border: 1px solid #ccc; padding: 4px;">dimensions</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">network.eth0.rx</td><td style="border: 1px solid #ccc; padding: 4px;">k8s</td><td style="border: 1px solid #ccc; padding: 4px;">packets</td><td style="border: 1px solid #ccc; padding: 4px;">gauge</td><td style="border: 1px solid #ccc; padding: 4px;">integer</td><td style="border: 1px solid #ccc; padding: 4px;">[cluster, pod, region]</td><td style="border: 1px solid #ccc; padding: 4px;">{"cluster":"prod","pod":"one","region":"[eu, us]"}</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">network.eth0.rx</td><td style="border: 1px solid #ccc; padding: 4px;">k8s</td><td style="border: 1px solid #ccc; padding: 4px;">packets</td><td style="border: 1px solid #ccc; padding: 4px;">gauge</td><td style="border: 1px solid #ccc; padding: 4px;">integer</td><td style="border: 1px solid #ccc; padding: 4px;">[cluster, pod, region]</td><td style="border: 1px solid #ccc; padding: 4px;">{"cluster":"prod","pod":"three","region":"[eu, us]"}</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">network.eth0.rx</td><td style="border: 1px solid #ccc; padding: 4px;">k8s</td><td style="border: 1px solid #ccc; padding: 4px;">packets</td><td style="border: 1px solid #ccc; padding: 4px;">gauge</td><td style="border: 1px solid #ccc; padding: 4px;">integer</td><td style="border: 1px solid #ccc; padding: 4px;">[cluster, pod, region]</td><td style="border: 1px solid #ccc; padding: 4px;">{"cluster":"prod","pod":"two","region":"[eu, us]"}</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">network.eth0.tx</td><td style="border: 1px solid #ccc; padding: 4px;">k8s</td><td style="border: 1px solid #ccc; padding: 4px;">packets</td><td style="border: 1px solid #ccc; padding: 4px;">gauge</td><td style="border: 1px solid #ccc; padding: 4px;">integer</td><td style="border: 1px solid #ccc; padding: 4px;">[cluster, pod, region]</td><td style="border: 1px solid #ccc; padding: 4px;">{"cluster":"prod","pod":"one","region":"[eu, us]"}</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">network.eth0.tx</td><td style="border: 1px solid #ccc; padding: 4px;">k8s</td><td style="border: 1px solid #ccc; padding: 4px;">packets</td><td style="border: 1px solid #ccc; padding: 4px;">gauge</td><td style="border: 1px solid #ccc; padding: 4px;">integer</td><td style="border: 1px solid #ccc; padding: 4px;">[cluster, pod, region]</td><td style="border: 1px solid #ccc; padding: 4px;">{"cluster":"prod","pod":"three","region":"[eu, us]"}</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">network.eth0.tx</td><td style="border: 1px solid #ccc; padding: 4px;">k8s</td><td style="border: 1px solid #ccc; padding: 4px;">packets</td><td style="border: 1px solid #ccc; padding: 4px;">gauge</td><td style="border: 1px solid #ccc; padding: 4px;">integer</td><td style="border: 1px solid #ccc; padding: 4px;">[cluster, pod, region]</td><td style="border: 1px solid #ccc; padding: 4px;">{"cluster":"prod","pod":"two","region":"[eu, us]"}</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">network.total_bytes_in</td><td style="border: 1px solid #ccc; padding: 4px;">k8s</td><td style="border: 1px solid #ccc; padding: 4px;">bytes</td><td style="border: 1px solid #ccc; padding: 4px;">counter</td><td style="border: 1px solid #ccc; padding: 4px;">long</td><td style="border: 1px solid #ccc; padding: 4px;">[cluster, pod, region]</td><td style="border: 1px solid #ccc; padding: 4px;">{"cluster":"prod","pod":"one","region":"[eu, us]"}</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">network.total_bytes_in</td><td style="border: 1px solid #ccc; padding: 4px;">k8s</td><td style="border: 1px solid #ccc; padding: 4px;">bytes</td><td style="border: 1px solid #ccc; padding: 4px;">counter</td><td style="border: 1px solid #ccc; padding: 4px;">long</td><td style="border: 1px solid #ccc; padding: 4px;">[cluster, pod, region]</td><td style="border: 1px solid #ccc; padding: 4px;">{"cluster":"prod","pod":"three","region":"[eu, us]"}</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">network.total_bytes_in</td><td style="border: 1px solid #ccc; padding: 4px;">k8s</td><td style="border: 1px solid #ccc; padding: 4px;">bytes</td><td style="border: 1px solid #ccc; padding: 4px;">counter</td><td style="border: 1px solid #ccc; padding: 4px;">long</td><td style="border: 1px solid #ccc; padding: 4px;">[cluster, pod, region]</td><td style="border: 1px solid #ccc; padding: 4px;">{"cluster":"prod","pod":"two","region":"[eu, us]"}</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">network.total_cost</td><td style="border: 1px solid #ccc; padding: 4px;">k8s</td><td style="border: 1px solid #ccc; padding: 4px;">usd</td><td style="border: 1px solid #ccc; padding: 4px;">counter</td><td style="border: 1px solid #ccc; padding: 4px;">double</td><td style="border: 1px solid #ccc; padding: 4px;">[cluster, pod, region]</td><td style="border: 1px solid #ccc; padding: 4px;">{"cluster":"prod","pod":"one","region":"[eu, us]"}</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">network.total_cost</td><td style="border: 1px solid #ccc; padding: 4px;">k8s</td><td style="border: 1px solid #ccc; padding: 4px;">usd</td><td style="border: 1px solid #ccc; padding: 4px;">counter</td><td style="border: 1px solid #ccc; padding: 4px;">double</td><td style="border: 1px solid #ccc; padding: 4px;">[cluster, pod, region]</td><td style="border: 1px solid #ccc; padding: 4px;">{"cluster":"prod","pod":"three","region":"[eu, us]"}</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">network.total_cost</td><td style="border: 1px solid #ccc; padding: 4px;">k8s</td><td style="border: 1px solid #ccc; padding: 4px;">usd</td><td style="border: 1px solid #ccc; padding: 4px;">counter</td><td style="border: 1px solid #ccc; padding: 4px;">double</td><td style="border: 1px solid #ccc; padding: 4px;">[cluster, pod, region]</td><td style="border: 1px solid #ccc; padding: 4px;">{"cluster":"prod","pod":"two","region":"[eu, us]"}</td>
                    </tr>
              </tbody>
            </table>
          <p>That extra column can be used to deduce metric cardinality. Each <code>TS_INFO</code> row is one time series for a given metric, so grouping with <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/stats-by"><code>STATS</code></a> counts <strong>how many distinct time series exist per metric</strong>:</p><pre><code>TS k8s
| TS_INFO
| STATS series_count = COUNT(*) BY metric_name
| SORT metric_name</code></pre>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">series_count</td><td style="border: 1px solid #ccc; padding: 4px;">metric_name</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">9</td><td style="border: 1px solid #ccc; padding: 4px;">network.eth0.rx</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">9</td><td style="border: 1px solid #ccc; padding: 4px;">network.eth0.tx</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">9</td><td style="border: 1px solid #ccc; padding: 4px;">network.total_bytes_in</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">9</td><td style="border: 1px solid #ccc; padding: 4px;">network.total_cost</td>
                    </tr>
              </tbody>
            </table>
          <p><strong>Choosing between them:</strong> Use <strong><code>METRICS_INFO</code></strong> when you want a compact inventory of metric names and types in the filtered <code>TS</code> context. Use <strong><code>TS_INFO</code></strong> when you need label combinations, per-metric series counts. In practice, skim with <code>METRICS_INFO</code> and then switch to <code>TS_INFO</code> when the answer depends on <strong>which</strong> dimensions apply, not just <strong>what</strong> metrics exist.</p><h2>Under the hood: How the commands are executed</h2><p>Both <code>METRICS_INFO</code> and <code>TS_INFO</code> run inside the same distributed ES|QL execution that powers any <code>TS</code> query. In addition to standard features, like shard-level parallelism, Lucene filter pushdown, and coordinator-side merging, special care has been taken during implementation so that the cost scales with the number of matching time series, not the number of documents. Here&#x27;s how each output row gets produced:</p><p><strong>1. The</strong> <strong><code>TS</code></strong> <strong>command defines the scope.</strong> <code>TS</code> resolves your data stream pattern to its TSDS <a href="https://www.elastic.co/docs/manage-data/data-store/data-streams/time-bound-tsds">backing indices</a> and turns any filters you place <strong>before</strong> the catalog command, such as a time range on <code>@timestamp</code> or dimension predicates in <code>WHERE</code>, into a Lucene query that runs on every shard that can match. Shards in backing indices outside the time window are pruned up front and never touched.</p><p><strong>2. Each shard iterates over matching documents and tracks one per series.</strong> A TSDS index is physically sorted by <code>_tsid</code> first, then by <code>@timestamp</code> (descending). That sort matters here: All documents belonging to the same time series sit next to each other on disk, so as a shard processes documents in order, it only needs to keep the first document it sees for each new <code>_tsid</code> and skip the rest. The result is one representative document per time series that has at least one document matching your filters.</p><p><strong>3. The mapping tells us what each field is.</strong> The backing index mapping is the source of truth for the metadata that describes each field:</p><ul><li>Fields declared with <code>time_series_metric</code> are metrics, and the mapping carries each metric&#x27;s <code>metric_type</code>, <code>field_type</code>, and (if declared) <code>meta.unit</code>.</li></ul><p>4. <strong>Synthetic source fills in the actual dimension and metric presence.</strong> For the one representative document per series, the shard reads a subset of _source containing only the dimension (and metric) paths the mapping declares. TSDS uses <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/mapping-source-field#synthetic-source">synthetic</a> <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/mapping-source-field#synthetic-source"><code>_source</code></a>, so that subset is reconstructed primarily from doc values — no stored _source is needed. From that reconstructed sliver of JSON, the shard learns two things:</p><ul><li>The dimension key/value pairs for this series (the <code>dimensions</code> JSON for <code>TS_INFO</code>, and the set of dimension keys that feed <code>dimension_fields</code> for both commands).</li><li>Which metric fields actually have data for this series in this backing index.</li></ul><p><strong>5. Partial aggregation happens inside each shard.</strong> Shards don&#x27;t ship raw per-series rows upstream. They partially aggregate first, which is a big part of why catalog queries stay cheap.</p><p><strong>6. The coordinator merges across shards and data streams.</strong> Each data node first reduces its own shards&#x27; partial results and streams them to the coordinator, which applies the same merge logic one more time.</p><p><strong>7. The rest of the pipeline runs as usual.</strong> Everything after the catalog command (<code>KEEP</code>, <code>WHERE</code>, <code>STATS</code>, <code>SORT</code>, <code>LIMIT</code>) runs against this consolidated metadata table on the coordinator, exactly like any other ES|QL stage.</p><p>The net effect is that catalog queries do just enough work to identify one representative document per series, read a small reconstructed slice of that document, classify its fields against the mapping, and fold the results into a handful of metadata rows. Because the output cardinality is bounded by the number of matching series (for <code>TS_INFO</code>) or by the number of distinct metric signatures (for <code>METRICS_INFO</code>), not by the number of documents in the window, these commands stay responsive even against long retention windows and high-ingest data streams.</p><p>Running these commands against the full high cardinality TSDB benchmark corpus without a time range filter (1.84 B documents / 1.4 M time series / 2.77 TB uncompressed) on a single-node Elasticsearch (AWS c8gd.8xlarge, 24 cores, 24 GiB heap, NVMe SSD, 3 primary shards, force-merged), <code>METRICS_INFO</code> returns in ~4 seconds.</p><h2>Beyond ad hoc queries</h2><p>These commands also support product workflows inside Kibana. The UI appends <code>METRICS_INFO</code> to a user&#x27;s <code>TS</code> query (when the query doesn&#x27;t already include <code>STATS</code>) <a href="https://www.elastic.co/docs/solutions/observability/infra-and-hosts/discover-metrics">to build a metric catalog</a> aligned with the user&#x27;s filters, rather than relying solely on mappings.</p><p>These new commands are also the foundation of Prometheus-compatible metadata APIs we&#x27;re adding to Elasticsearch, which Prometheus-based tools can use. Stay tuned for a dedicated blog post that goes into more detail.</p><h2>Data quality</h2><p>Multi-valued <code>unit</code>, <code>metric_type</code>, or <code>field_type</code> in the <code>METRICS_INFO</code> output, is a concise warning that backing indices aren&#x27;t aligned on a metric&#x27;s definition. <code>TS_INFO</code> makes it easier to see whether an explosion in cardinality comes from a handful of metrics or from dimension cardinality you should account for in alerts and aggregations. For example, ranking metrics by series count surfaces outliers at a glance:</p><pre><code>TS k8s
| TS_INFO
| STATS series_count = COUNT(*) BY metric_name
| SORT series_count DESC</code></pre>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">series_count</td><td style="border: 1px solid #ccc; padding: 4px;">metric_name</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">12000</td><td style="border: 1px solid #ccc; padding: 4px;">network.eth0.rx</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">9</td><td style="border: 1px solid #ccc; padding: 4px;">network.eth0.tx</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">9</td><td style="border: 1px solid #ccc; padding: 4px;">network.total_bytes_in</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">9</td><td style="border: 1px solid #ccc; padding: 4px;">network.total_cost</td>
                    </tr>
              </tbody>
            </table>
          <p>When a single metric dwarfs the rest, as <code>network.eth0.rx</code> does above, the explosion is concentrated in <strong>a handful of metrics</strong>, and inspecting that metric&#x27;s <code>dimensions</code> pinpoints which label is growing. Comparable counts across metrics instead point at <strong>shared dimension cardinality</strong>, such as a newly introduced <code>pod</code> or <code>instance</code> value propagating through every time series.</p><h2>Availability</h2><p><strong><code>METRICS_INFO</code></strong> and <strong><code>TS_INFO</code></strong> are <strong>generally available</strong> in <a href="https://www.elastic.co/cloud/serverless">Elastic Cloud Serverless</a> and in Elasticsearch basic starting with the 9.4.0 release.</p><p>For command pages (syntax, restrictions, and examples), see <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/metrics-info"><code>METRICS_INFO</code></a> and <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/ts-info"><code>TS_INFO</code></a>.</p><p>For background on TSDS and the <code>TS</code> command itself, start with the official documentation on <a href="https://www.elastic.co/docs/manage-data/data-store/data-streams/time-series-data-stream-tsds">time series data streams</a> and the <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/ts"><code>TS</code></a> <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/ts">source command</a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Stop guessing which query is burning your cluster: Query activity in Kibana]]></title>
            <description><![CDATA[Pinpoint long-running Elasticsearch searches from Kibana: live tasks, origin context, and cancel when the cluster allows without living in low-level APIs.]]></description>
            <link>https://www.elastic.co/search-labs/blog/kibana-query-activity-long-running-searches</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/kibana-query-activity-long-running-searches</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Valentin Crettaz]]></dc:creator>
            <pubDate>Tue, 28 Apr 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/3515b0a19116ad940210e4fe43026c8e6350b0d7-1536x1024.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Query activity in <a href="https://www.elastic.co/docs/deploy-manage/monitor/query-activity">Kibana</a> gives you a live list of <strong>in-flight search tasks</strong> in Elasticsearch (ES|QL, DSL, SQL, EQL), so <strong>in about a minute</strong> you can move from “something’s slow” to which query, how long it’s been running, and whether you can cancel it. It builds on the Tasks Management API, but it’s the operational view you can show in a war room. If you’re on Elastic Cloud Serverless, it’s available now; on Elastic Cloud Hosted and self-managed, it ships with Kibana 9.4.</p><h2>When “something is slow” finally has an answer</h2><p>Query activity is already available in your Elastic Cloud Serverless project today. For Elastic Cloud Hosted and Elastic Self-Managed clusters, it ships with Kibana 9.4 and is available across all deployments and clusters on that version. Query activity is the Kibana view on top of the Tasks Management API in Elasticsearch. It’s purpose-built for search-related tasks in any query language, including Elasticsearch Query Language (ES|QL), DSL, SQL, and Event Query Language (EQL).</p><p>It always starts the same way. Someone pings you on a Friday: <em>Discover feels stuck.</em> <em>The exec dashboard won’t load.</em> <em>Did we change something?</em> You open your monitoring tabs, squint at CPU, maybe tail a log, and you’re still guessing. Is it one giant ES|QL pipeline? A dashboard nobody remembers? A background rule doing honest work at the worst possible time? The cluster isn’t mysterious on purpose. In-flight work is simply invisible unless you enjoy living in Dev Tools and reconstructing life stories from task IDs and snippets of JSON.</p><p>We built Query activity for everyone who has ever muttered, <em>Just tell me what’s running.</em> It’s a new screen in Kibana that lists active search work in ES|QL, DSL, SQL, or EQL. It shows the queries that are consuming your cluster right now, with enough context to move from <strong>panic to diagnosis</strong> without a scavenger hunt.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/33ab7fc484605af4caa2c21913364159a2f23014-1999x775.png" style="width: 100%;" /><h2>The play you know and the one-minute rewrite</h2><p>If you’ve operated Elasticsearch for more than a week, you’ve lived the old script. In <strong>Act I</strong>, someone says the cluster feels slow. In <strong>Act II</strong>, you scatter across shards, heap, slow logs, and sticky-note task IDs. Hours pass, and you still cannot name the query. In <strong>Act III</strong>, maybe you find the culprit before dinner, or maybe next month Act I opens again with the same villain in a fake mustache.</p><p>Query activity replaces that meandering Act II with one default sequence. It’s the same story, compressed from symptom to evidence to origin to action in about a minute. Paste this into your runbook or send it to your on-call channel. It’s the whole innovation in practice.</p><p>1. <strong>Open Query activity</strong> as soon as Act I hits. On Elastic Cloud Hosted and Elastic Self-Managed clusters, go to <strong>Stack Management</strong> and then <strong>Cluster performance</strong>. On Elastic Cloud Serverless, go to <strong>Admin and Settings</strong> and then <strong>Project performance</strong>. Do this before you fork into guesswork.</p><p>2. <strong>Refresh the list of queries</strong> once so you’re looking at <strong>right now</strong>, not five minutes ago.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/bbfe0ba1e09916331c148ca791f1f7d9b0b40985-1999x331.png" style="width: 100%;" /><p>3. <strong>Surface the pressure.</strong> Sort by run time, or tighten the “Run time” filter, until the expensive work floats to the top.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/b7d487d2ce4496f7fe306d4c7e2527ac60699d05-764x448.png" style="width: 100%;" /><p>4. <strong>Open the flyout</strong> on the worst offender. You’ll see duration, query type, index breadth, and full query text. That’s your evidence without opening Dev Tools.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/7e62154a314a53e705b0faf9057a941e108fd610-1326x1582.png" style="width: 100%;" /><p>5.<strong> Name the owner.</strong> Use <code>trace.id</code> to jump into <strong>Discover</strong> and filter on the audit or query logs for that trace, or use <code>X-Opaque-Id</code> to figure out which dashboard, saved search, or rule this query originates from.</p><p>6. <strong>Resolve Act III.</strong> Let the query finish, fix the upstream, or <strong>cancel</strong> when it’s appropriate and Elasticsearch says the task is cancelable.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/3830fdbe460fb947323f35f0626a0fd76beacf41-1434x383.png" style="width: 100%;" /><p>That’s one pass through what used to be three acts. You get attribution instead of folklore and decisions instead of theater.</p><h2>Query activity deep dive</h2><p>The one-minute sequence above is the habit. What follows is the machinery: the concrete controls and signals in Kibana that make that rewrite possible. You get what’s executing, how long it’s been running, where it came from, and what to do next, without stitching clues across tabs.</p><p>Under the hood, this view is powered by Elasticsearch’s Tasks Management API for long-running search tasks. It’s translated into an operational UI that’s built for speed. You can find the outlier quickly, inspect rich details, and act with confidence.</p><p>Here’s how the UI backs each beat of the runbook.</p><p>The main view is a filterable list of running queries. It includes a search bar so you can match anything in the table, including task ID. You also get filters for run time, query language, and source (for example Discover, Dashboard, and similar surfaces). You stay in control of what “noisy” means.</p><p>Refresh is manual on purpose. The table does not auto-refresh. You click <strong>Refresh </strong>when you’re ready, and the UI shows when the last refresh happened. You shouldn’t have to wonder whether the list is stale.</p><p>When you click a task ID, a detail flyout opens. It shows start time, run time, how many indices the query touches, and full query text. When <code>X-Opaque-Id</code> is present, it can help you <a href="https://www.elastic.co/docs/troubleshoot/kibana/trace-elasticsearch-query-to-the-origin-in-kibana">trace an Elasticsearch query to its origin in Kibana</a> so you can turn “mystery load” into “that dashboard, that version.” Previous and next navigation let you walk the queue without jumping back to the list. When <code>trace.id</code> is available, you can open Discover with that trace prefiltered. That helps when the incident channel is already busy.</p><p>Where work is cancelable, you can request cancellation from the list or the flyout. There’s a deliberate confirmation step. After you confirm, the cancel control shows a spinner until Elasticsearch reports that the task actually stopped. The goal is oops-proof, not oops-fast.</p><p>Viewing and managing active query work requires appropriate cluster privileges. The UI states clearly when something is missing. For example, users without <code>cluster:manage</code> may not be able to take destructive actions. Users without <code>cluster:monitoring</code> may not see task details. You shouldn’t get a blank screen that feels like the stack is playing hide-and-seek.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/4c5a8f7aa5187029ed705f4a7affa9424f33c983-1999x1027.png" style="width: 100%;" /><p>If you’ve been following our broader story around query observability, this is the live side of the house. It’s what is happening now, in the product, with controls you can use. Over time, pair it with historical views, such as query logs and <a href="https://www.elastic.co/search-labs/blog/slow-search-elasticsearch-query-autoops">AutoOps long-running search tasks insights</a> when you need to ask whether this has happened before. When you need to answer what’s eating your cluster in this minute, start with the new Query activity UI in Kibana.</p><h2>Who this is for (and who becomes the hero)</h2><p>Cluster and platform admins get the obvious win: <strong>faster incident response</strong> and less time translating APIs into narratives for stakeholders.</p><p>Centers of excellence and internal search champions get something equally valuable: a teachable moment you can screenshot. This is the query pattern that blew up shared capacity. This is what “interactive” versus “background” pressure looks like when everyone is busy.</p><p>Anyone on the hook for Service Level Agreements (SLAs) gets a cleaner bridge between user experience (“the app is slow”) and search reality (“these three requests are still running, and one of them is huge”).</p><p>You don’t have to be the person who wrote the query to be the person who explains the cluster calmly. That’s the whole point.</p><p>Not every task is cancelable, and deep tuning work still has its place. Query activity doesn’t fix your queries for you. It surfaces in seconds which ones might need attention, and it gives you faster evidence, clearer attribution, and better decisions before you reach for heavier tools.</p><h2>Where to find it</h2><p>You’ll find Query activity in the performance area of each deployment model. In Elastic Cloud Hosted and Elastic Self-Managed clusters, open <code>Stack Management</code> and then <code>Cluster performance</code>. In Serverless projects, open <code>Admin and Settings</code> and then <code>Project performance</code>.</p><p>Threshold hygiene: Open <code>Stack Management</code> and then <code>Advanced Settings</code>. The <code>running_queries:minRunningTime</code> setting defaults to <code>100 ms</code>. Only tasks that have been running longer than that appear. That way you can dial through noise without drowning in instantaneous work.</p><h2>What to do next</h2><p>Walk through the six-step sequence once when the cluster is calm. When Act I hits, you won’t be learning a new UI under pressure. Then repeat it during the next slow moment. The gap between assuming and seeing is the whole product story.</p><p>If you aren’t on Elastic Cloud yet, you can still get hands-on with the stack at <a href="https://cloud.elastic.co/registration">elastic.cloud/registration</a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Accelerating merchandising improvements with a governed control plane]]></title>
            <description><![CDATA[Search behavior changes shouldn't require an engineering ticket. Learn how a governed control plane lets business teams update search policies in hours, without deployments, without risk.]]></description>
            <link>https://www.elastic.co/search-labs/blog/ecommerce-search-governance-zero-deploy</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/ecommerce-search-governance-zero-deploy</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Alexander Marquardt, Honza Král, Taylor Roy]]></dc:creator>
            <pubDate>Tue, 28 Apr 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/dc802d2ca828ba41d6ff2a0ea1ba67eb0e3bcd10-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p><a href="https://www.elastic.co/search-labs/blog/ecommerce-search-governance-improve-retrieval">Part 1</a> in this blog series established why ecommerce search needs a governance layer between the user&#x27;s query and the retrieval engine that classifies intent, enforces business constraints, and routes to the appropriate retrieval strategy. The natural next questions are: Who operates that layer, and how fast can they move?</p><p>This post answers those questions. A governed control plane doesn&#x27;t just improve search relevance; it changes the operating model. It moves search behavior changes from engineering deployment cycles to business-driven workflows, without sacrificing safety or accountability.</p><h2>The scenario that exposes the operating model</h2><p>Imagine that you’re in the weeks leading up to Christmas, and your merchandising team has identified three urgent changes that must immediately be made to search behavior:</p><ul><li><strong>Campaign launch.</strong> Due to an ordering error, there’s an oversupply of in-house branded turkeys. Therefore any query for &quot;turkey&quot; must boost the in-house brand.</li><li><strong>Product recall.</strong> A supplier has recalled a product line. Queries that would surface those products shouldn’t be shown.</li><li><strong>Seasonal reinterpretation.</strong> Queries for &quot;stocking&quot; are returning women&#x27;s hosiery and tights. During the holiday season, &quot;stocking&quot; should resolve to Christmas stockings and stocking stuffers. Once the season ends, the policy can be reverted in minutes.</li></ul><p>Under the traditional operating model, where search logic is embedded in application code, each of these changes requires an engineering ticket, a code change, a review cycle, a staging deployment, and a production release. In organizations with conservative release processes, that&#x27;s a timeline measured in weeks, not hours or minutes. The Christmas shopping window closes before engineering can ship the necessary modifications.</p><p>The bottleneck isn’t the retrieval engine; it’s the operating model. The core challenge is that business intent cannot be translated into search behavior without engineering acting as a constant intermediary, turning every strategic pivot into a technical ticket.</p><h2>The anti-pattern: Search logic in application code</h2><p><a href="https://www.elastic.co/search-labs/blog/ecommerce-search-governance-improve-retrieval">Part 1</a> described how search logic embedded in application code can turn into a &quot;spaghetti&quot; implementation, which creates operational friction. Here’s what that friction looks like at scale. What starts as a few targeted overrides, a filter here, a boost there, grows over time into tens of thousands of lines of if/else branching, regex patterns, and conditional query modifications. This creates problems beyond just technical debt:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/fc4d7ea5545512667552af429023fcd7fb316e82-1408x768.png" style="width: 100%;" /><p>This model introduces four systemic frictions that hinder both organizational speed and system scalability:</p><p><strong>Coupling.</strong> Business strategy changes daily. Application infrastructure should remain highly stable. When both live in the same codebase, a merchandiser&#x27;s request to boost a seasonal product becomes a deployment risk, and a scoring function refactor can silently break a campaign.</p><p><strong>Latency (organizational and computational).</strong> A single query behavior change can require a six-week deployment cycle: ticket, investigation, code change, review, staging, release. Furthermore, the application layer lacks any indexing mechanism to efficiently determine which policies apply to a given query, so policy evaluation often adds meaningful latency at query time as the system walks through sequential if/else checks.</p><p><strong>Accountability gaps.</strong> When results change unexpectedly, nobody can quickly answer <em>why</em>. Was it a synonym update? A scoring change? A new filter added three releases ago? When business logic is distributed across thousands of lines of application code, shipped by different teams across different releases, tracing a relevance change back to its root cause becomes an archaeology project.</p><p><strong>Misallocated engineering.</strong> This model turns skilled software engineers into full-time relevance mechanics. Instead of building platform capabilities, they spend their cycles translating merchandising requests into code changes and debugging interactions and conflicts between hard-coded business policies.</p><h2>The paradigm shift: Policies as data</h2><p>The solution is to decouple business policies from application code entirely. Instead of hard-coding query modifications in middleware, store governed policies as structured documents, each one expressing a discrete business intent, and evaluate them at query time in a dedicated governed control plane layer.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/13d47992fb1d1f3f3887f3800f5ddc83742e9c9c-1408x768.png" style="width: 100%;" /><p>A policy is a first-class data object. It has match criteria (when should this policy fire?), an action (what should it do?), a priority (how does it interact with other policies?), and metadata (a title and a description). The control plane evaluates matching policies, resolves conflicts deterministically, and produces an execution plan including constraints, boosts, and routing decisions that Elasticsearch executes against a product catalog.</p><p>For each additional search requirement, the application code doesn&#x27;t change. The retrieval engine doesn&#x27;t change. What changes is that business decisions are no longer encoded in code. They live in a policy index as data that can be updated without a deployment.</p><p>This changes your org chart, not just your query.</p><h2>Policies vs. triggers vs. rules</h2><p>A note on terminology used in this series: a <em>policy</em> refers to this complete governed document, including a trigger (match criteria), rule (action), priority, enabled/disabled, and metadata. A <em>trigger</em> refers to the matching criteria that determines when this policy fires, and a <em>rule</em> refers specifically to the action inside the policy, such as applying a filter or changing the retrieval strategy.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/281e0fb915a7723a5619b5bd08f855abb4e2c530-966x1412.png" style="width: 100%;" /><h2>The workflow: Author → Test → Promote</h2><p>Moving policies out of code and into data opens the door for business-driven search management. But enabling non-technical teams to alter search behavior requires strict operational guardrails. The goal is fast and safe iteration with governance.</p><p>To empower non-technical teams to modify search behavior with confidence, we suggest a three-stage workflow: Author, Test, and Promote. Let’s examine the components of this workflow in detail.</p><p><strong>Author.</strong> A merchandiser creates a policy using structured fields: what the policy should match, what action it should take, and at what priority. The interface guides the business user through what’s expressible.</p><p><a href="https://www.elastic.co/consulting">Elastic Services</a> has built and deployed a governed framework for enterprise ecommerce customers, which has an admin UI that looks as follows:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/3c8cd24f24320f5661800e34686719bc7d6c78e2-1005x959.png" style="width: 100%;" /><p><strong>Test.</strong> The policy is validated in a non-production environment where the merchandiser can run representative queries and verify that the policy produces the expected behavior, including how it interacts with other active policies. Because the control plane infrastructure is identical across environments, what works in the test environment will work in production.</p><p><strong>Review.</strong> Before a policy is promoted to production, it passes through review. Depending on the organization&#x27;s risk tolerance, this might be a peer review from another merchandiser, an approval from a search lead, or an automated validation that checks for conflicts with existing policies.</p><p><strong>Promote.</strong> Once approved, the policy is promoted to the production policy index. It takes effect on the next query: no code deployment, no engineering release, no staging build. The entire promotion is a data operation: the same JSON document, moved to a different index.</p><p><strong>Disable.</strong> If a production policy produces unexpected behavior, it can be disabled immediately without engineering involvement. Disabling removes the policy from query evaluation instantly, without affecting any other policy in the system.</p><p>This is the &quot;zero-deploy&quot; promise. It doesn&#x27;t mean &quot;no process.&quot; It means the process operates on <em>policy data</em>, not application code. This distinction compresses the change cycle from weeks to hours or minutes.</p><h2>Why &quot;zero-deploy&quot; matters for revenue-critical queries</h2><p>The economics of ecommerce search are asymmetric. A small number of high-volume queries (&quot;milk,&quot; &quot;bread,&quot; &quot;oranges,&quot; &quot;diapers&quot;) drive a disproportionate share of revenue. When one of these queries returns unexpected results, the cost is immediate and measurable: Conversion drops, customer complaints spike, and the merchandising team opens an urgent ticket.</p><p>Under the traditional model, the response cycle is:</p><ol><li>The merchant notices the problem.</li><li>The merchandiser files a ticket with engineering.</li><li>Engineering investigates, identifies the cause, and writes a fix.</li><li>The fix goes through code review, staging, and release.</li><li>Production is updated.</li></ol><p>Depending on the organization, steps 2 through 5 may take weeks. For a revenue-critical query during a peak sales period, that latency costs money.</p><p>Under a governed control plane, the response cycle compresses:</p><ol><li>The merchant notices the problem.</li><li>The merchandiser drafts a policy fix (or modifies an existing policy).</li><li>The policy goes through review and is published.</li><li>The fix is live.</li></ol><p>The difference isn&#x27;t just speed. It&#x27;s ownership. The person closest to the business context (the merchandiser who understands why &quot;oranges&quot; should resolve to produce, not beverages) is the person making the change. Engineering is freed from the daily merchandising loop to focus on the platform. This shift also unlocks something that&#x27;s nearly impossible under the traditional model: attributing search performance changes to specific business decisions.</p><h2>Measurability: Which policy moved conversion</h2><p>When policies are discrete, versioned documents that are stored in an Elasticsearch index, each one becomes independently deployable and therefore its impact can be more easily measured. You can answer questions that are nearly impossible to answer when business logic is scattered across application code:</p><ul><li>Did the &quot;cheap laptops&quot; price threshold policy improve conversion for that query class, or did it suppress it?</li><li>What was the click-through rate impact of the holiday campaign boost?</li><li>When we rolled back the &quot;oranges&quot; category constraint last Thursday, what happened to add-to-cart rates?</li></ul><p>This turns search governance into a data-driven discipline. Instead of vague &quot;relevance tuning,&quot; where a release contains a dozen changes and nobody can attribute the outcome, you get measurable, attributable impact per policy. Merchandisers can iterate with evidence. Engineers can evaluate whether a policy schema change produced the expected downstream effect. Leadership can see which policies are driving revenue and which are inert.</p><h2>What this means for each role</h2><h3>For merchandisers and business users</h3><p>Search behavior becomes something you can directly influence through structured policies without understanding Elasticsearch syntax or scoring algorithms. You can see what policies are triggered for a given query to understand why it produces specific results, and make changes within hours instead of weeks. The same policy mechanism also supports sponsored product placement: A merchandiser can create a boost policy that elevates a product or brand and flags it for a &#x27;Sponsored&#x27; indicator in the UI, without requiring engineering involvement or additional infrastructure.</p><h3>For search engineers</h3><p>The control plane separates two concerns that are currently entangled: retrieval optimization and business logic. Instead of maintaining tens of thousands of lines of application code that encodes business decisions, you maintain the retrieval engine and the control plane infrastructure. When a merchandiser needs a new campaign boost, they don&#x27;t need engineering to write it.</p><p>This doesn&#x27;t eliminate engineering involvement. Engineers design the policy schema, maintain the control plane, set guardrails on what policies can express, add new capabilities as required, and handle edge cases that fall outside the policy framework. But the day-to-day operational cadence of modifying query behavior shifts to the people who own the business context.</p><h3>For site reliability engineers and platform teams</h3><p>Because policies are structured documents rather than application code, they fit naturally into existing operational workflows. Policies can be stored in version control, reviewed through pull requests, and deployed through the same continuous integration and continuous deployment (CI/CD) pipelines the team already uses. Conflicts between policies are detected and resolved deterministically at query time through the control plane&#x27;s priority system, not through unpredictable interactions between code branches shipped in different releases.</p><p>When something does go wrong, diagnosing the cause is straightforward: Policies are discrete, named, and individually toggleable. A problematic policy can be disabled or deleted immediately without affecting anything else in the system. Compare that to debugging a relevance regression caused by an interaction between a synonym update, a scoring function change, and a new analyzer, all shipped in the same release with no clear attribution.</p><h2>Beyond manual authoring: Large language model–assisted (LLM-assisted) policy suggestions</h2><p>The policies described so far are authored by humans (a merchandiser identifying a gap and drafting a fix). But the same governed workflow supports a second mode: LLM-assisted policy suggestion.</p><p>An LLM can run offline or in the background, analyzing query logs, identifying patterns where search results underperform, such as queries with high exit rates, low click-through, or frequent reformulations. An LLM can then suggest new policies that enter the same Author → Test → Promote pipeline, where a human evaluates each one before it reaches production.</p><h2>Governance is the enabler, not the constraint</h2><p>It might seem counterintuitive: Adding a governance layer makes the system <em>faster</em> to change, not slower. This is the same pattern that works in other domains. CI/CD pipelines don&#x27;t slow down software delivery; they make it safe to ship frequently. Access control doesn&#x27;t slow down collaboration; it makes it safe to share broadly.</p><p>A governed control plane works the same way. The reason a query behavior change takes six weeks isn&#x27;t that the code change is complex; it&#x27;s that nobody is confident enough to ship it faster, because the blast radius is unclear and the rollback path is uncertain.</p><p>Governance provides that confidence. When every policy is explicit, every conflict is resolved deterministically, and every change can be instantly disabled and then rolled back (because policies are structured JSON documents that can be version controlled using existing workflows), the cost of iteration drops dramatically. Business teams move at the speed of the market. Engineering focuses on the platform.</p><h2>From operating model to architecture</h2><p>The shift from business logic in code to business policies as data is more than a technical refactoring; it&#x27;s an organizational change that puts relevance ownership with the teams closest to the business context. But it raises an architectural question: How do you evaluate policies at query time without adding latency or turning the control plane itself into a new form of spaghetti?</p><p>The next post will dig into exactly that: the design pattern that enables fast, deterministic policy evaluation at query time.</p><h2>Put governed ecommerce search into practice</h2><p>The workflow described here, merchandisers authoring, testing, and promoting search policies without engineering deployments, is already available. Elastic Services Engineering designed and built it, and Elastic Services has the skills to deploy it for enterprise ecommerce teams.</p><p>If your organization is ready to move from deployment-gated relevance tuning to business-editable search with governance and auditability, we can accelerate your implementation. Contact <a href="https://www.elastic.co/consulting">Elastic Professional Services</a>.</p><h2>Join the discussion</h2><p>Have questions about search governance, retrieval strategies, or ecommerce search architecture? Join the broader <a href="https://discuss.elastic.co/">Elastic community conversation</a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Preconditioning Vectors: Making Elasticsearch VectorDB Better Binary Quantization work for every vector]]></title>
            <description><![CDATA[Modern quantization techniques can hurt recall when using older models or embeddings that aren’t normally distributed. Learn how preconditioning fixes these vectors through random orthogonal projection, making BBQ more effective and recovering recall.]]></description>
            <link>https://www.elastic.co/search-labs/blog/elasticsearch-bbq-preconditioning-vectors</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/elasticsearch-bbq-preconditioning-vectors</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[John Wagster]]></dc:creator>
            <pubDate>Mon, 27 Apr 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/2f5809208179f0040ba699b2c8731533ef845bc7-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Elasticsearch as a vector database offers comprehensive quantization techniques like <a href="https://www.elastic.co/search-labs/blog/optimized-scalar-quantization-elasticsearch">Better Binary Quantization (BBQ)</a>. BBQ and other similarly modern quantization techniques compress vectors down to as little as a single bit per dimension, reducing memory use while retaining impressively accurate distance approximation. For vectors generated from deep learning models, such as Cohere models, this works really well; however, for other kinds of vectors, such as image data or histogram features, recall can be impacted heavily. Preconditioning fixes this by applying a random orthogonal rotation to your vectors before quantization, redistributing variance evenly across dimensions so every bit captures meaningful signal, in some cases improving recall by almost 75%.</p><p>Here, we’ll provide some intuition about the problem and how preconditioning solves it.</p><h2>The problem</h2><p>BBQ quantizes each dimension of a vector independently: Values above the mean become 1, values below it become 0. This works well when every dimension carries roughly the same amount of information. Transformer-based embeddings tend to have this property naturally such that their dimensions are learned representations that distribute variance evenly.</p><p>But there are lots of real-world vectors that aren’t like this. Consider a 784-dimension vector representing a grayscale image, like in the Fashion-MNIST dataset. Some pixels near the center of the image, where the clothing actually appears, vary a lot across the dataset. However, other pixels, such as those near the corners, are mostly one color and barely vary at all. When BBQ quantizes these vectors, the high-variance dimensions lose precision because a single bit can&#x27;t capture their range, while the low-variance dimensions become useless. The resulting quantized vectors are poor approximations of the originals, and recall suffers.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/d7230d6d1067eb8b40f1f3e2e514388dc846473f-767x789.png" style="width: 100%;" /><p>Picture of a representation of Fashion-MNIST images. (credit: <a href="https://www.geeksforgeeks.org/deep-learning/how-to-load-fashion-mnist-dataset-using-pytorch/">geeksforgeeks.org</a>)</p><h2>Precondition</h2><p>To fix the problem, we want to spread the information more evenly across dimensions so that each bit captures roughly the same amount of information.</p><p>Preconditioning applies a linear transformation to every vector before quantization. The transformation is an orthogonal rotation that reshuffles how information is distributed across dimensions without changing the distances between vectors. If you want to dig into the math, take a look at this in-depth analysis on <a href="https://www.elastic.co/search-labs/blog/robust-optimized-scalar-quantization">optimized scalar quantization (OSQ) with preconditioners</a>.</p><p>Here’s a graphic to help illustrate how preconditioning can help when applying quantization. This simplified two-dimensional diagram illustrates the idea that the orthogonal rotation helps to increase the spread, or range, of information that was previously quite compressed. While this two-dimensional animation is not an exact representation of preconditioning, it gives a good intuition for what roughly happens in higher dimensions where buckets of dimensions are transformed independently and a random projection can greatly improve the distribution. Imagine that the y-axis represents one pixel of our Fashion-MNIST corners that are primarily one shade with very low variance and the x-axis represents a pixel of clothing at the center of the image with very high variance. Without preconditioning, quantizing vectors to a single representative point is not a particularly good discriminator.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/40a3b20218bfd440f9d8f4aca991b31b83df759e-1080x608.gif" style="width: 100%;" /><h3>Let’s look at the data</h3><p>Today, preconditioning is supported in <a href="https://www.elastic.co/search-labs/blog/diskbbq-elasticsearch-introduction">DiskBBQ</a>. Here’s a benchmark showing the impact when visiting different percentages of the total vector dataset.</p><p>Fashion-MNIST Recall (784 dimensions, 60K docs, 5x oversample, k: 10)</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Vectors visited</td><td style="border: 1px solid #ccc; padding: 4px;">Baseline recall</td><td style="border: 1px solid #ccc; padding: 4px;">Preconditioned recall</td><td style="border: 1px solid #ccc; padding: 4px;">% Improvement</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">0.5%</td><td style="border: 1px solid #ccc; padding: 4px;">0.45</td><td style="border: 1px solid #ccc; padding: 4px;">0.77</td><td style="border: 1px solid #ccc; padding: 4px;">71%</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">3%</td><td style="border: 1px solid #ccc; padding: 4px;">0.49</td><td style="border: 1px solid #ccc; padding: 4px;">0.77</td><td style="border: 1px solid #ccc; padding: 4px;">57%</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">5%</td><td style="border: 1px solid #ccc; padding: 4px;">0.50</td><td style="border: 1px solid #ccc; padding: 4px;">0.87</td><td style="border: 1px solid #ccc; padding: 4px;">74%</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">10%</td><td style="border: 1px solid #ccc; padding: 4px;">0.55</td><td style="border: 1px solid #ccc; padding: 4px;">0.91</td><td style="border: 1px solid #ccc; padding: 4px;">65%</td>
                    </tr>
              </tbody>
            </table>
          <p>GIST (960 dimensions, 1M docs, 5x oversample, k: 10)</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Vectors visited</td><td style="border: 1px solid #ccc; padding: 4px;">Baseline recall</td><td style="border: 1px solid #ccc; padding: 4px;">Preconditioned recall</td><td style="border: 1px solid #ccc; padding: 4px;">% Improvement</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">0.1%</td><td style="border: 1px solid #ccc; padding: 4px;">0.49</td><td style="border: 1px solid #ccc; padding: 4px;">0.69</td><td style="border: 1px solid #ccc; padding: 4px;">41%</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">0.2%</td><td style="border: 1px solid #ccc; padding: 4px;">0.70</td><td style="border: 1px solid #ccc; padding: 4px;">0.77</td><td style="border: 1px solid #ccc; padding: 4px;">10%</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">0.3%</td><td style="border: 1px solid #ccc; padding: 4px;">0.73</td><td style="border: 1px solid #ccc; padding: 4px;">0.85</td><td style="border: 1px solid #ccc; padding: 4px;">16%</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">0.5%</td><td style="border: 1px solid #ccc; padding: 4px;">0.78</td><td style="border: 1px solid #ccc; padding: 4px;">0.88</td><td style="border: 1px solid #ccc; padding: 4px;">13%</td>
                    </tr>
              </tbody>
            </table>
          <p>SIFT (128 dimensions, 1M documents, 5x oversample, k: 10)</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Vectors visited</td><td style="border: 1px solid #ccc; padding: 4px;">Baseline recall</td><td style="border: 1px solid #ccc; padding: 4px;">Preconditioned recall</td><td style="border: 1px solid #ccc; padding: 4px;">% Improvement</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">0.5%</td><td style="border: 1px solid #ccc; padding: 4px;">0.48</td><td style="border: 1px solid #ccc; padding: 4px;">0.60</td><td style="border: 1px solid #ccc; padding: 4px;">25%</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">1%</td><td style="border: 1px solid #ccc; padding: 4px;">0.59</td><td style="border: 1px solid #ccc; padding: 4px;">0.71</td><td style="border: 1px solid #ccc; padding: 4px;">20%</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">3%</td><td style="border: 1px solid #ccc; padding: 4px;">0.71</td><td style="border: 1px solid #ccc; padding: 4px;">0.87</td><td style="border: 1px solid #ccc; padding: 4px;">23%</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">7%</td><td style="border: 1px solid #ccc; padding: 4px;">0.72</td><td style="border: 1px solid #ccc; padding: 4px;">0.90</td><td style="border: 1px solid #ccc; padding: 4px;">25%</td>
                    </tr>
              </tbody>
            </table>
          <p>That’s a nice boost in recall; however, this boost comes with a cost. Applying it to all embeddings blindly is inefficient, causing ~2–4% overhead in query latencies with no improvement in recall for datasets that don’t need to be preconditioned. And upwards of 20% additional overhead at index time. For production use cases where you see initially low recall, you may want to evaluate the impact of preconditioning with your specific model and dataset.</p><h2>Here’s the how</h2><p>Preconditioning is available for the <code>bbq_disk</code> index type. Simply set <code>precondition</code> to <code>true</code> in the <code>index_options</code>, like so:</p><pre><code>{
  "mappings": {
    "properties": {
      "my_vector": {
        "type": "dense_vector",
        "dims": 784,
        "index_options": {
          "type": "bbq_disk",
          "precondition": true
        }
      }
    }
  }
}</code></pre><p>Take a look at the <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/dense-vector#dense-vector-params">dense vector mapping</a> docs for more details.</p><h2>Conclusion</h2><p>BBQ is highly effective for deep learning embeddings, but it can be less effective with embeddings that have uneven variance across dimensions, as can occur in feature-engineered vectors. Preconditioning redistributes that variance so quantization can be more effective. On some datasets, like Fashion-MNIST, we see as much as a 74% improvement in recall!</p><p>For now, we’ve made preconditioning optional. Hopefully, you feel more capable of knowing when it may be beneficial so you try it out yourself. In the future, we plan to iterate on performance and automatically detect when to apply preconditioning.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Reindexing data streams due to mapping conflicts]]></title>
            <description><![CDATA[Learn how to fix Elasticsearch mapping conflicts by reindexing data streams. This blog explains the reindexing process and how to ensure new data is correctly mapped.]]></description>
            <link>https://www.elastic.co/search-labs/blog/elasticsearch-mapping-conflicts-reindex-data-streams</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/elasticsearch-mapping-conflicts-reindex-data-streams</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Lisa Larribas]]></dc:creator>
            <pubDate>Fri, 24 Apr 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/2f2573aa3d29b3a628e4fce606c803add2641501-720x420.jpg?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>When mapping conflicts arise in fields, whether they’re Elastic Common Schema–standard (ECS-standard) or specific to the data source, reindexing your data using Dev Tools becomes necessary. These conflicts can negatively impact any downstream function following ingestion, potentially causing inaccurate results or preventing the use of the complete dataset in features like visualizations, dashboards, the Security app, and aggregations. This blog post details the steps for this reindexing process.</p><p>This blog&#x27;s content was developed and verified using Elastic versions 9.2.8 and 8.19.14, along with Filestream Integration versions 2.3.0 and 1.2.0.</p><p><strong>Important note:</strong> Depending on your environment, some steps may require specific modifications. Furthermore, be aware that dynamic templates were removed from the <code>@package</code> component template starting with Filestream Integration version 2.3.3.</p><p>Before starting the reindexing process, it’s important to consider the current storage allocation in your environment. The steps outlined below involve creating a copy of the existing backing index, which will temporarily reside in the <a href="https://www.elastic.co/docs/manage-data/lifecycle/data-tiers">hot tier</a>.</p><p><strong><span style="text-decoration:underline">Elasticsearch data tiers</span></strong></p><ul><li><strong>Hot: </strong>The hot tier is the Elasticsearch entry point for time series data, storing the most recent, frequently searched data. Hot tier nodes require fast reads and writes, necessitating more resources and faster storage (SSDs). This tier is mandatory, and new data stream indices are automatically allocated here.</li><li><strong>Warm: </strong>Time series data can move to the warm tier once it’s being queried less frequently than the recently indexed data in the hot tier. The warm tier typically holds data from recent weeks. Updates are still allowed but are likely infrequent. Nodes in the warm tier generally don’t need to be as fast as those in the hot tier. For resiliency, indices in the warm tier should be configured to use one or more replicas.</li><li><strong>Cold: </strong>Data that’s infrequently searched can move from the warm to the cold tier. The cold tier, while still searchable, prioritizes lower storage costs over search speed. Alternatively, the cold tier can store regular indices with replicas instead of searchable snapshots, allowing use of less expensive hardware for older data without reducing disk space requirements compared to the warm tier.</li><li><strong>Frozen: </strong>Data that’s queried infrequently or no longer queried moves from the cold to the frozen tier for its remaining lifecycle. This tier uses a snapshot repository and partially mounted indices to store and load data, reducing local storage and costs while still allowing search. Searches on the frozen tier are generally slower than on the cold tier because Elasticsearch may need to fetch frozen data from the snapshot repository. We recommend dedicated frozen tier nodes.</li></ul><h2>Prerequisites: Determine which fields have conflicts</h2><p>To determine which fields have mapping conflicts, navigate to <strong>Stack Management -&gt; Data Views -&gt; logs-*</strong> (using the logs-* data view is the highest hierarchy of data present with the <em>logs-</em> prefix.) If there are any conflicts, there will be a yellow box stating that. You may either click <strong>View conflicts</strong> or, under the <strong>Field type</strong> box next to the <strong>Search </strong>box, select <strong>conflict</strong>.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/7d41594682b601a30a9544b8db678f118b0146ab-2048x720.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/41ad800daa6fc244a1123ba7538820bff5de6788-747x182.png" style="width: 100%;" /><p>Clicking the yellow <strong>Conflict</strong> button will reveal which indices are associated with which mapping types.</p><p>This situation (where the field is mapped as both a <code>keyword</code> and a <code>long</code>) typically occurs because data was ingested before a specific mapping type was defined in the <a href="https://www.elastic.co/docs/manage-data/data-store/templates#component-templates">component template</a> for the relevant <a href="https://www.elastic.co/docs/manage-data/data-store/data-streams">data stream</a>. In such cases, Elasticsearch attempts to set the mapping based on its dynamic templates.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/9973c1935aa52292c1ace09a8e9c0b31ad99e7a2-2048x1085.png" style="width: 100%;" /><p>In order to determine which mapping is appropriate for the field, and if the field is an ECS field, verification with <a href="https://www.elastic.co/docs/reference/ecs/ecs-field-reference">ECS field reference</a> is needed. If the field in question is not an ECS field, its value must be reviewed to determine the correct mapping.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/3c77d0a1520bd1ad17e7ffa1480ecf5e224953e1-418x360.png" style="width: 100%;" /><p>If a field, such as <code>log.offset</code> in this example, isn’t documented in the ECS, the next steps are to investigate the field&#x27;s value, determine which conflicting mapping type has the most backing indices, and examine the component templates of the other indices.</p><p>Typically, the mapping type associated with the highest number of indices is the correct one, but we recommend you verify the value of the field in question to validate this. To confirm the validity of a mapping type (for example, <code>long</code>), you must also verify that the field&#x27;s value is appropriate for that type. This verification can be done by using <strong>Discover </strong>to search for the field in question. Reviewing other data streams that contain the same field can provide additional confirmation also.</p><p>To review the values present for the field with the mapping issue, navigate back to the yellow <strong>Conflict </strong>button stated earlier, click the <strong>Conflict</strong> button, highlight one of the backing indices, and paste into a <strong>Discover </strong>session. Your Kibana Query Language (KQL) statement should look like the following screenshot, to include the <strong><code>_index</code>:</strong> field delimiter.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/781f63b34a9abd427ceb896484da29af446e3326-2048x1063.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/b7b0cb847c1694ab605c61a538722f5be004ec86-2048x909.png" style="width: 100%;" /><h2>Prepare the new backing index custom component template</h2><p>To address the mapping conflict in the data stream, first examine the relevant <code>@package</code> component template. You can find this under <strong>Stack Management -&gt; Index Management -&gt; Component Template</strong>. Search for the data stream and select the corresponding <code>@package</code> link. This template contains mappings for the fields out of the box and, while it isn’t common to have a mapping mismatch, it’s possible for the more appropriate type to be overlooked.</p><p>Review the template to confirm it contains the necessary field nesting and mapping for the field in question. For example, if the template incorrectly lists <code>log.offset</code> as a <code>keyword</code>, this is the source of the issue.</p><p><strong>Important:</strong> Because modifying <code>@package</code>/managed templates isn’t recommended, you must use or create an <code>@custom</code> component template to correct the mapping type (for example, for <code>log.offset</code>) for all future data.</p><ul><li>We don’t recommend modifying the <code>@package</code>/managed templates, since when you update the integration to a more recent version, any changes you make to the <code>@package</code> template will be overwritten. This is why we recommend using the <code>@custom</code> templates.</li><li>If a data stream is experiencing mapping conflicts, you need to add any missing field (ECS and non-ECS) nestings or mappings to the data stream&#x27;s <code>@custom</code> component template. Create this template if it doesn&#x27;t exist yet, and make sure to specify the correct mapping type for the field.</li><li>If you have multiple conflicts in your data view, apply all the necessary missing mappings for the data stream simultaneously so that the reindex is performed once versus multiple times. Having entries for proper data typing in the <code>@custom</code> component template will ensure any future data ingestion will follow the same mapping guideline.</li></ul><p>To create the <code>@custom</code> component template (or verify it’s in use and populated), navigate to <strong>Index Templates</strong>, type in the name of the data stream in question, and click the appropriate <code>@custom</code> template being used by the data stream. If the template is not yet created, a yellow box will appear, allowing you to create the template through the UI.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/bea7cb172227bebc28146e3f2f016e112f34cba5-2048x720.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/e8f63a2e396efbe7f1e62dc08a137a22700be484-2048x296.png" style="width: 100%;" /><p>The screenshot below shows the next page once <strong>Create component template</strong> is selected. Leave the defaults as is on the first page and click <strong>Mappings</strong> or <strong>Next</strong> until you reach the <strong>Mappings</strong> page.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/822f1d864302aa4be438c13756b8372f43fa1b0d-2048x1275.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/affa2f1214af516a5a6b571ab813628ed7649275-2048x1235.png" style="width: 100%;" /><p>To explicitly set the mapping for a new field coming in or to update a field that has a mapping conflict, when the data stream rolls over due to configuration set in the index lifecycle policy, an entry is needed for the field that the conflict exists in.</p><p>The below will set the mapping for the <code>log.offset</code> field in the <code>@custom</code> component template for the filestream data stream. Repeat the steps to add any custom fields or update necessary fields from the <code>@package</code> with the appropriate mappings, if needed, for this dataset. In this example, when setting offset to <code>Long</code>, the field type will be <code>Numeric</code> and the Numeric type will be <code>Long</code>. Click <strong>Add field</strong> and then outside of the area to continue.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/affa2f1214af516a5a6b571ab813628ed7649275-2048x1235.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/ea2975ee4e40ac0e10c4170d2a23125101f7f8da-2048x1136.png" style="width: 100%;" /><p>Once all needed fields have been added, click through to review, and select <strong>Create component template</strong> when ready. All new data being ingested from this step forward will have <code>log.offset</code> set to <code>long</code>.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/bdf8b8290b0c064c9d88990194b15232ffe85709-2048x1027.png" style="width: 100%;" /><h2>Creating the new backing index structure</h2><p>The new backing index needs to have the existing mappings from the data stream’s&nbsp; component template, as well as the ECS <code>ecs@mappings</code> component template. The <code>ecs@mappings</code> component template is applied after the data stream’s component as a catchall for additional mappings that potentially weren’t captured in the previous component templates.</p><p>Navigate to the browser tab for the data stream&#x27;s <code>@package</code> mappings. (Go to <strong>Stack Management -&gt; Index Management -&gt; Component Template -&gt; <code>logs-filestream.generic@package</code> -&gt; Manage -&gt; Edit</strong>.) Once there, click on the <strong>Review</strong> section, then <strong>Request</strong>, and finally the <strong>Copy</strong> button on the right. The JSON contents of the component template copied will ensure the remaining field mappings and settings are retained while we update the <code>log.offset</code> field mapping. The JSON will form the backing structure for the newly reindexed backing index.</p><p><strong>Important: </strong>If the template’s JSON was not copied and work was continued on with the reindex, the <code>log.offset</code> conflict would be resolved but there would be new conflicts with the integration, as the integrity of the current mappings were not upheld, creating double work to resolve the original issue.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/402b0431b0e19374e9b28a4374ed51dfa5fa44ba-2048x897.png" style="width: 100%;" /><p>Open a second browser tab, navigate to Dev Tools, and paste the copied content. Now, to clean up what was pasted:</p><p><strong>Modifications to the request</strong></p><p><strong>1. Index name:</strong> Replace <code>_component_template/logs-filestream.generic@package</code> with the name of the backing index you intend to reindex, appending <code>-1</code> to the end. For example, use <code>PUT &lt;backing index to reindex&gt;-1</code>.</p><ul><li>The appended <code>-1</code> signifies a reindex and won’t conflict with the default ILM rollover settings, which are based on the index&#x27;s creation date.</li></ul><p><strong>2. Settings:</strong> Remove the line <code>&quot;template&quot;</code> (line 3), as well as the very last closing brace for the entire JSON payload; Line 3 should start with <code>&quot;settings&quot;: {</code>.</p><ul><li>Replace the inner contents of the settings section with <code>&quot;index.codec&quot;: &quot;best_compression&quot;</code>. This action will apply Elastic&#x27;s best compression to the index upon creation.</li><li>Add in <code>&quot;index.lifecycle.name&quot;: &quot;logs&quot;</code>, as well as a line for <code>&quot;index.lifecycle.rollover_alias&quot;: &quot;&quot;</code>.<ol><li>The <code>&quot;index.lifecycle.name&quot;: &quot;logs&quot;</code> entry will apply the logs ILM policy to the new backing index. Modify the ILM policy name if you aren’t using logs.</li><li>The <code>&quot;index.lifecycle.rollover_alias&quot;: &quot;&quot;</code> is blank, since this backing index won’t be rolled over, yet the setting is required to avoid ILM rollover errors into the next ILM phase after hot.</li></ol></li></ul><p><strong>3. Structure:</strong> The request should now include both a <code>Settings</code> section and a <code>Mappings</code> section. Inside <code>&quot;mappings&quot;: {</code>, you should find <code>&quot;dynamic_templates&quot;</code> and a <code>&quot;properties&quot;</code> section containing hard-coded fields and their mappings.</p><p><strong>4. Dynamic templates modification: </strong>The current dynamic templates section contains entries for fields that may be overwritten when the <code>ecs@mappings </code>dynamic templates are added next, causing redundancy and extra lines that aren’t needed.</p><ul><li>Remove all sections in <code>&quot;dynamic_templates&quot;</code> except for the second section titled <code>&quot;_embedded_ecs-data_stream_to_constant&quot;: {</code>.</li><li>Repeat the same process as described above, gathering the dynamic mappings for the <code>@package</code> component template, but this time the dynamic mappings for <code>ecs@mappings</code> component template.<ul><li>It may be easier to copy the entire contents of the mappings from the UI for the <code>ecs@mappings</code> component template, paste into the working Dev Tools <code>dynamic_templates</code> section, and remove duplicate and unnecessary lines where appropriate. Include these dynamic template setting contents after the<code>&quot;_embedded_ecs-data_stream_to_constant&quot;: {</code> entry. The <code>dynamic_templates</code> section should look very similar to the below sample contents in Dev Tools.</li></ul></li><li><strong>If <code>dynamic_templates</code> are not included/removed altogether</strong>, other fields (review the screenshot below) will have double mappings: <code>text</code> and <code>keyword</code> versus the appropriate mappings, if the <code>dynamic_templates</code> section was left included. What’s left should be the <code>&quot;properties&quot;</code> section under <code>&quot;mappings&quot;</code>. This will also create issues in the data view by having the fields be double mapped (if not already mapped this way) and will cause additional mapping conflicts.</li></ul><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/24e972cd0fc8eadf943b21cfdd80a5d435e705aa-2048x994.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/aac13fbe882516e5ed5b5b1b5271c0ae34e80b04-1890x2048.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/c43d6f61c8ece4de2d51657f239a0c34ced07cdb-1928x1452.png" style="width: 100%;" /><p><strong>5. Metadata removal:</strong> Delete the last section labeled <code>&quot;_meta&quot;</code>, as well as the section labeled <code>&quot;version&quot;</code>, if present.</p><p><strong>6. Formatting:</strong> Auto-indent the remaining sections, and adjust or remove any unnecessary curly braces that would prevent a successful execution.<br/></p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/ebe1573647500de75315e7655256a0db9604c40d-2048x1402.png" style="width: 100%;" /><p><strong>7. Mapping change:</strong> Navigate to the <code>&quot;properties&quot;</code> section, find <code>&quot;log&quot;</code>, and then locate <code>&quot;offset&quot;</code> nested underneath. Change the type from <code>keyword</code> to <code>long</code>, and remove the line entry (comma included) labeled <code>&quot;ignore_above&quot;: 1024,</code>. If more than one entry was added to the <code>@custom</code> component template created earlier, include them here.</p><p>Your Dev Tools console view should now be similar to the example provided below.</p><pre><code>PUT .ds-logs-filestream.generic-default-2026.04.14-000001-1
{
  "settings": {
    "index.codec": "best_compression",
    "index.lifecycle.name": "logs",
    "index.lifecycle.rollover_alias": ""
  },
  "mappings": {
    "dynamic_templates": [
      {
        "_embedded_ecs-data_stream_to_constant": {
          "path_match": "data_stream.*",
          "mapping": {
            "type": "constant_keyword"
          }
        }
      },
      {
        "ecs_timestamp": {
          "mapping": {
            "ignore_malformed": false,
            "type": "date"
          },
          "match": "@timestamp"
        }
      },
      {
        "ecs_message_match_only_text": {
          "path_match": [
            "message",
            "*.message"
          ],
          "mapping": {
            "type": "match_only_text"
          },
          "unmatch_mapping_type": "object"
        }
      },
      {
        "ecs_non_indexed_keyword": {
          "path_match": [
            "*event.original"
          ],
          "mapping": {
            "index": false,
            "type": "keyword",
            "doc_values": false
          }
        }
      },
      {
        "ecs_non_indexed_long": {
          "path_match": [
            "*.x509.public_key_exponent"
          ],
          "mapping": {
            "index": false,
            "type": "long",
            "doc_values": false
          }
        }
      },
      {
        "ecs_ip": {
          "path_match": [
            "ip",
            "*.ip",
            "*_ip"
          ],
          "mapping": {
            "type": "ip"
          },
          "match_mapping_type": "string"
        }
      },
      {
        "ecs_wildcard": {
          "path_match": [
            "*.io.text",
            "*.message_id",
            "*registry.data.strings",
            "*url.path"
          ],
          "mapping": {
            "type": "wildcard"
          },
          "unmatch_mapping_type": "object"
        }
      },
      {
        "ecs_path_match_wildcard_and_match_only_text": {
          "path_match": [
            "*.body.content",
            "*url.full",
            "*url.original"
          ],
          "mapping": {
            "fields": {
              "text": {
                "type": "match_only_text"
              }
            },
            "type": "wildcard"
          },
          "unmatch_mapping_type": "object"
        }
      },
      {
        "ecs_match_wildcard_and_match_only_text": {
          "mapping": {
            "fields": {
              "text": {
                "type": "match_only_text"
              }
            },
            "type": "wildcard"
          },
          "unmatch_mapping_type": "object",
          "match": [
            "*command_line",
            "*stack_trace"
          ]
        }
      },
      {
        "ecs_path_match_keyword_and_match_only_text": {
          "path_match": [
            "*.title",
            "*.executable",
            "*.name",
            "*.working_directory",
            "*.full_name",
            "*file.path",
            "*file.target_path",
            "*os.full",
            "*email.subject",
            "*vulnerability.description",
            "*user_agent.original"
          ],
          "mapping": {
            "fields": {
              "text": {
                "type": "match_only_text"
              }
            },
            "type": "keyword"
          },
          "unmatch_mapping_type": "object"
        }
      },
      {
        "ecs_date": {
          "path_match": [
            "*.timestamp",
            "*_timestamp",
            "*.not_after",
            "*.not_before",
            "*.accessed",
            "created",
            "*.created",
            "*.installed",
            "*.creation_date",
            "*.ctime",
            "*.mtime",
            "ingested",
            "*.ingested",
            "*.start",
            "*.end",
            "*.indicator.first_seen",
            "*.indicator.last_seen",
            "*.indicator.modified_at",
            "*threat.enrichments.matched.occurred"
          ],
          "mapping": {
            "type": "date"
          },
          "unmatch_mapping_type": "object"
        }
      },
      {
        "ecs_path_match_float": {
          "path_match": [
            "*.score.*",
            "*_score*"
          ],
          "mapping": {
            "type": "float"
          },
          "path_unmatch": "*.version",
          "unmatch_mapping_type": "object"
        }
      },
      {
        "ecs_usage_double_scaled_float": {
          "path_match": "*.usage",
          "mapping": {
            "scaling_factor": 1000,
            "type": "scaled_float"
          },
          "match_mapping_type": [
            "double",
            "long",
            "string"
          ]
        }
      },
      {
        "ecs_geo_point": {
          "path_match": [
            "*.geo.location"
          ],
          "mapping": {
            "type": "geo_point"
          }
        }
      },
      {
        "ecs_flattened": {
          "path_match": [
            "*structured_data",
            "*exports",
            "*imports"
          ],
          "mapping": {
            "type": "flattened"
          },
          "match_mapping_type": "object"
        }
      },
      {
        "all_strings_to_keywords": {
          "mapping": {
            "ignore_above": 1024,
            "type": "keyword"
          },
          "match_mapping_type": "string"
        }
      }
    ],
    "properties": {
      "input": {
        "properties": {
          "type": {
            "ignore_above": 1024,
            "type": "keyword"
          }
        }
      },
      "@timestamp": {
        "ignore_malformed": false,
        "type": "date"
      },
      "ecs": {
        "properties": {
          "version": {
            "ignore_above": 1024,
            "type": "keyword"
          }
        }
      },
      "log": {
        "properties": {
          "file": {
            "properties": {
              "inode": {
                "ignore_above": 1024,
                "type": "keyword"
              },
              "path": {
                "ignore_above": 1024,
                "type": "keyword"
              },
              "device_id": {
                "ignore_above": 1024,
                "type": "keyword"
              },
              "fingerprint": {
                "index": false,
                "type": "keyword"
              }
            }
          },
          "offset": {
            "type": "long"
          },
          "level": {
            "ignore_above": 1024,
            "type": "keyword"
          }
        }
      },
      "data_stream": {
        "properties": {
          "namespace": {
            "type": "constant_keyword"
          },
          "type": {
            "type": "constant_keyword"
          },
          "dataset": {
            "type": "constant_keyword"
          }
        }
      },
      "event": {
        "properties": {
          "original": {
            "index": false,
            "type": "keyword",
            "doc_values": false
          },
          "module": {
            "type": "constant_keyword",
            "value": "filestream"
          },
          "dataset": {
            "type": "constant_keyword",
            "value": "filestream.generic"
          }
        }
      },
      "message": {
        "type": "match_only_text"
      },
      "tags": {
        "ignore_above": 1024,
        "type": "keyword"
      }
    }
  }
}</code></pre><p>After your console resembles the example (with any additional custom fields included and custom values specific to your environment), execute the command to create the shell of the new backing index, pausing to resolve any errors that arise.</p><h2>Begin reindex process</h2><p>With the shell of the new backing index successfully created, the next step is to reindex and resolve the mapping conflicts.</p><p><strong>Important:</strong> If the backing index that has the mapping conflict is the most recent index and is the current write index (for example, the ending number for the backing index is -000001), the data stream needs to be rolled over. Rolling over the data stream is needed since the current write index, which is having documents fed into it, is a live backing index and cannot be modified.</p><p>With the correct field mapping now applied to the newer write index via the previously created <code>@custom</code> component template, all new documents will reflect this change.</p><p>This is performed by executing the following: </p><pre><code>POST <full data stream name>/_rollover</code></pre><p>For example: </p><pre><code>POST logs-filestream.generic-default/_rollover</code></pre><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/22abc1a2f6de0420aa0d56ac498894111df7f4fd-2048x330.png" style="width: 100%;" /><p>Reindexing involves copying the data from an existing backing index to a new one within the same naming convention, typically to apply necessary changes. These modifications could include updates to a component template or the addition of a new ingest pipeline for the data to be processed through.</p><p>Next, the data will be copied from the backing index that has the incorrect mappings into a new backing index. The original backing index has been rolled over, meaning no new documents can be added to it. The new backing index will follow the same naming convention, which preserves data visibility and integrity while applying the correct ILM policy, but will include a <code>-1</code> suffix to indicate that it has been reindexed.</p><p>Adjust the index names as needed and paste the following code into the console. By including <code>wait_for_completion=false</code>, you can track the progress of document copying, which helps estimate the remaining reindexing time. Without this setting, you cannot track the status using the <code>GET _tasks</code> command below and will only be able to check the document count in the newer backing index using <code>GET &lt;backing index name&gt;-1/_count</code>.</p><p><strong>Important: </strong>If issues arise during the reindex process, don’t rerun the reindex command; doing so will restart the process and create duplicate records in the index ending with <code>-1</code>. If a restart is necessary, first delete the index with the trailing <code>-1</code>, and then execute the preceding <code>PUT</code> command to recreate the new backing index shell.</p><pre><code>POST _reindex?wait_for_completion=false
{
  "source": {
    "index": "<source backing index>"
  },
  "dest": {
    "index": "<new backing index>-1"
  }
}

i.e.
POST _reindex?wait_for_completion=false
{
  "source": {
    "index": ".ds-logs-filestream.generic-default-2026.04.13-000001"
  },
  "dest": {
    "index": ".ds-logs-filestream.generic-default-2026.04.13-000001-1"
  }
}</code></pre><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/22f9b1f762802ecd20faa7c7c1f76c9d1444aba5-2048x530.png" style="width: 100%;" /><p>Upon execution, the response will include a task ID. You can monitor the reindex progress using this ID with the command: <code>GET _tasks/&lt;task ID&gt;</code>.</p><p>The duration of the reindex depends on the volume of data in the original index. The completion can be tracked by looking for <code>&quot;completed&quot;: true</code> when executing the <code>GET</code> command, which should yield a similar output.</p><p><code>GET _tasks/&lt;task ID&gt;</code></p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/dbf0fb0a560a78236440b8c3de68cdf5c83e6d7a-2048x824.png" style="width: 100%;" /><p>With the reindexing process now finished for the document count, the next step is to verify that the mappings for the new backing index and the specific field in question are correct.</p><pre><code>GET <backing index>-1/_mapping</code></pre><p>For example:</p><pre><code>GET .ds-logs-filestream.generic-default-2026.04.13-000001-1/_mapping</code></pre><p>You can verify that the mapping for <code>log.offset</code> is as shown below. To confirm that other fields have only a single mapping entry (not both <code>text</code> and <code>keyword</code>), compare them to a field that was not part of the dynamic template section in the preceding <code>PUT</code> command.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/db5c12c0a651e804a916d517e6e260e49a8b835a-2048x1121.png" style="width: 100%;" /><p>If the backing index that’s being reindexed has a large number of documents, it’s helpful to check the status of those documents being copied to the new backing index; this can be done by the following two Dev Tools commands to compare the counts.</p><p><code>GET .ds-logs-filestream.generic-default-2026.04.14-000001/_count</code></p><p><code>GET .ds-logs-filestream.generic-default-2026.04.14-000001-1/_count</code></p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/a0e49ac79edb0abf9fe99d0e6fd35e96d0e3e0e5-2048x880.png" style="width: 100%;" /><p>Once the counts are verified to match and the correct mappings are present, update the data stream to include the new backing index, preventing an orphaned backing index in index management, where the ILM policy will never occur on the backing index.</p><ul><li>The return should be an acknowledgment of true, if successful.</li></ul><pre><code>POST _data_stream/_modify
{
  "actions": [
    {
      "add_backing_index": {
        "data_stream": "logs-filestream.generic-default",
        "index": ".ds-logs-filestream.generic-default-2026.04.14-000001-1"
      }
    }
  ]
}</code></pre><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/0aa3233377c0175258d37eaa661d56cf9f310d5e-2048x1288.png" style="width: 100%;" /><p>Verify the new backing index is added with the following command, making sure the <code>ilm_policy</code> is correct:</p><pre><code>GET _data_stream/logs-filestream.generic-default</code></pre><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/af8b75cf260f6f088c28a78da86ad31527e0bfd5-2048x839.png" style="width: 100%;" /><p>Check the ILM status of the backing index next with the following command:</p><ul><li>It’s normal to see that the index is in hot, as it was created very recently (review line 8 or 10).</li></ul><pre><code>GET .ds-logs-filestream.generic-default-2026.04.14-000001-1/_ilm/explain</code></pre><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/cd181a31001c7a3ee2b0599a7388909ce5b50baf-2048x972.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/20f6e7632804f173533e655f0292c3c540f26597-2048x894.png" style="width: 100%;" /><p>Execute the following to transition the backing index from the hot tier to the next appropriate tier that’s after the hot phase for the ILM policy for this data stream. The specific values for <code>phase</code>, <code>action</code>, and <code>name</code> in the <code>current_step</code> below can be referenced from lines 11, 13, and 15, respectively, in the provided screenshot above.</p><p>The <code>next_step</code> value indicates the subsequent ILM phase or data tier to which the index will transition to.</p><p>For example:</p><pre><code>POST _ilm/move/.ds-logs-filestream.generic-default-2026.04.14-000001-1
{
  "current_step": {
    "phase": "hot",
    "action": "rollover", 
    "name": "check-rollover-ready"
  },
  "next_step": {
    "phase": "warm" 
  }
}</code></pre><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/7ae60b900ce1d0b46ce26ec301901bc8a9ef750c-2048x1249.png" style="width: 100%;" /><ul><li>It isn’t necessary, but as a safety measure, you may execute the <code>_ilm/explain</code> command again to ensure the backing index has moved to the next phase and is no longer in hot.</li></ul><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/58a460cf2ec443223ea68ba7e7166a7cf9d8c97a-2048x915.png" style="width: 100%;" /><p>Once the following conditions are met, you can safely delete the original backing index that had mapping conflicts:</p><ol><li>A new backing index has been successfully created.</li><li>Documents have been moved to the new index, and the document counts match.</li><li>Mappings have been corrected (both data stream specific and ECS).</li><li>The data stream incorporates the new backing index.</li><li>The ILM policy has been applied and has moved the index out of the hot phase.</li></ol><p><strong>Important:</strong> Alternatively, before deleting the original index, you can check the <strong>Data Views</strong> page. Select <code>logs-*</code> and verify that the reindexed backing index (which ends in <code>-1</code>) now appears in the <strong><code>long</code></strong> section. The original backing index should still be present under <strong><code>keyword</code></strong>. If the reindexed backing index is not in the <strong><code>long</code></strong> section, go back and review the preceding steps and make any necessary corrections.</p><p>For example:</p><pre><code>DELETE .ds-logs-filestream.generic-default-2026.04.14-000001</code></pre><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/09d661b20a44929b4736a43eaa3df84180b25f30-2048x1295.png" style="width: 100%;" /><p>After resolving the conflicts, return to the <strong>Data Views</strong> page and select <code>logs-*</code>. If the conflict was solely related to <code>log.offset</code>, you should no longer see any conflicts listed. If there were other conflicts, the original backing index should no longer appear in the conflict list; instead, the new backing index should now be listed in the <code>long</code> section.</p><p>You can also verify in <strong>Discover</strong> that the <code>log.offset</code> field now displays the appropriate icons.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/1c3bb7029c99aa4bc6b0931f39f5648654b35ccd-2048x1204.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/0af1b1aa3a031c207aa5eb083696dd081d941e67-2048x1001.png" style="width: 100%;" /><p>Continue this process, repeating the above steps for every backing index that has a mapping conflict until all are successfully resolved.</p><p>References:</p><ul><li><a href="https://www.elastic.co/docs/reference/ecs/ecs-field-reference">ECS field reference</a></li><li><a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-reindex">Reindex documents</a></li></ul><h2>Final thoughts</h2><p>By following the steps in this blog, you will resolve mapping conflicts and ensure that all new data is correctly mapped. This is achieved by linking the necessary component templates to your data source. This workflow not only fixes the immediate issues but also establishes a secure and repeatable process for managing schema changes as your data and requirements evolve.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How we built Elasticsearch simdvec to make vector search one of the fastest in the world]]></title>
            <description><![CDATA[How we built Elasticsearch simdvec, the hand-tuned SIMD kernel library behind every vector search query in Elasticsearch.]]></description>
            <link>https://www.elastic.co/search-labs/blog/elasticsearch-vector-search-simdvec-engine</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/elasticsearch-vector-search-simdvec-engine</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Chris Hegarty, Lorenzo Dematte, Simon Cooper]]></dc:creator>
            <pubDate>Thu, 23 Apr 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/df9646091bafbbf0a6dfd212ff8a6bd1e8589708-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Elasticsearch simdvec is the engine behind every vector distance computation in Elasticsearch. It provides hand-tuned AVX-512 and NEON kernels for every vector type Elasticsearch supports. Its bulk scoring architecture hides memory latency through explicit prefetching on x86 and interleaved loading on ARM, outperforming libraries like FAISS and jvector by up to 4x when data exceeds CPU cache. In this post, we explain why we built it, what’s inside, and how it makes Elasticsearch vector search one of the fastest in the world.</p><h2>How we built Elasticsearch simdvec</h2><p>Every vector search query in Elasticsearch, whether <a href="https://arxiv.org/abs/1603.09320">Hierarchical Navigable Small World (HNSW)</a> traversal, inverted file (IVF) scan, or reranking pass, reduces to the same problem: computing distances between vectors, millions of times per query. Elasticsearch supports a wide range of data types and quantization strategies, from float32 to int8, bfloat16, binary, and Better Binary Quantization (BBQ). Each comes with different trade-offs between memory, throughput, and recall. Behind all of it is a single engine: simdvec.</p><p>We built simdvec to make every distance computation as fast as the hardware allows. In this post, we explain why we built it, what’s inside, and where it delivers the most impact.</p><h3>Built like a race car</h3><p>As Formula 1 enthusiasts, with one of us having previously worked with the Ferrari Formula 1 Team, we see a clear parallel. A Formula 1 car is designed with a single purpose: to achieve the best lap time. Engine power, aerodynamics, and chassis design only matter insofar as they contribute to that outcome. The same is true of a vector database, where indexing throughput, query latency, and recall define success.</p><p>While the end result is what matters, reaching the highest levels of performance requires each component to be at its best. It can’t just be <em>good enough</em>, it has to be the <em>best </em>in its category. Simdvec is built with that mindset, focusing on a critical part of the system: the engine. It’s a purpose-built, <a href="https://en.wikipedia.org/wiki/Single_instruction,_multiple_data">single instruction multiple data</a> (SIMD) optimized kernel library that provides hand-tuned native C++ distance functions, called from Java via the <a href="https://openjdk.org/projects/panama/">Panama</a> foreign function interface (FFI). It supports bulk scoring, cache line prefetching, and all vector types and layouts used in Elasticsearch.</p><p>That’s the engine behind every query.</p><h3>Why we built our own</h3><p>We started in 2023 with the Panama Vector API in Apache Lucene. It worked well for float32 dot products, but Elasticsearch&#x27;s needs quickly outgrew what it could provide. Elasticsearch supports a wide range of quantized vector types: int8, int4, bfloat16, single-bit, and asymmetric BBQ. Each has different SIMD strategies, packing layouts, and accumulator requirements. Beyond type coverage, Elasticsearch&#x27;s scoring paths demand more than single-pair throughput: HNSW needs to score several graph neighbors in one pass, IVF needs bulk scoring of thousands of candidates with prefetching, and disk-based scoring needs to work directly on mmap&#x27;d memory without copying. We looked at what was available, and nothing covered the full set.</p><p>So we built simdvec: hand-tuned native C++ kernels called from Java via FFI, with bulk scoring, prefetching, and support for every vector type Elasticsearch uses. By owning the library, we control the full stack. When we add a new quantization type like BBQ, it gets a tuned SIMD kernel wired all the way through the system. We don&#x27;t wait for an upstream library to support it, and we don&#x27;t compromise on performance for any type. Every vector query in Elasticsearch, whether HNSW, IVF, reranking, or hybrid, runs on this engine, built around the operations and types we actually use.</p><p>Simdvec has separate native libraries for x86 and ARM, each with multiple instruction set architecture (ISA) tiers selected at startup. The call overhead from Java via FFI is very low at <a href="https://github.com/ldematte/simsimd-benchmarks/blob/main/COMPARISON.md#ffm-downcall-overhead-measurements">single-digit nanoseconds</a>.</p><h3>The landscape</h3><p>We&#x27;re not the only ones building SIMD-optimized vector distance kernels. The ecosystem is rich, and we wanted to understand how simdvec performs. Not to rank projects, but to provide context and explain where Elasticsearch&#x27;s engine sits. We selected three projects as reference points, each representing a different approach:</p><ul><li><strong>jvector:</strong> A Java approximate nearest neighbor (ANN) library that uses the Panama Vector API for vectorized distance computation, with optional native C acceleration on x86.</li><li><strong>FAISS:</strong> A widely deployed open source vector search framework, with hand-tuned AVX2/AVX-512 kernels.</li><li><strong>NumKong</strong> (formerly SimSIMD): A comprehensive suite of over 2,000 hand-tuned SIMD kernels spanning distance functions, matrix operations, and geospatial computation.</li></ul><p>Each project serves a different purpose and makes different trade-offs. We include reference numbers from them to give context for simdvec&#x27;s performance on the specific operations that Elasticsearch needs.</p><h3>How we measure</h3><p>The simdvec and <a href="https://github.com/ChrisHegarty/jvector-kernel-benchmarks">jvector benchmarks</a> are written in Java with JMH, the standard JVM microbenchmark harness, with FFI overhead included. For <a href="https://github.com/ldematte/simsimd-benchmarks">NumKong benchmarks</a> and <a href="https://github.com/ChrisHegarty/faiss-kernel-benchmarks">FAISS benchmarks</a>, we wrote small C/C++ harnesses using Google Benchmark, which is the standard C++ microbenchmark framework. Both frameworks report nanoseconds per operation with warmup and iteration calibration. We verified via hardware performance counters that all libraries are using SIMD on both platforms. All the benchmark code is publicly available in the linked GitHub repositories (and, in the case of simdvec, in the <a href="https://github.com/elastic/elasticsearch">elasticsearch</a> repository).</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/aee116772df161cf86b7668f575ac34c733a23c5-1580x238.png" style="width: 100%;" /><p><strong>Software:</strong> JDK 25.0.2, JMH 1.37, GCC 14, Google Benchmark (latest).</p><h2>One vector at a time</h2><p>The most fundamental operation in vector search is computing the distance between two vectors. Every HNSW neighbor evaluation, every IVF candidate score, every reranking comparison reduces to this inner loop.</p><p>We measured single-pair throughput at 1024 dimensions on both platforms, starting with float32, the baseline type and the one where the ecosystem is most competitive. We compare simdvec against FAISS and jvector; we excluded NumKong as it uses float64 accumulators for float32, making it 3.2x-5.3x slower (depending on platform), prioritizing numerical precision over throughput. To keep the comparison like-for-like, we benchmark NumKong on int8 instead, where it uses the same accumulator strategy as simdvec.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/352cfa0fc18f123140f746d843404f80127fb1b7-1500x675.png" style="width: 100%;" /><p>On x86, FAISS AVX-512 is the fastest single-pair kernel at 23 ns. Simdvec AVX-512 follows at 28 ns, a gap that reflects the FFI call overhead. Both use 512-bit FMA with multi-accumulator unrolling. At the AVX2 level, the two are much closer, 36 ns and 39 ns respectively, both constrained by the 256-bit register and memory load widths. jvector lands at 44 ns using the Java Panama Vector API. Panama generates good SIMD code, but hand-tuned C++ intrinsics retain an edge.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/95e44d9767f21ec4bccaed835e3c99aa180431ee-1500x495.png" style="width: 100%;" /><p>On ARM, simdvec leads at 70 ns, well ahead of jvector at 110 ns and FAISS at 156 ns. Simdvec has hand-tuned NEON kernels for aarch64. Jvector has no native ARM code and relies on Panama. FAISS relies on compiler auto-vectorization rather than explicit NEON intrinsics, which accounts for the wider gap. This reflects a practical advantage of owning the kernel library: when Elasticsearch expanded to Graviton, we added purpose-built NEON kernels. Neither jvector nor FAISS have prioritized ARM native code to the same degree.</p><p>But Elasticsearch doesn&#x27;t only score float32. <strong>Int8</strong> quantization reduces memory by 4x, bfloat16 by 2x, and BBQ by 32x. Each type needs its own SIMD strategy, and simdvec provides hand-tuned native kernels for all of them.</p><p>Of the libraries we compared, only NumKong has comparable kernels for int8. We measured int8 dot product, squared Euclidean, and cosine at 1024 dimensions.</p><p><strong>Int8 single-pair scoring(1024 dimensions, ns/vec op – lower is better)</strong></p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/a0b852fd5f51d57bd2488886600472ee65ab64da-1594x378.png" style="width: 100%;" /><p>On both architectures, NumKong is equal or faster at small-to-medium dimensions, where the difference is largely due to lower call overhead (direct C call vs Java FFI). At larger dimensions simdvec catches up, where the more efficient kernel implementation (which uses cascade unrolling) amortizes the call cost: As dimension increases, <a href="https://github.com/ldematte/simsimd-benchmarks/blob/main/COMPARISON.md#single-pair-i8-nsop-2">this gap closes and eventually reverses</a>. Crossover is at dimensions between 768 and 1536, depending on function and architecture.</p><p>Despite the slightly higher overhead of Java FFI, simdvec is on par with highly optimized C/C++ libraries. Not only is it the only library with optimized kernels for both float32 <em>and</em> int8, but it also leads on ARM and only slightly behind FAISS on x86 (for float32), and very close to NumKong on both architectures (for int8). And for bfloat16, int4, binary, and BBQ, while alternatives exist, simdvec distinguishes itself through hand-tuned SIMD tailored to each type&#x27;s data layout.</p><p>But a production search engine doesn’t score one vector at a time; it scores thousands per query. The next question is what happens at that scale.</p><h3>Thousands at a time</h3><p>Single-pair performance is only part of the picture. What matters in practice is how systems behave under load. A single HNSW query may score hundreds of graph neighbors. An IVF scan may score thousands of posting list entries. A reranking pass may score tens of thousands of candidates. Single-pair throughput matters, but what matters more is how fast you can score many vectors, and how gracefully performance degrades as the working set spills out of CPU caches.</p><p>Simdvec provides bulk scoring for every data type. These aren&#x27;t just loops over single-pair kernels; they use multi-accumulator inner loops that load the query vector once per dimension stride and share it across multiple document vectors, with explicit cache-line prefetching for the next batch. Neither jvector nor FAISS offer an equivalent (at the time of writing). Jvector has no bulk API, so callers score one pair at a time in a loop. FAISS exposes <code>fvec_inner_products_ny</code>, which, at the time of writing, is implemented as a loop over its single-pair distance function with no query amortization or prefetching.</p><p><strong>Float32.</strong> To measure the impact at the kernel level, we scored a single query against increasing numbers of 1024 dimension float32 document vectors using random access patterns that simulate HNSW-like scattered graph neighbor lookups. The three dataset sizes, 32, 625, and 32,500 vectors, are chosen so the working set exceeds L1, L2, and L3 cache, respectively.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/1d90267be1c63ac82b8ba588617ebedb0be0d1b6-1334x558.png" style="width: 100%;" /><p>When the data fits in cache, simdvec is the fastest on both platforms, but the margins are modest since kernel arithmetic dominates. The real separation appears as the working set grows beyond L3. On x86, simdvec scores at 95 ns per vector, while FAISS needs 165 ns and jvector 412 ns. On ARM, the pattern is the same: simdvec holds at 162 ns, while FAISS climbs to 347 ns, and jvector to 476 ns. The prefetching and query amortization in simdvec keeps memory latency hidden in a way that a simple loop over single-pair kernels cannot match, and the advantage widens precisely where real search workloads operate, deep in main memory.</p><p><strong>Int8.</strong> The same pattern holds for quantized types. We measured int8 dot product bulk scoring at 1024 dimensions with dataset sizes chosen to exceed the same L1, L2, and L3 cache boundaries, comparing simdvec&#x27;s bulk scoring against NumKong single-pair scoring in a loop.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/9ee97d98b40d13b19370b76b33e67ea66bfb3250-1580x338.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/601f50a03fa3e6263cbac9700281dfe3e511de60-1580x338.png" style="width: 100%;" /><p>On x86, simdvec is between 1.2x and 1.9x faster, driven by the combination of explicit prefetching and batch processing. On ARM, simdvec wins again (1.7x to 1.9x faster) across all dataset sizes. The advantage comes from batch processing four vectors at a time, providing memory-level parallelism via an interleaved access pattern. In both cases, the most striking result is what happens at the largest dataset size, where it matters the most.</p><p>Results for squared distance and cosine show a similar pattern, with speedups of 1.4x to 1.8x for ARM, and of 1.3x to 3.0x for x86 (details <a href="https://github.com/ldematte/simsimd-benchmarks/blob/main/COMPARISON.md">here</a>).</p><h3>When memory matters</h3><p>Production vector indices typically don&#x27;t fit in CPU cache. A 10M-vector int8 index at 1024 dimensions is 10GB. Scoring candidates means streaming data from DRAM, and that&#x27;s where bulk scoring architecture makes the difference.</p><p>We used hardware performance counters to measure what happens inside the CPU during bulk scoring and found that hiding memory latency requires two fundamentally different strategies, one per architecture.</p><p><strong>On x86, explicit prefetching eliminates cache misses. </strong>The bulk kernel processes vectors sequentially, one fully computed before the next, while issuing prefetch instructions for the next batch. Future data is pulled into L1 before the CPU needs it.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/a39548f8060c2a4a5154521a4047dd92d8cd77be-1580x309.png" style="width: 100%;" /><p>On ARM, the same sequential approach performed poorly, even with prefetching. Instead, <strong>the bulk kernel interleaves loads</strong> from four vectors at every stride position, giving the out-of-order engine four independent memory streams. The CPU is not fetching data any faster, but rather waiting less by always having something else to compute while memory requests are in flight. Detailed analysis can be found in <a href="https://github.com/elastic/elasticsearch/issues/145412">this GitHub issue</a>.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/832de3bc0d556a493b7bf3f250196018acfc1585-1580x238.png" style="width: 100%;" /><p>The numbers tell two different stories:</p><ol><li>On x86, prefetching turns 139K cache misses into 19K, and instructions per cycle (IPC) more than doubles. The bulk advantage grows with dataset size, from 1.2x in L2 to 2.8x beyond L3, because prefetching hides progressively more expensive DRAM round trips.</li><li>On ARM, cache misses barely change. What changes is utilization: Backend stalls drop 40% because the interleaved access pattern keeps the pipeline fed. This advantage stays a consistent 1.8x regardless of dataset size, because memory-level parallelism applies whether data comes from cache or DRAM.</li></ol><p>Two architectures, two strategies, one result: At production scale, simdvec keeps the CPU pipeline busy even when vectors are scattered across main memory.</p><h2>What this means for Elasticsearch users</h2><p>These kernel-level capabilities compound. A single vector search query may compute millions of distance operations: HNSW graph traversal, candidate scoring, reranking. Across thousands of concurrent queries, nanoseconds per operation translate directly to query latency and cluster throughput. Whether you use float32, int8, bfloat16, or BBQ, whether your index is in memory or on disk, simdvec is the engine underneath, and every one of those operations runs through the same engine, tuned down to the last nanosecond.</p><p>The key takeaway is that at production scale, vector search performance isn’t primarily determined by raw SIMD throughput. It’s dominated by how efficiently the system hides memory latency while sustaining compute across millions of small operations.</p><p>The simdvec kernels improve with almost every Elasticsearch release. When new quantization types and hardware platforms emerge, they get tuned kernels from day one. And existing types continue to get faster as we refine the implementations that are already shipping.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Jina embeddings v3 now available on Gemini Enterprise Agent Platform Model Garden]]></title>
            <description><![CDATA[Jina search foundation model, jina-embeddings-v3, is now self-deployable on Gemini Enterprise Agent Platform Model Garden, with more to follow. Run jina-embeddings-v3 on a single L4 GPU inside your own VPC.]]></description>
            <link>https://www.elastic.co/search-labs/blog/jina-embeddings-v3-gemini-enterprise-model-garden</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/jina-embeddings-v3-gemini-enterprise-model-garden</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Sa Zhang]]></dc:creator>
            <pubDate>Wed, 22 Apr 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/42f72633f1e5453dbfd47730b5f776429f9f633e-721x420.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Today we’re launching <code>jina-embeddings-v3</code>, the first Jina search foundation model to be available on <a href="https://console.cloud.google.com/vertex-ai/publishers/jinaai/model-garden/jina-embeddings-v3">Gemini Enterprise Agent Platform Model</a><a href="https://console.cloud.google.com/vertex-ai/publishers/jinaai/model-garden/jina-embeddings-v3"> Garden</a> as a self-deployable partner model. <em>Self-deployment</em> means the model runs on GPU instances inside your Google Cloud project and Virtual Private Cloud (VPC). No external API calls, no per-token metering, no rate limits.</p><p>With this integration, Elasticsearch users gain a new deployment option that keeps data inside their security perimeter, delivers predictable infrastructure costs, and runs natively on Google Cloud. At the same time, the broader Google Cloud ecosystem gains access to Jina&#x27;s purpose-built, state-of-the-art search and retrieval models.</p><p>This is the first stage of a broader rollout. Together with the models coming next, the lineup will form a complete retrieval stack: Embed your data, embed queries, retrieve and rerank candidates, and extend search to images with multimodal embeddings, all on infrastructure you control. You can start today with <code>jina-embeddings-v3</code>, the model already powering production search pipelines across the Elasticsearch ecosystem via Elastic Inference Service (EIS).</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Model</td><td style="border: 1px solid #ccc; padding: 4px;">Type</td><td style="border: 1px solid #ccc; padding: 4px;">Parameters</td><td style="border: 1px solid #ccc; padding: 4px;">Key capability</td><td style="border: 1px solid #ccc; padding: 4px;">Status on Model Garden</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`jina-embeddings-v3`</td><td style="border: 1px solid #ccc; padding: 4px;">Text embedding</td><td style="border: 1px solid #ccc; padding: 4px;">572M</td><td style="border: 1px solid #ccc; padding: 4px;">Proven multilingual workhorse, 8K context, 1024 dim output, truncatable to 32</td><td style="border: 1px solid #ccc; padding: 4px;">Available now</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`jina-embeddings-v5-text-small`</td><td style="border: 1px solid #ccc; padding: 4px;">Text embedding</td><td style="border: 1px solid #ccc; padding: 4px;">677M</td><td style="border: 1px solid #ccc; padding: 4px;">State-of-the-art sub-1B multilingual, 32K context, 1024 dim output, truncatable to 32</td><td style="border: 1px solid #ccc; padding: 4px;">Coming soon</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`jina-embeddings-v5-text-nano`</td><td style="border: 1px solid #ccc; padding: 4px;">Text embedding</td><td style="border: 1px solid #ccc; padding: 4px;">239M</td><td style="border: 1px solid #ccc; padding: 4px;">Best-in-class under 500M params, 8K context, 768 dim output, truncatable to 32</td><td style="border: 1px solid #ccc; padding: 4px;">Coming soon</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`jina-reranker-v3`</td><td style="border: 1px solid #ccc; padding: 4px;">Reranker</td><td style="border: 1px solid #ccc; padding: 4px;">600M</td><td style="border: 1px solid #ccc; padding: 4px;">Listwise reranker, 131K context, up to 64 documents</td><td style="border: 1px solid #ccc; padding: 4px;">Coming soon</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`jina-clip-v2`</td><td style="border: 1px solid #ccc; padding: 4px;">Multimodal embedding</td><td style="border: 1px solid #ccc; padding: 4px;">900M</td><td style="border: 1px solid #ccc; padding: 4px;">Text + image in shared space, 89 languages, and 8K text context, 512×512 images</td><td style="border: 1px solid #ccc; padding: 4px;">Coming soon</td>
                    </tr>
              </tbody>
            </table>
          <p>Every model runs on a single NVIDIA L4 (24 GB), the most cost-efficient GPU tier on Google Cloud. Most other embedding models on Google Cloud Model Garden require an A100 80 GB or H100, roughly three times the per-hour instance cost before you even start counting tokens.</p><p>No additional commercial license is required when deployed through Vertex AI.</p><h2><strong>Why Model Garden?</strong></h2><p>Why deploy through Model Garden instead of hitting an API? It comes down to three things: control, cost, and context.</p><h3>Your data never leaves the house</h3><p>The biggest draw for most developers is the self-deploy architecture. When you deploy a Jina model through Model Garden, the weights run on GPU instances inside your own Google Cloud project and your own VPC. This is a game-changer for anyone working in industries with data security concerns, like finance or healthcare. Because there are no external API calls, your sensitive data stays within your security perimeter.</p><h3>Scaling with prediction</h3><p>Instead of paying every time you embed a sentence or rerank a document, you pay a flat hourly instance cost. And because every Jina model can run on a single NVIDIA L4, the most affordable GPU tier on Google Cloud, the barrier to entry is low. Whether you process a thousand requests or a billion, your infrastructure bill stays predictable. This is a setup that actually rewards you for growing your traffic rather than taxing you for it.</p><h3>Everything under one roof</h3><p>If your data is already sitting in Elasticsearch on Google Cloud, BigQuery, or Cloud Storage, it makes sense to keep your inference engines nearby. By deploying through Model Garden, Jina search foundation models inherit all the enterprise features you are already using: identity and access management (IAM) for access control, unified billing on your existing Google Cloud invoice, and the ability to plug into Vertex AI Pipelines for machine learning operations (MLOps) workflows.</p><p>While the Jina AI Cloud API and Elastic Cloud offer the fastest path for bursty traffic or existing search workflows, Model Garden is ideal for enterprise applications requiring strict data security and predictable costs at scale. Elastic wants to meet you where you are.</p><h2><strong>Jina AI models</strong></h2><h3><strong>jina-embeddings-v3</strong></h3><p>Our proven multilingual embedding model with 572M parameters and 8K token context. Scores 65.5 on Massive Text Embedding Benchmark (MTEB) English. Supports five task-specific Low-Rank Adaptation (LoRA) adapters (retrieval query/passage, text-matching, classification, clustering) and Matryoshka truncation from 1024 to 64 dimensions. Already widely adopted across the Elasticsearch ecosystem via EIS.</p><p>We’re leading with v3 because many production systems already depend on it. If you’re migrating a v3-based pipeline to Google Cloud, you can now run the same model natively without changing your embedding dimensions or reindexing.</p><h3><strong>jina-embeddings-v5-text (small and nano)</strong></h3><p>Our fifth-generation text embedding models, released February 2026, achieve top-tier performance, competing with models many times their size.</p><p><code>v5-text-small</code> (677M) scores 67.0 on the Multilingual MTEB (MMTEB) benchmark suite, encompassing 131 tasks of nine task types, and 71.7 on the MTEB English benchmark. It’s the strongest sub-1B multilingual embedding model on the MTEB Leaderboard.</p><p><code>v5-text-nano</code> (239M) scores 65.5 on MMTEB. No other model under 500M parameters reaches this level. At less than half the size of most comparable models, it’s the natural choice for edge and latency-sensitive deployments.</p><p>Both models support:</p><ul><li><strong>Four task-specific LoRA adapters:</strong> Retrieval, text-matching, classification, clustering. Selecting an appropriate adapter via <code>task</code> parameter at inference time.</li><li><strong>Matryoshka dimension truncation:</strong> Reduce embedding dimensions from 1024 (or 768 for nano) down to 32. Quality loss is minimal at moderate truncation (for example, 256 dims). Halving dimensions roughly halves storage.</li><li><strong>Binary quantization:</strong> Compress 1024-dim embeddings from 2KB to 128 bytes with binarization. Special training makes this compression minimal losses.</li><li><strong>Multilingual: </strong>119 languages (small) and 93 (nano).</li></ul><h3><strong>jina-reranker-v3</strong></h3><p>A 0.6B parameter multilingual listwise reranker built using a <em>last but not late interaction</em> architecture. The query and up to 64 candidate matches are entered into a single 131K-token context window, and the model performs cross-document comparison before scoring. Jina Reranker v3 achieves 61.94 nDCG@10 on BEIR, outperforming the model being 6× smaller in size.This is fundamentally different from pointwise rerankers that score each document in isolation, producing better results, especially for passage retrieval from single documents.</p><h3><strong>jina-clip-v2</strong></h3><p>A 0.9B multimodal, multilingual embedding model that maps text and images into a shared 1024-dimensional space. It supports:</p><ul><li><strong>89 languages</strong> for text-image retrieval.</li><li><strong>512×512 image resolution.</strong></li><li><strong>8K token text input.</strong></li><li><strong>Matryoshka truncation</strong> from 1024 to 64 dimensions for both modalities.</li></ul><p>Highly competitive on image-to-text benchmarks, including multilingual tasks.</p><h2><strong>Getting started</strong></h2><p>Jina Embeddings v3 is live on Model Garden today. Here’s how to get it running.</p><p>You need a Google Cloud project with the Vertex AI API enabled and enough GPU quota for at least one g2-standard-8 instance (NVIDIA L4). If you’re new to Google Cloud, <a href="https://cloud.google.com/vertex-ai/docs/start/cloud-environment">start with the setup guide.</a></p><p><a href="https://console.cloud.google.com/vertex-ai/publishers/jinaai/model-garden/jina-embeddings-v3">The Model Garden page for Jina Embeddings v3</a> walks you through the full flow: Upload the model, create an endpoint, pick your machine type, and deploy. Open it in your own project, and follow the guided steps. A100 and H100 machines are also available where region and quota allow, but L4 is all you need to start.</p><p>From click to first embedding, the whole process takes a few minutes.</p><h2><strong>What comes next</strong></h2><p>Jina Embeddings v3 is the starting point. In the coming weeks, we’ll bring the rest of the Jina retrieval stack to Model Garden: v5 text embeddings (small and nano), jina-reranker-v3, and jina-clip-v2 for multimodal search. All will run on a single L4 GPU with the same self-deploy model.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Elastic Security, Observability, and Search now offer interactive UI in your AI tools]]></title>
            <description><![CDATA[Elastic is the first security vendor to ship an interactive UI in AI Conversations, built on the MCP App standard that Anthropic and OpenAI co-authored. It launches alongside two other Elastic MCP Apps for Observability and Search.]]></description>
            <link>https://www.elastic.co/search-labs/blog/mcp-apps-elastic</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/mcp-apps-elastic</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[David Elgut, Graham Hudgins, Jesse Miller]]></dc:creator>
            <pubDate>Tue, 21 Apr 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/55cea79fd5c554656e8605525a61d4ad3ae0ee1b-2048x1143.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>AI agents are only as good as the surfaces they return. A text-only tool result flattens anything that is naturally visual or interactive: a dashboard, an alert list, an investigation graph, a distributed trace. The conversation becomes a place where you ask and the agent answers, but the moment the work gets dense, you leave for a separate tab, a separate product, a separate authentication boundary.</p><p><a href="https://modelcontextprotocol.io/extensions/apps/overview">MCP Apps</a> change the shape of that answer. A tool can now return an interactive UI alongside its text summary, and the host (Claude Desktop, Claude.ai, VS Code Copilot, Cursor) renders it inline in the conversation. The model keeps the compact text for reasoning. The human gets a live, clickable interface right next to the chat.</p><p>Three properties make this a different kind of integration than &quot;a webhook that returns a URL&quot;:</p><ul><li><strong>Context preservation.</strong> The UI lives inside the conversation. No tab switching, no hand-offs.</li><li><strong>Bidirectional data flow.</strong> The UI can call tools on the MCP server for fresh data, and the host can push new results from the agent back into the UI. No separate API layer or authentication plumbing.</li><li><strong>Sandboxed trust boundary.</strong> MCP Apps run in a host-controlled iframe. They cannot access the parent page, read cookies, or escape their container.</li></ul><p>Security operations run on triage, investigation graphs, and Attack Discovery, where an AI agent correlates hundreds of alerts into a handful of attack chains. Observability means distributed traces and time-series drill-downs. Building in Kibana means a dashboard grid. Flatten any of that to text and you lose the thing that makes it useful. We built MCP Apps for all three and are open-sourcing them together, so the same conversation can move from a triage queue to a dependency graph to a live dashboard without ever leaving the chat.</p><p>Each of the three reference apps is one MCP server serving many interactive views, not a bundle of separate products. The security app alone surfaces six dashboards that share the same server shell, the same tool-visibility model, and the same host bridge. The pattern is small; the surface area is where the value compounds.</p><h2>Elastic Security MCP App</h2><h3>Why it matters for the SOC</h3><p>When an agent tells a SOC analyst, &quot;<em>There are 47 alerts on host-314, here&#x27;s a summary,</em>&quot; it hasn&#x27;t done any work. It&#x27;s just pointed at where the work starts. The actual work lives in the alert list, the process tree, the investigation graph, and the case file. You can&#x27;t do it from a paragraph of text.</p><p>The security MCP App returns the workflow itself. The analyst prompts the agent, and the agent returns an interactive dashboard in the chat where the analyst can drill into alerts, run threat hunts, correlate attack chains, and open cases, all without losing the thread of the conversation. And because the findings, queries, and cases all land back in Elasticsearch, the same investigation is waiting in Kibana where the analyst can pick back up after the conversation has been closed.</p><h3>Six interactive dashboards</h3><p>The <a href="https://github.com/elastic/example-mcp-app-security">Elastic Security MCP App</a> ships six interactive elements, one per major SOC workflow. Each one is a React UI that renders inline when the agent calls the corresponding tool:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Tool</td><td style="border: 1px solid #ccc; padding: 4px;">What it does</td><td style="border: 1px solid #ccc; padding: 4px;">Interactive UI</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Alert triage</td><td style="border: 1px solid #ccc; padding: 4px;">Fetch, filter, and classify security alerts</td><td style="border: 1px solid #ccc; padding: 4px;">Severity grouping, AI verdict cards, process tree, network events</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Attack Discovery</td><td style="border: 1px solid #ccc; padding: 4px;">AI-correlated attack chain analysis with on-demand generation</td><td style="border: 1px solid #ccc; padding: 4px;">Attack narrative cards with confidence scoring, entity risk, MITRE mapping</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Case management</td><td style="border: 1px solid #ccc; padding: 4px;">Create, search, and manage investigation cases</td><td style="border: 1px solid #ccc; padding: 4px;">Case list with alerts, observables, comments tabs, AI actions</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Detection rules</td><td style="border: 1px solid #ccc; padding: 4px;">Browse, tune, and manage detection rules</td><td style="border: 1px solid #ccc; padding: 4px;">Rule browser with KQL search, query validation, noisy-rule analysis</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Threat hunt</td><td style="border: 1px solid #ccc; padding: 4px;">ES|QL workbench with entity investigation</td><td style="border: 1px solid #ccc; padding: 4px;">Query editor, clickable entities and investigation graph</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Sample data</td><td style="border: 1px solid #ccc; padding: 4px;">Generate ECS security events for common attack scenarios</td><td style="border: 1px solid #ccc; padding: 4px;">Scenario picker with four pre-built attack chains</td>
                    </tr>
              </tbody>
            </table>
          <p>Each tool returns a compact text summary that the model can reason over, alongside the interactive UI the analyst acts on. The UI can also fetch fresh data behind the scenes through the MCP host bridge. The full tool model and bridge API live in the <a href="https://github.com/elastic/example-mcp-app-security/blob/main/docs/architecture.md">repo&#x27;s architecture doc</a>.</p><p>The app also ships with <a href="https://github.com/elastic/example-mcp-app-security/tree/main/skills">Claude Desktop skills</a>, <code>SKILL.md</code> files that teach the agent when and how to use each tool. Download pre-built skill zips from the <a href="https://github.com/elastic/example-mcp-app-security/releases/latest">latest release</a>.</p><h3>From Alert to Case</h3><p>Four skills cover the core SOC loop. Each one picks up a prompt, calls a tool, and returns an interactive dashboard alongside a text summary that the model reasons over. An analyst&#x27;s day usually starts with an alert queue.</p><p><strong>Triage alerts. </strong>Ask the agent to triage by host, rule, user, or time window. The Alert Triage skill returns a dashboard of AI verdicts above the raw alert list, with one verdict per detection rule classifying that rule&#x27;s activity as benign, suspicious, or malicious, each with a confidence score and a recommended action. Click any alert to open a detailed view with a process tree, network events, related alerts, and MITRE ATT&amp;CK tags. No need to context switch between AI conversation and your alerts dashboard inside Kibana, everything is happening in real time inside your conversation.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/b0239ea916a828ee0bfd0cc046af17330568fe09-2048x1313.png" style="width: 100%;" /><p><strong>Hunt for threats. </strong>Ask the agent to hunt across your indices. The Threat Hunt skill returns an ES|QL workbench with the query pre-populated and auto-executed, with every entity in the results clickable for drill-down. The model writes a short read-out below the table: what&#x27;s unusual, what&#x27;s connected, what&#x27;s worth a closer look. It then offers the next pivot: either go deeper into the threat hunt, or start a new skill within the MCP app that complements the work done so far. What ties this really well is launching an Attack Discovery to gather more context on the alerts you’ve gone deep with and the threats you have hunted so far.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/9f0bde828bcedb24e2b45995cfa306f03e57d300-2048x1313.png" style="width: 100%;" /><p><strong>Run Attack Discovery.</strong> The Attack Discovery skill triggers the Attack Discovery API and returns a ranked list of findings. Each finding is a set of related alerts stitched into one attack chain, with MITRE tactics, a risk score, a confidence label, and the impacted hosts and users surfaced up front. The agent&#x27;s summary lands below the findings in the same rank order, and the conversation now holds everything needed to act: hunt queries, triage decisions, correlated chains, all staged for the next step.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/e52af22dbaa070192170280d519dfb219ad807f5-2048x1313.png" style="width: 100%;" /><p><strong>Open cases without leaving the chat.</strong> Approve findings in bulk or ask the agent to open cases for specific alerts. The Case Management skill creates one case per approved finding (source alerts attached, MITRE tactics inherited from the attack chain) and renders the live case list inline. Click a case for its detail view, which includes a row of AI action buttons: Summarize case, Suggest next steps, Extract IOCs, and Generate timeline. Each one drops a structured prompt back into the chat, so the agent picks up the case context without needing a reintroduction. The agent&#x27;s summary sits below the case list and covers the full IR queue, including the cases just opened and earlier findings that still need one.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/2057310b450c2917c5434c713f2b34732067429a-2048x1283.png" style="width: 100%;" /><p>Every step in this walkthrough runs the same loop: a prompt comes in, the skill picks it up, the tool returns a compact text summary for the model to reason over, alongside an interactive UI that the analyst acts on. Chain the skills together, and they compose into an end-to-end SOC flow — hunt, triage, correlate, open cases, and drive the next pivot, all with the model carrying the session context across every step. Invoke any one on its own and it&#x27;s still the full dashboard, pointed at whatever slice of your data you name. Either way, the work accumulates inside the conversation; no tab switching, no copy-paste, no hand-offs.</p><p>Two more skills round out the app: a detection-rule browser for tuning noisy rules, and a sample-data generator for spinning up realistic ECS events against a fresh cluster. A follow-up post will go deep on all six: investigation graph, attack-flow canvas, and end-to-end walkthrough.<br/></p><blockquote><em>&quot;The MCP App for Elastic Security bridges the gap between automated detection and manual hunting. By bringing our security data directly into a single interface within Claude Desktop, we surfaced &#x27;silent&#x27; threats in under an hour, risks that didn&#x27;t trigger standard alerts but required immediate action. It&#x27;s a force multiplier for our analysts.&quot; Mandy Andress: Chief Information Security Officer (CISO), Elastic.</em></blockquote><h3>How it works</h3><p>Each MCP App is a small Node.js server whose tools return both a compact text summary for the model and a React UI that the host renders inline. Because it&#x27;s built on the open MCP App spec, the same server runs on any compatible host - see the <a href="https://github.com/elastic/example-mcp-app-security/blob/main/docs/architecture.md">repo&#x27;s architecture doc</a> for the full design.</p><h3>Try it</h3><p>Requires Elasticsearch 9.x with Security enabled, plus Kibana for cases, rules, and Attack Discovery. The fastest path is the one-click <code>.mcpb</code> bundle from the <a href="https://github.com/elastic/example-mcp-app-security/releases/latest">latest release</a> - double-click it in Claude Desktop, and you&#x27;ll be prompted for your Elasticsearch URL and API key. Setup guides for <a href="https://github.com/elastic/example-mcp-app-security/blob/main/docs/setup-cursor.md">Cursor</a>, <a href="https://github.com/elastic/example-mcp-app-security/blob/main/docs/setup-vscode.md">VS Code</a>, <a href="https://github.com/elastic/example-mcp-app-security/blob/main/docs/setup-claude-code.md">Claude Code</a>, <a href="https://github.com/elastic/example-mcp-app-security/blob/main/docs/setup-claude-ai.md">Claude.ai</a>, and building from source are in the <a href="https://github.com/elastic/example-mcp-app-security">repo</a>.</p><a href="https://videos.elastic.co/watch/voDxV9RWWpY9HeBYTWTS93">Watch video on Vidyard</a><h2>Elastic Search MCP App: Dashboards built from conversation</h2><p>Every Kibana user knows the dashboard detour: leave what you&#x27;re working on, open Kibana, pick an index, pick fields, pick a visualization, tweak, and save. That&#x27;s five context switches before a single chart is on screen.</p><p>The new <a href="https://github.com/elastic/example-mcp-dashbuilder">example-mcp-dashbuilder</a> reference app collapses that into a prompt. Ask the agent to &quot;<strong>build me a dashboard with revenue metrics, order trends, and category breakdowns</strong>&quot; and the dashboard comes back inside the conversation without any tab-switching required.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/b096b493726488f53040b08b5c8bc53ca0e44f9d-2048x1182.png" style="width: 100%;" /><p>Behind that prompt, the agent explores your Elasticsearch data via ES|QL and selects chart types to match the data: bars for comparisons, lines for trends, metric cards for KPIs, and heatmaps for two-dimensional patterns. It lays panels out on Kibana&#x27;s 48-column grid using the Elastic UI Borealis theme, and the result is fully interactive: you can drag, resize, and group panels into collapsible sections right in the chat. When the dashboard looks right, a single tool call exports it to Kibana, preserving ES|QL queries and custom colors. You can also import existing Kibana dashboards back into the chat for AI-assisted editing.</p><p>The principle is the same one behind the Security app: when the artifact is the product, returning it inside the conversation closes the loop between describing what you want and seeing it.</p><p>Under the hood, it follows the same MCP App pattern. A Node.js server registers a <code>view_dashboard</code> model-facing tool alongside a set of app-only tools the UI calls directly (data fetching, layout persistence, time-field detection, export/import). The dashboard view itself is a single self-contained HTML file bundled with <code>vite-plugin-singlefile</code> and served as an MCP App resource. Builders forking the repo get the same server shell and host bridge they see in the Security app, pointed at a different job. The example-mcp-dashbuilder README has the full architecture and chart-type reference.</p><a href="https://videos.elastic.co/watch/wep6PxPiEDyyvsYTKq6v5B">Watch video on Vidyard</a><h2>Elastic Observability MCP App</h2><p>The third reference app, <a href="https://github.com/elastic/example-mcp-app-observability">Elastic Observability MCP App</a>, tackles the SRE version of the same shape problem. When something breaks in production, the answer the on-call engineer needs is not a chart, it is a diagnosis stitched together from K8s metrics, APM topology, ML anomalies, and risk assessment. The shape of the answer is a causal story: what failed, why, what depends on it, and what to do next.</p><p>Six tools supporting the observability investigation workflow</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Tool</td><td style="border: 1px solid #ccc; padding: 4px;">Dependency</td><td style="border: 1px solid #ccc; padding: 4px;">What It Does</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">observe</td><td style="border: 1px solid #ccc; padding: 4px;">Universal</td><td style="border: 1px solid #ccc; padding: 4px;">Transient ES|QL + ML-anomaly access primitive — run a query once, live-sample a metric, or block until a threshold or anomaly fires.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">manage-alerts</td><td style="border: 1px solid #ccc; padding: 4px;">Universal</td><td style="border: 1px solid #ccc; padding: 4px;">Create, list, get, and delete Kibana custom-threshold alerting rules. Omit the Kibana URL to run read-only.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">ml-anomalies</td><td style="border: 1px solid #ccc; padding: 4px;">ML jobs</td><td style="border: 1px solid #ccc; padding: 4px;">Query ML anomaly records and open an inline anomaly-explainer view.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">apm-health-summary</td><td style="border: 1px solid #ccc; padding: 4px;">Elastic APM</td><td style="border: 1px solid #ccc; padding: 4px;">Cluster-level health rollup from APM telemetry; layers in K8s and ML context when available.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">apm-service-dependencies</td><td style="border: 1px solid #ccc; padding: 4px;">Elastic APM</td><td style="border: 1px solid #ccc; padding: 4px;">Service dependency graph — upstream/downstream, protocols, call volume.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">k8s-blast-radius</td><td style="border: 1px solid #ccc; padding: 4px;">Kubernetes metrics</td><td style="border: 1px solid #ccc; padding: 4px;">Node-outage impact — full outage, degraded, unaffected, reschedule feasibility.</td>
                    </tr>
              </tbody>
            </table>
          <h3><strong>Cluster health rollup</strong></h3><p>Ask &quot;what&#x27;s broken?&quot; or &quot;give me a status report&quot; and get a one-shot orientation: overall health badge, degraded services with reasons, top pod memory consumers, anomaly severity breakdown, and service throughput — all in one inline view. This is the starting point when something feels off, but you don&#x27;t know where to look. The view adapts based on what your deployment supports. APM gives you service health. Kubernetes metrics add pod and node context. ML jobs layer in anomalies.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/f31be8bbcffd54623fcec4e51fd914dae666c026-1510x1810.png" style="width: 100%;" /><h3><strong>Service dependency graph</strong></h3><p>Ask &quot;what calls checkout?&quot; or &quot;show me the topology&quot; and get a layered dependency graph — upstream callers, downstream dependencies, protocols, call volume, and latency per edge. Let’s ask Claude to “Show me the service dependencies of the frontend”:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/2afec7dbd2153bc3a9f05068f323911af766a258-2048x1184.png" style="width: 100%;" /><p>Zoom, pan, and hover to get all the details you need to understand the complex service relationships:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/ff99d2fd0a54c22c9f7b975e9d8d7073c4311e7a-2048x970.png" style="width: 100%;" /><h3><strong>Assess risk with a blast radius</strong></h3><p>Ask &quot;what happens if my k8s node goes down?&quot; and get a radial impact diagram: the target node at center, full-outage deployments in red, degraded in amber, unaffected in gray. A floating summary card shows pods at risk and rescheduling feasibility. Single-replica deployments are flagged as single points of failure. </p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/7736bef9609e45a3f0c7f9f3090a314b0834315d-1272x1838.png" style="width: 100%;" /><h3><strong>Observe</strong></h3><p>The agent&#x27;s primary access primitive for Elastic — one tool, three modes for three different needs. Say &quot;what&#x27;s CPU doing right now?&quot; and it runs an ES|QL query once and returns a table. Say &quot;show me frontend latency for the next 60 seconds&quot; and it live-samples the metric, updating the chart in-place. Say &quot;tell me when memory drops below 80MB&quot; or &quot;watch for anything unusual for the next 10 minutes&quot; and it blocks until the condition fires or the window expires. The view adapts to the mode: a results table for one-shot queries, a live trend chart with current/peak/baseline stats for sampling and threshold conditions, and a severity-scored trigger card for anomaly mode.</p><h3>How it works</h3><p>Same MCP App pattern as the Security and Search apps: a Node.js server, six model-facing tools wired to six single-file view resources. Tools are grouped by deployment backend (Universal, APM-dependent, K8s-dependent, ML-dependent), so the agent and the user both know up front which tools apply to a given deployment instead of discovering capability gaps at call time. The MCP App also includes an example Agent Builder workflow: <code>k8s-crashloop-investigation-otel</code> that can trigger on a Kubernetes alert and return a structured root cause summary before you&#x27;ve opened a single dashboard.</p><a href="https://videos.elastic.co/watch/xcF3nKDziVgphc5bPAkv9R">Watch video on Vidyard</a><h2>The Agentic Stack, Interactive</h2><p>Three properties about this pattern are worth stating directly. First, the tool result is no longer the end of the work, it is the start of it: the conversation returns an interface you can act on, not a summary you have to act from. Second, the same agent, the same model context, and the same conversation thread can now move across Security, Search, and Observability surfaces without leaving the conversation. Third, this only works because Elasticsearch and Kibana already expose the APIs. The MCP App is a thin interactive layer over the product capabilities we already ship.</p><p>Attack Discovery already powers the correlated findings view inside this app. Inside the stack, the same agentic pattern goes further: Elastic Workflows automate the deterministic steps (enrich entities, create cases, isolate hosts), while Agent Builder reasons over the data and invokes those workflows as tools. The MCP App brings that same security surface into the external conversation; Workflows and Agent Builder deepen it inside the stack. Different entry points, same Elastic APIs underneath.</p><p><strong>Try it:</strong></p><ul><li><strong>Security:</strong> <a href="https://github.com/elastic/example-mcp-app-security">example-mcp-app-security</a></li><li><strong>Search and dashboards:</strong> <a href="https://github.com/elastic/example-mcp-dashbuilder">example-mcp-dashbuilder</a></li><li><strong>Observability: </strong><a href="https://github.com/elastic/example-mcp-app-observability">example-mcp-observability</a></li></ul><p>Don&#x27;t have an Elasticsearch cluster yet? Start a free <a href="https://cloud.elastic.co/registration">Elastic Cloud trial</a>. For more on the building blocks behind the security app, see the related Security Labs posts on <a href="https://www.elastic.co/security-labs/from-alert-fatigue-to-agentic-response">Elastic Workflows and Agent Builder</a>, <a href="https://www.elastic.co/security-labs/agent-skills-elastic-security">Agent Skills</a>, and <a href="https://www.elastic.co/security-labs/speeding-apt-attack-discovery-confirmation-with-attack-discovery-workflows-and-agent-builder">Attack Discovery</a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Introducing unified API keys for Elastic Cloud Serverless and Elasticsearch]]></title>
            <description><![CDATA[Learn how Elastic unified control plane and data plane authentication in Serverless with a globally distributed IAM architecture. Use one API key for Cloud and Elasticsearch APIs.]]></description>
            <link>https://www.elastic.co/search-labs/blog/elastic-cloud-api-keys-unified-serverless</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/elastic-cloud-api-keys-unified-serverless</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[ Alex Chalkias]]></dc:creator>
            <pubDate>Mon, 20 Apr 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/864e229f00eb2018084f13dd7f0e390e18383ed4-1980x1188.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Imagine you&#x27;re a site reliability engineer (SRE) responsible for a growing fleet of Elastic Cloud Serverless projects: Elastic Observability for your production infrastructure, Elastic Security for your security operations center (SOC) team, and Elasticsearch for your customer-facing application. Each project has its own Elasticsearch API key. Your continuous integration and continuous delivery (CI/CD) pipeline needs a separate Cloud API key to provision and manage those projects. Rotation day arrives every quarter: You walk through each project, mint fresh keys, update your Terraform state, redeploy your pipelines, and hope nothing falls through the cracks. When an incident hits at 2 a.m. and you need to revoke access fast, you&#x27;re cross-referencing a spreadsheet of credentials to figure out which key belongs to which project and which service.</p><p>Today, that story gets a lot simpler. <strong>Elastic Cloud API keys</strong> can now be used to authenticate directly against <strong>Elasticsearch</strong> and <strong>Kibana</strong> APIs on <strong>Elastic Cloud Serverless</strong>. You can now use a single credential to manage your organization&#x27;s resources <em>and</em> run data operations, like Elasticsearch Query Language (ES|QL) queries, data ingestion, and alerting.</p><p>Let’s look at why we built this, how we engineered a globally distributed identity layer to make it possible, and how it lays the foundation for cross-project search.</p><h2>The secret management burden</h2><p>Building reliable CI/CD pipelines, GitOps workflows, or Terraform automation around data platforms comes with a hidden cost: secret sprawl.</p><p>In the previous model, developers faced a disjointed authentication story:</p><ul><li><strong>Control plane (Elastic Cloud API keys):</strong> Organization-scoped keys used to create projects, invite users, and manage billing via the <a href="https://www.elastic.co/docs/api/doc/cloud/">Elastic Cloud API</a>.</li><li><strong>Data plane (Elasticsearch API keys):</strong> Project-scoped keys created <em>inside</em> a specific Serverless project to interact with <a href="https://www.elastic.co/docs/api/doc/elasticsearch-serverless/">Elasticsearch</a> and <a href="https://www.elastic.co/docs/api/doc/serverless">Kibana</a> APIs.</li></ul><p>This meant that your deployment script had to authenticate to Elastic Cloud, provision a Serverless project, extract a newly minted Elasticsearch API key from that specific project, and then inject <em>that</em> second key into the downstream application or automation tool, resulting in complex pipelines, fragmented audit logs, and a higher risk of credential leaks.</p><h2>Unified authentication in Elastic Cloud Serverless</h2><p>With this release, the split is gone for Serverless projects. You can now create an Elastic Cloud API key that’s explicitly authorized for <strong>Cloud, Elasticsearch, and Kibana APIs</strong>.</p><ul><li><strong>Before:</strong> An Elastic Cloud API key was strictly a control plane token. It could create projects, manage billing, and invite users, but it had a hard boundary; it couldn’t be used to call the Elasticsearch or Kibana APIs inside those projects. You always needed a second, project-specific key for data operations.</li><li><strong>Now:</strong> By opting into <strong>Cloud, Elasticsearch, and Kibana API</strong> access when creating an Elastic Cloud API key, the hard boundary is removed for Serverless. That API key becomes a truly unified credential. It retains its ability to manage your organization&#x27;s infrastructure, while simultaneously gaining native access to query, ingest, and analyze data across any authorized Serverless project.</li></ul><p>By unifying this under a single Elastic Cloud API key, you gain a single identity that can be scoped, audited, rotated, and revoked as one unit. Every API call, whether it provisions a new project or runs an ES|QL query, appears under the same credential in your audit logs, giving you a single trail to follow during incident investigations or compliance reviews. Credential rotation becomes a one-step operation instead of a coordinated update across separate control-plane and data-plane secrets. And because role assignments are per-project, a single key can span several projects, managing ingestion in your observability project and running queries in your security project, without juggling separate credentials for each.</p><p>Importantly, <em>unified</em> does not mean <em>all-powerful</em>. By using the <code>role_assignments</code> payload, you can scope a unified key strictly to a single project and a specific role (such as read-only), ensuring the blast radius remains completely contained if a credential is ever exposed. If a developer leaves or an application is decommissioned, you can revoke a single key from the Elastic Cloud Console, immediately terminating access across both the control plane and all associated Elasticsearch projects.</p><p><em>(Note: For Elastic Cloud Hosted/managed deployments, Cloud API keys still only manage the control plane. Support for extending this to hosted stack APIs is planned for a future release.)</em></p><h2>Automating your workflows</h2><p>Getting started is simple. You can configure this entirely through the Elastic Cloud console or automate it using the <a href="https://www.elastic.co/docs/deploy-manage/api-keys/elastic-cloud-api-keys">Elastic Cloud API</a>.</p><p>The UI process remains the same, but now you can select <strong>Cloud, Elasticsearch, and Kibana API</strong> access under the project role assignment.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/6da7949fc93a52b080ba4d5c0aebbd750273a43a-2700x2699.webp" style="width: 100%;" /><p>Here’s how you create a unified key programmatically using the Elastic Cloud API. Notice the <code>application_roles</code> array, as this is what grants the key native access to the Elasticsearch data plane:</p><pre><code>curl -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: ApiKey $EC_API_KEY" \
  "https://api.elastic-cloud.com/api/v1/users/auth/keys" \
  -d '{
    "description": "unified-automation-key",
    "expiration": "90d",
    "role_assignments": {
      "project": {
        "elasticsearch": [
          {
            "role_id": "elasticsearch-admin",
            "organization_id": "YOUR_ORG_ID",
            "all": false,
            "project_ids": ["YOUR_PROJECT_ID"],
            "application_roles": ["admin"]
          }
        ]
      }
    }
  }'</code></pre><p>Once created, you simply pass this exact same key in the <code>Authorization: ApiKey</code> header to both <code>api.elastic-cloud.com</code> and your specific Serverless Elasticsearch endpoints.</p><h2>Under the hood: Building a distributed identity layer</h2><p>Making a Cloud API key work across both the control plane and the data plane isn&#x27;t as simple as passing a token. It requires solving a fundamental distributed systems challenge.</p><p>Historically, Cloud API keys lived in a centralized global security cluster. That works fine for control plane operations where a higher latency is acceptable. However, Elasticsearch data requests require ultra-low latency. We cannot afford a round trip across the globe to a central control plane to validate every single search query or ingest request.</p><p>To solve this, we introduced a new authentication architecture backed by a globally distributed datastore. The following sequence diagram shows a client sending an Elasticsearch query using an Elastic Cloud API key, illustrating how authentication happens entirely within the local region, without a round trip to the global control plane. Elasticsearch delegates authentication to the Regional IAM Service, which validates the key and resolves its role assignments against a local replica of the globally distributed database. Once authorized, Elasticsearch executes the query and returns results to the client.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/3e38d7a862b9981523c5393c441b92eae13aeb90-2401x1351.webp" style="width: 100%;" /><h3>Globally distributed persistence</h3><p>Instead of relying solely on a centralized security cluster, Elastic Cloud API keys and their associated role definitions are now persisted in a globally distributed, highly available database. This database synchronizes identity and access management (IAM) data across the global control plane and the regional data planes where your Serverless projects actually run.</p><h3>Local validation with regional IAM</h3><p>When your client sends a request to Elasticsearch using an Elastic Cloud API key, the request doesn&#x27;t go back to the global control plane. Instead, it gets routed to the new regional IAM service. It validates the key against the local database replica, ensuring that authentication happens with near-zero latency and is completely insulated from global control plane outages.</p><h3>Dynamic role mapping</h3><p>Authentication is only half the battle; the system also needs to authorize the request. The regional IAM service instantly translates your Cloud-level role assignments for example, <code>application_roles</code>) into native Elasticsearch privileges. Elasticsearch can then authorize and execute the request locally, without ever needing a local <code>.security</code> index.</p><h2>The foundation for Cross-Project Search</h2><p>This distributed identity architecture is a foundational building block for the future of the Elastic platform.</p><p>Because identity and access are now unified and globally synchronized, we have the framework required to securely pass your identity between different projects. This enables the upcoming <strong>Cross-Project Search (CPS)</strong> capabilities for Serverless.</p><p>With CPS, you&#x27;ll be able to query data spanning multiple remote Serverless projects, such as combining security and observability workloads, as easily as if they were a single dataset. By relying on unified API keys, the system can automatically evaluate your permissions across all projects simultaneously without requiring you to configure complex trust relationships, certificates, or duplicate credentials on every target project.</p><h2>Learn more</h2><p>Ready to simplify your stack?</p><ul><li>Read the <a href="https://www.elastic.co/docs/deploy-manage/api-keys/elastic-cloud-api-keys">Elastic Cloud API keys documentation</a> to learn how to assign stack access.</li><li>Check out the <a href="https://www.elastic.co/docs/api/doc/cloud/operation/operation-create-api-key">Create API key (Elastic Cloud API)</a> reference to automate key generation.</li><li>Review <a href="https://www.elastic.co/docs/deploy-manage/api-keys">Elastic API keys</a> for a full comparison of key types across the Elastic platform.</li></ul><p>Start or continue building in <a href="https://cloud.elastic.co/registration">Elastic Cloud</a> today.</p><h2>Disclaimer</h2><p>The release and timing of any features or functionality described in this post remain at Elastic&#x27;s sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Fast approximate Elasticsearch ES|QL - part II]]></title>
            <description><![CDATA[Explaining the approach we use to obtain fast approximate Elasticsearch ES|QL queries and the testing we did of error estimation.]]></description>
            <link>https://www.elastic.co/search-labs/blog/fast-approximate-esql-part-2</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/fast-approximate-esql-part-2</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Thomas Veasey, Jan Kuipers]]></dc:creator>
            <pubDate>Fri, 17 Apr 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/c0cd6d42f5d061aac15767539209a7c443ed1acd-1000x600.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>As we discussed in our <a href="https://www.elastic.co/search-labs/blog/fast-approximate-esql-part-1">previous blog</a>, we’re introducing fast approximate <a href="https://www.elastic.co/docs/explore-analyze/query-filter/languages/esql">ES|QL</a> <code>STATS</code> queries, which will be available in version 9.4 of Elasticsearch and the Elastic Stack. This feature allows users to estimate an expensive analytics query, often orders of magnitude faster than running the full query, by relaxing the constraint that it returns the exact value. We believe this has many uses; for example, we’re planning to integrate it into Kibana to obtain fast chart previews where possible.</p><p>In order for you to be able to trust our estimates, we provide error estimates. Furthermore, since there are edge cases in error estimation, we certify when the estimated value and error are trustworthy. In this blog post, we’ll dive into the theory for approximating and estimating the error in such queries, as well as discuss the testing we’ve done.</p><h3>Background</h3><p>In order to estimate ES|QL <code>STATS</code> queries efficiently, we make use of a property that’s shared by many statistics: Their estimates computed from a large number of independent samples from a dataset approach their true value. In the case of an index with some field $X=\{x_i\},$ we can think of the true value of a statistic as its value computed for a random variable with uniform discrete distribution on $X$. In the following we denote this quantity $\theta$; it can be things like <code>AVG</code>, <code>MEDIAN</code>, and so on. If we make $n$ independent draws from $X$, denoted $S$, such that each value is selected with probability $\frac{1}{|X|}$, we have $n$ independent copies of this random variable. The property we rely on means that a sample statistic value $\hat{\theta}$ computed from $S$ approaches $\theta$ as $n$ becomes large. For example, if $\theta$ is the mean of some metric values then $\hat{\theta} = \frac{1}{|S|}\sum_{x\in S} x \rightarrow \frac{1}{|X|}\sum_{x\in X} x=\theta$ as $|S|=n$ becomes large. Indeed, for many statistics the limiting error distribution is known to be normal. Furthermore, it only depends on the distribution of $X$, the size of the sample $n$ and the type of the aggregation $\theta$. This means supported <code>STATS</code> queries can be approximated with fixed accuracy independent of the index size $|X|$.</p><p>It is easy to pick values at random from a Lucene index: create a filter that takes exponentially distributed jumps through the dataset, where the expected jump size is controlled by the desired sample probability. The AND of this filter and any other Lucene query can be performed extremely efficiently, since AND’ing filter queries is one of the things for which it is well optimized. In our other post, we discussed some real-world query examples to give a sense of the speedup we obtain for different levels of accuracy.</p><p>So far, we&#x27;ve only discussed obtaining an estimate of a query. While such a point estimator can be useful, without knowing anything about its error those uses are limited. We found that ES|QL has existing capabilities that make it relatively easy to incorporate cheap, flexible, and accurate error estimation at the same time. We&#x27;ll discuss this next.</p><h3>Error estimates</h3><p>We view providing an accurate understanding of the uncertainty in our estimates as crucial for users to be able to trust the approximation. While having the option to quickly estimate an ES|QL query alone can be useful in certain situations, we wanted to provide a richer feature that allows clients to make intelligent choices. For example, if an approximate query is being used to preview a chart and the error is only a couple of pixels, there’s little point in running another expensive query to redraw it.</p><p>The way we&#x27;ve chosen to represent error is by a confidence interval: the $\alpha$-central confidence interval, to be precise. This can be expressed in terms of the <a href="https://en.wikipedia.org/wiki/Cumulative_distribution_function">cumulative density</a>, $F$, of the statistic being estimated. Specifically, it&#x27;s the interval which contains the true value of the statistic with probability $\alpha$ whose endpoints are $F^{-1}\left(\frac{1-\alpha}{2}\right)$ and $F^{-1}\left(\frac{1+\alpha}{2}\right)$. Confidence interval calculations are surprisingly subtle. There are also important constraints for our use case that make standard approaches undesirable. Next, we’ll take a look in more detail at the motivation and the design for the approach we’ve adopted.</p><p>A key requirement of the whole project is to dramatically accelerate expensive analytics queries. It’s therefore vital that the overhead of estimating uncertainty isn’t too large compared to estimating the query result itself. We also want the feature to be as general as possible, but “isolated” within the language. In other words, ES|QL is a flexible language, and we want estimation to work with as much of it as possible. At the same time, we don’t want to introduce a cross-cutting feature that incurs development costs on every new feature we ship.</p><p>With these considerations in mind, we chose to estimate confidence intervals by partitioning the sample set and computing the query output on each subsample. This is reminiscent of bootstrap; however, since we ensure that each partition receives a disjoint random subset of the sample data, we know that they comprise true estimates of the statistic distribution. To achieve the best possible estimate of the statistic itself, we still compute its value on the full sample. For example, to estimate the mean and its distribution the process can be expressed as follows:</p><pre><code>FROM data | SAMPLE probability
          | EVAL bucketId = RANDOM(B) // B is the number of buckets
          | STATS avg     = AVG(x)
                  avg_0   = AVG(x) WHERE bucketId==0  
                  (...)
                  avg_B-1 = AVG(x) WHERE bucketId==B-1
            BY grouping
          | EVAL confidence_interval = CONFIDENCE_INTERVAL(avg, avg_0, ..., avg_B-1)
</code></pre><p>This introduces a complication to account for the discrepancy between the count of values used to estimate a query statistic and used to sample its distribution. This is a downside; however, there are some significant advantages.</p><p>Most of the work in analytic queries resides in computing the aggregate statistics: post-processing after a <code>STATS</code> reduction acts on a far smaller table, and the cost is often relatively small. In this scheme, every row in the input data to the <code>STATS</code> command is processed exactly twice compared to just estimating the statistic. Therefore, roughly speaking we pay a fixed overhead that&#x27;s the same order of magnitude as the cost of estimating the query in order to estimate its uncertainty. Since we often achieve multiple orders of magnitude speedup on the exact query, this is acceptable.</p><p>Because this process uses a plain old table, with extra columns for the distribution samples, we can pass the whole table through any ES|QL pipeline and compute confidence intervals on the final results. For example, if we include <code>EVAL square_avg = avg * avg</code> in the pipeline above, we&#x27;d have exactly the same <code>square_avg</code>, <code>square_avg_0</code>, …, <code>square_avg_B-1</code> extra values. At the end of the pipeline, we have samples from the distribution of the original statistics and all quantities that are computed using them. Therefore, we can apply our standard confidence interval machinery to reduce the table and convert samples into confidence intervals for derived quantities as well. This whole process is essentially transparent to the rest of the ES|QL language, and as we showed above, can be achieved by query rewriting.</p><h3>The confidence interval calculation</h3><p>We have independent samples of the statistic distribution $\{\hat{\theta}_i\}$. However, they&#x27;re computed with fewer values than our estimate $\bar{\theta}$. We also have a relatively small number of distribution samples, to avoid the count discrepancy being too large, and so we don’t inflate the table too much. We therefore prefer a parametric approach for estimating confidence intervals.</p><p>The errors in the statistics for which we support estimation tend to normal distributions in the limit they&#x27;re computed from many values. So a natural choice, the standard interval, is to estimate the mean and standard deviation from the samples and report the corresponding normal confidence intervals $\left[ m+\sigma \Phi^{-1}\left(\frac{1-\alpha}{2}\right), m+\sigma \Phi^{-1}\left(\frac{1+\alpha}{2}\right) \right]$. Here, $\Phi$ denotes the standard normal distribution function. For heavy-tailed data and statistical functions that are sensitive to outliers, such as <code>STD_DEV</code>, convergence to normality can be slow, resulting in poorly calibrated intervals.</p><p>Briefly, in order to assess the quality of the intervals, one can examine their calibration. Specifically, one computes a quantity called the <a href="https://en.wikipedia.org/wiki/Coverage_probability">coverage</a>. For a central confidence interval, it should contain the true statistic value roughly $(1-\alpha)n$ times for $n$ trials. In fact, since we seek the central confidence interval, we can make the stronger statement that the true value should be above, or below, the confidence interval endpoints in roughly $\frac{1-\alpha}{2} n$ out $n$ trials. The empirical coverage is this fraction computed for a large number of trials. It allows us to compare alternative approaches by simulation. We return to this when we report our test results.</p><p>In order to obtain better confidence intervals, we tried a couple of different approaches: the <a href="https://en.wikipedia.org/wiki/Cornish%E2%80%93Fisher_expansion">Cornish-Fisher</a> correction of quantiles and an adaptation of <a href="https://en.wikipedia.org/wiki/Bootstrapping_(statistics)#Deriving_confidence_intervals_from_the_bootstrap_distribution">bias-corrected accelerated</a> (BCa) confidence intervals. Simulation showed BCa provided more robust calibration across a range of confidences, so this is the approach we selected. The basic idea, which was introduced by Efron, is to assume that there exists a monotonic transformation of the underlying statistic $g=g(\theta)$ which, when applied to a distribution sample normalizes its distribution:</p><div class="katex">\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\hat{\phi}=\phi+(1+a \phi) \left(N(0,1) - z_0 \right)\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;(1)</div><p>Here, $\hat{\phi}=g(\hat{\theta})$, $\phi=g(\theta)$ and $N(0,1)$ is the standard normal random variable. This is clearly a relaxation of the assumption that the statistic itself is normally distributed, which is used to derive the standard interval. In fact, this family includes many distributions, since $g$ is only constrained to be monotonic. (You can think of $1+a\phi=1+a g(\theta)$ as a first-order Taylor expansion of the case that the variance is an arbitrary function of the true parameter value. This further relaxes the assumption that the normalizing transformation also stabilizes the variance.) The nice thing about this ansatz is that $g$ never needs to be explicitly computed, and there exist standard approaches for estimating the parameters $a$ and $z_0$ from the distribution samples.</p><p>To handle $z_0$ one simply arranges for the estimate to land at the median of transformed distribution. If we assume the cumulative distribution function in theta space is $F_{\theta}$ then $z_0=\Phi^{-1}(F_{\theta}(\bar{\theta}))$, where $\bar{\theta}$ is the estimated statistic value, and as before $\Phi$ is the standard normal distribution function. Typically, $F_{\theta}$ is approximated by the empirical distribution function, computed indirectly by bootstrap. However, somewhat surprisingly, extensive simulation showed that we obtained better calibrated intervals using a normal approximation to our sample values, i.e. $F_{\theta}(\bar{\theta})=\Phi \left( \frac{\bar{\theta}-\hat{\theta}}{\hat{\sigma}}\right)$ with $\hat{\theta}$ and $\hat{\sigma}$ their empirical mean and standard deviation, respectively.</p><p>To complete the procedure, one can rearrange (1) to derive $\alpha_{\theta}$ quantiles for $\theta$ as follows:</p><div class="katex">\alpha_{\theta}=\Phi\left( z_0 + \frac{z_0+z_{\alpha}}{1 - a (z_0+z_{\alpha})}\right)</div><p>where $z_{\alpha}$ is the standard normal z-score for quantile $\alpha$. Typically, one uses the inverse empirical cumulative density estimate of $F_{\theta}^{-1}$ to convert quantiles back to a confidence interval. However, because we have a mismatch between the count of values used to compute distribution samples and the query estimate, we need to do some sort of scaling. Exploring options by simulation, we again found it best to use a normal approximation, $F_{\theta}^{-1}(\alpha_{\theta})=\hat{m}+\frac{\hat{\sigma}}{\sqrt{s}}\Phi^{-1}(\alpha_{\theta})=\hat{m}+\frac{\hat{\sigma}}{\sqrt{s}}\left( z_0 + \frac{z_0+z_{\alpha}}{1 - a (z_0+z_{\alpha})}\right)$, where $s$ is the number of distribution samples we use. This is just applying the usual scaling of variance by $\frac{1}{\sqrt{\text{sample size}}}$.</p><p>Efron showed that in the case $\hat{\theta}$ is distributed as $f_{\theta}$, i.e. that it depends only on the true value $\theta$, then the acceleration $a$ can be estimated without any knowledge of $g$. In particular, $a=\frac{1}{6}\text{SKEW}\left(\frac{\partial f_{\theta}}{\partial \theta}\right)$. By assumption, our statistics tend to normal distributions with mean $\theta$. Since skew is translation and scale invariant, this gives that $a \approx \frac{1}{6}\text{SKEW}(\{\hat{\theta}\})$, i.e. one sixth of the skew of our distribution samples. One thing this glosses over is the dependence of skew, and therefore acceleration, on sample size. We know it tends to zero as the count increases. In fact, skew also asymptotes to zero as $\frac{1}{\sqrt{\text{sample size}}}$ and so we also adjust acceleration to be $\frac{1}{6\sqrt{s}}\text{SKEW}(\{\hat{\theta}\})$ to account for the count mismatch between the samples $\{\hat{\theta}\}$ and estimate $\bar{\theta}$.</p><p>Although we significantly improve the calibration of confidence intervals by using a better methodology, we still see issues in the case that the underlying distribution has very heavy tails for some of the supported <code>STATS</code> functions. Therefore, we introduce some additional guard rails we discuss next.</p><h3>Guard rails</h3><p>To avoid the user having to understand too much about edge cases, we provide additional safeguards that surface when we&#x27;ve been unable to confirm&nbsp; that the distribution samples behave as we expect. This typically happens when the statistic isn’t computed from a sufficient number of values given the metric distribution. It&#x27;s exacerbated by very skewed metric data and certain aggregation functions, such as the <code>STD_DEV</code>, which are sensitive to outliers.</p><p>We have some global constraints on the minimum count of values used to estimate a statistic for which we&#x27;ll certify it. For example, if any bucket is empty, then we can’t rely on the distribution samples. This is because ES|QL allows mixing approximate statistics, which treat empty buckets differently. For example, consider the following query:</p><pre><code>SET approximation=true;
FROM data | STATS avg = AVG(x), sum = SUM(y) | EVAL mix = avg + sum
</code></pre><p>There is no self-contained way of correctly assigning a value to <code>mix</code> for empty buckets, since summing requires that we treat them as zero, in which case we bias our estimate of <code>avg</code>. Alternatively, ignoring empty buckets introduces bias in the <code>sum</code>. There is also a global minimum count of values for which we’ve verified our certification method is sufficiently reliable; this is 10.</p><p>We explored a variety of additional tests to certify the results. These were based on both tests of the underlying data distribution, specifically <a href="https://en.wikipedia.org/wiki/Heavy-tailed_distribution#Hill.27s_tail-index_estimator">Hill’s estimator</a>, as well as the statistic’s distribution properties. If the true distribution of the statistic is sufficiently normal, then our estimate and confidence interval calculation behaves as we expect: The interval is well calibrated and the interval width is representative of the actual error. Therefore, in the end, we chose to use a test based on the p-value for distribution samples’ <a href="https://en.wikipedia.org/wiki/Skewness">skewness</a> and <a href="https://en.wikipedia.org/wiki/Kurtosis">kurtosis</a> versus a normal distribution null hypothesis. To certify a result, we require that the two tail p-values are greater than 0.05 for both tests. As we show below, we found this test was well aligned to our actual needs: to distinguish results for which the estimate and its confidence interval are more and less reliable.</p><p>There&#x27;s a simple trick we can use to boost the accuracy of the accuracy of the test: Create multiple independent distribution samples and use a vote. Given a test to certify results with a failure rate $f$, the distribution of the count of $k$ failures for $t$ tests is $\frac{t!}{k!(t-k)!}f^k(1-f)^{t-k}$ for the case the null hypothesis, that the estimate is trustworthy, is true. For example, for the majority vote assuming $f=0.05$ and $t=3,$ then the significance of the test is $1-(3\times 0.95^2\times 0.05+0.95^3)=0.007$, i.e. we fail to certify fewer than 1% of trustworthy results. Note that we can compute multiple trials relatively easily using different seeds for the <code>RANDOM</code> bucket identifier.</p><p>This additional check allows us to certify that we trust our estimates and their errors. We surface this information in the approximate query results. When we can’t certify results, they won’t necessarily be inaccurate, but they should be treated with more caution.</p><h3>Testing</h3><p>The two main aims of the testing we discuss here were to understand the calibration of the confidence intervals and to see how well they characterize the statistics&#x27; estimation errors. The count function is particularly well behaved, its error distribution is binomial, so the majority of our testing focused on metric aggregations. We study smooth distributions but make sure we cover a range of tail behaviors. The presence of outliers is the key factor that reduces the accuracy of estimated statistics. For example, if an outlier isn’t sampled at all, it can significantly affect the value of some statistics.</p><p>We explored a range of light-tailed distributions, such as uniform and normal, and skewed and heavy-tailed distributions, such as exponential, log-normal, Cauchy, and Pareto. For each family of distribution, we used multiple parameterizations, focusing primarily on varying the scale parameter. In total, we had 24 distinct data distributions. Figure 1 shows some example sample distributions from this set. Note that we’ve truncated the charts to remove extreme outliers, which are present for both the Cauchy and log-normal distributions.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/fb088c17f9755c0d1b3173fb917f0af2c0f83847-1712x950.png" style="width: 100%;" /><p>For each data distribution, we evaluated 14 different sample sizes, ranging from 1000 to 500000. Then, for each sample set, we evaluated <code>AVG</code>, <code>COUNT</code>, <code>MEDIAN_ABSOLUTE_DEVIATION</code>, <code>MEDIAN</code>, <code>PERCENTILE([25, 75, 90, 95, 99])</code>, <code>SUM</code> and <code>STD_DEV</code> at two levels of confidence, 50% and 90%. In total, we have around 7500 distinct experiments. For each experiment, we assessed the interval calibration using 100 runs and counting the number of times the true statistic lands in the confidence interval. This gives us a binomially distributed estimate for the true confidence interval coverage. The variation we expect in the estimated coverage changes slightly with the level of confidence; for example, at 50% we expect to see values mainly between 0.44 and 0.56, and for 90% we expect to see values mainly between 0.86 and 0.94 using 100 trials.</p><p>Figure 2 shows <a href="https://en.wikipedia.org/wiki/Box_plot">box plots</a> for the empirical coverage for the two confidence levels computed from all experiments. In all cases, the confidence intervals are reasonably well calibrated. Extreme percentiles are biased for small sample sizes, which leads to increased outlier counts for small sample sizes. As a rule of thumb, you’d want roughly $\frac{10}{p(1-p)}$ samples to ensure that you have enough samples in the appropriate tail.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/02a5375025e811ba18c4e823e1d984261bbf6f42-631x763.png" style="width: 100%;" /><p>Next, we examine the degree to which the confidence intervals capture the typical size of the estimate error. To do this, we examine the distribution of the ratio of the estimated statistics&#x27; error and half the confidence interval width for all certified results. The higher the confidence, the wider the interval, so different confidence levels shift the mean of this distribution. Figure 3 shows this distribution computed for the 90% confidence interval. As expected, the distribution is roughly normal, albeit with a tail of some larger errors. We see in all cases the confidence interval width gives the order of magnitude of the estimated statistics&#x27; actual errors.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/2d4ab88a07910edac7e8406ae4942694751f0090-1000x600.png" style="width: 100%;" /><p>We’ve shown that certified results are nearly always reliable; however, we’d also like some insight into the proportion of results which we fail to certify that are actually reliable, to confirm that the test aligns with our objective. We use <em>reliable</em> here in the fairly strong sense that the confidence interval is well calibrated. Specifically, for the 50% and 90% confidence intervals, we count the proportion of uncertified results for which the confidence interval empirical calibration has an acceptable margin of error, given the number of trials used to estimate it. Using this procedure, the false positive rate across all experiments is around 1%. This agrees well with the failure rate we expect by chance, given our test parameters, and confirms the assumption underlying the test.</p><p>Finally, to better understand the difference between certified and uncertified results, Figure 4 shows the error distribution of the ratio of the estimated statistics&#x27; errors and half the 90% confidence interval for the reliable and unreliable results separately. Note that we truncated the range for uncertified intervals.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/c0cd6d42f5d061aac15767539209a7c443ed1acd-1000x600.png" style="width: 100%;" /><h3>Wrapping up</h3><p>In this post, we present the background behind our approach for quickly estimating ES|QL queries and providing an indication of their errors. To do this, we developed an effective confidence interval mechanism that allows us to provide error estimates. Our approach also allows us to estimate confidence intervals for quantities derived from sampled statistics via other pipeline operations. Quantifying the error comes with a relatively small overhead compared to just estimating the query. Finally, we developed a statistical test to certify results we return. Values that aren’t certified can still be accurate, but we’re less confident in them.</p><p>As well as testing the feature on a range of real-world use cases, which we discuss in <a href="https://www.elastic.co/search-labs/blog/fast-approximate-esql-part-1">our companion post</a>, we tested the error estimation by extensive simulation across a range of data characteristics, sample sizes, aggregation functions, and confidence levels. This showed confidence intervals are well calibrated, and the interval itself provides a good approximation of the actual error we observe in the estimates. Finally, we showed that we were able to certify intervals with a low false negative rate.</p><p>We’re planning to integrate this feature into other stack capabilities in the future, so stay tuned.<br/><br/></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Fast approximate Elasticsearch ES|QL - part I]]></title>
            <description><![CDATA[Introducing the work we've done on a fast approximate querying mode for Elasticsearch ES|QL. In many cases, it allows us to achieve orders of magnitude latency reductions while providing accurate estimates.]]></description>
            <link>https://www.elastic.co/search-labs/blog/fast-approximate-esql-part-1</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/fast-approximate-esql-part-1</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Jan Kuipers, Thomas Veasey]]></dc:creator>
            <pubDate>Thu, 16 Apr 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/17afc59be8a46957a341faec1f44c9cb0a221894-1918x1176.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Analytics workloads typically involve summarizing large volumes of data into a much smaller number of statistics. The Elasticsearch Query Language (ES|QL) implements this capability using the <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/stats-by">STATS command</a>. This allows you to select various aggregation functions and apply them to the previous query results, as well as grouping the results by one or more ES|QL expressions. This is a flexible operation that, coupled with ES|QL querying capabilities, allows one to perform <a href="https://en.wikipedia.org/wiki/MapReduce">MapReduce</a> on data stored in collections of Elasticsearch indices.</p><p>One of the key requirements for a pleasant user experience is that these operations are performed quickly. Large language model–based (LLM) agents also introduce new <a href="https://arxiv.org/pdf/2509.00997">higher bandwidth and speculative query patterns</a> that can potentially benefit from different optimization strategies.</p><p>In this two-part blog series, we discuss an optimization approach we’re introducing to ES|QL in version 9.4 of Elasticsearch and the Elastic Stack, which exploits a relaxation of the problem. Rather than trying to get exact values for aggregates, we allow ourselves to return approximate values, together with some characterization of their error. A key benefit of approximation is that it breaks the dependency between performance and dataset size: The accuracy with which one can approximate a query doesn’t depend on the original dataset size but, principally, its data characteristics and the query itself. As we’ll see later, this allows us to achieve some dramatic performance improvements.</p><p>In our<a href="https://www.elastic.co/search-labs/blog/fast-approximate-esql-part-2"> next blog post</a>, we will discuss the theory behind our approach and the validation we’ve done of its statistical properties. Here, we introduce the syntax and give a sense of how it’s achieved using standard ES|QL and query rewriting. You can explore its performance on a subset of the popular <a href="https://github.com/ClickHouse/ClickBench">ClickBench</a> benchmark. Finally, we discuss some limitations and gotchas that are worth understanding when you use query approximation.</p><h3>Syntax and behavior</h3><p>So how do you actually use it?</p><pre><code>SET approximation=true;
// The query you want to approximate
FROM data | commands | STATS x=agg(...) | commands</code></pre><p>That’s it. You simply introduce the new line <code>SET approximation=true;</code> and write your <code>STATS</code> query pipeline as usual. Below, we discuss some advanced configuration options and some limitations around the <code>agg(...)</code> and <code>commands</code>. However, essentially, we choose defaults so that this will typically provide useful approximations while achieving significant speedups.</p><p>With this change, you’ll see some differences in the query results. Let’s look at a concrete example to illustrate this. Suppose the raw query is as follows:</p><pre><code>FROM sales | WHERE @timestamp >= NOW()-1w
           | STATS count = COUNT() BY item_category
           | SORT count DESC
           | LIMIT 5</code></pre><p>The results might look something like this:</p><pre><code>item_category        | count
---------------------+------
Household Essentials | 5165
Kitchen              | 2132
Storage              | 1121
Home Decor           | 877
Furniture            | 357</code></pre><p>Approximating this query introduces some extra columns for each quantity that’s estimated:</p><pre><code>item_category | count | _approximation_confidence_interval(count) | _approximation_certified(count)
--------------+-------+-------------------------------------------+--------------------------------
Essentials    | 5150  | [5100, 5250]                              | true
Kitchen       | 2150  | [2100, 2200]                              | true
Storage       | 1120  | [1100, 1150]                              | true
Home Decor    | 880   | [860, 900]                                | true
Furniture     | 330   | [310, 350]                                | true</code></pre><p>The count column now contains an estimate, and you’ll see it’s somewhat different from the exact values above. The <code>_approximation_confidence_interval(count)</code> column defaults to the central 90% confidence interval for the <code>count</code> estimate and the <code>_approximation_certified(count)</code> column indicates if we’re highly confident that the results and their confidence interval are trustworthy. In outline, the <em>confidence interval</em> is an interval we expect has a high probability (0.9) of containing the true value for the quantity being estimated. The <em>certified column</em> indicates the distribution of the approximation is behaving as we expect. When the result isn’t certified, it’s often still accurate, but our test of the properties of its distribution hasn’t been able to confirm this. These quantities are discussed in more detail in our second post.</p><h3>Implementation</h3><p>An approximate query is rewritten before query execution using random sampling and extrapolation. Let’s take a look at the query of the previous section. The part of the rewritten query responsible for obtaining the best estimate looks like:</p><pre><code>FROM sales | SAMPLE probability
           | WHERE @timestamp >= NOW()-1w
           | STATS count = TO_LONG(COUNT() / probability) BY item_category
           | SORT count DESC
           | LIMIT 5</code></pre><p>The query samples a fraction of the data, and therefore the final count has to be extrapolated by scaling up with the inverse of the sample probability. Extrapolation clearly depends on the underlying aggregation function, and we handle this appropriately for all functions we support.</p><p>To obtain the sample probability, we&#x27;re setting a fixed <code>number_of_rows</code> to be processed by the <code>STATS</code> command. In this case, the probability is calculated as follows:</p><pre><code>FROM sales | WHERE @timestamp >= NOW()-1w
           | STATS total_row_count = COUNT()
           | EVAL probability = number_of_rows / total_row_count</code></pre><p>This query is executed before the final approximate query is executed.</p><p>As well as this best estimate, confidence intervals and a statistical test used to certify that the value distribution is behaving as we expect also need to be computed. The intervals are computed using a variant of the <a href="https://blogs.sas.com/content/iml/2017/07/12/bootstrap-bca-interval.html">bias-corrected and accelerated bootstrap confidence interval</a> (BCa) method. Therefore, the data needs to be partitioned into B buckets, which are used in turn to compute the intervals. Omitting some implementation details, this approximate query looks like:</p><pre><code>FROM sales | SAMPLE p
           | WHERE @timestamp >= NOW()-1w
           | EVAL bucketId = RANDOM(B) // B is the number of buckets
           | STATS count     = TO_LONG(COUNT() / p) 
                   count_0   = TO_LONG(COUNT() / p) WHERE bucketId==0  
                   (...)
                   count_B-1 = TO_LONG(COUNT() / p) WHERE bucketId==B-1  
             BY item_category
           | WHERE count >= 10
           | SORT count DESC
           | LIMIT 5
           | EVAL ci = CONFIDENCE_INTERVAL(count, count_0, ..., count_B-1),
                  certified = CERTIFIED(count, count_0, ..., count_B-1)
           | DROP bucketId, count_0, ..., count_B-1</code></pre><p>To certify the estimate and confidence interval, there should be enough data, and the distribution of the bucket values should tend to normality.</p><p>Some queries can be efficiently computed using only summary statistics maintained in the index. To handle these correctly, where sampling is both slower and inaccurate, we updated the physical query planner, since detecting this case requires information that’s only available where the data resides. When the planner detects this is possible, it simply executes the query as normal. Such queries are typically fast anyway, and there’s no real side effect, so you don’t need to worry about this when using approximation; however, you’ll see that confidence intervals for such queries always have zero length, indicating the results are exact.</p><h3>Results</h3><p>To explore the performance improvements, we use <a href="https://github.com/ClickHouse/ClickBench">ClickBench</a>. This is a benchmark for analytics workloads for database management systems (DBMS). It comprises approximately 100 million rows, with a focus on clickstream and traffic analysis, web analytics, machine-generated data, structured logs, and events data. The benchmark also defines 43 queries that are typical of ad-hoc analytics and real-time dashboards.</p><p>Some of the queries aren’t suitable for approximation. For example, we don’t support approximating the unique count of a categorical value or computing the minimum and maximum of a metric value. We also don’t care about queries targeting search alone, for which Elasticsearch has excellent performance in any case. We therefore exclude these types of query from our evaluation. Finally, we also want to test a few additional aggregation functions, such as percentiles, which are not well represented in the original query set, so add some variants of the original metric queries to this end.</p><p>Queries in the benchmark are written using standard SQL and so need porting to use ES|QL syntax. This translation is fairly straightforward. Here’s an example:</p><pre><code>SELECT SUM(AdvEngineID), COUNT(*), AVG(ResolutionWidth) FROM hits</code></pre><p>becomes:</p><pre><code>FROM hits | STATS s = SUM(AdvEngineID),
                  c = COUNT(*),
                  a = AVG(ResolutionWidth)</code></pre><p>when rewritten in ES|QL.</p><p>For running all benchmarks, we use an Elastic Cloud Hosted instance with 870GB disk, 29GB Ram, and 4 vCPUs, in effect, an Amazon Elastic Compute Cloud (EC2) i3.xlarge instance. In the following results, we simply compare ES|QL with and without query approximation. Extensive results on a range of different hardware setups and datastores can be found <a href="https://benchmark.clickhouse.com/">here</a>. Even with significantly constrained test hardware (matching the vCPUs of the smallest setup), our approximation approach achieves competitive results against much larger systems.</p><p>We run each query and its approximation five times in a random order, clearing the <a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-clear-cache">query cache</a> between each run. We report the average run time over all five runs. While clearing the cache should be sufficient to avoid most of the advantage of running second, we wanted to avoid any possible accidental prewarming effects, which is why we alternate.</p><p>The results break down into four categories:</p><ol><li>Queries which are rewritten to use index summary statistics (three queries).</li><li>Queries that perform well (13 queries).</li><li>Queries with high cardinality partitioning (seven queries).</li><li>Queries with restrictive filters (12 queries).</li></ol><p>Roughly speaking, for these four categories, approximate querying is: equivalent (1); faster and accurate (2); faster but unreliable (3); and slightly slower (4), compared to exact querying, respectively.</p><p>For category 1, the planner automatically detects that we’re able to perform the query using summary statistics, and we end up executing the queries in the same way. To do this, we need information that’s only available on the data nodes, so we perform the rewrite only after we&#x27;ve estimated the sample probability. Because we&#x27;re able to do this very efficiently, the overhead is small (around 10–15%). In both cases, the results are exact.</p><p>Queries in category 2 run on average 23$\times$ faster if estimating the values and computing confidence intervals and 72$\times$ faster if just estimating the values, which you can select as follows: <code>SET approximation={&quot;confidence_level&quot;:null}</code>. These headline figures hide quite some variation in the impact of approximation on performance. The table below shows some queries sampled from the range of speedups we see:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Query</td><td style="border: 1px solid #ccc; padding: 4px;">Baseline / ms</td><td style="border: 1px solid #ccc; padding: 4px;">Approximate with CI / ms</td><td style="border: 1px solid #ccc; padding: 4px;">Approximate without CI / ms</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">3</td><td style="border: 1px solid #ccc; padding: 4px;">1725</td><td style="border: 1px solid #ccc; padding: 4px;">145</td><td style="border: 1px solid #ccc; padding: 4px;">15</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">10</td><td style="border: 1px solid #ccc; padding: 4px;">4340</td><td style="border: 1px solid #ccc; padding: 4px;">1721</td><td style="border: 1px solid #ccc; padding: 4px;">56</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">13</td><td style="border: 1px solid #ccc; padding: 4px;">32912</td><td style="border: 1px solid #ccc; padding: 4px;">6106</td><td style="border: 1px solid #ccc; padding: 4px;">3821</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">21</td><td style="border: 1px solid #ccc; padding: 4px;">46739</td><td style="border: 1px solid #ccc; padding: 4px;">3284</td><td style="border: 1px solid #ccc; padding: 4px;">2139</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">22</td><td style="border: 1px solid #ccc; padding: 4px;">252505</td><td style="border: 1px solid #ccc; padding: 4px;">6478</td><td style="border: 1px solid #ccc; padding: 4px;">5019</td>
                    </tr>
              </tbody>
            </table>
          <p>Here are the corresponding queries:</p><pre><code>// Query 3
FROM hits | STATS s = SUM(AdvEngineID),
                  c = COUNT(*),
                  a = AVG(ResolutionWidth)

// Query 10
FROM hits | STATS s = SUM(AdvEngineID),
                  c = COUNT(*),
                  a = AVG(ResolutionWidth) BY RegionID
          | SORT c DESC
          | LIMIT 10

// Query 13
FROM hits | WHERE SearchPhrase != ""
          | STATS c = COUNT(*) BY SearchPhrase
          | SORT c DESC
          | LIMIT 10

// Query 21
FROM hits | WHERE URL != ""
          | STATS l = AVG(LENGTH(URL)), c = COUNT(*) BY CounterID
          | WHERE c > 100000
          | SORT l DESC
          | LIMIT 25

// Query 22
FROM hits | WHERE Referer != ""
          | GROK Referer """%{URIPROTO}://(?:www\.)?%{URIHOST:k}"""
          | WHERE k IS NOT NULL
          | STATS l = AVG(LENGTH(Referer)), c = COUNT(*) BY k
          | WHERE c > 100000
          | SORT l DESC
          | LIMIT 25</code></pre><p>We&#x27;ll return to the accuracy of the approximation in the next blog post, but to give a sense of this, we plot below the exact and approximate values for a sample run for query 13:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/9ab83c13f42f88253a242d78339356f4a7c48700-2094x1358.png" style="width: 100%;" /><p>For category 3, we get an average speedup of $11\times$. However, the results of queries in this category can miss some partitions and often have large estimation errors. Approximation can still be valuable for such queries, particularly in the context of agentic workflows, but requires larger sample sizes than out default if accuracy is important. As we discuss in the next section, we provide an API to explicitly control the sample size. If the source dataset is sufficiently large, this can be increased and approximation will still yield significant performance improvements. The table below shows a couple of query examples for this category:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Query</td><td style="border: 1px solid #ccc; padding: 4px;">Baseline / ms</td><td style="border: 1px solid #ccc; padding: 4px;">Approximate with CI / ms</td><td style="border: 1px solid #ccc; padding: 4px;">Approximate without CI / ms</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">15</td><td style="border: 1px solid #ccc; padding: 4px;">8256</td><td style="border: 1px solid #ccc; padding: 4px;">1187</td><td style="border: 1px solid #ccc; padding: 4px;">124</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">17</td><td style="border: 1px solid #ccc; padding: 4px;">70641</td><td style="border: 1px solid #ccc; padding: 4px;">2109</td><td style="border: 1px solid #ccc; padding: 4px;">982</td>
                    </tr>
              </tbody>
            </table>
          <p>Here are the corresponding queries:</p><pre><code>// Query 15
FROM hits | STATS c = COUNT(*) BY UserID, SearchPhrase
          | SORT c DESC
          | LIMIT 10

// Query 17
FROM hits | EVAL m = DATE_EXTRACT("minute_of_hour", EventTime)
          | STATS c = COUNT(*) BY UserID, m, SearchPhrase 
          | SORT c DESC
          | LIMIT 10</code></pre><p>Finally, category 4 queries use selective filters and end up being executed exactly, but they run slightly slower because of the work done in the query rewrite stage. Typically, all these queries run fast anyway, so the absolute slowdown is small. On average, they run approximately 14% or 370ms slower than the “without” sampling for our test setup.</p><h3>Limitations and best practices</h3><p>It’s worth explicitly mentioning some limitations. In particular, the following queries are not currently supported:</p><ol><li>Queries using the <code>TS</code> source command.</li><li>Queries using the <code>FORK</code> or <code>JOIN</code> processing command.</li><li>Pipelines which use two or more <code>STATS</code> commands.</li><li>The <code>ABSENT</code>, <code>PRESENT</code>, <code>DISTINCT_COUNT</code>, <code>MIN</code>, <code>MAX</code>, <code>TOP</code>, <code>ST_CENTROID_AGG</code> and <code>ST_EXTENT_AGG</code> aggregation functions.</li></ol><p>We plan to lift some of these restrictions in future releases, such as approximating queries using <code>TS</code>, <code>FORK</code> and <code>JOIN</code>; however, some are intrinsic. For example, while there’s prior art for estimating the <a href="https://en.wikipedia.org/wiki/Generalized_extreme_value_distribution">minimum and maximum</a> of a metric dataset or the count of unique values of a categorical dataset (see, for example, <a href="https://arxiv.org/pdf/2202.02800">this</a> paper), they require making certain distributional assumptions, either explicitly or implicitly. In summary, we view trying to automatically provide estimates of these statistics as being too open to accidental misuse.</p><p>For the expert user, we provide another route: ES|QL supports using the <code>SAMPLE</code> command directly. This allows one to obtain “point estimates” of any query, albeit with no attempt to correct for the impact of sampling or quantify error. For example:</p><pre><code>FROM data | SAMPLE 0.01 | STATS DISTINCT_COUNT(value)</code></pre><p>computes the unique count of the value field on a sample of roughly 1/100th of the dataset. The sample probability can be adjusted to get a sense of how this is asymptoting, or more sophisticated estimation procedures can use <code>STATS COUNT() BY value</code> to estimate the frequency profile of the data.</p><p>There are a couple of cases that are more problematic for sampling. If a very restrictive filter is applied in the query, then sampling is of little value, since few rows match anyway. In such cases, we discover that we’d have to sample too large a proportion of the rows to estimate the query in the rewrite phase. In this case, we revert to running the query without sampling and its result is exact. However, the search procedure to determine the fraction of rows to sample comes with some overhead. One therefore pays a penalty, albeit less than the original query cost, for no benefit. If you know in advance that the query is expected to match relatively few rows, it&#x27;s best to run it without approximation.</p><p>The second case only applies when computing <code>STATS</code> partitioned by some expression. If the cardinality of this expression is very high, then even if many rows are searched, individual statistics may be computed from a small number of rows. Some cases are more problematic than others. Sorting by ascending count, that is, finding the rarest partitions, can be impossible to estimate in a single query if heavy hitters would require us to sample most of the dataset to find them. For this particular case, heavy hitting partitions can be estimated first and sometimes efficiently excluded by updating the query. In general, infrequent partitions may be lost in the sampling process, and their statistics&#x27; estimation errors can be high. It’s worth noting that we won’t attempt to estimate any statistic for which we have fewer than 10 samples, and we simply drop them from the result set. In the case of very high cardinality <code>BY</code> clause, for example, a field whose value is unique for every row, this means the query can return no results. If you find approximate query results are too inaccurate, you have the option to increase the sample size, which by default is 1,000,000 for <code>STATS</code>, which uses grouping and 100,000 otherwise. Currently, this needs to be done manually, and we provide the following API for this:</p><pre><code>SET approximation={"rows":12345678};
FROM data | commands | STATS x=agg(...) | commands</code></pre><p>Occasionally, functions significantly alter the distribution characteristics of the quantities they act on. A contrived example is the following:</p><pre><code>FROM data | STATS sl = SUM(length) | EVAL csl = COS(sl)</code></pre><p>If the variation in the estimate <code>sl</code> is much larger than $2\pi,$ we expect the distribution of <code>csl</code> to be mainly flat in the interval $[-1,1]$ with peaks near both endpoints. In this particular case, it’s not clear that the central confidence interval is a particularly useful concept, since the modes of the distribution lie outside almost all central confidence intervals. In any case, just observing the samples of <code>csl</code>, our standard confidence interval machinery won’t reliably characterize this distribution and it will underestimate the variability of <code>csl</code>. However, our statistical test should detect this problem, and the result won’t be certified.</p><p>Finally, we note that Elasticsearch implements some query optimization strategies that ideally <a href="https://github.com/elastic/elasticsearch/issues/138151">need to account for the fact that sampling is taking place</a>. These rewrite the query at the Lucene level and the preprocessing involved in this rewrite can be relatively expensive. Accelerating an expensive string matching operation by first building a suitable data structure makes sense if the query needs to process every row, but if it processes only a small fraction of them, the trade-off is different. This is something we plan to enhance in future.</p><h3>Conclusions</h3><p>In this blog post, we introduced a new form of query optimization we’re bringing to ES|QL that enables dramatically faster querying by relaxing the constraint that the results are exact. We found on ClickBench that we were able to accurately estimate query values and their confidence intervals up to 100 times faster and values alone up to 250 times faster than we can compute them exactly. Furthermore, we expect this advantage to grow as the dataset size increases, because the approximation accuracy is independent of the dataset size. This feature works with many features of the ES|QL language and is enabled by simply prepending <code>SET approximation=true;</code> to the query to estimate.</p><p>As well as providing a point estimate, we also estimate confidence intervals and indicate whether we think that the underlying assumptions used to compute these are satisfied. This allows us to certify the results if the results are reliable. We explain the theory behind this feature and discuss the testing of its accuracy in our <a href="https://www.elastic.co/search-labs/blog/fast-approximate-esql-part-2">next post</a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Is your ML job's datafeed losing a race it cannot win?]]></title>
            <description><![CDATA[Learn how switching from scroll-based to aggregation-based datafeeds optimizes machine learning jobs for large-scale deployments.]]></description>
            <link>https://www.elastic.co/search-labs/blog/elastic-machine-leaning-jobs-aggregation-datafeeds</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/elastic-machine-leaning-jobs-aggregation-datafeeds</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Valeriy Khakhutskyy]]></dc:creator>
            <pubDate>Wed, 15 Apr 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/023e3e6cb25891f789129d496c181113cc570f1f-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>On almost every large Elastic deployment I’ve worked with, there’s an Elastic Security or Elastic Observability anomaly detection (AD) job that looks healthy but is perpetually behind. Six hours behind. Twelve. And the gap never closes.</p><p>The datafeed isn’t broken. It’s doing exactly what it was built to do: reading every raw document, across every shard, every run. On a large cluster with cross-cluster search (CCS) and a broad index pattern, like <code>logs-*</code>, that means scanning billions of documents per bucket. There’s no hardware that makes that sustainable. The datafeed will always be chasing live data and never reaching it.</p><p>The fix is to switch from the default <a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-scroll"><strong>scroll-based</strong></a> datafeed configuration to an <a href="https://www.elastic.co/docs/explore-analyze/machine-learning/anomaly-detection/ml-configuring-aggregation"><strong>aggregation-based</strong> datafeed configuration</a>: Let the data nodes summarize locally, and ship only compact bucket results to the ML node. Same detections, a fraction of the load. The speedup can be dramatic. More than you might expect. The numbers are in the next section. The explanation for <em>why</em> the gap is so large is at the end of the post, for those who want to understand the mechanics.</p><p>One catch worth knowing now: Switching requires creating a new job. The old model doesn’t transfer; weeks of learned baseline are lost. <strong>The right time to make this switch is before the job has been running for months, not after.</strong> That’s the main reason to read this before you deploy.</p><h2><strong>How much faster? Scroll vs. aggregation datafeeds for ML jobs</strong></h2><p>We ran the same job two ways on production data: first scroll-based, and then aggregation-based. The job covered 13 months of history, monitoring 836,000 log events per hour in 15-minute buckets across multiple clusters.</p><p>Training on historical data with scroll-based configuration: <strong>five days of wall-clock time</strong>, 7.9 million sequential requests, and 3.5 TB transferred; with aggregations: <strong>2.3 minutes</strong>, 23 requests, and 34 MB (a 3,374× speedup). Think of it this way: If you start the scroll backfill at 9 a.m. Monday, it will finish Saturday morning. The aggregation version is done by 9:02 a.m.</p><p>On live data, the difference is less dramatic but still meaningful: around <strong>20×</strong> fewer requests per tick. That adds up quickly when the datafeed runs every few minutes around the clock.</p><h2><strong>Before you start</strong></h2><p>Three things worth knowing before diving into the configuration.</p><p><strong>This isn&#x27;t wizard territory.</strong> The standard Kibana job wizards (Single Metric, Multi-Metric, Population) don&#x27;t expose aggregation configuration. To create an aggregation-based job, you need either the Elasticsearch API or Kibana&#x27;s Advanced Job Wizard, with JSON edited by hand. The worked example below shows the most practical path: Configure the job in the Multi-Metric Wizard, and then click <strong>Convert to advanced job</strong> before creating it. That gets you a prefilled JSON starting point instead of a blank editor.</p><p><strong>The configuration is unforgiving and mostly silent about it.</strong> There&#x27;s no schema validation that catches a misnamed aggregation key or a <code>fixed_interval</code> that doesn&#x27;t match <code>bucket_span</code>. The job will run, anomalies will fire, and nothing will indicate that the results are based on the wrong data. This is why the five-step pattern exists and why the <strong>Preview </strong>tab is worth using every time: Catching a misconfiguration before the job trains is a 30-second check; catching it a week later is a much worse afternoon.</p><p><strong>The Single Metric Viewer has a known limitation with aggregated jobs.</strong> That viewer reconstructs the &quot;actual&quot; data curve by re-querying the index, but it can&#x27;t reproduce an arbitrary, user-defined aggregation, so the actual-value line is typically missing or approximate. The Anomaly Explorer is unaffected: Anomaly scores, swim lanes, and influencer attribution all work normally. Just don&#x27;t rely on the Single Metric Viewer&#x27;s chart for visual validation of what the model saw.</p><h2><strong>What we can and can’t aggregate</strong></h2><p>Almost every <a href="https://www.elastic.co/docs/reference/machine-learning/machine-learning-functions">ML function</a> works with aggregated datafeeds, but the right aggregation pattern depends on the function.</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Function</td><td style="border: 1px solid #ccc; padding: 4px;">Pattern</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`count`, `mean`, `high_mean`, `low_mean`, `sum`, `max`, `min`, `varp`</td><td style="border: 1px solid #ccc; padding: 4px;">Standard: `date_histogram` → `terms` → metric aggregation</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`time_of_day`, `time_of_week`</td><td style="border: 1px solid #ccc; padding: 4px;">Minimal: plain `date_histogram`, no `terms` or metric needed</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`rare`, `freq_rare`, `info_content`</td><td style="border: 1px solid #ccc; padding: 4px;">Composite: top-level composite with `date_histogram` as a source</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`categorization`</td><td style="border: 1px solid #ccc; padding: 4px;">`terms` on the `.keyword` subfield of the categorization field</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`lat_long`</td><td style="border: 1px solid #ccc; padding: 4px;">Scroll only</td>
                    </tr>
              </tbody>
            </table>
          <p><code>lat_long</code> is the one genuine exception. The configuration is accepted, but <a href="https://www.elastic.co/docs/reference/aggregations/search-aggregations-metrics-geocentroid-aggregation"><code>geo_centroid</code></a> computes the arithmetic mean of all coordinates in a bucket: If the same entity appears in New York and London within the same bucket, the centroid ends up in the Atlantic Ocean, which probably doesn&#x27;t make sense for the use case. Keep <code>lat_long</code> jobs on scroll-based datafeeds.</p><p>The five-step pattern in the next section covers the standard case. We’ll walk through the remaining patterns at the end of the post.</p><h2><strong>The standard five-step pattern: Scroll-based to aggregation datafeed</strong></h2><p>Converting any scroll-based job to an aggregation-based datafeed follows the same five steps. Once you understand the pattern, applying it to any compatible job takes about 10 minutes.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/77cd16165133374a04dbcf71210ea8d36f66b54f-1999x924.png" style="width: 100%;" /><p><strong>Step 1: Add <code>summary_count_field_name: <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/mapping-doc-count-field">&quot;doc_count&quot;</a></code> to the analysis config.</strong> This tells the ML engine that incoming data is pre-summarized. Without it, the engine treats each aggregated bucket as a single raw document and produces wrong anomaly scores.</p><p><strong>Step 2: Choose the bucket wrapper topology.</strong> For most functions (<code>count</code>, <code>mean</code>, <code>sum</code>, <code>max</code>, <code>min</code>, <code>varp</code>, <code>time_of_day</code>, <code>time_of_week</code>, and <code>categorization</code>) use a <a href="https://www.elastic.co/docs/reference/aggregations/search-aggregations-bucket-datehistogram-aggregation"><code>date_histogram</code></a> at the top level whose <code>fixed_interval</code> matches your <code>bucket_span</code> exactly to ensure accurate analysis. For <code>rare</code>, <code>freq_rare</code>, and <code>info_content</code>, use a <a href="https://www.elastic.co/docs/reference/aggregations/search-aggregations-bucket-composite-aggregation">composite</a> at the top level with a <code>date_histogram</code> as one of its sources. This routes the datafeed to the composite extractor, which paginates through all field-value combinations rather than truncating to a top-N.</p><p><strong>Step 3: Add a <a href="https://www.elastic.co/docs/reference/aggregations/search-aggregations-metrics-max-aggregation"><code>max</code></a> aggregation on <code>@timestamp</code>.</strong> The ML engine needs this to determine the precise end time of each bucket. In the standard topology (Step 2, <code>date_histogram</code> outer), it goes inside the histogram’s <code>aggregations</code>. In the composite topology, it sits as a sibling of the <code>composite</code> aggregation.</p><p><strong>Step 4: Map each analysis field to a <a href="https://www.elastic.co/docs/reference/aggregations/search-aggregations-bucket-terms-aggregation"><code>terms</code></a><span style="text-decoration:underline"> aggregation</span></strong>, named exactly after the corresponding field in the analysis config. One categorical field → a single nested <code>terms</code>. Two or more categorical fields → a <code>composite</code> aggregation nested inside the <code>date_histogram</code>, with one <code>terms</code> source per field. For categorization jobs, use a <code>terms</code> aggregation on the <code>.keyword</code> subfield of the <code>categorization_field_name</code>. The naming rule is strict: The aggregation key must exactly match the field name in the analysis config; the ML engine uses the aggregation name, not the <code>field</code> parameter, to look up values. A mismatch produces silently wrong results; no error, just a job that appears to run while missing everything meaningful.</p><p><strong>Step 5: Map each detector’s metric field</strong> to its Elasticsearch aggregation equivalent:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">ML function</td><td style="border: 1px solid #ccc; padding: 4px;">Elasticsearch aggregation</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`mean` / `high_mean` / `low_mean`</td><td style="border: 1px solid #ccc; padding: 4px;">`avg`</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`sum`</td><td style="border: 1px solid #ccc; padding: 4px;">`sum`</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`max`</td><td style="border: 1px solid #ccc; padding: 4px;">`max`</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`min`</td><td style="border: 1px solid #ccc; padding: 4px;">`min`</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`varp`</td><td style="border: 1px solid #ccc; padding: 4px;">`extended_stats`</td>
                    </tr>
              </tbody>
            </table>
          <p>For <code>count</code>, <code>rare</code>, <code>freq_rare</code>, <code>info_content</code>, <code>time_of_day</code>, <code>time_of_week</code>, and categorization jobs, the ML engine works from <code>doc_count</code> alone; no metric aggregation is needed, and this step can be skipped.</p><h2><strong>Step-by-step example: Building an aggregation-based ML job in Kibana</strong></h2><p>Let’s build this end to end using Kibana’s sample web logs. If you haven’t loaded them yet, go to the Kibana home page and click <strong>Integrations → Sample data → Sample web logs → Add data</strong>. This gives us a data view called <code>Kibana Sample Data Logs</code> and an index called <code>kibana_sample_data_logs</code> with fields including <code>@timestamp</code>, <code>bytes</code> (response size), and <code>geo.dest</code> (destination country).</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/75b692b9f38017cd7e4e221d2e89a14f75d3b9dc-1999x1905.png" style="width: 100%;" /><p>We’ll build a job that detects unusually large response sizes: <code>high_mean of bytes</code>, partitioned by destination country (<code>geo.dest</code>), with a 1-hour bucket span.</p><h3><strong>Creating the job with the Multi-Metric Wizard</strong></h3><p>This is how most jobs get created in practice. Navigate to <strong>Machine Learning → Anomaly Detection → Manage Jobs → Create job</strong>.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/b5bb2f2770a76fd535db22b97fc4f72471c43ca7-1999x587.png" style="width: 100%;" /><p>Select the “Kibana Sample Data Logs” data view, and set the time range to cover the full sample dataset. On the job type screen, choose <strong>Multi-metric</strong>.</p><p>In the Multi-Metric Wizard, configure the detector:</p><ul><li><strong>High mean</strong> of <code>bytes</code>.</li><li><strong>Split data by</strong> <code>geo.dest</code>.</li><li><strong>Bucket span:</strong> <code>1h</code>.</li></ul><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/b3055c91c881e4011521ba0c17cc36c6138595ee-1999x1540.png" style="width: 100%;" /><p>Give the job an ID, and leave everything else at its defaults, but <strong>don’t click Create yet</strong>. On this last configuration step, click on <strong>Preview JSON</strong> and look at the datafeed section. What you’ll see is a plain scroll-based datafeed with no aggregations, just an index pattern and a <code>match_all</code> query.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/32a1650958525b03a6052d480152933341acdd41-1999x1392.png" style="width: 100%;" /><p>This is the default every wizard produces. On a small cluster, it works fine. On a large cluster with CCS and a broad index pattern, this datafeed will scan every raw document on every run and never catch up with live data.</p><p>Instead of clicking <strong>Create</strong>, click <strong>Convert to advanced job</strong>. This keeps everything you just configured (the detector, the partition field, the bucket span) and drops you directly into the Advanced Wizard, where we can apply the five-step pattern.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/388d279820a639c08b119753f064b3a948ace8c6-1999x1591.png" style="width: 100%;" /><h3><strong>Analysis configuration</strong></h3><p>The conversion prefills the detector, partition field, and bucket span. The only change needed here is <strong>Step 1</strong> of the pattern: Open the <strong>Edit JSON</strong> view, and add <code>summary_count_field_name</code> to tell the ML engine that incoming data will be pre-summarized:</p><pre><code>{
  "bucket_span": "1h",
  "summary_count_field_name": "doc_count", // Step 1
  "detectors": [
    {
      "function": "high_mean",
      "field_name": "bytes",
      "partition_field_name": "geo.dest"
    }
  ],
  "influencers": ["geo.dest"]
}</code></pre><h3><strong>Datafeed configuration</strong></h3><p>Switch to the <strong>Datafeed</strong> tab. This is where Steps 2 through 5 of the pattern come together. Remove <code>scroll_size</code> if it’s present, and then enter the aggregations:</p><pre><code>{
  "buckets": {
    "date_histogram": {               // Step 2: bucket wrapper, interval = bucket_span
      "field": "@timestamp",
      "fixed_interval": "1h"
    },
    "aggregations": {
      "@timestamp": {                 // Step 3: max timestamp anchor
        "max": { "field": "@timestamp" }
      },
      "geo.dest": {                   // Step 4: partition field, name must match exactly
        "terms": {
          "field": "geo.dest",
          "size": 1000
        },
        "aggregations": {
          "bytes": {                  // Step 5: metric field → avg aggregation
            "avg": { "field": "bytes" }
          }
        }
      }
    }
  }
}</code></pre><p>A few notes on this config:</p><ul><li><strong>Step 2:</strong> The <code>date_histogram</code> uses <code>fixed_interval</code>: <code>&quot;1h&quot;</code>, matching <code>bucket_span</code> exactly. A mismatch produces incorrect bucket timing.</li><li><strong>Step 3:</strong> The <code>max</code> aggregation on <code>@timestamp</code> must be named <code>@timestamp</code> and placed inside the histogram’s <code>aggregations</code>; without it, the ML node can’t determine the precise end of each bucket.</li><li><strong>Step 4:</strong> The <code>terms</code> aggregation for the partition field must be named <strong>exactly</strong> after the partition field: <code>geo.dest</code>, not <code>geo.dest_grouping</code> or any alias. The ML engine uses the aggregation name, not the <code>field</code> parameter, to identify which partition value each bucket belongs to. A mismatch silently drops the partition field from results entirely.</li><li><strong>Step 5:</strong> The metric aggregation key <code>bytes</code> matches <code>field_name</code> in the detector exactly. Any mismatch here produces silently wrong anomaly scores.</li></ul><h3><strong>Validate with the preview</strong></h3><p>Before we create the job, let’s use the <strong>Preview</strong> tab. This runs the aggregation against real data and shows exactly what the ML node will receive, a very useful sanity check before committing.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/638eeb2ae5b854195dec0e468887300b9afd2c58-1999x1254.png" style="width: 100%;" /><p>Three things to verify in the preview output: <code>doc_count</code> should be present on every bucket and greater than 1. The <code>bytes</code> values should look like average response sizes: numbers in the hundreds to hundreds of thousands for web traffic. And each row should correspond to a distinct (<code>timestamp</code>, <code>geo.dest</code>) pair. If anything looks off, fix it in the JSON editor and rerun the preview.</p><h2><strong>Adding influencer fields</strong></h2><p>In the example above, <code>geo.dest</code> is the partition field. The ML model learns a separate baseline for each destination country, and anomalies are reported per country. But you might also want <code>machine.os</code> to appear as an <strong>influencer</strong> in anomaly results: When the detector fires, you want to see “this looks anomalous for <code>geo.dest: CN</code> and <code>machine.os: win</code> is a contributing factor.” <a href="https://www.elastic.co/docs/explore-analyze/machine-learning/anomaly-detection/ml-ad-run-jobs#ml-ad-influencers">Influencers</a> don’t drive anomaly detection; they provide context for the anomalies that are found.</p><p>To support an influencer alongside a partition field, the analysis config gains an <code>influencers</code> array:</p><pre><code>{
  "bucket_span": "1h",
  "summary_count_field_name": "doc_count",
  "detectors": [
    {
      "function": "high_mean",
      "field_name": "bytes",
      "partition_field_name": "geo.dest"
    }
  ],
  "influencers": ["geo.dest", "machine.os"]
}</code></pre><p>And now the datafeed needs to aggregate on both fields simultaneously. One <code>terms</code> nested inside another <code>terms</code> won’t work; a nested <code>terms</code> surfaces only the top-N values of the inner field per outer bucket, so you’d silently lose combinations. Instead, use a <a href="https://www.elastic.co/docs/reference/aggregations/search-aggregations-bucket-composite-aggregation">composite aggregation</a> with one <code>terms</code> source per field, nested inside the <code>date_histogram</code>:</p><pre><code>{
  "buckets": {
    "date_histogram": {
      "field": "@timestamp",
      "fixed_interval": "1h"
    },
    "aggregations": {
      "@timestamp": {
        "max": { "field": "@timestamp" }
      },
      "group_by_fields": {
        "composite": {
          "size": 1000,
          "sources": [
            { "geo.dest":   { "terms": { "field": "geo.dest" } } },
            { "machine.os": { "terms": { "field": "machine.os" } } }
          ]
        },
        "aggregations": {
          "bytes": {
            "avg": { "field": "bytes" }
          }
        }
      }
    }
  }
}</code></pre><p><code>composite</code> generates one bucket per unique (<code>geo.dest</code>, <code>machine.os</code>) combination. The ML node sees every pair and can correctly attribute which operating system was contributing when a country’s response sizes spiked. Use the preview to confirm distinct pairs appear. If you only see a handful of rows where you’d expect many, the <code>size</code> parameter on the composite may need to be raised.</p><p>Note that this <code>composite</code> is nested <em>inside</em> the <code>date_histogram</code>, a different structure from the top-level composite used for <code>rare</code>, <code>freq_rare</code>, and <code>info_content</code> below. The distinction matters: Composite nested inside <code>date_histogram</code> routes the datafeed to the standard extractor; composite at the top level routes it to the composite extractor, which paginates through all value combinations across time.</p><h2><strong>Categorization</strong></h2><p>Categorization works with aggregated datafeeds: <code>summary_count_field_name</code> and <code>categorization_field_name</code> can coexist in the same job. The five-step pattern applies directly. Step 2 uses the standard <code>date_histogram</code> topology. Step 4 has one adjustment: Instead of a partition field, we aggregate the text field itself using a <code>terms</code> aggregation on its <code>.keyword</code> subfield, named to match <code>categorization_field_name</code> exactly. Step 5 is skipped. The <code>count</code> detector works from <code>doc_count</code> alone.<br/><strong>Analysis config:</strong></p><pre><code>{
  "bucket_span": "1h",
  "summary_count_field_name": "doc_count",
  "categorization_field_name": "message",
  "detectors": [
    {
      "function": "count",
      "by_field_name": "mlcategory"
    }
  ],
  "influencers": ["mlcategory"]
}</code></pre><p><strong>Datafeed aggregations:</strong></p><pre><code>{
  "buckets": {
    "date_histogram": {
      "field": "@timestamp",
      "fixed_interval": "1h"
    },
    "aggregations": {
      "@timestamp": {
        "max": { "field": "@timestamp" }
      },
      "message": {
        "terms": {
          "field": "message.keyword",
          "size": 1000
        }
      }
    }
  }
}</code></pre><p>The datafeed sends one bucket per unique <code>message.keyword</code> value with a <code>doc_count</code> for each. The ML node receives those strings, runs categorization on them, assigning an <code>mlcategory</code> to each, and the <code>count</code> detector tracks how many documents fall into each category per bucket. The naming rule from Step 4 applies: The <code>terms</code> aggregation must be named <code>message</code>, matching <code>categorization_field_name</code> in the analysis config exactly.</p><p>One thing to watch: Keyword fields have a default <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/ignore-above"><code>ignore_above: 256</code></a> limit. Log messages longer than 256 characters won’t be indexed as <code>.keyword</code> and will be silently excluded from the aggregation. If your log messages are long, check the field mapping before using this approach. You may need to raise the limit in your index template.</p><h2><strong>The minimal pattern for <code>time_of_day</code> and <code>time_of_week</code></strong></h2><p><a href="https://www.elastic.co/docs/reference/machine-learning/ml-time-functions"><code>time_of_day</code> and <code>time_of_week</code></a> are the easiest functions to aggregate: They only need a timestamp and a document count. The C++ process extracts the time component from the bucket timestamp and builds a cyclical model of normal activity; <code>doc_count</code> tells it how many events fell in each bucket. No <code>terms</code> sources, no metric aggregation, no composite.<br/><strong>Analysis config:</strong></p><pre><code>{
  "bucket_span": "15m",
  "summary_count_field_name": "doc_count",
  "detectors": [
    { "function": "time_of_day" }
  ]
}</code></pre><p><strong>Datafeed aggregations:</strong></p><pre><code>{
  "time": {
    "date_histogram": {
      "field": "@timestamp",
      "fixed_interval": "15m"
    },
    "aggregations": {
      "@timestamp": { "max": { "field": "@timestamp" } }
    }
  }
}</code></pre><p>A plain <code>date_histogram</code> is enough; no composite needed. This makes <code>time_of_day</code> and <code>time_of_week</code> particularly CCS-friendly: one request per time chunk, minimal data over the wire. Use the same structure for <code>time_of_week</code>; only the function name changes.</p><p>If you want to add a <code>partition_field_name</code> (for example, to model time-of-day patterns per service), add a <code>terms</code> aggregation inside the histogram’s aggregations following the standard Step 4 pattern.</p><h2><strong>The composite pattern for <code>rare</code>, <code>freq_rare</code>, and <code>info_content</code></strong></h2><p><a href="https://www.elastic.co/docs/reference/machine-learning/ml-rare-functions"><code>rare</code>, <code>freq_rare</code></a>, and <a href="https://www.elastic.co/docs/reference/machine-learning/ml-info-functions"><code>info_content</code></a> all need the composite extractor, the one that paginates through all unique value combinations rather than truncating to top-N. The five-step pattern applies here with a different topology in Step 2: <code>composite</code> goes at the top level (not <code>date_histogram</code>), with <code>date_histogram</code> as a source inside it. Step 3 places the <code>max</code> <code>@timestamp</code> aggregation as a sibling of the <code>composite</code>, and Step 5 is skipped since all three functions work from <code>doc_count</code> alone.</p><p>The datafeed structure is the same for all three functions: a composite at the top level, a <code>date_histogram</code> as one of its sources, and one <code>terms</code> source per analysis field. The only thing that varies is which fields you include as <code>terms</code> sources: <code>rare</code> needs one source for <code>by_field_name</code>; <code>freq_rare</code> needs sources for both <code>by_field_name</code> and <code>over_field_name</code>; <code>info_content</code> needs a source for <code>field_name</code> plus any <code>by_field_name</code> or <code>over_field_name</code> fields. None of the three require a metric aggregation.</p><pre><code>{
  "buckets": {
    "composite": {
      "size": 10000,
      "sources": [
        { "@timestamp":   { "date_histogram": { "field": "@timestamp", "fixed_interval": "5m" } } },
        { "by_field":     { "terms": { "field": "by_field" } } },
        { "over_field":   { "terms": { "field": "over_field" } } }
      ]
    },
    "aggregations": {
      "@timestamp": { "max": { "field": "@timestamp" } }
    }
  }
}</code></pre><p>A few notes:</p><ul><li>The composite aggregation must be the top-level aggregation, not nested inside a <code>date_histogram</code>. This is what routes the datafeed to the composite extractor.</li><li>The <code>date_histogram</code> is a source inside the composite, not the outer wrapper. Its <code>fixed_interval</code> must divide evenly into <code>bucket_span</code>.</li><li>The <code>max</code> aggregation on <code>@timestamp</code> sits as a sibling of the <code>composite</code> (inside <code>aggregations</code>), not nested inside it.</li><li><code>composite.size</code> controls the page size per round trip. Setting it high (10000) reduces round trips, which matters with CCS latency. With three sources and high-cardinality fields, the total combination count can be large; the extractor paginates automatically.</li></ul><h2><strong>Why aggregation-based datafeeds outperform scroll at scale</strong></h2><p>The gap is structural, not incidental. A scroll-based datafeed reads raw documents one page at a time: Every 1,000 documents is one request, and each waits for the previous one to complete before issuing the next. The number of requests is therefore proportional to the total document count in the time range being backfilled. At 836,000 events per hour over 13 months, that&#x27;s roughly 7.9 billion events, or 7.9 million sequential round trips. Each round trip crosses the CCS boundary, waits for shard responses, and transfers matching documents in full. There’s no parallelism: The datafeed holds a <a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-scroll">scroll context</a> open on the remote cluster and processes one page at a time.</p><p>An aggregation-based datafeed works differently. The data nodes summarize data locally, grouping by time bucket and categorical fields, and ship only the bucket results to the ML node. The number of requests is proportional to field cardinalities, not document count. In our example, two influencer fields with six unique combinations produce six result rows per time bucket; the datafeed pages through those in a handful of requests regardless of how many raw events fall in each bucket. Double the ingestion rate and the scroll request count doubles; the aggregation request count stays the same. This is why the gap widens at scale: The more data you have, the worse scroll looks by comparison, and the better aggregations look.</p><p>On live data, the picture is different because each real-time tick covers only one fresh bucket: Scroll issues however many pages fit in that bucket&#x27;s worth of data, while aggregations issue one request. The 20× figure for live data reflects that ratio at 836,000 events per hour with a 15-minute bucket span. The practical threshold where aggregations stop being optional is when <code>(ingestion rate × bucket span) &gt; scroll_size</code>; once a single bucket contains more than <a href="https://www.elastic.co/docs/explore-analyze/machine-learning/anomaly-detection/anomaly-detection-scale#set-scroll-size">one scroll page</a> of documents, the datafeed can&#x27;t keep pace with live data regardless of hardware. Below that threshold, scroll is fine and aggregations are a nice-to-have. Above it, aggregations are the only sustainable option.</p><p>Scroll-based datafeeds are the right default, and the wizards make the right call for most deployments. At scale (more shards, broader index patterns, CCS across tiers), switching to an aggregation-based datafeed is the natural next step: The data nodes summarize where the data lives, the ML node processes compact results, and the detections stay the same. The one cost to know up front is model state: Switching requires a new job, so the earlier you make the move, the less you give up.</p><p>If you hit a case not covered here, an aggregation type that doesn’t map cleanly or a composite that behaves unexpectedly, the <a href="https://discuss.elastic.co/">Elastic Discuss forums</a> are a good place to continue.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How big is too big? Elasticsearch sizing best practices]]></title>
            <description><![CDATA[There’s no hard size limit in Elasticsearch, but there are clear signals you've outgrown your setup. Learn how to size shards, manage node limits, choose storage by tier, and use AutoOps to catch problems before they happen.]]></description>
            <link>https://www.elastic.co/search-labs/blog/elasticsearch-node-shard-size-best-practices</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/elasticsearch-node-shard-size-best-practices</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Gustavo Llermaly]]></dc:creator>
            <pubDate>Tue, 14 Apr 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/2576886b8938e8265c6fafc9dcee6a4442f6a2fa-1408x768.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>There’s no hard size limit in Elasticsearch. Production clusters run at petabyte scale. But &quot;too big&quot; shows up in three ways: queries slow past your Service Level Agreement (SLA), a node hits its shard ceiling, or storage costs spiral because everything lives on the same expensive tier. This guide walks through each signal, the numbers that matter, and what to do about them.</p><h2>The three limits that actually matter</h2><p>There’s no hard storage ceiling at the node level. Elastic has demonstrated <a href="https://www.elastic.co/blog/querying-a-petabyte-of-cloud-storage-in-10-minutes">a single node querying 1 PiB of data</a>. In earlier versions, the per-shard overhead was high enough that the old rule of thumb was <a href="https://github.com/elastic/elasticsearch/pull/86223">no more than 20 shards per GB of heap</a>. Exceeding that limit meant garbage collection pressure, slow cluster state updates, and unstable nodes. Over 7.x and 8.x, a <a href="https://github.com/elastic/elasticsearch/issues/77466">series of optimizations</a> (more compact metadata serialization, efficient caching, off-heap data structures, and compressed cluster state) reduced per-shard overhead to the point where that rule was retired in 8.3, replaced by <a href="https://www.elastic.co/blog/whats-new-elasticsearch-8-3-0">field-density-based sizing</a>.</p><p>What defines your actual ceiling is workload type. Cold nodes at 20 TB with 31 GB of heap handle audit and retention workloads comfortably, because the access pattern is infrequent and aggregation-based. The same spec on a high-concurrency document search workload would struggle.</p><p>The three things to watch operationally:</p><ul><li><strong>Shard size:</strong> Individual shards that are too large slow queries and recovery.</li><li><strong>Shards per node:</strong> Every node has a ceiling, and <a href="https://www.elastic.co/docs/manage-data/lifecycle/index-lifecycle-management">index lifecycle management</a> (ILM) creates shards automatically whether you track them or not.</li><li><strong>Storage tier mismatch:</strong> Keeping data on expensive fast storage longer than needed.</li></ul><h2>Shard size</h2><p>Target between 10 GB and 50 GB per shard. The <a href="https://www.elastic.co/docs/deploy-manage/production-guidance/optimize-performance/size-shards">official guidance</a> sets the ILM rollover trigger at 50 GB per primary shard, with 10 GB as the suggested floor. Keep each shard under 200 million documents.</p><p>Shards that are too small create unnecessary overhead: more metadata for master nodes, more heap consumed, more network traffic. Shards that are too large slow query execution and make recovery after node failure slow, since Elasticsearch recovers one shard at a time.</p><p>One rule you can stop using: The &quot;20 shards per GB of heap&quot; guideline was deprecated in Elasticsearch 8.3. The replacement is simpler: Watch the 1,000 shard-per-node limit below, and keep shard sizes in the <a href="https://www.elastic.co/docs/deploy-manage/production-guidance/optimize-performance/size-shards#shard-size-recommendation">10–50 GB range (or 200M documents)</a>.</p><p><strong>How to monitor:</strong></p><pre><code># size per shard
GET _cat/shards?&h=index,store&v</code></pre><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/e4820733cc108ee5a78c7acafcbdf882f09ef661-1290x150.png" style="width: 100%;" /><h2>The shard budget</h2><p>Each non-frozen data node supports up to <a href="https://www.elastic.co/docs/deploy-manage/production-guidance/optimize-performance/size-shards#shard-count-per-node-recommendation">1,000 shards</a>. ILM creates shards on your behalf. If your policy rolls over daily with five primary shards and one replica, that’s 10 shards per day. One node fills up in about 100 days without you changing anything.</p><p>Options when you’re approaching the ceiling:</p><ul><li><strong>Wider rollover intervals:</strong> Roll weekly or monthly if shards are not reaching 50 GB before the time trigger fires.</li><li><strong>Fewer shards per index:</strong> For smaller daily volumes, one or two primary shards is often enough. See <a href="https://www.elastic.co/search-labs/blog/elasticsearch-increase-primary-shard-count">how to increase primary shard count</a> if you need to rebalance an existing index.</li><li><strong>More nodes:</strong> If volume genuinely requires daily rollover at full shard count, distribute across more nodes.</li></ul><p>For master nodes, plan for <a href="https://www.elastic.co/docs/deploy-manage/production-guidance/optimize-performance/size-shards#shard-count-recommendation">1 GB of heap per 3,000 indices</a>.</p><p><strong>How to monitor:</strong></p><pre><code># shards per node
GET _cat/allocation?h=node,shards&v</code></pre><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/a0060038cbe02e0dc35c2a24112c82f50e2da6ee-474x98.png" style="width: 100%;" /><h2>Storage</h2><p>The <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-search-speed.html">search speed guide</a> recommends allocating at least half of system memory to the OS filesystem cache and using directly attached storage. Remote storage generally performs worse. The <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html">indexing speed guide</a> echoes this, recommending RAID 0 across multiple local SSDs for write-heavy workloads.</p><p>For hot data: Don’t use network-attached storage (NAS). NAS adds latency on every read, and some NAS systems don’t correctly implement POSIX filesystem semantics, which can cause data corruption. Use local SSDs.</p><p>What works at each tier:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Tier</td><td style="border: 1px solid #ccc; padding: 4px;">Storage</td><td style="border: 1px solid #ccc; padding: 4px;">Why</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Hot</td><td style="border: 1px solid #ccc; padding: 4px;">Local SSD (DAS)</td><td style="border: 1px solid #ccc; padding: 4px;">High I/O, low latency, safe filesystem semantics</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Warm</td><td style="border: 1px solid #ccc; padding: 4px;">HDD acceptable</td><td style="border: 1px solid #ccc; padding: 4px;">Lower query pressure, no active indexing</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Cold</td><td style="border: 1px solid #ccc; padding: 4px;">Searchable snapshots</td><td style="border: 1px solid #ccc; padding: 4px;">No replicas needed, ~50% storage savings</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Frozen</td><td style="border: 1px solid #ccc; padding: 4px;">Searchable snapshots</td><td style="border: 1px solid #ccc; padding: 4px;">Up to 20x reduction vs warm (Enterprise license)</td>
                    </tr>
              </tbody>
            </table>
          <p><strong>How to monitor:</strong></p><pre><code># disk usage per node and role
GET _cat/allocation?h=node,node.role,disk.used,disk.avail,disk.percent&v</code></pre><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/8fab2b7d44f4b5b11b89920e9b5ba49eeb48906c-1100x106.png" style="width: 100%;" /><p><em>On Elastic Cloud, skip this section. You select a <a href="https://www.elastic.co/docs/deploy-manage/deploy/elastic-cloud/ec-change-hardware-profile">hardware profile</a> per tier, and Elastic handles storage provisioning.</em></p><h2>Data tiers and ILM</h2><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/f8e5dbbc0bbef7cb35abc0f0e82367024cb68730-1999x301.png" style="width: 100%;" /><p>Index lifecycle management moves data through <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/data-tiers.html">tiers</a> automatically: hot, warm, cold, frozen, delete. The further data moves from hot, the cheaper the storage.</p><p>The cold and frozen tiers use <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/searchable-snapshots.html">searchable snapshots</a>:</p><ul><li><strong>Cold</strong> (fully mounted): Performance comparable to a regular index, no replicas required, roughly 50% cheaper than warm.</li><li><strong>Frozen</strong> (partially mounted): Up to 20x storage reduction compared to warm, slower queries, requires an Enterprise license.</li></ul><p>The cost difference at scale is significant. A <a href="https://www.elastic.co/search-labs/blog/searchable-snapshots-benchmark">Search Labs benchmark</a> measured 90 TB of data: all-hot cost $28,222 per month. A hot+frozen architecture brought that to $3,290 per month.</p><p>A typical ILM policy for time-series data with a 14-day hot window:</p><pre><code>{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": { "max_primary_shard_size": "50gb" }
        }
      },
      "warm": {
        "min_age": "14d",
        "actions": {
          "shrink": { "number_of_shards": 1 }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "searchable_snapshot": {
            "snapshot_repository": "my_repository"
          }
        }
      },
      "frozen": {
        "min_age": "90d",
        "actions": {
          "searchable_snapshot": {
            "snapshot_repository": "my_repository"
          }
        }
      },
      "delete": {
        "min_age": "365d",
        "actions": { "delete": {} }
      }
    }
  }
}</code></pre><p>Adjust <code>min_age</code> values to match your query patterns. Data queried weekly can move to cold sooner than data queried daily.</p><h2>AutoOps</h2><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/0dd56a74b148bcd95d1cb47eaaa45d99239455a9-1999x670.png" style="width: 100%;" /><p>As of February 2026, <a href="https://www.elastic.co/blog/autoops-free">AutoOps is free for all Elasticsearch users</a> regardless of license tier. On Elastic Cloud, it’s already enabled. For Elastic Self-Managed, Elastic Cloud Enterprise (ECE), and Elastic Cloud on Kubernetes (ECK) deployments, a lightweight Elastic Agent connects your cluster in about five minutes via <a href="https://www.elastic.co/docs/deploy-manage/monitor/autoops">Cloud Connect</a>. Internet connectivity is required; air-gapped deployments are not supported.</p><p>AutoOps samples hundreds of metrics every 10 seconds and surfaces issues with root cause analysis and remediation commands. It does not apply fixes automatically.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/16ec78498b40709c38eab2fa78128d11d5a59a6c-1340x546.png" style="width: 100%;" /><p>For large deployments it detects:</p><p></p><ul><li>Shards growing past the recommended size range.</li><li>Indices without ILM policies that have grown too large.</li><li>Shard imbalance across nodes.</li><li>Disk watermark violations before they cause allocation failures.</li><li>Indexing rejections and ingestion bottlenecks.</li><li>Slow queries and circuit breaker trips from large aggregations.</li></ul><p>It ships with 100+ customizable alerts and routes notifications to PagerDuty, Slack, Teams, or any webhook.</p><h2>Conclusion</h2><p>Watch shard size (10–50 GB), track your per-node shard budget as ILM rolls, put hot data on local SSDs, and use cold and frozen tiers for data that is rarely queried.</p><p>On Elastic Cloud, hardware profiles and AutoOps handle most of this for you. For self-managed deployments, this is your checklist, and AutoOps via Cloud Connect is your early warning system. If you’re unsure how much data your nodes can handle for your specific workload, use <a href="https://github.com/elastic/rally">Rally</a> to benchmark against your own data before committing to a hardware spec.</p><h2>Sources</h2><ul><li><a href="https://www.elastic.co/docs/deploy-manage/production-guidance/optimize-performance/size-shards">Size your shards</a></li><li><a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/data-tiers.html">Data tiers</a></li><li><a href="https://www.elastic.co/search-labs/blog/elasticsearch-shards-and-replicas-guide">Elasticsearch shards and replicas guide</a></li><li><a href="https://www.elastic.co/search-labs/blog/elasticsearch-shards-reduce-change-count">How to reduce shard count</a></li><li><a href="https://www.elastic.co/search-labs/blog/elasticsearch-increase-primary-shard-count">How to increase primary shard count</a></li><li><a href="https://www.elastic.co/search-labs/blog/optimize-elasticsearch-disk-space-and-usage">Optimize disk space and usage</a></li><li><a href="https://www.elastic.co/search-labs/blog/searchable-snapshots-benchmark">Searchable snapshots benchmark</a></li><li><a href="https://www.elastic.co/docs/deploy-manage/monitor/autoops">AutoOps documentation</a></li><li><a href="https://github.com/elastic/rally">Rally</a>: Elastic&#x27;s benchmarking framework for testing cluster sizing against your own data</li><li><a href="https://www.elastic.co/virtual-events/optimizing-storage-efficiency-elasticsearch">Optimizing storage efficiency in Elasticsearch</a> webinar by Christian Dahlqvist and Alan Woodward</li><li><a href="https://www.elastic.co/virtual-events/using-rally-to-get-elasticsearch-cluster-size-right">Using Rally to get your cluster size right</a>: webinar by Christian Dahlqvist and Daniel Mitterdorfer on benchmarking methodology</li></ul>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[New Elasticsearch ES|QL plugin for IntelliJ IDEA]]></title>
            <description><![CDATA[Build and run Elasticsearch ES|QL queries in your IDE with the new plugin for IntelliJ IDEA.]]></description>
            <link>https://www.elastic.co/search-labs/blog/esql-plugin-intellij-idea</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/esql-plugin-intellij-idea</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Laura Trotta]]></dc:creator>
            <pubDate>Mon, 13 Apr 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/91366de35a1b66860ce0d126c8a83e5b25b678f0-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p><a href="https://www.elastic.co/docs/reference/query-languages/esql">Elasticsearch Query Language (ES|QL)</a> is Elasticsearch’s piped query language, designed for intuitive data querying and manipulation. Refer to our <a href="https://www.elastic.co/blog/getting-started-elasticsearch-query-language">getting started guide</a> to learn more.</p><p>The Elasticsearch Java client <a href="https://www.elastic.co/search-labs/blog/esql-queries-to-java-objects">supports ES|QL queries</a> through the DSL, but currently it treats queries as simple strings, with no dedicated helper; and while <a href="https://www.elastic.co/kibana">Kibana</a> offers an excellent <a href="https://www.elastic.co/docs/explore-analyze/query-filter/languages/esql-kibana">UI to build ES|QL queries</a>, we’re aware that sometimes having everything needed to write applications in the integrated development environment (IDE) offers a better experience. So, until the Java client extends its type support to ES|QL, we wrote an Intellij IDEA plugin that autocompletes, syntax checks, shows documentation, and executes ES|QL queries.</p><p>The plugin currently supports Java, Kotlin, and plain text files, in case the Java Virtual Machine (JVM) isn’t your thing.</p><p>Check it out in the <a href="https://plugins.jetbrains.com/plugin/28898-elasticsearch-es-ql">JetBrains Marketplace page</a> and in the <a href="https://github.com/elastic/esql-idea-plugin">GitHub repository</a>, for more information.</p><h2>Prerequisites</h2><ul><li>IDE: Intellij IDEA version &gt;= 253 (community or ultimate)</li></ul><h2>Usage</h2><p>Install the plugin in Intellij IDEA like you would with every other plugin, so either from the <a href="https://plugins.jetbrains.com/plugin/28898-elasticsearch-es-ql">JetBrains marketplace</a> or by going to Settings -&gt; Plugins -&gt; Marketplace and searching “esql”.</p><p>The following examples are written using Java, but Kotlin is also supported and the usage is pretty much the same.</p><p>Create a text block string, write “ES|QL” in a simple comment above it, and you’re done.</p><pre><code>// ES|QL
String query = """
""";</code></pre><p>If you see the Elastic logo appearing on the left:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/620d47b0c241271ab9bf727c37d3ab5f4137ca44-417x55.png" style="width: 100%;" /><p>then everything is working, and you’re ready to write your queries.</p><p>Why text blocks and not simple strings? The ES|QL syntax accepts quotes in various contexts, and escaping them would trigger other errors in the syntax checker, so we decided on text blocks to keep things simple.</p><p>It’s even simpler for txt files, as you can just add the comment and start writing the query right below:</p><pre><code>// ES|QL</code></pre><h3>Connecting to a server instance</h3><p>The plugin can be connected to an Elasticsearch server instance to fetch indices and field names, which will then be added to the autocompletion options. Look for the Elastic logo on the bottom left of of the screen (or wherever you keep your tools), and configure your connection to any server instance:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/9f5109542c921ad523458b9156551bf1fca7d41a-418x269.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/9fe37a53d97f67cbdd8623793bce768e8e2f9ced-577x298.png" style="width: 100%;" /><h3>Autocomplete</h3><p>Start typing while in the text block to automatically open the autocompletion popup, which will return a list of acceptable commands/values to continue writing the query correctly. If you want to manually trigger autocompletion, <code>ctlr+space</code> is the IDE’s shortcut to use:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/987a927ab0e682bb1f9d07c934dd4254a769db20-584x252.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/5f8e379ea1cca345b38d0b7a0c2a873db6de624f-584x252.png" style="width: 100%;" /><h3>Syntax check</h3><p>The plugin will highlight errors in queries, explaining what to fix:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/15e91eab6c00fb49cfa5c1c6e270beeba534afc3-812x252.png" style="width: 100%;" /><h3>Documentation</h3><p>Hovering with the cursor over commands will display documentation describing what the command can be used for and its correct syntax:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/c78295103c2188a3615edb2e001acbaf17523656-1072x627.png" style="width: 100%;" /><h3>Running the query</h3><p>Once connected to a server instance, you can run queries by clicking on the green button beside the Elastic icon: The results will be displayed in the tool window:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/bff17bee7269dfb6f30140c9cd78dc203990f21c-1130x441.png" style="width: 100%;" /><p>Or if you’re writing an application, you can use the Java client like so:</p><pre><code>// ES|QL
String query = """
	FROM my-index
| SORT year DESC
| LIMIT 10
""";

try (ElasticsearchClient client = ElasticsearchClient.of(e -> e
                .host(serverUrl)
                .apiKey(apiKey))) {

client.esql().query(QueryRequest.of(qr -> qr.query(query)));

}</code></pre><p>Check our previous <a href="https://www.elastic.co/search-labs/blog/esql-queries-to-java-objects">ES|QL Java Client article</a> for a complete example of mapping ES|QL results to Java objects.</p><h2>How does it work?</h2><p>There’s no AI involved; the plugin is based on the ES|QL <a href="https://www.antlr.org/">ANTLR</a> grammar for autocompletion and syntax check, and it uses the <a href="https://www.elastic.co/docs/reference/query-languages/esql">Elasticsearch docs</a> to show documentation.</p><h2>Conclusion</h2><p>The plugin is still experimental, so feel free to report any bug or feature request on the <a href="https://github.com/elastic/esql-idea-plugin">Github repository</a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Unsupervised document clustering with Elasticsearch + Jina embeddings]]></title>
            <description><![CDATA[A practical, reproducible approach to unsupervised document clustering with Elasticsearch and Jina embeddings.]]></description>
            <link>https://www.elastic.co/search-labs/blog/unsupervised-document-clustering-elasticsearch-jina-embeddings</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/unsupervised-document-clustering-elasticsearch-jina-embeddings</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Matthew Adams]]></dc:creator>
            <pubDate>Fri, 10 Apr 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/662c00694c3e0c2fb2128098bdb6813df9e86a72-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Vector search starts with a query, but what if you don&#x27;t have one?</p><p>Organizations accumulate large document collections, like support tickets, legal filings, news feeds, research papers, and need to understand what&#x27;s in them before they can ask the right questions. Without labels or training data, manually reviewing thousands of documents is impractical. Traditional search doesn&#x27;t help when you don&#x27;t know what to search for.</p><p>This post walks through an Elasticsearch-native approach to unsupervised document clustering and temporal story tracking that addresses this discovery problem. By the end, you can trace story arcs like this across days:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/100a60a7fb85da8ab3813fd071a82c93f2c3f318-1300x650.png" style="width: 100%;" /><p><strong>What you&#x27;ll discover:</strong></p><ul><li>Why <strong>clustering embeddings</strong> (not retrieval embeddings) matter when you want topic discovery without a query.</li><li>How density-probed centroid classification groups documents by topic using Elasticsearch k-nearest neighbor (kNN) and batched <code>msearch</code>.</li><li>How <a href="https://www.elastic.co/docs/reference/aggregations/search-aggregations-bucket-significanttext-aggregation"><code>significant_text</code></a> can auto-label clusters so themes are readable without training a model.</li><li>How temporal story chains link daily clusters to show how themes evolve from day to day.</li></ul><div style="display:none">Unknown block type "infoPanel", specify a component for it in the `components.types` option</div><p>The pipeline uses ~8,500 February 2025 articles from BBC News and The Guardian as a test corpus. News is convenient because it has clear temporal behavior, but the pattern applies anywhere document discovery matters: legal review, compliance monitoring, research synthesis, customer support triage.</p><p><strong>Stack:</strong></p><ul><li><a href="https://www.elastic.co/search-labs/blog/jina-embeddings-v5-text"><strong>Jina v5</strong></a> <strong>clustering embeddings:</strong> Task-specific Low-Rank Adaptation (LoRA) adapters for topic grouping. <a href="https://www.elastic.co/blog/elastic-jina-ai">Jina has joined Elastic</a>, and its models are available natively through <a href="https://www.elastic.co/docs/explore-analyze/elastic-inference/eis">Elastic Inference Service (EIS)</a>.</li><li><strong>Elasticsearch:</strong> Scalable <a href="https://www.elastic.co/docs/solutions/search/vector/knn">kNN</a>, <code>significant_text</code> labeling, and vector storage.</li><li><a href="https://www.elastic.co/search-labs/blog/diskbbq-elasticsearch-introduction"><strong>DiskBBQ:</strong></a> A disk-based vector index format that combines <a href="https://www.elastic.co/docs/reference/elasticsearch/index-settings/bbq">Better Binary Quantization (BBQ)</a> with hierarchical k-means partitioning for approximate nearest neighbors (ANN) acceleration. This index partitioning is internal to vector search and separate from the density-probed clustering algorithm used in this post. <code>bbq_disk</code> stores quantized vectors on disk and keeps only partition metadata in heap, dramatically reducing resource requirements, compared to <code>bbq_hnsw</code>, while maintaining high recall.</li><li><strong>Global clustering + daily temporal linking:</strong> Discovery and story evolution.</li></ul><p><strong>What you&#x27;ll need:</strong></p><ul><li>An Elasticsearch deployment (Elastic Cloud, Elasticsearch Serverless, or Elastic Self-Managed 8.18+/9.0+): <code>bbq_disk</code> requires 8.18 or later. The optional diversify retriever section requires 9.3+ or serverless.</li><li>A <a href="https://jina.ai/embeddings/">Jina API key</a>: The free tier includes 10 million tokens, which covers the core clustering pipeline (~4.25 million tokens). The optional retrieval-versus-clustering comparison uses a second embedding pass.</li><li>A <a href="https://bonobo.capi.gutools.co.uk/register/developer">Guardian API key</a> (free).</li></ul><h2>Setup</h2><p>Install required packages:</p><pre><code>pip install elasticsearch pandas numpy plotly umap-learn python-dotenv pydantic-settings datasets requests</code></pre><p>Optional (only if you run scraping helpers from this repo):</p><pre><code>pip install beautifulsoup4</code></pre><p>Then configure API keys in a <code>.env</code> file at the project root:</p><pre><code>ELASTIC_CLOUD_ID=your-cloud-id        # or ELASTIC_HOST=https://...
ELASTIC_API_KEY=your-api-key
JINA_API_KEY=your-jina-key
GUARDIAN_API_KEY=your-guardian-key</code></pre><p>This notebook calls <code>load_dotenv(override=True)</code>, so local <code>.env</code> values take precedence.</p><pre><code>Connected to Elasticsearch</code></pre><h2>Part 1: Discovery clustering - Why clustering embeddings?</h2><p>Most vector search uses <strong>retrieval embeddings</strong> trained to match a <em>query</em> to relevant <em>documents</em>. That&#x27;s perfect for search, but not for discovery. When you want to find what topics exist in a corpus without any query at all, you need embeddings that group similar documents together.</p><p>Jina v5 solves this with <strong>task-specific Low-Rank Adaptation (LoRA) adapters</strong>. LoRA adds small low-rank updates to targeted internal layers while keeping most base-model weights frozen, so the model behavior shifts toward a specific task without full retraining. The same base model produces different embeddings depending on the <code>task</code> parameter:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Task</td><td style="border: 1px solid #ccc; padding: 4px;">Trained for</td><td style="border: 1px solid #ccc; padding: 4px;">Use case</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">retrieval.passage</td><td style="border: 1px solid #ccc; padding: 4px;">Query-document matching</td><td style="border: 1px solid #ccc; padding: 4px;">Search, retrieval augmented generation (RAG)</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">clustering</td><td style="border: 1px solid #ccc; padding: 4px;">Topic grouping (optimized for tight clusters)</td><td style="border: 1px solid #ccc; padding: 4px;">Discovery, categorization</td>
                    </tr>
              </tbody>
            </table>
          <p>The clustering adapter is trained to make documents about the same topic <em>closer</em> in embedding space and documents about different topics <em>further apart</em>. The visual comparison below makes the difference concrete.</p><h3>Retrieval vs. clustering: A visual comparison</h3><p>To see the difference, a sample of documents is embedded with both task types. Clustering is performed in the original 1024-dimensional embedding space; Uniform Manifold Approximation and Projection (UMAP) is used only to project those embeddings into 2D for visualization. UMAP preserves local neighborhood structure, making it useful for comparing cluster separation.</p><p>Below, the same 480-document sample is embedded with both task types and projected to 2D with UMAP. Look for tighter, more separated color groups in the clustering panel.</p><pre><code>    Full dataset: 8,495 articles
    Sources: guardian: 5749, bbc: 2746
    Date range: 2025-02-01 to 2025-02-28


    Sample: 480 docs across 8 sections
    section
    Film              60
    World news        60
    Australia news    60
    Opinion           60
    Football          60
    US news           60
    Sport             60
    Business          60


    Clustering embeddings: 480
    Retrieval embeddings:  480


    UMAP projection complete</code></pre><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/9bcf7a744900560c1c6c63a2dc3af2f9bfd33e11-1100x500.png" style="width: 100%;" /><p><em>Retrieval embeddings (left) spread topics broadly; clustering embeddings (right) produce tighter, more separated groups from the same documents.</em></p><p>The clustering embeddings produce tighter, more visually distinct groups. The retrieval embeddings spread topics out more evenly, ideal for search (fine-grained similarity); but for discovery, tight topical clusters are what matters.</p><p>This is why <code>task=&quot;clustering&quot;</code> is used for the rest of this walkthrough.</p><h3>Loading the dataset</h3><p>The corpus combines two news sources for February 2025:</p><ul><li><strong>BBC News</strong> via the <a href="https://huggingface.co/datasets/RealTimeData/bbc_news_alltime">RealTimeData/bbc_news_alltime</a> HuggingFace dataset.</li><li><strong>The Guardian</strong> via the <a href="https://open-platform.theguardian.com/">Guardian Open Platform API</a>.</li></ul><p>Having multiple sources helps validate that clustering finds <em>topics</em> rather than <em>source-specific style</em>.</p><pre><code>    Total articles:  8,495
    
    Source breakdown:
    source
    guardian    5749
    bbc         2746
    
    Date range: 2025-02-01 → 2025-02-28
    Days covered: 28
    
    Sample article:
      Source:  guardian
      Title:   Carbon monoxide poisoning ruled out in death of Gene Hackman and wife, police sa
      Section: Film
      Text:    Authorities have ruled out that Gene Hackman and his wife, Betsy Arakawa, died from carbon monoxide poisoning earlier this week in their home in Santa Fe, New Mexico. The Santa Fe county sheriff, Adan...</code></pre><h3>Embedding with the clustering task</h3><p>The Jina v5 API is called with <code>task=&quot;clustering&quot;</code> for all documents. Embeddings are cached to disk, so subsequent runs skip the API entirely.</p><p>The API call is straightforward. The <code>task</code> parameter is the key difference from typical embedding usage:</p><pre><code>payload = {
    "model": "jina-embeddings-v5-text-small",
    "input": texts,
    "task": "clustering",  # ← This selects the clustering LoRA adapter
}</code></pre><p>The timing below reflects a cache hit. First run against the API takes longer, depending on corpus size.</p><pre><code>    Embeddings ready: 8,495 vectors of dimension 1024
    Time: 0.6s</code></pre><h3>Indexing into a single Elasticsearch index</h3><p>For discovery clustering, the full month goes into one index (<code>docs-clustering-all</code>). Daily partitioning comes later for temporal story linking.</p><p>The index mapping uses <a href="https://www.elastic.co/docs/reference/elasticsearch/index-settings/bbq"><code>bbq_disk</code></a> for the vector field:</p><pre><code>{
  "embedding": {
    "type": "dense_vector",
    "dims": 1024,
    "index": true,
    "similarity": "cosine",
    "index_options": {
      "type": "bbq_disk"        // hierarchical k-means partitioning for ANN index lookup; separate from this post's clustering algorithm
    }
  }
}</code></pre><p>A 1024-dimensional float32 vector is 4 KB. <a href="https://www.elastic.co/search-labs/blog/diskbbq-elasticsearch-introduction"><code>bbq_disk</code></a> uses hierarchical k-means to partition vectors into small clusters, binary-quantizes them, and stores the full-precision vectors on disk for rescoring. Only partition metadata lives in heap, so memory requirements stay low even for large corpora. For workloads that can afford more heap, <a href="https://www.elastic.co/docs/reference/elasticsearch/index-settings/bbq"><code>bbq_hnsw</code></a> builds a Hierarchical Navigable Small World (HNSW) graph for faster lookups at higher resource cost.</p><p>The <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/dense-vector"><code>dense_vector</code></a> field type supports multiple quantization strategies: <code>bbq_disk</code> and <code>bbq_hnsw</code> are the best fits for high-dimensional embeddings like the 1024-dim vectors used here.</p><pre><code>    Indexed 8,495 documents into docs-clustering-all
    Time: 57.5s</code></pre><h3>Clustering: Density-probed centroid classification</h3><p>Traditional clustering algorithms like HDBSCAN assume you can hold the full N×d vector matrix in memory and run repeated full-pass updates. For 8,495 documents at 1024 dimensions, that&#x27;s manageable (~35 MB), but the approach doesn&#x27;t scale to millions of documents without additional infrastructure.</p><p>This algorithm is conceptually similar to KMeans++ initialization with Voronoi assignment and a noise floor, but it uses Elasticsearch <a href="https://www.elastic.co/docs/solutions/search/vector/knn">kNN search</a> as the compute primitive, keeping almost all work server-side:</p><ol><li><strong>Sample 5% of documents</strong> as density probes (random sample, minimum 50).</li><li><strong>Probe density via batched</strong> <a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-msearch"><strong><code>msearch</code></strong></a> <strong>kNN</strong>. Each probe fires a kNN query and records the mean similarity of its neighbours. High mean similarity = dense region of embedding space. <a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-msearch"><code>msearch</code></a> sends multiple search requests in a single HTTP call, which is critical here: Density probing generates hundreds of kNN queries, and batching them avoids per-request overhead.</li><li><strong>Select high-density seeds with diversification</strong>: Candidates above median density are sorted by density descending and greedily accepted only when their cosine similarity to every existing seed is below a separation threshold. This is the only client-side compute (~0.01s for 8k docs).</li><li><strong>Classify all docs against centroids via</strong> <strong><code>msearch</code></strong> <strong>kNN</strong>: Each seed acts as a centroid; a kNN search retrieves nearby documents above a similarity threshold. Each document is assigned to whichever centroid returned it with the highest score. Small clusters are dissolved to noise.</li></ol><p>Elasticsearch handles the heavy lifting: <code>msearch</code> for density probes, <code>msearch</code> for classification, and <code>significant_text</code> for labeling. For this corpus (8,495 docs), the 5% density-probe sample launches 425 kNN probe queries, which <code>msearch</code> batches into nine HTTP calls (at batch size 50), avoiding one-request-per-probe overhead. Combined with <code>bbq_disk</code> ANN lookup, this keeps the clustering stage fast and scalable. The kNN queries use a minimal <a href="https://www.elastic.co/docs/deploy-manage/production-guidance/optimize-performance/approximate-knn-search"><code>num_candidates</code></a> value for speed during the clustering pass; production search queries should use higher <code>num_candidates</code> values to improve recall at the cost of latency.</p><p>Clusters have natural sizes determined by the embedding space density around each centroid, not by a hard <code>k</code> cap. Dense topic regions produce larger clusters; niche topics produce smaller ones.</p><h4>Why not KMeans or HDBSCAN?</h4><p>KMeans assumes spherical clusters and requires the full N×d matrix in memory. For corpora that fit in memory, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.cluster.HDBSCAN.html">HDBSCAN</a> is a strong alternative. It handles arbitrary cluster shapes and has well-understood density semantics.</p><p>The density-probed centroid approach targets a different niche: corpora where you want storage, retrieval, and clustering in one system, or where scale makes client-side matrix operations impractical. It uses Elasticsearch kNN as the compute primitive, handles arbitrary cluster sizes, and keeps nearly all computation server-side.</p><pre><code>    Clustered global index in 31.6s
      Total clusters: 82
      Total noise:    2420 (28.5%)
      Density probes: 425 kNN queries via 9 _msearch HTTP calls</code></pre><h4>Understanding the noise rate</h4><p>The ~28% noise rate is by design, not a failure mode. Documents that don&#x27;t fit any dense cluster at the configured <code>similarity_threshold</code> are left unassigned rather than forced into a poor match. This acts as a quality gate: Opinion columns, short articles, and one-off stories naturally resist clustering because they lack the thematic density that defines a coherent group.</p><p>The threshold is tunable: Lowering <code>similarity_threshold</code> produces more aggressive clustering (more documents assigned, but looser clusters), while raising it tightens clusters and increases the noise fraction. For this corpus of mixed news content, ~30% noise is a reasonable operating point. Production deployments should tune the threshold against domain-specific quality criteria.</p><h3>Automatic labels with significant_text</h3><p>Now each cluster needs a human-readable label. Elasticsearch&#x27;s <code>significant_text</code> aggregation finds terms that appear unusually often in a foreground set (the cluster) compared to a background set (the full corpus).</p><p>Under the hood, it uses a statistical heuristic (JLH score by default) that balances absolute and relative frequency shifts, no machine learning, no large language model (LLM) calls. A cluster about UK politics might surface terms like <code>starmer</code>, <code>labour</code>, <code>downing</code> because those terms are disproportionately common in that cluster compared to the overall news corpus.</p><p>For this global pass, labels are computed directly against <code>docs-clustering-all</code>, so both foreground and background are drawn from the full month. In Part 2, labeling uses the daily index pattern (<code>docs-clustering-*</code>), a wildcard that lets queries span all matching indices simultaneously, to give <code>significant_text</code> a broader background for better contrast.</p><p>A minimal query shape looks like this:</p><pre><code>{
  "size": 0,
  "query": { "term": { "cluster_id": "72" } },
  "aggs": {
    "label_terms": {
      "significant_text": {
        "field": "text",
        "size": 5,
        "filter_duplicate_text": true
      }
    }
  }
}</code></pre><p><code>significant_text</code> also serves as a quality gate: Clusters that produce no significant terms have no distinguishing vocabulary. They&#x27;re incoherent groupings that should be dissolved back to noise rather than given a misleading label.</p><p>A lightweight deterministic cleanup step removes noisy label terms (numeric tokens, generic words) and falls back to a representative headline when needed. This keeps labels Elasticsearch native while improving readability.</p><pre><code>    Sample cluster labels:
      cluster   3  (200 docs)  arsenal | mikel | villa
      cluster   1  (198 docs)  volodymyr | ukrainian | kyiv
      cluster   0  (196 docs)  hostages | hamas | israeli
      cluster   4  (187 docs)  scrum | rugby | borthwick
      cluster  52  (185 docs)  fossil | renewable | renewables
      cluster  10  (156 docs)  labour | gwynne | mps
      cluster  40  (151 docs)  novel | novels | literary
      cluster  11  (149 docs)  mewis | sarina | wiegman
      cluster  44  (143 docs)  flooding | rainfall | rain
      cluster  13  (131 docs)  doge | musk | elon
      cluster  12  (128 docs)  murder | insp | knockholt
      cluster   5  (124 docs)  putin | backstop | starmer


    Reassigned 35 docs from incoherent clusters to noise
    Total docs: 8,495
    Clustered:  6,040 (71.1%)
    Noise:      2,455 (28.9%)</code></pre><h3>Visualizing the clusters</h3><p>The visualizations below show what the global clustering pass discovered: a date-wise breakdown of clustered versus noise documents, a UMAP projection of the full month, and a source-mix chart confirming that clusters reflect topics rather than sources.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/99099f5adaa945ae4097c50b0d7151c7dd28872e-1000x400.png" style="width: 100%;" /><p>Daily distribution of clustered versus noise documents across February 2025.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/477c6c7177942955a942f85f5c881da50e517915-1100x700.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/4f4302556c8974c416842452cf33bca06e90b966-1100x700.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/327f96a79e382ef30614cb0570aa7fccd822b8f8-1100x700.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/acfb1524a10e24d6ff2412e7c3ec0f2b3ac75193-900x600.png" style="width: 100%;" /><p>Each colored island in the UMAP represents a cluster: a group of articles about the same topic discovered purely from embedding similarity. The gray noise points are articles that didn&#x27;t fit cleanly into any cluster (often short articles, opinion pieces, or one-off stories).</p><p>The source breakdown chart confirms that clusters contain articles from <strong>both</strong> BBC News and The Guardian. The clustering is finding <em>topics</em>, not <em>sources</em>, exactly what unsupervised discovery should produce.</p><h3>Exploring cluster breadth with the diversify retriever</h3><p>Plain kNN returns the documents most similar to a cluster&#x27;s centroid (the dense core). But real clusters cover subtopics. The <a href="https://www.elastic.co/docs/reference/elasticsearch/rest-apis/retrievers/diversify-retriever"><strong>diversify retriever</strong></a> uses Maximal Marginal Relevance (MMR) to surface documents that are relevant to the centroid but also <em>different from each other</em>.</p><p>The key parameter is <strong>λ (lambda)</strong>:</p><ul><li>λ = 1.0 → pure relevance (same as plain kNN).</li><li>λ = 0.0 → pure diversity (maximally spread results).</li><li>λ = 0.5 → balanced: that is relevant to the topic, but covering different angles.</li></ul><div style="display:none">Unknown block type "infoPanel", specify a component for it in the `components.types` option</div><p>A minimal retriever request shape looks like this:</p><pre><code>{
  "size": 8,
  "retriever": {
    "diversify": {
      "type": "mmr",
      "field": "embedding",
      "lambda": 0.5,
      "query_vector": "<cluster-centroid-vector>",
      "retriever": {
        "knn": {
          "field": "embedding",
          "query_vector": "<cluster-centroid-vector>",
          "k": 50,
          "num_candidates": 100
        }
      }
    }
  }
}</code></pre><p>The <code>type</code>, <code>field</code>, and <code>query_vector</code> parameters are required at the diversify level: <code>field</code> tells MMR which dense_vector field to use for inter-result similarity, and <code>query_vector</code> provides the reference point for relevance scoring.</p><p>This lets you answer: &quot;What does this cluster actually cover?&quot; rather than just &quot;What&#x27;s at its center?&quot;</p><pre><code>    Exploring cluster 52 (185 docs)
    Label: fossil | renewable | renewables
    Centroid computed (dim=1024)


    ========================================================================
    Plain kNN (closest to centroid)
    ========================================================================
      1. [0.9738] Green campaigners fear ministers are poised to award billions of pounds in fresh subsidies to Drax power station, despite strong concerns...
      2. [0.9710] Thirteen more oil and gas licences could be cancelled as ministers decide new guidance for fossil fuel extraction after a landmark court...
      3. [0.9699] Experts have accused the fossil fuel industry of seeking special treatment after lobbyists argued greenhouse gas emissions from oilfields...
      4. [0.9681] Burning wood is a terrible way of producing electricity . Chopping down trees destroys habitats for wildlife, and growing new trees cannot...
      5. [0.9649] Keir Starmer will do huge damage to the global fight against climate change if he gives in to political pressure and allows the development...
      6. [0.9641] Labour will next week be confronted with stark policy choices that threaten to expose the fault lines between the Treasury and the...
      7. [0.9638] The Drax power station near Selby in north Yorkshire burns imported wood pellets  The government has agreed a new funding arrangement with...
      8. [0.9581] If you care about the world we are handing on to future generations, the news on Thursday morning was dramatic. This January was the...
    
    ========================================================================
    Diversify retriever (MMR, lambda=0.5)
    ========================================================================
      1. [0.9738] Green campaigners fear ministers are poised to award billions of pounds in fresh subsidies to Drax power station, despite strong concerns...
      2. [0.9434] Oil and gas interests have waged a coordinated campaign to kill pro-electrification policies that ban gas connections in new buildings ,...
      3. [0.9303] It was interesting to read that new licences for oil and gas production in the North Sea are being delayed by legal action ( Thirteen more...
      4. [0.9139] The US energy secretary, Chris Wright, has said he “would love to see Australia get in the game of supplying uranium and maybe going down...
      5. [0.9077] Rachel Reeves was facing criticism on Saturday night as it was confirmed that a report she cited as evidence that a third ­runway at...
      6. [0.8996] When Margaret Thatcher opened the Hadley Centre for Climate Change in 1990 journalists suggested she was attempting to appear to be doing...
      7. [0.8993] The vast majority of governments are likely to miss a looming deadline to file vital plans that will determine whether or not the world has...
      8. [0.8987] European imports of seaborne gas shipments fell by a fifth last year to their lowest level since the pandemic, according to a new report,...
    
    Overlap: 1/8 documents appear in both result sets
    
    Avg pairwise similarity (lower = more diverse):
      Plain kNN:          0.9057
      Diversify retriever: 0.6965</code></pre><p>The plain kNN results cluster around one angle of the topic: the documents most similar to the centroid and to each other. The diversify retriever surfaces different facets of the same cluster: subtopics, different sources, and varied perspectives.</p><p>The diversity metric confirms this quantitatively: the average pairwise similarity is lower for the diversify retriever results, meaning that the returned documents cover more ground.</p><p>This is useful for:</p><ul><li><strong>Understanding what a cluster actually covers</strong>, not just its center but also its edges.</li><li><strong>Generating summaries</strong>. Diverse representative docs give an LLM better material.</li><li><strong>Finding representative examples</strong> for human review or downstream labeling.</li><li><strong>Quality checks</strong>. If the diverse results look incoherent, the cluster may need splitting.</li></ul><h2>Part 2: Temporal story chains</h2><h3>Tracking stories across days</h3><p>Part 1 clustered the full month globally for topic discovery. For temporal flow, the same density-probed centroid classification runs independently per day on <strong>daily indices</strong>, and then clusters are linked across adjacent days. Note that the daily clusters are independent of the global clusters from Part 1; each day produces its own cluster assignments and labels tuned to that day&#x27;s content.</p><h4><strong>The linking approach: sample-and-query</strong></h4><p>For each cluster on day A:</p><ol><li>Sample a few representative documents.</li><li>Run kNN against day B&#x27;s index.</li><li>Count how many hits land in each day B cluster.</li><li>If the hit fraction exceeds a threshold (kNN fraction ≥ 0.4), record a link.</li></ol><p>This is fast (only a few docs per cluster are queried, not all of them) and uses Elasticsearch&#x27;s native kNN, no external tools needed.</p><pre><code>Preparing daily indices for temporal linkage...


Indexed 8,495 docs into 28 daily indices


Temporal links found: 808 in 145.4s

Strongest links:
  2025.02.01 'league | arsenal | premier' -> 2025.02.02 'league | season | striker'  (100%)
  2025.02.03 'league | striker | loan' -> 2025.02.04 'league | striker | season'  (100%)
  2025.02.03 'score | operator | gedling' -> 2025.02.04 'league | striker | season'  (100%)
  2025.02.12 'playoff | leg | bayern' -> 2025.02.13 'league | players | injury'  (100%)
  2025.02.14 'league | injury | football' -> 2025.02.15 'league | premier | football'  (100%)
  2025.02.18 'russia | ukraine | talks' -> 2025.02.19 'saudi | russia | arabia'  (100%)
  2025.02.18 'football | league | bayern' -> 2025.02.19 'league | manchester | players'  (100%)
  2025.02.21 'league | premier | manchester' -> 2025.02.22 'game | players | defeat'  (100%)
  2025.02.21 'rugby | calcutta | brilliant' -> 2025.02.22 'game | players | defeat'  (100%)
  2025.02.26 'metals | kyiv | ukrainian' -> 2025.02.27 'ukraine | russia | talks'  (100%)</code></pre><p>A kNN fraction of 100% means every sampled document from the source cluster landed in the same target cluster, the strongest possible cross-day link. Most links above are football-related, which makes sense: Premier League coverage runs daily with high topical consistency.</p><p>The <code>score | operator | gedling</code> → <code>league | striker | season</code> link is an example of a niche local football cluster (Gedling is a non-league club) being absorbed into the broader Premier League cluster on the next day, a natural effect of daily reclustering at different granularity.</p><h3>Building story chains</h3><p>A story chain is a sequence of linked clusters across consecutive days.</p><p>Individual pairwise links tell you that Monday&#x27;s &quot;UK politics&quot; cluster connects to Tuesday&#x27;s. Chains reveal the full arc: a story that starts Monday, evolves through the week, and fades by Friday.</p><p>Chains are built greedily from links with a kNN fraction ≥ 0.4, meaning that at least 40% of sampled documents from the source cluster landed in a single target cluster. Starting from the earliest cluster, the algorithm always follows the strongest outgoing link.<br/></p><pre><code>    Strong links (kNN fraction >= 0.4): 244
    Story chains spanning 3+ days: 18
      Chain 1: 'ukrainian | kyiv | eastern' (19 days: Feb 3 → Feb 21)
      Chain 2: 'playing | opposition' (19 days: Feb 10 → Feb 28)
      Chain 3: 'tadhg | maro | cadan' (10 days: Feb 1 → Feb 10)
      Chain 4: 'invade | china | putin' (8 days: Feb 21 → Feb 28)
      Chain 5: 'elected | labour | leader' (7 days: Feb 12 → Feb 18)
      Chain 6: 'film | swift | awards' (6 days: Feb 2 → Feb 7)
      Chain 7: 'amendment | termination | reporting' (6 days: Feb 12 → Feb 17)
      Chain 8: 'officers | scene | police' (5 days: Feb 1 → Feb 5)</code></pre><p>The longest chain tracks Ukraine–Russia coverage for 19 consecutive days, unsurprising given the sustained geopolitical intensity in February 2025. The second-longest follows Premier League football across 19 days of the month. Shorter chains capture award season (film/awards, six days), Six Nations rugby (10 days), and UK political leadership coverage (seven days). Each chain represents a story arc that the algorithm discovered purely from embedding similarity across daily indices.</p><h3>Sankey: Visualizing story flow</h3><p>A Sankey diagram is a flow visualization where link width represents connection strength. Here, each vertical band is a day, each node is a daily cluster (sized by document count), and each colored path traces one story chain across time. Link width encodes kNN overlap strength: Thicker links mean more sampled documents landed in the target cluster. Colors are consistent per chain, so a single color path from left to right reads as one story&#x27;s progression.</p><p>For example, the Ukraine–Russia chain (visible as one of the longer paths) flows continuously from early February through the third week, with consistently thick links indicating strong topical continuity across days.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/100a60a7fb85da8ab3813fd071a82c93f2c3f318-1300x650.png" style="width: 100%;" /><p><em>Temporal story chains flowing across February 2025. Each colored path is a story persisting across days; link width indicates kNN overlap strength.</em></p><h2>What this approach delivers</h2><p>This walkthrough covered a complete unsupervised document clustering pipeline built on Elasticsearch:</p><ol><li><strong>Clustering embeddings</strong>: Jina v5&#x27;s task-specific adapters produce embeddings optimized for topic grouping, not just query-document matching.</li><li><strong>Global discovery clustering</strong>: Clustering the full month in one index maximizes cross-day topical discovery.</li><li><strong>Density-probed centroid classification</strong>: Sample 5%, probe density via <code>msearch</code> kNN, select diverse high-density seeds, classify all docs against centroids. Elasticsearch handles the heavy compute; only seed selection runs client-side (~0.01s).</li><li><a href="https://www.elastic.co/docs/reference/aggregations/search-aggregations-bucket-significanttext-aggregation"><strong><code>significant_text</code></strong></a> <strong>labeling</strong>: Significance testing produces meaningful cluster labels without any ML model or manual annotation. Clusters that produce no significant terms are incoherent and get demoted to noise — a built-in quality gate.</li><li><strong>Temporal story linking</strong>: Daily indices and sample-and-query cross-index kNN trace how stories evolve over time.</li></ol><p><strong>Key takeaways:</strong></p><ul><li>The embedding task type matters: Clustering embeddings produce measurably tighter topical groups.</li><li>Elasticsearch can serve as both the storage layer <em>and</em> the clustering engine via <a href="https://www.elastic.co/docs/solutions/search/vector/knn">kNN search</a>.</li><li>Density-probed centroid classification keeps nearly all compute server-side and produces clusters with natural sizes determined by embedding space density.</li><li><code>significant_text</code> is fast, interpretable, and effective for both auto-labeling and quality gating.</li></ul><p><strong>When this approach is useful:</strong></p><ul><li>You have timestamped text and want topic discovery without labeled training data.</li><li>You want one stack for storage, vector search, labeling, and temporal linkage.</li></ul><p><strong>Extensions to explore:</strong></p><ul><li>Multi-period clustering (weekly, monthly rollups).</li><li>Real-time ingestion with incremental cluster assignment.</li><li>LLM-generated cluster summaries using the significant_text terms as seeds.</li><li>At larger scale, sampled KMeans centroids can serve as warm-start seeds for density-based clustering, reducing the probe phase cost.</li></ul><h2>Try it yourself</h2><p>Swap in your own timestamped document corpus; any collection of text with dates works with this pipeline. The full notebook and supporting code are available in the <a href="https://github.com/elastic/elasticsearch-labs/tree/main/supporting-blog-content/unsupervised-document-clustering-elasticsearch-jina-embeddings">companion repository</a>.</p><ul><li><a href="https://cloud.elastic.co/registration?onboarding_token=vectorsearch&amp;cta=cloud-registration&amp;tech=trial&amp;plcmt=article%20content&amp;pg=search-labs"><strong>Start a free Elastic Cloud trial</strong></a>: Spin up a managed cluster with <code>bbq_disk</code> support in minutes.</li><li><a href="https://www.elastic.co/elasticsearch/serverless"><strong>Try Elasticsearch Serverless</strong></a>: No cluster management, scales automatically, and supports everything in this walkthrough.</li></ul>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Why ecommerce search needs governance]]></title>
            <description><![CDATA[Learn why ecommerce search falls short without governance and how a control layer ensures predictable and intent-driven results, thus improving retrieval.]]></description>
            <link>https://www.elastic.co/search-labs/blog/ecommerce-search-governance-improve-retrieval</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/ecommerce-search-governance-improve-retrieval</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Alexander Marquardt, Honza Král, Taylor Roy]]></dc:creator>
            <pubDate>Thu, 09 Apr 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/3793259b01a5653a7520393a2f006610de0d21e7-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Ecommerce retailers need to handle various fundamentally different query types within the same system. A shopper searching for “oranges” expects the fruit, not products containing the word “orange”, such as orange juice or orange marmalade, and not semantically related citrus products. A shopper searching for a “gift for grandpa who has a sweet tooth” needs semantic discovery, not literal keyword matching.</p><p><em>Lexical retrieval</em> (text matching), <em>semantic retrieval</em> (matching concepts), and <em>hybrid retrieval</em> (combining lexical and semantic signals) don’t solve these issues on their own. Lexical retrieval may return anything containing the word “oranges”, while pure semantic retrieval on a high-intent query like “oranges” may broaden toward related items, such as lemons or grapefruits. Hybrid retrieval blends these lexical and semantic signals, but it still doesn’t determine if this query should be treated as navigational, which constraints should be enforced, or which business policies should apply. The gap isn’t the retrieval technology itself; it’s the absence of a governance layer that understands what kind of query this is and what constraints should be enforced before retrieval begins.</p><p>In this blog, we explore ecommerce search governance, why it matters, and how a control layer ensures predictable, accurate retrieval.</p><h2>What governance means in ecommerce search</h2><p><em>Governance</em>, in this context, means introducing a decision layer between the user&#x27;s query and the retrieval engine. This layer performs the following functions:</p><ul><li>Classifies query intent: Is this navigation (&quot;oranges&quot;) or discovery (&quot;gift for grandpa&quot;)?</li><li>Applies business constraints: What category boundaries, eligibility rules, availability constraints, or merchandising policies apply?</li><li>Routes to the appropriate strategy: Should this use lexical retrieval, semantic retrieval, or hybrid?</li></ul><p>A governance layer determines which retrieval approach should be used for each query, which constraints must be enforced, and which business policies should apply before retrieval begins. It’s important not to conflate governance with hybrid retrieval: hybrid is one retrieval strategy that combines lexical and semantic signals, while governance is the upstream decision layer that determines whether lexical, semantic, or hybrid should be used.</p><h2>The status quo: The application layer &quot;spaghetti&quot; implementation</h2><p>Currently, many retailers attempt to solve this by adding logic directly into the application layer. This often results in <em>spaghetti code</em>, that is, thousands of lines of hard-coded if-then statements, regex, and complex search templates.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/f532b099ee103458e15563a711dae92952f8df02-1024x765.png" style="width: 100%;" /><p>This approach can provide desired search results as shown above; however, it creates significant operational friction:</p><ul><li><strong>Engineering dependency:</strong> Business users and merchandisers cannot modify search behavior without engineering tickets and long deployment cycles that often span several weeks.</li><li><strong>Fragmentation:</strong> Search logic becomes scattered between application code and search templates, and is difficult to explain or audit, making it risky to evolve.</li></ul><p>Even when teams recognize the need for routing, the debate often focuses on the wrong question: which retrieval method to pick.</p><h2>The false choice: Lexical vs. semantic vs. hybrid</h2><p>Search teams often frame the challenge as a retrieval strategy choice: lexical/BM25 versus semantic/vectors versus hybrid. That framing is understandable (retrieval methods matter), but it misses the most common failure mode in real deployments, which is that using a single retrieval approach for all queries will give suboptimal results.</p><p>Commerce search is a mix of fundamentally different intents:</p><ul><li><strong>Deterministic, high-intent navigation</strong> ( &quot;oranges&quot;, “milk”, “chocolate without peanuts”, “cheap olive oil”).</li><li><strong>Exploratory discovery</strong> (&quot;jacket for hiking in the mountains&quot;, &quot;gift for a 12-year-old who likes robotics&quot;).</li><li><strong>Operational constraints</strong> (availability, size, price, color).</li><li><strong>Merchandising and campaigns</strong> (boost, bury, seasonal campaigns).</li></ul><p>When the system routes all of these through the same retrieval strategy, the results are often systematically wrong in predictable ways because the operating model lacks governance. When teams don&#x27;t recognize this as a governance gap, they respond with the only lever they have: more tuning.</p><h2>Why &quot;relevance tuning&quot; can become cyclical</h2><p>Without a routing layer, “relevance” often turns into a never-ending backlog:</p><ul><li>Why is this query showing accessories above the core product?</li><li>Why did this head query suddenly start surfacing related items?</li><li>Why did results change after we added synonyms, adjusted analyzers, or enabled hybrid?</li><li>Why does the business team need an engineering release to fix a single query?</li></ul><p>Teams respond with more tuning: more synonyms, more boosts, more reranking experiments, more exceptions in application code. This can work for a while, but it often produces brittle behavior because the system still lacks an explicit decision layer for determining query type and enforcing the right constraints before retrieval.</p><h2>The anatomy of ecommerce intent: Head and tail</h2><p>In this section, we use “head” and “tail” as practical shorthand for common navigational and exploratory query patterns in ecommerce. In the real world, many queries contain aspects of both:</p><h3>Head queries (deterministic intent)</h3><p>These are direct, navigational queries where the user knows exactly what they want:</p><ul><li>Single-item intent (&quot;oranges&quot;, &quot;milk&quot;, &quot;bread&quot;).</li><li>Exact brands or product families (&quot;iPhone 15 Pro&quot;, &quot;Diet Coke&quot;).</li><li>SKUs, model numbers, sizes (&quot;ABC123&quot;, &quot;air max 270&quot;).</li></ul><p>For these queries, lexical retrieval can handle token correspondence (matching words), but the business also expects to respect constraints, return predictable rankings, and have controllable outcomes. A merchandiser needs to ensure that a query resolves within the correct category boundaries, respects eligibility, and surfaces specific business priorities.</p><p>Governance is required to enforce the intended resolution. For example, “oranges” should map to the produce category, not to orange juice, orange marmalade, or orange soda.</p><h3>Tail queries (exploratory discovery)</h3><p>These are descriptive, intent-rich queries where shoppers are exploring:</p><ul><li>&quot;Gift for grandpa who has a sweet tooth&quot;</li><li>&quot;Jacket for hiking in the mountains&quot;</li><li>&quot;Shoes for standing all day&quot;</li></ul><p>Lexical retrieval often struggles here. Semantic retrieval excels because it can connect the query concept to the product, even when wording does not match. But semantic retrieval alone is rarely sufficient either. Real queries often require constraints to be enforced, regardless of which retrieval method is used.</p><h2>Constraints are orthogonal to retrieval method</h2><p>Applying constraints to semantic retrieval doesn’t mean <em>hybrid search</em>. These are orthogonal concepts. Constraints, such as filters and boosts in Elasticsearch, can be applied to any lexical, semantic, or hybrid retrieval. The challenge is deciding how the query should be interpreted, which constraints must be enforced, and which retrieval strategy should be used.</p><p>Below are some examples of queries that combine retrieval with hard constraints:</p><ul><li><strong>Oranges:</strong> Lexical retrieval for “oranges” plus a category constraint, such as “Fruits” or “Produce”, eliminating orange marmalade, orange juice, and orange soda.</li><li><strong>Fruits high in vitamin C under $4:</strong> Semantic retrieval for nutritional intent plus constraints limiting results to the fruit category and products under $4.</li><li><strong>Comfortable shoes for work:</strong> Semantic retrieval for contextual intent plus a category constraint limiting results to shoes.</li></ul><p>These queries can&#x27;t be handled by a single approach:</p><ul><li><strong>Pure lexical retrieval</strong> is often insufficient here because phrases like “high in vitamin C” or “comfortable” may not exist as clean, structured attributes. They may need to be inferred from product descriptions, reviews, or specifications.</li><li><strong>Pure semantic retrieval</strong> is also not always sufficient because, without explicit constraints, a query like “fruits high in vitamin C” might broaden toward vitamin supplements, fruit-flavored drinks, or high-vitamin vegetables outside the intended category and price range.</li></ul><p>A governance layer determines whether a query needs lexical retrieval, semantic understanding, constraint enforcement, or some combination of these. Without this layer, ecommerce teams may end up:</p><ul><li><strong>Over-constraining:</strong> Using lexical retrieval for semantic requests (for example, &quot;gift for grandpa&quot;).</li><li><strong>Under-constraining: </strong>Using semantic queries for high-intent head queries (for example, “oranges”).</li></ul><p>The governance challenge is to build a system that can make the right judgment call for each class of query.</p><h2>What happens without governance</h2><p>The most common failure mode is straightforward: Teams take the raw user query and pass it directly into a single retrieval strategy (lexical, semantic, or hybrid), without an intermediate governance layer.</p><h3>Lexical retrieval misses intended resolution</h3><p>When a user searches for “oranges”, a lexical retrieval strategy may return anything containing that token: orange juice, orange marmalade, or orange soda. The system matched the term correctly, but without governance it may not resolve the intended shopping context (the fruit).</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/99abc7a46f9c56a26a68d0a089d7ab830b9b5568-1560x814.png" style="width: 100%;" /><h3>Semantic retrieval broadens beyond intended constraints</h3><p>When a user searches for “oranges”, a semantic system may retrieve conceptually related items across nearby product concepts. The system may correctly understand the broader domain (fruit or produce), but without explicit governance it can still over-broaden beyond the user’s intended constraint (specifically oranges).</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/c9de86363ecbed499fe48259f47b3c5b2c26bc43-1568x796.png" style="width: 100%;" /><h3>The gap is governance</h3><p>What’s required is an upstream decision layer that determines query intent and enforces the right constraints before retrieval begins. This fixes issues such as the following:</p><ul><li>Similar or related items appearing alongside what the user actually wanted.</li><li>Blurred category boundaries (&quot;beverages&quot; versus. &quot;produce&quot;).</li><li>Inability to implement seasonal boosts or campaigns.</li><li>Unpredictable and unexplainable results.</li></ul><h2>Intent understanding and routing: The necessary control plane</h2><p>A governed search system introduces a lightweight control plane in front of retrieval (prior to executing a query in Elasticsearch). The control will be discussed in detail in parts <a href="https://www.elastic.co/search-labs/blog/ecommerce-search-governance-control-plane-architecture">3</a> and <a href="https://www.elastic.co/search-labs/blog/elasticsearch-percolator-search-governance">4</a> of this blog series; for now, we just discuss what it can do but not how it works:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/88c3d0f9731a128d73a765dcdffed897308110a6-2680x766.png" style="width: 100%;" /><p>A control plane can detect intent, apply business policies, and ensure the appropriate retrieval strategy as follows:</p><p><strong>1. Detect intent signals</strong></p><ul><li>Is this query likely navigation versus discovery?</li><li>Is it a known head query (milk, bread, bananas)?</li><li>Is there a known product, brand, or category interpretation (for example, “oranges” should resolve to produce).</li><li>Is the query an SKU-like pattern?</li><li>Does the query fall under an active campaign or seasonal policy (for example, during Christmas, boost turkey-related results)?</li><li>Does the query imply constraints (category, attributes, exclusions, price/size/color)?</li></ul><p><strong>2. Apply governance and business policies</strong></p><ul><li>Enforce deterministic constraints first (category/attribute/negation/availability).</li><li>Apply active merchandising policies (boost/bury/pin/override).</li><li>Resolve conflicts with precedence rules (for example, campaign overrides versus global policies).</li></ul><p><strong>3. Route to the appropriate retrieval strategy</strong></p><ul><li>Lexical (fast, deterministic) for navigational/high-intent head queries.</li><li>Semantic retrieval for true discovery queries.</li><li>Hybrid where combined lexical and semantic signals add value under explicit business constraints.</li></ul><p>In practice, the output of the control plane is not simply “use hybrid” or “use semantic.” It’s a governed retrieval plan: an interpretation of the shopper’s intent, the constraints and policies that should apply, and the retrieval strategy that should be executed. A few simple examples make this concrete:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Shopper query</td><td style="border: 1px solid #ccc; padding: 4px;">Governed interpretation</td><td style="border: 1px solid #ccc; padding: 4px;">Example retrieval plan</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">“chocolate without peanuts”</td><td style="border: 1px solid #ccc; padding: 4px;">Product-oriented query with a hard exclusion constraint</td><td style="border: 1px solid #ccc; padding: 4px;">Lexical retrieval for chocolate plus an exclusion filter for products containing peanuts</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">“cheap olive oil”</td><td style="border: 1px solid #ccc; padding: 4px;">Product/category query with a price constraint</td><td style="border: 1px solid #ccc; padding: 4px;">Lexical retrieval for olive oil plus a price filter capped at the retailer’s threshold for cheap</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">“fruit high in vitamin C under $4”</td><td style="border: 1px solid #ccc; padding: 4px;">Discovery query requiring semantic understanding plus hard constraints</td><td style="border: 1px solid #ccc; padding: 4px;">Semantic retrieval for nutritional intent, constrained to the fruit category and filtered to products priced under $4</td>
                    </tr>
              </tbody>
            </table>
          <p>A control plane selects the right policy and retrieval strategy for each query consistently, predictably, and at scale. This makes advanced retrieval methods more predictable in production because intent-aligned constraints are enforced first and routing decisions are explicit rather than implicit.</p><h2>How this relates to other approaches</h2><p>Some teams use improved embedding models to better capture product semantics, which can materially improve semantic retrieval quality. Others use reranking approaches, such as <a href="https://www.elastic.co/docs/solutions/search/ranking/learning-to-rank-ltr">Learning To Rank (LTR)</a>, to optimize result ordering based on engagement or business signals after retrieval. Both are valuable and often complementary. Better embeddings improve similarity matching. Reranking improves ordering among retrieved candidates.</p><p>Governance addresses a different layer of the problem: It sits upstream of retrieval. It decides which retrieval strategy to use (for example, lexical, semantic, or hybrid), what deterministic constraints are required, and which queries should combine multiple business policies.</p><h2>What a governed control plane enables</h2><p>Once a governance layer is in place, the operating model changes fundamentally. Revenue-critical queries become predictable. Business teams can update search behavior without waiting on engineering release cycles. And advanced retrieval methods, like semantic and hybrid, can be adopted incrementally, behind routing and guardrails, instead of as a global on/off switch.</p><p>The <a href="https://www.elastic.co/search-labs/blog/ecommerce-search-governance-zero-deploy">next post</a> in this series explores what that operating model looks like in practice and why it may matter as much as the retrieval technology underneath it.</p><p>If a merchandiser has to open a Jira ticket and wait for a deploy to fix a revenue-critical query, the bottleneck isn&#x27;t the engine; it&#x27;s the operating model. Modern ecommerce search needs a way to translate business intent into controlled, auditable search behavior quickly and safely, while still using advanced retrieval where it adds measurable value.</p><h2>What’s next in this series</h2><p>The patterns explored in this series operate upstream of retrieval: translating business intent into the right query strategy before query generation begins. In the <a href="https://www.elastic.co/search-labs/blog/ecommerce-search-governance-zero-deploy">next post</a>, we shift from the technical problem to the operational one: what happens when business teams can change search behavior without an engineering deployment, and why governance makes that safe.</p><h2>Put governed ecommerce search into practice</h2><p>Engineering bottlenecks, brittle application-layer logic, and unpredictable search results are problems that Elastic Services can help you solve in enterprise ecommerce services engagements. The governed control plane architecture described in this series was built by Elastic Services Engineering.</p><p>If your team is spending engineering cycles translating merchandising requests into code changes, or if your search relevance backlog never seems to shrink, we can help you assess your current architecture and build a path to governed, business-editable search. Contact <a href="https://www.elastic.co/consulting">Elastic Services</a>.&nbsp; </p><h2>Join the discussion</h2><p>Have questions about search governance, retrieval strategies, or ecommerce search architecture? Join the broader <a href="https://discuss.elastic.co/">Elastic community conversation</a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to build agentic AI applications with Mastra and Elasticsearch]]></title>
            <description><![CDATA[Learn how to build agentic AI applications using Mastra and Elasticsearch through a practical example.]]></description>
            <link>https://www.elastic.co/search-labs/blog/build-agentic-ai-applications-mastra-elasticsearch</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/build-agentic-ai-applications-mastra-elasticsearch</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Enrico Zimuel]]></dc:creator>
            <pubDate>Wed, 08 Apr 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/ab30baf2f908534840c5d71a46705773807baf54-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>In this article, we’ll walk through how to use the <a href="https://mastra.ai/">Mastra</a> TypeScript framework to build agentic applications that interact with <a href="https://www.elastic.co/elasticsearch">Elasticsearch</a>.</p><p>We recently contributed to the <a href="https://github.com/mastra-ai/mastra">mastra-ai/mastra</a> open source project by adding support for Elasticsearch as a vector database. With this new feature, you can use Elasticsearch natively in Mastra to store embeddings. In addition to vectors, Elasticsearch provides a suite of advanced features to address all your context engineering requirements. (for example, <a href="https://www.elastic.co/search-labs/blog/context-engineering-hybrid-search-evolution-agentic-ai">hybrid search and reranking</a>).</p><p>This article details the creation of an agent to implement a retrieval augmented generation (RAG) architecture using Elasticsearch. We’ll showcase a demo project where an agentic approach is used to interact with a corpus of sci-fi movie data stored within Elasticsearch. The project is available at <a href="https://github.com/elastic/mastra-elasticsearch-example">elastic/mastra-elasticsearch-example</a>.</p><h2>Mastra</h2><p>Mastra is a TypeScript framework to create agentic AI applications.</p><p>A project structure in Mastra looks as follows:</p><pre><code>src/
├── mastra/
│   ├── agents/
│   │   └── weather-agent.ts
│   ├── tools/
│   │   └── weather-tool.ts
│   ├── workflows/
│   │   └── weather-workflow.ts
│   ├── scorers/
│   │   └── weather-scorer.ts
│   └── index.ts
├── .env.example
├── package.json
└── tsconfig.json</code></pre><p>In Mastra, you can build <a href="https://mastra.ai/docs/agents/overview">agents</a>, <a href="https://mastra.ai/docs/agents/using-tools">tools</a>, <a href="https://mastra.ai/docs/workflows/overview">workflows</a>, and <a href="https://mastra.ai/docs/evals/overview">scores</a>.</p><p>An <strong>agent</strong> is a class that accepts a message in input and produces a response as output. An agent can use tools, large language models (LLMs), and a memory (figure 1).</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/f6aca2dcc7fcc45d25e06681649be1b2b7eb6781-706x721.png" style="width: 100%;" /><p>An agent&#x27;s <strong>tools</strong> allow it to interact with the &quot;external world,&quot; such as communicating with a web API or performing an internal operation, like querying Elasticsearch. The <strong>memory</strong> component is crucial for storing the history of conversations, including past inputs and outputs. This stored context enables the agent to provide more informed and relevant responses to future questions by using its past interactions.</p><p><strong>Workflows</strong> let you define complex sequences of tasks using clear, structured steps rather than relying on the reasoning of a single agent (figure 2). They give you full control over how tasks are broken down, how data moves between them, and what gets executed when. Workflows run using the built-in execution engine by default or can be deployed to <a href="https://mastra.ai/docs/deployment/workflow-runners">workflow runners</a>.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/0dce161cf7891207015dc87532b5b90df1822432-880x252.png" style="width: 100%;" /><p>In Mastra, you can also define scores, which are automated tests that evaluate agent outputs using model-graded, rule-based, and statistical methods. Scorers return <em>scores</em>: numerical values (typically between 0 and 1) that quantify how well an output meets your evaluation criteria. These scores enable you to objectively track performance, compare different approaches, and identify areas for improvement in your AI systems. Scorers can be customized with your own prompts and scoring functions.</p><h2>Elasticsearch</h2><p>For running the demo project, we need to have an Elasticsearch instance running. You can activate a free trial on <a href="https://www.elastic.co/cloud">Elastic Cloud</a> or install it locally using the <a href="https://github.com/elastic/start-local"><code>start-local</code></a> script:</p><pre><code>curl -fsSL https://elastic.co/start-local | sh</code></pre><p>This will install Elasticsearch and Kibana on your computer and generate an API key to be used for configuring the Mastra integration.</p><p>The API key will be shown as output of the previous command and stored in a <strong>.env</strong> file in the elastic-start-local folder.</p><h2>Install and configure the demo</h2><p>We created an <a href="https://github.com/elastic/mastra-elasticsearch-example">elastic/mastra-elasticsearch-example</a> repository containing the source code of the demo project. The example reported in the repository illustrates how to create an agent in Mastra that implements a RAG architecture for retrieving documents from Elasticsearch.</p><p>We provided a dataset for the demo about sci-fi movies. We extracted 500 movies from the IMDb dataset on <a href="https://www.kaggle.com/datasets/rajugc/imdb-movies-dataset-based-on-genre/versions/2?select=scifi.csv">Kaggle</a>.</p><p>The first step is to install the dependencies of the project with npm, using the following command:</p><pre><code>npm install</code></pre><p>Then we need to configure the <strong>.env</strong> file that will contain the settings. We can generate this file copying the structure from the <strong>.env.example</strong> file, using the following command:</p><pre><code>cp .env.example .env</code></pre><p>Now we can edit the .env, adding the missing information:</p><pre><code>OPENAI_API_KEY=
ELASTICSEARCH_URL=
ELASTICSEARCH_API_KEY=
ELASTICSEARCH_INDEX_NAME=scifi-movies</code></pre><p>The name of the Elasticsearch index is <strong><code>scifi-movies</code></strong>. If you want, you can change it using the env variable <code>ELASTICSEARCH_INDEX_NAME</code>.</p><p>We used OpenAI as embedding service, which means that you need to provide an API key for OpenAI in the <code>OPENAI_API_KEY</code> env variable.</p><p>The embedding model used in the example is <a href="https://developers.openai.com/api/docs/models/text-embedding-3-small">openai/text-embedding-3-small</a>, with an embedding dimension of 1536.</p><p>To generate the final answer, we used the <a href="https://developers.openai.com/api/docs/models/gpt-5-nano">openai/gpt-5-nano</a> model to reduce the costs.</p><p>The RAG architecture allows you to use a less powerful (and typically less expensive) final LLM model because the heavy lifting of grounding the answer is done by the retrieval component (Elasticsearch in this case).</p><p>The smaller LLM is only responsible for two main tasks:</p><ul><li><strong>Rephrasing/embedding the query:</strong> Converting the user&#x27;s natural language question into a vector embedding for semantic search.</li><li><strong>Synthesizing the answer:</strong> Taking the highly relevant, retrieved context chunks (documents/movies) and synthesizing them into a coherent, final, human-readable answer, following the provided prompt instructions.</li></ul><p>Since the RAG process <strong>provides the exact factual context</strong> needed for the answer, the final LLM doesn&#x27;t need to be massive or highly complex and it doesn’t need to possess all the required knowledge within its own parameters (which is where large, expensive models excel). It essentially acts as a sophisticated text summarizer and formatter for the context provided by Elasticsearch, rather than as a full-fledged knowledge base itself. This enables the use of models like <code>gpt-5-nano</code> for cost and latency optimization.</p><p>After the configuration of the .env file, you can ingest the movies to Elasticsearch using the following command:</p><pre><code>npx tsx src/utility/store.ts</code></pre><p>You should see an output as follows:</p><pre><code>🚀 Starting ingestion of 500 movies from 500_scifi_movies.jsonl...
Ingesting ░░░░░░░░░░░░░░░░░░░░░░░░ 1/500 (0%) | ok:1 | fail:0 | chunks:1 | eta:19m 33s | current:Capricorn One
Ingesting ░░░░░░░░░░░░░░░░░░░░░░░░ 2/500 (0%) | ok:2 | fail:0 | chunks:2 | eta:10m 32s | current:Doghouse
Ingesting ░░░░░░░░░░░░░░░░░░░░░░░░ 3/500 (1%) | ok:3 | fail:0 | chunks:3 | eta:7m 33s | current:Dinocroc
Ingesting ░░░░░░░░░░░░░░░░░░░░░░░░ 4/500 (1%) | ok:4 | fail:0 | chunks:7 | eta:6m 10s | current:Back to the Future           
Ingesting ░░░░░░░░░░░░░░░░░░░░░░░░ 5/500 (1%) | ok:5 | fail:0 | chunks:9 | eta:5m 14s | current:The Projected Man            
Ingesting ░░░░░░░░░░░░░░░░░░░░░░░░ 6/500 (1%) | ok:6 | fail:0 | chunks:11 | eta:4m 41s | current:I, Robot
...
✅ Ingestion complete in 1m 46s. Success: 500, Failed: 0, Chunks: 693.</code></pre><p>The mapping of the scifi-movies index contains the following fields:</p><ul><li><strong>embedding</strong>, dense_vector with 1536 dimension, cosine similarity.</li><li><strong>description</strong>, text containing the description of the movie.</li><li><strong>director</strong>, text containing the name of the director.</li><li><strong>title</strong>, text containing the title of the movie.</li></ul><p>We generated the embeddings using the title + description. Since the title and the description are two separate fields, the concatenation of both ensures that the resulting embedding vector captures both the specific, unique identity (title) and the rich, descriptive context (description) of the movie, leading to more accurate and comprehensive semantic search results. This combined input gives the embedding model a better single representation of the document&#x27;s content for similarity matching.</p><h2>Run the demo</h2><p>You can run the demo with the following command:</p><pre><code>npm run dev</code></pre><p>This command will start a web application at <strong>localhost:4111</strong> to access Mastra Studio (figure 3).</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/1567e309df21a12bcf1dfef4429f82342549956c-1705x1079.png" style="width: 100%;" /><p><a href="https://mastra.ai/docs/getting-started/studio">Mastra Studio</a> offers an interactive UI for building and testing your agents, along with a REST API that exposes your Mastra application as a local service. This lets you start building right away without worrying about integration.</p><p>We provided an <strong>Elasticsearch Agent</strong> that uses the <a href="https://mastra.ai/reference/tools/vector-query-tool">createVectorQueryTool</a> by Mastra as a tool for executing semantic search using Elasticsearch. This agent uses the RAG approach to search for relevant documents (that is, movies) to answer the user’s question.</p><p>This agent uses the following prompt:</p><pre><code>You are a helpful assistant that answers questions based on the provided context.
Follow these steps for each response:

1. First, carefully analyze the retrieved context chunks and identify key information.
2. Break down your thinking process about how the retrieved information relates to the query.
3. Draw conclusions based only on the evidence in the retrieved context.
4. If the retrieved chunks don't contain enough information, explicitly state what's missing.

Format your response as:
THOUGHT PROCESS:
- Step 1: [Initial analysis of retrieved chunks]
- Step 2: [Reasoning based on chunks]

FINAL ANSWER:
[Your concise answer based on the retrieved context]

Important: When asked to answer a question, please base your answer only on the context provided in the tool. 
If the context doesn't contain enough information to fully answer the question, please state that explicitly and stop it.
Do not add more information than what is present in the retrieved chunks.
Remember: Explain how you're using the retrieved information to reach your conclusions.</code></pre><p>If you click on the <code>Mastra Studio &gt; Agents</code> menu and select <strong>Elasticsearch Agent</strong>, you can test the agent using a chat system. For instance, you can ask information regarding sci-fi movies with a question as follows:</p><p><em>Find 5 movies or TV series about UFOs</em>.</p><p>You’ll notice that the agent will execute the vectorQueryTool. You can click on the invoked tool to have a look at the input and the output. At the end of execution, the LLM will reply to your question, given the context coming from the scifi-movies index of Elasticsearch (figure 4).</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/d9998d4f687984de98845dae52d1288166abf448-1344x1071.png" style="width: 100%;" /><p>Mastra executes the following steps internally:</p><ol><li><strong>Vector conversion:</strong> The user&#x27;s question, <em>Find 5 movies or TV series about UFOs,</em> is converted into a vector embedding using OpenAI&#x27;s <code>openai/text-embedding-3-small</code> model.</li><li><strong>Vector search:</strong> This embedding is then used to query Elasticsearch via a vector search.</li><li><strong>Result retrieval:</strong> Elasticsearch returns a set of 10 movies highly relevant to the query (that is, those with vectors closest to the user&#x27;s query vector).</li><li><strong>Answer generation:</strong> The retrieved movies and the original user question are sent to the LLM, specifically <code>openai/gpt-5-nano</code>. The LLM processes this information and generates a final answer, ensuring that the user&#x27;s request for five results is met.</li></ol><h2>The Elasticsearch Agent</h2><p>Here we reported the source code of Elasticsearch Agent.</p><pre><code>import { Agent } from "@mastra/core/agent";
import { ElasticSearchVector } from '@mastra/elasticsearch';
import { createVectorQueryTool } from '@mastra/rag';
import { ModelRouterEmbeddingModel } from "@mastra/core/llm";
import { Memory } from "@mastra/memory";

const es_url = process.env.ELASTICSEARCH_URL;
const es_apikey = process.env.ELASTICSEARCH_API_KEY;
const es_index_name = process.env.ELASTICSEARCH_INDEX_NAME;
const prompt = 'insert here the previous prompt';

const esVector = new ElasticSearchVector({
  id: 'elasticsearch-vector',
  url: es_url,
  auth: {
    apiKey : es_apikey
  }
});

const vectorQueryTool = createVectorQueryTool({
  vectorStore: esVector,
  indexName: es_index_name,
  model: new ModelRouterEmbeddingModel("openai/text-embedding-3-small")
});

export const elasticsearchAgent = new Agent({
  id: "elasticsearch-agent",
  name: "Elasticsearch Agent",
  instructions: prompt,
  model: 'openai/gpt-5-nano',
  tools: { vectorQueryTool },
  memory: new Memory(),
});</code></pre><p>The <strong>vectorQueryTool</strong> is the tool that’s invoked to implement the retrieval part of the RAG example. It uses the <a href="https://mastra.ai/reference/vectors/elasticsearch">ElasticSearchVector</a> implementation that Elastic contributed to Mastra.</p><p>The agent is an object of the agent class that consumes the vectorQueryTool, the prompt, and a memory. As you can see, the code that we need to put in place for connecting Elasticsearch to an agent is very minimal.</p><h2>Conclusion</h2><p>This article demonstrated the simplicity and power of integrating Elasticsearch with the Mastra framework to build sophisticated agentic AI applications. Specifically, we walked through creating a RAG agent capable of performing semantic search over a corpus of sci-fi movie data indexed in Elasticsearch.</p><p>A key takeaway is the direct contribution by Elastic to the Mastra open source project, providing native support for Elasticsearch as a vector store. This integration significantly lowers the barrier to entry, as evidenced by the <strong>Elasticsearch Agent</strong> source code. Using the <code>ElasticSearchVector</code> and <code>createVectorQueryTool</code>, the complete setup for connecting Elasticsearch to your agent requires only a minimal number of lines of configuration code.</p><p>Elasticsearch provides several advanced features to enhance result relevance. For example, <a href="https://www.elastic.co/elasticsearch/hybrid-search">hybrid search</a> significantly boosts accuracy by combining lexical search with vector search. Another interesting feature is reranking using the latest <a href="https://www.elastic.co/search-labs/tutorials/jina-tutorial/jina-reranker-v3">Jina models</a> that can be applied at the end of hybrid search. To learn more about these techniques, consult the following articles from Elasticsearch Labs:</p><ul><li><a href="https://www.elastic.co/search-labs/blog/hybrid-search-elasticsearch">Elasticsearch hybrid search</a> by Valentin Crettaz</li><li><a href="https://www.elastic.co/search-labs/blog/jina-models-elasticsearch-guide">An introduction to Jina models, their functionality, and uses in Elasticsearch</a> by Scott Martens</li></ul><p>We also encourage you to explore the provided example and begin building your own data-powered agents with Mastra and Elasticsearch. For more information about Mastra, you can have a look at the official documentation <a href="https://mastra.ai/docs">here</a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Migrating your OpenShift Elasticsearch 6.x cluster to Elastic Cloud on Kubernetes (ECK)]]></title>
            <description><![CDATA[A step-by-step guide on how to migrate from the legacy OpenShift Elasticsearch Operator (ES 6.x) to the modern Elastic Cloud on Kubernetes (ECK).]]></description>
            <link>https://www.elastic.co/search-labs/blog/openshift-elastic-cloud-kubernetes-migration</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/openshift-elastic-cloud-kubernetes-migration</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Omer Kushmaro, Jamie Parker]]></dc:creator>
            <pubDate>Wed, 08 Apr 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/2f35bf231e4c8ad4f9a79d4ac4d188bfc2e385a3-720x420.jpg?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Red Hat&#x27;s OpenShift platform has long been a trusted foundation for enterprise Kubernetes workloads, and for years, its built-in Elasticsearch Operator has made log management simple. But all good things evolve: The OpenShift Elasticsearch Operator reached the end of its supported lifecycle with Red Hat OpenShift Container Platform (OCP) 4.13, and the Elasticsearch 6.x clusters it managed are long out of support. Working closely with Red Hat, we&#x27;ve put together this step-by-step guide to help you move from that legacy setup to <strong>Elastic Cloud on Kubernetes</strong> (ECK), the modern, full-featured operator maintained directly by Elastic. The migration path we&#x27;ve designed respects the OpenShift-native tooling you already rely on, keeps disruption to a minimum, and leaves you on a solid foundation for future upgrades to 8.x, 9.x, and beyond.</p><h2>Why this matters</h2><ul><li><strong>Security and support:</strong> The last Elasticsearch 6.x patch release was on January 13, 2022. ECK lets you upgrade at your own pace, with a supported operator from the creators of Elasticsearch. Remaining on an old Elasticsearch version exposes you to supportability risks or well-known security issues.</li><li><strong>Features you’ve been missing:</strong> Autoscaling, data tiers, machine learning (ML) jobs, searchable snapshots. None of these are available in the legacy operator.</li></ul><p><strong>Future-proof operations:</strong> ECK ships <em>day-and-date</em> with every new Elastic release, so you’re never stuck waiting again.</p><h2>High-level plan</h2>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Phase</td><td style="border: 1px solid #ccc; padding: 4px;">Goal</td><td style="border: 1px solid #ccc; padding: 4px;">Outcome</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">0</td><td style="border: 1px solid #ccc; padding: 4px;">Snapshot and sanity-check your 6.x cluster.</td><td style="border: 1px solid #ccc; padding: 4px;">You have a backup in case you need it.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">1</td><td style="border: 1px solid #ccc; padding: 4px;">Install ECK 2.16.1 alongside the Red Hat operator.</td><td style="border: 1px solid #ccc; padding: 4px;">Both operators coexist safely.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">2</td><td style="border: 1px solid #ccc; padding: 4px;">Spin up a fresh, production-ready ES 6.8.23 cluster managed by ECK.</td><td style="border: 1px solid #ccc; padding: 4px;">Empty ECK-managed cluster.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">3</td><td style="border: 1px solid #ccc; padding: 4px;">Restore data into that new cluster.</td><td style="border: 1px solid #ccc; padding: 4px;">All indices now live under ECK.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4</td><td style="border: 1px solid #ccc; padding: 4px;">Point openshift-logging to the new service and retire the old operator.</td><td style="border: 1px solid #ccc; padding: 4px;">Single source of truth.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">5</td><td style="border: 1px solid #ccc; padding: 4px;">Rolling-upgrade Elasticsearch to 7.17.28.</td><td style="border: 1px solid #ccc; padding: 4px;">Latest long-term 7.x.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">6</td><td style="border: 1px solid #ccc; padding: 4px;">Upgrade ECK to 3.3.1.</td><td style="border: 1px solid #ccc; padding: 4px;">Operator on current release.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">7</td><td style="border: 1px solid #ccc; padding: 4px;">Schedule your own jump to 8.x & 9.x.</td><td style="border: 1px solid #ccc; padding: 4px;">You control the timing.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">8</td><td style="border: 1px solid #ccc; padding: 4px;">Clean up</td><td style="border: 1px solid #ccc; padding: 4px;">Removing old operator</td>
                    </tr>
              </tbody>
            </table>
          <p>Feel free to bookmark this list. Each milestone is small, reversible, and validated before you move on.</p><h3>0. Preflight checks</h3><p><strong>A. Health first: </strong>Run <code>/_cat/health</code> and make sure you’re green.</p><p><strong>B. Disk watermarks:</strong> Keep at least 20% free before starting a migration.</p><p><strong>C. Final snapshot:</strong> S3, GCS, NFS: It doesn’t matter, as long as you can mount the same repo in the new cluster.</p><ol><li>If you don’t have object storage handy in your environment, you can use <a href="https://access.redhat.com/solutions/6408341">this solution-post by Red Hat</a> to snapshot your data to local storage on the OpenShift cluster.</li></ol><p><strong>D. Review the documentation:</strong> Elastic provides thorough documentation for <a href="https://www.elastic.co/docs/manage-data/migrate/migrate-data-between-elasticsearch-clusters-with-minimal-downtime">migrating data between Elasticsearch clusters</a>.</p><h3>1. Installing ECK 2.16.1 (your “bridge” operator)</h3><p>ECK 2.16.1 is the last release that still accepts <code>spec.version: 6.8.x</code>, which makes it the ideal bridge between past and future Elasticsearch versions.</p><pre><code>helm repo add elastic https://helm.elastic.co
helm repo update
oc create namespace elastic-system 

helm install elastic-operator elastic/eck-operator --version=2.16.1 -n elastic-system --create-namespace</code></pre><p>You can keep the Red Hat operator in place; the two watch <em>different</em> Custom Resource Definitions (CRDs), so they won’t step on each other’s toes.</p><p>Keep in mind that with OpenShift, ECK might display some Transport Layer Security (TLS) errors in its logs as OpenShift tries to connect to its healthcheck webhook endpoint via HTTP, but ECK allows TLS communication only. This is a well-known issue and shouldn’t pose a problem.</p><p>You can refer further to <a href="https://www.elastic.co/docs/deploy-manage/deploy/cloud-on-k8s/install-using-helm-chart">Elastic documentation</a>, in case you need to make a local namespaced installation.</p><h3>2. Launching a 6.x cluster under ECK</h3><p>Below is a starter Kubernetes manifest that balances resiliency (separate masters) with cost (three hot-tier data nodes). Swap storage class names, resources, and snapshot credentials to match your environment.</p><p><strong>Note:</strong> The syntax used below is a bit different than what it would be for newer Elasticsearch versions on ECK.</p><pre><code>apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: es-logs
  namespace: elastic # Create this namespace prior, or use another namespace
spec:
  version: 6.8.23
  nodeSets:
    - name: hot
      count: 3
      volumeClaimTemplates:
        - metadata:
            name: elasticsearch-data
          spec:
            accessModes:
              - ReadWriteOnce
            storageClassName: gp3-csi   # adjust if needed
            resources:
              requests:
                storage: 100Gi # Storage may vary depending on  
      config:
        node.master: true
        node.data: true
        node.ingest: true
        node.attr.data: hot
        cluster.routing.allocation.awareness.attributes: data
      podTemplate:
        spec:
          containers:
            - name: elasticsearch
              resources:
                requests:
                  memory: 16Gi
                  cpu: 2
                limits:
                  memory: 16Gi
---
apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
  name: kibana
  namespace: elastic
spec:
  version: 6.8.23
  count: 1
  elasticsearchRef:
    name: es-logs
  podTemplate:
    spec:
      containers:
        - name: kibana
          resources:
            requests:
              memory: 1Gi
              cpu: 0.5
            limits:
              memory: 4Gi</code></pre><p>Deploy it, watch pods come up, and you’re ready for data.</p><h3>3. Moving the data</h3><p>To move data from one Elasticsearch cluster to another, you can also further consult <a href="https://www.elastic.co/docs/manage-data/migrate/migrate-data-between-elasticsearch-clusters-with-minimal-downtime">this guide</a> in the Elastic documentation. For the purpose of this post, we’re assuming that snapshot and restore are used.</p><p><strong>Snapshot and restore</strong> are quickest:</p><pre><code># on the old cluster, take a snapshot
PUT _snapshot/log-backups
{
  "type": "s3",
  "settings": { ... }
}

PUT _snapshot/log-backups/final-snap-2025-08-07

# on the new cluster (readonly!)
PUT _snapshot/log-backups
{
  "type": "s3",
  "settings": {
    "readonly": true,
    ...
  }
}

# Perform the restore operation
POST _snapshot/log-backups/final-snap-2025-08-07/_restore</code></pre><p>Can’t share an object store? Use <a href="https://www.elastic.co/docs/manage-data/migrate/migrate-from-a-self-managed-cluster-with-a-self-signed-certificate-using-remote-reindex"><strong>remote re-index</strong></a> (slower, but works everywhere; has drawbacks in terms of not migrating index templates, component templates, and more) or pump logs through a one-off Logstash job.</p><h3>4. Configuring ClusterLogging operator</h3><p>First, we’ll need to decommission our Red Hat operator–managed Elasticsearch cluster. We’ll modify our ClusterLogging like so:</p><pre><code>oc edit clusterlogging instance -n openshift-logging 
---------
 logStore:
    elasticsearch:
      nodeCount: 0 # scale down node count, previously > 0 
      redundancyPolicy: ZeroRedundancy
    type: elasticsearch
  managementState: Managed # this needs to be kept, as it will manage the fluentd instance for us.
  visualization:
    kibana:
      replicas: 0 # scale down kibana as well 
    type: kibana</code></pre><p>Then we’ll define a ClusterLogForwarder to direct the logs from fluentd to our newly built Elasticsearch 6.x cluster managed by ECK. We’ll need to create a secret with the Elasticsearch credentials:</p><pre><code>oc create secret generic eck-es-credentials \
  -n openshift-logging \
  --from-literal=username=elastic \
  --from-literal=password=$(oc get secret es-logs-es-elastic-user -n elastic -o jsonpath='{.data.elastic}' | base64 -d)</code></pre><p>For configuring TLS (as recommended), you’ll need to create a ConfigMap for ClusterLogForwarder to trust the ECK ca certificates. Further <a href="https://docs.redhat.com/en/documentation/openshift_container_platform/4.16/html/logging/log-collection-and-forwarding?utm_source=chatgpt.com">guidance can be found here</a>. We’ll run the command: </p><pre><code>oc -n elastic get secret es-logs-es-http-certs-public \
-o go-template='{{index .data "tls.crt" | base64decode}}' > ca.crt

oc -n openshift-logging create configmap eck-es-ca \
--from-file=ca-bundle.crt=ca.crt</code></pre><p>To create the certificate secret, and then we’ll reference it in the ClusterLogging CRD:</p><pre><code>apiVersion: logging.openshift.io/v1
kind: ClusterLogForwarder
metadata:
  name: instance
  namespace: openshift-logging
spec:
  outputs:
    - name: eck-es
      type: elasticsearch
      url: https://es-logs-es-http.elastic.svc:9200
      secret:
        name: eck-es-credentials # this secret needs to be created first
      tls:
        # insecureSkipVerify: true # can be used for lab testing purposes
        ca:
          name: eck-es-ca
  pipelines:
    - name: send-to-eck
      inputRefs:
        - application
        - infrastructure
        - audit
      outputRefs:
        - eck-es</code></pre><p>⚠️ <em>If you’re troubleshooting connectivity issues, you can temporarily set <code>tls.insecureSkipVerify: true</code>, but this shouldn’t be used long term.</em></p><p>Because we’re restoring legacy indices into a fresh ECK-managed cluster, OpenShift Logging will not automatically recreate the legacy index layout or aliases. You must ensure that write aliases exist and point to writable indices. In my case, I needed to verify that I have proper aliases, set up as:</p><ul><li>app-write</li><li><code>infra-write</code></li><li><code>audit-write</code></li></ul><p>Pointing to indices with dynamic mappings (not recommended) for minimizing errors and troubleshooting steps. </p><pre><code># Forward ES port to local machine
oc -n elastic port-forward svc/es-logs-es-http 9200:9200

PASS="$(oc -n elastic get secret es-logs-es-elastic-user -o jsonpath='{.data.elastic}' | base64 -d)"

# Make sure the write alias points to the correct backing index
curl -s -k -u "elastic:${PASS}" -XPOST "https://localhost:9200/_aliases" \
  -H 'Content-Type: application/json' \
  -d '{
    "actions": [
      { "add": { "index": "infra-000002", "alias": "infra-write", "is_write_index": true } }
    ]
  }'</code></pre><p>Repeat for <code>app-write</code> and <code>audit-write</code> with their respective backing indices.We should see data start flowing now toward our new ECK managed cluster.</p><h3>5. Rolling upgrade to 7.17.29, and verify</h3><p>Now you can finally leave 6.x behind.</p><p>A. Check <code>_xpack/migration/deprecations?pretty</code> using curl against Elasticsearch, to tackle deprecations. This API will return either warnings or critical things to attend to before you upgrade.</p><p>B. Patch the CRD to upgrade it to the latest 7.x version. I’m using 7.17.29.</p><pre><code>oc -n elastic patch elasticsearch es-logs --type=merge -p '{"spec":{"version":"7.17.29"}}'</code></pre><p>C. ECK restarts nodes <em>one at a time</em>. Your cluster should be online throughout.</p><p>D. Give cluster tasks and shard recoveries time to settle before pressing on.</p><p>E. Don’t forget to upgrade Kibana in the same way.</p><pre><code>oc -n elastic patch kibana kibana --type=merge -p '{"spec":{"version":"7.17.29"}}'</code></pre><p>Once complete, check your Elasticsearch version and Kibana version, as well as the health state:</p><pre><code>oc -n elastic get elasticsearch es-logs
oc -n elastic get kibana kibana</code></pre><h3>6. Operator upgrade: ECK 2.16.1 → 3.3.1</h3><p>ECK upgrades are pleasantly boring:</p><pre><code>helm upgrade elastic-operator elastic/eck-operator -n elastic-system --version 3.3.1</code></pre><p>Watch the operator pod roll. Your Elasticsearch cluster keeps running; only the controller restarts.</p><p>Verify that the upgrade is successful by looking at the operator logs and ensuring that no major errors appear:</p><pre><code>oc logs -n elastic-system sts/elastic-operator</code></pre><p>And then verifying the new version of the operator (will now be 3.3.1):</p><pre><code>helm -n elastic-system list</code></pre><h3>7. Your roadmap to 8.x and 9.x (when you’re ready)</h3><p>You’re now on:</p><ul><li><strong>ECK Operator:</strong> 3.3.1</li><li><strong>Elastic Stack:</strong> 7.17.29</li></ul><p>That pair is fully supported and serves as the official launchpad for 8.x. It’s important to first go through the <a href="https://www.elastic.co/guide/en/security/8.19/upgrade-7.17-8x.html">Elastic upgrade documentation</a>.</p><p>We’ll again go through the procedure of checking for any hard-breaking changes between our 7.17.29 and the latest 8 version (8.19.9):</p><pre><code>GET _migration/deprecations?pretty</code></pre><p>It&#x27;s important to look through the result of this query carefully and to go through necessary steps, like re-indexing indices and changing mappings, among others.</p><p>Once you’ve addressed all required changes from 7.17.29 to 8.x:</p><pre><code>oc -n elastic patch elasticsearch es-logs --type=merge -p '{"spec":{"version":"8.19.9"}}'
oc -n elastic patch kibana kibana --type=merge -p '{"spec":{"version":"8.19.9"}}'</code></pre><p>ECK will handle the rest. Just remember to upgrade Beats, Logstash pipelines, and client libraries in lockstep to avoid wire-protocol surprises.</p><p>Repeat the process again to migrate to the <strong>latest 9.x</strong> version.</p><h3>8. Cleanup</h3><ul><li>Remove the Red Hat Elasticsearch operator.</li></ul><p>Now that you’re no longer using the Red Hat Elasticsearch operator, you can remove it from your cluster. You can do that via the following steps:</p><p>A. In the OpenShift Console, go to <strong>Operators</strong> and then to <strong>Installed Operators</strong>.</p><p>B. In the <strong>Filter By Name</strong> field, enter “Elasticsearch” to find the installed Red Hat Elasticsearch operator.</p><p>C. On the <strong>Operator Details</strong> page, select <strong>Uninstall Operator</strong> from the <strong>Actions</strong> list.</p><p>D. On the <strong>Uninstall Operator?</strong> dialog box, select <strong>Uninstall</strong>. This removes the operator, the operator deployments, and the pods. After this step, the operator stops running and will no longer receive updates.</p><p>All of these steps can be found in <a href="https://docs.redhat.com/en/documentation/openshift_container_platform/4.16/html-single/operators/index#olm-deleting-operators-from-a-cluster-using-web-console_olm-deleting-operators-from-a-cluster">this link</a> from Red Hat OpenShift documentation.</p><h2>Wrapping up</h2><p>By installing <strong>ECK 2.16.1</strong> as a bridge, snapshot-restoring into a new cluster, and stepping cleanly through 7.x before landing on <strong>ECK 3.3</strong>, you’ve transformed an aging, unsupported logging back end into a modern, secure, first-class Elastic deployment, without surprises or downtime.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Semantic search, now multilingual by default]]></title>
            <description><![CDATA[semantic_text now defaults to jina-embeddings-v5-text on Elastic Inference Service, enabling multilingual semantic search in Elasticsearch.]]></description>
            <link>https://www.elastic.co/search-labs/blog/multilingual-semantic-search-jina-embeddings-v5-text</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/multilingual-semantic-search-jina-embeddings-v5-text</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Kapil Jadhav, Brendan Jugan, Ranjana Devaji]]></dc:creator>
            <pubDate>Tue, 07 Apr 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/205fc7de02c5d84b99624bffc0488a46df2d3b07-720x420.jpg?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Today, we’re pleased to announce that <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/semantic-text"><code>semantic_text</code></a> now defaults to the <a href="https://www.elastic.co/search-labs/blog/jina-embeddings-v5-text"><code>jina-embeddings-v5-text</code></a> family on <a href="https://www.elastic.co/docs/explore-analyze/elastic-inference/eis">Elastic Inference Service (EIS)</a>, bringing built-in multilingual inference with no additional configuration required.</p><p>EIS provides managed, GPU-accelerated inference tightly integrated with Elasticsearch. With EIS, you don’t need to host, scale, or maintain infrastructure for embedding models.</p><p><a href="https://www.elastic.co/docs/solutions/search/semantic-search">Semantic search</a> retrieves results based on meaning. Text is converted into vector embeddings so queries can match related concepts, even when the exact words differ.</p><p>The <code>semantic_text</code> field type simplifies this entire workflow, with <a href="https://www.elastic.co/search-labs/blog/chunking-strategies-elasticsearch">automatic chunking</a>, embedding generation at index time, and seamless querying via the <code>semantic</code> query, without building custom pipelines or managing separate model inference.</p><p>The <code>jina-embeddings-v5-text</code> model family <a href="https://www.elastic.co/search-labs/blog/jina-embeddings-v5-text">just launched on EIS</a>, giving developers powerful multilingual embeddings accessible as part of the core <code>semantic_text</code> workflow. So now your semantic search works across languages out of the box, and global datasets, such as support articles, product descriptions, user reviews, and multilingual websites, work without extra configuration.</p><p>This default opens up broad, globe-spanning semantic retrieval with no operational overhead.</p><h2>jina-embeddings-v5-text</h2><p>The <strong><code>jina-embeddings-v5-text</code></strong> models represent the latest generation of compact, high-performance multilingual embedding models on EIS.</p><ul><li><strong>State-of-the-art multilingual performance:</strong> Top scores on <a href="https://huggingface.co/spaces/mteb/leaderboard">MMTEB benchmarks</a> across hundreds of languages. <code>jina-embeddings-v5-text-nano</code> leads models under 500M parameters, and <code>jina-embeddings-v5-text-small</code> outperforms significantly larger alternatives.</li><li><strong>Multiple task capabilities:</strong> Spanning across retrieval, semantic matching, clustering, and classification.</li><li><strong>Flexible choices to fit your use case:</strong> Two model sizes (<code>small</code>, <code>nano</code>) let you balance speed, cost, and quality.</li><li><strong>Long-context support:</strong> Embed long texts efficiently, ideal for document collections with extended context.</li></ul><h2>Get started</h2><h3>1. Create index</h3><p>Define a <code>semantic_text</code> field with no additional configuration. Embeddings will be generated automatically at index time using the default model. For production workloads, explicitly specify the model to ensure consistent behavior and results.</p><pre><code>PUT /multilingual-reviews
{
  "mappings": {
    "properties": {
      "product": { "type": "keyword" },
      "review": { "type": "semantic_text" },
      "language": { "type": "keyword" }
    }
  }
}</code></pre><h3>2. Index multilingual documents</h3><p>Add product reviews in six different languages. Each document’s review field is automatically embedded at ingest time, with no separate pipeline or preprocessing needed.</p><pre><code>POST /multilingual-reviews/_bulk?refresh=wait_for
{ "index": { "_id": "1" } }
{ "product": "wireless-headphones", "review": "Amazing noise cancellation and the battery lasts all day. Perfect for long flights.", "language": "en" }
{ "index": { "_id": "2" } }
{ "product": "wireless-headphones", "review": "La cancelación de ruido es impresionante. Muy cómodos incluso después de horas de uso.", "language": "es" }
{ "index": { "_id": "3" } }
{ "product": "wireless-headphones", "review": "ノイズキャンセリングが素晴らしく、長時間つけていても耳が痛くなりません。", "language": "ja" }
{ "index": { "_id": "4" } }
{ "product": "wireless-headphones", "review": "Réduction de bruit excellente et très confortable pour les longs trajets en avion.", "language": "fr" }
{ "index": { "_id": "5" } }
{ "product": "wireless-headphones", "review": "Hervorragende Geräuschunterdrückung. Ideal für Pendler und Vielflieger.", "language": "de" }
{ "index": { "_id": "6" } }
{ "product": "wireless-headphones", "review": "O cancelamento de ruído é excelente e a bateria dura o dia todo.", "language": "pt" }</code></pre><h3>3. Search across languages with a query in English</h3><pre><code>GET /multilingual-reviews/_search
{
  "query": {
    "match": {
      "review": "comfortable for long flights"
    }
  }
}</code></pre><p>The results show all six reviews ranked by semantic relevance to the English query:</p><pre><code>{
  "took": 83,
  "timed_out": false,
  "_shards": {
    "total": 6,
    "successful": 6,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 6,
      "relation": "eq"
    },
    "max_score": 0.8275735,
    "hits": [
      {
        "_index": "multilingual-reviews",
        "_id": "4",
        "_score": 0.8275735,
        "_source": {
          "product": "wireless-headphones",
          "review": "Réduction de bruit excellente et très confortable pour les longs trajets en avion.",
          "language": "fr"
        }
      },
      {
        "_index": "multilingual-reviews",
        "_id": "1",
        "_score": 0.7616198,
        "_source": {
          "product": "wireless-headphones",
          "review": "Amazing noise cancellation and the battery lasts all day. Perfect for long flights.",
          "language": "en"
        }
      },
      {
        "_index": "multilingual-reviews",
        "_id": "5",
        "_score": 0.72122526,
        "_source": {
          "product": "wireless-headphones",
          "review": "Hervorragende Geräuschunterdrückung. Ideal für Pendler und Vielflieger.",
          "language": "de"
        }
      },
      {
        "_index": "multilingual-reviews",
        "_id": "2",
        "_score": 0.6867013,
        "_source": {
          "product": "wireless-headphones",
          "review": "La cancelación de ruido es impresionante. Muy cómodos incluso después de horas de uso.",
          "language": "es"
        }
      },
      {
        "_index": "multilingual-reviews",
        "_id": "3",
        "_score": 0.66513836,
        "_source": {
          "product": "wireless-headphones",
          "review": "ノイズキャンセリングが素晴らしく、長時間つけていても耳が痛くなりません。",
          "language": "ja"
        }
      },
      {
        "_index": "multilingual-reviews",
        "_id": "6",
        "_score": 0.61658823,
        "_source": {
          "product": "wireless-headphones",
          "review": "O cancelamento de ruído é excelente e a bateria dura o dia todo.",
          "language": "pt"
        }
      }
    ]
  }
}</code></pre><p>Notice that the French review ranks first, even above the English one. That&#x27;s because &quot;<em>très confortable pour les longs trajets en avion</em>&quot; (&quot;very comfortable for long trips by plane&quot;) is a closer semantic match to the query than the English review, which splits its focus across noise cancellation, battery life, and flights. This demonstrates the jina-embeddings-v5-text-small ability to rank by meaning, not language.</p><h3>4. Search across languages with a Japanese query</h3><pre><code>GET /multilingual-reviews/_search
{
  "query": {
    "match": {
      "review": "長時間のフライトに最適"
    }
  }
}</code></pre><p>The results show all six reviews ranked by semantic relevance to the Japanese query (“Ideal for long-haul flights”):</p><pre><code>{
  "took": 89,
  "timed_out": false,
  "_shards": {
    "total": 6,
    "successful": 6,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 6,
      "relation": "eq"
    },
    "max_score": 0.7556782,
    "hits": [
      {
        "_index": "multilingual-reviews",
        "_id": "4",
        "_score": 0.7556782,
        "_source": {
          "product": "wireless-headphones",
          "review": "Réduction de bruit excellente et très confortable pour les longs trajets en avion.",
          "language": "fr"
        }
      },
      {
        "_index": "multilingual-reviews",
        "_id": "1",
        "_score": 0.7395687,
        "_source": {
          "product": "wireless-headphones",
          "review": "Amazing noise cancellation and the battery lasts all day. Perfect for long flights.",
          "language": "en"
        }
      },
      {
        "_index": "multilingual-reviews",
        "_id": "5",
        "_score": 0.68835545,
        "_source": {
          "product": "wireless-headphones",
          "review": "Hervorragende Geräuschunterdrückung. Ideal für Pendler und Vielflieger.",
          "language": "de"
        }
      },
      {
        "_index": "multilingual-reviews",
        "_id": "3",
        "_score": 0.6487931,
        "_source": {
          "product": "wireless-headphones",
          "review": "ノイズキャンセリングが素晴らしく、長時間つけていても耳が痛くなりません。",
          "language": "ja"
        }
      },
      {
        "_index": "multilingual-reviews",
        "_id": "6",
        "_score": 0.6241487,
        "_source": {
          "product": "wireless-headphones",
          "review": "O cancelamento de ruído é excelente e a bateria dura o dia todo.",
          "language": "pt"
        }
      },
      {
        "_index": "multilingual-reviews",
        "_id": "2",
        "_score": 0.6183049,
        "_source": {
          "product": "wireless-headphones",
          "review": "La cancelación de ruido es impresionante. Muy cómodos incluso después de horas de uso.",
          "language": "es"
        }
      }
    ]
  }
}</code></pre><p>The ranking is nearly identical to the English query: French and English still lead because they&#x27;re the most semantically relevant to &quot;perfect for long flights,&quot; regardless of query language. The Japanese review didn&#x27;t get artificially boosted just because the query was in Japanese. It ranks fourth because it discusses wearing comfort, not flights. Semantic relevance takes priority over language matching.</p><h2>Note: For English-only use cases</h2><p>If you prefer a sparse representation or would like to continue to use Elastic Learned Sparse EncodeR (ELSER) for English workloads, ELSER remains available and fully supported as an option for <code>semantic_text</code>.</p><p>You can explicitly choose ELSER by specifying <code>inference_id: &quot;.elser-2-elastic</code> in your mappings when creating an index.</p><h2>Conclusion: Semantic search without borders</h2><p>With <code>semantic_text</code> now defaulting to the <code>jina-embeddings-v5-text</code> family on Elastic Inference Service, multilingual semantic search becomes the standard developer experience in Elasticsearch. This means developers can build search, retrieval augmented generation (RAG), and AI applications that work across global datasets without stitching pipelines together.</p><p>Create a <code>semantic_text</code> field, index your data, and start searching. All Elastic Cloud trials have access to Elastic Inference Service. Try it now on <a href="https://cloud.elastic.co/serverless-registration">Elastic Cloud Serverless</a> or Elastic Cloud Hosted, or use <a href="https://www.elastic.co/search-labs/blog/cloud-connect-elastic-inference-service">EIS via Cloud Connect</a> with your self-managed cluster.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to compare two Elasticsearch indices and find missing documents]]></title>
            <description><![CDATA[Exploring approaches for comparing two Elasticsearch indices and finding missing documents.]]></description>
            <link>https://www.elastic.co/search-labs/blog/elasticsearch-index-comparison</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/elasticsearch-index-comparison</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[David Pilato]]></dc:creator>
            <pubDate>Mon, 06 Apr 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/099abf465250360ab741a5aa13931fa8884ded34-1376x768.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>When managing Elasticsearch indices, you may need to verify that all documents present in one index also exist in another, such as after a reindex operation, a migration, or a data pipeline. Elasticsearch doesn&#x27;t provide a built-in &quot;diff&quot; command for this, but the right approach depends on one key question: <strong>Are your document IDs stable between the two indices?</strong></p><h2>The problem</h2><p>Imagine you have two indices, <code>index-a</code> (source) and <code>index-b</code> (target), and you want to find all documents that exist in <code>index-a</code> but are missing from <code>index-b</code>.</p><p>A naive approach, querying both indices and comparing results in memory, won&#x27;t scale. Elasticsearch is designed to handle millions of documents, and loading them all at once isn’t practical.</p><p>There are two scenarios:</p><ol><li><strong>IDs are stable</strong>: Both indices use the same <code>_id</code> for the same document (for example, <code>emp_no</code> as the document ID). This is the easy case.</li><li><strong>IDs are generated</strong>: Documents were ingested through different pipelines that assigned random or sequential IDs. You can&#x27;t compare by <code>_id</code>; you need to match on content.</li></ol><p>Let&#x27;s walk through both.</p><h2>Step 0 — A lighter CLI for Elasticsearch</h2><p>All the examples in this post use <a href="https://github.com/Anaethelion/escli-rs">escli</a>, a small Rust command line interface (CLI) that wraps the Elasticsearch REST API. It reads your cluster URL and credentials from environment variables, so you don’t have to repeat authentication headers on every command.</p><p>To see why that matters, here&#x27;s a typical <code>_search</code> call with raw <code>curl</code>:</p><pre><code>curl -X GET \
  -H "Authorization: ApiKey $ELASTIC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"query":{"term":{"user.id":"kimchy"}}}' \
  "$ELASTICSEARCH_URL/my-index-000001/_search"</code></pre><p>With <code>escli</code>, the same request becomes:</p><pre><code>./escli search --index my-index-000001 <<< '{"query":{"term":{"user.id":"kimchy"}}}'</code></pre><p>The credentials live in a <code>.env</code> file that escli sources automatically — no <code>-H &quot;Authorization: ...&quot;</code> on every call, no risk of leaking secrets in shell history. The request body is passed via stdin (<code>&lt;&lt;&lt;</code>), which makes it easy to pipe in multi-line JSON built dynamically with <code>jq</code>.</p><h2>Step 1 — Count documents in both indices</h2><p>Before doing a full scan, get a quick count of each index. If the counts match, the indices are likely in sync, and there’s no need to scan at all.</p><pre><code>./escli count --index index-a
./escli count --index index-b</code></pre><p>The <code>_count</code> API returns:</p><pre><code>{ "count": 1000000 }</code></pre><p>If the counts differ, proceed to the full comparison.</p><h2>Step 2 — When IDs mean something: Use op_type=create</h2><p>If both indices use the same <code>_id</code> for the same document, for example, because you indexed documents using a functional business key like <code>emp_no</code> rather than a generated UUID, you can find and fix missing documents in a single <code>_reindex</code> call.</p><h3>Why functional IDs matter</h3><p>Using a meaningful field as <code>_id</code> (instead of a random UUID) is a best practice when the data has a natural key. It means:</p><ul><li>The same document always gets the same <code>_id</code>, regardless of which pipeline ingested it.</li><li>You can easily update or delete documents by ID.</li><li>You can use <a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-index#operation-index-op_type"><code>op_type=create</code></a> to skip documents that already exist in the target.</li><li>No client-side scanning or comparison is needed.</li></ul><h3>The op_type=create trick</h3><p><code>_reindex</code> with <code>op_type=create</code> tries to create each document from the source in the target. If a document with the same <code>_id</code> already exists, Elasticsearch reports it as a <code>version_conflict</code> and moves on. It <strong>doesn’t</strong> overwrite the existing document. Setting <code>conflicts=proceed</code> tells the API to continue instead of aborting on the first conflict.</p><pre><code>./escli reindex <<< '{
  "source": { "index": "index-a" },
  "dest":   { "index": "index-b", "op_type": "create" },
  "conflicts": "proceed"
}'</code></pre><p>The response tells you exactly what happened:</p><pre><code>{
  "total": 1000000,
  "created": 49594,
  "version_conflicts": 950406,
  "failures": []
}</code></pre><ul><li><code>created</code>: Documents that were missing from <code>index-b</code> and have now been added.</li><li><code>version_conflicts</code>: Documents that already existed in <code>index-b</code> and were left untouched.</li></ul><p><strong>No scanning, no client-side comparison, no intermediate file.</strong> Everything happens server-side in about six seconds on a 1M-document dataset.</p><h2>Step 3 — When IDs are not stable: Business-key comparison</h2><p>Sometimes you can&#x27;t rely on <code>_id</code>. A document pipeline that generates IDs at ingestion time will assign a different <code>_id</code> each time the same record is processed. If <code>index-a</code> and <code>index-b</code> were populated by two such pipelines, the same employee record might have <code>_id: &quot;abc123&quot;</code> in one index and <code>_id: &quot;xyz789&quot;</code> in the other, even though the underlying data is identical.</p><p>In this case, you need to match documents by content rather than by ID. The key is to identify a set of fields that together form a unique business key.</p><p>For an employee dataset, a reasonable business key is <code>(first_name, last_name, birth_date)</code>. A document in <code>index-a</code> is &quot;missing&quot; from <code>index-b</code> if no document in <code>index-b</code> has the same combination of those three fields.</p><h3>3a — Scan the source with PIT + search_after</h3><p>Open a <a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-open-point-in-time">point in time (PIT)</a> on the source index to get a consistent snapshot, and then paginate through it, fetching only the business-key fields:</p><pre><code>./escli open_point_in_time index-a 5m
# → { "id": "46ToAwMDaWR..." }</code></pre><pre><code>./escli search <<< '{
  "size": 10000,
  "_source": ["first_name", "last_name", "birth_date"],
  "pit": { "id": "46ToAwMDaWR...", "keep_alive": "5m" },
  "sort": [{ "_shard_doc": "asc" }]
}'</code></pre><p>The sort key <code>_shard_doc</code> is the most efficient sort for full-index pagination: it uses the internal Lucene document order with no overhead. Repeat with <code>search_after</code> until the response contains zero hits. Always close the PIT when done:</p><pre><code>./escli close_point_in_time <<< '{"id": "46ToAwMDaWR..."}'</code></pre><h3>3b — Check each page against the target via _msearch</h3><p>For each page of source documents, build one <code>_msearch</code> request with one subquery per document. Each subquery uses a <code>bool/must</code> on the three business-key fields and requests <code>size: 0</code>; we only need to know whether a match exists, we don’t need to retrieve the document itself.</p><pre><code>./escli msearch << 'EOF'
{"index": "index-b"}
{"size":0,"query":{"bool":{"must":[{"term":{"first_name.keyword":"Alice1"}},{"term":{"last_name.keyword":"Smith"}},{"term":{"birth_date":"1985-03-12"}}]}}}
{"index": "index-b"}
{"size":0,"query":{"bool":{"must":[{"term":{"first_name.keyword":"Bob2"}},{"term":{"last_name.keyword":"Jones"}},{"term":{"birth_date":"1990-07-24"}}]}}}
EOF</code></pre><p>The response contains one entry per subquery, in the same order:</p><pre><code>{
  "responses": [
    { "hits": { "total": { "value": 1 } } },
    { "hits": { "total": { "value": 0 } } }
  ]
}</code></pre><p><code>total.value == 0</code> means no document in <code>index-b</code> matches that business key; the document is missing. Collect the corresponding <code>_id</code> from the source page.</p><blockquote><strong>Note on</strong> <strong><code>.keyword</code></strong> <strong>subfields</strong>: <code>term</code> queries require exact (keyword) matching. The <code>first_name</code> and <code>last_name</code> fields must have a <code>.keyword</code> subfield in the index mapping. The demo&#x27;s <code>mapping.json</code> includes this.</blockquote><h3>3c — Speed it up with split-by-date</h3><p>If the business key includes a date field, you can partition the source into date slices and run each slice as an independent job. Each slice opens its own PIT with a <code>range</code> filter on <code>birth_date</code>, runs its own msearch loop, and writes its results to a separate file. The parent script launches all slices in parallel and aggregates the results when they’re all done.</p><p>But depending on your use case, you might want to partition by a different field; for example, if you have a <code>team</code> field, you could run one slice per team. The key is to find a field that allows you to split the data into reasonably even chunks that can be processed in parallel.</p><pre><code>[compare] Launching 5 slices in parallel...

  → Slice 1: 1960-01-01 → 1969-12-31 ✅ — 244408 checked, 12207 missing
  → Slice 2: 1970-01-01 → 1979-12-31 ✅ — 243624 checked, 12212 missing
  → Slice 3: 1980-01-01 → 1989-12-31 ✅ — 243551 checked, 11921 missing
  → Slice 4: 1990-01-01 → 1999-12-31 ✅ — 243895 checked, 11991 missing
  → Slice 5: 2000-01-01 → 2009-12-31 ✅ — 24522 checked, 1263 missing</code></pre><h2>Performance on a 1M dataset</h2><p>To validate the approaches, the demo generates 1,000,000 documents in <code>index-a</code> and deliberately skips ~5% in <code>index-b</code> (49,594 missing documents), and then runs the full compare → reindex cycle.</p><p>Results on a MacBook M3 Pro:</p><p><strong>Comparison</strong> (<code>compare-indices.sh</code>):</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Strategy</td><td style="border: 1px solid #ccc; padding: 4px;">Compare</td><td style="border: 1px solid #ccc; padding: 4px;">Reindex</td><td style="border: 1px solid #ccc; padding: 4px;">Total</td><td style="border: 1px solid #ccc; padding: 4px;">How it works</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">op_type</td><td style="border: 1px solid #ccc; padding: 4px;"></td><td style="border: 1px solid #ccc; padding: 4px;">6s</td><td style="border: 1px solid #ccc; padding: 4px;">6s</td><td style="border: 1px solid #ccc; padding: 4px;">Full _reindex server-side, skips existing</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">business-key</td><td style="border: 1px solid #ccc; padding: 4px;">1m 38s</td><td style="border: 1px solid #ccc; padding: 4px;">4s</td><td style="border: 1px solid #ccc; padding: 4px;">1m 42s</td><td style="border: 1px solid #ccc; padding: 4px;">PIT scan + _msearch by business key</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">split-by-date</td><td style="border: 1px solid #ccc; padding: 4px;">32s</td><td style="border: 1px solid #ccc; padding: 4px;">4s</td><td style="border: 1px solid #ccc; padding: 4px;">36s</td><td style="border: 1px solid #ccc; padding: 4px;">Same as business-key, 5 slices in parallel</td>
                    </tr>
              </tbody>
            </table>
          <p>The <code>op_type=create</code> approach is fastest because everything is server-side and requires no client-side scanning. The <code>split-by-date</code> strategy cuts the <code>business-key</code> duration from 1m 38s down to 36s through parallelism: not bad for a comparison across two 1M-document indices.</p><h2>Decision tree</h2><pre><code>Are _id values stable between both indices?
├── Yes → _reindex with op_type=create          (6s, server-side)
└── No  → Do you have a reliable business key?
          ├── Yes, simple scan is fast enough → business-key   (1m 42s)
          └── Yes, and you need more speed    → split-by-date  (36s, parallel)</code></pre><h2>Conclusion</h2><p>Elasticsearch doesn&#x27;t offer a native index diff command, but the right strategy depends on your data model:</p><ul><li><strong>Use functional</strong> <strong><code>_id</code>s</strong> (a natural business key like <code>emp_no</code>) whenever possible. It unlocks the simplest and fastest approach: <code>_reindex</code> with <code>op_type=create</code> finds and fills gaps in one server-side call.</li><li><strong>When IDs are unstable</strong>, match by business key using PIT + <code>_msearch</code>. Partition by a field and run slices in parallel to recover most of the performance. If you find yourself doing this regularly, consider computing a hash of your business key fields and using it as <code>_id</code> at ingestion time. You get the best of both worlds: stable IDs and efficient lookups.</li></ul><p>The complete demo, including dataset generation, comparison scripts, and reindex scripts, is available at <a href="https://github.com/dadoonet/blog-compare-indices/">https://github.com/dadoonet/blog-compare-indices/</a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Monitoring Kibana dashboard views with Elastic Workflows]]></title>
            <description><![CDATA[Learn how to use Elastic Workflows to collect Kibana dashboard view metrics every 30 minutes and index them into Elasticsearch, so you can build custom analytics and visualizations on top of your own data.]]></description>
            <link>https://www.elastic.co/search-labs/blog/monitor-kibana-dashboard-views-elastic-workflows</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/monitor-kibana-dashboard-views-elastic-workflows</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Gustavo Llermaly]]></dc:creator>
            <pubDate>Fri, 03 Apr 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/0ed94ce00962287b5507f45c92ecb60fdcbf2718-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p><a href="https://www.elastic.co/kibana">Kibana</a> tracks how many times each dashboard is viewed, but that data isn’t exposed natively in any built-in dashboard. In this article, we’ll use <strong>Elastic Workflows</strong> to automatically collect that data every 30 minutes and index it into Elasticsearch, so we can build our own analytics on top of it.</p><p><a href="https://www.elastic.co/docs/explore-analyze/workflows">Elastic Workflows</a> is a built-in automation engine inside Kibana that lets you define multistep processes using a simple YAML configuration. Each workflow can be triggered on a schedule or event or as a tool in <a href="https://www.elastic.co/docs/explore-analyze/ai-features/elastic-agent-builder">Elastic Agent Builder</a>, and each step can call Kibana APIs, query Elasticsearch, or transform data.</p><p>We’ll use dashboard view counts as a concrete example, but the same pattern applies to any metric exposed through the Kibana saved objects API.</p><h2>Prerequisites</h2><ul><li><a href="https://www.elastic.co/cloud">Elastic Cloud</a> or <a href="https://www.elastic.co/docs/deploy-manage/deploy/self-managed">self-managed </a>cluster running 9.3</li><li><a href="https://www.elastic.co/docs/explore-analyze/workflows/get-started#workflows-prerequisites">Workflows enabled</a> (Advanced settings)</li></ul><h2>Step 1: Explore the raw data in <a href="https://www.elastic.co/docs/explore-analyze/query-filter/tools/console">Dev Tools</a></h2><p>Before building anything, let&#x27;s understand what data we have. Kibana stores most of its configuration and metadata as <a href="https://www.elastic.co/docs/api/doc/kibana/group/endpoint-saved-objects">saved objects</a> in a dedicated internal index. One of the things Kibana tracks this way is dashboard view counts, using a special saved object type called usage counters. You can query them directly from Dev Tools:</p><pre><code>GET kbn:/api/saved_objects/_find?type=usage-counter&filter=usage-counter.attributes.domainId:"dashboard"%20and%20usage-counter.attributes.counterType:"viewed"&per_page=10000</code></pre><p>The response looks like this:</p><pre><code>{
  "page": 1,
  "per_page": 10000,
  "total": 1,
  "saved_objects": [
    {
      "type": "usage-counter",
      "id": "dashboard:346f3c64-ebca-484d-9d57-ec600067d596:viewed:server:20260310",
      "attributes": {
        "domainId": "dashboard",
        "counterName": "346f3c64-ebca-484d-9d57-ec600067d596",
        "counterType": "viewed",
        "source": "server",
        "count": 1
      },
      ...
    }
  ]</code></pre><p>The <code>counterName</code> field is the dashboard ID, and <code>count</code> is the cumulative view count for that dashboard on that specific day. Kibana creates one counter object per dashboard per day; you can see the date suffix in the object ID (...viewed:server:20260310). The count grows throughout the day as users open the dashboard.</p><p>Rather than replicating this daily-document model in our index, we’ll create one document per workflow execution. Each document records how many views that dashboard had accumulated for the day at the moment of capture.</p><h2>Step 2: Create the destination index</h2><p>We need an index to store our dashboard view snapshots. The following command creates it with explicit mappings so we can aggregate and visualize later. Run this in Dev Tools:</p><pre><code>PUT dashboard-views
{
  "mappings": {
    "properties": {
      "captured_at": {
        "type": "date"
      },
      "dashboard_id": {
        "type": "keyword"
      },
      "dashboard_name": {
        "type": "keyword"
      },
      "view_count": {
        "type": "integer"
      }
    }
  }
}</code></pre><p>Using <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/keyword"><code>keyword</code></a> mappings for IDs and names allows <a href="https://www.elastic.co/docs/explore-analyze/query-filter/aggregations">aggregations</a>. Using <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/number"><code>integer</code></a> for <code>view_count</code> is a safe default, since Kibana resets the counter daily, reaching the 32-bit limit (more than 2 billion views in a single day) isn’t a realistic concern. It still supports numeric operations, like <a href="https://www.elastic.co/docs/reference/aggregations/search-aggregations-metrics-max-aggregation"><code>max</code></a>, <a href="https://www.elastic.co/docs/reference/aggregations/search-aggregations-metrics-avg-aggregation"><code>avg</code></a>, and <a href="https://www.elastic.co/docs/reference/aggregations/search-aggregations-metrics-min-aggregation"><code>min</code></a> among others.</p><h2>Step 3: Create the workflow</h2><p>Go to <strong>Stack Management &gt; Workflows &gt; New Workflow</strong>, and paste the following workflow YAML configuration:</p><pre><code>name: dashboard-views-ingestion
triggers:
  - type: scheduled
    with:
      every: 30m

steps:
  - name: fetch_dashboard_views
    type: kibana.request
    with:
      method: GET
      path: >-
        /api/saved_objects/_find?type=usage-counter&per_page=10000&filter=usage-counter.attributes.domainId:"dashboard"%20and%20usage-counter.attributes.counterType:"viewed"

  - name: index_each_dashboard
    type: foreach
    foreach: "{{ steps.fetch_dashboard_views.output.saved_objects }}"
    steps:
      - name: fetch_dashboard_name
        type: kibana.request
        with:
          method: GET
          path: /api/saved_objects/dashboard/{{ foreach.item.attributes.counterName }}
        on-failure:
          continue: true

      - name: index_doc
        type: elasticsearch.request
        with:
          method: POST
          path: /dashboard-views/_doc
          body:
            dashboard_id: "{{ foreach.item.attributes.counterName }}"
            dashboard_name: "{{ steps.fetch_dashboard_name.output.attributes.title }}"
            view_count: "${{ foreach.item.attributes.count | plus: 0 }}"
            captured_at: "{{ execution.startedAt | date: '%Y-%m-%dT%H:%M:%SZ' }}"</code></pre><p>In the next section, let&#x27;s break down the workflow step by step.</p><h3>How the workflow works</h3><h4>Triggers</h4><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/5670991d65c64ee833924225c2d375a1be868b13-325x162.png" style="width: 100%;" /><p>The workflow runs on a scheduled trigger every 30 minutes. This gives us time-series data without hammering the API.</p><h4>fetch_dashboard_views</h4><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/66eaec147c3d01c524c67cf1c7f663ac56a3259d-812x215.png" style="width: 100%;" /><p>Uses <code>kibana.request</code> to call the Kibana saved objects API. No authentication setup is needed: The workflow engine automatically attaches the correct headers based on the execution context.</p><h4>index_each_dashboard (foreach)</h4><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/aad45e8aed8dc81ded6260cd6199ff78dcffe3b4-1892x290.png" style="width: 100%;" /><p>Iterates over the <a href="https://www.elastic.co/docs/api/doc/kibana/group/endpoint-saved-objects"><code>saved_objects</code></a> array returned by the previous step. The current item in each iteration is available as <code>foreach.item</code>. Inside the loop, we run two nested steps for each dashboard.</p><p><strong>1. <code>fetch_dashboard_name</code>:</strong></p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/db21ec5094b743018b9cd66c5052681f14c7d7e3-1999x431.png" style="width: 100%;" /><p>Resolves the human-readable dashboard title by calling <code>GET /api/saved_objects/dashboard/{id}</code>. We add <code>on-failure: continue: true</code> so that if a dashboard was deleted but still has view counters, the loop continues instead of failing the whole execution.</p><p><strong>2. <code>index_doc</code>:</strong></p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/f49dd0c9f0817bb1e1e5d9f4a2b05d13ef331054-1999x626.png" style="width: 100%;" /><p>Indexes each document using <code>POST /dashboard-views/_doc</code> (without an explicit ID), which lets Elasticsearch auto-generate IDs. This creates a new document on every run, building a history of view counts over time rather than overwriting the previous snapshot.</p><p>Two things worth noting:</p><ul><li>The <code>captured_at</code> field uses the date filter to format the timestamp as <a href="https://www.iso.org/iso-8601-date-and-time-format.html">ISO 8601</a>. Without it, the value comes out as a JavaScript date string, like <code>Tue Mar 10 2026 05:03:47 GMT+0000</code>, which Elasticsearch won&#x27;t map as a date.</li><li>The <code>view_count</code> uses <code>${{ }}</code> syntax with <code>| plus: 0</code> to preserve the numeric type. Using <code>{{ }}</code> would render it as a string, which would prevent math operations in the dashboard.</li></ul><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/6d48c8784d5df6192e8b5175e69dbab5098194bc-919x774.png" style="width: 100%;" /><p><em>The UI allows you to nicely debug each of the workflow steps.</em></p><h2>Step 4: Build the stats dashboard</h2><p>Once the workflow has run a few times and data is collected, create a new dashboard in Kibana using the dashboard-views data view.</p><p>Some panels to start with:</p><ul><li><strong>Top dashboards by views:</strong> Use a <a href="https://www.elastic.co/docs/explore-analyze/visualize/charts/bar-charts"><strong>Bar chart</strong></a> with <code>dashboard_name</code> on the X axis and <code>last_value(view_count)</code> on the Y axis. This shows the current daily view count per dashboard.</li><li><strong>Views over time:</strong> Use a <a href="https://www.elastic.co/docs/explore-analyze/visualize/charts/line-charts"><strong>Line chart</strong></a> with <code>captured_at</code> on the X axis and <code>last_value(view_count)</code> on the Y axis, broken down by <code>dashboard_name</code>. Since each run appends a new document, use last value to get the peak count per time bucket rather than summing duplicates.</li><li><strong>Current snapshot:</strong> Use a <a href="https://www.elastic.co/docs/explore-analyze/visualize/charts/tables"><strong>Data table</strong></a> with the latest <code>captured_at</code> to show the most recent view counts across all dashboards.</li></ul><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/e245b95f67daf76a2aaf4cb9df2c75ef4cfef582-1462x747.png" style="width: 100%;" /><p>Since each workflow creates a new document, you can filter by time range to analyze activity in specific periods, compare week over week, or build alerts when a dashboard drops below a view threshold.</p><h2><strong>Conclusion</strong></h2><p>Elastic Workflows is a good fit for this kind of periodic data collection because both the source (Kibana API) and the destination (Elasticsearch) are native, which means zero credential management. The workflow engine handles authentication automatically for <code>kibana.request</code> and <code>elasticsearch.request</code> steps, so the only thing you write is the logic.</p><h2><strong>Resources</strong></h2><ul><li><a href="https://www.elastic.co/docs/explore-analyze/workflows">Elastic Workflows</a></li><li><a href="https://www.elastic.co/docs/api/doc/kibana/">Kibana API</a></li></ul>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[When TSDS meets ILM: Designing time series data streams that don't reject late data]]></title>
            <description><![CDATA[How TSDS time bounds interact with ILM phases; and how to design policies that tolerate late-arriving metrics.]]></description>
            <link>https://www.elastic.co/search-labs/blog/tsds-ilm-elasticsearch</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/tsds-ilm-elasticsearch</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Bret Wortman]]></dc:creator>
            <pubDate>Thu, 02 Apr 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/e4c46e4a6f746d9c845857e80de036f5d51cd4e7-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Recently, I migrated a customer&#x27;s metrics cluster from &quot;everything in the hot tier&quot; to a hot/cold/frozen architecture. It was a change I’d performed dozens of times before. Within minutes, Logstash stopped advancing data entirely.</p><p>Elasticsearch was rejecting late-arriving metrics. Those rejections caused the pipeline to fall behind, resulting in more late data, which triggered even more rejections. Eventually, the pipeline stalled completely.</p><p>We had to restore from snapshot, reindex the data, and redesign the ingestion pipeline to recover.</p><p>The root cause wasn&#x27;t index lifecycle management (ILM) itself. It was time series data streams (TSDS) and how they enforce time‑bound backing indices.</p><p>TSDS can reduce storage requirements for metrics by 40–70%, but the architectural changes that make TSDS efficient also alter how indices behave over time. Those changes matter when designing ILM policies or when your ingestion pipelines may produce late‑arriving data.</p><h2>TL;DR</h2><p>When using TSDS:</p><ul><li>Backing indices only accept documents within a specific time window.</li><li>If late data arrives after an index moves to cold or frozen, Elasticsearch rejects those documents or routes them to the failure store, if configured.</li></ul><p>Design rule:</p><pre><code>warm_min_age > rollover_max_age + maximum_expected_lateness</code></pre><h2>What is a time series data stream?</h2><p>A<em> time series data stream</em> (TSDS) is a specialized data stream optimized for metrics data. Data is routed so that related documents are located within the same shards, optimizing them for query and retrieval. Here’s how Elasticsearch does it:</p><p>Each document contains:</p><ul><li>A timestamp.</li><li>Dimension fields identifying the time series.</li><li>Metric fields representing measured values.</li></ul><p>Examples include:</p><ul><li>CPU usage per host.</li><li>Request latency per service.</li><li>Temperature readings per sensor.</li></ul><p><em>Dimensions </em>identify what we want to measure, while <em>metrics </em>represent values that change over time.</p><h3>Dimensions</h3><p>Dimensions describe the measured entity.</p><p>Examples:</p><pre><code>host.name
service.name
container.id</code></pre><p>We define them in mappings with:</p><pre><code>time_series_dimension: true</code></pre><h3>Metrics</h3><p>Metrics represent numeric values and are defined using:</p><pre><code>time_series_metric</code></pre><p>Common metric types:</p><ul><li>Gauge: Values that rise and fall.</li><li>Counter: Values that increase until reset.</li></ul><p>Elastic Agent primarily collects metrics and logs data, so even if you haven’t enabled any TSDS indices by hand, you may still have them in your cluster.</p><h3>The _tsid field</h3><p>Elasticsearch internally generates a <code>_tsid</code> value from dimension fields. This allows documents with identical dimensions to be routed to the same shard, improving:</p><ul><li>Compression.</li><li>Query locality.</li><li>Aggregation performance.</li></ul><h2>The key difference: Time‑bound backing indices</h2><p>Traditional data streams always write to the most recent backing index, called the <em>write index</em>, but TSDS behaves differently.</p><p>Each TSDS backing index has a defined time window and only accepts documents with <code>@timestamp</code> values that fall in that window:</p><pre><code>GET _data_stream/my-metrics-data-stream
{
  "index_mode": "time_series",
  "time_series": {
    "temporal_ranges": [
      {
        "start": "2026-01-15T14:35:50.000Z",
        "end": "2026-03-16T11:34:40.000Z"
      }
    ]
  }
}</code></pre><p>When a document is indexed, Elasticsearch routes it to the backing index responsible for that timestamp, meaning that, unlike traditional indices, a TSDS may write to multiple backing indices simultaneously.</p><p>For example:</p><ul><li>Real‑time data → newest index.</li><li>Late data → earlier index covering that time range.</li></ul><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/31af2bb3b3dc24db8342e791e1db77a44659ba7a-1589x502.png" style="width: 100%;" /><h2>Designing for late‑arriving data</h2><p>Real ingestion pipelines rarely deliver metrics perfectly on time. Metrics can be delayed by network outages, backlogs along the way, batch ingestion, and loss of edge devices, which reconnect and start to catch up.</p><p>Traditional indices quietly absorb those delays. TSDS does not.</p><p>If a document&#x27;s timestamp falls outside the range of writable backing indices, Elasticsearch rejects it, meaning your ILM policy must account for late data.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/32a103b95b20e31615c214271e27811a7ee315ae-1999x691.png" style="width: 100%;" /><h2>The critical constraint</h2><p>Backing indices must remain writable long enough to accept delayed data.</p><p>In practical terms:</p><pre><code>time_until_readonly > maximum_expected_lateness</code></pre><p>Because ILM measures ages from rollover, the operational rule becomes:</p><pre><code>warm_or_cold_min_age > rollover_max_age + maximum_expected_lateness</code></pre><p></p><p>For example, if metrics may arrive up to six hours late, indices must remain writable at least six hours after rollover.</p><p></p><p>Failing to account for this constraint was exactly what caused the ingestion failure described earlier. Late-arriving data was directed to an earlier index, which was already in the cold tier and therefore write-blocked.</p><p></p><h2>Handling rejected documents</h2><p>When TSDS rejects a document, Elasticsearch returns an error, indicating that the timestamp doesn’t fall within the range of writable indices. How your ingestion pipeline handles that error determines whether you lose data or stall ingestion.</p><p>The primary mechanism for handling rejected documents is the failure store.</p><h3>Failure store (recommended in Elasticsearch 9.1+)</h3><p>Elasticsearch 9.1 introduced the failure store, which automatically captures rejected documents. Instead of returning errors to clients, Elasticsearch writes failed documents to a dedicated failure index inside the data stream.</p><p>You can inspect failures using:</p><pre><code>GET metrics-myapp::failures/_search</code></pre><p>Using the failure store prevents ingestion pipelines from choking on rejection errors while preserving failed data for analysis or <a href="https://www.elastic.co/docs/manage-data/data-store/data-streams/reindex-tsds">reindexing</a>.</p><h2>Monitoring for rejection issues</h2><p>Late‑arrival problems usually appear first as ingestion anomalies. You may notice them first as:</p><ul><li>Sudden drops in indexing rate.</li><li>Spikes in rejected documents.</li><li>A growing number of failure store entries.</li><li>Mismatches between pipeline input and output counts.</li></ul><p>Alerting on these signals allows operators to detect issues before pipelines stall. Workflows, machine learning jobs, and other mechanisms can be used to automate detection and notification.</p><h2>Migration checklist for TSDS + ILM</h2><p>If you&#x27;re migrating a metrics cluster to TSDS, introducing ILM tiering, or upgrading to an Elasticsearch version where metrics are TSDS by default, review these items first.</p><h3><strong>1. Measure ingestion latency</strong></h3><p>Before changing ILM policies, determine:</p><ul><li>Normal ingestion delay.</li><li>Worst-case delay during incidents.</li><li>Delays caused by batch pipelines.</li></ul><p>Your ILM design must accommodate the maximum realistic delay.</p><h3><strong>2. Verify index time windows</strong></h3><p>Inspect your TSDS backing indices:</p><pre><code>GET _data_stream/<your-stream></code></pre><p>Look for:</p><ul><li><code>time_series.start_time</code></li><li><code>time_series.end_time</code></li></ul><p>These bounds determine which indices can accept documents. Understanding these windows can help you determine how late data can be before it’s rejected.</p><h3><strong>3. Size the hot tier for late arrivals</strong></h3><p>Ensure backing indices remain writable long enough for delayed data.</p><p>Operational rule:</p><ul><li><code>warm_min_age &gt; rollover_max_age + maximum_expected_lateness</code></li></ul><p>Remember, indices must remain writable for at least six hours if metrics may arrive six hours late.</p><h3><strong>4. Decide how to handle rejected documents</strong></h3><p>Choose a strategy before enabling TSDS:</p><ul><li>Failure store (recommended in Elasticsearch 9.1+).</li><li>Logstash dead letter queue.</li><li>Fallback index for late arrivals.</li><li>Accepting limited data loss.</li></ul><h3><strong>5. Monitor ingestion health</strong></h3><p>Add alerts for:</p><ul><li>Indexing rate drops.</li><li>Rejected documents.</li><li>Failure store growth.</li><li>Pipeline input/output mismatches.</li></ul><p>Late data issues often appear first as ingestion anomalies.</p><h2>Summary</h2><p>Time series data streams provide major storage and performance improvements for metrics workloads, but they introduce an important architectural change: Backing indices are time‑bound, which affects how ILM behaves.</p><p>When using TSDS:</p><ul><li>Indices must remain writable long enough to accept delayed data.</li><li>Ingestion pipelines should handle rejected documents safely.</li></ul><p>The key rule to remember is:</p><pre><code>warm_min_age > rollover_max_age + maximum_expected_lateness</code></pre><p>If you design ILM policies around that constraint, TSDS works extremely well for metrics workloads.</p><p>Ignore it, though, and your ingestion pipeline may discover those time boundaries the hard way.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[LINQ to Elasticsearch ES|QL: Write C#, query Elasticsearch]]></title>
            <description><![CDATA[Exploring the new LINQ to Elasticsearch ES|QL provider in the Elasticsearch .NET client, which allows you to write C# code that’s automatically translated to ES|QL queries.]]></description>
            <link>https://www.elastic.co/search-labs/blog/linq-esql-c-elasticsearch-net-client</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/linq-esql-c-elasticsearch-net-client</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Florian Bernd, Martijn Laarman]]></dc:creator>
            <pubDate>Wed, 01 Apr 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/e54132e915217063e9ed0ec45059c6cfc38e31dd-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Starting with <strong>v9.3.4</strong> and <strong>v8.19.18</strong>, the Elasticsearch .NET client includes a <a href="https://learn.microsoft.com/en-us/dotnet/csharp/linq/">Language Integrated Query (LINQ) </a>provider that translates C# LINQ expressions into <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/esql.html">Elasticsearch Query Language (ES|QL)</a> queries at runtime. Instead of writing ES|QL strings by hand, you compose queries using <code>Where</code>, <code>Select</code>, <code>OrderBy</code>, <code>GroupBy</code>, and other standard operators. The provider takes care of translation, parameterization, and result deserialization, including per-row streaming that keeps memory usage constant, regardless of result set size.</p><h2>Your first query</h2><p>Start by defining a plain old CLR object (POCO) that maps to your Elasticsearch index. Property names are resolved to ES|QL column names through standard <code>System.Text.Json</code> attributes, like <code>[JsonPropertyName]</code>, or through a configured <code>JsonNamingPolicy</code>. The same <a href="https://www.elastic.co/docs/reference/elasticsearch/clients/dotnet/source-serialization">source serialization</a> rules that apply across the rest of the client apply here as well.</p><pre><code>using System.Text.Json.Serialization;

public class Product
{
    [JsonPropertyName("product_id")]
    public string Id { get; set; }

    public string Name { get; set; }

    public string Brand { get; set; }

    [JsonPropertyName("price_usd")]
    public double Price { get; set; }

    [JsonPropertyName("in_stock")]
    public bool InStock { get; set; }
}</code></pre><p>With the type in place, a query looks like this:</p><pre><code>var minPrice = 100.0;
var brand = "TechCorp";

await foreach (var product in client.Esql.QueryAsync<Product>(q => q
    .From("products")
    .Where(p => p.InStock && p.Price >= minPrice && p.Brand == brand)
    .OrderByDescending(p => p.Price)
    .Take(10)))
{
    Console.WriteLine($"{product.Name}: ${product.Price}");
}</code></pre><p>The provider translates this into the following ES|QL:</p><pre><code>FROM products
| WHERE (in_stock == true AND price_usd >= ?minPrice AND brand == ?brand)
| SORT price_usd DESC
| LIMIT 10</code></pre><p>A few details to note:</p><ul><li><strong>Property name resolution:</strong> <code>p.Price</code> becomes <code>price_usd</code> because of the <code>[JsonPropertyName]</code> attribute, and <code>p.Brand</code> becomes <code>brand</code> following the default camelCase naming policy.</li><li><strong>Parameter capturing:</strong> The C# variables <code>minPrice</code> and <code>brand</code> are captured as named parameters (<code>?minPrice</code>, <code>?brand</code>). They’re sent separately from the query string in the JSON payload, which prevents injection and enables server-side query plan caching.</li><li><strong>Streaming:</strong> <code>QueryAsync&lt;T&gt;</code> returns <code>IAsyncEnumerable&lt;T&gt;</code>. Rows are materialized one at a time as they arrive from Elasticsearch.</li></ul><p>You can also inspect the generated query and its parameters without executing it:</p><pre><code>var query = client.Esql.CreateQuery<Product>()
    .Where(p => p.InStock && p.Price >= minPrice && p.Brand == brand)
    .OrderByDescending(p => p.Price)
    .Take(10);

Console.WriteLine(query.ToEsqlString());
// FROM products | WHERE (in_stock == true AND price_usd >= 100) | SORT price_usd DESC | LIMIT 10

Console.WriteLine(query.ToEsqlString(inlineParameters: false));
// FROM products | WHERE (in_stock == true AND price_usd >= ?minPrice AND brand == ?brand) | SORT price_usd DESC | LIMIT 10

var parameters = query.GetParameters();
// { "minPrice": 100.0, "brand": "TechCorp" }</code></pre><h2>How does this work? A quick LINQ refresher</h2><p>The mechanism that makes LINQ providers possible is the distinction between <code>IEnumerable&lt;T&gt;</code> and <code>IQueryable&lt;T&gt;</code>.</p><p>When you call <code>.Where(p =&gt; p.Price &gt; 100)</code> on an <code>IEnumerable&lt;T&gt;</code>, the lambda compiles to a <code>Func&lt;Product, bool&gt;</code>, a regular delegate that the runtime executes in-process. This is LINQ-to-Objects.</p><p>When you call the same method on an <code>IQueryable&lt;T&gt;</code>, the C# compiler wraps the lambda in an <code>Expression&lt;Func&lt;Product, bool&gt;&gt;</code> instead. This is a data structure that represents the <em>structure</em> of the code rather than its executable form. The expression tree can be inspected, analyzed, and translated into another language at runtime.</p><pre><code>// IEnumerable: the lambda is a compiled delegate
IEnumerable<Product> local = products.Where(p => p.Price > 100);

// IQueryable: the lambda is an expression tree, a data structure
IQueryable<Product> remote = queryable.Where(p => p.Price > 100);</code></pre><p>The <code>IQueryProvider</code> interface is the extension point. Any provider can implement <code>CreateQuery&lt;T&gt;</code> and <code>Execute&lt;T&gt;</code> to translate these expression trees into a target language. Entity Framework uses this to emit SQL. The LINQ to ES|QL provider uses it to emit ES|QL.</p><p>The expression tree for the query above looks like this:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/f864cd18a390831f8d28503a29b5835efb1842f7-1000x720.png" style="width: 100%;" /><p><em>Expression tree for the example query.</em></p><p>The tree is nested inside out: <code>Take</code> wraps <code>OrderByDescending</code>, which wraps <code>Where</code>, which wraps <code>From</code>, which wraps the root <code>EsqlQueryable&lt;Product&gt;</code> constant. The <code>Where</code> predicate is itself a subtree of <code>BinaryExpression</code> nodes for the <code>&amp;&amp;</code>, <code>&gt;=</code>, and <code>==</code> operators, with <code>MemberExpression</code> leaves for property accesses and closure captures for the <code>minPrice</code> and <code>brand</code> variables. This is the data structure that the provider walks to produce the final ES|QL.</p><h2>Under the hood: The translation pipeline</h2><p>The path from a LINQ expression to query results follows a six-stage pipeline:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/2a2c772b63d720f61fc9a28b2f85668fa2db8d38-1999x1036.png" style="width: 100%;" /><p><em>Translation pipeline overview.</em></p><h3>1. Expression tree capture</h3><p>When you chain <code>.Where()</code>, <code>.OrderBy()</code>, <code>.Take()</code> and other operators on an <code>IQueryable&lt;T&gt;</code>, the standard LINQ infrastructure builds an expression tree. <code>EsqlQueryable&lt;T&gt;</code> implements <code>IQueryable&lt;T&gt;</code> and delegates to <code>EsqlQueryProvider</code>.</p><h3>2. Translation</h3><p>When the query is executed (by enumerating, calling <code>ToList()</code>, or using <code>await foreach)</code>, the <code>EsqlExpressionVisitor</code> walks the expression tree inside out. It dispatches each LINQ method call to a specialized visitor:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Visitor</td><td style="border: 1px solid #ccc; padding: 4px;">Translates</td><td style="border: 1px solid #ccc; padding: 4px;">Into</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">WhereClauseVisitor</td><td style="border: 1px solid #ccc; padding: 4px;">.Where(predicate)</td><td style="border: 1px solid #ccc; padding: 4px;">WHERE condition</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">SelectProjectionVisitor</td><td style="border: 1px solid #ccc; padding: 4px;">.Select(selector)</td><td style="border: 1px solid #ccc; padding: 4px;">EVAL + KEEP + RENAME</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">GroupByVisitor</td><td style="border: 1px solid #ccc; padding: 4px;">.GroupBy().Select()</td><td style="border: 1px solid #ccc; padding: 4px;">STATS ... BY</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">OrderByVisitor</td><td style="border: 1px solid #ccc; padding: 4px;">.OrderBy() / .ThenBy()</td><td style="border: 1px solid #ccc; padding: 4px;">SORT field [ASC\|DESC]</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">EsqlFunctionTranslator</td><td style="border: 1px solid #ccc; padding: 4px;">EsqlFunctions.*, Math.*, string methods</td><td style="border: 1px solid #ccc; padding: 4px;">80+ ES|QL functions</td>
                    </tr>
              </tbody>
            </table>
          <p>During translation, C# variables referenced in expressions are captured as named parameters.</p><h3>3. Query model</h3><p>The visitors don’t produce strings directly. Instead, they produce <code>QueryCommand</code> objects, an immutable intermediate representation. A <code>FromCommand</code>, a <code>WhereCommand</code>, a <code>SortCommand</code>, and a <code>LimitCommand</code>, each representing one ES|QL processing command. These are collected into an <code>EsqlQuery</code> model.</p><p><em></em></p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/2adc349b6cf655b96b7b3e826a134e8a17fe42fd-1999x1036.png" style="width: 100%;" /><p><em>Query model and command pattern.</em></p><p>This intermediate model is decoupled from both the expression tree and the output format. It can be inspected, intercepted (via <code>IEsqlQueryInterceptor</code>), or modified before formatting.</p><h3>4. Formatting</h3><p><code>EsqlFormatter</code> visits each <code>QueryCommand</code> in order and produces the final ES|QL string. Each command becomes one line, separated by the pipe (|) operator that ES|QL uses to chain processing commands. Identifiers containing special characters are automatically escaped with backticks.</p><h3>5. Execution</h3><p>The formatted ES|QL string and captured parameters are sent to Elasticsearch’s <code>/_query</code> endpoint as a JSON payload. The <code>IEsqlQueryExecutor</code> interface abstracts the transport layer, which is where the layered package architecture comes into play.</p><h3>6. Materialization</h3><p><code>EsqlResponseReader</code> streams the JSON response without buffering the entire result set into memory. A <code>ColumnLayout</code> tree, precomputed once per query, maps flat ES|QL column names (like <code>address.street</code>, <code>address.city</code>) to nested POCO properties. Each row is assembled into a <code>T</code> instance and yielded one at a time via <code>IEnumerable&lt;T&gt;</code> or <code>IAsyncEnumerable&lt;T&gt;</code>.</p><h2>The layered architecture</h2><p>The LINQ to ES|QL functionality is split across three packages:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/41b8aae860ecdc2480edcb1c1d4cc9b03cfb78c9-1999x1036.png" style="width: 100%;" /><p><em>Package architecture.</em><br/><a href="https://www.nuget.org/packages/Elastic.Esql"><strong><code>Elastic.Esql</code></strong></a> is the pure translation engine. It has zero HTTP dependencies and contains the expression visitors, query model, formatter, and response reader. You can use it stand alone to build and inspect ES|QL queries without an Elasticsearch connection, which is useful for testing, query logging, or building your own execution layer.</p><pre><code>// Translation-only: no Elasticsearch connection needed
var provider = new EsqlQueryProvider();
var query = new EsqlQueryable<Product>(provider)
    .From("products")
    .Where(p => p.InStock)
    .OrderByDescending(p => p.Price);

Console.WriteLine(query.ToEsqlString());
// FROM products | WHERE in_stock == true | SORT price_usd DESC</code></pre><p><a href="https://www.nuget.org/packages/Elastic.Clients.Esql"><strong><code>Elastic.Clients.Esql</code></strong></a> is a lightweight stand-alone ES|QL client. It adds HTTP execution on top of <code>Elastic.Esql</code> via <code>Elastic.Transport</code>. If your application only needs ES|QL and none of the other Elasticsearch APIs, this is the minimal dependency option.</p><p><a href="https://www.nuget.org/packages/Elastic.Clients.Elasticsearch"><strong><code>Elastic.Clients.Elasticsearch</code></strong></a> is the full Elasticsearch .NET client. It also builds on <code>Elastic.Esql</code> and exposes the LINQ provider through the <code>client.Esql</code> namespace. This is the recommended entry point for most applications.</p><p>Both execution-layer packages provide their own implementation of <code>IEsqlQueryExecutor</code>, the strategy interface that bridges translation and transport.</p><p>All three packages are compatible with Native AOT when used with a source-generated <code>JsonSerializerContext</code>. For the full client, see the <a href="https://www.elastic.co/docs/reference/elasticsearch/clients/dotnet/source-serialization#native-aot">Native AOT documentation</a>.</p><h2>Beyond the basics</h2><p>The example above covered filtering, sorting, and pagination. The provider supports a broader set of operations.</p><h3>Aggregations</h3><p><code>GroupBy</code>, combined with aggregate functions in <code>Select</code>, translates to ES|QL <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/stats-by"><code>STATS ... BY</code></a>:</p><pre><code>var stats = client.Esql.Query<Product, object>(q => q
    .GroupBy(p => p.Brand)
    .Select(g => new
    {
        Brand = g.Key,
        Count = g.Count(),
        AvgPrice = g.Average(p => p.Price),
        MaxPrice = g.Max(p => p.Price)
    }));

// -> FROM products | STATS COUNT(*), AVG(price_usd), MAX(price_usd) BY brand</code></pre><h3>Projections</h3><p><code>Select</code>, with anonymous types generates <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/eval"><code>EVAL</code></a>, <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/keep"><code>KEEP</code></a>, and <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/rename"><code>RENAME</code></a> commands:</p><pre><code>var query = client.Esql.CreateQuery<Product>()
    .Select(p => new { ProductName = p.Name, p.Price, p.InStock });

// -> FROM products | KEEP name, price_usd, in_stock | RENAME name AS ProductName</code></pre><h3>Rich function library</h3><p>Over 80 ES|QL functions are available through the <code>EsqlFunctions</code> class, covering date/time, string, math, IP, pattern matching, and scoring. Standard <code>Math.*</code> and <code>string.*</code> methods are also translated:</p><pre><code>.Where(p => p.Name.Contains("Pro"))       // -> WHERE name LIKE "*Pro*"
.Where(p => EsqlFunctions.CidrMatch(      // -> WHERE CIDR_MATCH(ip, "10.0.0.0/8")
    p.IpAddress, "10.0.0.0/8"))</code></pre><h3>LOOKUP JOIN</h3><p>Cross-index lookups translate to ES|QL <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/lookup-join"><code>LOOKUP JOIN</code></a>:</p><pre><code>var enriched = client.Esql.Query<Product, object>(q => q
    .LookupJoin<Product, CategoryLookup, string, object>(
        "category-lookup-index",
        product => product.Id,
        category => category.CategoryId,
        (product, category) => new { product.Name, category!.CategoryLabel }));</code></pre><h3>Raw ES|QL escape hatch</h3><p>For ES|QL features not yet covered by the LINQ provider, you can append raw fragments:</p><pre><code>var results = client.Esql.Query<Product>(q => q
    .Where(p => p.InStock)
    .RawEsql("| EVAL discounted = price_usd * 0.9"));</code></pre><h3>Server-side async queries</h3><p>For long-running queries, submit them for background processing on the server:</p><pre><code>await using var asyncQuery = await client.Esql.SubmitAsyncQueryAsync<Product>(
    q => q.Where(p => p.InStock),
    asyncQueryOptions: new EsqlAsyncQueryOptions
    {
        WaitForCompletionTimeout = TimeSpan.FromSeconds(5),
        KeepAlive = TimeSpan.FromMinutes(10)
    });

await asyncQuery.WaitForCompletionAsync();
await foreach (var product in asyncQuery.AsAsyncEnumerable())
    Console.WriteLine(product.Name);</code></pre><p>Server-side async queries are especially useful for long-running analytical queries / large dataset processing that might exceed typical timeout thresholds, or in timeout-sensitive environments with load balancers, API gateways, or proxies that enforce strict HTTP timeouts. Async queries avoid connection drops by decoupling submission from result retrieval.</p><h2>Getting started</h2><p>LINQ to ES|QL is available starting from:</p><ul><li><strong>Elastic.Clients.Elasticsearch v9.3.4</strong> (9.x branch)</li><li><strong>Elastic.Clients.Elasticsearch v8.19.18</strong> (8.x branch)</li></ul><p>Install from NuGet:</p><p><code>dotnet add package Elastic.Clients.Elasticsearch</code></p><p>The entry points are on <code>client.Esql</code>:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Method</td><td style="border: 1px solid #ccc; padding: 4px;">Returns</td><td style="border: 1px solid #ccc; padding: 4px;">Use case</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Query<T>(...)</td><td style="border: 1px solid #ccc; padding: 4px;">IEnumerable<T></td><td style="border: 1px solid #ccc; padding: 4px;">Synchronous execution</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">QueryAsync<T>(...)</td><td style="border: 1px solid #ccc; padding: 4px;">IAsyncEnumerable<T></td><td style="border: 1px solid #ccc; padding: 4px;">Async streaming</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">CreateQuery<T>()</td><td style="border: 1px solid #ccc; padding: 4px;">IEsqlQueryable<T></td><td style="border: 1px solid #ccc; padding: 4px;">Advanced composition and inspection</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">SubmitAsyncQueryAsync<T>(...)</td><td style="border: 1px solid #ccc; padding: 4px;">EsqlAsyncQuery<T></td><td style="border: 1px solid #ccc; padding: 4px;">Long-running server-side queries</td>
                    </tr>
              </tbody>
            </table>
          <p>For the full feature reference, including query options, multifield access, nested objects, and multivalue field handling, see the <a href="https://www.elastic.co/docs/reference/elasticsearch/clients/dotnet/linq-to-esql">LINQ to ES|QL documentation</a>.</p><h2>Conclusion</h2><p>LINQ to ES|QL brings the full expressiveness of C# LINQ to Elasticsearch&#x27;s ES|QL query language, letting you write strongly typed, composable queries without handcrafting query strings. With automatic parameter capturing, streaming materialization, and a layered package architecture that scales from stand-alone translation to the full Elasticsearch client, it fits naturally into .NET applications of any size. Install the latest client, point your LINQ expressions at an index, and let the provider handle the rest.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[From judgment lists to trained Learning to Rank (LTR) models]]></title>
            <description><![CDATA[Learn how to transform judgment lists into training data for Learning To Rank (LTR), design effective features, and interpret what your model learned.]]></description>
            <link>https://www.elastic.co/search-labs/blog/learning-to-rank-models-judgment-lists</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/learning-to-rank-models-judgment-lists</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Jeffrey Rengifo]]></dc:creator>
            <pubDate>Tue, 31 Mar 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/b43ac8c8f6dd519b28a6ed2b89328b6e76c71e02-1999x1125.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>In <a href="https://www.elastic.co/search-labs/blog/judgment-lists-search-query-relevance-elasticsearch">Evaluating search query relevance with judgment lists</a>, we built lists and used the <code>_rank_eval</code> API to measure search quality. Though this approach gave us an objective way to evaluate changes, improving relevance still requires manual query tuning.</p><p>If judgment lists answer the question,<strong> “How good is my ranking?,”</strong> Learning To Rank (LTR) answers, <strong>“How do I systematically make it better?”</strong></p><p>In this article, we take the next step: using those judgment lists to train an LTR model using <a href="https://xgboost.ai/">XGBoost</a>, <a href="https://eland.readthedocs.io/en/latest/">Eland</a>, and Elasticsearch. We’ll focus on understanding the process rather than on implementation details. For the complete code, refer to the <a href="https://github.com/elastic/elasticsearch-labs/blob/main/notebooks/search/08-learning-to-rank.ipynb">companion notebook</a>.</p><h2>What is LTR?</h2><p><a href="https://www.elastic.co/docs/solutions/search/ranking/learning-to-rank-ltr">LTR</a> uses machine learning (ML) to build a ranking function for your search engine. Instead of manually tuning query weights, you provide examples of proper rankings (your<a href="https://www.elastic.co/search-labs/blog/judgment-lists"> judgment list</a>) and let the model learn what makes documents relevant. In Elasticsearch, LTR works as a second-stage reranker following retrieval of documents from Elasticsearch:</p><ul><li><strong>First stage:</strong> A standard query (<a href="https://www.elastic.co/blog/practical-bm25-part-1-how-shards-affect-relevance-scoring-in-elasticsearch">BM25</a>, vector, or <a href="https://www.elastic.co/search-labs/blog/hybrid-search-elasticsearch">hybrid</a>) retrieves candidate documents quickly.</li><li><strong>Second stage:</strong> The LTR model reranks the top results using multiple signals it learned to combine.</li></ul><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/926c037f84e211b6fb5a44a0849823151ab0fa0a-1712x1852.png" style="width: 100%;" /><p>For a deeper introduction, see <a href="https://www.elastic.co/search-labs/blog/elasticsearch-learning-to-rank-introduction">Introducing Learning To Rank (LTR)</a> in Elasticsearch.</p><h2>The journey from judgment list to model</h2><p>A judgment list tells us which documents should rank highly for a given query. But the model cannot learn directly from document IDs. It needs numerical signals that explain <em>why</em> certain documents are potentially relevant.</p><p>The process works like this:</p><ol><li><strong>Start with judgments.</strong> Query-document pairs with relevance grades, so you define that doc1 is a good match for “DiCaprio performance” search terms.</li><li><strong>Extract features.</strong> For each query-document pair, compute numerical signals, some about the document alone (for example, popularity), and others about how the query and document interact (for example, BM25 score).</li><li><strong>Train the model.</strong> The model learns which feature patterns predict high grades.</li><li><strong>Deploy. </strong>Deploy the trained model to your Elasticsearch cluster.</li><li><strong>Query.</strong> Use the model to rerank search results.</li></ol><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/b43ac8c8f6dd519b28a6ed2b89328b6e76c71e02-1999x1125.png" style="width: 100%;" /><p>The key insight is that features must capture what your judgments are measuring. If your judgment list rewards popular thriller movies but your features only include text-matching scores, the model has no way to learn what makes those documents relevant.</p><h2>What are features?</h2><p><a href="https://www.elastic.co/docs/solutions/search/ranking/learning-to-rank-ltr#learning-to-rank-feature-extraction"><em>Features</em></a> are numerical values that describe a query-document pair. In Elasticsearch, we define features using queries that return <a href="https://www.elastic.co/search-labs/blog/elasticsearch-scoring-and-explain-api">scores</a>. There are three types:</p><ul><li><strong><em>Query-document features</em></strong> measure how well a query matches a document. Eland provides the <code>QueryFeatureExtractor</code> utility to define these features, which computes the BM25 relevance score for each query-document pair:</li></ul><pre><code>QueryFeatureExtractor(
    feature_name="title_bm25",
    query={"match": {"title": "{{query}}"}}
)</code></pre><p>This extracts the BM25 score from the title field for each document relative to the query.</p><ul><li><strong><em>Document features</em></strong> are properties of the document that don’t depend on the query. You can extract these using <a href="https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-script-score-query"><code>script_score</code></a> or <a href="https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-function-score-query"><code>function_score</code></a>:</li></ul><pre><code>QueryFeatureExtractor(
    feature_name="popularity",
    query={
        "script_score": {
            "query": {"exists": {"field": "popularity"}},
            "script": {"source": "return doc['popularity'].value;"}
        }
    }
)</code></pre><ul><li><strong><em>Query features</em></strong> describe the query itself, like the number of terms. These are less common but can help the model handle different query types.</li></ul><h2>Designing your feature set</h2><p>Choosing features isn’t random. Each feature should capture a signal that might explain why users prefer certain documents. Let&#x27;s look at the features from the LTR notebook and understand the reasoning:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Feature</td><td style="border: 1px solid #ccc; padding: 4px;">Type</td><td style="border: 1px solid #ccc; padding: 4px;">Purpose</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`title_bm25`</td><td style="border: 1px solid #ccc; padding: 4px;">Query-document</td><td style="border: 1px solid #ccc; padding: 4px;">Title matches are strong relevance signals. For example, a movie titled Star Wars should rank highly for the query "star wars".</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`actors_bm25`</td><td style="border: 1px solid #ccc; padding: 4px;">Query-document</td><td style="border: 1px solid #ccc; padding: 4px;">Some users search by actor name. If they search for "leonardo dicaprio movies", they should get films starring Leonardo DiCaprio.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`title_all_terms_bm25`</td><td style="border: 1px solid #ccc; padding: 4px;">Query-document</td><td style="border: 1px solid #ccc; padding: 4px;">This is a stricter version of title matching where all query terms must be present. It helps distinguish between exact matches and partial ones.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`actors_all_terms_bm25`</td><td style="border: 1px solid #ccc; padding: 4px;">Query-document</td><td style="border: 1px solid #ccc; padding: 4px;">This is the same stricter matching logic as described above but applied specifically to actors.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`popularity`</td><td style="border: 1px solid #ccc; padding: 4px;">Document</td><td style="border: 1px solid #ccc; padding: 4px;">Users generally prefer well-known movies over obscure ones when relevance is similar. A popular Star Wars film should rank above a low-budget parody with "Star Wars" in the title.</td>
                    </tr>
              </tbody>
            </table>
          <p>Notice the strategy here:</p><ul><li><strong>Multiple signals for the same concept. </strong>We have both <code>title_bm25</code> (lenient) and <code>title_all_terms_bm25</code> (strict). The lenient version scores any document where at least one query term matches the title, and the strict version requires all the terms to be present. For short queries, the lenient match might be enough; whereas for longer, more specific queries, strict matching might be more important. The model can learn when to rely on each.</li><li><strong>Text features plus quality features. </strong>Text matching alone can return irrelevant documents that happen to contain the right words. The popularity feature lets the model boost well-known, quality content when text scores are similar.</li><li><strong>Coverage for different query types.</strong> Some queries target titles (&quot;star wars&quot;), and others target actors (&quot;dicaprio movies&quot;). Having features for both means that the model can handle diverse searches.</li></ul><p>When designing your own features, ask yourself, &quot;What signals would a human use to decide if this document is relevant?&quot; Those are your candidate features.</p><h2>Building the training dataset</h2><p>Once features are defined, we extract them for every query-document pair in our judgment list. The result is a training dataset where each row contains:</p><ul><li>The query identifier.</li><li>The document identifier.</li><li>The relevance grade (from our judgment list).</li><li>All feature values.</li></ul><p>Here’s a simplified example:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`query_id`</td><td style="border: 1px solid #ccc; padding: 4px;">`query`</td><td style="border: 1px solid #ccc; padding: 4px;">`doc_id`</td><td style="border: 1px solid #ccc; padding: 4px;">`grade`</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">qid:1</td><td style="border: 1px solid #ccc; padding: 4px;">star wars</td><td style="border: 1px solid #ccc; padding: 4px;">11</td><td style="border: 1px solid #ccc; padding: 4px;">4</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">qid:1</td><td style="border: 1px solid #ccc; padding: 4px;">star wars</td><td style="border: 1px solid #ccc; padding: 4px;">12180</td><td style="border: 1px solid #ccc; padding: 4px;">3</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">qid:1</td><td style="border: 1px solid #ccc; padding: 4px;">star wars</td><td style="border: 1px solid #ccc; padding: 4px;">278427</td><td style="border: 1px solid #ccc; padding: 4px;">1</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">qid:2</td><td style="border: 1px solid #ccc; padding: 4px;">tom hanks movies</td><td style="border: 1px solid #ccc; padding: 4px;">857</td><td style="border: 1px solid #ccc; padding: 4px;">4</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">qid:2</td><td style="border: 1px solid #ccc; padding: 4px;">tom hanks movies</td><td style="border: 1px solid #ccc; padding: 4px;">13</td><td style="border: 1px solid #ccc; padding: 4px;">3</td>
                    </tr>
              </tbody>
            </table>
          <p>A few things to notice:</p><p><strong>NaN values are normal.</strong> When a query doesn’t match a field, the feature returns no score. The movie <em>Star Wars</em> has a high <code>title_bm25</code> but no <code>actors_bm25</code> because the query &quot;star wars&quot; doesn’t match any actor names.</p><p><strong>Queries are grouped during training.</strong> The <code>query_id</code> column tells the model which documents to compare against each other. For &quot;star wars&quot;, it learns that document 11 (grade 4) should rank above document 278427 (grade 1).</p><p>But here’s the important part: The model doesn’t memorize these specific queries. Instead, it learns general patterns, like &quot;documents with high <code>title_bm25</code> AND high <code>popularity</code> tend to have high grades.&quot; When presented with a new query, the model applies these learned patterns to rank the results.</p><p><strong>Features must explain grade differences.</strong> Look at qid:1: The grade 4 document has a higher <code>title_bm25</code> and higher <code>popularity</code> than the grade 1 document. These patterns are what the model learns.</p><h2>Training the LTR model</h2><p>With the training dataset prepared, we train an <a href="https://xgboost.readthedocs.io/en/stable/">XGBoost</a> model with a ranking objective. The model builds decision trees that learn patterns like:</p><ul><li>&quot;If <code>title_bm25</code> &gt; 10 and <code>popularity</code> &gt; 50, predict high relevance.&quot;</li><li>&quot;If <code>title_bm25</code> is missing but <code>actors_bm25</code> &gt; 12, still predict moderate relevance.&quot;</li></ul><p>Here&#x27;s how the training process works in practice:</p><pre><code>from xgboost import XGBRanker
from sklearn.model_selection import GroupShuffleSplit

# Create the ranker model:
ranker = XGBRanker(
    objective="rank:ndcg",
    eval_metric=["ndcg@10"],
    early_stopping_rounds=20,
)

# Shaping training and eval data in the expected format.
X = judgments_with_features[ltr_config.feature_names]
y = judgments_with_features["grade"]
groups = judgments_with_features["query_id"]

# Split the dataset in two parts respectively used for training and evaluation of the model.
group_preserving_splitter = GroupShuffleSplit(n_splits=1, train_size=0.7).split(
    X, y, groups
)
train_idx, eval_idx = next(group_preserving_splitter)

train_features, eval_features = X.loc[train_idx], X.loc[eval_idx]
train_target, eval_target = y.loc[train_idx], y.loc[eval_idx]
train_query_groups, eval_query_groups = groups.loc[train_idx], groups.loc[eval_idx]

# Training the model
ranker.fit(
    X=train_features,
    y=train_target,
    group=train_query_groups.value_counts().sort_index().values,
    eval_set=[(eval_features, eval_target)],
    eval_group=[eval_query_groups.value_counts().sort_index().values],
    verbose=True,
)</code></pre><p>During training, the model tries different combinations of these rules and measures how well the resulting rankings match your judgment grades. It uses a metric called <a href="https://en.wikipedia.org/wiki/Discounted_cumulative_gain">Normalized Discounted Cumulative Gain (NDCG)</a> to score itself. A perfect NDCG of 1.0 means that the model&#x27;s ranking exactly matches your judgments. Lower scores mean that some relevant documents are ranking below where they should be.</p><p>The training also uses a technique called <a href="https://xgboosting.com/early-stopping/">early stopping</a>. If the model&#x27;s score stops improving for several rounds, training halts automatically. This prevents the model from memorizing the training data too closely, which would hurt its ability to generalize to new queries.</p><p>The <a href="https://github.com/elastic/elasticsearch-labs/blob/main/notebooks/search/08-learning-to-rank.ipynb">companion notebook</a> contains the complete training code.</p><h2>Understanding what your LTR model learned</h2><p>After training, XGBoost can show you which features the model relied on most. You can generate a feature importance chart using XGBoost&#x27;s built-in visualization:</p><pre><code>from xgboost import plot_importance

plot_importance(ranker, importance_type="weight")</code></pre><p>The <code>importance_type=&quot;weight&quot;</code> parameter shows how often each feature was used in tree splits. Here’s the resulting chart:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/67608c24c50a8775b6e70fa7f522d7b58cefc764-734x460.png" style="width: 100%;" /><p>The F score counts how many times each feature was used to make split decisions across all trees in the model. Higher values mean that the model relied on that feature more often.</p><p>In this example:</p><ul><li><strong><code>popularity</code> (2178):</strong> The most important feature. The model frequently uses popularity to separate relevant from nonrelevant documents.</li><li><strong><code>title_bm25</code> (1642):</strong> Second-most important. Title matches matter a lot for movie searches.</li><li><strong><code>actors_bm25</code> (565):</strong> Moderately important. This is useful for queries that mention actors.</li><li><strong><code>title_all_terms_bm25</code> (211):</strong> Occasionally useful. The stricter matching helps for some queries.</li><li><strong><code>actors_all_terms_bm25</code> (63):</strong> Rarely used. The model found this feature less predictive.</li></ul><p>This chart helps you iterate on your feature set. If a feature that you expected to be important shows near-zero importance, investigate why. Maybe the feature extraction is not working as intended, or maybe that signal doesn’t actually predict relevance in your judgment data.</p><h2>Deploying and using the LTR model</h2><p>Once trained, upload the model to Elasticsearch using Eland:</p><pre><code>MLModel.import_ltr_model(
    es_client=es_client,
    model=ranker,
    model_id="ltr-model-xgboost",
    ltr_model_config=ltr_config,
    es_if_exists="replace",
)</code></pre><p>Once uploaded, the model can be used as a <a href="https://www.elastic.co/docs/reference/elasticsearch/rest-apis/retrievers/rescorer-retriever">rescorer retriever</a> to be combined with other <a href="https://www.elastic.co/docs/solutions/search/retrievers-overview">retrievers</a> for multistage search pipelines:</p><pre><code>GET movies/_search
{
  "retriever": {
    "rescorer": {
      "rescore": {
        "window_size": 50,
        "learning_to_rank": {
          "model_id": "ltr-model-xgboost",
          "params": {
            "query": "star wars"
          }
        }
      },
      "retriever": {
        "standard": {
          "query": {
            "multi_match": {
              "fields": ["title", "overview", "actors", "director", "tags", "characters"],
              "query": "star wars"
            }
          }
        }
      }
    }
  }
}</code></pre><p>Response (simplified):</p><pre><code> "hits": {
    "total": {
      "value": 852,
      "relation": "eq"
    },
    "max_score": 25.165691,
    "hits": [
      {
        "_index": "movies",
        "_id": "11",
        "_score": 25.165691,
        "_source": {
          "title": "Star Wars"
        }
      },
      {
        "_index": "movies",
        "_id": "12180",
        "_score": 25.092865,
        "_source": {
          "title": "Star Wars: The Clone Wars"
        }
      },
      {
        "_index": "movies",
        "_id": "181812",
        "_score": 23.456198,
        "_source": {
          "title": "Star Wars: The Rise of Skywalker"
        }
      },
      {
        "_index": "movies",
        "_id": "140607",
        "_score": 23.320757,
        "_source": {
          "title": "Star Wars: The Force Awakens"
        }
      },
...</code></pre><p>The first-stage query retrieves candidates using BM25. The LTR model then reranks the top 50 results using all the features it learned to weight.</p><p>For the sake of the example, the <code>multi_match</code> query alone would return some less relevant results on the first positions that LTR helped to fix:</p><pre><code>{
  "hits": [
    {
      "_index": "movies",
      "_id": "11",
      "_score": 10.971989,
      "_source": {
        "title": "Star Wars"
      }
    },
    {
      "_index": "movies",
      "_id": "12180",
      "_score": 9.923633,
      "_source": {
        "title": "Star Wars: The Clone Wars"
      }
    },
    {
      "_index": "movies",
      "_id": "1022100",
      "_score": 8.9880295,
      "_source": {
        "title": "Andor: A Disney+ Day Special Look"
      }
    },
    {
      "_index": "movies",
      "_id": "278427",
      "_score": 8.845748,
      "_source": {
        "title": "Family Guy Presents: It's a Trap!"
      }
    },
    ...
  ]
}</code></pre><h2>Conclusion</h2><p>The path from judgment lists to a working LTR model involves three key steps: designing features that capture relevance signals, building a training dataset that pairs those features with your judgment grades, and training a model that learns the patterns.</p><p>Our <a href="https://www.elastic.co/search-labs/blog/judgment-lists-search-query-relevance-elasticsearch">previous article</a> becomes the starting point for this process. Your grades define what &quot;relevant&quot; means and how to measure it, and your features give the model the signals to predict it.</p><p>For the complete implementation with a dataset of 9,750 movies and 384,755 judgment rows, see the <a href="https://github.com/elastic/elasticsearch-labs/blob/main/notebooks/search/08-learning-to-rank.ipynb">LTR notebook</a>. For advanced use cases, like personalized search, see <a href="https://www.elastic.co/search-labs/blog/personalized-search-elasticsearch-ltr">Personalized search with LTR</a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[From Elasticsearch runtime fields to ES|QL: Adapting legacy tools to current techniques]]></title>
            <description><![CDATA[Learn how to migrate five common Elasticsearch runtime field patterns to their ES|QL equivalents, with side-by-side code comparisons and guidance on when each approach makes sense.]]></description>
            <link>https://www.elastic.co/search-labs/blog/elasticsearch-runtime-fields-to-esql</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/elasticsearch-runtime-fields-to-esql</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Jeffrey Rengifo]]></dc:creator>
            <pubDate>Mon, 30 Mar 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/657ec44d182de78e6ddabb6632c6844b5a36774d-720x420.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Elasticsearch <a href="https://www.elastic.co/docs/manage-data/data-store/mapping/runtime-fields">runtime fields</a> solve the problem of computing values at query time without <a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-reindex">reindexing</a>. But they come with <a href="https://www.elastic.co/docs/reference/scripting-languages/painless/painless">Painless scripting</a> complexity and performance costs that scale with document count. <a href="https://www.elastic.co/docs/reference/query-languages/esql">Elasticsearch Query Language (ES|QL)</a> offers a more powerful alternative with a dedicated execution engine, pipeline processing, and no scripting required. In this article, you’ll learn how to map five common runtime field patterns to their ES|QL equivalents, so you can modernize your queries and understand when each approach makes sense.</p><h2>Prerequisites</h2><ul><li>Elasticsearch 8.15+ (for <code>::</code> cast operator support; core ES|QL features available from 8.11)</li></ul><h2>Runtime fields versus ES|QL</h2><p>Runtime fields were introduced in Elasticsearch 7.11 as a way to define fields at query time. Instead of reindexing data, you could write a Painless script that computes values on the fly:</p><pre><code>PUT my-index/_mapping
{
  "runtime": {
    "full_address": {
      "type": "keyword",
      "script": {
        "source": "emit(doc['address'].value + ':' + doc['port'].value)"
      }
    }
  }
}</code></pre><p>This works, but comes with trade-offs:</p><ul><li><strong>Painless scripting overhead:</strong> Every runtime field requires scripting knowledge, and the <a href="https://www.elastic.co/docs/reference/scripting-languages/painless/painless-language-specification">syntax</a> is Java-like, not query-like.</li><li><strong>Performance cost:</strong> Runtime fields evaluate per document at query time. Elasticsearch classifies them as &quot;expensive queries&quot; that <a href="https://www.elastic.co/docs/manage-data/data-store/mapping/runtime-fields#runtime-compromises">can be rejected</a> by cluster settings.</li><li><strong>Isolated computation:</strong> Each runtime field computes independently. There’s no way to chain transforms or use the output of one field in another within the same query.</li></ul><p>ES|QL changes the equation. It has its own execution engine (not translated to Query DSL), runs queries concurrently across nodes, and provides a complete toolkit for field computation: <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/eval"><code>EVAL</code></a>, <a href="http://elastic.co/docs/reference/query-languages/esql/commands/grok"><code>GROK</code></a>, <a href="http://elastic.co/docs/reference/query-languages/esql/commands/dissect"><code>DISSECT</code></a>, type casting, and pipeline chaining.</p><p>Let&#x27;s see how each runtime field pattern maps to ES|QL.</p><h2>Setting up the example data</h2><p>All the code snippets in this article can be executed in the Kibana <a href="https://www.elastic.co/docs/explore-analyze/query-filter/tools/console">Dev Tools console</a>.</p><p>To follow along, create a sample index with data that exercises all five patterns. This simulates a server logs scenario with mixed field types, raw messages, and some intentional data quality issues:</p><pre><code>PUT server-logs
{
  "mappings": {
    "properties": {
      "host": { "type": "keyword" },
      "port": { "type": "keyword" },
      "raw_message": { "type": "text" },
      "response_time": { "type": "keyword" },
      "status_code": { "type": "keyword" },
      "region": { "type": "keyword" }
    }
  }
}</code></pre><p>Now index some sample documents:</p><pre><code>POST _bulk
{ "index": { "_index": "server-logs" } }
{ "host": "web-01", "port": "8080", "raw_message": "2024-01-15 INFO user=alice action=login duration=230ms", "response_time": "145", "status_code": "200", "region": "us-east" }
{ "index": { "_index": "server-logs" } }
{ "host": "web-02", "port": "443", "raw_message": "2024-01-15 ERROR user=bob action=upload duration=1200ms", "response_time": "not_available", "status_code": "500", "region": "eu-west" }
{ "index": { "_index": "server-logs" } }
{ "host": "api-01", "port": "3000", "raw_message": "2024-01-15 WARN user=charlie action=query duration=890ms", "response_time": "890", "status_code": "200", "region": "us-east" }
{ "index": { "_index": "server-logs" } }
{ "host": "api-02", "port": "3000", "raw_message": "2024-01-16 INFO user=diana action=export duration=3400ms", "response_time": "3400", "status_code": "200", "region": "ap-south" }
{ "index": { "_index": "server-logs" } }
{ "host": "web-01", "port": "8080", "raw_message": "2024-01-16 ERROR user=eve action=login duration=50ms", "response_time": "50", "status_code": "401", "region": "US-EAST" }
</code></pre><p>Notice that <code>response_time</code> is stored as a <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/keyword">keyword</a> (a common real-world mistake), and the last document has <code>&quot;US-EAST&quot;</code> instead of <code>&quot;us-east&quot;</code> (a data quality issue we’ll fix later).</p><h2>Pattern 1: Field concatenation</h2><p>A common runtime field use case is combining two fields into one. For example, creating a <code>host:port</code> identifier.</p><h3>The runtime field approach</h3><p>You can define it inline at query time. Query-time approach avoids modifying the mapping, but you still need Painless scripting, scoping it to a single search request:</p><pre><code>GET server-logs/_search
{
  "runtime_mappings": {
    "endpoint": {
      "type": "keyword",
      "script": {
        "source": "emit(doc['host'].value + ':' + doc['port'].value)"
      }
    }
  },
  "fields": ["endpoint"],
  "_source": false
}</code></pre><h3>The ES|QL approach</h3><p>You can run ES|QL queries using the <a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-esql-query"><code>_query API</code></a> endpoint:</p><pre><code>POST _query
{
  "query": """
    FROM server-logs
    | EVAL endpoint = CONCAT(host, ":", port)
    | KEEP host, port, endpoint
    | LIMIT 1
  """
}</code></pre><p>Response:</p><pre><code>{
  "columns": [
    { "name": "host", "type": "keyword" },
    { "name": "port", "type": "keyword" },
    { "name": "endpoint", "type": "keyword" }
  ],
  "values": [
    ["web-01", "8080", "web-01:8080"]
  ]
}</code></pre><p><code>CONCAT</code> accepts two or more arguments and always returns a <code>keyword</code>.</p><p><em>Note: For brevity, the remaining ES|QL examples in this article show just the query. Wrap them in <code>POST _query { &quot;query&quot;: &quot;...&quot; }</code> to run them in Kibana Dev Tools.</em></p><h4>When to use</h4><p>If you need <code>endpoint</code> to persist across all queries and be available in Kibana dashboards, use a mapping-level runtime field. If you need it for a single search request within Query DSL, use a query-time runtime field. If you need it for ad-hoc analysis or exploratory work, ES|QL is simpler.</p><h2>Pattern 2: Data extraction from unstructured text</h2><p>Extracting structured data from raw log messages is another classic runtime field pattern.</p><h3>The runtime field approach</h3><p>Painless uses Java&#x27;s regex <a href="https://docs.oracle.com/javase/8/docs/api/java/util/regex/Matcher.html">Matcher</a> class:</p><pre><code>GET server-logs/_search
{
  "runtime_mappings": {
    "log_user": {
      "type": "keyword",
      "script": {
        "source": "def matcher = /user=(\\w+)/.matcher(params._source['raw_message']); if (matcher.find()) { emit(matcher.group(1)); }"
      }
    }
  },
  "fields": ["log_user"],
  "_source": false
}</code></pre><p>This is verbose. You need to know <a href="https://www.elastic.co/docs/explore-analyze/scripting/modules-scripting-regular-expressions-tutorial">Painless regex syntax</a>, handle the <code>Matcher</code> object, and call <code>emit()</code> correctly.</p><h3>The ES|QL approach: GROK</h3><p>ES|QL provides two purpose-built commands for text extraction. <code>GROK</code> uses regex-based patterns:</p><pre><code>FROM server-logs
| GROK raw_message "%{WORD:timestamp_date} %{WORD:log_level} user=%{WORD:user} action=%{WORD:action} duration=%{WORD:duration}"
| KEEP user, log_level, action, duration</code></pre><p>Response:</p><pre><code>{
  "columns": [
    { "name": "user", "type": "keyword" },
    { "name": "log_level", "type": "keyword" },
    { "name": "action", "type": "keyword" },
    { "name": "duration", "type": "keyword" }
  ],
  "values": [
    ["alice", "INFO", "login", "230ms"], ...
  ]
}</code></pre><p><code>GROK</code> uses the <code>%{SYNTAX:SEMANTIC}</code> pattern format. It extracts multiple fields in a single and readable command.</p><h3>The ES|QL approach: DISSECT</h3><p>For structured data with consistent delimiters, <code>DISSECT</code> is faster because it doesn’t use regular expressions:</p><pre><code>FROM server-logs
| DISSECT raw_message "%{timestamp_date} %{log_level} user=%{user} action=%{action} duration=%{duration}"
| KEEP user, log_level, action, duration</code></pre><p>The syntax is nearly identical to <code>GROK</code>, but <code>DISSECT</code> works by splitting on delimiters rather than matching regex patterns. This makes it faster for data that follows a consistent format.</p><h4>When to use GROK vs DISSECT</h4><p>Use <code>DISSECT</code> when your data has a predictable structure (same delimiters, same field order). Use <code>GROK</code> when you need regex flexibility, for example when fields may be optional or formats vary.</p><h2>Pattern 3: Dynamic type conversion</h2><p>When a field is mapped as <code>keyword</code> but contains numeric data (a surprisingly common scenario), runtime fields can cast it at query time.</p><h3>The runtime field approach</h3><pre><code>GET server-logs/_search
{
  "runtime_mappings": {
    "response_time_long": {
      "type": "long",
      "script": {
        "source": """
          def val = doc['response_time'].value;
          if (val != 'not_available') {
            emit(Long.parseLong(val));
          }
        """
      }
    }
  },
  "fields": ["response_time_long"],
  "_source": false
}</code></pre><p>You need to handle parsing exceptions manually. If <a href="https://docs.oracle.com/javase/8/docs/api/java/lang/Long.html#parseLong-java.lang.String-"><code>Long.parseLong</code></a> fails on an unexpected value, the script throws an error.</p><h3>The ES|QL approach</h3><p>ES|QL provides explicit conversion functions and a shorthand cast operator:</p><pre><code>FROM server-logs
| EVAL response_ms = TO_LONG(response_time)
| KEEP host, response_time, response_ms</code></pre><p>Or with the <code>::</code> cast operator (<a href="https://www.elastic.co/search-labs/blog/esql-timeline-of-improvements">available since 8.15</a>):</p><pre><code>FROM server-logs
| EVAL response_ms = response_time::long
| KEEP host, response_time, response_ms</code></pre><p>Response:</p><pre><code>{
  "columns": [
    { "name": "host", "type": "keyword" },
    { "name": "response_time", "type": "keyword" },
    { "name": "response_ms", "type": "long" }
  ],
  "values": [
    ["web-01", "145", 145]
  ]
}</code></pre><p>Both produce the same result. The key difference from Painless: <strong>Failed conversions return <code>null</code> instead of throwing exceptions</strong>. The document with <code>&quot;not_available&quot;</code> simply gets <code>null</code> for <code>response_ms</code>, and ES|QL emits a warning.</p><p>Common conversion functions include:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Function</td><td style="border: 1px solid #ccc; padding: 4px;">Converts to</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`TO_LONG()`</td><td style="border: 1px solid #ccc; padding: 4px;">Long integer</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`TO_INTEGER()`</td><td style="border: 1px solid #ccc; padding: 4px;">Integer</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`TO_DOUBLE()`</td><td style="border: 1px solid #ccc; padding: 4px;">Double</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`TO_DATETIME()`</td><td style="border: 1px solid #ccc; padding: 4px;">Date</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`TO_BOOLEAN()`</td><td style="border: 1px solid #ccc; padding: 4px;">Boolean</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`TO_IP()`</td><td style="border: 1px solid #ccc; padding: 4px;">IP address</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`TO_VERSION()`</td><td style="border: 1px solid #ccc; padding: 4px;">Version</td>
                    </tr>
              </tbody>
            </table>
          <p>The <code>::</code> operator works with all these types (for example, <code>field::double</code>, <code>field::datetime</code>).</p><h4>When to use</h4><p>ES|QL&#x27;s graceful null handling makes it safer for dirty data. Runtime fields with Painless give you fine-grained control over error handling but require more code. For type conversion specifically, ES|QL is almost always the better choice.</p><h2>Pattern 4: <a href="https://www.elastic.co/docs/manage-data/data-store/mapping/dynamic-field-mapping">Dynamic field</a> handling</h2><p>Runtime fields support <code>&quot;dynamic&quot;: &quot;runtime&quot;</code> in mappings, which prevents <a href="https://www.elastic.co/docs/troubleshoot/elasticsearch/mapping-explosion">mapping explosion</a> by creating all new fields as runtime fields instead of indexed fields:</p><pre><code>{
  "mappings": {
    "dynamic": "runtime",
    "properties": {
      "timestamp": { "type": "date" }
    }
  }
}</code></pre><p>Any new field sent to this index becomes a runtime field automatically. This is useful when you ingest semi-structured data with unpredictable field names.</p><h3>Where ES|QL fits</h3><p>ES|QL provides query-time flexibility, but it still needs fields to be visible in the mapping. This is where runtime fields and ES|QL complement each other rather than compete.</p><p>If a field exists in <code>_source</code> but isn’t mapped, ES|QL cannot access it directly. The current workaround is to define a runtime field to make the unmapped field visible:</p><pre><code>PUT dynamic-logs/_mapping
{
  "runtime": {
    "custom_field": {
      "type": "keyword",
      "script": {
        "source": "emit(params._source['custom_field'])"
      }
    }
  }
}</code></pre><p>Once defined, ES|QL can query it:</p><pre><code>FROM dynamic-logs
| WHERE custom_field == "some_value"
| KEEP timestamp, custom_field</code></pre><p>This is one scenario where runtime fields remain essential. They act as a bridge, making unmapped data accessible to ES|QL.</p><h2>Pattern 5: Field shadowing for error correction</h2><p>Runtime fields can shadow (override) indexed fields by defining a runtime field with the same name as an existing field. This is useful for correcting data without reindexing.</p><h3>The runtime field approach</h3><p>Remember our data quality issue, where <code>region</code> has inconsistent casing (<code>&quot;US-EAST&quot;</code> versus <code>&quot;us-east&quot;</code>)?</p><pre><code>GET server-logs/_search
{
  "runtime_mappings": {
    "region": {
      "type": "keyword",
      "script": {
        "source": "emit(params._source['region'].toLowerCase())"
      }
    }
  },
  "fields": ["region"],
  "_source": false
}</code></pre><p>This overrides the indexed <code>region</code> field for all queries. Every search, aggregation, and Kibana visualization will see the lowercase version.</p><pre><code>FROM server-logs
| EVAL region = TO_LOWER(region)
| KEEP host, port, region</code></pre><p>When you use <code>EVAL</code> with an existing column name, ES|QL drops the original column and replaces it with the computed value. This is the exact equivalent of field shadowing, but scoped to the current query.</p><p>You can also chain multiple corrections in a pipeline:</p><pre><code>FROM server-logs
| EVAL region = TO_LOWER(region)
| EVAL region = CASE(region == "us-east", "US East", region == "eu-west", "EU West", region == "ap-south", "AP South", region)
| KEEP host, region</code></pre><h4>When to use</h4><p>If the correction should apply to all queries and <a href="https://www.elastic.co/kibana/kibana-dashboard">Kibana dashboards</a>, use runtime field shadowing. If you need to correct data for a specific analysis, ES|QL is more flexible since you can apply different transformations in different queries without modifying the mapping.</p><h2>The ES|QL pipeline advantage: Going beyond runtime fields</h2><p>This is where ES|QL fundamentally surpasses runtime fields. Runtime fields are isolated: each one computes independently, and you cannot use the output of one runtime field as input for another in the same query.</p><p>ES|QL pipelines chain transforms. Here’s a single query that combines multiple patterns:</p><pre><code>FROM server-logs
| GROK raw_message "%{WORD:log_date} %{WORD:log_level} user=%{WORD:user} action=%{WORD:action} duration=%{INT:duration_raw}ms"
| EVAL duration_ms = duration_raw::long
| EVAL region = TO_LOWER(region)
| WHERE log_level == "ERROR" AND duration_ms > 100
| STATS avg_duration = AVG(duration_ms), error_count = COUNT(*) BY region</code></pre><p>This single query:</p><ul><li><strong>Extracts</strong> fields from raw text (<code>GROK</code>).</li><li><strong>Converts</strong> the duration to a number (<code>EVAL</code> with cast).</li><li><strong>Normalizes</strong> region casing (<code>EVAL</code> with <a href="https://www.elastic.co/docs/reference/query-languages/esql/functions-operators/string-functions/to_lower"><code>TO_LOWER</code></a>).</li><li><strong>Filters</strong> for errors with high duration (<a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/where"><code>WHERE</code></a>).</li><li><strong>Aggregates</strong> by region (<a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/stats-by"><code>STATS</code></a>).</li></ul><p>To achieve the same result with runtime fields, you would need to define at least three separate runtime fields (for extraction, conversion, and normalization) and then write a Query DSL query with <a href="https://www.elastic.co/docs/reference/elasticsearch/rest-apis/filter-search-results">filters</a> and <a href="https://www.elastic.co/docs/explore-analyze/query-filter/aggregations">aggregations</a>. The ES|QL version is a single, readable pipeline.</p><p>You can even use expressions directly inside aggregations:</p><pre><code>FROM server-logs
| EVAL response_ms = response_time::long
| STATS
    avg_response = AVG(response_ms),
    p95_response = PERCENTILE(response_ms, 95),
    slow_count = COUNT(CASE(response_ms > 1000, 1, null))
  BY host</code></pre><h2>Conclusion</h2><p>What we covered:</p><ul><li>ES|QL provides a full toolkit (<code>EVAL</code>, <code>GROK</code>, <code>DISSECT</code>, type casting with <code>::</code>) that replaces most runtime field patterns without any Painless scripting.</li><li>Failed type conversions in ES|QL return <code>null</code> instead of throwing exceptions, making it safer for real-world data.</li><li>Pipeline processing (chaining <code>GROK</code> into <code>EVAL</code> into <code>WHERE</code> into <code>STATS</code>) goes beyond what runtime fields can do in isolation.</li><li>Runtime fields remain valuable for persistent computed fields, field shadowing across all queries, and as a bridge for unmapped data in ES|QL.</li></ul><p>One important caveat: Both runtime fields and ES|QL compute values at query time, which means they pay the cost on every query. If you find yourself applying the same transformation repeatedly (type corrections, field extraction, data normalization), consider using <a href="https://www.elastic.co/docs/manage-data/ingest/transform-enrich/ingest-pipelines">ingest pipelines</a> to fix the data at index time instead. Ingest pipelines let you parse, enrich, and transform documents before they’re stored, so queries can work with clean, properly typed fields directly. Runtime fields and ES|QL are great for exploration and ad-hoc analysis, but for production workloads, indexing the right data from the start is almost always the better choice.</p><p><strong>The key takeaway: </strong>Runtime fields aren’t deprecated, and they aren’t going away. But for most query-time computation patterns, ES|QL offers a simpler, more powerful, and more performant approach. And when the transformation is known up front, an ingest pipeline is the most efficient option of all.</p><h2>Next steps</h2><ul><li><a href="https://www.elastic.co/docs/reference/query-languages/esql">ES|QL documentation</a></li><li><a href="https://www.elastic.co/docs/manage-data/data-store/mapping/runtime-fields">Runtime fields reference</a></li><li><a href="https://www.elastic.co/search-labs/blog/esql-timeline-of-improvements">ES|QL timeline of improvements</a></li><li><a href="https://www.elastic.co/blog/getting-started-with-elasticsearch-runtime-fields">Getting started with runtime fields</a></li><li><a href="https://www.elastic.co/docs/reference/query-languages/esql/esql-process-data-with-dissect-grok">ES|QL processing data with DISSECT and GROK</a></li></ul>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Creating an Elasticsearch MCP server with TypeScript]]></title>
            <description><![CDATA[Learn how to create an Elasticsearch MCP server with TypeScript and Claude Desktop.]]></description>
            <link>https://www.elastic.co/search-labs/blog/elastic-mcp-server-typescript-claude</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/elastic-mcp-server-typescript-claude</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Jeffrey Rengifo]]></dc:creator>
            <pubDate>Fri, 27 Mar 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/0bb24c05fff391f42070c2883182ea6fe9cb9680-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>When working with large knowledge bases in Elasticsearch, finding information is only half the battle. Engineers often need to synthesize results from multiple documents, generate summaries, and trace answers back to their sources. Model Context Protocol (MCP) provides a standardized way to connect Elasticsearch with large language model–powered (LLM-powered) applications to accomplish this. While Elastic offers official solutions, like Elastic Agent Builder (which includes an <a href="https://www.elastic.co/docs/solutions/search/agent-builder/mcp-server">MCP endpoint</a> among its features), building a custom MCP server gives you full control over search logic, result formatting, and how retrieved content is passed to an LLM for synthesis, summaries, and citations.</p><p>In this article, we’ll explore the benefits of building a custom Elasticsearch MCP server and show how to create one in TypeScript that connects Elasticsearch to LLM-powered applications.</p><h2>Why build a custom Elasticsearch MCP server?</h2><p>Elastic provides some alternatives for <a href="https://www.elastic.co/docs/solutions/search/mcp">MCP servers</a>:</p><ul><li><a href="https://www.elastic.co/docs/solutions/search/agent-builder/mcp-server">Elastic Agent Builder MCP server for Elasticsearch 9.2+</a></li><li><a href="https://github.com/elastic/mcp-server-elasticsearch?tab=readme-ov-file#elasticsearch-mcp-server">Elasticsearch MCP server for older versions (Python)</a></li></ul><p>If you need more control over how your MCP server interacts with Elasticsearch, building your own custom server gives you the flexibility to tailor it exactly to your needs. For example, Agent Builder&#x27;s MCP endpoint is limited to Elasticsearch Query Language (ES|QL) queries, while a custom server allows you to use the full Query DSL. You also gain control over how results are formatted before being passed to the LLM and can integrate additional processing steps, like the OpenAI-powered summarization we&#x27;ll implement in this tutorial.</p><p>By the end of this article, you’ll have an MCP server in TypeScript that searches for information stored in an Elasticsearch index, summarizes it, and provides citations. We&#x27;ll use Elasticsearch for retrieval, OpenAI&#x27;s <code>gpt-4o-mini</code> model to summarize and generate citations, and Claude Desktop as the MCP client and UI to take in user queries and give responses. The end result is an internal knowledge assistant that helps engineers discover and synthesize best practices across their organization’s technical docs.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/ec5771a874cf9740d4cac6888622cbe8cd6aede7-1999x1133.png" style="width: 100%;" /><h2>Prerequisites:</h2><ul><li>Node.js 20 +</li><li>Elasticsearch</li><li>OpenAI API key</li><li>Claude Desktop</li></ul><h3>What is MCP?</h3><p><a href="https://www.elastic.co/what-is/mcp">MCP</a> is an open standard, created by <a href="https://www.anthropic.com/news/model-context-protocol">Anthropic</a>, that provides secure, bidirectional connections between LLMs and external systems, like Elasticsearch. You can read more about the current state of MCP in <a href="https://www.elastic.co/search-labs/blog/mcp-current-state">this article</a>.</p><p>The MCP landscape is <a href="https://www.elastic.co/search-labs/blog/mcp-current-state#mcp-project-updates:-transport,-elicitation,-and-structured-tooling">evolving every day</a>, with servers available for a wide range of use cases. On top of that, it’s easy to build your own custom MCP server, as we’ll show in this article.</p><h3>MCP clients</h3><p>There’s a long <a href="https://modelcontextprotocol.io/clients">list of available MCP clients</a>, each with its own characteristics and limitations. For simplicity and popularity, we’ll use <a href="https://claude.ai/download">Claude Desktop</a> as our MCP client. It will serve as the chat interface where users can ask questions in natural language, and it will automatically invoke the tools exposed by our MCP server to search documents and generate summaries.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/66eb0b11473347b6cf2d85718251eeac38d6249d-1999x1491.png" style="width: 100%;" /><h2>Creating an Elasticsearch MCP server</h2><p>Using the <a href="https://github.com/modelcontextprotocol/typescript-sdk">TypeScript SDK</a>, we can easily create a server that understands how to query our Elasticsearch data based on a user query input.</p><p>Here are the steps in this article to integrate the Elasticsearch MCP server with the Claude Desktop client:</p><ol><li><a href="https://www.elastic.co/search-labs/blog/elastic-mcp-server-typescript-claude#configure-mcp-server-for-elasticsearch">Configure MCP server for Elasticsearch.</a></li><li><a href="https://www.elastic.co/search-labs/blog/elastic-mcp-server-typescript-claude#load-the-mcp-server-into-claude-desktop">Load the MCP server into Claude Desktop.</a></li><li><a href="https://www.elastic.co/search-labs/blog/elastic-mcp-server-typescript-claude#test-it-out">Test it out.</a></li></ol><h3>Configure MCP server for Elasticsearch</h3><p>To begin, let&#x27;s initialize a node application:</p><pre><code>npm init -y</code></pre><p>This will create a <code>package.json</code> file, and with it, we can start installing the necessary dependencies for this application.</p><pre><code>npm install @elastic/elasticsearch @modelcontextprotocol/sdk openai zod && npm install --save-dev ts-node @types/node typescript</code></pre><ul><li><strong>@elastic/elasticsearch</strong> will give us access to the Elasticsearch Node.js library.</li><li><strong>@modelcontextprotocol/sdk</strong> provides the core tools to create and manage an MCP server, register tools, and handle communication with MCP clients.</li><li><strong>openai</strong> allows interaction with OpenAI models to generate summaries or natural language responses.</li><li><strong><a href="https://zod.dev/">zod</a> </strong>helps define and validate structured schemas for input and output data in each tool.</li></ul><p><code>ts-node</code>, <code>@types/node</code>, and <code>typescript</code> will be used during development to type the code and compile the scripts.</p><h4>Set up the dataset</h4><p>To provide the data that Claude Desktop can query using our MCP server, we’ll use a mock <a href="https://github.com/Delacrobix/typescript-elasticsearch-mcp/blob/main/dataset.json">internal knowledge base dataset</a>. Here’s what a document from this dataset will look like:</p><pre><code>{
    "id": 5,
    "title": "Logging Standards for Microservices",
    "content": "Consistent logging across microservices helps with debugging and tracing. Use structured JSON logs and include request IDs and timestamps. Avoid logging sensitive information. Centralize logs in Elasticsearch or a similar system. Configure log rotation to prevent storage issues and ensure logs are searchable for at least 30 days.",
    "tags": ["logging", "microservices", "standards"]
}</code></pre><p>To ingest the data, we prepared a script that creates an index in Elasticsearch and loads the dataset into it. You can find it <a href="https://github.com/Delacrobix/typescript-elasticsearch-mcp/blob/main/setup.ts">here</a>.</p><h4>MCP server</h4><p>Create a file named <a href="https://github.com/Delacrobix/typescript-elasticsearch-mcp/blob/main/index.ts"><code>index.ts</code></a> and add the following code to import the dependencies and handle environment variables:</p><pre><code>// index.ts
import { z } from "zod";
import { Client } from "@elastic/elasticsearch";
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import OpenAI from "openai";

const ELASTICSEARCH_ENDPOINT =
  process.env.ELASTICSEARCH_ENDPOINT ?? "http://localhost:9200";
const ELASTICSEARCH_API_KEY = process.env.ELASTICSEARCH_API_KEY ?? "";
const OPENAI_API_KEY = process.env.OPENAI_API_KEY ?? "";
const INDEX = "documents";</code></pre><p>Also, let’s initialize the clients to handle the Elasticsearch and OpenAI calls:</p><pre><code>const openai = new OpenAI({
  apiKey: OPENAI_API_KEY,
});

const _client = new Client({
  node: ELASTICSEARCH_ENDPOINT,
  auth: {
    apiKey: ELASTICSEARCH_API_KEY,
  },
});</code></pre><p>To make our implementation more robust and ensure structured input and output, we&#x27;ll define schemas using <a href="https://zod.dev/"><code>zod</code></a>. This allows us to validate data at runtime, catch errors early, and make the tool responses easier to process programmatically:</p><pre><code>const DocumentSchema = z.object({
  id: z.number(),
  title: z.string(),
  content: z.string(),
  tags: z.array(z.string()),
});

const SearchResultSchema = z.object({
  id: z.number(),
  title: z.string(),
  content: z.string(),
  tags: z.array(z.string()),
  score: z.number(),
});

type Document = z.infer<typeof DocumentSchema>;
type SearchResult = z.infer<typeof SearchResultSchema>;</code></pre><p>Learn more about structured outputs <a href="https://www.elastic.co/search-labs/blog/structured-outputs-elasticsearch-guide">here</a>.</p><p>Now let’s initialize the MCP server:</p><pre><code>const server = new McpServer({
  name: "Elasticsearch RAG MCP",
  description:
    "A RAG server using Elasticsearch. Provides tools for document search, result summarization, and source citation.",
  version: "1.0.0",
});</code></pre><h4>Defining the MCP tools</h4><p>With everything configured, we can start writing the tools that will be exposed by our MCP server. This server exposes two tools:</p><ul><li><strong><code>search_docs</code>: </strong>Searches for documents in Elasticsearch using full-text search.</li><li><strong><code>summarize_and_cite</code>:</strong> Summarizes and synthesizes information from previously retrieved documents to answer a user question. This tool also adds citations referencing the source documents.</li></ul><p>Together, these tools form a simple “retrieve-then-summarize” workflow, where one tool fetches relevant documents and the other uses those documents to generate a summarized, cited response.</p><h4>Tool response format</h4><p>Each tool can accept arbitrary input parameters, but it must respond with the following structure:</p><ul><li><strong>Content:</strong> This is the response of the tool in an unstructured format. This field is usually used to return text, images, audio, links, or embeddings. For this application, it will be used to return formatted text with the information generated by the tools.</li><li><strong>structuredContent: </strong>This is an optional return used to provide the results of each tool in a structured format. This is useful for programmatic purposes. Although it isn&#x27;t used in this MCP server, it can be useful if you want to develop other tools or process the results programmatically.</li></ul><p>With that structure in mind, let’s dive into each tool in detail.</p><h4>Search_docs tool</h4><p>This tool performs a <a href="https://www.elastic.co/docs/solutions/search/full-text">full-text search</a> in the Elasticsearch index to retrieve the most relevant documents based on the user query. It highlights key matches and provides a quick overview with relevance scores.</p><pre><code>server.registerTool(
  "search_docs",
  {
    title: "Search Documents",
    description:
      "Search for documents in Elasticsearch using full-text search. Returns the most relevant documents with their content, title, tags, and relevance score.",
    inputSchema: {
      query: z
        .string()
        .describe("The search query terms to find relevant documents"),
      max_results: z
        .number()
        .optional()
        .default(5)
        .describe("Maximum number of results to return"),
    },
    outputSchema: {
      results: z.array(SearchResultSchema),
      total: z.number(),
    },
  },
  async ({ query, max_results }) => {
    if (!query) {
      return {
        content: [
          {
            type: "text",
            text: "Query parameter is required",
          },
        ],
        isError: true,
      };
    }

    try {
      const response = await _client.search({
        index: INDEX,
        size: max_results,
        query: {
          bool: {
            must: [
              {
                multi_match: {
                  query: query,
                  fields: ["title^2", "content", "tags"],
                  fuzziness: "AUTO",
                },
              },
            ],
            should: [
              {
                match_phrase: {
                  title: {
                    query: query,
                    boost: 2,
                  },
                },
              },
            ],
          },
        },
        highlight: {
          fields: {
            title: {},
            content: {},
          },
        },
      });

      const results: SearchResult[] = response.hits.hits.map((hit: any) => {
        const source = hit._source as Document;

        return {
          id: source.id,
          title: source.title,
          content: source.content,
          tags: source.tags,
          score: hit._score ?? 0,
        };
      });

      const contentText = results
        .map(
          (r, i) =>
            `[${i + 1}] ${r.title} (score: ${r.score.toFixed(
              2,
            )})\n${r.content.substring(0, 200)}...`,
        )
        .join("\n\n");

      const totalHits =
        typeof response.hits.total === "number"
          ? response.hits.total
          : (response.hits.total?.value ?? 0);

      return {
        content: [
          {
            type: "text",
            text: `Found ${results.length} relevant documents:\n\n${contentText}`,
          },
        ],
        structuredContent: {
          results: results,
          total: totalHits,
        },
      };
    } catch (error: any) {
      console.log("Error during search:", error);

      return {
        content: [
          {
            type: "text",
            text: `Error searching documents: ${error.message}`,
          },
        ],
        isError: true,
      };
    }
  }
);</code></pre><p><em>We configure <code><a href="https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-fuzzy-query">fuzziness</a>: “AUTO”</code> to have a variable typo tolerance based on the length of the token that’s being analyzed. We also set <code>title^2</code> to increase the score of the documents where the match happens on the title field.</em></p><h4>summarize_and_cite tool</h4><p>This tool generates a summary based on documents retrieved in the previous search. It uses OpenAI’s <code>gpt-4o-mini</code> model to synthesize the most relevant information to answer the user’s question, providing responses derived directly from the search results. In addition to the summary, it also returns citation metadata for the source documents used.</p><pre><code>server.registerTool(
  "summarize_and_cite",
  {
    title: "Summarize and Cite",
    description:
      "Summarize the provided search results to answer a question and return citation metadata for the sources used.",
    inputSchema: {
      results: z
        .array(SearchResultSchema)
        .describe("Array of search results from search_docs"),
      question: z.string().describe("The question to answer"),
      max_length: z
        .number()
        .optional()
        .default(500)
        .describe("Maximum length of the summary in characters"),
      max_docs: z
        .number()
        .optional()
        .default(5)
        .describe("Maximum number of documents to include in the context"),
    },
    outputSchema: {
      summary: z.string(),
      sources_used: z.number(),
      citations: z.array(
        z.object({
          id: z.number(),
          title: z.string(),
          tags: z.array(z.string()),
          relevance_score: z.number(),
        })
      ),
    },
  },
  async ({ results, question, max_length, max_docs }) => {
    if (!results || results.length === 0 || !question) {
      return {
        content: [
          {
            type: "text",
            text: "Both results and question parameters are required, and results must not be empty",
          },
        ],
        isError: true,
      };
    }

    try {
      const used = results.slice(0, max_docs);

      const context = used
        .map(
          (r: SearchResult, i: number) =>
            `[Document ${i + 1}: ${r.title}]\\n${r.content}`
        )
        .join("\n\n---\n\n");

      // Generate summary with OpenAI
      const completion = await openai.chat.completions.create({
        model: "gpt-4o-mini",
        messages: [
          {
            role: "system",
            content:
              "You are a helpful assistant that answers questions based on provided documents. Synthesize information from the documents to answer the user's question accurately and concisely. If the documents don't contain relevant information, say so.",
          },
          {
            role: "user",
            content: `Question: ${question}\\n\\nRelevant Documents:\\n${context}`,
          },
        ],
        max_tokens: Math.min(Math.ceil(max_length / 4), 1000),
        temperature: 0.3,
      });

      const summaryText =
        completion.choices[0]?.message?.content ?? "No summary generated.";

      const citations = used.map((r: SearchResult) => ({
        id: r.id,
        title: r.title,
        tags: r.tags,
        relevance_score: r.score,
      }));

      const citationText = citations
        .map(
          (c: any, i: number) =>
            `[${i + 1}] ID: ${c.id}, Title: "${c.title}", Tags: ${c.tags.join(
              ", ",
            )}, Score: ${c.relevance_score.toFixed(2)}`,
        )
        .join("\n");

      const combinedText = `Summary:\\n\\n${summaryText}\\n\\nSources used (${citations.length}):\\n\\n${citationText}`;

      return {
        content: [
          {
            type: "text",
            text: combinedText,
          },
        ],
        structuredContent: {
          summary: summaryText,
          sources_used: citations.length,
          citations: citations,
        },
      };
    } catch (error: any) {
      return {
        content: [
          {
            type: "text",
            text: `Error generating summary and citations: ${error.message}`,
          },
        ],
        isError: true,
      };
    }
  }
);</code></pre><p>Finally, we need to start the server using <a href="https://github.com/modelcontextprotocol/typescript-sdk?tab=readme-ov-file#stdio">stdio</a>. This means the MCP client will communicate with our server by reading and writing to its standard input and output streams. stdio is the simplest transport option and works well for local MCP servers launched as subprocesses by the client. Add the following code at the end of the file:</p><pre><code>const transport = new StdioServerTransport();
server.connect(transport);</code></pre><p>Now compile the project using the following command:</p><pre><code>npx tsc index.ts --target ES2022 --module node16 --moduleResolution node16 --outDir ./dist --strict --esModuleInterop</code></pre><p>This will create a <code>dist</code> folder, and inside it, an <code>index.js</code> file.</p><h3>Load the MCP server into Claude Desktop</h3><p>Follow <a href="https://modelcontextprotocol.io/docs/develop/connect-local-servers">this guide</a> to configure the MCP server with Claude Desktop. In the Claude configuration file, we need to set the following values:</p><pre><code>{
  "mcpServers": {
    "elasticsearch-rag-mcp": {
      "command": "node",
      "args": [   "/Users/user-name/app-dir/dist/index.js"
      ],
      "env": {
        "ELASTICSEARCH_ENDPOINT": "your-endpoint-here",
        "ELASTICSEARCH_API_KEY": "your-api-key-here",
        "OPENAI_API_KEY": "your-openai-key-here"
      }
    }
  }
}</code></pre><p>The <code>args</code> value should point to the compiled file in the <code>dist</code> folder. You also need to set the environment variables in the configuration file with the exact same names defined in the code.</p><h3>Test it out</h3><p>Before executing each tool, click on <strong>Search and Tools</strong> to make sure that the tools are enabled. Here you can also enable or disable each one:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/172981c2a54adabc70d5819013c3007670935605-1999x1002.png" style="width: 100%;" /><p>Finally, let’s test the MCP server from the Claude Desktop chat and start asking questions:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/03654c0f8c53c714f801fba8b25747071179209b-1999x1353.png" style="width: 100%;" /><p>For the question “<strong>Search for documents about authentication methods and role-based access control</strong>”, the <code>search_docs</code> tool is executed and returns the following results:</p><pre><code>Most Relevant Documents:
Access Control and Role Management (highest relevance) - This document covers role-based access control (RBAC) principles, including ensuring users only have necessary permissions, regular auditing of user roles, revoking inactive accounts, and implementing just-in-time access for sensitive operations.
User Authentication with OAuth 2.0 - This document explains OAuth 2.0 authentication, which enables secure delegated access without credential sharing. It covers configuring identity providers, token management with limited scope and lifetime, and secure storage of refresh tokens.
Container Security Guidelines - While primarily about container security, this document touches on access control aspects like running containers as non-root users and avoiding embedded credentials.
Incident Response Playbook - This mentions role assignment during incidents (incident commander, communications lead, etc.), which relates to access control in emergency scenarios.
Logging Standards for Microservices - This document includes guidance on avoiding logging sensitive information, which is relevant to authentication security.</code></pre><p>The response is, “Great! I found 5 relevant documents about authentication methods and role-based access control. Here&#x27;s what was found:”</p><p>The tool call returns the source documents as part of its response payload, which are later used to generate citations.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/25ff4c7e9563d99752f95540dafdc7fd211a66e3-800x530.gif" style="width: 100%;" /><p>It’s also possible to chain multiple tools in a single interaction. In this case, Claude Desktop analyzes the user&#x27;s question and determines that it needs to first call <code>search_docs</code> to retrieve relevant documents and to then pass those results to <code>summarize_and_cite</code> to generate the final answer, all without requiring separate prompts from the user:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/e6a8a46acb4236e77058f18bcd2f0737b5882c05-1999x1101.png" style="width: 100%;" /><p>In this case, for the query “<strong>What are the main recommendations to improve authentication and access control across our systems? Include references.</strong>”, we obtained the following results:</p><pre><code>Based on the documentation, here are the main recommendations to improve authentication and access control across your systems:

Key Recommendations

1. Implement Role-Based Access Control (RBAC)
Ensure users have only the permissions necessary for their job functions. This principle of least privilege minimizes security risks by limiting access to what's truly needed. [1]

2. Regular Access Audits
Conduct regular audits of user roles and promptly revoke access for inactive accounts. This helps prevent unauthorized access from dormant or compromised accounts. [1]

3. Just-in-Time (JIT) Access
Implement just-in-time access for sensitive operations, where permissions are granted only when needed and for a limited duration. This reduces the attack surface by minimizing standing privileges. [1]

4. OAuth 2.0 for Secure Authentication
Use OAuth 2.0 to enable secure delegated access without sharing user credentials. This protocol allows third-party applications to access resources on behalf of users without exposing passwords. [2]

5. Token Security and Management
Configure your identity provider to issue tokens with limited scope and lifetime

Store refresh tokens securely
Validate access tokens consistently to maintain security [2]
References

Access Control and Role Management (Tags: security, access-control)
User Authentication with OAuth 2.0 (Tags: authentication, oauth)
These recommendations work together to create a defense-in-depth approach, where multiple security layers protect your systems from unauthorized access.</code></pre><p>As in the previous step, we can see the response from each tool for this question:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/cb606d356b2f7d5e4878a5eff71bc881869ac0ee-800x585.gif" style="width: 100%;" /><p><em>Note: If a submenu appears asking whether you approve the use of each tool, select <strong>Always allow</strong> or <strong>Allow once</strong>.</em></p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/aea942ba9b0037526ea215bec65690f1a5c3099c-1522x250.png" style="width: 100%;" /><h2>Conclusion</h2><p>MCP servers represent a significant step toward standardizing LLM tools for both local and remote applications. Though full compatibility is still in the works, we’re moving fast in that direction.</p><p>In this article, we learned how to build a custom MCP server in TypeScript that connects Elasticsearch to LLM-powered applications. Our server exposes two tools: <code>search_docs</code> for retrieving relevant documents using Query DSL; and <code>summarize_and_cite</code> for generating summaries with citations via OpenAI models and Claude Desktop as client UI.</p><p>The future of compatibility between different client and server providers looks promising. Next steps include adding more functionalities and flexibility to your agent. There’s a practical <a href="https://www.elastic.co/search-labs/blog/llm-functions-elasticsearch-intelligent-query">article</a> on how you can parameterize your queries using search templates to gain precision and flexibility.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Announcing read-only permissions for Kibana dashboards]]></title>
            <description><![CDATA[Introducing read-only dashboards in Kibana, giving dashboard creators granular sharing controls to keep results accurate and protected from unwanted changes.]]></description>
            <link>https://www.elastic.co/search-labs/blog/kibana-dashboards-read-only-permissions</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/kibana-dashboards-read-only-permissions</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Fabio Busatto, Teresa Alvarez Soler]]></dc:creator>
            <pubDate>Thu, 26 Mar 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/11e578bc317aea30d2e10ccc0334a532f6af2ef9-720x420.jpg?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>You’ve been there. You spend an hour building the perfect dashboard to monitor your logs: every chart, every filter, and every label. You share it with your team. A few days later, you open it and something’s off. A colleague tweaked a query. Or someone changed the date range. Maybe they thought they were helping. Now you’re digging through revisions and second-guessing every number. Sound familiar?</p><p>That’s exactly why we built <strong>read-only dashboards</strong>. It’s the control you’ve been asking for. Share dashboards with confidence, without worrying that the next person with edit access will change or break them.</p><p>Note: Read-only permissions are available in Elastic Cloud Serverless and from version 9.3 for Elastic Cloud Hosted and Elastic Self-Μanaged.</p><h2>When “everyone can edit” gets in the way</h2><p>In Kibana, <em>sharing </em>has usually meant space-level permissions. If someone can create dashboards in a space, they can also edit or delete anyone else’s. That’s great for collaboration until it isn’t. One accidental edit can ripple into wrong decisions, lost trust, and a lot of cleanup.</p><p>We’ve heard the workarounds: <strong>“We put ‘read-only’ in the dashboard name and hope people notice.”</strong> Or: <strong>“We tag them and cross our fingers.”</strong> Hope isn’t a permission model. You needed a real way to lock a dashboard without locking everyone out of the space.</p><h2>What actually goes wrong</h2><p>Deb and Kevin both have edit access to the log monitoring dashboard within the Operations space. Kevin makes some changes to the charts. When Deb comes back, the numbers don’t match what she presented. She has to track down what changed (often from memory), fix it, and wonder how many reports went out with bad data.</p><h2>Read-only dashboards: Ownership and control that make sense</h2><p>Read-only dashboards fix this by giving you control to decide whether other users can edit the dashboard. When you share a dashboard, you choose: <strong>edit</strong> (default, same as today) or <strong>view</strong>. In <strong>view </strong>mode, only you (and Kibana admins) can change or delete it. Everyone else can open it, use it, and trust it, but they can’t modify it.</p><h3>What you get</h3><ul><li><strong>Dashboard integrity:</strong> In <strong>view</strong> mode, other users with edit access in the space cannot modify or delete the dashboard. If they try, they’re told it’s locked. Your charts and logic stay as you left them.</li><li><strong>You stay in control:</strong> You’re the owner. You can always edit, refine, and update. Sharing as view-only doesn’t lock you out; it locks in the version everyone else sees.</li><li><strong>Flexible lifecycle:</strong> You can switch a dashboard back to “can edit” anytime. And Kibana admins can still manage all dashboards (for example, if the owner leaves). No dead ends.</li></ul><p>You can share finalized, mission-critical dashboards widely and know they’ll stay consistent. This is available in <strong>all Elastic tiers and offerings</strong>, including Serverless.</p><h3>Who can do what?</h3><p>Quick reference by role:</p><ul><li><strong>Dashboard owner:</strong> You created it; you have full edit access.</li><li><strong>Kibana admin:</strong> Can manage all dashboards.</li><li><strong>User with space edit:</strong> Can create and edit their dashboards; can’t edit or delete view-only dashboards.</li><li><strong>User with space view:</strong> Can only view (and list) dashboards.</li></ul>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Action</td><td style="border: 1px solid #ccc; padding: 4px;">Dashboard owner</td><td style="border: 1px solid #ccc; padding: 4px;">Kibana admin</td><td style="border: 1px solid #ccc; padding: 4px;">User with space edit</td><td style="border: 1px solid #ccc; padding: 4px;">User with space view</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">List and view dashboards</td><td style="border: 1px solid #ccc; padding: 4px;">✔</td><td style="border: 1px solid #ccc; padding: 4px;">✔</td><td style="border: 1px solid #ccc; padding: 4px;">✔</td><td style="border: 1px solid #ccc; padding: 4px;">✔</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Create new dashboards</td><td style="border: 1px solid #ccc; padding: 4px;">✔</td><td style="border: 1px solid #ccc; padding: 4px;">✔</td><td style="border: 1px solid #ccc; padding: 4px;">✔</td><td style="border: 1px solid #ccc; padding: 4px;">✘</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Modify/delete editable dashboards</td><td style="border: 1px solid #ccc; padding: 4px;">✔</td><td style="border: 1px solid #ccc; padding: 4px;">✔</td><td style="border: 1px solid #ccc; padding: 4px;">✔</td><td style="border: 1px solid #ccc; padding: 4px;">✘</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Modify/delete read-only dashboards</td><td style="border: 1px solid #ccc; padding: 4px;">✔</td><td style="border: 1px solid #ccc; padding: 4px;">✔</td><td style="border: 1px solid #ccc; padding: 4px;">✘</td><td style="border: 1px solid #ccc; padding: 4px;">✘</td>
                    </tr>
              </tbody>
            </table>
          <h2>How to turn on read-only</h2><p>You can set view-only when you save a new dashboard or later from the share menu.</p><h3>When saving a new dashboard</h3><ul><li>Build your dashboard, and click <strong>Save</strong>.</li><li>In the “Save as new dashboard” modal, find <strong>Permissions</strong>.</li><li>Change from <strong>Can edit</strong> to <strong>Can view</strong>.</li><li>Click <strong>Save</strong>. Done. It’s read-only for everyone else.</li></ul><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/42a71d1bb55f9d50bd079f53bf45a0e1999b27f7-1214x1306.png" style="width: 100%;" /><h2>For a dashboard you already own</h2><ul><li>Open the dashboard.</li><li>Open the <strong>Share dashboard</strong> menu.</li></ul><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/e8405938557c879b1d4c262b98cf5a7f66408c04-1246x264.png" style="width: 100%;" /><ul><li>In the sharing modal, find <strong>Permissions</strong> and switch to <strong>Can view</strong>. The change applies immediately; other users in the space can no longer edit or delete it.</li></ul><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/849405bc32701f3ebe0def012d8ae3cf3813ea0a-996x750.png" style="width: 100%;" /><ul><li>You can mouse over the <strong>Share</strong> action to see what type of permissions a given dashboard has.</li></ul><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/80541ddb1b1bc567b0aeff693944ea8b6871d6a7-1270x320.png" style="width: 100%;" /><h3>Seeing which dashboards are locked</h3><p>On the main Dashboards list, dashboards you can’t edit or delete have a disabled selection checkbox. This provides an easy way to spot what’s view-only.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/24b2eba08df86174db949c662e7886c5aea1b460-1999x876.png" style="width: 100%;" /><p>In the dashboard, you will also find that the Edit action is disabled and a tooltip will appear, explaining that the dashboard has been set as view-only.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/e0a2f9da6dc854e876fc6dc2a7c3ef8b313b52ef-1358x330.png" style="width: 100%;" /><h2>Try it</h2><p>Read-only dashboards are available now. Create a dashboard, flip it to <strong>Can view</strong>, and share it. Your team gets a single source of truth, and you get peace of mind. No more “please don’t edit” in the title.</p><p>We’d love to hear how you use read-only dashboards. Share your feedback in our <a href="https://discuss.elastic.co">community forum</a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[The shell tool is not a silver bullet for context engineering]]></title>
            <description><![CDATA[Learn what context-retrieval tools exist for context engineering, how they work, and their trade-offs.]]></description>
            <link>https://www.elastic.co/search-labs/blog/search-tools-context-engineering</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/search-tools-context-engineering</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Leonie Monigatti]]></dc:creator>
            <pubDate>Wed, 25 Mar 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/f91e4d07915ba7bf3b7abf15fac8fab3350f7df2-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>The most important tools an agent has are the search tools it can use to build its own context. Recent posts by <a href="https://www.llamaindex.ai/blog/files-are-all-you-need">LlamaIndex</a> and <a href="https://x.com/hwchase17/status/2011814697889316930">LangChain</a> have sparked a discussion: <em>Are a shell tool and a filesystem all an agent needs for context engineering? </em>Unfortunately, the discussion quickly drifted to the wrong focus: filesystem versus database.</p><p>This post refocuses on the question,<em>What are the right search interfaces an agent needs to build its own context?</em> It first covers the trade-offs between shell tools and dedicated database tools. From there, it offers a practical framework for finding the right interfaces for your agent&#x27;s needs.</p><h2>What does &quot;building context&quot; actually mean for an agent?</h2><p>In early <a href="https://www.elastic.co/what-is/retrieval-augmented-generation">retrieval augmented generation (RAG) pipelines</a>, the developer engineered a fixed retrieval pipeline, and the large language model (LLM) was a passive recipient of the context. This was a fundamental limitation: Context was retrieved on every query, whether or not it was needed, with no check that it actually helped.</p><p>With the shift to agentic RAG, the agents now have access to a set of search tools to build their own context. For example, both Claude Code [1] and Cursor [2] let the agent choose between different search tools and even combine them for chained queries, depending on what the task actually requires.</p><h2>What search interfaces exist for context engineering?</h2><p>Context can live in different locations, such as on the web, in a local filesystem, or in a database. An agent can interact with each of these out-of-context data sources through different tools:</p><ul><li><strong>Shell tools</strong> can execute shell commands and have access to the local filesystem. Some examples of built-in shell tools are <a href="https://platform.claude.com/docs/en/agents-and-tools/tool-use/bash-tool">Claude API&#x27;s bash tool</a>, <a href="https://docs.openclaw.ai/tools/exec">OpenClaw&#x27;s exec tool</a>, and <a href="https://docs.langchain.com/oss/python/integrations/tools/bash">LangChain&#x27;s shell tool</a>.</li><li><strong>Dedicated database tools,</strong> such as tools from a Model Context Protocol (MCP) server (for example, the <a href="https://www.elastic.co/docs/explore-analyze/ai-features/agent-builder/mcp-server">Elastic Agent Builder MCP server)</a> or custom tools (for example, <code>run_esql(query)</code> or <code>db_list_index()</code>), can query databases.</li><li><strong>Dedicated file search tools</strong> can search and read local (or uploaded) files (without full shell access). Some examples of built-in file search tools are <a href="https://ai.google.dev/gemini-api/docs/file-search">Gemini API’s File Search Tool</a> or <a href="https://developers.openai.com/api/docs/guides/tools-file-search">OpenAI’s File Search Tool</a>.</li><li><strong>Web search tools</strong> can retrieve information from the web.</li><li><strong>Memory tools</strong> store and recall from long-term memory (regardless of how it’s stored).</li></ul><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/115f20c8ded259e508f51524b2c06bdc702d70ab-1999x1050.png" style="width: 100%;" /><p>As you can see, the shell tool is versatile and can be used to retrieve context from different data sources, including:</p><ul><li><strong>Filesystem:</strong> The agent explores the directory structure (ls, find), searches for relevant content (grep, cat), and repeats until it has built sufficient context.</li><li><strong>Database:</strong> The agent can use database command line interface (CLI) tools (for example, <a href="https://www.elastic.co/docs/reference/query-languages/sql/sql-cli"><code>elasticsearch-sql-cli</code></a>), call HTTP APIs via curl, or run scripts, which is especially useful in combination with agent skills, which are reusable, documented examples injected into the agent&#x27;s context to guide correct tool usage (for example, <a href="https://github.com/elastic/agent-skills">Elastic Agent Skills for Elasticsearch</a>).</li><li><strong>Web: </strong>The agent can execute web searches via a curl command through a search provider’s API.</li></ul><p>However, the shell tool provides direct system access and therefore requires safety measures, such as running in an isolated sandbox environment and logging all executed commands.</p><h2>When to use which search interfaces</h2><p>The right search interface depends on your data, your query patterns, and your use case. This section serves as a practical starting point.</p><h3>Filesystems aren’t making databases obsolete</h3><p>The filesystems-versus-databases discussion is not about the storage layer. For example, LangChain explains that <a href="https://x.com/hwchase17/status/2011814697889316930">its memory system</a> doesn’t actually store memory in a real filesystem. Instead, it stores memory in a database and <em>represents</em> it as a set of files to the agent [3].</p><p>Filesystems are a natural fit for file-native use cases, such as coding agents. They also work well as a temporary scratch pad or working memory and for single-user or single-agent scenarios where concurrency isn&#x27;t a concern. In these cases, a physical filesystem or representing the data as a filesystem gives you flexibility before committing to a purpose-built interface.</p><p>But filesystem storage has real downsides, such as weak concurrency, manual schema enforcement, and atomic transactions. These become more apparent when your application needs to scale or move to a multi-agent scenario. Anyone who ignores these downsides is doomed to <a href="https://dx.tips/oops-database">painfully reinvent worse databases</a> without the decades of engineering behind transaction safety or access control that production databases already provide. Additionally, in most enterprise contexts, you don&#x27;t choose whether to use a database since it&#x27;s already there, storing business-critical data.</p><h3>Shell tool + filesystem</h3><p>A shell tool is the natural starting point for filesystem search. Currently, coding agents are driving a lot of progress in the field. Because they work with code in local files, they’re naturally file-heavy use cases. Therefore, LLMs are fine-tuned in the post-training stage for coding tasks. That’s why many LLMs are not only good at writing code but also at using shell commands and navigating filesystems.</p><p>Using a shell tool with built-in CLIs, like <code>ls</code> and <code>grep</code>, to find files is effective. With grep, a query like &quot;Find all files that import <code>matplotlib</code>&quot; is fast, precise, and cheap. But when the agent needs to handle conceptual queries, such as &quot;How does our app handle failed authentication?&quot;, pattern matching with grep can hit a ceiling quickly. Several alternatives that bring semantic search capabilities to the command line have emerged to fill this gap, including <a href="https://github.com/jina-ai/jina-grep-cli"><code>jina-grep</code></a>.</p><p>However, grep and many of its semantic search alternatives run in O(n) over the corpus. For use cases over codebases, this might be fine. However, if your data grows, latency will become noticeable. In this case, an indexed datastore becomes necessary to maintain performance.</p><h3>Shell tool + database</h3><p>Another way to add more search capabilities, such as semantic or hybrid search, over your data is to store it in a database, as Cursor does, for example. Additionally, when data requires complex relational joins or aggregations, a database interface is nonnegotiable.</p><p>When the data lives in a database rather than on the filesystem, a shell tool can serve as a lightweight database interface for certain use cases. If your queries are simple enough for a CLI or a curl call, a dedicated database tool may add unnecessary complexity.</p><p>This approach is also suitable in early exploration stages, when you don&#x27;t yet know what query patterns your agent will actually develop. In this case, Agent Skills can give the agent enough structure to query correctly without committing to a purpose-built tool. However, when the agent requires many iterations to figure out the right way to query the database for repeated tasks, the token overhead of using a shell tool as the interface no longer justifies the simplicity benefit of avoiding an extra tool.</p><h3>Dedicated database tool</h3><p>Especially when repeated query patterns are structured or analytical, dedicated database tools become necessary. A <a href="https://vercel.com/blog/testing-if-bash-is-all-you-need">blog post from Vercel and Braintrust</a> compared agents with different sets of search tools for real-world retrieval tasks over semi-structured data, such as customer support tickets and sales call transcripts (for example, “How many open issues mention &#x27;security&#x27;?&quot; or &quot;Find issues where someone reported a bug and later someone submitted a PR claiming to fix it?&quot;) [4].</p><p>Agents with dedicated database tools used fewer tokens, were faster, and made fewer mistakes than agents with only a shell tool and filesystem. The lesson is that direct database tools are the right choice when the query requires analytical reasoning over semi-structured data.</p><h3>Combining search interfaces</h3><p>No single search interface handles every query well. For example, Cursor combines shell tools (for searches via grep) and semantic search tools and lets the agent select the right tool based on the user’s prompt. They report that the agent chooses grep for matching specific symbols or strings, semantic search for conceptual or behavior questions, and both for exploratory tasks.</p><p>The Vercel experiment reports the same: Its hybrid agent with access to both a shell tool and a dedicated database tool achieved the best performance out of all tested agents by first using the dedicated database tools and then verifying the results by grepping through the filesystem. However, this approach uses more tokens and time for reasoning about tool choice and verification.</p><p>The pattern across both examples is the same: Composition beats any single interface, but composition comes at the trade-off of added cost and latency.</p><h2>Practical recommendations for finding the right set of tools</h2><p>The right set of search interfaces is small, purposeful, and specific to your agent&#x27;s actual query patterns. The current best practice is to have an agent with as few tools as possible instead of having an agent with hundreds of MCP tools. This is because the downside of exposing all possible tools up front is that it bloats the context window and confuses the agent about which tool to actually use. For example, Claude Code reportedly only has about 20 tools.</p><p>Instead, the idea of progressive disclosure is to start with a minimal set of tools and let the agent discover additional capabilities only when needed. Research from Anthropic [5] and Cursor [6] has shown that this approach yields a token savings between 47%–85%. Claude Code, for example, implements this directly, allowing the agent to incrementally discover how to query an API or a database, without that knowledge consuming context on every LLM call.</p><p>Once you’re familiar with the agent&#x27;s query patterns, you can revisit the set of search tools that the agent has access to by default. A useful way to think about this trade-off is the <a href="https://www.elastic.co/search-labs/blog/database-retrieval-tools-context-engineering#building-the-right-database-retrieval-tools-%5C(%E2%80%9Clow-floor,-high-ceiling%E2%80%9D%5C">&quot;low floor, high ceiling&quot; principle</a> for deciding which tools make the cut. High-ceiling tools don&#x27;t limit the agent&#x27;s potential. For example, a versatile shell tool lets the agent write full database queries, including ambiguous ones, but at the cost of reasoning overhead, higher latency, and lower reliability.</p><p>Low-floor tools are the opposite. They’re specialized tools that wrap specific queries and are immediately accessible to the agent with minimal reasoning overhead, producing lower cost and higher reliability. But they need upfront engineering, can&#x27;t cover every possible query, and can make it harder for the agent to choose the right tool.</p><p>Think of each tool on a spectrum: Low-floor tools are easy for the agent to use correctly but narrow in scope. High-ceiling tools are versatile but demand more reasoning to use well.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/e6d1b973be4b0a0a25c99c74f02a47e98395a3f7-1200x630.png" style="width: 100%;" /><p>Most agents need a mix of different search tools. But each tool needs to earn its addition. We recommend starting with an all-purpose search tool (for example a <code>search_database()</code> tool or a shell tool). Then reuse the command logs you&#x27;re already keeping for security purposes to track what your agent actually does, including tool calls, retries, and number of calls per user query. And, when you see a query pattern repeating or failing, that&#x27;s the signal to build a purpose-built tool for it.</p><h2>Summary</h2><p>The filesystem-versus-database debate is distracting from the actual question that engineers need to be asking: <em>What are the right search interfaces an agent needs to build its own context?</em> The answer is most likely, <em>Not a single one</em>.</p><p>A shell tool is a versatile tool to interact with different out-of-context sources and thus a good starting point. But it’s less efficient and accurate for use cases with structured analytical queries than dedicated database tools.</p><p>The goal is to find the minimal set of search tools that handles your agent&#x27;s actual query patterns well. Start with a shell tool, and log what your agent actually does. When you see a query pattern repeating and failing, it’s time to engineer specialized tools.</p><h2>References</h2><p>1. Thariq (Anthropic). <a href="https://x.com/trq212/status/2027463795355095314">Lessons from Building Claude Code: Seeing like an Agent</a> (2026).</p><p>2. Cursor: Documentation. <a href="https://cursor.com/docs/agent/tools/search">Semantic &amp; agentic search</a> (2026).</p><p>3. Harrison Chase (LangChain). <a href="https://x.com/hwchase17/status/2011814697889316930">How we built Agent Builder’s memory system</a> (2026).</p><p>4. Ankur Goyal (Braintrust) and Andrew Qu (Vercel). <a href="https://vercel.com/blog/testing-if-bash-is-all-you-need">Testing if &quot;bash is all you need&quot;</a> (2026).</p><p>5. Anthropic. <a href="https://www.anthropic.com/engineering/advanced-tool-use">Introducing advanced tool use on the Claude Developer Platform</a> (2025).</p><p>6. Cursor. <a href="https://cursor.com/blog/dynamic-context-discovery">Dynamic context discovery</a> (2026).</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Elasticsearch replicas for load balancing in Serverless]]></title>
            <description><![CDATA[Learn how Elastic Cloud Serverless automatically adjusts index replicas based on search load, ensuring optimal query performance without manual configuration.]]></description>
            <link>https://www.elastic.co/search-labs/blog/elasticsearch-replicas-load-balancing-serverless</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/elasticsearch-replicas-load-balancing-serverless</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Andrei Dan]]></dc:creator>
            <pubDate>Tue, 24 Mar 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/3c4c1e99b892f026b7aba098973593f8298e2ea6-1280x717.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>In Elastic Cloud Serverless, we automatically adjust the number of replicas for your indices based on search load, ensuring optimal query performance without any manual configuration. In this blog, we’ll explain how replicas are scaled, when the system adds or removes them, and what this means for your indices.</p><h2>The party is getting crowded</h2><p>You&#x27;re hosting a pizza party. You&#x27;ve got a few friends helping you serve, each stationed at different spots around the room. You give each friend a pizza, and they start handing out slices to hungry guests as they arrive.</p><p>At first, things run smoothly. A few guests trickle in, your friends serve slices, everyone&#x27;s happy. But then word spreads about your sourdough pizzas. The doorbell keeps ringing. Guests pour in. Soon, there&#x27;s a crowd forming around one of your friends, the one holding the pepperoni pizza, which everyone seems to want.</p><p>Your friend with the pepperoni pizza is overwhelmed. Guests are waiting, getting impatient, and a large queue has formed. Meanwhile, your friend holding the margherita pizza is standing around with barely anyone asking for a slice.</p><p>What do you do?</p><p>You order a couple more pepperoni pizzas and hand them to other friends. Now three friends are holding pepperoni instead of one. The crowd spreads out, and suddenly you can serve three times as many guests at once.</p><p>A few things become clear as you host more parties:</p><ul><li><strong>Not all pizzas are equally popular.</strong> Some are in high demand, others have fewer takers. You don&#x27;t need extra &quot;copies&quot; of the unpopular ones. You need extras of the ones with queues.</li><li><strong>Order more pizzas before the queue gets too long.</strong> If you wait until your friend is completely overwhelmed and guests are leaving angry, you&#x27;ve waited too long. Better to get an extra pizza when you see a crowd forming.</li><li><strong>Don&#x27;t throw away pizzas too quickly.</strong> Just because the crowd around the pepperoni thinned out for five minutes doesn&#x27;t mean the rush is over. Maybe they&#x27;re just refilling drinks, or even talking among themselves (is that still a thing?). Keep the extra pizzas ready. If the lull continues for a while, then you can put them away.</li><li><strong>You can only hand out as many pizzas as you have friends who are helping.</strong> If you&#x27;ve only got four friends helping, ten pizzas won’t change the outcome. Only four can be served at once. Match your pizza count to your available hands.</li><li><strong>When a friend leaves, take their pizza.</strong> If one of your friends needs to head out, grab their pizza immediately. You can&#x27;t have pizzas sitting unattended. Hand it to someone else, or put it away.</li></ul><h2>From pizzas to replicas</h2><p>Let&#x27;s map this back to Elasticsearch.</p><p>In our analogy, pizzas are replicas (copies of your index shards), your friends helping serve are search nodes, hungry guests are search queries, and that popular pizza with a crowd around it is a hot index with high search load.</p><p>When search traffic increases on a particular index, we create additional replicas and distribute them across your search nodes. Any replica can serve any query for that index, just like any friend holding pepperoni can hand out pepperoni slices. More replicas means higher throughput: Three replicas can handle three times the queries per second of a single replica.</p><h2>Measuring the hunger</h2><p>Before we decide how many pizzas to order, we need to know how hungry the crowd is.</p><p>Elasticsearch tracks the <strong>search load</strong> for every shard. It&#x27;s a metric that captures how much search activity a shard is handling. We aggregate this across all shards of an index to understand the total search demand.</p><p>What matters most is the <strong>relative search load</strong>: What proportion of your project&#x27;s total search traffic is hitting each index? If one index is receiving 60% of all searches while another gets 5%, we know where to add capacity.</p><h2>The math behind the pizzas</h2><p>We calculate the optimal number of replicas following this formula:</p><pre><code>desired_replicas = min(ceil(L × N / (S × X)), N)</code></pre><p>Where:</p><ul><li><strong>L</strong> = the index&#x27;s relative search load (between 0 and 1).</li><li><strong>N</strong> = the number of desired search nodes in your project.</li><li><strong>S</strong> = the number of shards in the index.</li><li><strong>X</strong> = a threshold to avoid hot spots (default: 0.5).</li></ul><p>An example: four search nodes, one index with two primary shards receiving 80% of search traffic:</p><pre><code>desired_replicas = min(ceil(0.8 × 4 / (2 × 0.5)), 4)
                 = min(4, 4)
                 = 4</code></pre><p>This hot index gets four replicas distributed across the search nodes.</p><p>The threshold X (defaulting to 0.5) is important. We don&#x27;t wait until a replica is completely overwhelmed; we scale up when it&#x27;s at half capacity. Hand out the extra pizza when you see the crowd forming, not when guests are already leaving.</p><h2>Scale up fast, scale down slow</h2><p>When search load increases, we add replicas immediately. No reason to make users wait.</p><p>When search load drops, we wait a bit before taking any action. We need to see consistent low demand for about 30 minutes before reducing replicas. (This is to deal with spiky traffic where a quiet moment doesn&#x27;t mean the party is over.)</p><p>This matters because adding a replica has a cost. The new replica copies data and warms its caches before serving queries efficiently. Removing replicas too eagerly means constantly paying this startup cost as traffic naturally fluctuates.</p><h2>Respecting topology bounds</h2><p>Replicas can never exceed the number of search nodes. Having more replicas than nodes provides no benefit (you can only serve as many pizzas as you have friends who are helping to serve slices).</p><p>When nodes are removed from your project, we reduce replicas immediately to match. No waiting for the cooldown, as you can&#x27;t have unassigned replicas. The moment a friend leaves, we remove their pizza.</p><h2>The bigger Serverless picture</h2><p>Replicas for search load balancing works alongside other autoscaling systems:</p><ul><li><strong>Search autoscaling</strong> adjusts the number of search nodes (how many friends are helping).</li><li><strong>Replicas for search load balancing</strong> distribute traffic by adjusting replica counts per index (how many pizzas of each kind we need).</li><li><strong>Data stream autosharding</strong> optimizes shard counts for writes (how to slice each pizza, covered in the <a href="https://www.elastic.co/search-labs/blog/datastream-autosharding-serverless">previous post</a>).</li></ul><p>An important design principle: Replicas for load balancing don&#x27;t directly trigger search autoscaling. Instead, by distributing search requests across more replicas, it enables increasing resource utilization across your search nodes. This higher utilization then triggers our existing autoscaling logic to add capacity if needed. Replicas for load balancing enables autoscaling to do its job, making sure your search nodes are actually being used, rather than having all traffic bottlenecked on a single replica while other nodes sit idle.</p><h2>What this means for you</h2><p>You don&#x27;t need to predict which indices will be popular. You don&#x27;t need to manually adjust replicas when traffic patterns change. You don&#x27;t need to wake up at 3 a.m. because a surge overwhelmed your busiest index.</p><p>The system watches where queues are forming and orders more pizzas for those spots. Cold indices don&#x27;t waste resources on unnecessary replicas. Hot indices get the capacity they need. Your budget goes where it matters.</p><h2>Conclusion</h2><p>In the <a href="https://www.elastic.co/search-labs/blog/datastream-autosharding-serverless">autosharding post</a>, we made sure your pizzas are sliced right. Now, with replicas for search load balancing, we make sure you have enough pizzas, in the right hands, when the hungry crowds arrive.</p><p>Try <a href="https://www.elastic.co/cloud/serverless">Elastic Cloud Serverless</a> and let us handle the pizza logistics.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Using Elasticsearch Inference API along with Hugging Face models]]></title>
            <description><![CDATA[Learn how to connect Elasticsearch to Hugging Face models using inference endpoints, and build a multilingual blog recommendation system with semantic search and chat completions.]]></description>
            <link>https://www.elastic.co/search-labs/blog/hugging-face-elasticsearch-inference-api</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/hugging-face-elasticsearch-inference-api</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Jeffrey Rengifo]]></dc:creator>
            <pubDate>Mon, 23 Mar 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/1417d6ff033712206c9bd4bcc22074ee3437ce96-1999x1125.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>In recent updates, Elasticsearch introduced a native integration to connect to models hosted on the <a href="https://endpoints.huggingface.co/">Hugging Face Inference Service</a>. In this post, we’ll explore how to configure this integration and perform inference through simple API calls using a large language model (LLM). We’ll use <a href="https://huggingface.co/HuggingFaceTB/SmolLM3-3B">SmolLM3-3B</a>, a lightweight general-purpose model with a good balance between resource usage and answer quality.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/7ddadf1976421a860a7d62087239adb9150d808b-1999x1388.png" style="width: 100%;" /><h2>Prerequisites</h2><ul><li><strong>Elasticsearch 9.3 or Elastic Cloud Serverless: </strong>You can create a cloud deployment following <a href="https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud">these instructions</a>, or you can use the <a href="https://www.elastic.co/docs/deploy-manage/deploy/self-managed/local-development-installation-quickstart#local-dev-quick-start"><code>start-local</code></a> quickstart instead.</li><li><strong>Python 3.12: </strong>Download Python <a href="https://www.python.org/">here</a>.</li><li><strong>Hugging Face </strong><a href="https://huggingface.co/docs/hub/en/security-tokens">access token</a>.</li></ul><h2>Chat completions using a Hugging Face inference endpoint</h2><p>First, we’ll build a practical example that connects Elasticsearch to a Hugging Face <a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-inference-put">inference endpoint</a> to generate AI-powered recommendations from a collection of blog posts. For the app knowledge base, we’ll use a dataset of company blog articles, which contains valuable but often hard-to-navigate information.</p><p>With this endpoint, <a href="https://www.elastic.co/docs/solutions/search/semantic-search">semantic search</a> retrieves the most relevant articles for a given query, and a Hugging Face LLM generates short, contextual recommendations based on those results.</p><p>Let’s take a look at a high-level overview of the information flow we’re going to build:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/1dfbc2323438feaaa42e13ab242dd1f7166f74aa-1200x676.png" style="width: 100%;" /><p>In this article, we’ll test <strong>SmolLM3-3B </strong>capacity to<strong> </strong>combine its compact size with strong multilingual reasoning and tool-calling capabilities. Based on a search query, we’ll send all the matching content (in English and Spanish) to the LLM to generate a list of recommended articles with a custom-made description based on the search query and results.</p><p>Here’s what the UI of an article site with an AI recommendations generation system could look like.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/8d3b86b212f28ff279f2da67a33e6134039f0e4e-1999x949.png" style="width: 100%;" /><p>You can find the full implementation of this application in the linked <a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/elasticsearch-inference-api-and-hugging-face/notebook.ipynb">notebook</a>.</p><h3>Configuring Elasticsearch inference endpoints</h3><p>To use the Elasticsearch <a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-inference-put-hugging-face">Hugging Face inference endpoint</a>, we need two important elements: a Hugging Face API key and a running Hugging Face endpoint URL. It should look like this:</p><pre><code>PUT _inference/chat_completions/hugging-face-smollm3-3b
{
    "service": "hugging_face",
    "service_settings": {
        "api_key": "hugging-face-access-token", 
        "url": "url-endpoint" 
    }
}</code></pre><p>The Hugging Face inference endpoint in Elasticsearch supports different task types: <code>text_embedding</code>, <code>completion</code>, <code>chat_completion</code>, and <code>rerank</code>. In this blog post, we use <code>chat_completion</code> because we need the model to generate conversational recommendations based on the search results and a system prompt.This endpoint allows us to perform chat completions directly from Elasticsearch in a simple way using the Elasticsearch API:</p><pre><code>POST _inference/chat_completion/hugging-face-smollm3-3b/_stream
{
  "messages": [
      { "role": "user", "content": "<user prompt>" }
  ]
}</code></pre><p>This will serve as the core of the application, receiving the prompt and the search results that will pass through the model. With the theory covered, let’s start implementing the application.</p><h4>Setting up ​​inference endpoint on Hugging Face</h4><p>To deploy the Hugging Face model, we’re going to use <a href="https://huggingface.co/inference-endpoints/dedicated">Hugging Face one-click deployments</a>, an easy and fast service for deploying model endpoints. Keep in mind that this is a paid service, and using it may incur additional costs. This step will create the model instance that will be used to generate the recommendations of the articles.</p><p>You can pick a model from the one-click catalog:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/b816e9fba1fe172687bf58f5143fb1f838c1077f-549x331.png" style="width: 100%;" /><p>Let’s pick the <strong>SmolLM3-3B</strong> model:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/610d3aba0429f3666c2df3616d513eb6a4397c0c-502x478.png" style="width: 100%;" /><p>From here, grab the Hugging Face endpoint URL:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/025094ddb2cfbd1f0f216a5ec4e119b0f4fa2c42-646x328.png" style="width: 100%;" /><p>As mentioned in the Elasticsearch <a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-inference-put-hugging-face">Hugging Face inference endpoints documentation</a>, text generation requires a model that’s compatible with the OpenAI API. For that reason, we need to append the <code>/v1/chat/completions</code> subpath to the Hugging Face endpoint URL. The final result will look like this:</p><pre><code>https://j2g31h0futopfkli.us-east-1.aws.endpoints.huggingface.cloud/v1/chat/completions</code></pre><p>With this in place, we can start coding in a Python notebook.</p><h4>Generating Hugging Face API key</h4><p>Create a <a href="https://huggingface.co/join">Hugging Face account</a>, and obtain an API token by following <a href="https://huggingface.co/docs/hub/en/security-tokens#user-access-tokens">these instructions</a>. You can choose between three token types: <em>fine-grained</em> (recommended for production, as it provides access only to specific resources); <em>read</em> (for read-only access); or <em>write</em> (for read and write access). For this tutorial, a read token is sufficient, since we only need to call the inference endpoint. Save this key for the next step.</p><h4>Setting up Elasticsearch inference endpoint</h4><p>First, let’s declare an Elasticsearch Python client:</p><pre><code>os.environ["ELASTICSEARCH_API_KEY"] = "your-elasticsearch-api-key"
os.environ["ELASTICSEARCH_URL"] = "https://xxxx.us-central1.gcp.cloud.es.io:443"

es_client = Elasticsearch(
    os.environ["ELASTICSEARCH_URL"], api_key=os.environ["ELASTICSEARCH_API_KEY"]
)</code></pre><p>Next, let’s create an Elasticsearch inference endpoint that uses the Hugging Face model. This endpoint will allow us to generate responses based on the blog posts and the prompt passed to the model.</p><pre><code>INFERENCE_ENDPOINT_ID = "smollm3-3b-pnz"

os.environ["HUGGING_FACE_INFERENCE_ENDPOINT_URL"] = (
 "https://j2g31h0futopfkli.us-east-1.aws.endpoints.huggingface.cloud/v1/chat/completions"
)
os.environ["HUGGING_FACE_API_KEY"] = "hf_xxxxx"

resp = es_client.inference.put(
        task_type="chat_completion",
        inference_id=INFERENCE_ENDPOINT_ID,
        body={
            "service": "hugging_face",
            "service_settings": {
                "api_key": os.environ["HUGGING_FACE_API_KEY"],
                "url": os.environ["HUGGING_FACE_INFERENCE_ENDPOINT_URL"],
            },
        },
    )</code></pre><h3>Dataset</h3><p>The dataset contains the <a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/elasticsearch-inference-api-and-hugging-face/dataset.json">blog posts</a> that will be queried, representing a multilingual content set used throughout the workflow:</p><pre><code>// Articles dataset document example: 
{
    "id": "6",
    "title": "Complete guide to the new API: Endpoints and examples",
    "author": "Tomas Hernandez",
    "date": "2025-11-06",
    "category": "tutorial",
    "content": "This guide describes in detail all endpoints of the new API v2. It includes code examples in Python, JavaScript, and cURL for each endpoint. We cover authentication, resource creation, queries, updates, and deletion. We also explain error handling, rate limiting, and best practices. Complete documentation is available on our developer portal."
  }</code></pre><h4>Elasticsearch mappings</h4><p>With the dataset defined, we need to create a data schema that properly fits the blog post structure. The following <a href="https://www.elastic.co/docs/manage-data/data-store/mapping">index mappings</a> will be used to store the data in Elasticsearch:</p><pre><code>INDEX_NAME = "blog-posts"

mapping = {
    "mappings": {
        "properties": {
            "id": {"type": "keyword"},
            "title": {
                "type": "object",
                "properties": {
                    "original": {
                        "type": "text",
                        "copy_to": "semantic_field",
                        "fields": {"keyword": {"type": "keyword"}},
                    },
                    "translated_title": {
                        "type": "text",
                        "fields": {"keyword": {"type": "keyword"}},
                    },
                },
            },
            "author": {"type": "keyword", "copy_to": "semantic_field"},
            "category": {"type": "keyword", "copy_to": "semantic_field"},
            "content": {"type": "text", "copy_to": "semantic_field"},
            "date": {"type": "date"},
            "semantic_field": {"type": "semantic_text"},
        }
    }
}


es_client.indices.create(index=INDEX_NAME, body=mapping)</code></pre><p>Here, we can see more clearly how the data is structured. We’ll use semantic search to retrieve results based on natural language, along with the <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/copy-to"><code>copy_to</code></a> property to copy the field contents into the <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/semantic-text"><code>semantic_text</code></a> field. Additionally, the <code>title</code> field contains two subfields: the <code>original</code> subfield stores the title in either English or Spanish, depending on the original language of the article; and the <code>translated_title</code> subfield is present only for Spanish articles and contains the English translation of the original title.</p><h3>Ingesting data</h3><p>The following code snippet ingests the blog posts dataset into Elasticsearch using the <a href="https://www.elastic.co/docs/reference/elasticsearch/clients/javascript/bulk_examples">bulk API</a>:</p><pre><code>def build_data(json_file, index_name):
    with open(json_file, "r") as f:
        data = json.load(f)

    for doc in data:
        action = {"_index": index_name, "_source": doc}
        yield action


try:
    success, failed = helpers.bulk(
        es_client,
        build_data("dataset.json", INDEX_NAME),
    )
    print(f"{success} documents indexed successfully")

    if failed:
        print(f"Errors: {failed}")
except Exception as e:
    print(f"Error: {str(e)}")</code></pre><p>Now that we have the articles ingested into Elasticsearch, we need to create a function capable of searching against the <code>semantic_text</code> field:</p><pre><code>def perform_semantic_search(query_text, index_name=INDEX_NAME, size=5):
    try:
        query = {
            "query": {
                "match": {
                    "semantic_field": {
                        "query": query_text,
                    }
                }
            },
            "size": size,
        }

        response = es_client.search(index=index_name, body=query)
        hits = response["hits"]["hits"]

        return hits
    except Exception as e:
        print(f"Semantic search error: {str(e)}")
        return []</code></pre><p>We also need a function that calls the inference endpoint. In this case, we’ll call the endpoint using the <strong><code>chat_completion</code> </strong>task type to get streaming responses:</p><pre><code>def stream_chat_completion(messages: list, inference_id: str = INFERENCE_ENDPOINT_ID):
    url = f"{ELASTICSEARCH_URL}/_inference/chat_completion/{inference_id}/_stream"
    payload = {"messages": messages}
    headers = {
        "Authorization": f"ApiKey {ELASTICSEARCH_API_KEY}",
        "Content-Type": "application/json",
    }

    try:
        response = requests.post(url, json=payload, headers=headers, stream=True)
        response.raise_for_status()

        for line in response.iter_lines(decode_unicode=True):
            if line:
                line = line.strip()

                if line.startswith("event:"):
                    continue

                if line.startswith("data: "):
                    data_content = line[6:]

                    if not data_content.strip() or data_content.strip() == "[DONE]":
                        continue

                    try:
                        chunk_data = json.loads(data_content)

                        if "choices" in chunk_data and len(chunk_data["choices"]) > 0:
                            choice = chunk_data["choices"][0]
                            if "delta" in choice and "content" in choice["delta"]:
                                content = choice["delta"]["content"]
                                if content:
                                    yield content

                    except json.JSONDecodeError as json_err:
                        print(f"\nJSON decode error: {json_err}")
                        print(f"Problematic data: {data_content}")
                        continue

    except requests.exceptions.RequestException as e:
        yield f"Error: {str(e)}"</code></pre><p>Now we can write a function that calls the semantic search function, along with the <code>chat_completions</code> inference endpoint and the recommendations endpoint, to generate the data that will be allocated in the cards:</p><pre><code>def recommend_articles(search_query, index_name=INDEX_NAME, max_articles=5):
    print(f"\n{'='*80}")
    print(f"🔍 Search Query: {search_query}")
    print(f"{'='*80}\n")

    articles = perform_semantic_search(search_query, index_name, size=max_articles)

    if not articles:
        print("❌ No relevant articles found.")
        return None, None

    print(f"✅ Found {len(articles)} relevant articles\n")

    # Build context with found articles
    context = "Available blog articles:\n\n"
    for i, article in enumerate(articles, 1):
        source = article.get("_source", article)
        context += f"Article {i}:\n"
        context += f"- Title: {source.get('title', 'N/A')}\n"
        context += f"- Author: {source.get('author', 'N/A')}\n"
        context += f"- Category: {source.get('category', 'N/A')}\n"
        context += f"- Date: {source.get('date', 'N/A')}\n"
        context += f"- Content: {source.get('content', 'N/A')}\n\n"

    system_prompt = """You are an expert content curator that recommends blog articles.

    Write recommendations in a conversational style starting with phrases like:
    - "If you're interested in [topic], this article..."
    - "This post complements your search with..."
    - "For those looking into [topic], this article provides..."


    FORMAT REQUIREMENTS:
    - Return ONLY a JSON array
    - Each element must have EXACTLY these three fields: "article_number", "title", "recommendation"
    - If the original title is in spanish, use the "translated_title" subfield in the "title" field

    Keep each recommendation concise (2-3 sentences max) and focused on VALUE to the reader.

    EXAMPLE OF CORRECT FORMAT:
    [
        {"article_number": 1, "title": "Article title in english", "recommendation": "If you are interested in [topic], this article provides..."},
        {"article_number": 2, "title": "Article title in english", "recommendation": " for those looking into [topic], this article provides..."}
    ]

    Return ONLY the JSON array following this exact structure."""

    user_prompt = f"""Search query: "{search_query}"

    Generate recommendations for the following articles: {context}
    """

    messages = [
        {"role": "system", "content": "/no_think"},
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ]

    # LLM generation
    print(f"{'='*80}")
    print("🤖 Generating personalized recommendations...\n")

    full_response = ""

    for chunk in stream_chat_completion(messages):
        print(chunk, end="", flush=True)
        full_response += chunk

    return context, articles, full_response</code></pre><p>Finally, we need to extract the information and format it to be printed:</p><pre><code>def display_recommendation_cards(articles, recommendations_text):
    print("\n" + "=" * 100)
    print("📇 RECOMMENDED ARTICLES".center(100))
    print("=" * 100 + "\n")

    # Parse JSON recommendations - clean tags and extract JSON
    recommendations_list = []
    try:

        # Clean up <think> tags
        cleaned_text = re.sub(
            r"<think>.*?</think>", "", recommendations_text, flags=re.DOTALL
        )
        # Remove markdown code blocks ( ... ``` or ``` ... ```)
        cleaned_text = re.sub(r"```(?:json)?", "", cleaned_text)
        cleaned_text = cleaned_text.strip()

        parsed = json.loads(cleaned_text)

        # Extract recommendations from list format
        for item in parsed:
            article_number = item.get("article_number")
            title = item.get("title", "")
            rec_text = item.get("recommendation", "")

            if article_number and rec_text:
                recommendations_list.append(
                    {
                        "article_number": article_number,
                        "title": title,
                        "recommendation": rec_text,
                    }
                )
    except json.JSONDecodeError as e:
        print(f"⚠️  Could not parse recommendations as JSON: {e}")
        return

    for i, article in enumerate(articles, 1):
        source = article.get("_source", article)

        # Card border
        print("┌" + "─" * 98 + "┐")

        # Find recommendation and title for this article number
        recommendation = None
        title = None
        for rec in recommendations_list:
            if rec.get("article_number") == i:
                recommendation = rec.get("recommendation")
                title = rec.get("title")
                break

        # Print title
        title_lines = textwrap.wrap(f"📌 {title}", width=94)
        for line in title_lines:
            print(f"│  {line}".ljust(99) + "│")

        # Card border
        print("├" + "─" * 98 + "┤")

        # Print recommendation
        if recommendation:
            recommendation_lines = textwrap.wrap(recommendation, width=94)
            for line in recommendation_lines:
                print(f"│  {line}".ljust(99) + "│")

        # Card bottom
        print("└" + "─" * 98 + "┘")</code></pre><p>Let’s test this by asking a question about the security blog posts:</p><pre><code>search_query = "Security and vulnerabilities"

context, articles, recommendations = recommend_articles(search_query)

print("\nElasticsearch context:\n", context)

# Display visual cards
display_recommendation_cards(articles, recommendations)</code></pre><p>Here we can see the cards in the console generated by the workflow:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/730d35212594bb3db30447c3ea7e2a92857287b7-1999x1515.png" style="width: 100%;" /><p>You can see the full results, including all hits and the LLM response, in <a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/elasticsearch-inference-api-and-hugging-face/results.md">this file</a>.</p><p>We’re asking for articles related to: “Security and vulnerabilities.” This question is used as the search query against the documents stored in Elasticsearch. The retrieved results are then passed to the model, which generates recommendations based on their content. As we can see, the model did a great job generating engaging short text that can motivate the reader to click on it.</p><h2>Conclusion</h2><p>This example shows how Elasticsearch and Hugging Face can be combined to create a fast and efficient centralized system for AI applications. This approach reduces manual effort and provides flexibility, thanks to Hugging Face’s extensive model catalog. Using SmolLM3-3B, in particular, shows how compact, multilingual models can still deliver meaningful reasoning and content generation when paired with semantic search. Together, these tools offer a scalable and effective foundation for building intelligent content analysis and multilingual applications.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Fast vs. accurate: Measuring the recall of quantized vector search]]></title>
            <description><![CDATA[Explaining how to measure recall for vector search in Elasticsearch with minimal setup.]]></description>
            <link>https://www.elastic.co/search-labs/blog/recall-vector-search-quantization</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/recall-vector-search-quantization</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Jeff Vestal]]></dc:creator>
            <pubDate>Fri, 20 Mar 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/09f03b9239d66c36763cdab3fafcdac207ff6d83-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Everyone wants vector search to be instant. But high-dimensional vectors are heavy. A single 1,024-dimension float-32 vector takes up significant memory, and comparing it against millions of others is computationally expensive.</p><p>To solve this, search engines like Elasticsearch use two main optimization strategies:</p><ol><li><strong>Approximate search (hierarchical navigable small world [HNSW]):</strong> Instead of scanning every document, we build a navigation graph to jump quickly to the likely neighborhood of the answer.</li><li><strong>Quantization:</strong> We compress the vectors (for example, from 32-bit floats to 8-bit integers or even 1-bit binary values) to reduce memory usage and speed up calculations.</li></ol><p>But optimization often comes with a tax: <strong>accuracy</strong>.</p><p>The fear is valid: &quot;If I compress my data and take shortcuts during the search, will I miss the best results?&quot; &quot;Does this optimization degrade the relevance of my search engine?&quot;</p><p>To prove that Elastic’s quantization doesn’t degrade results, we built a repeatable test harness using the <a href="https://huggingface.co/datasets/fancyzhx/dbpedia_14"><strong>DBPedia-14</strong> dataset</a> to calculate exactly how much accuracy (specifically, <strong>recall)</strong> we trade for speed when using default optimizations in Elasticsearch.</p><p>tldr: It’s likely much less than you think. Check out the <a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/fast_vs_accurate_measuring_the_recall_of_quantized_vector_search/vector_recall_notebook.ipynb">notebook here</a>, and try it yourself</p><h2><strong>The definitions (for the non-experts)</strong></h2><p>Before we look at the code, let’s level-set on some terms.</p><ul><li><strong>Relevance versus recall:</strong> <strong>Relevance</strong> is subjective (did I find good stuff?). <strong>Recall</strong> is mathematical. If there are 10 documents in the database that are the <em>perfect</em> mathematical matches for your query, and the search engine finds nine of them, your recall is 90% (or 0.9).</li><li><strong>Exact search (flat):</strong> Sometimes called the &quot;brute force&quot; method. The search engine scans every single document in an index and calculates the distance.<ul><li><em>Pros:</em> 100% perfect recall.</li><li><em>Cons:</em> Computationally expensive and slow at scale.</li></ul></li><li><strong>Approximate search (HNSW):</strong> The &quot;shortcut&quot; method. The search engine builds an <a href="https://www.elastic.co/search-labs/blog/hnsw-graph">HNSW</a> graph. It traverses the graph to find the nearest neighbors.<ul><li><em>Pros:</em> Extremely fast and scalable.</li><li><em>Cons:</em> You might miss a neighbor if the graph traversal stops too early.</li></ul></li></ul><h2><strong>The experiment: Exact versus approximate</strong></h2><p>To test recall, we used the <strong>DBPedia-14</strong> dataset, a large dataset of titles and abstracts across 14 ontology classes, commonly used for training and evaluating text categorization models. Specifically, we’ll focus on the &quot;Film&quot; category. We wanted to compare the optimized production settings against a mathematically perfect ground truth.</p><p>For this experiment, we are using the <a href="https://www.elastic.co/search-labs/blog/jina-embeddings-v5-text">jina-embeddings-v5-text-small</a> model, a state-of-the-art multilingual model that leads industry benchmarks for text representation. We chose this model because it defines the current standard for high-performance embeddings. By combining Jina v5’s elite accuracy with Elasticsearch’s native quantization, we can demonstrate a search architecture that is both computationally efficient and uncompromising on retrieval quality.</p><p>We set up an index with dual mapping. We ingested the same text into two different fields simultaneously:</p><ol><li><strong><code>content.raw</code> </strong>with type: <code>flat</code>. This forces Elasticsearch to perform a brute-force scan of the full Float32 vectors. This returns exact match results and will be used for our baseline.</li><li><strong><code>content</code> </strong>with type <code>semantic_text</code>. With defaults using HNSW + Better Binary Quantization (BBQ). This is the standard, optimized production setting for approximate match.</li></ol><h3><strong>The Recall@10 test</strong></h3><p>For our metric, we used Recall@10.</p><p>We picked 50 random movies and ran the same query against both fields.</p><ul><li>If the <strong>exact (flat)</strong> search says the top 10 neighbors are IDs [1, 2, 3... 10].</li><li>And the <strong>approximate (HNSW)</strong> search returns IDs [1, 2, 3... 9, 99].</li><li>We found nine out of the top 10 correctly. The score is <strong>0.9</strong>.</li></ul><p>Here’s the mapping we used:</p><pre><code># The "Control Group": Forces exact brute-force scan
"raw": {
    "type": "semantic_text",
    "inference_id": ".jina-embeddings-v5-text-small",
    "index_options": {
        "dense_vector": {
            "type": "flat"
        }
    }
}</code></pre><p><strong>The results: The &quot;flat line&quot; of success</strong></p><p>We ran a scale test, reloading the full dataset and testing against index sizes of 1,000 to 40,000 documents.</p><p>Here’s what happened to the recall score:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Documents</td><td style="border: 1px solid #ccc; padding: 4px;">Recall@10 score</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">1,000</td><td style="border: 1px solid #ccc; padding: 4px;">1.000 (100%)</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">5,000</td><td style="border: 1px solid #ccc; padding: 4px;">0.998 (100%)</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">10,000</td><td style="border: 1px solid #ccc; padding: 4px;">0.992 (99.4%)</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">20,000</td><td style="border: 1px solid #ccc; padding: 4px;">0.999 (99.0%)</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">40,000</td><td style="border: 1px solid #ccc; padding: 4px;">0.992 (98.8%)</td>
                    </tr>
              </tbody>
            </table>
          <p>The results were incredibly stable. Even as we scaled up, the approximate search matched the brute-force exact search <strong>&gt;99% of the time</strong>.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/a4bfacb1d0cce6fdf6df0e1a9d4fc5d4007a66da-1999x1209.png" style="width: 100%;" /><h2><strong>Why did it work so well?</strong></h2><p>You might expect that compressing vectors to binary values would hurt accuracy more than this. The reason it doesn&#x27;t lies in how Elasticsearch handles the retrieval.</p><p>Most embedding models today output Float32 vectors, which are large. To make search efficient, Elasticsearch uses quantization for high-dimensional vectors. Specifically, since 9.2, it uses <a href="https://www.elastic.co/search-labs/blog/elasticsearch-9-1-bbq-acorn-vector-search">BBQ</a> by default.</p><p>BBQ uses a <strong>rescoring</strong> mechanism:</p><ol><li><strong>Traversal:</strong> The search engine uses the compressed (quantized) vectors to traverse the HNSW graph quickly. Because the vectors are small, it can efficiently over-sample, gathering a larger list of candidates (for example, the top 100 roughly similar docs) without a performance penalty.</li><li><strong>Rescore:</strong> Once it has those candidates, it retrieves the full-precision values for just those few documents to calculate the final, precise ranking.</li></ol><p>This gives you the best of both worlds, the speed of quantization for the heavy lifting, and the precision of floats for the final sort.</p><h2><strong>Can we do better?</strong></h2><p>It’s worth noting that the results we’re seeing here are using default settings and a random sampling of data. Think of this as a high-performance starting point. While Jina v5 is a beast, these recall scores aren&#x27;t a &quot;one size fits all&quot; guarantee for every dataset. Every data collection has its own quirks, and while you can definitely tune things further to squeeze out even more performance, you should always benchmark against your own specific data to see where your ceiling is.</p><h2><strong>Conclusion</strong></h2><p>This is a very small-scale test. But the point of the exercise is not to measure the embedding model or BBQ specifically, it’s to demonstrate how you can easily measure the recall of your dataset with minimal setup.</p><p>If you want to run this test on your own data, you can check out the <a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/fast_vs_accurate_measuring_the_recall_of_quantized_vector_search/vector_recall_notebook.ipynb">notebook here</a> and try it yourself.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Testing Elasticsearch. It just got simpler.]]></title>
            <description><![CDATA[Explaining how Elasticsearch integration tests have become simpler thanks to improvements in Elasticsearch 9.x, the modern Java client, and Testcontainers 2.x.]]></description>
            <link>https://www.elastic.co/search-labs/blog/elasticsearch-integration-tests</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/elasticsearch-integration-tests</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Piotr Przybyl]]></dc:creator>
            <pubDate>Thu, 19 Mar 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/3ae41b1f2876d2ad11c8e2b79bbf79955d6902aa-1440x840.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>When I first wrote about <a href="https://www.elastic.co/search-labs/blog/series/integration-tests-using-elasticsearch">testing Elasticsearch</a> with Testcontainers for Java, the focus was very pragmatic: if you care about correctness, you should test against a real node; if you care about confidence, your integration tests should resemble production as closely as possible; and if you care about maintainability, your setup shouldn’t turn into a maze of mocks and assumptions.</p><p>That philosophy hasn’t changed.</p><p>What has changed, however, is how little effort it now takes to achieve that goal. With Elasticsearch 9.x, the modern Java client, and Testcontainers 2.x, the experience of writing integration tests feels noticeably smoother, as if a layer of incidental complexity has quietly been removed.</p><p>The example accompanying this article is intentionally modest and can be found <a href="https://github.com/pioorg/elasticsearch9-testcontainers2/blob/main/src/test/java/testing_elasticsearch/ES9TC2DemoTest.java">here</a>.</p><p>It doesn’t attempt to demonstrate sophisticated indexing strategies or elaborate data pipelines; instead, it concentrates on the essentials, because the essentials are precisely where the improvements are most visible.</p><h2>When the tooling stops getting in the way</h2><p>Anyone who has maintained a test suite for a few years will recognize the pattern: You introduce a new library, a transitive dependency pulls something unexpected, and before long, you’re negotiating between versions of testing engines rather than writing tests.</p><p>With Testcontainers 2.x, that negotiation largely disappears. The dependency structure is clearer, the modules are more explicit, and the accidental coupling to older testing frameworks no longer sneaks in behind your back. In practical terms, adding Elasticsearch support to your tests is now as straightforward as declaring:</p><pre><code><dependency>
  <groupId>org.testcontainers</groupId>
  <artifactId>testcontainers-elasticsearch</artifactId>
  <version>2.0.3</version>
  <scope>test</scope>
</dependency></code></pre><p>And, if you’re using JUnit Jupiter integration:</p><pre><code><dependency>
  <groupId>org.testcontainers</groupId>
  <artifactId>testcontainers-junit-jupiter</artifactId>
  <version>2.0.3</version>
  <scope>test</scope>
</dependency></code></pre><p>There are no exclusions to sprinkle in, no legacy engines to silence, and no uneasy feeling that something hidden might surface during the next upgrade. The configuration becomes almost unremarkable, which, in the context of build tooling, is a compliment.</p><h2>A real Elasticsearch node, with security intact</h2><p>In the demo test, we use the official Elasticsearch 9.3.1 Docker image:</p><pre><code>var container =
    new ElasticsearchContainer("docker.elastic.co/elasticsearch/elasticsearch:9.3.1");

container.start();</code></pre><p>At first glance, this may look similar to older examples, yet the subtle difference lies in what we no longer need to do. <strong>We don’t disable security.</strong> <strong>We don’t bypass SSL.</strong> We don’t simplify the environment just to make the test convenient.</p><p>Instead, once the container is started, we construct a client that uses the REST API and authenticates properly:</p><pre><code>try (var client = ElasticsearchClient.of(c -> c
     .host("https://" + container.getHttpHostAddress())
     .usernameAndPassword("elastic", ElasticsearchContainer.ELASTICSEARCH_DEFAULT_PASSWORD)
     .sslContext(container.createSslContextFromCa())
)) {</code></pre><p>What deserves special mention here is how neat the client construction itself has become. In earlier iterations, creating an Elasticsearch client often meant juggling multiple intermediate objects, configuring transport layers explicitly, wrapping low-level clients, and dedicating some amount of code to what was essentially plumbing. Now, the signal-to-noise ratio is refreshingly high. The builder encapsulates the necessary details, the container provides what the client needs, and the resulting configuration fits comfortably within a few readable lines.</p><p>Just as importantly, the <code>ElasticsearchClient</code> is <code>AutoCloseable</code>, which means it integrates naturally with try-with-resources, ensuring proper cleanup without additional ceremony. The lifecycle is explicit, concise, and self-contained, which is exactly what you want in integration tests that should focus on behavior rather than infrastructure management.</p><p>The container exposes everything required to build a legitimate, secure connection, and the client integrates with it naturally, which means the test environment mirrors production in all the aspects that matter, without imposing additional mental overhead from the developer.</p><p>This alignment between realism and simplicity is, perhaps, one of the most meaningful improvements.</p><h2>Typed APIs change the character of tests</h2><p>The evolution of the Elasticsearch Java client has also reshaped how integration tests read and feel. Where older approaches often involved parsing JSON responses or navigating loosely typed structures, the modern client offers a builder-based, strongly typed API that guides you through valid request shapes at compile time.</p><p>In the demo, we perform a simple cluster health check:</p><pre><code>var health = client.cluster().health();

Assertions.assertEquals("docker-cluster", health.clusterName());
Assertions.assertEquals(HealthStatus.Green, health.status());</code></pre><p>What’s striking here is not the complexity of the operation, but the absence of friction. There’s no manual extraction from maps, no assertions built on untyped string values, and no detour into low-level response handling. The test code looks indistinguishable from application code, which subtly reinforces the idea that integration tests aren’t a special category of code with different rules, but simply another consumer of the same APIs.</p><p>When the boundary between production code and test code becomes thinner, confidence increases almost by default.</p><h2>Reading the test as a story</h2><p>If you take a look at the full test case:</p><pre><code>@Test
void newClientTest() throws IOException {
    try (var container =
             new ElasticsearchContainer("docker.elastic.co/elasticsearch/elasticsearch:9.3.1")) {
        
        container.start();
        
        try (
            var client = ElasticsearchClient.of(c ->
                c.host("https://" + container.getHttpHostAddress())
                    .usernameAndPassword("elastic", ElasticsearchContainer.ELASTICSEARCH_DEFAULT_PASSWORD)
                    .sslContext(container.createSslContextFromCa()))) {

            HealthResponse health = client.cluster().health();

            Assertions.assertEquals("docker-cluster", health.clusterName());
            Assertions.assertEquals(HealthStatus.Green, health.status());
        }
    }
}</code></pre><p>you’ll notice that it reads less like a configuration script and more like a short narrative:</p><ul><li>We define the container.</li><li>We start the container.</li><li>We build a client.</li><li>We call a real API.</li><li>We assert the outcome.</li></ul><p>The supporting infrastructure fades into the background, leaving the intent of the test clearly visible. That clarity isn’t accidental; it’s the cumulative effect of incremental improvements across Testcontainers and the Elasticsearch client.</p><h2>The advanced patterns still apply</h2><p>None of the more advanced techniques discussed in earlier articles, <a href="https://www.elastic.co/search-labs/blog/elasticsearch-integration-tests-faster">Faster integration tests with real Elasticsearch</a> and <a href="https://www.elastic.co/search-labs/blog/elasticsearch-improve-performance-integration-tests">Advanced integration tests with real Elasticsearch</a>, have become obsolete. Reusing containers to speed up large test suites, customizing cluster settings, preloading indices, or testing role-based access scenarios remain entirely valid and, in many cases, essential.</p><p>What has improved is the baseline experience. The simplest possible integration test, the one that merely needs a real node and a real client, no longer requires defensive configuration or dependency gymnastics. It’s concise, expressive, and production-like by default.</p><h2>Progress without drama</h2><p>There was no dramatic rewrite of the ecosystem, no disruptive migration guide that forced a rethinking of everything. Instead, there has been a steady refinement of APIs and dependencies, each release smoothing a rough edge here and removing a surprise there.</p><p>The result isn’t flashy, yet it’s tangible. Writing integration tests against Elasticsearch now feels less like assembling a test harness and more like exercising a real system in miniature.</p><p>Sometimes progress announces itself loudly. Sometimes it arrives quietly, in the form of code that simply reads better and requires less explanation. In this case, it’s the latter, and for those of us who care about clean, reliable integration tests, that’s more than enough.</p><p>And what if we could do something similar with Kibana? Sounds appealing? Stay tuned!</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[AI agent memory: Creating smart agents with Elasticsearch managed memory]]></title>
            <description><![CDATA[ Learn how to create smarter and more efficient AI agents by managing memory using Elasticsearch.]]></description>
            <link>https://www.elastic.co/search-labs/blog/ai-agent-memory-management-elasticsearch</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/ai-agent-memory-management-elasticsearch</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Gustavo Llermaly, Jeffrey Rengifo]]></dc:creator>
            <pubDate>Wed, 18 Mar 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/e0172bf1bf25f10001de046531d7dfae816c1f80-1200x676.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>In this article, we’ll learn about using memory techniques to make agents smarter using Elasticsearch as the database for memories and knowledge.</p><h2>Understanding memory in large language models (LLMs)</h2><p>Here&#x27;s something that trips people up: The conversations with LLMs are completely <a href="https://www.geeksforgeeks.org/computer-networks/difference-between-stateless-and-stateful-protocol/">stateless</a>. Every time you send a message, you need to include the entire chat history to &quot;remind&quot; the model what happened before. The ability to keep track of what was asked and answered within a single conversation session is what we call <strong><em>short-term memory</em></strong>.</p><p>But here&#x27;s where it gets interesting: Nothing stops us from manipulating this chat history beyond simple storage. For example, when we want to persist memories like user preferences across different conversations, we inject those into fresh conversations when needed and call it <strong><em>long-term memory</em></strong>.</p><h2>Why mess with chat history?</h2><p>There are three compelling reasons to go beyond simply appending each new message and response to a growing list that gets sent to the LLM with every request:</p><ul><li><strong>Inject useful context:</strong> Add information about previous interactions, like user preferences, without cluttering the current conversation.</li><li><strong>Summarize and remove data:</strong> Clean up information the model has already used to avoid confusion (<em>context poisoning</em>) and keep the model focused.</li><li><strong>Save tokens:</strong> Remove unnecessary data to prevent filling the context window, enabling longer, more meaningful conversations.</li></ul><p>This opens up some sci-fi possibilities. Imagine an agent that selectively remembers things based on its environment or who it&#x27;s talking to, like the TV show <strong><em>Severance</em></strong>, where the main character, Mark, has a chip implanted in his brain that creates two separate identities with distinct memories depending on whether he’s in the office (&quot;innie&quot;) or outside of it (&quot;outie&quot;), switching based on location.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/b3c08fe6139cae5571288d713e890b7c8ca3f428-1000x1500.png" style="width: 100%;" /><h2>Memory types and selective retrieval in agents: Creating smart agents with Elasticsearch managed memory</h2><p>Not all memories serve the same purpose, and treating them as interchangeable chat history limits how far agents can scale. Modern agent architectures, including frameworks like <a href="https://arxiv.org/abs/2309.02427">Cognitive Architectures for Language Agents (CoALA)</a>, distinguish between <strong>procedural</strong>, <strong>episodic</strong>, and <strong>semantic</strong> memory. Rather than treating all context as a single growing buffer, these architectures recognize that each memory type requires distinct storage, retrieval, and consolidation strategies.</p><h3>Procedural memory: How the agent operates</h3><p><em>Procedural memory</em> defines how an agent behaves, not what it knows or remembers.</p><p>In practice, this includes:</p><ul><li>When to store a memory.</li><li>When to retrieve one.</li><li>How to summarize conversations.</li><li>How to use tools.</li></ul><p>In our system, procedural memory lives primarily in the application code and prompts and isn’t stored in Elasticsearch. Instead, Elasticsearch is used by procedural memory.</p><p><strong><em>Procedural memory determines how memory is used, not what’s stored.</em></strong></p><h3>Episodic memory: What happened</h3><p><em>Episodic memory</em> captures specific experiences tied to an entity and a context.</p><p>Examples:</p><ul><li>“Peter’s birthday is tomorrow and he wants steak.”</li><li>“Janice has a report due at 9 am.”</li></ul><p>This is the most dynamic and personal form of memory and the one most prone to context pollution if handled incorrectly.</p><p>In our architecture:</p><ul><li>Episodic memories are stored as documents in Elasticsearch.</li><li>Each memory includes metadata (user, role, timestamp, innie or outie).</li><li>Retrieval is selective, based on who’s asking and in what context.</li></ul><p>This is where the innie/outie model applies as an example of episodic memory isolation.</p><h3>Semantic memory: Ground truth</h3><p><em>Semantic memory</em> represents<strong> abstracted, generalized knowledge about the world</strong>, independent of any single interaction or personal context. Unlike episodic memory, which is tied to who said what and when, semantic memory captures what is true in general.</p><p>In our analogy, the knowledge about <strong>Lumon</strong>, which is the company where Mark works in the show <em>Severance</em>, is world truth shared between innies and outies.</p><p>Things like company handbooks and rules are part of the knowledge being used as semantic memory.</p><p>While episodic memory retrieval prioritizes precision and strong contextual filters (such as identity, role, and time), semantic memory favors high-recall, concept-level retrieval. It’s designed to surface generally true information that can ground reasoning, rather than personal experiences tied to a specific situation.</p><p>Let’s move to architecture and see how these ideas translate into a memory system for our agent.</p><h2>Prerequisites</h2><ul><li>Elasticsearch Elastic Cloud Hosted (ECH) or self-hosted 9.1+ instance.</li><li>Python 3.x.</li><li><a href="https://platform.openai.com/docs/api-reference/authentication">OpenAI API Key</a>.</li></ul><p>The full Python notebook for this application can be found <a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/smarter-agents-with-memory/notebook.ipynb">here</a>.</p><h2>Why Elasticsearch?</h2><p>Elasticsearch is an ideal solution for storing both knowledge and memory because it&#x27;s a native vector database ready to scale. It gives us everything we need to manage selective memory:</p><ul><li><strong>Vector database</strong> with <a href="https://www.elastic.co/docs/solutions/search/hybrid-semantic-text">hybrid search</a> to find memories by context, not only by keywords.</li><li><strong>Multiple data types</strong>, including text, numbers, dates, and geolocation.</li><li><strong>Metadata filters</strong> for complex queries across different fields.</li><li><a href="https://www.elastic.co/docs/deploy-manage/users-roles/cluster-or-deployment-auth/controlling-access-at-document-field-level"><strong>Document level security</strong></a> to filter memories based on who&#x27;s asking.</li></ul><h3>Why selective memory improves latency and reasoning</h3><p>Selective memory is not only about correctness and isolation; it also has a direct impact on latency and model performance. By narrowing the search space using structured filters (such as memory type, user, or time) before running semantic retrieval, Elasticsearch reduces the number of vectors that need to be scored and the amount of context that must be injected into the LLM. This results in faster retrieval, smaller prompts, and more focused attention for the model, which in practice translates into lower latency, lower token usage, and more accurate responses.</p><p>Episodic memory is inherently temporal: Recent experiences are usually more relevant than older ones, and not all memories should be kept with the same level of detail forever. In human cognition, experiences are gradually forgotten, summarized, or consolidated into more abstract knowledge.</p><p>Memory compression is a whole different topic, but you can implement strategies to summarize and store old memories while retrieving the fresh ones entirely.</p><h2>The setup</h2><p>Following the <strong><em>Severance</em></strong> concept, we&#x27;re creating an agent named Mark with two distinct memory sets:</p><ul><li><strong>Innie memories</strong>: Work-related conversations with colleagues.</li><li><strong>Outie memories</strong>: Personal conversations with friends and family.</li></ul><p>When Mark talks to an innie, he shouldn&#x27;t remember conversations with outies, and vice versa.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/e0172bf1bf25f10001de046531d7dfae816c1f80-1200x676.png" style="width: 100%;" /><h2>Building the memory system</h2><h3>Memory index structure</h3><p>First, we define our memory schema:</p><pre><code>mappings = {
    "properties": {
        "user_id": {"type": "keyword"},
        "memory_type": {"type": "keyword"},
        "created_at": {"type": "date"},
        "memory_text": {
            "type": "text",
            "fields": {
                "semantic": {
                    "type": "semantic_text"
                }
            }
        }
    }
}</code></pre><p><em>Note that we use <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/multi-fields">multi-field</a> for <code>memory_text </code>so we can do both full-text search, and <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/semantic-text">semantic search</a> using the <a href="https://www.elastic.co/search-labs/tutorials/search-tutorial/semantic-search/elser-model">Elastic Learned Sparse EncodeR (ELSER) model</a> (default) against the same field content.</em></p><p>This gives us semantic search capabilities while maintaining structured metadata for filtering.</p><h3>Setting up document level security</h3><p>This is the key piece that makes selective memory work. We create two separate roles: one for innies, one for outies, each with query-level filters built in. When a user with the innie role queries the memories index, Elasticsearch automatically applies a filter that only returns memories where <code>memory_type</code> equals &quot;innie&quot;.</p><p><em>You can find more illustrative examples about access control <a href="https://www.elastic.co/docs/deploy-manage/users-roles/cluster-or-deployment-auth/controlling-access-at-document-field-level#basic-examples">here</a> and about role management <a href="https://www.elastic.co/docs/deploy-manage/users-roles/cluster-or-deployment-auth/kibana-role-management">here</a>.</em></p><p>Here&#x27;s the innie role:</p><pre><code>innie_role_descriptor = {
    "indices": [
        {
            "names": ["memories"],
            "privileges": ["read", "write"],
            "query": {
                "bool": {
                    "filter": [
                        {"term": {"memory_type": "innie"}}
                    ]
                }
            }
        }
    ]
}</code></pre><p>We create a similar role for outies, just filtering by <code>&quot;memory_type&quot;: &quot;outie&quot;</code> instead.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/9e675da210be223200e1d5519b5bc62b6f97b3b1-1999x1046.png" style="width: 100%;" /><p>Then we create users and assign them to these roles. For example:</p><ul><li><strong>Peter (outie):</strong> Can only access memories marked as <code>&quot;outie&quot;</code>.</li><li><strong>Janice (innie):</strong> Can only access memories marked as <code>&quot;innie&quot;</code>.</li></ul><p>When Mark (our agent) receives a query, he uses the credentials of whoever is asking. If Peter asks something, Mark uses Peter&#x27;s credentials, which means Elasticsearch automatically filters to only show outie memories. If Janice asks, only innie memories are visible.</p><p>The application code doesn&#x27;t need to filter the user management and is completely decoupled from the application logic. Elasticsearch handles all the security automatically.</p><h3>Creating the agent tools</h3><p>We define three key functions for our agent:</p><ul><li><strong><code>GetKnowledge</code>:</strong> Searches the knowledge base for relevant context (<a href="https://www.elastic.co/search-labs/blog/retrieval-augmented-generation-rag">traditional retrieval augmented generation [RAG]</a>).</li><li><strong><code>GetMemories</code>:</strong> Retrieves memories using <a href="https://www.elastic.co/what-is/hybrid-search">hybrid search</a> (semantic + keyword):</li></ul><pre><code>def get_memory(query: str):
    es_query = {
        "retriever": {
            "rrf": {
                "retrievers": [
                    {
                        "standard": {
                            "query": {
                                "semantic": {
                                    "field": "semantic_field",
                                    "query": query
                                }
                            }
                        }
                    },
                    {
                        "standard": {
                            "query": {
                                "multi_match": {
                                    "query": query,
                                    "fields": ["memory_text"]
                                }
                            }
                        }
                    }
                ],
                "rank_window_size": 50,
                "rank_constant": 20
            }
        }
    }
    
    response = user_es_client.search(index="memories", body=es_query)
    return response</code></pre><p><em>Notice that we don&#x27;t apply security filters in the query; Elasticsearch handles that automatically based on the user&#x27;s credentials.</em></p><ul><li><strong><code>SetMemory</code>:</strong> Stores new memories (implementation uses LLM to convert conversations into structured memory records).</li></ul><h3>How the agent uses these tools</h3><p>When a user asks Mark a question, the flow works like this:</p><p>1.<strong> User asks:</strong> &quot;What&#x27;s my favorite family destination?&quot;</p><p>2.<strong> LLM decides to use tools:</strong> OpenAI&#x27;s Response API with function calling lets the LLM decide it needs to call <code>GetMemories</code> with the query <code>&quot;favorite family destination&quot;</code>.</p><p>3.<strong> We execute the function:</strong> Our code calls <code>get_memory(&quot;favorite family destination&quot;)</code> using the user&#x27;s credentials (Peter&#x27;s in this case).</p><p>4. <strong>Elasticsearch filters automatically:</strong> Because we&#x27;re using Peter&#x27;s credentials, only outie memories are returned:<br/></p><pre><code>Memories
peter125: (User name is Peter Johnson. His favorite family destination is Disneyland.)</code></pre><p>5. <strong>We send results back to LLM:</strong> The memory gets added to the conversation context.</p><p>6.<strong> LLM generates an answer:</strong> &quot;Your favorite family destination is Disneyland.&quot;</p><p>Here&#x27;s the actual code that handles this loop:</p><pre><code># Initial call with tools available
response = client.responses.create(
    model="gpt-4.1-mini",
    input=messages,
    tools=tools,
    parallel_tool_calls=True
)

# Execute any tool calls the LLM requested
for tool_call in response.output:
    if tool_call.name == "GetMemories":
        result = get_memory(tool_call.arguments["query"])
        # Add result to messages
        
# Call LLM again with tool results to generate final answer
final_response = client.responses.create(
    model="gpt-4.1-mini",
    input=messages  # Now includes tool results
)</code></pre><p>The key insight: The application doesn&#x27;t decide which memories to retrieve or when. The LLM decides based on the user&#x27;s question, and Elasticsearch ensures that only the right memories are accessible.</p><h2>Testing selective memory</h2><p>Let&#x27;s see it in action:</p><p>Outie conversation (Peter):</p><pre><code>Peter: Hey Mark, my birthday is tomorrow! I'd like to have a steak for dinner.
Mark: That's great! (memory stored)</code></pre><p>Mark stores this as an outie memory associated with Peter. Here&#x27;s what that memory looks like in Elasticsearch:</p><pre><code>{
    "user_id": "peter125",
    "memory_type": "outie",
    "created_at": "2025-10-11T18:02:52.182780",
    "memory_text": "Peter's birthday is tomorrow. He wants steak for dinner."
}</code></pre><p>Innie conversation (Janice):</p><pre><code>Janice: Hey Mark, remember we have to finish the end of year report tomorrow at 9am.
Mark: Thanks for reminding me! (memory stored)</code></pre><p>This creates a separate innie memory:</p><pre><code>{
    "user_id": "janice456",
    "memory_type": "innie", 
    "created_at": "2025-10-11T19:15:33.445821",
    "memory_text": "End of year report deadline tomorrow at 9am with Janice."
}</code></pre><p>Imagine Peter also works at Lumon. A colleague stores a work-related memory about him:</p><pre><code>{
    "user_id": "innie-peter",
    "memory_type": "innie",
    "created_at": "2025-10-11T20:30:00.000000",
    "memory_text": "Peter needs to review the Q4 budget spreadsheet before Friday."
}</code></pre><p>This memory exists in Elasticsearch, but Peter&#x27;s current credentials only grant him the outie role. When he asks Mark about work tasks, this memory is invisible to him; Elasticsearch&#x27;s document level security ensures that it’s never returned.</p><p><em>Note: To allow interaction with these memories, you would need to create a separate user (or assign an additional role) with &quot;innie&quot; access for Peter. This is left as an exercise, but it demonstrates that the same person can have isolated memory contexts, and access is controlled entirely at the security layer.</em></p><h3>Memory isolation test</h3><p>Now Peter starts a new conversation:</p><pre><code>Peter: Hey Mark, do you remember what I want for my birthday?
Mark: Yes! You want steak.

Peter: When do you have to finish the end of year report?
Mark: What are you talking about?</code></pre><p>Perfect! Mark only accesses outie memories when talking to Peter. The agent&#x27;s &quot;brain&quot; is genuinely split, just like in the show.</p><h2>The full implementation</h2><p>The complete working implementation is available in this <a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/smarter-agents-with-memory/notebook.ipynb">notebook</a>, where you can:</p><ul><li>Set up the Elasticsearch indices.</li><li>Create roles and users with document level security.</li><li>Build the agent with OpenAI&#x27;s Response API.</li><li>Test the selective memory system.</li></ul><h2>Conclusion</h2><p>Memory isn’t just a place to store past conversations. It’s part of the agent’s architecture. By going beyond raw chat history and separating procedural, episodic, and semantic memory, we can build agents that reason more clearly, scale better, and stay focused over long interactions.</p><p>Selective retrieval reduces context pollution, lowers latency, and improves the quality of the information sent to the LLM. Episodic memory can be filtered by user and time, semantic memory can be used to ground answers in shared knowledge, and procedural memory controls how and when all of this is used.</p><p>Elasticsearch provides the building blocks to implement this in practice through hybrid search, rich metadata, security, and temporal filtering. Just like in <em>Severance</em>, we can create agents with isolated experiences and shared world knowledge. The difference is that here the split is intentional and useful, not a mystery.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[The Gemini CLI extension for Elasticsearch with tools and skills]]></title>
            <description><![CDATA[Introducing Elastic’s extension for Google's Gemini CLI to search, retrieve, and analyze Elasticsearch data in developer and agentic workflows.
]]></description>
            <link>https://www.elastic.co/search-labs/blog/gemini-cli-extension-elasticsearch</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/gemini-cli-extension-elasticsearch</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Walter Rafelsberger, Enrico Zimuel]]></dc:creator>
            <pubDate>Tue, 17 Mar 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/6dfcec4a77b3dc83bf0d974417bf2e211abb1f4f-876x468.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>We&#x27;re glad to announce the release of our Elastic extension for Google&#x27;s Gemini CLI, bringing the full power of <a href="https://www.elastic.co/elasticsearch">Elasticsearch</a> and <a href="https://www.elastic.co/elasticsearch/agent-builder">Elastic Agent Builder</a> directly into your AI development workflow. This extension also offers several recently developed agent skills for interacting with Elasticsearch.</p><p>The extension is available as an open source project <a href="https://github.com/elastic/gemini-cli-elasticsearch">here</a>.</p><h2>What is Gemini CLI, and how do you install it?</h2><p><a href="https://geminicli.com/">Gemini CLI</a> is an open source AI agent that brings Google’s Gemini models directly into the command line. It allows developers to interact with AI from the terminal to perform tasks such as generating code, editing files, running shell commands, and retrieving information from the web.</p><p>Unlike typical chat interfaces, Gemini CLI integrates with your local development environment, meaning that it can understand project context, modify files, run builds or tests, and automate workflows directly within the terminal. This makes it useful for developers, site reliability engineers (SREs), and engineers who want AI-assisted coding and automation without leaving their command-line workflow.</p><p>Gemini CLI can be installed using several package managers. The most common method is via npm:</p><pre><code>npm install -g @google/gemini-cli</code></pre><p>If you want to know alternative installation options, refer to the <a href="https://geminicli.com/docs/get-started/installation/">official installation page</a>.</p><p>After installation, start the CLI by running:</p><pre><code>gemini</code></pre><p>You see a screen, as shown in Figure 1:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/6dfcec4a77b3dc83bf0d974417bf2e211abb1f4f-876x468.png" style="width: 100%;" /><h2>Configure Elasticsearch</h2><p>We need to have an Elasticsearch instance running. If you want to use the Model Context Protocol (MCP) server, you also need Kibana 9.3+ installed. To use the Elasticsearch Query Language (ES|QL) skill (<code>esql</code>) described below, Kibana is not required.</p><p>You can activate a free trial on <a href="https://www.elastic.co/cloud">Elastic Cloud</a> or install it locally using the <a href="https://github.com/elastic/start-local"><code>start-local</code></a> script:</p><pre><code>curl -fsSL https://elastic.co/start-local | sh</code></pre><p>This will install Elasticsearch and Kibana on your computer and will generate an API key to be used for configuring Gemini CLI.</p><p>The API key will be shown as output of the previous command and stored in a <strong>.env</strong> file in the <strong><code>elastic-start-local</code></strong> folder.</p><p>If you’re using on-premises Elasticsearch (for example, using <code>start-local</code>), and you want to use Elastic Agent Builder with MCP, you also need to connect a large language model (LLM). You can read <a href="https://www.elastic.co/docs/explore-analyze/ai-features/llm-guides/llm-connectors">this documentation page</a> to understand the different options.</p><p>If you’re using Elastic Cloud (or serverless), you already have a prebuilt LLM connection in place.</p><h2>Install the Elasticsearch extension</h2><p>You can install the Elasticsearch extension for Gemini CLI with the following command:</p><pre><code>gemini extensions install https://github.com/elastic/gemini-cli-elasticsearch</code></pre><p>You can check that the extensions have been installed successfully by opening Gemini and executing the following command:</p><pre><code>/extensions list</code></pre><p>You should see the Elasticsearch extension available.</p><p>If you want to use the MCP integration, you need to have an Elasticsearch 9.3+ version installed. You need your MCP server URL from <a href="https://www.elastic.co/kibana">Kibana</a>:</p><ul><li>Get your MCP server URL from Agents &gt; View all tools &gt; Manage MCP &gt; Copy MCP Server URL.</li><li>The URL will look like this: https://your-kibana-instance/api/agent_builder/mcp</li></ul><p>You need the Elasticsearch endpoint URL. This is typically reported at the top of the Kibana Elasticsearch page. If you’re running Elasticsearch with <code>start-local</code>, you already have the endpoint in the <code>ES_LOCAL_URL</code> key in the <code>start-local</code> .env file.</p><p>You also need an API key. If you’re running Elasticsearch with <code>start-local</code>, you already have the <code>ES_LOCAL_API_KEY</code> in the <code>start-local</code> .env file. Otherwise, you can create an API key using the Kibana interface, as reported <a href="https://www.elastic.co/docs/deploy-manage/api-keys/elasticsearch-api-keys">here</a>:</p><ul><li>In Kibana: Stack Management &gt; Security &gt; API Keys &gt; Create API key.</li><li>We suggest setting only the read privileges for the API key, enabling the <code>feature_agentBuilder.read</code> privilege as reported <a href="https://www.elastic.co/docs/explore-analyze/ai-features/agent-builder/permissions#grant-access-with-roles">here</a>.</li><li>Copy the encoded API key value.</li></ul><p>Set the required environment variables in your shell:</p><pre><code>export ELASTIC_URL="your-elasticsearch-url"
export ELASTIC_MCP_URL="your-elasticsearch-mcp-url"
export ELASTIC_API_KEY="your-encoded-api-key"</code></pre><h2>Install the example dataset</h2><p>You can install the <strong>eCommerce orders </strong>dataset available from Kibana. It includes a single index named <strong><code>kibana_sample_data_ecommerce</code></strong>, containing information for 4,675 orders from an ecommerce website. For each order, we have the following information:</p><ul><li>Customer information (name, ID, birth date, email, and more).</li><li>Order date.</li><li>Order ID.</li><li>Products (list of all the products with price, quantity, ID, category, discount, and other details).</li><li>SKU.</li><li>Total price (taxless, taxed).</li><li>Total quantity.</li><li>Geo information (city, country, continent, location, region).</li></ul><p>To install the sample data, open the <strong>Integrations</strong> page in Kibana (search for “Integration” in the top search bar) and install the <strong>Sample Data</strong>. For more details, refer to the documentation <a href="https://www.elastic.co/docs/explore-analyze/#gs-get-data-into-kibana">here</a>.</p><p>The goal of this article is to show how easy it is to configure Gemini CLI to connect to Elasticsearch and interact with the <strong><code>kibana_sample_data_ecommerce</code></strong> index.</p><h2>How to use the Elasticsearch MCP</h2><p>You can check the connection using the following command in Gemini:</p><pre><code>/mcp list</code></pre><p>You should see the <strong><code>elastic-agent-builder</code></strong> enabled, as shown in Figure 2:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/1508423bc1d1bc3c04a1cb01e2d59495a3516ed1-1465x844.png" style="width: 100%;" /><p>Elasticsearch provides a default set of tools. See the description <a href="https://www.elastic.co/docs/explore-analyze/ai-features/agent-builder/tools/builtin-tools-reference">here</a>.</p><p>Using these tools, you can interact with Elasticsearch, asking questions like:</p><ul><li><code>Give me the list of all the indexes available in Elasticsearch.</code></li><li><code>How many customers are based in the USA in the kibana_sample_data_ecommerce index of Elasticsearch?</code></li></ul><p>Depending on the question, Gemini will use one or more of the available tools to try to answer it.</p><h2>The /elastic commands</h2><p>In the Elasticsearch extension for Gemini CLI, we also added<strong> <code>/elastic</code></strong> commands.</p><p>If you execute the <strong><code>/help</code></strong> command, you see all the available <code>/elastic</code> options (Figure 3):</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/5b2a0727ce7a04354878dd048253d3f4d062324b-1983x230.png" style="width: 100%;" /><p>These commands can be useful if you want to directly execute a specific tool of the <code>elastic-agent-builder</code> MCP server. For instance, using the following command, you can get the mapping of the <code>kibana_sample_data_ecommerce</code>:</p><pre><code>/elastic:get-mapping kibana_sample_data_ecommerce</code></pre><p>These commands are essentially shortcuts for executing specific tools, rather than relying on the Gemini model to determine which tool should be invoked.</p><h2>How to use the Elasticsearch skills</h2><p>This extension also comes with an <a href="https://github.com/elastic/gemini-cli-elasticsearch/tree/main/skills/esql">agent skill for ES|QL</a>, the <a href="https://www.elastic.co/docs/explore-analyze/discover/try-esql">Elasticsearch Query Language</a> available in Elasticsearch. <a href="https://agentskills.io/home">Agent Skills</a> is an open format that gives AI coding agents, like Gemini CLI, custom instructions for specific tasks. They use a concept called <em>progressive disclosure</em>, meaning that only a brief description of the skill is added to the initial system prompt. When you ask the agent to perform a task, like querying Elasticsearch, it matches the request to the relevant skill and dynamically loads the detailed instructions. This is an efficient way to manage token budgets while providing the AI with exactly the context it needs.</p><p>The <strong><code>esql</code> skill</strong> is designed to let Gemini CLI write and execute ES|QL queries directly against your cluster. ES|QL is a powerful piped query language that makes data exploration, log analysis, and aggregations highly intuitive. With this skill enabled, you don&#x27;t need to look up ES|QL syntax; you can simply ask the Gemini CLI natural language questions about your data, and the agent will handle the rest.</p><p>The executions are performed using simple <a href="https://curl.se/">curl</a> commands run in a terminal. This is possible because Elasticsearch provides a rich set of REST APIs that can be easily used to integrate the system into any architecture.</p><p><strong>What the <code>esql</code> skill offers:</strong></p><ul><li><strong>Index and schema discovery:</strong> The agent can use the skill&#x27;s built-in tools to list available indices and fetch field mappings. For instance, before writing a query for the eCommerce dataset, the agent can run a schema check on <strong><code>kibana_sample_data_ecommerce</code></strong> to understand the available fields, like <strong><code>taxful_total_price</code></strong> or <strong><code>category</code></strong>.</li><li><strong>Seamless natural language translation:</strong> The skill gives the agent more than just a plain reference manual; it provides a specific guide for interpreting user intent. When you type natural language requests, like &quot;Show average response time grouped by service,&quot; the agent uses the skill&#x27;s bundled pattern matching to instantly translate your words into the correct ES|QL aggregations, filters, and commands.</li><li><strong>Self-correct:</strong> If a query fails (for example, due to a type mismatch or syntax error), the skill returns the generated query alongside the exact Elasticsearch error message, allowing the agent to instantly fix the query and try again without you having to intervene.</li></ul><p>Since the <code>esql</code> skill is also available as a tool in the <code>elastic-agent-builder</code> MCP server, we need to disable this server momentarily. You can use the following command to disable it:</p><pre><code>/mcp disable elastic-agent-builder</code></pre><p>Then you can simply type a prompt like this into your Gemini CLI:</p><pre><code>Find the top 5 product categories by total sales revenue in the kibana_sample_data_ecommerce index</code></pre><p>The agent will:</p><ul><li>Recognize the need for the <code>esql</code> skill.</li><li>Check the schema of <strong><code>kibana_sample_data_ecommerce</code></strong>.</li><li>Construct an ES|QL query, like: <code>FROM kibana_sample_data_ecommerce | STATS total_revenue = SUM(taxful_total_price) BY category.keyword | SORT total_revenue DESC | LIMIT 5</code>.</li><li>Execute the query against the Elasticsearch API.</li><li>Present the final answer to you right in the terminal.</li></ul><p>Here, we reported an example of response from Gemini to the previous prompt:</p><pre><code>───────────────────────────────────────────────────────────
> Find the top 5 product categories by total sales revenue in the kibana_sample_data_ecommerce index   ───────────────────────────────────────────────────────────
✓  Activate Skill "esql": Interact with Elasticsearch using ES|QL and curl. Use when querying, managing indices,  …
Skill esql activated. Resources loaded from /home/user/.gemini/extensions/elasticsearch/skills/esql: 
Showing up to 200 items (files + folders).                                                                     /home/user/.gemini/extensions/elasticsearch/skills/esql/                                                            ├───SKILL.md 
└───references/
├───esql-reference.md 
├───generation-tips.md 
└───query-patterns.md

✦ I will start by exploring the mapping of the kibana_sample_data_ecommerce index to identify the fields for product category and sales revenue. Then, I will construct and execute an ES|QL query to determine the top 5 product categories by total sales revenue.

✓  Shell curl -s "${ELASTIC_URL%/}/kibana_sample_data_ecommerce/_mapping" -H "Authorization: ApiKey $(printenv ELAS…                                                                                                                       {  
   "kibana_sample_data_ecommerce": { 
     "mappings": { 
       "properties": {
         "category": { 
           "type": "text", 
           "fields": {
             "keyword": {
               "type": "keyword"
             }
           }
         }, 
         "currency": {
           "type": "keyword"
         }, 

✦ Okay, I have the mapping. The category is products.category.keyword and sales revenue is taxful_total_price. I'll now execute the ES|QL query to get the top 5 categories by revenue.

✓  Shell curl -s -X POST "${ELASTIC_URL%/}/_query" \  -H "Authorization: ApiKey $(printenv ELASTIC_API_KEY)" \  -H …
{ 
   "took": 62,
   "is_partial": false, 
   "completion_time_in_millis": 1772805644164,
   "documents_found": 4675,
   "values_loaded": 12084, 
   "start_time_in_millis": 1772805644102, 
   "expiration_time_in_millis": 1773237644087,  
   "columns": [ 
     { 
       "name": "total_revenue",    
       "type": "double"  
     }, 
     {   
       "name": "products.category.keyword",

✦ The query for top 5 product categories by revenue was successful. I'll format the returned JSON values into a   user-friendly list.

 ✓  Shell 
                             
 Top 5 product categories by total sales revenue: 
 1. Men's Clothing: 149393.91
 2. Women's Clothing: 135099.91
 3. Women's Shoes: 105479.17
 4. Men's Shoes: 91797.92
 5. Women's Accessories: 60830.31 

✦ I've found the answer; now I'll format it for clarity.

 ✓  Shell 

 Here are the top 5 product categories by total sales revenue:                                                        
                                                                      
 | Rank | Category              | Total Revenue | 
 |------|-----------------------|---------------| 
 | 1    | Men's Clothing        | 149,393.91    |
 | 2    | Women's Clothing      | 135,099.91    |
 | 3    | Women's Shoes         | 105,479.17    |
 | 4    | Men's Shoes           | 91,797.92     | 
 | 5    | Women's Accessories   | 60,830.31     | 

✦ I've formatted and presented the answer. I'm finished.
 ───────────────────────────────────────────────────────────</code></pre><p>It’s interesting to note how the Gemini model generates the final response by showing all the steps it follows. Here, you can clearly see the influence of the skill in the model’s reasoning process. The first time the model recognizes that it needs to use a skill or execute a shell command, it requests permission using the human-in-the-loop approach.</p><p>By handling the heavy lifting of schema discovery, query generation, and execution, the <code>esql</code> skill lets you focus entirely on the answers rather than the mechanics of getting them. You’ll get the data you need, properly formatted and right in your terminal, all without ever writing a single line of syntax or context-switching to a different application.</p><h2>Conclusion</h2><p>In this article, we presented the Elasticsearch extension for Gemini CLI that we recently released. This extension gives you the ability to interact with your Elasticsearch instance using Gemini and the Elasticsearch MCP server provided by Elastic Agent Builder, available starting from version 9.3.0, as well as the <code>/elastic</code> command.</p><p>Moreover, the extension also includes an <code>esql</code> skill that converts a user’s request from natural language into an ES|QL query. This skill can be particularly useful when the MCP server cannot be used, since the underlying communication is driven by simple curl commands executed in a terminal. Elasticsearch offers a rich set of REST APIs that can be easily integrated into any project. This is especially useful when developing agentic AI applications.</p><p>For more information about our Gemini CLI extension, visit the project repository <a href="https://github.com/elastic/gemini-cli-elasticsearch">here</a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Agent Skills for Elastic: Turn your AI agent into an Elastic expert]]></title>
            <description><![CDATA[Give your AI coding agent the knowledge to query, visualize, secure, and automate with Elastic Agent Skills.]]></description>
            <link>https://www.elastic.co/search-labs/blog/agent-skills-elastic</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/agent-skills-elastic</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Graham Hudgins, Matt Ryan]]></dc:creator>
            <pubDate>Mon, 16 Mar 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/09e64953819083168a9ecef0888c7f8bde1a43bd-720x420.jpg?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Every developer, site reliability engineer (SRE), or analyst who has tried to use an AI coding agent with a specialized platform has hit the same wall. You ask the agent to write a query, configure an alert, or investigate something, and it gets close but not right. Elastic has an advantage here: over a decade of documentation, blog posts, and community answers means AI agents already know Elastic better than most data platforms. But that depth comes with noise. Deprecated APIs sit alongside current ones. Outdated patterns rank as highly as best practices. The agent confidently reproduces an approach that worked three versions ago, because in its training data, it did. The result is a correction tax: users manually feed documentation into context, fix hallucinated syntax, and work around the agent instead of with it. Worse, advanced capabilities go unused entirely, not because users don&#x27;t need them, but because the agent doesn&#x27;t know they exist.</p><p>That&#x27;s why we’re open-sourcing <a href="https://github.com/elastic/agent-skills">Elastic Agent Skills</a>: native platform expertise for Elasticsearch, Kibana, Elastic Observability, and Elastic Security. Drop them into the agent runtime you already use, and improve your agent from being a ‘generalist’ that guesses at a lot of syntax to giving it expertise, such as being able to use many of the architectural standards as Elastic’s own engineering teams. This initial technical preview release focuses on skills with maximum compatibility for <a href="https://www.elastic.co/cloud/serverless">Elastic Cloud Serverless</a>, but will evolve quickly to include improved support for older stack releases.</p><p>In addition, Elastic is solving this problem from both sides. For agents on the Elastic platform, <a href="https://www.elastic.co/search-labs/blog/agent-builder-elastic-ga">Elastic Agent Builder</a> (now generally available) lets you create and chat with AI agents that inherit your data&#x27;s access controls, use built-in search and analysis tools, and work in context alongside your dashboards, alerts, and investigations. We&#x27;re working hard to ensure amazing Agentic experiences in the Elastic platform. But not every agent lives inside Elastic. Your team already uses Cursor, Claude Code, or other runtimes, and those agents need to get Elastic right, too. That&#x27;s where Agent Skills come in.</p><h2>Why agents struggle with specialized platforms</h2><p>Large language models (LLMs) are remarkably capable generalists. They can write Python, explain Kubernetes manifests, and refactor React components because their training data is rich with examples. But when it comes to platform-specific work, the kind that involves proprietary query languages, deep API surfaces, and domain-specific best practices, they fall short in predictable ways.</p><p>For Elasticsearch, the gap shows up concretely:</p><ul><li><strong>Elasticsearch Query Language (ES|QL) is new territory.</strong> LLMs are trained heavily on SQL, but ES|QL is a piped query language with different syntax, different functions, and different semantics. Agents frequently write queries that look plausible but don&#x27;t parse. They confuse <code>WHERE</code> with <code>| WHERE</code>, invent functions that don&#x27;t exist, and miss the pipe-based composition model entirely.</li><li><strong>API surfaces are wide and deep.</strong> Elasticsearch, Kibana, and Elastic Security expose hundreds of APIs across search, ingestion, alerting, detection rules, case management, dashboards, and more. An agent armed with nothing but general training data has to guess which endpoint to call, what the request body looks like, and how to handle the response. It guesses wrong often enough to erode trust.</li><li><strong>Best practices aren&#x27;t in the training data.</strong> When should you use <code>semantic_text</code> versus a custom embedding pipeline? How should you structure an ingest pipeline for a 10GB CSV? What&#x27;s the right detection rule syntax for a <a href="https://www.elastic.co/docs/solutions/security/detect-and-alert/mitre-attandckr-coverage">MITRE ATT&amp;CK</a> technique? General-purpose agents don&#x27;t have curated, reliably structured Elastic-specific knowledge loaded by default. They&#x27;d have to go find it, and even if they did, raw docs don&#x27;t always encode the judgment calls and best practices that skilled practitioners carry.</li></ul><p>The result: Developers spend more time fixing agent output than they would have spent writing the code themselves. That&#x27;s not the experience anyone signed up for.</p><h2>Agent Skills: Platform knowledge, packaged for agents</h2><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/8d16ec00d16e70a916c5eef0aaa23fcc735b7186-1067x1280.png" style="width: 100%;" /><p>Agent Skills are self-contained directories of instructions, scripts, and reference material that agent runtimes can load dynamically. When a skill is active, the agent has access to the right context at the right time: query syntax, API patterns, validation logic, worked examples, so it can complete tasks correctly on the first try.</p><p>Each skill follows the open <a href="https://agentskills.io">agentskills.io</a> specification: a folder with a <code>SKILL.md</code> file containing metadata and structured instructions. No proprietary format, no lock-in. Skills work across agent runtimes, including Cursor, Claude Code, GitHub Copilot, Windsurf, Gemini CLI, Cline, Codex, and <a href="https://agentskills.io">many more</a>.</p><h3>What&#x27;s in the initial v0.1.0 release</h3><p>The first set of skills spans five areas of the Elastic Stack:</p><ul><li>Interacting with Elasticsearch APIs (search, indexing, cluster management)</li><li>Building and managing Kibana content such as dashboards, alerts, connectors, and more</li><li>Domain expertise for Elastic Observability</li><li>Domain expertise for Elastic Security</li><li>Making effective agents in Agent Builder</li></ul><h3>Skills are composable</h3><p>Skills aren&#x27;t monolithic. They&#x27;re modular by design. Your agent loads only the skills relevant to the task at hand. Working on an ES|QL query? The ES|QL skill activates. Need to build a dashboard from those results? The dashboards skill picks up. Assessing the health of your application? The service health skill comes into play. Investigating a security alert? The triage skill chains into case management and response skills as the investigation progresses.</p><p>This composability means that you don&#x27;t need a single, massive prompt that tries to cover everything. Each skill carries exactly the context its domain requires, nothing more, nothing less.</p><h2>For developers building search and AI applications</h2><p>If you&#x27;re loading data into Elasticsearch, writing queries, or migrating indices, skills reduce the cycle of generating code, hitting errors, and searching docs for what went wrong.</p><p>Ask your agent to load a CSV file, and it uses a streaming ingestion tool that handles backpressure and infers mappings from the data. It’s not a hand-rolled _bulk loop that runs out of memory on the first large file. Ask it to query with ES|QL, and it discovers your actual index names and field schemas, then writes valid piped queries with correct syntax, appropriate aggregations, and version-aware feature selection, not a SQL-flavored guess that requires three rounds of debugging. Ask it to reindex across clusters, and it follows the full operational workflow: creates the destination with explicit mappings, tunes settings for throughput, runs the job async, and restores production settings when it finishes, not a bare _reindex call that skips half the steps an experienced operator would follow.</p><p>Instead of an agent that gives you a plausible starting point you have to fix, you get one that encodes the operational discipline that makes the output actually work.</p><p><strong>Example impacts of using Elastic Agent Skills</strong></p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Eval</td><td style="border: 1px solid #ccc; padding: 4px;">What the skill changed</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">es-audit-query-failed-logins</td><td style="border: 1px solid #ccc; padding: 4px;">Used the audit log query patterns from the skill instead of generic search</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">es-authz-role-mapping-ldap</td><td style="border: 1px solid #ccc; padding: 4px;">Emitted the correct role mapping API call structure</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">esql-basic-query</td><td style="border: 1px solid #ccc; padding: 4px;">Wrote ES|QL pipe syntax over Query DSL</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">esql-error-handling</td><td style="border: 1px solid #ccc; padding: 4px;">Schema first instead of guessing field names</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">esql-schema-discovery</td><td style="border: 1px solid #ccc; padding: 4px;">Never guessed an index name</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">es-ingest-csv-with-infer</td><td style="border: 1px solid #ccc; padding: 4px;">Used --infer-mappings alone, avoided combining with --source-format csv which causes an empty index</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">es-ingest-json-file</td><td style="border: 1px solid #ccc; padding: 4px;">Used robust ingestion approach capable of handling large files</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">es-reindex-local-async</td><td style="border: 1px solid #ccc; padding: 4px;">Created destination index first with replicas: 0 and refresh_interval: "-1", then async reindex. Baseline skipped any prep</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">es-security-403-privileges</td><td style="border: 1px solid #ccc; padding: 4px;">Followed the skill's diagnostic workflow for privilege errors instead of generic advice</td>
                    </tr>
              </tbody>
            </table>
          <h2>For security teams</h2><p>Security teams repeat the same operational workflows daily: triaging alerts, tuning detection rules, managing cases. Agent Skills encode that procedural knowledge so your AI agent can execute these workflows correctly, calling the right APIs in the right order with the right field names. For a hands-on walkthrough that takes you from zero to a fully populated Elastic Security environment without leaving your IDE, see <a href="https://www.elastic.co/security-labs/agent-skills-elastic-security">Get started with Elastic Security from your AI agent</a>.</p><h2>For observability and operations teams</h2><p>The new Agent Skills for Elastic Observability reduce the operational toil of instrumenting complex systems, managing SLOs, sifting through complex data, and assessing service health. Embedding native Elastic expertise directly into AI agents allows teams to execute complex observability workflows using simple natural language. This empowers SREs and Ops teams to resolve incidents faster and maintain reliable systems more easily. Learn more in <a href="https://www.elastic.co/observability-labs/blog/elastic-agent-skills-observability-workflows">this blog</a>.</p><h2>Open source, open spec, community-driven</h2><p>We&#x27;re releasing Agent Skills under the Apache 2.0 license because we believe that agent knowledge should be open. The <a href="https://agentskills.io">agentskills.io</a> specification that skills follow is an open standard, not an Elastic proprietary format. We want skills to be a community effort, not a walled garden.</p><h2>Part of a bigger picture</h2><p>Agent Skills is one piece of a broader initiative to make Elasticsearch the most agent-friendly data platform available. For agents that live on the Elasticsearch platform, <a href="https://www.elastic.co/search-labs/blog/agent-builder-elastic-ga">Agent Builder</a> goes further by inheriting your data&#x27;s access controls and permissions, providing built-in and custom tools for search and analysis, and letting users interact with agents in context alongside their dashboards, alerts, and investigations. Finally, support for skills is coming soon to Agent Builder, allowing developer flexibility to leverage Elastic Agent Skills as well as skills from any other source to enable secure, context enhanced chat and automation on the Elasticsearch platform.</p><p>For agents that live everywhere else, we&#x27;re investing in the open ecosystem:</p><ul><li><strong>Model Context Protocol (MCP) server expansion:</strong> Extending the <a href="https://www.elastic.co/docs/explore-analyze/ai-features/agent-builder/mcp-server">MCP endpoint</a> in Agent Builder with more tools beyond the current search, ES|QL, and index operations.</li><li><strong>Authentication improvements:</strong> Making it easier for agents to connect securely, with the goal of eliminating manual API key copy-paste.</li><li><strong>LLM-readable documentation:</strong> Publishing <code>llms.txt</code> and <code>AGENTS.md</code> files so agents can discover and understand Elastic APIs on their own.</li><li><strong>A command line interface (CLI) for agent workflows:</strong> Command-line tooling that makes connection management and common operations agent-friendly.</li></ul><p>Skills are the layer you can use today. The rest is coming.</p><h2>Get started</h2><p><strong>Before you get started: </strong>AI coding agents operate with real credentials, real shell access, and often the full permissions of the user running them. When those agents are pointed at security workflows, the stakes are higher: you&#x27;re handing an automated system access to detection logic, response actions, and sensitive telemetry. Every organization&#x27;s risk profile is different. Before enabling AI-driven security workflows, <strong>evaluate what data the agent can access, what actions it can take, and what happens if it behaves unexpectedly</strong>.</p><p>Install Elastic Agent Skills into your agent runtime:</p><p><code>npx skills add elastic/agent-skills</code></p><p>This auto-detects your installed agent runtimes and places skills in the correct configuration directory. From there, your agent picks them up automatically.</p><p>You can also browse the <a href="https://github.com/elastic/agent-skills">skills catalog</a> directly and install individual skills manually by copying the skill folder into your agent&#x27;s configuration directory.</p><p>Don&#x27;t have an Elasticsearch cluster yet? Start an <a href="https://cloud.elastic.co/registration">Elastic Cloud free trial</a>. It takes about a minute to get a fully configured environment.</p><p><strong>Explore the project:</strong></p><ul><li><a href="https://github.com/elastic/agent-skills">Agent Skills repository</a></li><li><a href="https://agentskills.io">agentskills.io specification</a></li><li><a href="https://www.elastic.co/docs">Elasticsearch documentation</a></li><li><a href="https://cloud.elastic.co/registration">Elastic Cloud free trial</a></li></ul>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Entity resolution with Elasticsearch, part 4: The ultimate challenge]]></title>
            <description><![CDATA[Solving and evaluating entity resolution challenges in a highly diverse “ultimate challenge” dataset designed to prevent shortcuts.]]></description>
            <link>https://www.elastic.co/search-labs/blog/entity-resolution-elasticsearch-llm-challenges</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/entity-resolution-elasticsearch-llm-challenges</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Jessica Moszkowicz]]></dc:creator>
            <pubDate>Fri, 13 Mar 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/70fb0ff949f6db9ac9b8a28ecb4329ab915ebf46-720x420.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>We’ve now seen intelligent entity resolution implemented in two ways. Both approaches begin the same way: entity preparation and extraction, followed by candidate retrieval with Elasticsearch. From there, we evaluate those candidates using a large language model (LLM), either through prompt-based JSON generation or through function calling, and require the model to provide a transparent explanation for its judgment.</p><p>As we saw in the <a href="https://www.elastic.co/search-labs/blog/elasticsearch-entity-resolution-llm-function-calling">previous post</a>, the consistency provided by function calling is not just a nice optimization; it’s essential. Once we removed structural errors from the evaluation loop, results on standard scenarios (such as those in the tier 4 dataset) improved dramatically.</p><p>Yet there’s an obvious question left to answer:</p><p><em>Does this approach still work when things get genuinely messy?</em></p><p>Real-world entity resolution rarely fails because of simple cases. It fails when names cross languages, cultures, writing systems, time periods, and organizational boundaries. It fails when people are referenced by titles instead of names, when companies change names, when transliterations aren’t consistent, and when context (not spelling) is the only thing tying a mention to a real-world entity.</p><p>So, for the final post in this series, we put the system through what we called <strong>the ultimate challenge</strong>.</p><h2>What makes this the ultimate challenge?</h2><p>In earlier evaluations, we tested the system using increasingly complex datasets. By the time we reached tier 4, discussed in the previous post, we were already dealing with a mix of nicknames, titles, multilingual names, and semantic references. Those tests showed that the architecture itself was sound, but that reliability issues, especially malformed JSON, were suppressing recall.</p><p>With function calling in place, we finally had a stable foundation. That gave us the opportunity to ask a more interesting question:</p><p><em>Can one unified pipeline handle <strong>many different kinds</strong> of entity resolution problems at once?</em></p><p>The ultimate challenge dataset was designed to push precisely on that dimension.</p><p>Instead of focusing on a single difficulty (like nicknames or transliteration), this dataset combines <strong>50+ distinct challenge types</strong>, including:</p><ul><li>Cultural naming conventions.</li><li>Title-based references.</li><li>Business relationships and historical name changes.</li><li>Multilingual and cross-script mentions.</li><li>Compound challenges that mix several of the above.</li></ul><p>Crucially, this isn’t about optimizing for any one narrow use case. It’s about testing whether the <em>design pattern</em> holds up when the rules change from entity to entity.</p><h2>The dataset at a glance</h2><p>The ultimate challenge dataset consists of:</p><ul><li><strong>50 entities</strong>, spanning people, organizations, and institutions.</li><li><strong>~60 articles</strong>, with varying structure and linguistic complexity.</li><li><strong>51 distinct challenge categories</strong>, grouped broadly into:<ul><li>Cultural naming conventions.</li><li>Titles and professional context.</li><li>Business and organizational relationships.</li><li>Multilingual and transliteration challenges.</li><li>Combined and edge‑case scenarios.</li></ul></li></ul><p>Earlier in the series, we saw that using generative AI (GenAI) to create datasets can be a mixed blessing. Without it, assembling sufficiently large and diverse test data would be extremely difficult. But left unchecked, the model has a tendency to make things too easy.</p><p>On an early generation pass, for example, we discovered that the model had included phrases like “the Russian president” as explicit aliases for Vladimir Putin. That might seem reasonable today, but it defeats the purpose of testing contextual resolution. What happens if the article is discussing Russia in the 1990s? The system should infer the correct entity from context, not rely on a hard-coded alias.</p><p>For that reason, this dataset was deliberately designed so that <strong>shortcuts don’t work</strong>. Aliases are not explicitly listed when the system is expected to infer meaning. Descriptive phrases are not prelinked to entities. Correct matches often depend on article-level context, not just local text.</p><p><strong>Important note:</strong> Although we demonstrate the system’s capabilities across diverse scenarios, this is still an educational prototype. Production systems handling real-world sanctioned-entity monitoring would require additional validation, compliance checks, audit trails, and specialized handling for sensitive use cases.</p><h2>Why these scenarios are hard</h2><p>Back in the first post in this series, we introduced a simple but ambiguous example: “The new Swift update is here!” The challenge is that “Swift” can resolve to multiple real-world entities, depending on context. That example captures a broader truth: Natural language is inherently ambiguous.</p><p>Entity resolution, therefore, is not just a string-matching problem. Humans routinely rely on shared knowledge, cultural norms, and situational context to resolve references, and we rarely even notice we’re doing it.</p><p>Consider a few common cases:</p><ul><li>A title like “the president” is meaningless without geopolitical and temporal context.</li><li>A company name may refer to a parent, a subsidiary, or a former brand depending on when the article was written.</li><li>A person’s name may appear in different orders, scripts, or transliterations, depending on language and culture.</li><li>The same phrase can legitimately refer to different entities in different contexts, and the system must be able to <em>reject</em> matches just as confidently as it accepts them.</li></ul><p>There is no single rule set that handles all of this cleanly. That’s why this prototype separates concerns so aggressively:</p><ul><li>Elasticsearch narrows the candidate space efficiently and transparently.</li><li>The LLM is used only where judgment is required and is forced to explain itself.</li><li>Retrieval and reasoning remain distinct steps.</li></ul><p>This separation becomes even more important as the diversity of challenge types increases.</p><h2>How the system handles diversity without special cases</h2><p>One of the most interesting outcomes of this evaluation is what <em>didn’t</em> change:</p><ul><li>We did <strong>not</strong> add special logic for Japanese names.</li><li>We did <strong>not</strong> add custom rules for Arabic patronymics.</li><li>We did <strong>not</strong> add hard-coded mappings for historical company names.</li></ul><p>Instead, the system relied on the same core ingredients introduced earlier in the series:</p><ul><li>Context-enriched entities indexed for semantic search.</li><li>Hybrid retrieval (exact, alias, and semantic) in Elasticsearch.</li><li>A small, well-defined set of candidate matches.</li><li>LLM judgment constrained by function calling and minimal schemas.</li></ul><p>This suggests that the system’s flexibility comes from <strong>representation and architecture</strong>, not from an ever-growing collection of rules.</p><p>When the system succeeds, it’s because the right candidates are retrieved and the LLM has enough context to explain why a reference does (or does not) map to a specific entity.</p><h2>Results: How did it perform?</h2><p>On the ultimate challenge dataset, the system produced the following overall results:</p><ul><li><strong>Precision:</strong> ~91%</li><li><strong>Recall:</strong> ~86%</li><li><strong>F1 Score:</strong> ~89%</li><li><strong>LLM acceptance rate:</strong> ~72%</li></ul><h3>Performance across challenge types</h3><p>Breaking down results by challenge type reveals strengths and limitations:</p><p><strong>Strongest performance (100% F1 score)</strong> was observed in areas such as:</p><ul><li>Cross-script matching (Cyrillic, Korean, Chinese business entities).</li><li>Hebrew scenarios (patronymics, professional titles, religious titles, transliteration).</li><li>Business hierarchies (aerospace, diversified manufacturing, multidivision corporations).</li><li>Professional titles (academic, military, political, religious).</li><li>Combined Japanese scenarios involving multiple writing systems.</li></ul><p><strong>Strong performance (80–99% F1 score)</strong> included:</p><ul><li>International political figures (98%).</li><li>Historical name changes (90%).</li><li>Complex business hierarchies (89%).</li><li>Japanese company names (93%).</li><li>Cross-script transliteration (86%).</li><li>Arabic patronymics (86%).</li></ul><p><strong>More challenging areas</strong> included:</p><ul><li>Advanced transliteration (Chinese, Korean): 0% F1.</li><li>Certain Japanese scenarios (honorifics, name order, writing system variation): ~67% F1.</li><li>Some Arabic scenarios (company names, institutional references): ~40% F1.</li></ul><p>What’s important here is <em>why</em> the system struggled in these cases. The failures were not due to the overall approach breaking down, but to limitations in specific components, most notably the dense vector model used for semantic search in certain multilingual scenarios.</p><p>Because retrieval and judgment are cleanly separated, improving performance does not require rewriting the system. Swapping in a more capable multilingual embedding model, enriching entity context, or refining retrieval strategies would improve results across these categories without changing the core architecture.</p><p>From an architectural standpoint, that’s the real success metric.</p><h2>What this tells us about the design</h2><p>Looking back across the series, a few patterns stand out:</p><ul><li><strong>Preparation matters more than clever matching. </strong>Enriching entities with context up front dramatically reduces ambiguity later.</li><li><strong>LLMs are most valuable as judges, not retrievers. </strong>Asking them to explain <em>why</em> a match makes sense is far more powerful than asking them to search.</li><li><strong>Reliability enables accuracy. </strong>Function calling didn’t just clean up JSON; it unlocked recall that was already latent in the retrieval step.</li><li><strong>Generalization beats specialization. </strong>A small number of well-chosen abstractions handled dozens of challenge types without custom logic.</li></ul><p>This is why the prototype is intentionally Elasticsearch-native and intentionally conservative in how it uses LLMs. The goal isn’t to replace search; it’s to make search explainable in situations where meaning matters.</p><h2>Final thoughts</h2><p>The ultimate challenge wasn’t about chasing perfect metrics; it was about answering a more fundamental question:</p><p><em>Can a transparent, search-first, LLM-assisted architecture handle real-world entity ambiguity without collapsing into rules or black boxes?</em></p><p>For this educational prototype, the answer is yes, with clear caveats around production hardening, compliance, monitoring, and data quality. If you’re building systems that need to justify <em>why</em> an entity match was made, this pattern is worth serious consideration. I hope this series has shown that entity resolution doesn’t have to be mysterious. With the right separation of concerns, it becomes something you can reason about, measure, and improve.</p><p>This work also suggests a broader architectural pattern. What emerges is a slight but important evolution of classic retrieval augmented generation (RAG). Instead of allowing retrieval to feed generation directly, we introduce an explicit evaluation step. The LLM is first used to judge and sanity-check retrieved candidates, and only those approved results are allowed to augment generation. You can think of this as Generation-Augmented Retrieval-Augmented Generation with Evaluation, or GARAGE, because who doesn’t love a good acronym.</p><p>What other use cases could benefit from this pattern? Systems that require trust, transparency, and defensible reasoning are natural candidates. Future work in this area should prove as compelling as the results we’ve seen here, and I’m excited to see where the community takes it next.</p><h2>Next steps: Try it yourself</h2><p>Want to see the ultimate challenge in action? Check out the <a href="https://github.com/jesslm/entity-resolution-lab-public/tree/main/notebooks#:~:text=5%20minutes%20ago-,05_ultimate_challenge_v3.ipynb,-Initial%20public%20lab"><strong>Ultimate Challenge notebook</strong></a> for a complete walkthrough, with real implementations, detailed explanations, and hands-on examples.</p><p>The complete entity resolution pipeline demonstrates the core concepts and architecture needed for production use. You can use it as a foundation to build systems that monitor news articles, track entity mentions, and answer questions about which entities appear in which articles, all while retaining transparency and explainability.<br/></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[The stateless architecture of Elasticsearch Serverless]]></title>
            <description><![CDATA[Exploring the stateless architecture of Elasticsearch Serverless. Learn how the stateful architecture was transformed into stateless for Serverless.]]></description>
            <link>https://www.elastic.co/search-labs/blog/elasticsearch-serverless-stateless-architecture</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/elasticsearch-serverless-stateless-architecture</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Iraklis Psaroudakis]]></dc:creator>
            <pubDate>Thu, 12 Mar 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/5ce3d804dd4652ce2114ce6a20f53cb41e5826c5-1396x812.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p><strong>We&#x27;re thrilled to announce</strong> the publication of our new peer-reviewed paper, <strong><em><a href="https://doi.org/10.1145/3772052.3772245">Serverless Elasticsearch: the Architecture Transformation from Stateful to Stateless</a>,</em></strong> which was accepted and presented at the industrial track of the 2025 Association for Computing Machinery (ACM) Symposium on Cloud Computing (SoCC). The paper gives a concise view of our recent innovations in Elasticsearch. At Elastic, we’re relentlessly focused on the future of search. From optimizing performance to simplifying operations, our teams are always exploring what&#x27;s next. </p><p>This paper isn&#x27;t just an academic exercise. It&#x27;s a foundational exploration of how the core of a search engine could be reimagined for a purely serverless world. We <a href="https://www.elastic.co/search-labs/blog/stateless-your-new-state-of-find-with-elasticsearch">decouple</a> storage from compute: Data lives in a cloud blob store with virtually infinite storage and scalability. That vision is the main driver behind our <a href="https://www.elastic.co/cloud/serverless">Elastic Cloud Serverless</a> offering: seamless search over massive datasets, with the economics and operational simplicity of serverless.</p><h2>The challenge: Rethinking stateful search for the cloud</h2><p>For decades, search engines have been powerful, stateful systems. Deploying a production-grade cluster like Elasticsearch has meant:</p><ul><li>Provisioning servers and managing storage.</li><li>Carefully tuning configurations for cost, performance, and reliability.</li><li>Paying for idle capacity when workloads are spiky or unpredictable.</li><li>Significant operational effort to scale up and down.</li></ul><p>Modern cloud platforms have made some of this easier, but the fundamental tension remains:</p><p><em>Can we build a search engine that delivers the power and rich query capabilities of Elasticsearch with the economics and operational simplicity of a serverless architecture?</em></p><p>That question drove our research.</p><h2>Our key contributions</h2><p>The paper presents concrete innovations that make Elasticsearch Serverless possible:</p><ul><li><strong>Object store as single source of truth: </strong>We offload index data, the transaction log (translog), and cluster state to a cloud object store. That eliminates replica shards for durability and makes the object store the sync point between indexing and search.</li><li><strong>&quot;Thin&quot; (stateless) shards: </strong>Shards recover and relocate quickly across nodes without copying large amounts of data. Disks are used only for caching, not for persistent storage.</li><li><strong>Batched compound commits (BCC): </strong>We wrap index commits in a custom format, cutting upload costs, while keeping the same read-after-write semantics as Elasticsearch.</li><li><strong>Batched translog uploads: </strong>Translog uploads are batched at the node level, cutting upload costs.</li><li><strong>Smart garbage collection:</strong> We track the usage of BCCs and translogs we’ve uploaded, and we delete them once they’re unused, to reduce storage footprint and retention costs.</li><li><strong>Autoscaling: </strong>We scale automatically with ingestion and search load so clients can call APIs without managing cluster size.</li></ul><p><strong>The bottom line:</strong> In our experiments, Elasticsearch Serverless achieves up to <strong>twice</strong> the indexing throughput of stateful Elasticsearch on comparable hardware and scales linearly with autoscaling to match ingestion load.</p><h2>Visualizing the architecture</h2><p>Figure 1 in the paper gives a clear side-by-side view: stateful Elasticsearch versus the new stateless architecture Elasticsearch Serverless.</p><p><strong>Stateful Elasticsearch (top):</strong> Familiar data tiers: hot, warm, cold, frozen. Data lives on local disks; primaries and replicas are spread across nodes; colder tiers may use searchable snapshots on an object store.</p><p><strong>Elasticsearch Serverless</strong> <strong>(bottom):</strong> Just two tiers: <strong>indexing</strong> and <strong>search</strong>. All durable data (Lucene commits, translogs, cluster state) lives in the object store. Indexing nodes write and upload; search nodes read from the object store and a shared cache, with no local persistence of index data.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/c4ca8c725cc60c06625cee3b5d1045c42367c166-1080x1116.png" style="width: 100%;" /><p>The takeaway: <em>a complete separation between the resources used for indexing and those used for querying.</em></p><h2>A tale of two data flow paths</h2><p>Figures 2 and 3 in the paper contrast how data flows in stateful versus Elasticsearch Serverless.</p><p><strong>Stateful Elasticsearch </strong>(figure 2):</p><ul><li>Documents go to the primary shard&#x27;s Lucene buffers and translog and then to replica shards.</li><li>After refresh, the documents go to new searchable segments.</li><li>After flush, they’re committed to disk.</li><li>Thus, durability is given by the disk and the replicas.</li></ul><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/eed45de365b8475b8f258342d344ad71491d39e4-1088x1022.png" style="width: 100%;" /><p><strong>Elasticsearch Serverless </strong>(figure 3):</p><ul><li>Documents go to Lucene and the translog on an indexing node.</li><li><em>Before</em> acknowledging the client, the translog is uploaded to the object store.</li><li>After refresh, the documents go to new searchable segments and are committed to disk in the indexing nodes.</li><li>After flush, they go into BCCs and are uploaded to the blob object store.</li><li>Search nodes serve queries from the object store (and, for recent data not yet uploaded, directly from the indexing node).</li><li>Thus, durability comes from the object store, not from disk or replicas.</li></ul><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/5374213dbac9c84e0d672045f663582b25f57e7a-1090x986.png" style="width: 100%;" /><p>Result: <em>The indexing and search paths are fully decoupled</em>.</p><h2>Autoscaling</h2><p>Section 7 of the paper describes the autoscaler. Because data lives in the object store, relocating shards doesn&#x27;t mean copying full segment data; only metadata and, when needed, cache warming. So the cluster can scale up and down <em>much</em> faster than in stateful Elasticsearch.</p><p>How it works:</p><ul><li>The autoscaler is an external component that monitors metrics from Elasticsearch Serverless.</li><li><strong>Indexing tier:</strong> Scale-up is driven by memory usage and ingestion load (including queued work).</li><li><strong>Search tier:</strong> Scale-up is driven by memory, search load, and the user-configurable &quot;search power&quot; (how much of the dataset is cached locally).</li><li>It polls every few seconds and adjusts each tier independently.</li></ul><p>Outcome: <em>automatic, workload-driven scaling so clients can focus on their applications instead of on capacity planning</em>.</p><h2>The experimental results</h2><p>Section 8 of the paper presents our experimental evaluation.</p><p><strong>Microbenchmarks</strong> show the impact of batching: fewer object store operations for both commits and translogs, with some trade-offs.</p><p><strong>Autoscaling experiments:</strong> As we increase the number of indexing clients, throughput scales linearly while P50 and P99 latency stay stable. A real-world example shows bulk response times improving and stabilizing as the indexing tier scales up with demand.</p><p><strong>Head-to-head comparison of stateful Elasticsearch versus Elasticsearch </strong>Serverless<strong>:</strong></p><ul><li>Elasticsearch Serverless achieves roughly <strong>twice</strong> the indexing throughput of stateful Elasticsearch at the 50th percentile.</li><li>The gain comes largely from using the object store for durability instead of replicating every operation to replica shards.</li><li>Latency stays competitive.</li></ul><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/353ca1e865535e7211d22547a0d9eaa60e47b4aa-1084x578.png" style="width: 100%;" /><p>The takeaway: <em>The stateless design delivers both better peak performance and more efficient, automatic scaling.</em></p><h2>Why this matters for the future of Elastic</h2><p>The stateless architecture isn&#x27;t just a technical achievement; it&#x27;s the foundation for how we want search to work in the cloud.</p><ul><li><strong>Pay-as-you-go:</strong> Customers can index and search over practically limitless data without provisioning clusters, tuning tiers, or managing replicas and snapshots.</li><li><strong>Automatic scaling: </strong>Each tier scales on its own automatically; no capacity planning required.</li><li><strong>Frequent, automated upgrades:</strong> Better security and time-to-value, without the operational cost of rolling upgrades over stateful data.</li></ul><p>This work is a step toward making powerful search more accessible, cost-effective, and scalable for everyone.</p><h2>Read the full paper, and join the conversation</h2><p>We believe in the power of open research and collaboration to move technology forward. We encourage you to dive into the details. We provide a <a href="https://cdn.sanity.io/files/me0ej585/search-labs-import-testing/af199341cc57431238d62c7d4dbc31791bfa806e.pdf">preprint of this paper</a> for your information, which details in depth the architecture transformation.</p><h2>Dive deeper: Explore related blog posts</h2><p>While our paper offers a concise overview of the Elasticsearch Serverless architecture, the details and underlying innovations are explored more fully in a collection of in-depth blog posts written by our engineering team. These articles provide the background, nuance, and specific technical deep dives that make the stateless transformation possible.</p><p>We encourage you to delve into the following resources to gain a richer understanding of the components and concepts presented in the paper:</p><ul><li><a href="https://www.elastic.co/search-labs/blog/stateless-your-new-state-of-find-with-elasticsearch">Stateless — your new state of find with Elasticsearch (2022)</a> and <a href="https://www.elastic.co/blog/elastic-serverless-architecture">Serve more with Serverless (2023)</a>. Read the foundational posts introducing the concept of decoupling storage and compute.</li><li><a href="https://www.elastic.co/search-labs/blog/data-safety-stateless-elasticsearch">Stateless: Data safety in a stateless world (2024).</a> Learn how data durability is achieved in the absence of local replicas.</li><li><a href="https://www.elastic.co/search-labs/blog/datastream-autosharding-serverless">Autosharding of data streams in Elasticsearch Serverless (2024).</a> Discover the logic behind automatic and dynamic data stream sharding.</li><li><a href="https://www.elastic.co/search-labs/blog/elasticsearch-refresh-costs-serverless">How we optimized refresh costs in Elasticsearch Serverless (2024).</a> Understand the specific optimizations applied to reduce the cost of making data searchable.</li><li><a href="https://www.elastic.co/search-labs/blog/thin-indexing-shards-elasticsearch-serverless">Introducing Serverless Thin Indexing Shards (2024).</a> Explore the innovation of &quot;thin&quot; shards that enable rapid relocation and recovery.</li><li><a href="https://www.elastic.co/search-labs/blog/elasticsearch-serverless-tier-autoscaling">Search tier autoscaling in Elasticsearch Serverless (2024).</a> Gain insight into the mechanisms driving the automatic scaling of search resources.</li><li><a href="https://www.elastic.co/search-labs/blog/elasticsearch-ingest-autoscaling">Ingest autoscaling in Elasticsearch (2024).</a> Learn how the ingestion tier scales automatically to meet fluctuating indexing load.</li><li><a href="https://www.elastic.co/blog/elastic-cloud-serverless-pricing-packaging">Elastic Cloud Serverless pricing and packaging (2025)</a>. Learn how the pricing and packaging was initially structured for Elastic Cloud&#x27;s Serverless offering.</li><li><a href="https://www.elastic.co/blog/elasticsearch-opensearch-performance-gap">Elasticsearch vs. OpenSearch: Unraveling the performance gap (2023)</a>. Learn about the performance differences and key optimizations that distinguish Elasticsearch from OpenSearch, as observed in 2023.</li></ul><h2>Acknowledgments</h2><p>We would like to thank all the co-authors of the paper: Iraklis Psaroudakis, Pooya Salehi, Jason Bryan, Francisco Fernández Castaño, Brendan Cully, Ankita Kumar, Henning Andersen, and Thomas Repantis. We would also like to thank the Elasticsearch Distributed Systems team for their contributions, and also the entire Elasticsearch engineering team.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Hybrid search with Java: LangChain4j Elasticsearch integration]]></title>
            <description><![CDATA[Learn how to use hybrid search in LangChain4j via its Elasticsearch integrations, with a complete Java example.]]></description>
            <link>https://www.elastic.co/search-labs/blog/langchain4j-elasticsearch-hybrid-search</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/langchain4j-elasticsearch-hybrid-search</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Laura Trotta]]></dc:creator>
            <pubDate>Wed, 11 Mar 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/93d8e0d84fb4cfbf5e51df85df7ec2e600d9dcc7-1088x607.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>In our <a href="https://www.elastic.co/search-labs/blog/langchain-elasticsearch-hybrid-search">previous article</a> on hybrid search with Elasticsearch in LangChain, we explained why hybrid search can help retrieve better results than simple vector search, along with how it works. We recommend reading that article first.</p><p>In addition to Python and JavaScript, the LangChain ecosystem also has a community-driven Java project called <a href="https://github.com/langchain4j/langchain4j">LangChain4j</a>, which will be the focus of this article, showing how powerful hybrid search can be by writing a complete application using LangChain4j, Elasticsearch, and Ollama.</p><h2>Setting up the environment</h2><h3>Running a local Elasticsearch instance</h3><p>Before running the examples, you&#x27;ll need Elasticsearch running locally. The easiest way is using the <a href="https://github.com/elastic/start-local?tab=readme-ov-file"><code>start-local</code></a> script:</p><pre><code>curl -fsSL https://elastic.co/start-local | sh</code></pre><p>After starting, you&#x27;ll have:</p><ul><li>Elasticsearch at http://localhost:9200.</li><li>Kibana at http://localhost:5601.</li></ul><p>Your API key is stored in the .env file (under the elastic-start-local folder) as <code>ES_LOCAL_API_KEY</code>.</p><p>&gt; <strong>Note: This script is for local testing only. Do not use it in production. For production installations, refer to the <a href="https://www.elastic.co/downloads/elasticsearch">official documentation</a> for Elasticsearch.</strong></p><h3>Running a local Ollama instance</h3><p>You’ll also need to connect your application to an embedding model. Although you can choose between any provider supported by LangChain4j (check the <a href="https://docs.langchain4j.dev/integrations/language-models/">complete list</a>), for this example we’ll be using Ollama, which can be easily set up locally following the <a href="https://docs.ollama.com/quickstart">quickstart</a>.</p><h2>Let’s start coding</h2><p>The idea for the application is simple: Given a dataset of movies (taken from an IMDb dataset on <a href="https://www.kaggle.com/datasets/rajugc/imdb-movies-dataset-based-on-genre/versions/2?select=scifi.csv">Kaggle</a>), we want to be able to find movies whose descriptions are relevant to our queries. This demo uses a subset of the data, which has been cleaned. You can download the dataset used for this article from our <a href="https://github.com/elastic/hybrid-search-elastic-langchain4j">GitHub repo</a>, along with the full code for this demo.</p><h2>Step 1: Dependencies and environment</h2><p>Open your favorite integrated development environment (IDE), create a new blank project, preferably with a modern Java version (we’re using Java24) and a gradle/maven version to match (in our case, Gradle 9.0).</p><p>We only need three dependencies:</p><pre><code>dependencies {
    implementation("com.fasterxml.jackson.dataformat:jackson-dataformat-csv:2.17.0")
    implementation("dev.langchain4j:langchain4j-elasticsearch:1.11.0-beta19")
    implementation("dev.langchain4j:langchain4j-ollama:1.11.0")
}</code></pre><p>The first one is needed to ingest the data that we’ll embed and query; the other two are the necessary LangChain4j dependencies to connect and manage our Elasticsearch vector store and Ollama embedding model.</p><p>The best way to connect to the external services is to set up environment variables and set them at the start of our main function:</p><pre><code>String elasticsearchServerUrl = System.getenv("ES_LOCAL_URL");
String elasticsearchApiKey = System.getenv("ES_LOCAL_API_KEY");

String ollamaUrl = System.getenv("ollama-url");
String ollamaModelName = System.getenv("model-name");</code></pre><h2>Step 2: Ingesting the dataset</h2><p>Since the dataset is a CSV, we’ll be using <a href="https://github.com/FasterXML/jackson-dataformats-text">Jackson dataformat</a>’s <code>jackson-dataformat-csv</code> to easily read the data and map it to a Java class, defined as:</p><pre><code>public record Movie(
    String movie_id,
    String movie_name,
    Integer year,
    String genre,
    String description,
    String director
) {
}</code></pre><p>Now we can create an instance of <code>CsvSchema</code> mapping the CSV structure and read the file into an iterator:</p><pre><code>CsvSchema schema = CsvSchema.builder()                    
    .addColumn("movie_id") // same order as in the csv    
    .addColumn("movie_name")                              
    .addColumn("year")                                    
    .addColumn("genre")                                   
    .addColumn("description")                             
    .addColumn("director")                                
    .setColumnSeparator(',')                              
    .setSkipFirstDataRow(true)                            
    .build();                                             
                                                          
CsvMapper csvMapper = new CsvMapper();                    
                                                          
File initialFile = new File("src/main/resources/scifi_1000.csv");
InputStream csvContentStream = new FileInputStream(initialFile);
                                                          
MappingIterator<Movie> it = csvMapper                     
    .readerFor(Movie.class)                               
    .with(schema)                                         
    .readValues(new InputStreamReader(csvContentStream)); </code></pre><p>Each row needs to be embedded first, and then both the embedded content and the text representation will be ingested by Elasticsearch.</p><p>Let’s start by creating an instance of the Ollama embedding model class:</p><pre><code>EmbeddingModel embeddingModel = OllamaEmbeddingModel.builder()
    .baseUrl(ollamaUrl)
    .modelName(ollamaModelName)
    .build(); </code></pre><p>And then the Elasticsearch vector store, which needs an instance of the Elasticsearch Java RestClient:</p><pre><code>RestClient restClient = RestClient
    .builder(HttpHost.create(elasticsearchServerUrl))
    .setDefaultHeaders(new Header[]{
        new BasicHeader("Authorization", "ApiKey " + elasticsearchApiKey)
    })
    .build(); 

EmbeddingStore<TextSegment> embeddingStore = ElasticsearchEmbeddingStore.builder()
    .restClient(restClient)
    .build(); </code></pre><p>For the ingestion loop, the LangChain4j library requires the data to be split in two lists for ingestion, one for the vector representation and one for the original text, so we’ll set up two lists which will be filled by the loop:</p><pre><code>List<Embedding> embeddings = new ArrayList<>();
List<TextSegment> embedded = new ArrayList<>();</code></pre><p>Where <code>Embedding</code> and <code>TextSegment</code> are both library specific classes.</p><p>We’ll iterate on the movie dataset iterator, use the embedding model to retrieve the vector representation for each movie information (a text representation of all the fields merged), and add the name separately as metadata so that the result will be easier to read.</p><pre><code>boolean hasNext = true;

while (hasNext) {
    try {
        Movie movie = it.nextValue();
        String text = movie.toString();

        Embedding embedding = embeddingModel.embed(text).content();
        embeddings.add(embedding);

        Metadata metadata = new Metadata();
        metadata.put("movie_name", movie.movie_name());
        embedded.add(new TextSegment(text, metadata));

        hasNext = it.hasNextValue();
    } catch (JsonParseException | InvalidFormatException e) {
        // ignore malformed data
    }
}</code></pre><p>Finally, the vector list and text list are passed to the vector store method <code>addAll()</code>, which will handle asynchronously sending the data to the vector store:</p><pre><code>embeddingStore.addAll(embeddings, embedded);</code></pre><h2>Step 3: Querying</h2><p>Our goal is to find movies with time loops in the plot, so our prompt will be:</p><pre><code>String query = "Find movies where the main character is stuck in a time loop and reliving the same day.";</code></pre><p>Let’s try a simple vector search first, by creating a content retriever with a <a href="https://www.elastic.co/docs/solutions/search/vector/knn">k-nearest neighbor (kNN) query</a> default configuration and then running the query and printing the results:</p><pre><code>ElasticsearchContentRetriever contentRetrieverVector = ElasticsearchContentRetriever.builder()
                .restClient(restClient)
                .configuration(ElasticsearchConfigurationKnn.builder().build())
                .maxResults(5)
                .embeddingModel(embeddingModel)
                .build();

List<Content> vectorSearchResult = contentRetrieverVector.retrieve(Query.from(query));

System.out.println("Vector search results:");
vectorSearchResult.forEach(v -> System.out.println(v.textSegment().metadata().getString(
                "movie_name")));</code></pre><p>This outputs:</p><pre><code>Vector search results:
The Witch: Part 1 - The Subversion
Divinity
The Maze Runner
Spider-Man
Spider-Man: Into the Spider-Verse</code></pre><p>Now let’s see how hybrid search performs:</p><pre><code>ElasticsearchContentRetriever contentRetrieverHybrid = ElasticsearchContentRetriever.builder()
    .restClient(restClient)
    .configuration(ElasticsearchConfigurationHybrid.builder().build())
    .maxResults(5)
    .embeddingModel(embeddingModel)
    .build();

List<Content> hybridSearchResult = contentRetrieverHybrid.retrieve(Query.from(query));

System.out.println("Hybrid search results:");
hybridSearchResult.forEach(v -> System.out.println(v.textSegment().metadata().getString(
            "movie_name")));</code></pre><pre><code>Hybrid search results:
Edge of Tomorrow
The Witch: Part 1 - The Subversion
Boss Level
Divinity
The Maze Runner</code></pre><h2>Why these results?</h2><p>This query (“time loop / reliving the same day”) is a great case where hybrid search tends to shine because the dataset contains literal phrases that BM25 can match and vectors can still capture meaning.</p><ul><li>Vector-only (kNN) embeds the query and tries to find semantically similar plots. Using a broad sci‑fi dataset, this can drift into “trapped / altered reality / memory loss / high-stakes sci‑fi” even when there’s no time-loop concept. That’s why results like “The Witch: Part 1 – The Subversion” (amnesia) and “The Maze Runner” (trapped / escape) can appear.</li><li>Hybrid (BM25 + kNN + reciprocal rank fusion [RRF]) rewards documents that match keywords and meaning. Movies whose descriptions explicitly mention “time loop” or “relive the same day” get a strong lexical boost, so titles like “Edge of Tomorrow” (relive the same day over and over again…) and “Boss Level” (trapped in a time loop that constantly repeats the day…) rise to the top.</li></ul><p>Hybrid search doesn’t guarantee that every result is perfect; it balances lexical and semantic signals, so you may still see some non-time-loop sci‑fi in the tail of the top‑k.</p><p>The main takeaway is that hybrid search helps anchor semantic retrieval with exact textual evidence when the dataset contains those keywords. Check the <a href="https://www.elastic.co/search-labs/blog/langchain-elasticsearch-hybrid-search">previous article</a> for more information on how hybrid search works.</p><h2>Full code example</h2><p>You can find the full demo code on <a href="https://github.com/elastic/hybrid-search-elastic-langchain4j">GitHub</a>.</p><h2>Conclusion</h2><p>In this article, we demonstrated how to use hybrid search in LangChain4j through its Elasticsearch integrations, with a complete Java example. This article is an extension of a <a href="https://www.elastic.co/search-labs/blog/langchain-elasticsearch-hybrid-search">previous article</a>, which presents the LangChain integrations for Python and JavaScript and introduces and explains hybrid search. We’re planning to continue our collaboration with LangChain4j in the future by contributing to the embedding models with our Elasticsearch <a href="https://www.elastic.co/docs/api/doc/elasticsearch/group/endpoint-inference">Inference API</a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[SearchClaw: Bring Elasticsearch to OpenClaw with composable skills]]></title>
            <description><![CDATA[Give your local AI agent access to Elasticsearch data using OpenClaw, composable skills, and agents, no custom code required.]]></description>
            <link>https://www.elastic.co/search-labs/blog/openclaw-elasticsearch-ai-agents</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/openclaw-elasticsearch-ai-agents</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Alex Salgado]]></dc:creator>
            <pubDate>Tue, 10 Mar 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/ce3248c3cb7a352e3fdafef4ac8116ab998ab4f4-1950x1137.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>In recent weeks, <a href="https://openclaw.ai/">OpenClaw</a> has been appearing frequently in AI community discussions, particularly among developers interested in agents, automation, and local runtimes. The project gained traction quickly, which naturally raised a technical question:</p><p><em>What real problem does it solve for engineers?</em></p><p><strong>OpenClaw</strong> is a self-hosted gateway for AI agents: a single runtime that coordinates execution, treats agents as isolated processes, and uses skills (structured instructions in markdown files) as the unit of integration. Conceptually, this isn’t entirely different from what we already do with command line interfaces (CLIs) and scripts, but it’s now formalized around agent-driven workflows.</p><p>This led to a practical exploration within the Elastic Stack:</p><p><em>If we treat OpenClaw as an orchestration runtime, how does it behave when Elasticsearch is the back end? And how straightforward is integration using OpenClaw skills?</em></p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/ec61e65f54b96b83975b52b2d88305170001d9bd-1999x1445.png" style="width: 100%;" /><p>Let&#x27;s build an integration using composable skills.</p><h2><strong>Solution architecture</strong></h2><p>In this tutorial, we’ll teach OpenClaw how to access and query Elasticsearch data through a custom read-only skill, and we’ll then demonstrate how it composes multiple skills together; for example, combining Elasticsearch queries with real-time weather data to generate dynamic reports.</p><p>Before diving into the hands-on steps, let’s look at what we’re building. The solution is composed of three integrated layers that work together through OpenClaw orchestration.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/ca8dc124a7410ba036ddf887eee011c42125cdf3-1270x680.png" style="width: 100%;" /><h3>Layer 1: Storage and search (Elasticsearch)</h3><p>The data layer runs on Elasticsearch via <a href="https://github.com/elastic/start-local"><code>start-local</code></a>, a single command that spins up Elasticsearch and Kibana locally with Docker.</p><p>Two sample indices demonstrate different use cases:</p><ul><li><strong><code>fresh_produce</code>:</strong> 10 products with semantic search (ecommerce scenario)</li><li><strong><code>app-logs-synthetic</code>:</strong> 30 log entries across four services (observability scenario)</li></ul><p>The same read-only skill works with both indices without any reconfiguration; the agent inspects the mapping and adapts its queries accordingly.</p><h3>Layer 2: Orchestration (OpenClaw Gateway)</h3><p>The gateway receives natural language requests and loads the Elasticsearch skill, and the large language model (LLM) decides which queries to construct. The skill is a pure <strong><code>SKILL.md</code></strong> with reference docs, meaning that its operations require no custom code.</p><p>To understand how the gateway organizes this, two core OpenClaw concepts are worth knowing:</p><ul><li><strong>Agents:</strong> Independent AI instances, each with its own configuration, workspace, and set of skills. You can run multiple agents for different purposes.</li><li><strong>Workspace:</strong> A folder that defines an agent’s context:<strong> <code>AGENTS.md</code></strong> (the agent’s permanent briefing), <strong><code>.env</code> </strong>(credentials), and a <strong><code>skills/</code></strong> directory. Think of it as the agent’s working environment.</li></ul><h3>Layer 3: Skills (composable capabilities)</h3><p>Skills are structured instructions in markdown files (<code>SKILL.md</code>) that teach the agent how to use specific tools or APIs. They can be global (available to all agents), workspace-specific, or bundled with OpenClaw. The agent selectively loads only the skills relevant to each request.</p><p>This tutorial uses two skills:</p><ul><li><strong><code>Elasticsearch-openclaw</code> (custom, built for this tutorial):</strong> A read-only skill that teaches the agent how to search, filter, aggregate, and explore Elasticsearch indices using curl.</li><li><strong><code>Weather</code> (community skill, used for composition demo):</strong> A skill that fetches current weather conditions from external APIs.</li></ul><p>Later in the tutorial, we&#x27;ll demonstrate how OpenClaw composes both skills in a single request, querying Elasticsearch products based on real-time weather data without any custom integration code.</p><h4>Read-only by design</h4><p>The <code>elasticsearch-openclaw</code> skill is <strong>read-only by design</strong>. It provides patterns for searching, filtering, and aggregating data, but it never writes, updates, or deletes. This minimizes the security footprint when giving AI agents access to your Elasticsearch cluster.</p><p>Even if the agent environment is compromised, your data remains safe from modification or deletion. This is enforced through:</p><ul><li><strong>Skill design:</strong> No write operation patterns in <code>SKILL.md</code> or reference files.</li><li><strong>API key permissions:</strong> The tutorial uses a read-only API key with only <code>read</code> and <code>view_index_metadata</code> privileges.</li><li><strong>Agent instructions:</strong> <code>AGENTS.md</code> explicitly states &quot;You can SEARCH, FILTER, and AGGREGATE data, but you can NEVER write, update, or delete.&quot;</li></ul><p>This security-first approach is why infrastructure setup (index creation, data loading) must be done manually; by design, the agent cannot do it for you.</p><h2><strong>Prerequisites</strong></h2><p>To follow this tutorial, you’ll need:</p><p><strong>Software and tools:</strong></p><ul><li>Docker Desktop installed and running (Docker Engine with Compose V2).</li><li>Elasticsearch running locally via <code>start-local</code>. (We’ll set this up in the next section.)</li><li>Jina API key (free): <a href="https://jina.ai/embeddings">https://jina.ai/embeddings</a>.</li><li>OpenClaw installed: <a href="https://openclaw.ai">https://openclaw.ai</a>.</li></ul><h3><strong>Setting up the environment</strong></h3><p>Start by cloning the starter project, which contains the skill, workspace configuration, and Dev Tools scripts:</p><pre><code>git clone https://github.com/salgado/elasticsearch-openclaw-start-blog
cd elasticsearch-openclaw-start-blog</code></pre><p>The repository contains:</p><pre><code>elasticsearch-openclaw-start-blog/
├── devtools_fresh_produce.md         ← Creates fresh_produce index (10 products)
├── devtools_app_logs_synthetic.md    ← Creates app-logs-synthetic index (30 logs)
└── openclaw-workspace-elastic-blog/
    ├── AGENTS.md                      ← Agent briefing
    ├── .env.example                   ← Credentials template</code></pre><p><em><strong>Note:</strong> The <code>devtools*.md</code> files contain Kibana Dev Tools commands formatted as reference documentation.</em></p><h4>Installing OpenClaw</h4><p>OpenClaw is a self-hosted gateway. This means you maintain full control over execution and data, but you need to prepare your local environment or server.</p><p>I installed OpenClaw on a separate machine, which is why I included the disclaimer below.</p><p><strong>** Security and responsibility disclaimer **</strong></p><p>Since OpenClaw is an early-stage, rapidly evolving open-source project, the community has raised important discussions about potential security vulnerabilities, especially around token handling and third-party script execution.</p><p><strong>Deployment recommendations:</strong></p><ul><li><strong>Isolated environments:</strong> If you’re not an advanced infrastructure security user, we recommend installing OpenClaw strictly in isolated, controlled environments (such as a dedicated virtual machine [VM], a rootless Docker container, or a test machine).</li><li><strong>Do not use in production:</strong> Avoid running the gateway on servers containing sensitive data or with unrestricted access to your corporate network until the project reaches a more stable, audited version.</li><li><strong>Least privilege:</strong> We reinforce the need to use Elasticsearch API keys with restricted permissions (read-only) to mitigate risks, in case the environment is compromised.</li><li><strong>Network segmentation:</strong> Both Elasticsearch and OpenClaw bind to <code>localhost</code> by default. Keep it that way, unless you have a specific reason to expose them.</li><li><strong>Credential rotation:</strong> Rotate API keys periodically. OpenClaw stores credentials locally, so treat the machine’s security as the perimeter.</li><li><strong>Audit logging:</strong> Enable Elasticsearch audit logging to track all API calls made by OpenClaw. This creates a full trail of what the agent accessed and when.</li><li><strong>Keep the installation up to date.</strong></li></ul><p>For a deeper analysis of the security architecture and deployment options, consult the <a href="https://docs.openclaw.ai">official OpenClaw documentation</a>.</p><h4>Runtime installation</h4><p>OpenClaw manages daemons and skill isolation via CLI. Since it’s a recent project that has undergone naming changes, we recommend strictly following the <a href="https://docs.openclaw.ai/install">official documentation</a> to ensure installation compatibility.</p><pre><code># Global gateway installation
curl -fsSL https://openclaw.ai/install.sh | bash</code></pre><h2><strong>Preparing the Elasticsearch back end</strong></h2><p>Before connecting any agent runtime, we need a working Elasticsearch environment with data to query and a secure, <strong>read-only access layer</strong>. In the next two sections, we’ll spin up Elasticsearch locally using <code>start-local</code>, create an index with <code>semantic_text</code> and Jina v5 embeddings, load sample data, validate that semantic search works, and generate a read-only API key. Once this foundation is in place, the Elasticsearch side is complete and we can focus entirely on teaching the agent how to use it.</p><h3>Part 1: Setting up Elasticsearch locally</h3><p>Start a local Elasticsearch and Kibana instance with a single command:</p><pre><code>curl -fsSL https://elastic.co/start-local | sh</code></pre><p>Once complete: Elasticsearch at <code>http://localhost:9200</code>, Kibana at <code>http://localhost:5601</code>, and credentials in <code>elastic-start-local/.env</code>.</p><h3>Part 2: Configuring the index in Kibana Dev Tools</h3><p>Open <code>http://localhost:5601</code> → Dev Tools and run <code>devtools_fresh_produce.md</code> in order.</p><ul><li><strong>Step 1:</strong> Replace <code>YOUR_JINA_API_KEY</code> with your actual Jina API key (free).</li><li><strong>Step 2:</strong> Save the encoded field immediately; it cannot be retrieved later.</li></ul><p>The key commands in the Dev Tools file are:</p><p><strong>Create the Jina inference endpoint:</strong></p><pre><code>PUT _inference/text_embedding/jina-embeddings-v5
{
  "service": "jinaai",
  "service_settings": {
    "api_key": "YOUR_JINA_API_KEY",
    "model_id": "jina-embeddings-v5-text-small"
  }
}</code></pre><p><strong>Create the index with <code>semantic_text</code>:</strong></p><pre><code>PUT /fresh_produce
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "fields": { "keyword": { "type": "keyword" } }
      },
      "description": { "type": "text" },
      "category": { "type": "keyword" },
      "price": { "type": "float" },
      "stock_kg": { "type": "float" },
      "on_sale": { "type": "boolean" },
      "image_url": { "type": "keyword" },
      "semantic_content": {
        "type": "semantic_text",
        "inference_id": "jina-embeddings-v5"
      }
    }
  }
}</code></pre><p>The <code>semantic_text</code> field type handles embedding generation automatically at index time.</p><p><strong>Index sample products</strong> using the bulk API (see <code>devtools_fresh_produce.md</code> for the full dataset of 10 products).</p><p><strong>Validate semantic search:</strong></p><pre><code>GET /fresh_produce/_search
{
  "query": {
    "semantic": {
      "field": "semantic_content",
      "query": "healthy colorful meals"
    }
  },
  "size": 3,
  "_source": ["name", "description", "category"]
}</code></pre><p>The semantic query type handles inference on the query side automatically; no need to specify model IDs or embedding details.</p><p><strong>Create a read-only API key:</strong></p><pre><code>POST /_security/api_key
{
  "name": "openclaw-readonly",
  "role_descriptors": {
    "reader": {
      "cluster": ["monitor"],
      "indices": [
        {
          "names": ["fresh_produce", "app-logs-synthetic"],
          "privileges": ["read", "view_index_metadata"]
        }
      ]
    }
  }
}</code></pre><p>Save the encoded value from the response. This is your API key for the OpenClaw configuration.</p><h2>Connecting to OpenClaw</h2><p>With the Elasticsearch back end ready, we can now wire it into OpenClaw. Several Elasticsearch integrations already exist in the ecosystem, from <a href="https://www.elastic.co/docs/explore-analyze/ai-features/agent-builder/mcp-server">Elastic’s own Model Context Protocol (MCP) server</a> to community-built MCP servers. However, most of these offer full CRUD access or are designed for different agent runtimes. Given that the technology is still in its early stages and security remains a primary concern, I chose to build a dedicated skill, simple, read-only, and purpose-built for OpenClaw. This approach ensures that the agent can search, filter, and aggregate data but never modify it, keeping the blast radius minimal even if the environment is compromised.</p><p>In the next sections, we’ll configure credentials, install the skill, create a dedicated agent, and explore how the workspace ties everything together.</p><h3>Install the skill and create the agent</h3><h4>Step 1: Configure credentials</h4><p>From the cloned repository, configure the credentials by copying the environment template and filling in your Elasticsearch URL and the read-only API key:</p><pre><code>cp openclaw-workspace-elastic-blog/.env.example 
openclaw-workspace-elastic-blog/.env</code></pre><p>Edit the .env file with these two values:</p><pre><code>ELASTICSEARCH_URL: http://localhost:9200 (from start-local)
ELASTICSEARCH_API_KEY: The encoded value from the read-only API key you created in Part 2 (the POST /_security/api_key response)</code></pre><p>Example .env file:</p><pre><code>ELASTICSEARCH_URL=http://localhost:9200
ELASTICSEARCH_API_KEY=VnVaRmxLSDRCQxxxxxxxxbGVfa2V5</code></pre><h4>Step 2: Install the skill from ClawHub</h4><p><a href="https://clawhub.ai/">ClawHub</a> is OpenClaw&#x27;s public skill registry. Think of it as npm for AI agent skills. At the time of this writing, ClawHub hosts over 3,200 skills, covering everything from Slack and GitHub integrations to Internet of Things (IoT) device automation. For this tutorial, we created <code>elasticsearch-openclaw</code>, a custom skill focused on read-only queries using <code>semantic_text</code>, aggregations, and observability on Elasticsearch 9.x. It’s published on ClawHub so you can install it directly. As a best practice, only install skills from trusted sources with known provenance; as with any package manager, review the content before granting access to your agent.</p><p>The <code>elasticsearch-openclaw</code> skill is published on ClawHub.</p><p><strong>Recommended:</strong> Open the OpenClaw Web UI (http://127.0.0.1:18789/) and ask:</p><pre><code>Install the elasticsearch-openclaw skill from https://clawhub.ai/salgado/elasticsearch-openclaw</code></pre><p>OpenClaw will:</p><ul><li>Fetch the skill from ClawHub.</li><li>Install it in the appropriate directory.</li><li>Confirm when ready to use.</li></ul><h4>Step 3: Create the agent</h4><p>Do this by registering a dedicated agent with its own workspace, and then restart the gateway to load the new configuration:</p><pre><code>openclaw agents add elasticsearch-agent \
  --workspace ~/path/to/elasticsearch-openclaw-start-blog/openclaw-workspace-elastic-blog \
  --non-interactive

openclaw gateway restart</code></pre><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/abc051a513b0cc7dff4a7f02493d51e220c72ad4-1999x1095.png" style="width: 100%;" /><h3>Understanding the workspace</h3><p>Now that the agent is running, let’s look at what makes it tick.</p><h4><code>AGENTS.md</code></h4><p>The <code>AGENTS.md</code> file is the agent’s permanent briefing. It defines who the agent is, what it can do, and how it should behave. For our Elasticsearch agent, this file instructs the agent about the available indices, the read-only constraint, and the preferred query patterns.</p><h4>Skills: When they make a difference</h4>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Without skill</td><td style="border: 1px solid #ccc; padding: 4px;">With `elasticsearch-openclaw` skill</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Agent has no knowledge of Elasticsearch query syntax.</td><td style="border: 1px solid #ccc; padding: 4px;">Agent knows semantic, full-text, filtered, and aggregation patterns.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Agent might attempt write operations.</td><td style="border: 1px solid #ccc; padding: 4px;">Agent is instructed to never write, update, or delete.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Agent guesses field names and types.</td><td style="border: 1px solid #ccc; padding: 4px;">Agent inspects mappings first and then constructs appropriate queries.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Generic curl commands with trial and error.</td><td style="border: 1px solid #ccc; padding: 4px;">Structured query templates with best practices for Elasticsearch 9.x.</td>
                    </tr>
              </tbody>
            </table>
          <h2><strong>Exploring with the agent</strong></h2><p>With the Elasticsearch back end configured and the OpenClaw agent connected, it’s time to see what the agent can actually do. In the next sections, we’ll test natural language queries, explore observability data, and compose multiple skills together.</p><h3><strong>Testing in OpenClaw</strong></h3><p>Open the OpenClaw web UI, and try some natural language queries. The agent will inspect the index mapping, choose the appropriate query type, and return results.</p><p>Type:</p><pre><code>“Find products that would be good for a healthy summer salad.”</code></pre><p>Result:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/d5b4bbe71ad56af5462bccc1475bd10d5233abd9-1011x557.png" style="width: 100%;" /><p>Others ideas to explore:</p><ul><li><strong>Index exploration:</strong> &gt; “What indices do I have in Elasticsearch? Show me the fields of <code>fresh_produce</code>.”</li><li><strong>Filtered search:</strong> &gt; “Show me all products on sale under $15.”</li><li><strong>Aggregations:</strong> &gt; “What’s the average price by category?”</li></ul><h3>Observability</h3><p>To demonstrate that the skill works beyond a single use case, the repository includes a second index: <code>app-logs-synthetic</code>, with 30 synthetic log entries across four fictional services, created from <code>devtools_app_logs_synthetic.md</code>.</p><h4>Setting up the log data</h4><p>Since the skill is read-only, you need to populate the index first. The <code>devtools_app_logs_synthetic.md</code> file contains <strong>five commands</strong> (three for setup and two for verification):</p><ul><li><strong><code>Create ingest pipeline</code>:</strong> Adds @timestamp to log entries automatically.</li><li><strong><code>Create index mapping</code>:</strong> Defines the <code>app-logs-synthetic</code> structure (classic fields only, no <code>semantic_text</code>).</li><li><strong><code>Bulk insert logs</code>:</strong> Loads 30 synthetic log entries across four services.</li><li><strong><code>Count query</code>:</strong> Verify 30 documents were indexed.</li><li><strong><code>Sample search</code>:</strong> Quick test to confirm that data is queryable.</li></ul><h4>How to run:</h4><ol><li>Open Kibana Dev Tools: http://localhost:5601 → Dev Tools.</li><li>Copy each numbered block from the .md file.</li><li>Paste into the Dev Tools console.</li><li>Press <strong><em>Ctrl/Cmd+Enter</em></strong> to execute.</li><li>Wait for a successful response before continuing to the next block.</li></ol><p>This creates the <code>app-logs-synthetic</code> index with sample data ready for querying.</p><p>Try this query in the OpenClaw web UI:</p><pre><code>Show me the distribution of HTTP status codes across all services.</code></pre><p>Result:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/d35fc1c0ea6d647f1c85163eb0ab8e268c6c4f89-1002x565.png" style="width: 100%;" /><p>Other ideas to explore:</p><ul><li>“How many 500 errors do I have in <code>app-logs-synthetic</code>? Which services are failing?”</li><li>“Which endpoints have the slowest response times?”</li><li>“What happened with the <code>payment-service</code> in the last 24 hours?”</li></ul><p>This is the same skill, same agent, same setup, just pointed at different data. The agent inspects the new index mapping, adapts its queries, and returns relevant results without any reconfiguration.</p><h2><strong>Composing skills in action</strong></h2><p>This is where composable skills truly shine. Start by asking the agent:</p><pre><code>Install the weather skill.</code></pre><p>OpenClaw will search for the weather skill, automatically attempt the installation, and guide you through the process. Just follow the on-screen instructions; no new API key is required for the weather skill. Afterward, try this:</p><pre><code>“Find the products on sale in the fresh_produce index that match today’s weather in São Paulo. Generate a nice HTML report with product cards using the image_url field from each document, price, description, and stock. Save it to ~/Desktop/report.html and open it in the browser.”</code></pre><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/14fa4303bb2f1eb19530d8844f09c99948b3c752-1965x1079.png" style="width: 100%;" /><p>In a single request, the agent chains multiple skills: the <strong>weather skill</strong> to check current conditions, the <strong>Elasticsearch skill </strong>to run a hybrid search on products that match the context, and its built-in file and browser tools to generate an HTML report and open it. No custom integration code, no glue scripts, just skills composed by the LLM at runtime.</p><p>This is what makes OpenClaw different from a traditional automation framework. You don’t preprogram the workflow. You describe the outcome, and the agent figures out the composition.</p><h2><strong>Conclusion</strong></h2><p>SearchClaw started as a simple experiment and ended up demonstrating what composable, LLM-driven integration looks like in practice. The key takeaway is not the individual tools (all are familiar) but the approach. Instead of writing a specific application with hardcoded queries, we gave the agent capabilities and let it compose solutions dynamically. This is what makes OpenClaw native: composable, LLM-driven, and local-first.</p><p>As with any early-stage project, OpenClaw should be used thoughtfully, especially regarding security and environment isolation. The read-only skill approach demonstrated here is one way to limit risk while still unlocking the value of your Elasticsearch data.</p><p>The full code is available in the repository and can serve as a starting point for your own integrations: <a href="https://github.com/salgado/elasticsearch-openclaw-start-blog">https://github.com/salgado/elasticsearch-openclaw-start-blog</a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Building effective database retrieval tools for context engineering]]></title>
            <description><![CDATA[Best practices for writing database retrieval tools for context engineering. Learn how to design and evaluate agent tools for interacting with Elasticsearch data.]]></description>
            <link>https://www.elastic.co/search-labs/blog/database-retrieval-tools-context-engineering</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/database-retrieval-tools-context-engineering</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Leonie Monigatti]]></dc:creator>
            <pubDate>Mon, 09 Mar 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/06d3502988707f6f88f9b18243072cbbed3eb8af-1200x630.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Good search is incredibly difficult. That&#x27;s why we have an ecosystem full of search algorithms, tooling, and even dedicated stacks like Elastic’s, not to mention an entire field called <em>information retrieval</em>. In modern <a href="https://www.elastic.co/search-labs/blog/context-engineering-overview">context engineering</a>, AI agents search and retrieve context from many different sources, including local files, the web, or memory files. Giving agents access to tools to interact with data stored in a database allows them to ground their answers in proprietary information or even complete analytical tasks.</p><p>However, if these interfaces aren’t carefully engineered, agents can search the wrong index, generate invalid SQL/Elasticsearch Query Language (ES|QL) queries, or return large amounts of irrelevant data. During the development of <a href="https://www.elastic.co/search-labs/blog/agent-builder-elastic-ga">Elastic Agent Builder</a>, we’ve seen these failure modes multiple times. While collaborating with dozens of internal teams to build tools for interacting with Elasticsearch data and integrating them to improve our internal processes with agentic workflows, such as our <a href="https://www.elastic.co/search-labs/blog/agent-builder-one-workflow">internal laptop refresh process</a>, we found that the most successful teams carefully design database retrieval tools as curated interfaces to their data.</p><p>In this blog, we share the best practices we follow when building database retrieval tools. In fact, the principles we share are based on common patterns we saw during iteratively improving our prebuilt tools and helping internal teams build custom tools.</p><h2>Key challenges of agentic retrieval</h2><p>Coding and search are among the best use cases of agents. Even though coding agents have recently made substantial progress with new concepts, such as filesystem-oriented tools and code-specific embeddings, search agents (specifically for database retrieval) lack reported breakthroughs.</p><p>Agent use cases are challenging for multiple reasons: They can ignore the available tools to accomplish a task; they can call the wrong tools; and they can call the right tools with the wrong parameters. In addition to these general challenges, we believe that database retrieval use cases are challenging for the following three key reasons:</p><ul><li><strong>Identifying the right index of data</strong> requires the large language model (LLM) to understand what it contains. But sometimes, the number of <a href="https://www.elastic.co/blog/what-is-an-elasticsearch-index">indices</a> can already be so large that even representing those to select may cause context length problems.</li><li><strong>Generating efficient queries</strong> that balance retrieving relevant information with minimizing latency and resource usage can be challenging.</li><li><strong>Avoiding context bloat with tool responses </strong>requires the tool response to be optimized for contextual relevance and<strong> </strong>token efficiency. This isn’t always easy, especially when the agent generates the query from scratch. Once the context is no longer relevant to answer a user’s query, offloading the data for later reference is another challenge.</li></ul><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/6736afbe774da63ef9534a64f65a7de1f7d075c0-1280x720.png" style="width: 100%;" /><p>During the development of Agent Builder and integrating it into our own processes, we faced these challenges multiple times. In fact, the principles we share in the following sections are based on common patterns we saw during iteratively improving our built-in and custom tools and our internal workflows built on top of them.</p><h2>Principles for building effective database retrieval tools</h2><p>In this section, we translate our learnings into guiding principles for building effective database retrieval tools: deciding which tools to build, making sure the agent finds the right index to search and calls the right tool with appropriate parameters, optimizing the tool responses, handling errors, and safeguarding the data.</p><h3>Building the right database retrieval tools (“low floor, high ceiling”)</h3><p>When deciding on what database retrieval tools to build, we follow the principle of “low floor, high ceiling” for good <a href="https://biilmann.blog/articles/introducing-ax/">agent experience</a>:</p><ul><li><strong>High ceiling: </strong>Tools that don’t limit the agent&#x27;s potential to handle ambiguous user queries in the worst case. In the context of database retrieval, these are <strong>general-purpose</strong> tools that, for example, allow the agent to write full SQL/ES|QL queries from scratch. These come with the trade-off of reasoning overhead of the agent and result in higher latency, higher cost, and lower reliability.</li><li><strong>Low floor:</strong> Tools with high accessibility that the agent can use successfully on the first try with minimal reasoning overhead for repeating user queries. In the context of database retrieval, these are <strong>specialized tools</strong> that, for example, wrap specific queries. These have the benefit of lower latency, lower cost, and higher reliability than general-purpose tools. However, they require engineering effort, and realistically, it may not be possible for engineers to anticipate every possible user query.</li></ul><p>For example, in our experience, a generic search tool is mandatory to allow the agent to handle unique and ambiguous user queries in the worst case. However, we found it necessary to reduce reasoning overhead and increase efficiency by creating specialized tools (for example, <code>get_top_performing_products(category)</code>).</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/e6d1b973be4b0a0a25c99c74f02a47e98395a3f7-1200x630.png" style="width: 100%;" /><p>Another lesson we learned is to consider the level of abstraction of a tool. During our preview phase, the agent had access to a large number of atomic general-purpose tools (for example, <code>get_index_mappings</code>, <code>generate_esql</code>, <code>execute_query</code>, and others). In practice, this had two downsides: When faced with a complex, open-ended question, the agent would confuse tools and their expected order, despite guiding instructions. Combining multiple tools in an agentic workflow also requires hand-off of information via the context window, which leads to filling up the context window with information that’s only temporarily important. To overcome this inefficiency, we wrapped the functionality of multiple atomic tools into one self-contained search tool.</p><h3>Finding the right index</h3><p>While the majority of tools that interact with a database will focus on querying the database, in some use cases, like for index selection, the tool will allow the agent to interact with the database’s metadata to decide which indices to search against based on a user’s query and intent.</p><p>Initially, our naive index selection relied on the index names and a sample of their schema definitions. This worked well in our internal testing, but when internal teams experimented with it, we realized real use cases often don’t have index names that are distinct and descriptive but are instead vague (for example, <code>users</code>, <code>logs</code>, <code>flight_travels</code> versus <code>web-logs-2026.01</code>, <code>web-logs-2026.02</code>).</p><p>To overcome this, we started exposing each index’s metadata and schema definitions in the tool. This significantly improved the selection by allowing engineers to add descriptions to translate technical names into natural language on two levels:</p><ul><li><strong>Index-level descriptions: </strong>What data is stored in the index and how documents relate to one another.</li><li><strong>Field-level hints:</strong> Specific guidance on a field&#x27;s format, expected values, or business meaning (for example, <code>”Use this field for exact ID matches only”</code>).</li></ul><p>In another iteration, we decided to add data sampling. For each index, we sample some of the data present in each field to let the agent have a clearer understanding of the type of data in the index. While it significantly improved the efficiency of index selection, it comes at the cost of increased tool response data.</p><h3>Calling the right database retrieval tool</h3><p>Guiding an agent to call the right tool is already challenging enough in general. This section discusses what helped us make sure the agent calls a tool to ground their response, as well as calling the right database retrieval tool.</p><h4>Naming and namespacing: Standardizing identifiers for reliable selection</h4><p>A tool’s name acts as a skimmable header that agents use to decide which one to investigate further. In practice, this means using descriptive and distinct tool names. Additionally, tool selection is more reliable when their names are consistent in formatting (for example, standardize on snake_case) and wording. Using action-oriented verbs helps the LLM map the user’s intent to the tool’s purpose, although the exact wording of the tool names is less critical in practice (for example, <code>search</code> versus <code>find</code>).</p><p>Namespacing tools to group related tools under common prefixes or suffixes is similarly helpful. In the case of databases, namespacing tools by index or domain helps the LLM understand tool relationships and prevents naming collisions (for example, <code>finance.search_ticker</code> or <code>support.get_ticket_details</code>).</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/ae64cafe7a3aade3985780d7981a853e90387945-1200x630.png" style="width: 100%;" /><h4>Tool description: Instructing the agent on proper usage</h4><p>The description is the most crucial component of any tool definition because it instructs the agent on <em>when</em> and <em>how</em> to use it, especially when tools have similar names (for example, <code>search_logs</code> and <code>find_errors</code>).</p><p>Consider this template for complex tools:</p><ul><li><strong>Core purpose: </strong>A high-level summary of what the tool does.</li><li><strong>Trigger: </strong>When the tool should be used (and when it should not).</li><li><strong>Action:</strong> Which specific data the tool retrieves or modifies, and what type of questions it can answer.</li><li><strong>Limitations: </strong>What important limitations and constraints exist, such as specific query languages or formats.</li><li><strong>Relationships with other tools: </strong>Does one tool affect another tool, or are there any preconditions?</li><li><strong>Examples:</strong> Specific few-shot examples of user queries and how to use the tool for them, such as how to determine the optimal search strategy or when to use which operator.</li></ul><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/b3e19d6472cbc62fe2be0e4efcb918e52c10d2f8-1200x630.png" style="width: 100%;" /><p>A note on model sensitivity: While top-tier models like Claude 4.5 Sonnet are forgiving, smaller models often need clearer and more elaborate descriptions to select the right tool.</p><h4>Adding reasoning parameters</h4><p>Inspired by the <a href="https://arxiv.org/pdf/2601.18282v1">paper on think-augmented function calling</a>, we added a reasoning parameter. This approach improves the parameter accuracy by providing a scratchpad for the LLM to process its thoughts and facilitates a more transparent user experience.</p><p>This works well for complex tool calls or when a large number of tools are exposed to the agent. However, it can lead to regression in simple scenarios, and the benefits are further diminished for thinking-based LLMs. In our Agent Builder implementation, the reasoning parameter is often optional and stripped before execution and is only used for tool selection and parameter filling.</p><pre><code>"properties": {
   "reasoning": {
      "type": "string",
      "description": "Brief explanation of why you're calling this tool"
   }
}</code></pre><h4>Support: Reinforcing instructions in the agent prompt</h4><p>A common error we observed is that the LLM would sometimes ignore the available tools and instead use its innate knowledge to generate a (hallucinated) response. For example, when asked ”Can you tell me more information about Elasticsearch’s ES|QL language?”, it would assume it was fine to answer by itself instead of calling the tool, which was specifically designed to fetch documentation about Elastic products.</p><p>To mitigate this problem, we added repeated, explicit instructions in the system prompt of the agent itself to guide the agent to find the right balance between using its innate knowledge and grounding its answer in a tool response. Our testing indicates that this is especially effective when multiple tools with similar purposes are exposed to the agent.</p><h4>Forcing tool usage</h4><p>Beyond reinforcing the instructions in the agent prompt, we found it helpful to force tool usage when mandatory by explicitly binding tools using <code>tool_choice: ‘any’</code>.</p><h3>Calling database retrieval tools with appropriate values and writing queries</h3><p>Another challenge is to enable the agent to call a tool with appropriate parameter values. We’ve observed a consistent pattern where a strong definition, the number of parameters, and their complexity play an important role in reducing errors.</p><h4>Parameter definition</h4><p>A strong parameter definition significantly improves the parameter accuracy. General best practices for parameter definitions in agent tools are:</p><ul><li><strong>Unambiguous name:</strong> Clearly identifies the purpose (for example, <code>user_id</code> versus <code>user</code>).</li><li><strong>Strong typing:</strong> Uses integer, string, or enums, among others, for finite sets of valid values.</li><li><strong>Detailed description:</strong> Explains what the parameter means and when and how to use it. Specifies default values for missing values, document formats (for example, for dates), hidden rules (for example, <code>”at least one of agent_id | user_id is required”</code>), and includes small examples.</li></ul><pre><code># Weak parameter description
"properties": {
   "index": {
      "description": "Name of the index",
   },
}

# Strong parameter description
"properties": 
{
   "index": {
      "type": "string","description": "The specific index, alias, or datastream to search. Defaults to 'main-alias' if unknown. ",
   },
}</code></pre><h4>Number of parameters</h4><p>Agents struggle to call a tool with a large number of parameters with appropriate values, especially mandatory ones. As a general rule of thumb, we try to keep the mandatory parameters below five and the total parameters below 10.</p><h4>Parameter complexity</h4><p>Reducing the complexity of the input parameter when possible leads to fewer mistakes. For example, it requires reasoning overhead to let an LLM generate a search term than generating full SQL or ES|QL queries from scratch. Especially for repeating user queries, “pre-canning” search queries reduces latency, cost, and error rates (although modern LLMs are good at using well-known languages, such as SQL).</p><p>To follow the principles of “low floor, high ceiling,” we gravitated toward wrapping a specific query inside a tool and letting the agent only provide the search term. Below is an example of input parameters with varying complexity for the same user query, &quot;Find the 5 most relevant &#x27;resolved&#x27; support tickets based on a new problem description to find previous solutions.&quot;</p><pre><code># Complex parameter
search_support_tickets("FROM support_tickets | WHERE status = 'resolved' AND customer_email = ?email | MATCH(issue_title, issue_description, ?problem_description) | KEEP ticket_id, issue_title, resolution | LIMIT 5",
)

# Simple parameter
find_similar_customer_resolved_tickets(problem_description)</code></pre><h4>Model sensitivity</h4><p>Models have a strong impact on parameter accuracy and query generation. Our internal benchmarking showed that switching from Claude 3.7 Sonnet to Claude 4.5 Sonnet reduced the syntax errors of the generated ES|QL queries from ~28% to ~4%.</p><h4>Input validation</h4><p>Although the above techniques increase parameter accuracy, they don’t eliminate the possibility of errors. Instead of trusting the LLM&#x27;s input, we gravitated toward always validating and sanitizing it to ensure queries adhere to the expected schema.</p><h3>Optimizing database retrieval tool responses</h3><p>One common mistake is neglecting the size of the tool’s output. Because the tool’s output is what goes into the agent’s context window, not optimizing it for <strong>contextual relevance</strong> (quality) can distract the agent, and not optimizing it for <strong>token efficiency</strong> (quantity) can increase cost and risk exceeding the LLM’s context window limit. Working alongside internal teams, we’ve identified three dimensions for reviewing the return values:</p><p><strong>The first dimension is length:</strong> The number of search results. A recurring pitfall we’ve observed among teams is the failure to limit search results, which can lead to overflowing the context window. While vector search queries inherently require a limiting parameter, other search methods often do not. We recommend including <code>limit </code>clauses (for example, 10 - 20) in all query types. This prevents returning low-signal results and ensures token efficiency.</p><p><strong>The second dimension is width: </strong>The number of fields per data object. Instead of returning all properties, including cluttering ones (for example, timestamps and internal IDs, among others), curating a set of relevant fields can help improve both the user experience and the agent behavior.</p><p>Consider the following examples:</p><ul><li><strong>Returning identifiers </strong>with a concise set of properties of a data object allows the agent to retrieve full information about a data object later when it needs it (“just-in-time context engineering”), rather than cluttering the context window.</li><li><strong>Returning metadata for citations</strong> (for example, page numbers in large PDF documents) can help build trust for the user.</li><li><strong>Returning the number of search results and status messages</strong> can help the agent reason the status of the search query.</li></ul><p><strong>The third dimension is depth:</strong> The size of a single field. Consider a case where the documents themselves are large (for example, in the 10s of MB scale). Those can’t just be passed back to the agent in full, as context length will instantly reach its limit. To mitigate this, we recommend truncating long text fields when an excerpt is sufficient. We found features such as <a href="https://www.elastic.co/docs/reference/elasticsearch/rest-apis/highlighting">Elasticsearch’s highlighting</a> are helpful for this without the need for another LLM API call.</p><p>But even when working with smaller documents, letting the tool format the data into an easily digestible format for the LLM, such as sanitizing the content (for example, stripping HTML tags), formatting for readability (for example, tables to Markdown tables or links as “[Title](url)”), improved downstream performance.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/74a92e1102e235ecd7fd9868d9c3a81a59809c9c-1200x630.png" style="width: 100%;" /><p>While using only one of these techniques is often not sufficient for improving the contextual relevance, combining all of them might reduce the retrieval recall (for example, limiting the number of search results might risk not retrieving all of the relevant documents). In practice, this requires evaluating different combinations of these approaches to find the best balance.</p><h3>Handling errors and enabling self-correction</h3><p>We’ve observed that agents can get stuck in infinite loops or hallucinate responses when they encounter an error. Even if an agent follows its instructions perfectly, if a tool doesn’t provide any error message, only returns an error code, or at best provides a short, nondescriptive error message, the agent has no chance to self-correct from an error it doesn’t understand.</p><p>An <strong>informative error message</strong> enables the agent to understand why the error is happening and how to recover from it. For this, engineers need to think about the “not so happy” paths and the anticipated edge cases, such as the following examples:</p><p>If an error occurred because of a wrongly formulated search query, the agent should be able to reason over the failure and reformulate the query. In this example, returning the number of search results and the generated query can help the agent self-correct.</p><p>In general, engineers need to consider whether “zero results” is an expected behavior or an error for a given tool. In cases where an empty result likely indicates an error, both an error message and agent instructions can be helpful.</p><pre><code># Example error message from tool
"No product data found for product ID [XYZ]. 

Review the search query [insert used search query here].

Ask the customer to confirm the product name, and look up the product ID by name to confirm you have the correct ID."

---
# Example agent instruction
"If the product_search tool returns no results, do not state that the product does not exist. 

First, check that the `:` operator was used for multi-value fields.


Second, ask the user to provide and confirm the product ID or provide an alternative identifier like product name."</code></pre><p>When encountering an API failure due to an expired API key, strictly limit retries (for example, a maximum of two or three) to prevent the agent from repeatedly trying a failing deterministic flow.</p><h3>Safeguarding data</h3><p>The primary engineering challenge for production-grade agent applications with different tools lies in identity propagation, specifically the distinct requirements of <em>authentication</em> (verifying who the user is) and <em>authorization</em> (verifying what they can access). While an initial layer (for example, Okta) can provide the base authentication, downstream systems (for example, ServiceNow, Elasticsearch, and others) maintain their own nonuniform authorization architectures with varying levels of granularity.</p><p>The most successful implementations we’ve seen enforce identity verification at every touchpoint within the tool&#x27;s logic. This prevents the agent from accidentally accessing data that the end user isn&#x27;t allowed to see. By verifying identity at every system level, we ensure that the agent respects privacy even when navigating complex, multisystem workflows. Be aware of the downside; this introduces intentional, security-mandated latency.</p><p>Beyond verifying the identity at every touchpoint, managing credentials securely is critical. Exposing sensitive API keys in tool definitions or hardcoding database credentials in YAML configuration files is a high-risk vulnerability. Instead, we recommend that engineers use secure credential management systems.</p><h2>Evaluating database retrieval tools</h2><p>The development of database retrieval tools for agentic systems is an iterative, evaluation-driven process. To evaluate the effectiveness of our database retrieval tools and uncover issues, our internal teams maintain evaluation datasets with realistic user queries and the expected tool calls (for example, ”Am I eligible for a laptop refresh?” expects the <code>check_eligibility</code> tool). We’ve used the following metrics for evaluation of our database retrieval tools and for benchmarking different LLMs for model selection:</p><ul><li><strong>Tool selection accuracy: </strong>How often the correct tool was selected for a specific query type (for example, retrieval, analytical, hybrid, adversarial).</li><li><strong>First-pass success rate: </strong>A critical metric for us was distinguishing between eventual success and immediate success. Does the agent pick the right tool on the first try, or does it require a self-correction loop? (High self-correction indicates poor tool descriptions.)</li><li><strong>Average tool calls per answer:</strong> We track the efficiency of the agent. If the average number of tool calls to answer a simple question jumps from 1.5 to 4, it usually indicates that the agent is getting lost or that the tools are too granular.</li><li><strong>Tool-specific recall/precision:</strong> For dedicated database retrieval tools, we measure standard retrieval metrics to ensure that the documents returned are relevant to the arguments passed by the LLM.</li><li><strong>Failure rate:</strong> We strictly monitor the rate of malformed tool calls (for example, missing parameters) to identify which models need more &quot;hand-holding&quot; in the prompt instructions.</li></ul><p>Once live, our teams at Elastic continue to <strong>monitor</strong> agentic health and log telemetry data (for example, every success and failure and the time taken for the agent to complete the task) in Kibana. This allows our ITOps teams to build dashboards to answer meta-questions like, &quot;What is the failure rate this week?,&quot; &quot;How many laptop requests came from California?,&quot; &quot;How many requests were fulfilled?,&quot; without building a custom analytics engine.</p><h2>Summary</h2><p>During the iterative, evaluation-driven development process of <a href="https://www.elastic.co/search-labs/blog/agent-builder-elastic-ga">Elastic Agent Builder</a>, we identified consistent patterns in what makes database retrieval tools effective for context engineering. When implementing tools to search, retrieve, and manipulate data, we try to follow the following core principles:</p><ol><li><strong>Building the right database retrieval tools</strong> by following the “low floor, high ceiling” principle.</li><li><strong>Helping the agent call the right database retrieval tool with appropriate parameter values</strong> through careful and reinforced prompting and interface design.</li><li><strong>Avoid context flooding</strong> by optimizing the tool response for context relevancy (quality) and token efficiency (quantity).</li></ol><p>However, there remain open challenges, and we’re actively working on improving these techniques:</p><ul><li>Context bloat is a primary hurdle for agent performance because retrieved data remains in the context window as the conversation progresses. A future direction is to dynamically off-load large chunks of data, such as tool responses or attachments, in a file store and allow the agent to retrieve them on demand.</li><li>Efficient discovery of large volumes of tools and data attachments will be central for building production-grade agents. We plan to introduce agent skills with the functionality of <a href="https://claude.com/blog/equipping-agents-for-the-real-world-with-agent-skills"><em>progressive disclosure</em></a> (loading information only as needed) and implementing a semantic metadata layer.</li></ul><h3>Acknowledgements</h3><p>Written by Leonie Monigatti with valuable contributions from colleagues across Search Solutions Engineering (Sean Story, Pierre Gayvallet, Abhimanyu Anand) and Enterprise Applications (Sri Kolagani).</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Build task-aware agents with an expanded model catalog on Elastic Inference Service (EIS)]]></title>
            <description><![CDATA[Elastic Inference Service (EIS) expands its managed model catalog, enabling teams to build production-ready agents with flexible model choice across retrieval, generation, and reasoning, without managing GPUs or infrastructure.]]></description>
            <link>https://www.elastic.co/search-labs/blog/build-ai-agents-elastic-inference-service</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/build-ai-agents-elastic-inference-service</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Sean Handley, Anish Mathur, Deepti Dheer, Ranjana Devaji]]></dc:creator>
            <pubDate>Fri, 06 Mar 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/d8c0ffb8835c49c116da7163edb0cf7f6857817c-720x420.jpg?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Today, we’re pleased to announce an expanded model catalog for Elastic Inference Service (EIS), making it easy to run fast, high-quality inference on managed GPUs, without setup or hosting complexity.</p><p>EIS already provides access to state-of-the-art large language models (LLMs) that power <a href="https://www.elastic.co/blog/elastic-out-of-box-genai-capabilities">out-of-the-box AI capabilities</a> across <a href="https://www.elastic.co/search-labs/blog/agent-builder-elastic-ga">Elastic Agent Builder</a> and Elastic AI Assistants, including automatic ingest, threat detection, problem investigation, and root cause analysis. We’re now extending this foundation with a <a href="https://www.elastic.co/docs/explore-analyze/elastic-inference/eis#llm-chat-models">broader catalog</a> of managed models, giving developers more control over how agents reason, retrieve, and act.</p><p>In practice, this reflects a broader shift in how enterprises build AI systems. The idea of a single, all-purpose AI model no longer holds up. Real-world agent workflows require multiple models with different strengths, costs, and performance characteristics. With EIS, teams can either choose and switch models directly in Agent Builder, with zero setup, cost, or hosting overhead, or they can mix and match models in an agent <a href="https://www.elastic.co/blog/elastic-workflows-technical-preview">workflow</a> so each step uses the model best suited to the task.</p><p>Developers can use models from OpenAI, Anthropic, and Google directly in Elasticsearch, selecting different models for different agent steps while Elastic fully manages inference, scaling, and GPU execution for production agents.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/4eea6b9d6fbe492d75947828bf1c46ae9801387a-1234x892.png" style="width: 100%;" /><h2>An expanded catalog of managed models on EIS</h2><p>The expanded EIS catalog now includes models optimized for different classes of tasks, from lightweight generation to large-context reasoning and embeddings for retrieval.</p><p>For generation, the catalog includes:</p><ul><li>Anthropic Claude Opus 4.5 and 4.6.</li><li>Gemini 2.5 Flash.</li><li>Gemini 2.5 Pro.</li><li>OpenAI GPT-4.1 and GPT-4.1 Mini.</li><li>OpenAI GPT-5.2.</li><li>OpenAI GPT-OSS-120B.</li></ul><p>For retrieval, EIS includes native Jina AI models, <a href="https://www.elastic.co/search-labs/blog/jina-embeddings-v3-elastic-inference-service"><code>jina-embeddings-v3</code></a> and <a href="https://www.elastic.co/search-labs/blog/jina-embeddings-v5-text"><code>jina-embeddings-v5</code></a>, which provide fast, high-quality embeddings for multilingual retrieval. The service also includes embedding models from Microsoft, OpenAI, Google, and Alibaba.</p><h2>Choosing the right models for agent tasks</h2><p>With EIS, model choice becomes a design decision inside the agent, rather than an operational concern. Agents can select models based on the role they play, without changing how inference is deployed or scaled.</p><p>To see how this plays out in practice, consider a few common agent scenarios.</p><h3>Simple informational query</h3><p>Simple interactions, such as answering “What is our holiday policy?,” do not require an expensive frontier model and can be handled by a fast, low-cost option.</p><ul><li>Task: “What is our holiday policy?”</li><li>Pattern: Retrieve and summarize.</li><li>Model choice: Fast, low-cost generation model.</li></ul><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/e3c4bb30b433c9026d87d28fc16a5afa93594401-1436x1396.png" style="width: 100%;" /><p>This can also be configured <a href="https://www.elastic.co/docs/api/doc/kibana/operation/operation-post-agent-builder-converse">through the API</a> by selecting the model you want to use:</p><pre><code>POST kbn://api/agent_builder/converse
{
 "input": "What is our holiday policy?",
 "agent_id": "internal-kb-bot",
 "connector_id": "Anthropic-Claude-Sonnet-4-5"
}</code></pre><p>This step relies primarily on retrieval quality. A lightweight model is sufficient to summarize a small set of documents quickly.</p><h3>Moderate capability</h3><p>More complex tasks may benefit from a more capable generation model, without necessarily requiring the most expensive reasoning model available.</p><ul><li>Task: “Compare our holiday policy with new labor laws in France and draft an email.”</li><li>Pattern: Retrieve relevant documents, compare policy details across sources, and generate output such as a draft email.</li><li>Model choice: More capable generation model.</li></ul><a href="https://videos.elastic.co/watch/ajCYrZDZBqSQ1WUhahiShk">Watch video on Vidyard</a><p>Here’s the API example:</p><pre><code>POST kbn://api/agent_builder/converse
{
 "input": "Compare our holiday policy with new labor laws in France and draft an email.",
 "agent_id": "internal-kb-bot",
 "connector_id": "Google-Gemini-2-5-Pro"
}</code></pre><p>This task requires synthesis across multiple sources and structured output but doesn’t need the heaviest frontier reasoning model.</p><h3>Investigation or audit task (high capability)</h3><ul><li>Task: Review a large document set to identify compliance risks.</li><li>Pattern: Multistep reasoning over large context, where the model evaluates information across many documents and synthesizes findings before producing a final judgment.</li><li>Model choice: Frontier or large-context model.</li></ul><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/54a500766139b7c39785eee38184a4479ee76e2b-1080x1080.gif" style="width: 100%;" /><p>Try it out using the API:</p><pre><code>POST kbn://api/agent_builder/converse
{
 "input": "What are the compliance risks associated with Example AI products?",
 "connector_id": "Anthropic-Claude-Opus-4-6"
}</code></pre><p>Because the task requires deeper reasoning and consistent evaluation across many inputs, output quality matters more. A high-capability model is therefore appropriate for this step.</p><p>EIS also enables more advanced orchestration patterns. Enterprises increasingly recognize that using a frontier model for every agent step is inefficient.</p><p>With Agent Builder and Elastic Workflows, teams can design agents where each subtask is executed by the most efficient model for the job, based on cost, complexity, and accuracy requirements.</p><h3>Models-as-judge pattern (quality control)</h3><ul><li>Task: Validate an agent’s output using a second model</li><li>Pattern: Generate and evaluate.</li></ul><p>In this <a href="https://www.elastic.co/blog/elastic-workflows-technical-preview">Elastic Workflow</a> example, the agent uses one model to generate a response and a second model to evaluate its quality, adding a validation layer for the result. Elastic Workflows, the automation engine built into Elasticsearch, let developers combine reliable scripted automation with AI-driven steps for tasks that require reasoning.</p><a href="https://videos.elastic.co/watch/FoSG66tYxtgshaWAUHZFHy">Watch video on Vidyard</a><p>The multimodel approach enables new reliability patterns by separating generation from evaluation, allowing one model to produce a response and another to validate it. Today, teams can implement this by pairing a general-purpose generation model with a lighter-weight evaluation model.</p><p>Over time, this pattern naturally lends itself to specialized judging and safeguard models designed specifically for validation, policy checks, and quality control. As these models become available, EIS makes it straightforward to introduce them into agent workflows without changing how inference is deployed or managed.</p><h2>What’s next</h2><p>EIS is actively evolving, with more models on the way. You can track what’s coming next and what we’re currently building on the <a href="https://github.com/orgs/elastic/projects/2066/views/2">Elastic public roadmap</a>.</p><h2>Get started</h2><p>Elastic Inference Service makes it easy to start with default models and evolve toward sophisticated, multimodel agent workflows over time, all within Elasticsearch. Whether you’re building global retrieval augmented generation (RAG) systems, search, or <a href="https://www.elastic.co/search-labs/blog/ai-agentic-workflows-elastic-ai-agent-builder">agentic workflows</a> that need reliable context, Elastic now gives you high-performance models out of the box, along with the operational simplicity to move from prototype to production with confidence.</p><p>All Elastic Cloud trials have access to Elastic Inference Service. Try it now on <a href="https://cloud.elastic.co/serverless-registration">Elastic Cloud Serverless</a> or Elastic Cloud Hosted, or use <a href="https://www.elastic.co/search-labs/blog/cloud-connect-elastic-inference-service">EIS via Cloud Connect</a> with your self-managed cluster.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Does MCP make search obsolete? Not even close]]></title>
            <description><![CDATA[Explore why search engines and indexed search remain the foundation for scalable, accurate, enterprise-grade AI, even in the age of MCP, federated search, and large context windows.]]></description>
            <link>https://www.elastic.co/search-labs/blog/future-of-search-engines-indexed-search-mcp</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/future-of-search-engines-indexed-search-mcp</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Dayananda Srinivas]]></dc:creator>
            <pubDate>Thu, 05 Mar 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/5b8af1c3ca51f2c038406c714eb9a71b696bbc5a-1999x1091.jpg?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>With the rise of large language models (LLMs), agent frameworks, and new protocols like Model Context Protocol (MCP), a provocative question is starting to surface:</p><blockquote><strong>Do we still need a search engine at all?</strong></blockquote><p>If agents can call tools on demand and models can reason over massive context windows, why not just fetch data live from every system and let the LLM figure it out?</p><p>It’s a reasonable question. It’s also the wrong conclusion.</p><p>The reality is that MCP and agent tooling don’t eliminate the need for search. They make the quality of search <strong>more critical than ever</strong>. In this blog, we’ll explore why MCP, federated search, and large context windows don’t replace search engines and why indexes remain the foundational layer for scalable, accurate, enterprise-grade AI.</p><h2><strong>What MCP actually is (and what it is not)</strong></h2><p>MCP is a <strong>coordination protocol</strong>. It standardizes how an agent requests information or actions from external systems.</p><p>What MCP <em>doesn’t</em> do:</p><ul><li>Rank results across systems.</li><li>Understand relevance across heterogeneous data.</li><li>Normalize schemas or metadata.</li><li>Data transformations or enrichments at scale.</li><li>Apply consistent security and permissions.</li><li>Optimize for latency, cost, or scale.</li></ul><p>In other words, <strong>MCP tells agents <em>how</em> to ask for data, not <em>which</em> data matters most</strong>.</p><h2><strong>Modern retrieval requires query intelligence, not just data access</strong></h2><p>In modern enterprise search architectures, retrieval quality is determined long before a query reaches an index. Raw queries — especially those generated by agents — may be incomplete, overly literal, schema-driven rather than intent-driven, and at times syntactically invalid.</p><p>This is why mature search platforms introduce a query intelligence layer that performs query rewriting, entity normalization, synonym expansion, and intent disambiguation before retrieval even begins.</p><p>For example, an agent-generated request such as: “Show severity 2 authentication failures from last sprint” may be rewritten to include authentication synonyms (login, SSO, OAuth), normalized severity mappings, and sprint-to-date-range translation. The result is not just more matches — it is more <em>relevant</em> matches.</p><p>In enterprise AI, retrieval is not a single step. It is a controlled pipeline.</p><p>This distinction is crucial because once MCP-based agents start pulling information live from multiple tools, they recreate a familiar pattern under a new name: <strong>federated search</strong>.</p><h2><strong>MCP-based retrieval is federated search in disguise</strong></h2><p>Federated search isn’t new. Enterprises have tried it for decades.</p><p>The model is simple:</p><ul><li>Send the user’s query to multiple systems in parallel (SharePoint, GitHub, Jira, customer relationship management [CRM]).</li><li>Collect the responses.</li><li>Merge and present the results.</li></ul><p>MCP-driven tool calls follow the same pattern, except that the caller is now an agent instead of a user interface.</p><p>And the same problems resurface.</p><h2><strong>Why federated search breaks down at enterprise scale</strong></h2><ul><li><strong>Latency becomes unpredictable:</strong> A federated query is only as fast as its slowest system. Enterprise systems can have wildly different response times and rate limits, so federated queries tend to be <strong>slow and jittery</strong>. Agents must wait for multiple round trips before reasoning can even begin. The result is a laggy experience and unpredictable wait times.</li><li><strong>Relevance is fragmented:</strong> Because each system ranks results on its own, there’s no unified relevance model. Federated search <strong>cannot apply a single ranking or semantic understanding across all content</strong>, so results often seem disjointed or incomplete. Agents may retrieve <em>correct</em> information but not the <em>most useful</em> information.</li><li><strong>Context is shallow and incomplete: </strong>Federated systems typically expose only what’s directly accessible through an API call.They rarely surface:<ul><li>Usage signals, like clicks, dwell time, recency of access, popularity, or authority.</li><li>Relationships between documents across different systems to correlate the insights.</li><li>Organizational knowledge beyond a single silo.<br/><br/>This strips agents of the broader context required for high-quality reasoning.<br/></li></ul></li><li><strong>Limited filtering and features:</strong> In a federated setup, you can only filter on fields that every system supports (the “lowest common denominator”). If one system doesn’t support a particular filter or facet, you lose that functionality entirely. This severely limits rich search features, like date ranges, categories, or tags.</li></ul><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/25bb778f4da9a3cb4f0d4e10af66221b8af73900-1376x768.jpg" style="width: 100%;" /><h2><strong>The power of an indexed search</strong></h2><p>Search engines achieve millisecond-level retrieval at massive scale by using specialized data structures, including inverted indexes for lexical search and k‑dimensional trees (k-d trees) for vector-based retrieval. The approach is to <strong>crawl or ingest every source into search engines</strong>, creating a central place of company knowledge. This brings big advantages:</p><ul><li><strong>Speed by design:</strong> Searching an index is lightning fast. Queries hit inverted indexes and specialized data structures, avoiding the need to poll each backend system.</li><li><strong>Relevance that compounds over time:</strong> Search engines that support <strong>semantic search </strong>are capable of comprehending the intent, and machine learning models can rerank results for enterprise contexts. In one Elastic <a href="https://www.elastic.co/blog/elastic-generative-ai-experiences?">experiment</a>, Elastic users see more accurate results when combining vector search with a question-answering (QA) model to extract answers. It gives better precision than keyword matching.</li><li><strong>Advanced features:</strong> Elastic’s <a href="https://www.elastic.co/search-labs/blog/rag-graph-traversal#:~:text=Retrieval,for%20deeper%2C%20more%20contextual%20retrieval">Graph retrieval augmented generation (RAG) solution</a> shows how structuring an index as a knowledge graph can power more contextual retrieval. In other words, indexes aren’t just backward-looking dumps of text; they can also encode relationships and ontologies that let AI connect the dots across documents.</li><li><strong>Permission-aware search:</strong> Enterprise AI cannot compromise on security. Indexed search allows:<ul><li><a href="https://www.elastic.co/docs/reference/search-connectors/document-level-security">Document-level security.</a></li><li><a href="https://www.elastic.co/docs/deploy-manage/users-roles/cluster-or-deployment-auth/user-roles#roles">Role-based access control.</a></li><li><a href="https://www.elastic.co/search-labs/blog/rag-and-rbac-integration">Permission-aware retrieval for RAG and agents.</a></li></ul></li></ul><p>Agents see only what users are allowed to see, without leaking data into model prompts or training. Elasticsearch is suitable for the indexed search layer in the diagram below, as it provides the essential components for context engineering.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/8785e7806616273d086a90b3540273fb26d045ae-1392x768.jpg" style="width: 100%;" /><h2><strong>Retrieval consistency through search templates and governed execution</strong></h2><p>At scale, retrieval must be predictable, secure, and repeatable. This is where <a href="https://www.elastic.co/docs/solutions/search/search-templates">search templates</a> become critical.</p><p>Search templates act as retrieval contracts between applications, agents, and the search platform. Instead of dynamically constructing queries at runtime, agents invoke pre-defined retrieval patterns that enforce:</p><ul><li>Consistent relevance logic</li><li>Mandatory security filters</li><li>Cost and latency guardrails</li><li>Business-specific ranking rules</li><li>Explicit index and field scope boundaries</li></ul><p>In MCP-driven architectures, this becomes even more important. Agents should not dynamically invent retrieval strategies. Instead, MCP tool calls can map directly to approved search templates, ensuring that every retrieval request adheres to enterprise relevance and governance standards.</p><p>This approach shifts retrieval from ad-hoc query execution to controlled retrieval orchestration.</p><h2><strong>Retrieval is now a multi-layer engineering discipline</strong></h2><p>Modern enterprise retrieval is no longer a simple query-to-index operation. It typically includes multiple coordinated layers:</p><ul><li>Query understanding — rewriting, expansion, entity resolution</li><li>Retrieval strategy selection — hybrid search, vector search, graph retrieval, or synthetic query techniques such as Hypothetical Document Embeddings (HyDE), where the system generates a representative answer or expanded context first and retrieves documents using that richer semantic signal.</li><li>Execution governance — templates, security enforcement, and performance guardrails</li><li>Ranking and re-ranking — blending lexical precision, semantic similarity, and interaction-derived relevance signals such as click-through patterns, dwell time, and document usage frequency.</li></ul><p>When these layers are implemented upstream, agents receive clean, high-confidence context rather than raw, fragmented data.</p><p>This is what makes large-scale agent systems reliable in production environments.</p><h2><strong>Advanced retrieval techniques improve context quality before reasoning begins</strong></h2><p>Modern retrieval systems increasingly use AI-assisted techniques to improve recall and semantic coverage before ranking is applied.</p><p>One example is <a href="https://medium.com/@nirdiamant21/hyde-exploring-hypothetical-document-embeddings-for-ai-retrieval-cc5e5ac085a6">Hypothetical Document Embeddings (HyDE)</a>. Instead of embedding only the original query, the system first generates a hypothetical answer or expanded context, embeds that representation, and retrieves documents based on that richer semantic signal.</p><p>This is particularly useful in enterprise environments where:</p><ul><li>Users or agents may not know the exact terminology</li><li>Knowledge is distributed across silos</li><li>Important context is implied rather than explicitly stated</li></ul><p>Techniques like HyDE improve the probability that relevant documents are retrieved even when the original query is underspecified.</p><p>This reinforces a key principle of enterprise AI: better context retrieval produces better reasoning outcomes.</p><h2><strong>Agents aren’t data engineers; they’re reasoning systems</strong></h2><p>They shouldn’t be responsible for stitching together raw data, reconciling schemas, or compensating for poor retrieval.</p><p>This is where a search platform such as <strong>Elasticsearch</strong> becomes foundational.</p><p>By ingesting data once and normalizing it upstream (through pipelines, mappings, enrichment processors, and prebuilt indexes), Elasticsearch resolves schema mismatches, joins signals across sources, and materializes retrieval-ready views of the data. At query time, the agent receives clean, ranked, semantically enriched results rather than fragmented raw records.</p><p>For example, instead of an agent pulling independently from CRM, ticketing, and documentation systems and attempting to reconcile customer IDs, timestamps, and formats in real time, Elasticsearch can pre-index these sources into a unified customer interaction index with hybrid (keyword + vector) search and relevance ranking. The agent then queries a single, coherent interface and immediately reasons over the most relevant context.</p><p>This separation of concerns, that is, <strong>Elasticsearch handling data integration and retrieval, and agents focusing on reasoning, planning, and decision-making</strong>,<strong> </strong>is what makes agent systems scalable, reliable, and production ready.</p><h2><strong>Elastic’s role in the AI stack</strong></h2><p>Elastic sits at the intersection of search and AI by design.</p><ul><li><strong>Connectors and crawlers</strong> ingest data continuously from enterprise systems.</li><li><strong>Semantic and vector search</strong> enable intent-based retrieval.</li><li><strong>Hybrid search</strong> blends lexical precision with semantic understanding.</li><li><strong>RAG workflows</strong> ground LLMs in authoritative, permission-aware data.</li></ul><p>Elastic does not compete with agents or MCP. It <strong>makes them effective</strong>.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/193da239544ce858416db845f9fc34c7c0e9b6f9-1920x1080.png" style="width: 100%;" /><h2><strong>Bigger models don’t eliminate retrieval</strong></h2><p>Some have wondered whether huge new LLMs can bypass traditional search, perhaps by letting the model read <em>everything</em> in one go. Large context windows feel powerful, but they introduce:</p><ul><li>Higher latency.</li><li>Higher cost.</li><li>Lower precision due to noise.</li><li>A higher propensity for confusion, context clash, and context poisoning.</li></ul><p>RAG wins because it filters first and then reasons.In another <a href="https://www.elastic.co/search-labs/blog/rag-vs-long-context-model-llm#:~:text=,context%20approach%20led%20to%20inaccuracies">Elastic Search Labs experiment</a>, RAG achieved answers in about <strong>1 second</strong>, versus 45 seconds for the raw-LM approach, at <strong>1/1250th</strong> the cost, and with far higher accuracy. In other words, giving an LLM a million tokens of documents is slower, more expensive, and actually <em>less precise</em> than filtering through an index first.</p><h2><strong>Conclusion: MCP changes the interface, not the fundamentals</strong></h2><p>MCP is a meaningful step forward in how agents interact with tools. But it doesn’t replace the need for fast, relevant, governed retrieval.</p><p>In enterprise AI:</p><ul><li>Context quality determines answer quality.</li><li>Indexes create that context.</li><li>Search is the foundation, not the legacy.</li></ul><p>Indexes aren’t obsolete in the era of MCP. They’re <strong>the reason that MCP-based agents can work at all</strong>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Entity resolution with Elasticsearch, part 3: Optimizing LLM integration with function calling]]></title>
            <description><![CDATA[Learn how function calling enhances LLM integration, enabling a reliable and cost-efficient entity resolution pipeline in Elasticsearch.]]></description>
            <link>https://www.elastic.co/search-labs/blog/elasticsearch-entity-resolution-llm-function-calling</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/elasticsearch-entity-resolution-llm-function-calling</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Jessica Moszkowicz]]></dc:creator>
            <pubDate>Wed, 04 Mar 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/b2afa90c1b863c716008f3f5bbdd2866fa1c3577-720x420.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>In <a href="https://www.elastic.co/search-labs/blog/entity-resolution-llm-elasticsearch">part 1</a> and <a href="https://www.elastic.co/search-labs/blog/elasticsearch-entity-resolution-llm-semantic-search">part 2</a> of this series, we built a complete entity resolution pipeline that included preparing entities with context and indexing them for semantic search, extracting entities from articles using hybrid named entity recognition (NER), and matching entities using semantic search and large language model (LLM) judgment. The results were promising, but JSON parsing errors significantly lowered measured accuracy by causing otherwise valid judgments to be discarded. The system wasn’t failing because it made bad judgments; it was failing because it couldn’t reliably express them.</p><p>The root of this problem was our somewhat naive choice to use prompt-based JSON generation in which the LLM generates JSON responses in text format. If we asked the LLM to judge more than a couple of matches at a time, the generated JSON was often ill-formed. To mitigate this, we were forced to reduce the processing batch size, which simply won&#x27;t scale in a production system.</p><p>So the prompt-based JSON generation helped validate our approach to entity resolution, but we need a more systematic and reliable method. OpenAI function calling provides a better path by guaranteeing structure and type safety while reducing errors and costs. We chose OpenAI&#x27;s functions for the educational prototype, but other LLM providers typically provide similar functionality (for example, Claude tools).</p><p><strong>Note:</strong> While we discuss production challenges here, this is still an educational prototype demonstrating optimization techniques. Real production systems would need additional considerations, like monitoring, alerting, fallback strategies, and comprehensive error handling.</p><h2>Key concepts: Function calling, schema design, and cost benefits</h2><p><strong>What is function calling?</strong> <em>Function calling</em> is OpenAI&#x27;s structured output API. With it, we can define schemas for LLM responses, so we always know exactly what we&#x27;re going to get. By enforcing the JSON format rather than trying to define it in the LLM prompt, we should be able to eliminate parsing errors.</p><p><strong>Why is it better than prompt-based JSON?</strong> LLMs generate nondeterministic output. One hopes that they&#x27;ll at least generate content that contains the correct response, but the presentation of that response is unpredictable. With a chatbot, this is often not a problem, but our prototype is trying to programmatically process the output. Computer programs demand consistency, so when the LLM generates what we expect, everything is fine, but as soon as it goes off script, so to speak, the code errors out. We could try to account for the different possibilities, but it would be very difficult to catch everything. We could try to enforce more consistent behavior by adding something like &quot;Always return parsable JSON&quot;. We tried this exact technique in the prototype&#x27;s prompt, but we&#x27;ve seen that prompt-based JSON still goes off the rails pretty quickly, particularly if we try to process a batch of matches.</p><p>Function calling makes the LLM generation controllable and predictable, exactly what we need for entity resolution. To aid in the definition of the functions, we’ll also follow minimal schema design principles.</p><p><strong>What are minimal schema design principles?</strong> <em>Minimal schema design</em> means defining only the fields you need, using simple types, and avoiding nested structures when possible. This reduces token usage (smaller schemas mean fewer tokens), improves reliability (simpler schemas are easier for the LLM to follow), and lowers costs (fewer tokens mean lower API costs).</p><p><strong>What are the cost and reliability benefits?</strong> Since fewer errors means match processing is much more likely to succeed, even with large batch sizes, we don&#x27;t have to retry judging matches. The elimination of retries reduces costs by reducing token usage, but using minimal schemas also keeps our token count down. This all leads to a less expensive and more reliable approach that’s much more suitable to use in production.</p><p>We need to check one more thing, though. While matches may be getting processed without error, are the errorless results actually correct? How does this new approach compare to the promising results we saw with the prompt-based approach?</p><h2>Real-world results: Side-by-side comparison</h2><p>As we did in the previous blog, we ran the function calling approach against the tier 4 dataset, which consists of 206 expected matches across 69 articles. The results demonstrate a dramatic improvement:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Metric</td><td style="border: 1px solid #ccc; padding: 4px;">Prompt-based</td><td style="border: 1px solid #ccc; padding: 4px;">Function calling</td><td style="border: 1px solid #ccc; padding: 4px;">Improvement</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Error rate</td><td style="border: 1px solid #ccc; padding: 4px;">30.2%</td><td style="border: 1px solid #ccc; padding: 4px;">0.0%</td><td style="border: 1px solid #ccc; padding: 4px;">100% elimination</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Precision</td><td style="border: 1px solid #ccc; padding: 4px;">83.8%</td><td style="border: 1px solid #ccc; padding: 4px;">90.3%</td><td style="border: 1px solid #ccc; padding: 4px;">+6.5pp</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Recall</td><td style="border: 1px solid #ccc; padding: 4px;">62.6%</td><td style="border: 1px solid #ccc; padding: 4px;">90.8%</td><td style="border: 1px solid #ccc; padding: 4px;">+28.2pp</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">F1 score</td><td style="border: 1px solid #ccc; padding: 4px;">71.7%</td><td style="border: 1px solid #ccc; padding: 4px;">90.6%</td><td style="border: 1px solid #ccc; padding: 4px;">+18.9pp</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Acceptance rate</td><td style="border: 1px solid #ccc; padding: 4px;">44.8%</td><td style="border: 1px solid #ccc; padding: 4px;">60.2%</td><td style="border: 1px solid #ccc; padding: 4px;">+15.4pp</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">True positives</td><td style="border: 1px solid #ccc; padding: 4px;">129</td><td style="border: 1px solid #ccc; padding: 4px;">187</td><td style="border: 1px solid #ccc; padding: 4px;">+45.0%</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">False negatives</td><td style="border: 1px solid #ccc; padding: 4px;">77</td><td style="border: 1px solid #ccc; padding: 4px;">19</td><td style="border: 1px solid #ccc; padding: 4px;">-75.3%</td>
                    </tr>
              </tbody>
            </table>
          <h3>Error elimination: The key differentiator</h3><p>The most striking difference is the <strong>complete elimination of JSON parsing errors</strong>. This resulted in a modest precision improvement and a far more dramatic recall improvement. The precision metric captures how often the matches the system accepts were expected in the golden document. So the prototype was decent at judging matches correctly in the prompt-based approach, but function calling does that even better.</p><p></p><p>Conversely, recall tells us how many of the expected matches were found. When a batch of matches comes back with malformed JSON, the system loses all of those matches. It&#x27;s likely that Elasticsearch sends many of these matches for judgment, but we lose those matches if judgment fails. The significant recall improvement shows that this hypothesis is correct. Elasticsearch identifies the potential matches and function calling verifies which of those matches are correct.</p><p></p><p><strong>Note:</strong> It’s expected that Elasticsearch will find some incorrect matches because we look at the top two or three results from hybrid search. Most of the time, hybrid search returns the correct match as the top result, but having the LLM judge the top few hits ensures that we see how the LLM handles incorrect matches. If we move from the educational prototype to a production system, we’ll likely tune the Elasticsearch queries more carefully so that we only send promising matches to the LLM, further optimizing our LLM costs.</p><h2>What&#x27;s next: The ultimate challenge</h2><p>Now that we&#x27;ve optimized our LLM integration with function calling, we have a complete entity resolution pipeline with improved reliability and cost efficiency. However, can it handle the ultimate challenge? In the next post, we&#x27;ll explore how the system handles diverse entity resolution scenarios across 50 different challenge types, including cultural naming conventions, business relationships, titles, and multilingual variations.</p><h2>Try it yourself</h2><p>Want to see function calling optimization in action? Check out the <a href="https://github.com/jesslm/entity-resolution-lab-public/tree/main/notebooks#:~:text=5%20minutes%20ago-,04_function_calling_optimization_v3.ipynb,-Initial%20public%20lab">Function Calling Optimization notebook</a> for a complete walkthrough with real implementations, detailed explanations, and hands-on examples. The notebook shows you exactly how to use function calling for structured output, compare it with prompt-based JSON, and analyze cost and reliability benefits.</p><p><strong>Remember:</strong> This is an educational prototype designed to teach optimization concepts. When building production systems, consider additional factors, like multi-provider support, advanced caching strategies, monitoring and alerting, comprehensive error handling, and compliance requirements that aren&#x27;t covered in this learning-focused prototype.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Using subagents and Elastic Agent Builder to bring business context into code planning]]></title>
            <description><![CDATA[Learn about subagents, how to ensure they have the right information, and how to create a specialized subagent that connects Claude Code to your Elasticsearch data.]]></description>
            <link>https://www.elastic.co/search-labs/blog/subagents-with-elastic-agent-builder</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/subagents-with-elastic-agent-builder</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Gustavo Llermaly]]></dc:creator>
            <pubDate>Tue, 03 Mar 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/bfeb78b687bd930371364ee7dd0341ae90004349-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p><a href="https://code.claude.com/docs/en/sub-agents">Subagents in Claude Code</a> let you offload specialized tasks to separate context windows, keeping your main conversation focused. In this article, you&#x27;ll learn what subagents are, when to use them, and how to build a retrieval subagent using Elastic Agent Builder that connects your development workflow to business data in Elasticsearch.</p><h2>What are subagents?</h2><p><em>Subagents </em>are specialized assistants that can be called to execute a specific task, using their own context window. They complete a task and give the results to the main agent, preventing it from saving information that isn’t relevant for the rest of the conversation in the context window.</p><p>Their four core principles are:</p><ul><li><strong>Context preservation:</strong> Each subagent uses its own context window.</li><li><strong>Specialized expertise:</strong> Each subagent is designed for a specific task.</li><li><strong>Reusability:</strong> You can reuse a subagent in different sessions and projects.</li><li><strong>Flexible access:</strong> You can limit the subagent access to specific tools.</li></ul><p>Each subagent can have access to Claude Code tools to work with the terminal, such as glob, read, write, grep, or bash, or to access the internet, like search, fetch, or call external tools with Model Context Protocol (MCP) servers.</p><p>A subagent uses the following schema:</p><pre><code>---
name: your-sub-agent-name
description: Description of when this subagent should be invoked
tools: tool1, tool2, tool3  # Optional - inherits all tools if omitted
model: sonnet  # Optional - specify model alias or 'inherit'
permissionMode: default  # Optional - permission mode for the subagent
skills: skill1, skill2  # Optional - skills to auto-load
---

Your subagent's system prompt goes here. This can be multiple paragraphs
and should clearly define the subagent's role, capabilities, and approach
to solve problems.

Include specific instructions, best practices, and any constraints
the subagent should follow.</code></pre><p>You can call subagents implicitly by talking about the task they run, and Claude will call them automatically. For example, you can say, &quot;I want to plan my new functionality.&quot;</p><p>You can also call them explicitly by directly asking Claude Code to use a subagent and telling it, &quot;Use the planning subagent to plan my new functionality.&quot;</p><p>Another important feature is that subagents are stateful, so once you give one a task, it will generate an ID. This way, when you use it again, you can start from scratch or provide the ID to give it context from its previous tasks.</p><p>You can read the <a href="https://code.claude.com/docs/en/sub-agents">full documentation here</a>.</p><h2>When are subagents used?</h2><p>Subagents are useful when you need to delegate tasks that require specialized context but you don&#x27;t want to clutter the main chat window. Considering our example of coding, the most common subtasks include:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Subtask type</td><td style="border: 1px solid #ccc; padding: 4px;">Description</td><td style="border: 1px solid #ccc; padding: 4px;">Typical tools</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Exploration / research</td><td style="border: 1px solid #ccc; padding: 4px;">Searching and analyzing code without modifying it.</td><td style="border: 1px solid #ccc; padding: 4px;">Read, grep, glob</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Planning</td><td style="border: 1px solid #ccc; padding: 4px;">Running deep analysis to create implementation plans.</td><td style="border: 1px solid #ccc; padding: 4px;">Read, grep, glob, bash</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Code review</td><td style="border: 1px solid #ccc; padding: 4px;">Reviewing quality, safety, and best practices.</td><td style="border: 1px solid #ccc; padding: 4px;">Read, grep, glob, bash</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Code modification</td><td style="border: 1px solid #ccc; padding: 4px;">Writing and editing code.</td><td style="border: 1px solid #ccc; padding: 4px;">Read, edit, write, grep, glob</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Testing / debugging</td><td style="border: 1px solid #ccc; padding: 4px;">Running tests and analyzing issues.</td><td style="border: 1px solid #ccc; padding: 4px;">Bash, read, grep, edit</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Retrieval</td><td style="border: 1px solid #ccc; padding: 4px;">Getting information from external sources (APIs, databases).</td><td style="border: 1px solid #ccc; padding: 4px;">MCP tools, bash</td>
                    </tr>
              </tbody>
            </table>
          <p>Claude Code includes three built-in agents that showcase these use cases:</p><p></p><ul><li><strong>Explore:</strong> Quick agents for read-only search in the codebase. It&#x27;s great for answering questions like, &quot;Where are the client&#x27;s errors handled?&quot;</li><li><strong>Plan:</strong> Research agent that activates in plan mode to analyze the codebase before proposing changes.</li><li><strong>General-purpose:</strong> The most capable agent for complex tasks that require multiple steps and can include modifications.</li></ul><h2>Context management: Ensuring subagents have the right information</h2><p>One of the most important decisions when designing subagents is how to handle context. There are three key considerations:</p><h3><strong>1. Which context the subagent should get</strong></h3><p>The prompt you give to the subagent must contain all of the necessary information to complete the task since the subagent doesn’t have access to the main chat. You need to be specific:</p><ul><li>Do NOT say, &quot;Review the code.&quot;</li><li>SAY, &quot;Review the changes to src/auth/index.ts, focusing on JWT token validation.&quot;</li></ul><p>Providing the exact file name makes a difference between using the read tool against the file directly and making a wide search using grep and thus wasting time and tokens.</p><p>Also consider what not to include. Irrelevant context can distract the subagent or bias results. It’s tempting to ask for multiple things in one pass, but focused tasks yield better results:</p><ul><li>Do NOT say, “Review src/auth/<a href="http://index.ts">index.ts</a>. Here is also the database schema and our API docs for reference, fix bugs and suggest improvements about the architecture decisions.”</li><li>SAY, “Fix the token refresh bug in src/auth/index.ts that&#x27;s throwing AUTH_TOKEN_EXPIRED unexpectedly.”</li></ul><h3><strong>2. What tools to provide</strong></h3><p>Limit the tools to what’s strictly necessary. This improves security, keeps the subagent focused, and reduces unnecessary tool calls and execution costs.</p><pre><code># For just an analysis agent
tools: Read, Grep, Glob

# For an agent that needs to modify the code
tools: Read, Edit, Write, Grep, Glob</code></pre><p>If you don&#x27;t specify a tools field, the subagent inherits all tools from the main agent, including MCP tools.</p><p>You can learn all Claude Code tools <a href="https://code.claude.com/docs/en/how-claude-code-works#tools">here</a>.</p><h3><strong>3. How to keep context between calls</strong></h3><p>Subagents can be resumed using their agentId:</p><pre><code># First call
> Use the code-analyzer agent to review the authentication module
[Agent completes the analysis and returns agentId: "abc123"]

# Continue with previous context
> Resume agent abc123 and now analyze the authorization module
[Agent continues with the context from the previous chat]</code></pre><p></p><p>You can ask Claude for the agent ID or find it in <code>~/.claude/projects/{project}/{sessionId}/subagents/</code></p><p>This is especially useful for long research tasks or multistep workflows.</p><p>Another way to keep context consistent is to ask the agent to write a Markdown checklist with what it&#x27;s doing and its current progress. Then you can execute <code>/clear</code> without losing the initial instruction. In that request, you can define the task granularity or details to retain that make sense for your use case.</p><pre><code># Task: Review authentication module

## Progress
- [x] Analyzed src/auth/index.ts
- [x] Found JWT validation issue
- [ ] Review authorization module
- [ ] Check rate limiting

## Findings
- Token refresh has race condition in line 42</code></pre><p>After you clear the conversation, the next agent can pick it up from here. This is very useful when you want an agent to run a script over a list and watch the output record by record.</p><h2>Orchestration patterns</h2><p>It’s important to see subagents as a context optimization mechanism. The way in which you coordinate them determines the efficiency of the whole system. There are different orchestration patterns.</p><h3><strong>Sequential (chaining)</strong></h3><p>Here, a subagent completes a task, and its results feed the next one in a sequence of tasks, similar to traditional Linux piping.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/74a3a376600cd1b7cdd2dddddfed2f00ab131eed-896x94.png" style="width: 100%;" /><p>Call example:</p><pre><code>> First use the planning agent to design the feature,
> then use the coding agent to implement it,
> finally use the reviewer agent to check the code</code></pre><h3><strong>Parallel</strong></h3><p>In this pattern, multiple subagents run independent tasks simultaneously. The main Claude Code agent invokes them since <strong>subagents cannot spawn other subagents</strong>.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/84eca68d29bf79cf978a8089d3c18972738cd2c1-595x272.png" style="width: 100%;" /><p>This approach reduces the execution time for tasks like code review since it allows you to work with the same code from different angles without impacting the running time.</p><h3><strong>Hub-and-spoke (delegation)</strong></h3><p>In this approach, the main agent acts as an orchestrator, delegates tasks to specialized agents, and then consolidates the results.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/93bb2cc55c435f509b426fbcc090a67c53021684-595x272.png" style="width: 100%;" /><p>This is the pattern we’ll implement in our example: The main Claude Code agent will delegate the gathering of business information to a retrieval agent built with Elastic Agent Builder, while the explore agent will look into local files and the planning agent builds a plan.</p><h2>Why use an agent instead of a single query?</h2><p>Before building our retrieval subagent, it&#x27;s worth understanding when an agent adds value versus when a simple Elasticsearch Query Language (ES|QL) query suffices.</p><p>If you need a single aggregation, like &quot;What&#x27;s our most visited page?&quot; just run the query directly. The agent adds value when your question requires:</p><ul><li><strong>Multiple queries that build on each other:</strong> The answer from query 1 informs query 2.</li><li><strong>Cross-index reasoning:</strong> Correlating data from different sources.</li><li><strong>Ambiguity resolution:</strong> The agent interprets and follows leads.</li><li><strong>Synthesis:</strong> Combining quantitative data with qualitative knowledge.</li></ul><p>Our example will demonstrate all of these capabilities.</p><h2>Agent Builder as subagent</h2><p>Generating code using AI is very quick, but the problem is having a good planning phase to set the boundaries for our coding agent. To help with that, Claude created a subagent that <a href="https://code.claude.com/docs/en/common-workflows#use-plan-mode-for-safe-code-analysis">specializes in planning</a> to perform deep analysis and create a to-do list for the main agent to execute.</p><p>With this flow, you can plan based on what Claude Code can see both in local files and on the internet. However, there&#x27;s still knowledge available in Elasticsearch that you cannot access via standard tools.</p><p>To access our internal knowledge during the planning phase, we&#x27;ll create a Claude Code subagent by making a retrieval agent using Agent Builder.</p><p>You can configure the agent using the UI or an API. In this example, we&#x27;ll use the latter.</p><h3><strong>Prerequisites</strong></h3><ul><li><a href="https://code.claude.com/docs/en/setup">Claude Code</a> 2.0.76+</li><li>Elasticsearch 9.2</li><li>Elasticsearch <a href="https://www.elastic.co/docs/deploy-manage/api-keys/elasticsearch-api-keys">API key</a></li></ul><h3><strong>The scenario: Technical debt sprint planning</strong></h3><p>You&#x27;re a tech lead. You have two weeks and two developers. Your <code>TECH_DEBT.md</code> lists 12 items. You can tackle maybe three or four. Which ones should you prioritize?</p><p>The complexity is that you need to optimize across multiple dimensions simultaneously:</p><ul><li><strong>User impact:</strong> How many users hit this issue?</li><li><strong>Business impact:</strong> Does it affect paying customers? Enterprise tier?</li><li><strong>Severity:</strong> Errors? Performance? Just ugly code?</li><li><strong>Effort:</strong> Quick win or rabbit hole?</li><li><strong>Dependencies:</strong> Does fixing A unlock fixing B?</li><li><strong>Strategic alignment:</strong> Does it align with Q1 priorities?</li></ul><p>A single query like, &quot;What&#x27;s the most important tech debt item?&quot; fails because this requires:</p><ol><li>Reading <code>TECH_DEBT.md</code> to understand what the 12 items even are.</li><li>For EACH item, querying <code>error_logs</code>to get error frequency.</li><li>Cross-referencing with <code>customer_data</code> to see tier breakdown.</li><li>Checking <code>support_tickets</code>to see complaint volume.</li><li>Reading <code>engineering_standards</code> in the knowledge base to see whether any items violate core principles.</li><li>Reading <code>Q1_roadmap</code> to check strategic alignment.</li><li>Synthesizing all of this into a prioritized recommendation.</li></ol><p>This is where a retrieval agent can be helpful in orchestrating multiple queries across different indices and synthesizing the results.</p><h2>Steps</h2><h3><strong>Preparing the test dataset</strong></h3><p>We&#x27;ll create four indices: a knowledge base with internal documentation, error logs, support tickets, and customer data.</p><p>You can create the indices, index the data, and create the agent using one of the following:</p><ul><li><strong>Kibana Dev Tools:</strong> Using the Elasticsearch requests provided below.</li><li><strong>Jupyter Notebook:</strong> Using the <a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/subagents-with-elastic-agent-builder/notebook.ipynb">complete notebook</a> written for this article.</li></ul><h2>Create the indices</h2><p>Open Kibana Dev Tools, and run the following requests to create each index with its mapping and bulk data. Here&#x27;s an example of the knowledge index structure and data to be indexed:</p><pre><code>PUT customer_data
{
  "mappings": {
    "properties": {
      "user_id": { "type": "keyword" },
      "customer_tier": { "type": "keyword" },
      "company_name": { "type": "text" },
      "mrr": { "type": "float" },
      "joined_at": { "type": "date" }
    }
  }
}

POST customer_data/_bulk
{"index":{}}
{"user_id":"enterprise_user_01","customer_tier":"enterprise","company_name":"Acme Corp","mrr":2500.00,"joined_at":"2023-01-15"}
{"index":{}}
{"user_id":"enterprise_user_02","customer_tier":"enterprise","company_name":"GlobalTech Inc","mrr":4200.00,"joined_at":"2022-08-20"}
{"index":{}}
{"user_id":"enterprise_user_05","customer_tier":"enterprise","company_name":"DataFlow Systems","mrr":3100.00,"joined_at":"2023-06-01"}
{"index":{}}
{"user_id":"user_001","customer_tier":"free","company_name":"","mrr":0,"joined_at":"2024-03-15"}
{"index":{}}
{"user_id":"user_002","customer_tier":"free","company_name":"","mrr":0,"joined_at":"2024-05-20"}
{"index":{}}
{"user_id":"user_045","customer_tier":"pro","company_name":"SmallBiz LLC","mrr":49.00,"joined_at":"2024-01-10"}
{"index":{}}
{"user_id":"user_089","customer_tier":"pro","company_name":"StartupXYZ","mrr":49.00,"joined_at":"2024-02-28"}</code></pre><p>Full requests for all indices:</p><ul><li><strong>Knowledge index:</strong> <a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/subagents-with-elastic-agent-builder/elasticsearch_requests/knowledge.txt">knowledge.txt</a></li><li><strong>Error logs index:</strong> <a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/subagents-with-elastic-agent-builder/elasticsearch_requests/error_logs.txt">error_logs.txt</a></li><li><strong>Support tickets index:</strong> <a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/subagents-with-elastic-agent-builder/elasticsearch_requests/support_tickets.txt">support_tickets.txt</a></li><li><strong>Customer data index:</strong> <a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/subagents-with-elastic-agent-builder/elasticsearch_requests/customer_data.txt">customer_data.txt</a></li></ul><p>The raw JSON files with the dataset are also available:</p><ul><li><a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/subagents-with-elastic-agent-builder/dataset/knowledge.json">knowledge.json</a></li><li><a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/subagents-with-elastic-agent-builder/dataset/error_logs.json">error_logs.json</a></li><li><a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/subagents-with-elastic-agent-builder/dataset/support_tickets.json">support_tickets.json</a></li><li><a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/subagents-with-elastic-agent-builder/dataset/customer_data.json">customer_data.json</a></li></ul><h2>Local project files</h2><p>Create the following Markdown (MD) files in your project. These files look like this:</p><pre><code># Tech Debt Items

## AUTH-001: Token refresh race condition
- **Module**: src/auth/refresh.ts
- **Symptom**: Users randomly logged out
- **Estimate**: 3 days

## EXPORT-002: CSV export timeout on large datasets
- **Module**: src/export/csv.ts
- **Symptom**: Timeout after 30s for >10k rows
- **Estimate**: 2 days

...</code></pre><p>Full files:</p><p></p><ul><li><a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/subagents-with-elastic-agent-builder/TECH_DEBT.md">TECH_DEBT.md</a>: Tech debt items list.</li><li><a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/subagents-with-elastic-agent-builder/REQUIREMENTS.md">REQUIREMENTS.md</a>: FlowDesk Q1 2025 requirements.</li></ul><p>This ties directly to the tech debt items and gives the agent clear priorities to work with when cross-referencing with the Elasticsearch data.</p><h2>Create an agent with Agent Builder</h2><p>We&#x27;ll now create an agent capable of running analytics queries with ES|QL to provide us with app usage information while also capable of searching to provide us info from Knowledge Base (KB) in unstructured text format.</p><p>We&#x27;re using the <a href="https://www.elastic.co/docs/explore-analyze/ai-features/agent-builder/tools#built-in-tools">built-in tools</a> since they cover search and analytics on any index. Agent Builder also supports custom tools for more specialized operations, like scoping an index or adding ES|QL dynamic parameters, but that&#x27;s beyond our scope here.</p><p>You can create the agent using the curl request in <a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/subagents-with-elastic-agent-builder/elasticsearch_requests/create_agent.txt">create_agent.txt</a>.</p><pre><code>curl -X POST "https://${KIBANA_URL}/api/agent_builder/agents" \
  -H "Authorization: ApiKey ${API_KEY}" \
  -H "kbn-xsrf: true" \
  -H "Content-Type: application/json" \
  -d '{
    "id": "tech-debt-advisor",
    "name": "Tech Debt Prioritization Agent",
    "description": "I help prioritize technical debt by analyzing error logs, support tickets, customer impact, and aligning with engineering standards and roadmap priorities.",
    "avatar_color": "#BFDBFF",
    "avatar_symbol": "TD",
    "configuration": {
      "instructions": "This agent helps prioritize technical debt items. Use the following indices:\n\n- knowledge: Engineering standards, policies, and roadmap priorities\n- error_logs: Production error frequency by module\n- support_tickets: Customer complaints and their urgency\n- customer_data: Customer tier information (enterprise, pro, free)\n\nWhen analyzing tech debt:\n1. Check error frequency in error_logs\n2. Cross-reference affected users with customer_data to understand tier impact\n3. Count support tickets and note urgency markers\n4. Check knowledge base for relevant policies and Q1 priorities\n5. Synthesize findings into prioritized recommendations",
      "tools": [
        {
          "tool_ids": [
            "platform.core.search",
            "platform.core.list_indices",
            "platform.core.get_index_mapping",
            "platform.core.get_document_by_id",
            "platform.core.execute_esql",
            "platform.core.generate_esql"
          ]
        }
      ]
    }
  }'</code></pre><p>You’ll get this response if everything went OK:</p><pre><code>{
  "id": "tech-debt-advisor",
  "type": "chat",
  "name": "Tech Debt Prioritization Agent",
  "description": "I help prioritize technical debt by analyzing error logs, support tickets, customer impact, and aligning with engineering standards and roadmap priorities.",
  ...
}</code></pre><p>The agent will be available in Kibana, so you can now chat with it if you want:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/c655ec6b9b1cc2fa1ab3cc13d289e7b96a543284-815x784.png" style="width: 100%;" /><h3><strong>Configure the agent as Claude Code tool</strong></h3><p>The agent we just created will expose an <a href="https://www.elastic.co/docs/explore-analyze/ai-features/agent-builder/mcp-server">MCP server.</a> Let&#x27;s add the MCP server to Claude Code using the already-generated API key:</p><pre><code>claude mcp add --transport http agentbuilder https://${KIBANA_URL}/api/agent_builder/mcp --header "Authorization: ApiKey ${API_KEY}"</code></pre><p>We can check the connection status using <code>claude mcp get agentbuilder</code>.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/ba5fbc144f9e29151b8628dffd33dc74b12deece-499x177.png" style="width: 100%;" /><h3><strong>Create a subagent that uses the tool</strong></h3><p></p><p>Now that we have the Agent Builder available as a set of MCP tools, we can create a subagent in Claude Code that will use all or some of those tools, in combination with Claude Code ones.</p><p></p><p>Claude Code recommends using its agent creator tool for this step:</p><p></p><p>1. Type <code>/agents</code> in Claude Code.</p><p>2. Choose <strong>Create new agent</strong>.</p><p>3. Select <strong>Project scope</strong> so that it&#x27;s only available for this project. (This is the recommended setting to avoid agent overflow.)</p><p>4. Select <strong>Generate with Claude (recommended)</strong>.</p><p>5. Type in the description: &quot;Agent that analyzes technical debt by querying Elasticsearch for error logs, support tickets, customer data, and engineering knowledge base. Use this agent when you need to prioritize tech debt items based on business impact.&quot;</p><p>6. In “Select tools,” choose <strong>Advanced options</strong> and select the tools we defined on the agent creation.</p><pre><code>Individual Tools:
☒ platform.core.search (agentbuilder)
☒ platform.core.list_indices (agentbuilder)
☒ platform.core.get_index_mapping (agentbuilder)
☒ platform.core.get_document_by_id (agentbuilder)
☒ platform.core.execute_esql (agentbuilder)
☒ platform.core.generate_esq (agentbuilder)</code></pre><p>7. Select <strong>[ Continue ]</strong>.</p><p>Now choose the model. For planning tasks, the recommendation is to use Opus due to its significant reasoning capacity. So let&#x27;s select that and continue.</p><p>Finally, choose the background color for our subagent text and confirm.</p><p>Claude automatically names our subagent based on the description (for example, <code>tech-debt-analyzer</code>).</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/f01ac4c9f30fbcbed7fc69881aae9ff72c4616a0-869x521.png" style="width: 100%;" /><h2>Testing the agent</h2><p>Once the agent has been created, we can test it with a complex prioritization question that requires multistep reasoning:</p><pre><code>> Based on TECH_DEBT.md, which items should we prioritize for our 2-week sprint?
> Use the tech-debt-analyzer agent to check error frequency, customer impact,
> support ticket volume, and alignment with engineering standards.</code></pre><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/d76d972ab5b07e6d35bdf3036cb5ee3c080c7156-749x239.png" style="width: 100%;" /><p>Watch how the agent orchestrates multiple queries:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/c49b56c366576406586ba03f694d2bfb09d30895-875x96.png" style="width: 100%;" /><p>And will give you a comprehensive analysis of the local files combined with Elasticsearch data:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/084c532b4c9e993e53810738ae1da1fd4af1f025-1228x693.png" style="width: 100%;" /><p>This demonstrates why a single query fails and an agent succeeds: It orchestrates five or more queries across different indices, correlates the data, and synthesizes a recommendation that contradicts the naive &quot;fix highest error count&quot; approach.</p><p>By typing <code>/context</code>, we can see how much context each of the MCP tool&#x27;s definitions uses and our subagent&#x27;s prompt. Keep an eye on this overhead when creating subagents.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/3a6f528a9cae7b7fbf17f1b97e13c51c78c1b8b4-666x391.png" style="width: 100%;" /><h2>Start planning</h2><p>We can now start planning using local files, the internet, and our Elasticsearch knowledge as information sources.</p><p>Ask something like:</p><pre><code>"Based on our requirements defined in REQUIREMENTS.md, use the planning agent
to create a detailed implementation plan, prioritizing tasks according to
business impact. Use the tech-debt-analyzer agent to query about internal
company knowledge and make analytical queries about error patterns and
customer impact."</code></pre><p>Note that Claude decides to run the Elasticsearch data analysis and the local documentation reading in parallel, following the hub-and-spoke orchestration pattern.</p><p>After the analysis, you should get a plan that prioritizes based on actual business data rather than on assumptions. This context will make your AI coding experience much more reliable, as you can feed this plan directly to the agent and execute step by step:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/cfa5c6c44425d6e73355116e08082a33699915a3-961x873.png" style="width: 100%;" /><p>The more details you provide and the more focused the instructions are, the better the quality of the plan will be. If you have an existing codebase, it will suggest the code changes.</p><h2>Conclusion</h2><p>Subagents are a great tool to offload specific tasks where we only need the final result for the main chat (without going through how we got there), keeping the chat flow focused.</p><p>By choosing the right orchestration pattern (sequential, parallel, or hub-and-spoke) and handling the context properly, we can build efficient and maintainable agent systems.</p><p>Elastic Agent Builder and its MCP feature allow us to access our data using a retrieval subagent to facilitate planning and coding by combining local (files, source code), external (internet), and internal (Elasticsearch) sources. The key insight is that agents add value not for simple queries but when you need multistep reasoning that builds on previous results and synthesizes information from multiple sources.</p><h2>Resources</h2><ul><li><a href="https://code.claude.com/docs/en/sub-agents">Claude Code Subagents</a></li><li><a href="https://www.elastic.co/elasticsearch/agent-builder">Elastic Agent Builder</a></li><li><a href="https://www.elastic.co/docs/explore-analyze/ai-features/agent-builder/mcp-server">Agent Builder MCP</a></li></ul>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Adaptive early termination for HNSW in Elasticsearch]]></title>
            <description><![CDATA[Introducing a new adaptive early termination strategy for HNSW in Elasticsearch.]]></description>
            <link>https://www.elastic.co/search-labs/blog/hnsw-elasticsearch-adaptive-early-termination</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/hnsw-elasticsearch-adaptive-early-termination</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Tommaso Teofili]]></dc:creator>
            <pubDate>Mon, 02 Mar 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/e6d3186f609dd56dc5ffe33d70fa9e5cfa05b51f-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Elasticsearch uses the <a href="https://www.elastic.co/search-labs/blog/hnsw-graph">Hierarchical Navigable Small World</a> (HNSW) algorithm to perform vector search over a proximity graph. HNSW is known to provide a nice trade-off between the quality of k-nearest neighbor (KNN) results and the associated cost.</p><p>In HNSW, search proceeds by iteratively expanding candidate nodes in the graph, maintaining a bounded set of nearest neighbors discovered so far. Each expansion has a cost (vector operations, random seeks to disk, and more), and the marginal benefit of that cost tends to decrease as the search progresses.</p><p>One way to optimize HNSW graph traversal is to stop searching when the marginal likelihood of finding new true neighbors doesn’t increase. For this reason, in <a href="https://www.elastic.co/docs/reference/elasticsearch/index-settings/index-modules#index-dense-vector-hnsw-early-termination">Elasticsearch 9.2</a> we introduced a new <a href="https://www.elastic.co/search-labs/blog/hnsw-knn-search-early-termination">early termination mechanism</a>. This stops the search process when visiting graph nodes doesn’t provide enough new nearest neighbors, consecutively, for a fixed number of times.</p><p>This article guides you through how we improved over the mentioned early termination mechanism in HNSW to make it better suited for different datasets and data distributions.</p><h2><strong>Early termination in HNSW</strong></h2><p>In HNSW, search proceeds by iteratively expanding candidate nodes in the proximity graph, maintaining a bounded set of nearest neighbors discovered so far, until it either has visited the whole graph or meets some early stop criteria.</p><p>Early termination is therefore not necessarily always an optimization, it’s <strong>part of the search algorithm itself</strong>. The moment we decide to stop determines the balance between efficiency and recall. In Elasticsearch, there are already a number of ways a query on HNSW can early terminate:</p><ul><li>A fixed maximum number of nodes is visited.</li><li>A fixed timeout is reached.</li></ul><p>While simple and predictable, these rules are largely <strong>agnostic to what the search is actually doing</strong>. Also they’re used mostly to make sure that the query finishes in reasonable time for the end user.</p><p>In a <a href="https://www.elastic.co/search-labs/blog/hnsw-knn-search-early-termination">previous blogpost</a>, we introduced the concept of redundancy in HNSW. In short, redundant computations occur when HNSW continues to evaluate new candidate nodes that don’t result in finding more nearest neighbors.</p><h2><strong>Patience: Measuring progress instead of effort</strong></h2><p>The notion of <em>patience</em> reframes early termination around <strong>progress rather than effort</strong>.</p><p>Instead of asking:</p><p>“How many steps have we taken?”</p><p>The new question becomes:</p><p>“What is the amount of computation we accept to waste, until we lose hope?”</p><p>During HNSW search, early exploration typically produces peak improvements to the top-k candidate set. During first steps of the HNSW graph exploration, the set of neighbors is continuously updated as the algorithm keeps discovering nearer and nearer neighbors to the query vector. Over time, these improvements become rarer as the search converges. <a href="https://cs.uwaterloo.ca/~jimmylin/publications/Teofili_Lin_ECIR2025.pdf">Patience-based termination</a> monitors this pattern and terminates the search once improvements have ceased for a sustained period.</p><p>In practice, while visiting the HNSW graph we also compute the queue saturation ratio as we hop through candidate nodes. This measures the percentage of nearest neighbors that were left unchanged while visiting the most recent graph node (or the inverse of the number of new neighbors introduced during the last iteration). When such a ratio becomes too big for too many consecutive iterations, we stop visiting the graph.</p><p>Conceptually, patience treats HNSW search as a <strong>diminishing returns process</strong>. When returns flatten out, continuing to explore the graph yields little benefit.</p><p>This framing is powerful because it ties termination directly to <em>observable outcomes</em> rather than to arbitrary fixed limits.</p><p>The benefit of using this smart early termination technique is that HNSW graph explorations tend to visit a smaller number of graph nodes while retaining an almost perfect relative recall.</p><p>To visualize this, we can plot the amount of recall per visited node that we got with the patience based early termination (labeled as <em><code>et=static</code></em>), when compared to the default HNSW behavior (labeled as <em><code>et=no</code></em>) on a couple of datasets, FinancialQA and Quora, and models, JinaV3 and E5-small.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/a9d07c5153ea64a2426c82487c36846030692bb9-1600x945.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/dac56125c4b16d1b596c9876b6ca9ac7b2dc87fa-1600x944.png" style="width: 100%;" /><h2><strong>Static thresholds and HNSW dynamics</strong></h2><p>In practice, in Elasticsearch this is implemented using <strong>static thresholds</strong>. One threshold refers to the <strong>saturation threshold</strong>: that is, the ratio of saturation that we consider suboptimal. The other threshold refers to the number of consecutive graph nodes that we allow to be visited while still having a suboptimal queue saturation: that is, the <strong>patience threshold</strong>.</p><p>When we introduced this early termination strategy in Elasticsearch 9.2, we decided to opt for conservative defaults, so as to let the recall as much as possible, while still gaining in terms of latency and memory consumption. For this reason, we set the saturation threshold to be 100% and the patience threshold to be set as a (bounded) 30% of the <a href="https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-knn-query#knn-query-top-level-parameters:~:text=search%20request%20size.-,num_candidates,-(Optional%2C%20integer)%20The"><em><code>num_candidates</code></em></a> in the KNN query.</p><p>In many scenarios, these settings resulted to work nicely; however, two queries requesting the same number of neighbors might have radically different convergence behaviors. Some queries encounter dense local neighborhoods and saturate quickly; others must traverse long, sparse paths before finding competitive candidates. The latter resulted to be the most difficult to handle effectively.</p><p>As a result, we sometimes noticed:</p><ul><li>Over-exploration for easy queries.</li><li>Premature termination for hard queries.</li></ul><p>Therefore, we figured that fixed threshold values encode global assumptions about convergence, whereas we could make HNSW better adapt to different dynamics.</p><h2><strong>Making HNSW early termination adaptive</strong></h2><p>Adaptive early termination approaches this problem from a different angle. Instead of enforcing predefined stopping thresholds, the algorithm <strong>infers when to stop from the search dynamics themselves</strong>.</p><p>So instead of comparing the queue saturation ratio between two consecutive candidates, we decided to introduce both an instant smoothed discovery rate&nbsp; $d_{q,i} $ (how many new neighbors were introduced for a query <em>q</em>, in the last visit <em>i</em>) together with rolling mean $\mu_{q,i}$ and standard deviation $\sigma_{q,i}$ of such a discovery rate during the graph visit (using <a href="https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford&#x27;s_online_algorithm">Welford’s algorithm</a>). These statistics about the discovery rate are calculated per query, so that this information can be used to decide different degrees of patience for each query.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/1958be7ca4425ade66eaf621ada3533173183598-694x118.png" style="width: 100%;" /><p>The previously static thresholds become adaptive to the discovery rate statistics: The saturation threshold becomes the rolling mean plus the standard deviation; whereas we make the patience adapt and scale inversely with the standard deviation.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/f7ee4a55c24853b657df26052b275e8bd76cf0f9-654x156.png" style="width: 100%;" /><p>The early exit rules remain the same; the saturation happens when the instant discovery rate is lower than the adaptive saturation threshold. The graph visit stops if the saturation persists for a number of consecutive candidate visits that’s larger than the adaptive patience.</p><p>This way, we obtain a behavior that doesn’t depend on the <em><code>num_candidates</code></em> parameter in the KNN query (which might be always set or left as the default, regardless of early exit) and that better adapts to each query and vector distribution dynamically.</p><p>The recall per visited node on FinancialQA and Quora with the adaptive strategy (labeled as <em><code>et=adaptive</code></em>) reports a higher recall per visited node, when compared to the static strategy (<em><code>et=static</code></em>) and the default HNSW behavior (<em><code>et=no</code></em>).</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/2a906997d9a25d74c7038bd9661bc97581e7258e-1600x938.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/3a114911e232c351dbb814cea20e8b0f1415a717-1600x925.png" style="width: 100%;" /><p>Adaptive early termination is turned on by default in Elasticsearch 9.3 for HNSW dense vector fields (and it can eventually be turned off via the <a href="https://www.elastic.co/docs/reference/elasticsearch/index-settings/index-modules#index-dense-vector-hnsw-early-termination">same index level setting</a>).</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Common Expression Language (CEL): How the CEL input improves data collection in Elastic Agent integrations]]></title>
            <description><![CDATA[Learn how the Common Expression Language differs from other programming languages, how we’ve extended it for Filebeat’s CEL input, and the flexibility it gives you to express data collection logic in Elastic Agent integrations.]]></description>
            <link>https://www.elastic.co/search-labs/blog/common-expression-language-elasticsearch</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/common-expression-language-elasticsearch</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Chris Berkhout]]></dc:creator>
            <pubDate>Fri, 27 Feb 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/985c50bfabee3348494eb4307f0b3375a97a0644-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Elastic Agent <a href="https://www.elastic.co/integrations">integrations</a> allow users to ingest data into Elasticsearch from a wide range of sources. They combine collection logic, ingest pipelines, dashboards, and other artifacts into a package that can be installed and managed from the Kibana web interface.</p><p>Integrations configure <a href="https://www.elastic.co/docs/reference/beats/filebeat/configuration-filebeat-options">Filebeat inputs</a> to do the data collection. To collect data from HTTP APIs, we’ve often used the <a href="https://www.elastic.co/docs/reference/beats/filebeat/filebeat-input-httpjson">HTTP JSON input</a>. However, even basic listing APIs can differ greatly in the details, and the HTTP JSON input&#x27;s model of YAML-configured transformations can make it awkward and sometimes impossible to express the required collection logic.</p><p>The <a href="https://www.elastic.co/docs/reference/beats/filebeat/filebeat-input-cel">Common Expression Language (CEL) input</a> was introduced to allow more flexible interaction with HTTP APIs. <a href="https://cel.dev/">CEL</a> is a language designed to be embedded in applications that require a fast, safe, and extensible way to express conditions and data transformations. The CEL input lets an integration builder write one expression that can read settings, keep track of its own state, make requests, process responses, and ultimately return events ready to ingest.</p><p>In this article, we’ll look at how CEL differs from other programming languages, how we’ve extended it for the CEL input, and the flexibility and power that gives you to express your data collection logic.</p><h2>CEL and how it works in the input</h2><p>CEL is an expression language. It has no statements. When you write CEL, you don’t tell it what to do by writing statements, you tell it what value to produce by writing an expression. Every CEL expression produces a value, and smaller expressions can be combined into a larger expression to produce a result according to more complex rules. Later, we’ll see how to use expressions for things that may be written with statements in other languages.</p><p>CEL is intentionally a non-Turing complete language. It doesn’t allow unbounded loops. Later, we’ll see how you can process lists and maps using macros, but by avoiding unbounded loops, the language guarantees predictable and limited execution time for individual expressions.</p><p>The CEL input is configured with a CEL program (an expression) and some initial state. The state will be provided as input to the program. The program is evaluated to produce an output state. If the output state includes a list of events, those will be removed and published. The rest of the output state will be used as the input for the next evaluation. If the output state includes one or more events and the flag <code>want_more: true</code>, the next evaluation will be performed immediately; otherwise, it will sleep for the rest of the configured interval time before continuing. Here’s a simplified diagram of the input’s control flow:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/42671541f97e2dba808fd53969fe12f517917f9a-1600x529.png" style="width: 100%;" /><p>The output of each evaluation will be passed forward as the input to the next evaluation, for as long as the input runs. Output data under the key &quot;<code>cursor</code>&quot; will be persisted to disk and reloaded after the input is restarted, but the rest of the state will not be preserved across restarts.</p><p>The CEL language itself has limited functionality and avoids side effects, but it is extensible. The <a href="https://github.com/google/cel-go">cel-go</a> implementation adds some functionality, such as optional syntax and types. The <a href="https://github.com/elastic/mito">Mito</a> library builds on cel-go and adds more functionality, including the ability to make HTTP requests. The CEL input uses Mito’s version of CEL.</p><h2>Working with Mito</h2><p>To build or debug an integration using the CEL input, the most important thing to understand is what output state your CEL program will produce for a given input state. During development, it can be cumbersome to have your CEL program run by the input, surrounded by the full Elastic stack. One way to achieve a faster feedback loop is to use Mito’s command-line tool, which will let you run a CEL program directly and see the output it produces for a given input.</p><p>Mito is written in Go and can be installed as follows:</p><pre><code>go install github.com/elastic/mito/cmd/mito@latest</code></pre><p>When you run a CEL program with Mito, you typically give it two files: a JSON file with the initial input state, and another file with the source code of your CEL program:</p><pre><code>mito -data state.json src.cel</code></pre><p>For easier copy and pasting, the examples in this article are written as single commands that have the shell create temporary files on the fly, by wrapping the content of each file in <code>&lt;(echo &#x27;...content...&#x27;)</code>. In your own development, working with actual files will be easier.</p><h2>Fetching issues data from GitHub</h2><p>The following example includes a full CEL program that will fetch data about issues from the <a href="https://docs.github.com/en/rest/issues/issues?apiVersion=2022-11-28#list-repository-issues">GitHub API</a>. Its initial input state has a URL for the API endpoint, and some information about how it should handle pagination. The CEL program uses the data in the input state to generate a request. It will decode the response, produce events from it, and return them as part of its output state.</p><pre><code>mito -data <(echo '
  {
    "url": "https://api.github.com/repos/elastic/integrations/issues",
    "per_page": 3,
    "max_pages": 3
  }
') <(echo '
  int(state.?cursor.page.orValue(1)).as(page,
    (
      state.url + "?" + {
        "state": ["all"],
        "sort": ["created"],
        "direction": ["asc"],
        "per_page": [string(state.per_page)],
        "page": [string(page)],
      }.format_query()
    ).as(full_url,
      request("GET", full_url).with({
        "Header": {
          "Accept": ["application/vnd.github+json"],
          "X-GitHub-Api-Version": ["2022-11-28"],
        }
      }).do_request().as(resp,
        resp.Body.decode_json().as(data,
          state.with({
            "events": data.map(i, {
              "html_url": i.html_url,
              "title": i.title,
              "created_at": i.created_at,
            }),
            "cursor": { "page": page + 1 },
            "want_more": size(data) == state.per_page && page < state.max_pages,
          })
        )
      )
    )
  )
')</code></pre><p>Its first evaluation produces the following output:</p><pre><code>{
  "cursor": {
    "page": 2
  },
  "events": [
    {
      "created_at": "2018-09-14T09:47:35Z",
      "html_url": "https://github.com/elastic/integrations/issues/3250",
      "title": "Increase support of log formats in haproxy filebeat module"
    },
    {
      "created_at": "2019-02-06T12:37:37Z",
      "html_url": "https://github.com/elastic/integrations/issues/487",
      "title": "ETCD Metricbeat module needs polishing and grooming"
    },
    {
      "created_at": "2019-08-13T11:33:11Z",
      "html_url": "https://github.com/elastic/integrations/pull/1",
      "title": "Initial structure"
    }
  ],
  "max_pages": 3,
  "per_page": 3,
  "url": "https://api.github.com/repos/elastic/integrations/issues",
  "want_more": true
}</code></pre><p>The events will be removed, and when run in the CEL input, they’ll be published for ingestion. The rest of the output will be provided to the next CEL program evaluation as its input state.</p><p></p><p>To understand how that CEL program works, we’ll look at some smaller CEL examples and discuss more details of how the CEL input operates.</p><h2>CEL basics</h2><p>In the CEL language, there are no statements; there are only expressions. Every successful CEL expression evaluates to a final value. Here’s one of the smallest CEL expressions you can write, along with its output:</p><pre><code>mito <(echo '
  "hello" + " " + "world"
')</code></pre><pre><code>"hello world"</code></pre><p>Many simple expressions are intuitive. Mathematical operations are only supported on values of the same type (for example, <code>int</code> with <code>int</code>), so convert types as you need (here from <code>int</code> to <code>double</code>):</p><pre><code>mito <(echo '
  double((1 + 2) * (3 + 4)) / 2.0
')</code></pre><pre><code>10.5</code></pre><p>There are no variables in the CEL language, but an expression can be given a name and used in a larger expression with the help of Mito’s <a href="https://pkg.go.dev/github.com/elastic/mito/lib#hdr-As__Macro_-Collections"><code>as</code></a> macro. In this example, the expression <code>(1 + 1)</code> evaluates to the value <code>2</code>, and <code>.as(n, ...)</code> gives that value the name <code>n</code> for use in the expression <code>&quot;one plus one is &quot;+string(n)</code>:</p><pre><code>mito <(echo '
  (1 + 1).as(n, "one plus one is "+string(n))
')</code></pre><pre><code>"one plus one is 2"</code></pre><p>It&#x27;s also possible to accumulate information in a map and use it later in the expression, as demonstrated here using <a href="https://pkg.go.dev/github.com/elastic/mito/lib#hdr-With-Collections"><code>with</code></a>:</p><pre><code>mito <(echo '
  { "key": "value" }.with({ "key2": "value2" }).as(data,
    {
      "data": data,
      "size": size(data),
    }
  )
')</code></pre><pre><code>{
  "data": {
    "key": "value",
    "key2": "value2"
  },
  "size": 2
}</code></pre><p>Look at that example again. Notice that the nested part, <code>({ &quot;data&quot;: data, &quot;size&quot;: size(data), })</code>, gives us the shape of the final value. It’s a map with the keys <code>&quot;data&quot;</code> and <code>&quot;size&quot;</code>. The values for those keys depend on <code>data</code>, which is defined by the outer part of the expression. Reading CEL expressions from the inside out can help to quickly see what they’ll return.</p><p>CEL has no control flow statements, like <code>if</code>, but conditional branching can be done with the ternary operator:</p><pre><code>mito <(echo '
  1 + 1 < 12 ? "few" : "many"
')</code></pre><pre><code>"few"</code></pre><p>Unbounded loops and recursion are not supported, as CEL is not a Turing complete language. That makes execution time predictable and proportional to the size of the input data and the expression complexity.</p><p>Although unbounded loops are not possible in individual CEL expressions, you can process lists and maps using macros like <a href="https://github.com/google/cel-spec/blob/master/doc/langdef.md#macros"><code>map</code></a>:</p><pre><code>mito <(echo '
  [1, 2, 3].map(x, x * 2)
')</code></pre><pre><code>[2, 4, 6]</code></pre><p>In this section, we’ve covered:</p><ul><li>Strings, numbers, lists, and maps.</li><li>String concatenation.</li><li>Mathematical operations.</li><li>Type casting.</li><li>Conditionals.</li><li>Naming sub-expressions.</li><li>Processing collections.</li></ul><p>Next, we’ll look at how to make HTTP requests.</p><h2>Requests</h2><p>Mito extends CEL with the ability to make <a href="https://pkg.go.dev/github.com/elastic/mito/lib#HTTP">HTTP requests</a>:</p><pre><code>mito <(echo '
  get("https://example.com").as(resp, string(resp.Body))
')</code></pre><pre><code>"<!doctype html><html lang=\"en\"><head><title>Example Domain</title>..."</code></pre><p>Requests can be explicitly constructed before they’re executed. That makes it possible to use different HTTP methods and to add headers and a body.</p><p>In this example, we build a URL with the help of <a href="https://pkg.go.dev/github.com/elastic/mito/lib#hdr-Format_Query-HTTP"><code>format_query</code></a>, add a header to the request, and parse the response body with <a href="https://pkg.go.dev/github.com/elastic/mito/lib#hdr-Decode_JSON-JSON"><code>decode_json</code></a>. When given the <code>-log_requests</code> option, Mito will log detailed information in JSON format about each request and response.</p><pre><code>mito -log_requests <(echo '
  request("GET",
    "https://postman-echo.com/get?" + {
        "q": ["query value"]
     }.format_query()
  ).with({
    "Header": { "Accept": ["application/json"] }
  }).do_request().as(resp, {
    "status": resp.StatusCode,
    "data": resp.Body.decode_json(),
  })
')</code></pre><pre><code>{"time":"...","level":"INFO","msg":"HTTP request",...}
{"time":"...","level":"INFO","msg":"HTTP response",...}
{
  "data": {
    "args": {
      "q": "query value"
    },
    "headers": {
      "accept": "application/json",
      "accept-encoding": "gzip, br",
      "host": "postman-echo.com",
      "user-agent": "Go-http-client/2.0",
      "x-forwarded-proto": "https"
    },
    "url": "https://postman-echo.com/get?q=query+value"
  },
  "status": 200
}</code></pre><h2>Managing state and evaluations</h2><p>Now that we’ve covered how to make requests and the CEL basics required to produce our desired output state, let’s take a closer look at what we should put into the output state and how that lets us direct later processing.</p><p>An integration’s CEL program needs to make sure its output state is suitable for use as the input of the next evaluation. Configuration sets the initial state, and that should be repeated in the output with any appropriate changes. An easy way to do that is to use <code>state.with({ ... })</code>, to repeat the state map with some overrides. A common pattern for small programs is to wrap the whole program in <code>state.with()</code>, so that state propagation doesn’t have to be repeated in each branch that generates output data (for example, success, errors).</p><p>When there are state values that are initialized by an evaluation rather than hard-coded in the initial input state, the program will need to check for an existing value before setting the initial one. That’s something that the support for <a href="https://pkg.go.dev/github.com/google/cel-go/cel#OptionalTypes">optional syntax and types</a> can help with. By using a question mark before the field name in a map key, the access becomes optional: It may or may not resolve to a value, but further optional accesses are possible and it’s easy to supply a default if no value is present:<br/></p><pre><code>mito -data <(echo '{}') <(echo '
  int(state.?counter.orValue(0)).as(counter,
    state.with({
      "counter": counter + 1,
      "want_more": counter + 1 < 3,
    })
  )
')</code></pre><pre><code>{ "counter": 1, "want_more": true }
{ "counter": 2, "want_more": true }
{ "counter": 3, "want_more": false }</code></pre><p>In that example, the counter value read from state is cast to <code>int</code> because all numbers are serialized in the state as floating point numbers, in keeping with conventions established by JSON and JavaScript’s <code>Number</code> type. It should also be noted that <code>&quot;want_more&quot;: true</code> is honored here by Mito, but when run in the CEL input, the evaluation would only be repeated if the output also contains events.</p><p>It’s a requirement of CEL programs run by the CEL input that they return an <code>&quot;events&quot;</code> key in their output map. Its value can be a list of event maps, an empty list, or a single event map. The single event case is usually used for errors. The event will be published by the input, but its value will also be logged, and if it sets an <code>error.message</code> value, that will be used to update the integration’s Fleet health status. If your program produces a single non-error event, it’s best to wrap it in a list.</p><p>Take another look at the output of our GitHub issues program from earlier:</p><pre><code>{
  "url": "https://api.github.com/repos/elastic/integrations/issues",
  "per_page": 3,
  "max_pages": 3,
  "cursor": {
    "page": 2
  },
  "events": [
    { ... },
    { ... },
    { ... }
  ],
  "want_more": true
}</code></pre><p>The program effectively managed its state, by:</p><ul><li>Repeating initial state values in <code>url</code>, <code>per_page</code>, and <code>max_pages</code>.</li><li>Adding state that should be persisted across restarts in <code>cursor.page</code>.</li><li>Returning events ready to publish in the <code>events</code> list.</li><li>Requesting immediate re-evaluation with <code>want_more: true</code>.</li></ul><p>Now that you understand optional access and state management, as well as CEL basics and HTTP requests, the full GitHub issues program should be readable. Try running it with Mito and experimenting with some changes.</p><h2>Review and resources</h2><p>In this article, we looked at what the CEL language is and how it has been extended in the Mito library for use in the CEL input. We saw the flexibility of CEL in an example program that fetches issues information from the GitHub API, and went through all the details necessary to understand that program, covering access to settings in the initial state, interaction with HTTP APIs, returning events to be ingested, and managing the state for later program executions.</p><p>To learn more and build integrations using the CEL input, there are a number of resources worth exploring:</p><ul><li><a href="https://www.elastic.co/docs/reference/beats/filebeat/filebeat-input-cel">CEL input - Filebeat documentation</a></li><li><a href="https://pkg.go.dev/github.com/elastic/mito">Mito documentation</a></li><li><a href="https://cel.dev/">Common Expression Language - cel.dev website</a></li><li><a href="https://www.elastic.co/docs/extend/integrations">Create an Integration - Elastic documentation</a></li></ul><p>And perhaps the most valuable resource for building integrations with the CEL input is the CEL code of existing Elastic integrations, which can be found on GitHub:</p><p><a href="https://github.com/search?q=repo%3Aelastic%2Fintegrations+path%3A**%2Fcel.yml.hbs&amp;type=code"><code>cel.yml.hbs</code> files in the Elastic integrations repository - GitHub</a></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Entity resolution with Elasticsearch & LLMs, Part 2: Matching entities with LLM judgment and semantic search]]></title>
            <description><![CDATA[Using semantic search and transparent LLM judgment for entity resolution in Elasticsearch.]]></description>
            <link>https://www.elastic.co/search-labs/blog/elasticsearch-entity-resolution-llm-semantic-search</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/elasticsearch-entity-resolution-llm-semantic-search</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Jessica Moszkowicz]]></dc:creator>
            <pubDate>Thu, 26 Feb 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/473ca4357c7d60f690edbd2a844acda169aca9c3-720x420.jpg?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>In<a href="https://www.elastic.co/search-labs/blog/entity-resolution-llm-elasticsearch"> Part 1</a>, we prepared our watch list and extracted entity mentions. Now we’re ready to answer the hard question: Which entity does a mention actually refer to? Let&#x27;s return to the example in the first blog of this series, which set up why we need entity resolution: &quot;The Swift update is here!&quot; Imagine that this headline is accompanied by a bit more context:</p><ol><li>The new Swift update is here! Developers are eager to try out the new features.</li><li>The new Swift update is here! The new album will drop next month.</li></ol><p>With this added context, we should be able to resolve the name &quot;Swift&quot; to the correct entity.</p><p>In the <a href="https://www.elastic.co/search-labs/blog/entity-resolution-llm-elasticsearch">previous post</a>, we set up our watch list and enriched the entities with additional context. Looking at our examples above, we need to have at least the following two entities in the list: Taylor Swift and Swift Programming Language. We also covered how we extract entity mentions from text. Both of these examples would extract &quot;Swift&quot;. With these ingredients in place, the enriched watch list, and the extracted entities, we’re finally ready to introduce the star of the show: entity matching.</p><p><strong>Remember:</strong> This is an educational prototype designed to teach entity-matching concepts. Production systems might use different large language models (LLMs), custom matching rules, specialized judgment pipelines, or ensemble approaches combining multiple matching strategies.</p><h2>The problem: Why matching is hard</h2><p>Human language is a remarkable thing. One of the most interesting properties of it is its endless creativity. We can generate and understand an infinite number of new sentences. Is it any wonder, then, that exact matches in entity resolution are rare? Authors strive to be creative when they can. It would get quite tedious if we had to write and read full names whenever an entity is mentioned. So, while exact matches are easy, the reality is that we need a more sophisticated approach to entity resolution: one that’s robust enough to handle at least some of the boundless creativity of human authors. That’s why we separate the problem into two steps: Use Elasticsearch to retrieve plausible candidates at scale, and then use an LLM to judge whether those candidates truly refer to the same real-world entity.</p><h2>The solution: Three-step matching with transparent LLM judgment</h2><p>We’re in the midst of a paradigm shift in how we use computers. Just as the rise of the internet took us from localized computing to a globally connected network, generative AI (GenAI) is fundamentally changing how content, code, and information are created. In fact, the educational prototype that accompanies this series was almost exclusively &quot;vibe coded&quot; using an LLM with careful prompting by the author. This is not to say that LLMs have or even will reach the kind of productivity inherent with human language, but it does mean that we now have a powerful resource to help with entity resolution.</p><p>A common pattern we use with GenAI is retrieval augmented generation (RAG). Here, <em>retrieval</em> means retrieving entity candidates (not generating answers), and the LLM is used strictly for match evaluation and explanation. While we <em>could</em> ask an LLM to help us with end-to-end entity resolution, that’s a costly approach, both in terms of time and money. RAG helps LLMs do their work by using more efficient ways to provide context to the LLM, thereby empowering the LLM to efficiently help with entity resolution.</p><p>For the retrieval part of RAG, we again turn to Elasticsearch. We first find potential matches using a combination of exact matching, matching against aliases, and hybrid search, which combines keyword and semantic search. Once we find these potential matches, we send them to an LLM for judgment. The LLM acts as the final match evaluator. We also make the LLM explain its reasoning, an important differentiator with other entity resolution systems. Without these explanations, entity resolution is a black box; with them, we can see for ourselves why a match makes sense.</p><h2>Key concepts: Three-step matching, hybrid search, and transparent LLM judgment</h2><p><strong>What is three-step matching?</strong> At the onset of this project, we hypothesized that semantic search will be a crucial part of the system, but not every match requires such sophisticated search. In order to find matches efficiently, we take a progressive approach to the problem. First, we check for exact matches using keyword search. If we find such a match, our work is done and we can move on. If exact matching fails, we turn to alias matching. In the prototype, alias matching is also done using exact matching with keywords, for simplicity. In production, you might expand this step with normalization, transliteration rules, fuzzy matching, or curated alias tables. If we still haven&#x27;t found a potential match in the first two steps, then it&#x27;s time to bring in semantic search via Elasticsearch&#x27;s hybrid search with reciprocal rank fusion (RRF).</p><p><strong>What is hybrid search?</strong> In Elasticsearch, we can use semantic search to find meaningful matches that take context into account. Elasticsearch is widely used for vector search and hybrid retrieval. Semantic similarity is powerful for meaning, but it’s not a substitute for structured filtering (for example, by time ranges, locations, or identifiers), and it’s often unnecessary when an exact match is available. Elasticsearch made its mark with lexical search, which is great at tasks where semantic search doesn&#x27;t fit. To take full advantage of both approaches, we use lexical search alongside semantic search in a single hybrid query. We then merge the results to find the most likely matches using RRF. In the prototype, the top two results become potential matches that can be sent for LLM judgment.</p><p><strong>Why LLM judgment?</strong> LLM judgments and explanations allow our system to handle ambiguity and context transparently. This is vital for cases like &quot;the president&quot;, which could refer to multiple entities, depending on the context, but it also makes things like nicknames and cultural variations work well in the system. Finally, when we consider mission-critical tasks, like identifying entities from sanctions lists, we need to know why a match was accepted in order to trust the system. Crucially, the LLM does not search the full corpus; it evaluates only the small set of candidates returned by Elasticsearch.</p><h2>Real-world results: Matching with LLM reasoning</h2><p>A major challenge for any natural language processing task is the creation of a golden document, an &quot;answer key&quot; that tells us what the expected results are. Without this, it&#x27;s next to impossible to judge how well a system performs on a task, but creating such a document can be a laborious process. For the entity resolution prototype, we turned again to GenAI to help set up data we could test against.</p><p>We first defined several challenge types, such as nicknames and transliteration, and then asked the LLM to create a tiered collection of datasets that would get progressively larger and more challenging for the system. The creation of the datasets was less straightforward than one might hope. The LLM had a strong propensity for &quot;cheating&quot; by making it too easy to get the right answer. For example, one of the challenge types focused on semantic context. This type included things like resolving &quot;Russian author&quot; to &quot;Leo Tolstoy&quot;. The LLM incorrectly put &quot;Russian author&quot; as an alias for &quot;Leo Tolstoy&quot;, which negated the need for hybrid search to find the match.</p><p>After several refactorings to fix issues like this, we had five dataset tiers to work with. Tiers 1–4 were progressively larger with more challenge types. Tier 5 was the &quot;ultimate challenge&quot; dataset, made up of the trickiest examples from all challenge types. All of the test data is available in the <a href="https://github.com/jesslm/entity-resolution-lab-public/tree/main/comprehensive_evaluation">comprehensive evaluation directory</a>.</p><p>To evaluate our prompt-based entity resolution approach, we focused our attention on the tier 4 dataset. An important note is that the evaluation was conducted as a controlled experiment so that we could focus on entity match quality. The watch list data was pre-enriched with context, and entities were extracted from the article ahead of time. This ensured that evaluation was focused on matching rather than on extraction accuracy. This isolates match quality; end-to-end performance would additionally depend on extraction recall and enrichment quality.</p><h3>Evaluation dataset</h3><p>The tier 4 evaluation dataset provides a comprehensive test of the system&#x27;s capabilities:[1]</p><ul><li><strong>Watch list entities:</strong> 66 entities across diverse types (people, organizations, locations).</li><li><strong>Test articles:</strong> 69 articles covering real-world entity resolution scenarios.</li><li><strong>Expected matches:</strong> 206 expected entity matches across all articles.</li><li><strong>Challenge types: </strong>15 different challenge types testing various aspects of entity resolution.</li></ul><p>The challenge types included in the dataset are:</p><ul><li><strong>Nicknames:</strong> &quot;Bob Smith&quot; → &quot;Robert Smith&quot; (seven articles).</li><li><strong>Titles and honorifics:</strong> &quot;Dr. Sarah Williams&quot; → &quot;Sarah Williams&quot; (five articles).</li><li><strong>Semantic context:</strong> &quot;Russian author&quot; → &quot;Leo Tolstoy&quot; (eight articles).</li><li><strong>Multilingual names:</strong> Handling names in different scripts (six articles).</li><li><strong>Business entities:</strong> Corporate name variations (seven articles).</li><li><strong>Executive references: </strong>&quot;Microsoft CEO&quot; → &quot;Satya Nadella&quot; (five articles).</li><li><strong>Political leaders:</strong> Title-based references (five articles).</li><li><strong>Initials:</strong> &quot;J. Smith&quot; → &quot;John Smith&quot; (three articles).</li><li><strong>Name order variations:</strong> Different name ordering conventions (three articles).</li><li><strong>Truncated names:</strong> Partial name matches (three articles).</li><li><strong>Name splitting:</strong> Names split across text (three articles).</li><li><strong>Missing spaces/hyphens:</strong> Formatting variations (two articles).</li><li><strong>Transliteration:</strong> Cross-script name matching (two articles).</li><li><strong>Combined challenges:</strong> Multiple challenges in one article (six articles).</li><li><strong>Complex business:</strong> Hierarchical business relationships (five articles).</li></ul><p>Let&#x27;s see how prompt-based entity resolution performed.</p><h3>Overall performance</h3><p>The results show that there&#x27;s a lot of promise with LLM-powered match evaluation, but they also reveal a significant reliability issue. Because each candidate pair must be evaluated by the LLM, failures in structured output can suppress acceptance and recall even when retrieval is working well.</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Metric</td><td style="border: 1px solid #ccc; padding: 4px;">Value</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Precision</td><td style="border: 1px solid #ccc; padding: 4px;">83.8%</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Recall</td><td style="border: 1px solid #ccc; padding: 4px;">62.6%</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">F1 score</td><td style="border: 1px solid #ccc; padding: 4px;">71.7%</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Total matches found</td><td style="border: 1px solid #ccc; padding: 4px;">344</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">LLM acceptance rate</td><td style="border: 1px solid #ccc; padding: 4px;">44.8%</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Error rate</td><td style="border: 1px solid #ccc; padding: 4px;">30.2%</td>
                    </tr>
              </tbody>
            </table>
          <h3>The error rate problem</h3><p>Recall that the first step we take in the prototype is to create potential match pairs using Elasticsearch. Each of these potential matches needs to be evaluated by the LLM. To efficiently process all of those matches, we batch the LLM calls together. This reduces API costs and latency, but there’s also an increased risk of getting malformed JSON in the output. As batch size increases, the JSON becomes longer and more complex, making it more likely that the LLM will generate invalid JSON. This is where the 30% error rate stems from. In the evaluation, we used a batch size of five matches per request. Even with this conservative batch size, we still see JSON parsing failures, which skews the evaluation results significantly.</p><h2>What&#x27;s next: Optimizing LLM integration</h2><p>Now that we&#x27;ve matched entities using semantic search and LLM judgment, we have a complete entity resolution pipeline. This approach introduces a new failure mode, however, when the model’s judgment is correct, but its output isn’t usable. We can optimize the LLM integration for better reliability and cost efficiency. In the next post, we&#x27;ll explore how to use function calling for structured output, which provides guaranteed structure and type safety while reducing errors and costs.</p><h2>Try it yourself</h2><p>Want to see entity matching in action? Check out the <a href="https://github.com/jesslm/entity-resolution-lab-public/tree/main/notebooks#:~:text=5%20minutes%20ago-,03_entity_matching_v3.ipynb,-Initial%20public%20lab">Entity Matching notebook</a> for a complete walk-through with real implementations, detailed explanations, and hands-on examples. The notebook shows you exactly how to match entities using three-step search, hybrid search with RRF, and LLM-powered judgment with reasoning.</p><p><strong>Remember:</strong> This is an educational prototype designed to teach the concepts. When building production systems, consider additional factors, like model selection, cost optimization, latency requirements, quality validation, error handling, and monitoring, which aren&#x27;t covered in this learning-focused prototype.</p><h2>Notes</h2><ol><li>These datasets are synthetic and designed for education; they approximate real challenges but are not representative of any single production domain.</li></ol>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Elasticsearch vector search is up to 8x faster than OpenSearch]]></title>
            <description><![CDATA[Exploring filtered vector search benchmarks of OpenSearch vs. Elasticsearch and why vector search performance is critical for context-engineered systems.]]></description>
            <link>https://www.elastic.co/search-labs/blog/opensearch-vs-elasticsearch-filtered-vector-search</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/opensearch-vs-elasticsearch-filtered-vector-search</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Sachin Frayne]]></dc:creator>
            <pubDate>Wed, 25 Feb 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/fb7ee623925ca6696d643e437ce8efe5fe749079-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<h2>Why search speed matters for AI agents and context engineering</h2><p>Our benchmarks on a 20M document corpus show that Elasticsearch delivers up to 8x higher throughput than OpenSearch for filtered vector search, while also achieving higher Recall@100 across the configurations we tested. Context engineering depends on more than fast vector retrieval. Teams also need strong relevance controls, like hybrid search and filtering, operational simplicity, and predictable performance, as workflows iterate. But because agents often run retrieve, reason, retrieve loops many times per request, retrieval latency becomes a multiplier, so improvements here translate directly into better end-to-end responsiveness and lower cost.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/d5a52a07773f0942c2baa732dacfe782aac0f415-1600x683.png" style="width: 100%;" /><p>For context engineering, retrieval isn’t a one-time step. Agents and applications repeatedly run loops, such as retrieve → reason → retrieve, to refine queries, verify facts, assemble grounded context, and complete tasks. This pattern is common in agentic workflows and iterative retrieval augmented generation (RAG). Because retrieval may be invoked many times per user request, it adds delay to the response and/or increases infrastructure costs.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/54ac476ff20a3cf93484298c9ae47612c12fc110-800x417.png" style="width: 100%;" /><h2>Why is vector search performance critical?</h2><p></p><p>Imagine a shopping assistant answering the question, “I need a carry-on backpack under $60 that fits a 15-inch laptop, is water resistant, and can arrive by Friday.”</p><p>In production, the assistant rarely issues one vector query and stops. It runs a retrieval loop to build the right context, and each step is typically constrained by filters, like availability, region, shipping promise, brand rules, and policy eligibility.</p><p><strong>Step 1: Interpret intent and translate to constraints.</strong></p><p>The agent turns the request into structured filters and a semantic query, such as:</p><ul><li>Filters: In stock, deliverable to the user’s postcode, delivery by Friday, price under $60, valid listing</li><li>Vector query: “Carry-on backpack 15-inch laptop water resistant”</li></ul><p><strong>Step 2: Retrieve candidates, and then refine.</strong></p><p>It often repeats retrieval with variations to avoid missing good matches:</p><ul><li>“travel backpack carry on laptop sleeve”</li><li>“water resistant commuter backpack 15 inch”</li><li>“lightweight cabin backpack”</li></ul><p>Each query uses the same eligibility filters, because retrieving irrelevant or unavailable items is wasted context.</p><p><strong>Step 3: Expand to confirm details and reduce risk.</strong></p><p>The agent then retrieves again to verify key attributes that affect the final answer:</p><ul><li>Material and water resistance wording</li><li>Dimensions and laptop compartment fit</li><li>Return policy or warranty constraints</li><li>Alternate options if inventory is low</li></ul><p>This is multistep context engineering: Retrieve, reason, retrieve, assemble.</p><h2>Why latency and recall matter for context engineering</h2><p>These interactions can involve dozens of filtered retrieval calls per user session. That makes per-call latency a direct multiplier on end-to-end response time, and low recall forces extra retries or causes the agent to miss eligible items, degrading answer quality.</p><p>Takeaway: In context-engineered systems, filtered approximate nearest neighbors (ANN) isn’t a single lookup. It’s a repeated operation under constraints, so vector search performance shows up immediately in latency, throughput, and cost, even when the large language model (LLM) is the most visible component.</p><h2>Benchmarking</h2><h3>Results</h3><p>In Graph 2, each dot represents one test configuration. The best results appear toward the top left, meaning higher recall with lower latency. Elasticsearch’s results are consistently closer to the top left than OpenSearch’s, indicating better speed and accuracy under the same workload settings.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/c50d1df00968cac18149a2799e6242fbe49b66a0-1600x990.png" style="width: 100%;" /><h4>Some key insights</h4><ul><li><code>s_n_r_value</code>: Shorthand for <code>size_numCandidates_rescoreOversample</code> (k and numCandidates set equal to numCandidates in these tests), for example, <code>100_500_1</code> means size=100, numCandidates=500 and k=500, rescore oversample=1</li><li>Recall: Measured Recall@100 for that configuration</li><li>Avg latency (ms): Average end-to-end latency per query</li><li>Throughput: Queries per second</li><li>Recall %: Relative recall lift of Elasticsearch versus OpenSearch (Elasticsearch minus OpenSearch) / OpenSearch</li><li>Latency Xs: OpenSearch average latency divided by Elasticsearch average latency</li><li>Throughput Xs: Elasticsearch throughput divided by OpenSearch throughput</li></ul>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Engine</td><td style="border: 1px solid #ccc; padding: 4px;">`s_n_r_value`</td><td style="border: 1px solid #ccc; padding: 4px;">Recall</td><td style="border: 1px solid #ccc; padding: 4px;">Avg Latency (ms)</td><td style="border: 1px solid #ccc; padding: 4px;">Throughput</td><td style="border: 1px solid #ccc; padding: 4px;">Recall %</td><td style="border: 1px solid #ccc; padding: 4px;">Latency Xs</td><td style="border: 1px solid #ccc; padding: 4px;">Throughput Xs</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Elasticsearch</td><td style="border: 1px solid #ccc; padding: 4px;">100_250_1</td><td style="border: 1px solid #ccc; padding: 4px;">0.7704</td><td style="border: 1px solid #ccc; padding: 4px;">25</td><td style="border: 1px solid #ccc; padding: 4px;">534.75</td><td style="border: 1px solid #ccc; padding: 4px;">9.70%</td><td style="border: 1px solid #ccc; padding: 4px;">2.28</td><td style="border: 1px solid #ccc; padding: 4px;">1.91</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">OpenSearch</td><td style="border: 1px solid #ccc; padding: 4px;">100_250_1</td><td style="border: 1px solid #ccc; padding: 4px;">0.7023</td><td style="border: 1px solid #ccc; padding: 4px;">57.08</td><td style="border: 1px solid #ccc; padding: 4px;">279.58</td><td style="border: 1px solid #ccc; padding: 4px;"></td><td style="border: 1px solid #ccc; padding: 4px;"></td><td style="border: 1px solid #ccc; padding: 4px;"></td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Elasticsearch</td><td style="border: 1px solid #ccc; padding: 4px;">100_500_1</td><td style="border: 1px solid #ccc; padding: 4px;">0.8577</td><td style="border: 1px solid #ccc; padding: 4px;">25.42</td><td style="border: 1px solid #ccc; padding: 4px;">524.14</td><td style="border: 1px solid #ccc; padding: 4px;">7.20%</td><td style="border: 1px solid #ccc; padding: 4px;">2.4</td><td style="border: 1px solid #ccc; padding: 4px;">2</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">OpenSearch</td><td style="border: 1px solid #ccc; padding: 4px;">100_500_1</td><td style="border: 1px solid #ccc; padding: 4px;">0.8001</td><td style="border: 1px solid #ccc; padding: 4px;">60.9</td><td style="border: 1px solid #ccc; padding: 4px;">262.12</td><td style="border: 1px solid #ccc; padding: 4px;"></td><td style="border: 1px solid #ccc; padding: 4px;"></td><td style="border: 1px solid #ccc; padding: 4px;"></td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Elasticsearch</td><td style="border: 1px solid #ccc; padding: 4px;">100_750_1</td><td style="border: 1px solid #ccc; padding: 4px;">0.8947</td><td style="border: 1px solid #ccc; padding: 4px;">29.67</td><td style="border: 1px solid #ccc; padding: 4px;">528.09</td><td style="border: 1px solid #ccc; padding: 4px;">5.72%</td><td style="border: 1px solid #ccc; padding: 4px;">2.25</td><td style="border: 1px solid #ccc; padding: 4px;">2.21</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">OpenSearch</td><td style="border: 1px solid #ccc; padding: 4px;">100_750_1</td><td style="border: 1px solid #ccc; padding: 4px;">0.8463</td><td style="border: 1px solid #ccc; padding: 4px;">66.76</td><td style="border: 1px solid #ccc; padding: 4px;">239.11</td><td style="border: 1px solid #ccc; padding: 4px;"></td><td style="border: 1px solid #ccc; padding: 4px;"></td><td style="border: 1px solid #ccc; padding: 4px;"></td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Elasticsearch</td><td style="border: 1px solid #ccc; padding: 4px;">100_1000_1</td><td style="border: 1px solid #ccc; padding: 4px;">0.9156</td><td style="border: 1px solid #ccc; padding: 4px;">29.65</td><td style="border: 1px solid #ccc; padding: 4px;">534.5</td><td style="border: 1px solid #ccc; padding: 4px;">4.66%</td><td style="border: 1px solid #ccc; padding: 4px;">2.46</td><td style="border: 1px solid #ccc; padding: 4px;">2.44</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">OpenSearch</td><td style="border: 1px solid #ccc; padding: 4px;">100_1000_1</td><td style="border: 1px solid #ccc; padding: 4px;">0.8748</td><td style="border: 1px solid #ccc; padding: 4px;">72.88</td><td style="border: 1px solid #ccc; padding: 4px;">219.01</td><td style="border: 1px solid #ccc; padding: 4px;"></td><td style="border: 1px solid #ccc; padding: 4px;"></td><td style="border: 1px solid #ccc; padding: 4px;"></td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Elasticsearch</td><td style="border: 1px solid #ccc; padding: 4px;">100_1500_1</td><td style="border: 1px solid #ccc; padding: 4px;">0.9386</td><td style="border: 1px solid #ccc; padding: 4px;">31.84</td><td style="border: 1px solid #ccc; padding: 4px;">497.3</td><td style="border: 1px solid #ccc; padding: 4px;">3.38%</td><td style="border: 1px solid #ccc; padding: 4px;">2.71</td><td style="border: 1px solid #ccc; padding: 4px;">2.68</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">OpenSearch</td><td style="border: 1px solid #ccc; padding: 4px;">100_1500_1</td><td style="border: 1px solid #ccc; padding: 4px;">0.9079</td><td style="border: 1px solid #ccc; padding: 4px;">86.16</td><td style="border: 1px solid #ccc; padding: 4px;">185.4</td><td style="border: 1px solid #ccc; padding: 4px;"></td><td style="border: 1px solid #ccc; padding: 4px;"></td><td style="border: 1px solid #ccc; padding: 4px;"></td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Elasticsearch</td><td style="border: 1px solid #ccc; padding: 4px;">100_2000_1</td><td style="border: 1px solid #ccc; padding: 4px;">0.9507</td><td style="border: 1px solid #ccc; padding: 4px;">34.69</td><td style="border: 1px solid #ccc; padding: 4px;">457.2</td><td style="border: 1px solid #ccc; padding: 4px;">2.57%</td><td style="border: 1px solid #ccc; padding: 4px;">2.98</td><td style="border: 1px solid #ccc; padding: 4px;">2.96</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">OpenSearch</td><td style="border: 1px solid #ccc; padding: 4px;">100_2000_1</td><td style="border: 1px solid #ccc; padding: 4px;">0.9269</td><td style="border: 1px solid #ccc; padding: 4px;">103.36</td><td style="border: 1px solid #ccc; padding: 4px;">154.55</td><td style="border: 1px solid #ccc; padding: 4px;"></td><td style="border: 1px solid #ccc; padding: 4px;"></td><td style="border: 1px solid #ccc; padding: 4px;"></td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Elasticsearch</td><td style="border: 1px solid #ccc; padding: 4px;">100_2500_1</td><td style="border: 1px solid #ccc; padding: 4px;">0.9582</td><td style="border: 1px solid #ccc; padding: 4px;">37.9</td><td style="border: 1px solid #ccc; padding: 4px;">418.43</td><td style="border: 1px solid #ccc; padding: 4px;">1.99%</td><td style="border: 1px solid #ccc; padding: 4px;">3.28</td><td style="border: 1px solid #ccc; padding: 4px;">3.26</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">OpenSearch</td><td style="border: 1px solid #ccc; padding: 4px;">100_2500_1</td><td style="border: 1px solid #ccc; padding: 4px;">0.9395</td><td style="border: 1px solid #ccc; padding: 4px;">124.29</td><td style="border: 1px solid #ccc; padding: 4px;">128.53</td><td style="border: 1px solid #ccc; padding: 4px;"></td><td style="border: 1px solid #ccc; padding: 4px;"></td><td style="border: 1px solid #ccc; padding: 4px;"></td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Elasticsearch</td><td style="border: 1px solid #ccc; padding: 4px;">100_3000_1</td><td style="border: 1px solid #ccc; padding: 4px;">0.9636</td><td style="border: 1px solid #ccc; padding: 4px;">41.86</td><td style="border: 1px solid #ccc; padding: 4px;">379.4</td><td style="border: 1px solid #ccc; padding: 4px;">1.62%</td><td style="border: 1px solid #ccc; padding: 4px;">3.46</td><td style="border: 1px solid #ccc; padding: 4px;">3.44</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">OpenSearch</td><td style="border: 1px solid #ccc; padding: 4px;">100_3000_1</td><td style="border: 1px solid #ccc; padding: 4px;">0.9482</td><td style="border: 1px solid #ccc; padding: 4px;">144.67</td><td style="border: 1px solid #ccc; padding: 4px;">110.34</td><td style="border: 1px solid #ccc; padding: 4px;"></td><td style="border: 1px solid #ccc; padding: 4px;"></td><td style="border: 1px solid #ccc; padding: 4px;"></td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Elasticsearch</td><td style="border: 1px solid #ccc; padding: 4px;">100_4000_1</td><td style="border: 1px solid #ccc; padding: 4px;">0.9705</td><td style="border: 1px solid #ccc; padding: 4px;">50.28</td><td style="border: 1px solid #ccc; padding: 4px;">316.21</td><td style="border: 1px solid #ccc; padding: 4px;">1.06%</td><td style="border: 1px solid #ccc; padding: 4px;">3.87</td><td style="border: 1px solid #ccc; padding: 4px;">3.85</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">OpenSearch</td><td style="border: 1px solid #ccc; padding: 4px;">100_4000_1</td><td style="border: 1px solid #ccc; padding: 4px;">0.9603</td><td style="border: 1px solid #ccc; padding: 4px;">194.36</td><td style="border: 1px solid #ccc; padding: 4px;">82.22</td><td style="border: 1px solid #ccc; padding: 4px;"></td><td style="border: 1px solid #ccc; padding: 4px;"></td><td style="border: 1px solid #ccc; padding: 4px;"></td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Elasticsearch</td><td style="border: 1px solid #ccc; padding: 4px;">100_5000_1</td><td style="border: 1px solid #ccc; padding: 4px;">0.9749</td><td style="border: 1px solid #ccc; padding: 4px;">58.77</td><td style="border: 1px solid #ccc; padding: 4px;">270.91</td><td style="border: 1px solid #ccc; padding: 4px;">0.73%</td><td style="border: 1px solid #ccc; padding: 4px;">4.43</td><td style="border: 1px solid #ccc; padding: 4px;">4.41</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">OpenSearch</td><td style="border: 1px solid #ccc; padding: 4px;">100_5000_1</td><td style="border: 1px solid #ccc; padding: 4px;">0.9678</td><td style="border: 1px solid #ccc; padding: 4px;">260.33</td><td style="border: 1px solid #ccc; padding: 4px;">61.38</td><td style="border: 1px solid #ccc; padding: 4px;"></td><td style="border: 1px solid #ccc; padding: 4px;"></td><td style="border: 1px solid #ccc; padding: 4px;"></td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Elasticsearch</td><td style="border: 1px solid #ccc; padding: 4px;">100_6000_1</td><td style="border: 1px solid #ccc; padding: 4px;">0.9781</td><td style="border: 1px solid #ccc; padding: 4px;">66.75</td><td style="border: 1px solid #ccc; padding: 4px;">238.59</td><td style="border: 1px solid #ccc; padding: 4px;">0.52%</td><td style="border: 1px solid #ccc; padding: 4px;">4.91</td><td style="border: 1px solid #ccc; padding: 4px;">4.89</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">OpenSearch</td><td style="border: 1px solid #ccc; padding: 4px;">100_6000_1</td><td style="border: 1px solid #ccc; padding: 4px;">0.973</td><td style="border: 1px solid #ccc; padding: 4px;">327.44</td><td style="border: 1px solid #ccc; padding: 4px;">48.81</td><td style="border: 1px solid #ccc; padding: 4px;"></td><td style="border: 1px solid #ccc; padding: 4px;"></td><td style="border: 1px solid #ccc; padding: 4px;"></td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Elasticsearch</td><td style="border: 1px solid #ccc; padding: 4px;">100_7000_1</td><td style="border: 1px solid #ccc; padding: 4px;">0.9804</td><td style="border: 1px solid #ccc; padding: 4px;">74.64</td><td style="border: 1px solid #ccc; padding: 4px;">213.49</td><td style="border: 1px solid #ccc; padding: 4px;">0.38%</td><td style="border: 1px solid #ccc; padding: 4px;">5.28</td><td style="border: 1px solid #ccc; padding: 4px;">5.27</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">OpenSearch</td><td style="border: 1px solid #ccc; padding: 4px;">100_7000_1</td><td style="border: 1px solid #ccc; padding: 4px;">0.9767</td><td style="border: 1px solid #ccc; padding: 4px;">394.24</td><td style="border: 1px solid #ccc; padding: 4px;">40.53</td><td style="border: 1px solid #ccc; padding: 4px;"></td><td style="border: 1px solid #ccc; padding: 4px;"></td><td style="border: 1px solid #ccc; padding: 4px;"></td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Elasticsearch</td><td style="border: 1px solid #ccc; padding: 4px;">100_8000_1</td><td style="border: 1px solid #ccc; padding: 4px;">0.9823</td><td style="border: 1px solid #ccc; padding: 4px;">82.28</td><td style="border: 1px solid #ccc; padding: 4px;">193.59</td><td style="border: 1px solid #ccc; padding: 4px;">0.27%</td><td style="border: 1px solid #ccc; padding: 4px;">6.86</td><td style="border: 1px solid #ccc; padding: 4px;">6.83</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">OpenSearch</td><td style="border: 1px solid #ccc; padding: 4px;">100_8000_1</td><td style="border: 1px solid #ccc; padding: 4px;">0.9797</td><td style="border: 1px solid #ccc; padding: 4px;">564.14</td><td style="border: 1px solid #ccc; padding: 4px;">28.33</td><td style="border: 1px solid #ccc; padding: 4px;"></td><td style="border: 1px solid #ccc; padding: 4px;"></td><td style="border: 1px solid #ccc; padding: 4px;"></td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Elasticsearch</td><td style="border: 1px solid #ccc; padding: 4px;">100_9000_1</td><td style="border: 1px solid #ccc; padding: 4px;">0.9837</td><td style="border: 1px solid #ccc; padding: 4px;">90.08</td><td style="border: 1px solid #ccc; padding: 4px;">176.96</td><td style="border: 1px solid #ccc; padding: 4px;">0.16%</td><td style="border: 1px solid #ccc; padding: 4px;">7.63</td><td style="border: 1px solid #ccc; padding: 4px;">7.61</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">OpenSearch</td><td style="border: 1px solid #ccc; padding: 4px;">100_9000_1</td><td style="border: 1px solid #ccc; padding: 4px;">0.9821</td><td style="border: 1px solid #ccc; padding: 4px;">687.25</td><td style="border: 1px solid #ccc; padding: 4px;">23.25</td><td style="border: 1px solid #ccc; padding: 4px;"></td><td style="border: 1px solid #ccc; padding: 4px;"></td><td style="border: 1px solid #ccc; padding: 4px;"></td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Elasticsearch</td><td style="border: 1px solid #ccc; padding: 4px;">100_10000_1</td><td style="border: 1px solid #ccc; padding: 4px;">0.9848</td><td style="border: 1px solid #ccc; padding: 4px;">97.64</td><td style="border: 1px solid #ccc; padding: 4px;">163.31</td><td style="border: 1px solid #ccc; padding: 4px;">0.08%</td><td style="border: 1px solid #ccc; padding: 4px;">8.38</td><td style="border: 1px solid #ccc; padding: 4px;">8.36</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">OpenSearch</td><td style="border: 1px solid #ccc; padding: 4px;">100_10000_1</td><td style="border: 1px solid #ccc; padding: 4px;">0.984</td><td style="border: 1px solid #ccc; padding: 4px;">818.64</td><td style="border: 1px solid #ccc; padding: 4px;">19.53</td><td style="border: 1px solid #ccc; padding: 4px;"></td><td style="border: 1px solid #ccc; padding: 4px;"></td><td style="border: 1px solid #ccc; padding: 4px;"></td>
                    </tr>
              </tbody>
            </table>
          <p>For example, at <code>100_9000_1</code>, OpenSearch averages 687 milliseconds per retrieval versus 90 milliseconds on Elasticsearch, and in a 10-step retrieval loop that’s about 10 x (687 - 90) = six seconds of additional waiting time. </p><p>See the <a href="https://github.com/elastic/competitive-benchmarking-studies/tree/main/es-9.3-vs-os-3.5-vector-search/jingra/results/20260220">full results</a>.</p><h3>Methodology</h3><p>Using Python to send the queries and track the response timing and other statistics, we sent the following queries to the engines. Bear in mind that the performance of any vector search engine depends on how you tune its core parameters: how many candidates to consider, how aggressively to rescore, and how much context to return. These settings directly affect both recall (the likelihood of finding the right answer) and latency (how fast you get results).</p><p>In our benchmarks, we used the same candidate, rescore, and result-size settings you’d typically tune in an agentic retrieval loop, and we measured how Elasticsearch performs under that workload. We then ran OpenSearch with the same settings as a reference.</p><p>OpenSearch</p><pre><code>GET <INDEX_NAME>/_search
{
  "query": {
    "knn": {
      "<DENSE_VECTOR_FIELD_NAME>": {
        "vector": [...],
        "k": <NUMBER_OF_CANDIDATES>,
        "method_parameters": {
          "ef_search": <NUMBER_OF_CANDIDATES>
        },
        "rescore": {
          "oversample_factor": <OVERSAMPLE>
        },
        "filter": {
          <SOME_FILTER>
        }
      }
    }
  },
  "size": <RESULT_SIZE>,
  "_source": {
    "excludes": [
      "<DENSE_VECTOR_FIELD_NAME>"
    ]
  }
}</code></pre><ul><li><code>&quot;size&quot;: &lt;RESULT_SIZE&gt;</code>: Number of hits returned to the client. In this benchmark, result size is 100 to compute Recall@100.</li><li><code>&quot;k&quot;: &lt;NUMBER_OF_CANDIDATES&gt;</code>: The number of nearest neighbor candidates.</li><li><code>&quot;ef_search&quot;: &lt;NUMBER_OF_CANDIDATES&gt;</code>: The number of vectors to examine.</li><li><code>&quot;oversample_factor&quot;: &lt;OVERSAMPLE&gt;</code>: How many candidate vectors are retrieved before rescoring.</li></ul><p>Elasticsearch</p><pre><code>GET <INDEX_NAME>/_search
{
  "query": {
    "knn": {
      "field": "<DENSE_VECTOR_FIELD_NAME>",
      "query_vector": [...],
      "k": <NUMBER_OF_CANDIDATES>,
      "num_candidates": <NUMBER_OF_CANDIDATES>,
      "rescore_vector": {
        "oversample": <OVERSAMPLE>
      },
      "filter": {
        <SOME_FILTER>
      }
    }
  },
  "size": <RESULT_SIZE>,
  "_source": {
    "excludes": [
      "<DENSE_VECTOR_FIELD_NAME>"
    ]
  }
}</code></pre><ul><li><code>&quot;size&quot;: &lt;RESULT_SIZE&gt;</code>: Number of hits returned to the client. In this benchmark, result size is 100 to compute Recall@100.</li><li><code>&quot;k&quot;: &lt;NUMBER_OF_CANDIDATES&gt;</code>: Number of nearest neighbors to return from each shard.</li><li><code>&quot;num_candidates&quot;: &lt;NUMBER_OF_CANDIDATES&gt;</code>: Number of nearest neighbor candidates to consider per shard while doing <code>knn</code> search.</li><li><code>&quot;oversample&quot;: &lt;OVERSAMPLE&gt;</code>: How many candidate vectors are retrieved before rescoring.</li></ul><p>Example</p><p><code>Knn</code> query, (<code>100_500_1</code>), would be as follows:</p><p>OpenSearch</p><pre><code>GET search_catalog_128/_search
{
  "query": {
    "knn": {
      "search_catalog_embedding": {
        "vector": [...],
        "k": 500,
        "method_parameters": {
          "ef_search": 500
        },
        "rescore": {
          "oversample_factor": 1
        },
        "filter": {
          "term": {
            "valid": true
          }
        }
      }
    }
  },
  "size": 100,
  "_source": {
    "excludes": [
      "search_catalog_embedding"
    ]
  }
}</code></pre><p>Elasticsearch</p><pre><code>GET search_catalog_128/_search
{
  "query": {
    "knn": {
      "field": "search_catalog_embedding",
      "query_vector": [...],
      "k": 500,
      "num_candidates": 500,
      "rescore_vector": {
        "oversample": 1
      },
      "filter": {
        "term": {
          "valid": true
        }
      }
    }
  },
  "size": 100,
  "_source": {
    "excludes": [
      "search_catalog_embedding"
    ]
  }
}</code></pre><p>The full configuration, alongside Terraform scripts, Kubernetes manifests and the benchmarking code is available in this <a href="https://github.com/elastic/competitive-benchmarking-studies">repository</a> in the folder <a href="https://github.com/elastic/competitive-benchmarking-studies/tree/main/es-9.3-vs-os-3.5-vector-search">es-9.3-vs-os-3.5-vector-search</a>.</p><h3>Cluster setup</h3><p>We ran our tests on six e2-standard-16 cloud servers, each with 16 vCPUs and 64 GB RAM. On each server, we allocated 15 vCPUs and 56 GB RAM to each Kubernetes pod running the search engine node, with 28 GB reserved for the JVM heap.</p><p>The clusters ran Elasticsearch 9.3.0 and OpenSearch 3.5.0 (Lucene 10.3.2). Because both systems use the same Lucene version in this benchmark, the throughput and latency differences we observe cannot be attributed to Lucene alone and instead reflect differences in how each engine integrates and executes filtered k-nearest neighbor (kNN) retrieval and rescoring. We used a single index with three primary shards and one replica (so 6 shards total, 1 per node).</p><p>We also used a separate server in the same region to run the benchmark client and collect timing statistics.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/f81002c9186e4c2d3e92f49d72418fee9860fc5e-761x401.png" style="width: 100%;" /><h3>The dataset</h3><p></p><p>For this benchmark, we used a large-scale ecommerce-style catalog embedding dataset with 20 million documents, designed to reflect real-world filtered vector retrieval at scale.</p><p></p><p>Each document represents a catalog item and includes:</p><p></p><ul><li>A 128-dimensional dense vector embedding used for approximate kNN retrieval.</li><li>Structured metadata fields used for filtering (for example, item validity and availability plus other catalog constraints) enabling the common production pattern of retrieving the nearest neighbors but only within an eligible subset.</li></ul><p></p><p>We chose this dataset because it captures the core performance challenge we see in agentic and RAG-style systems in production: Vector similarity alone is not enough, retrieval is frequently constrained by filters, and the system must maintain high recall while keeping latency low under those constraints. Compared to smaller QA-style datasets, a 20M document corpus also better reflects the scale and candidate pressure that filtered ANN systems face in practice.</p><h2>Conclusion</h2><p>In modern AI architectures, especially those built around context engineering, vector search speed isn’t a minor implementation detail. It’s a multiplier. When agents and workflows iterate through retrieve → reason → retrieve, retrieval performance directly shapes end-to-end latency, throughput, and the quality of the context fed into the model.</p><p>In our benchmarks, Elasticsearch consistently delivered higher recall at lower latency than OpenSearch in scenarios where correctness depends on retrieving the right document, not just a similar vector. On a controlled dataset, the difference is clear, and in production those gains accumulate across large volumes of retrieval calls, improving responsiveness, increasing capacity headroom, and reducing infrastructure costs.</p><h3>Further reading</h3><ol><li><a href="https://www.elastic.co/search-labs/blog/context-engineering-overview">What is context engineering?</a></li><li><a href="https://www.elastic.co/search-labs/blog/series/context-engineering-hybrid-search-evolution">The evolution of hybrid search and context engineering</a></li><li><a href="https://www.elastic.co/search-labs/blog/context-engineering-relevance-ai-agents-elasticsearch">The impact of relevance in context engineering for AI agents</a></li></ol>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Elastic AutoOps now free for all: What you get with it]]></title>
            <description><![CDATA[Elastic AutoOps is now free for all self-managed clusters. Learn what you get with it and how it compares to Stack Monitoring.]]></description>
            <link>https://www.elastic.co/search-labs/blog/elastic-autoops-free-for-self-managed-elasticsearch</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/elastic-autoops-free-for-self-managed-elasticsearch</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Valentin Crettaz, Arnon Stern, Ori Shafir]]></dc:creator>
            <pubDate>Wed, 25 Feb 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/2067a9457334a1b4bc14056fad34a304f39d1fab-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<h2><strong>Same AutoOps, regardless of your license</strong></h2><p>We’re making this simple: Starting today, AutoOps is available at no cost for every self-managed Elasticsearch cluster through <a href="https://www.elastic.co/blog/elasticsearch-autoops-on-prem">Elastic Cloud Connect</a>. Whether you’re on Free, Basic, Platinum, or Enterprise, you get the same feature product. This isn’t a limited preview or a &quot;lite&quot; version. It’s the same product used by the largest deployments.</p><h2><strong>An investment in the self-managed community</strong></h2><p>Extending it to every user, including those on the free distribution, reflects Elastic&#x27;s commitment to the success of the entire Elasticsearch community. By providing AutoOps for free, we’re investing in the stability and performance of the hundred of thousands of clusters that power the community&#x27;s search and analytics workloads.</p><p>Elastic Cloud Connect enables self-managed clusters to consume Elastic Cloud services, such as AutoOps and the <a href="https://www.elastic.co/search-labs/de/blog/cloud-connect-elastic-inference-service">recently announced</a> Elastic Inference Service (EIS), without the operational overhead of maintaining, patching, monitoring, and operating the services locally.</p><h2><strong>How AutoOps for self-managed works and what it provides</strong></h2><p>As clusters grow in complexity and size, you find yourself spending more time chasing configuration tweaks and trying to find the root cause of an issue. Monitoring tools show you metrics and leave the manual correlation to you and your favorite large language model (LLM) to find the root cause when the issue appears. AutoOps tells you what’s wrong, why, and exactly how to fix it, with real-time issue detection and specific resolution paths.</p><p>AutoOps runs on Elastic Cloud; there’s no infrastructure for you to provision or maintain. You simply run a lightweight agent on-premises to connect your cluster to the AutoOps service where operational metadata (such as node stats, cluster settings, and shard states) is shipped to AutoOps in real time to provide insights and recommendations. Your data never leaves your environment.</p><h2><strong>AutoOps vs. Stack Monitoring for self-managed users</strong></h2><p>Stack Monitoring provides the essential telemetry and basic monitoring for your nodes and indices, showing you the trend over time for various metrics, and alerts you when thresholds are crossed, yet it often leaves the diagnostic burden on the engineer. AutoOps offers a complete picture of cluster health by correlating all relevant metrics. This provides valuable insights and clear instructions on how to resolve issues when they occur.</p><p><strong>Faster root cause analysis</strong></p><p>Your cluster was humming along, but it suddenly got red in the middle of the night (and, as usual, nothing changed the days before).</p><ul><li><strong>With Stack Monitoring:</strong> A built-in alert will notify you when your cluster health turns red. To find the cause, you need to dig into your cluster logs and turn to Dev Tools to look at your shard allocations to understand why that primary shard couldn’t be allocated. Looking at your alerts history, you see another one informing you that your hot nodes reached 80% disk utilization two days ago. You can’t find any disk usage charts to learn more about your disk fill rate, you only know that your disk reached 90% utilization in the meantime, and when one of your data streams needed to roll over, a new backing index was created, but no shards could be allocated to any of your nodes.</li><li><strong>With AutoOps:</strong> The system notifies you when your cluster turns red (1). Looking at the timeline, you immediately realize that this happened because you failed to take action on the previous watermark events that AutoOps raised, namely the high watermark events (2) that started happening recently, and the low watermark ones (3) that built up over the past few days. It is now straightforward for you to know what you need to do to get your cluster back to green.</li></ul><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/445dac65e7fb0f061a4bff87d4a1a2cb4ee39140-1600x583.png" style="width: 100%;" /><p><strong>Higher signal to noise ratio</strong></p><p>Keeping your Elasticsearch cluster healthy is most probably your main concern. Yet, it’s not uncommon for the health status to sometimes flap between green and yellow (and sometimes red), and the cause is not always worthy of your time.</p><ul><li><strong>With Stack Monitoring:</strong> The built-in “Cluster Health” alert will continue to be raised on each health transition from green to either yellow or red. In some situations, like frequent index creations, this can create a lot of repeated and undesired noise. Also, and more importantly, there’s no distinction between a yellow and a red status.</li><li><strong>With AutoOps:</strong> There are dedicated “Status Red” and “Status Yellow” events, with different severities. The latter can be customized in many different ways to fit your use case, as shown in the screenshot below:<ol><li>Since the cluster health can turn yellow only for a brief duration, you can decide for how long to ignore the yellow status before being notified (for example, five minutes, in the screenshot below).</li><li>Furthermore, there are a lot of legitimate operations that Elasticsearch does all the time and that make the cluster turn yellow. You can pick any of the operations you don’t want to be notified about when your cluster turns yellow because of them (for example, adding replicas, relocating replicas, or closing or opening an index, among others).</li><li>Finally, and most importantly, if you have several clusters to manage, you don’t need to configure this for all of them separately; you simply decide to which ones this configuration should be applied. Simple, powerful!</li></ol></li></ul><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/e2b2826b4c674e4a5924d1f8a9b855606c735fa9-1542x1518.png" style="width: 100%;" /><p><strong>More insightful correlations and comparisons of node metrics</strong></p><p>When your cluster runs on more than just a handful nodes, you often need to see how they perform against each other, especially when looking at search and indexing performance.</p><ul><li><strong>With Stack Monitoring:</strong> The Nodes list doesn’t allow you to focus on a specific data tier and doesn’t provide any search or indexing performance metrics that you can sort on. These metrics are available, but only once you drill into a specific node, which doesn’t allow you to easily compare node performance against each other.</li><li><strong>With AutoOps:</strong> The Nodes view allows you to select nodes from a specific data tier. It also provides you with over 50 metrics visualizations, among them search and indexing performance, which give you exactly the visual cues that you need to understand how each node performs against each other and whether there are any struggling nodes that require your attention. In the screenshot below, we can see that some nodes are indexing at double the rate of some others and that search latency is building up at four times the latency of the fastest nodes.</li></ul><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/1eec2db4931ca2992579a5501dd939fc8a552709-1600x1377.png" style="width: 100%;" /><p><strong>Quick overview of the main differences</strong></p><p>Here’s a glimpse at the notable differences between AutoOps and Stack Monitoring, but you can find a more detailed breakdown in our <a href="https://www.elastic.co/docs/deploy-manage/monitor/autoops-vs-stack-monitoring#compare-capabilities">official documentation</a>.</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Capability</td><td style="border: 1px solid #ccc; padding: 4px;">Stack Monitoring</td><td style="border: 1px solid #ccc; padding: 4px;">AutoOps</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Cluster, node, and index metrics</td><td style="border: 1px solid #ccc; padding: 4px;">Yes</td><td style="border: 1px solid #ccc; padding: 4px;">Yes</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Real-time dashboards</td><td style="border: 1px solid #ccc; padding: 4px;">Yes</td><td style="border: 1px solid #ccc; padding: 4px;">Yes</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Multi-cluster overview</td><td style="border: 1px solid #ccc; padding: 4px;">No</td><td style="border: 1px solid #ccc; padding: 4px;">Yes</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Root cause analysis</td><td style="border: 1px solid #ccc; padding: 4px;">No</td><td style="border: 1px solid #ccc; padding: 4px;">Automated analysis</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Fix recommendations</td><td style="border: 1px solid #ccc; padding: 4px;">None</td><td style="border: 1px solid #ccc; padding: 4px;">In-context Elasticsearch commands</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Performance tuning insights</td><td style="border: 1px solid #ccc; padding: 4px;">None</td><td style="border: 1px solid #ccc; padding: 4px;">Yes, based on usage patterns</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Alerts and notifications</td><td style="border: 1px solid #ccc; padding: 4px;">14 built-in alerts and 27 connectors</td><td style="border: 1px solid #ccc; padding: 4px;">100+ customizable alerts and 7 connectors</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Template and mapping analysis</td><td style="border: 1px solid #ccc; padding: 4px;">None</td><td style="border: 1px solid #ccc; padding: 4px;">Detects mapping misconfigurations</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Setup infrastructure</td><td style="border: 1px solid #ccc; padding: 4px;">Dedicated monitoring cluster</td><td style="border: 1px solid #ccc; padding: 4px;">Five-minute installation. No extra infrastructure and no cost. Only lightweight agent installation is needed.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Other stack components</td><td style="border: 1px solid #ccc; padding: 4px;">Kibana, Logstash, Elastic APM</td><td style="border: 1px solid #ccc; padding: 4px;">Coming soon</td>
                    </tr>
              </tbody>
            </table>
          <h2><strong>Start now: Five-minute installation</strong></h2><p>Connecting your cluster takes minutes, regardless of your license type:</p><ol><li>Log in to your free Elastic Cloud account, or <a href="https://cloud.elastic.co/registration?onboarding_service_type=ccm-autoops">sign up</a> for one.</li><li>Choose how to connect your cluster: Elastic Cloud on Kubernetes (ECK), Kubernetes, Docker, or Linux.</li><li>Enter your Elasticsearch cluster endpoint, and run the single command to install and run the lightweight Elastic agent.</li><li>Access AutoOps in your <a href="https://cloud.elastic.co/home">Elastic Cloud account</a>.</li></ol><p>For more details on AutoOps and instructions on connecting your self-managed cluster, read our <a href="https://www.elastic.co/docs/deploy-manage/monitor/autoops/cc-autoops-as-cloud-connected">product documentation</a>.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/bf040d977b515ed6448ae780ede31beff701a4f6-1600x861.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/7a5a01c8ec70dd467870aa06ac037a51bf8c86dc-1600x898.png" style="width: 100%;" /><h2><strong>Reach out if you have any questions</strong></h2><p>Feel free to reach out to us to share your questions and ideas via our <a href="https://elasticstack.slack.com/archives/C0A80TKS71D">Slack community</a>, by posting on our <a href="https://discuss.elastic.co/c/elastic-stack/monitoring/103">Discuss forum</a>, or by clicking the “Give Feedback” button on the AutoOps product page. If you’re connecting a paid self-managed Platinum or Enterprise cluster, you can contact support within your Elastic Cloud account.</p><h2><strong>Read more</strong></h2><p>If you’re interested to learn more about AutoOps and what it can do for you, please head to the <a href="https://www.elastic.co/docs/deploy-manage/monitor/autoops/cc-autoops-as-cloud-connected">official AutoOps documentation</a> and the following Elastic Search Labs articles:</p><ul><li><a href="https://www.elastic.co/search-labs/blog/autoops-elasticsearch-easy-operations">AutoOps makes every Elasticsearch deployment simple(r) to manage</a></li><li><a href="https://www.elastic.co/search-labs/blog/elastic-autoops-self-managed-elasticsearch">AutoOps: A journey to simplify self-managed Elasticsearch management</a></li><li><a href="https://www.elastic.co/search-labs/blog/slow-search-elasticsearch-query-autoops">Leveraging AutoOps to detect long-running search queries</a></li></ul><p><a href="https://cloud.elastic.co/registration?onboarding_service_type=ccm-autoops"><strong>Start using AutoOps for free</strong></a></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Build a "chat with your website data" agent with Jina Embeddings v5 and Elasticsearch]]></title>
            <description><![CDATA[Learn how to build a working, grounded "chat with your website" experience in under an hour by using Elasticsearch Serverless, Jina Embeddings v5, Elastic Open Web Crawler, and Elastic Agent Builder.]]></description>
            <link>https://www.elastic.co/search-labs/blog/chat-website-data-elasticsearch-jina-ai</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/chat-website-data-elasticsearch-jina-ai</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Ugo Sangiorgi, Lily Adler]]></dc:creator>
            <pubDate>Tue, 24 Feb 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/6e3ea84419f1edc60b5e2bfcb7e06693810d03a0-512x268.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Build a “chat with your website” experience in under an hour using <strong>Elasticsearch Serverless</strong>, <strong>Jina Embeddings v5</strong>, <strong>Elastic Open Web Crawler</strong>, and <strong>Elastic Agent Builder</strong>.</p><p>By the end, you’ll have a working agent that can search your crawled pages, cite relevant passages, and answer questions grounded in your content, no custom chunking or embedding pipeline required.</p><p>In this guide, you’ll:</p><ol><li>Start an <strong>Elasticsearch Serverless</strong> project.</li><li>Create an index using the new <code>semantic_text</code> field powered by <strong>Jina Embeddings v5</strong>.</li><li>Crawl any website using<strong> <a href="https://github.com/ugosan/elastic-crawler-control">Elastic Crawler Control (a.k.a. Crawly)</a></strong> (an open source UI + API wrapper around the <a href="https://www.elastic.co/guide/en/enterprise-search/current/crawler.html">Elastic Open Web Crawler</a>).</li><li>Chat with that data using the <strong>Elastic Agent Builder</strong> in Kibana.</li></ol><p><strong>What you’ll walk away with:</strong></p><ul><li>A repeatable pattern you can point at <em>any</em> website/docs source.</li><li>Chat that stays grounded in your content.</li></ul><h2>Prerequisites</h2><ul><li>An <strong>Elasticsearch Serverless (Search)</strong> project + an <strong>API key</strong> with write permissions.</li><li><strong>Docker + Docker Compose</strong> (to run the crawler UI).</li><li><code>git </code>(to clone the repo).</li></ul><h2>1. Start an Elasticsearch Serverless project</h2><p>First, we need a serverless project to host our data.</p><p>1. Log in to your <a href="https://cloud.elastic.co">Elastic Cloud Console</a>.</p><p>2. Click <strong>Create project</strong>.</p><p>3. Select <strong>Search</strong> as the project type. (This type is optimized for vector search and retrieval.)</p><p>4. Give it a name (for example, <code>es-labs-jina-guide</code>), and click <strong>Create</strong>.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/6a138b12f85727872cd8105d954b5ae95f9af428-1600x1135.png" style="width: 100%;" /><p>5.<strong>&nbsp; Important:</strong> Save the <strong>Elasticsearch endpoint </strong>and <strong>API Key</strong> provided when the project is created. You’ll need these for the crawler.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/7c43c7f91d141658dbaff852f452d64853d7ddb8-1536x298.png" style="width: 100%;" /><h2>2. Create the index</h2><p>Elasticsearch Serverless supports <code>semantic_text</code>, which handles chunking and embedding generation automatically. We’ll use the .<code>jina-embeddings-v5-text-small</code> model that’s hosted on GPUs on <a href="https://www.elastic.co/docs/explore-analyze/elastic-inference/eis">Elastic Inference Service</a>.</p><p>Create the index with the <code>semantic_text</code> field. This tells Elastic to automatically vectorize content put into the <code>field </code>property using the inference endpoint we just created.</p><p>In <strong>Kibana</strong> <strong>Dev tools</strong> run:</p><pre><code>PUT furnirem-website
{
  "mappings": {
    "_meta": {
      "description": "Each document represents a web page with the following schema: 'title' and 'meta_description' provide high-level summaries; 'body' contains the full text content; 'headings' preserves the page hierarchy for semantic weighting. URL metadata is decomposed into 'url_host', 'url_path', and 'url_path_dir1/2/3' to allow for granular filtering by site section (e.g., 'blog' or 'tutorials'). 'links' contains extracted outbound URLs for discovery. Crawl timestamp: 2026-01-26T12:54:16.347907."
    },
    "properties": {
      "body_content": { 
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          },
          "semantic_multilingual": {
            "type": "semantic_text",
            "inference_id": ".jina-embeddings-v5-text-small" 
          }
        }
      },
      "headings": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          },
          "semantic_multilingual": {
            "type": "semantic_text",
            "inference_id": ".jina-embeddings-v5-text-small"
          }
        }
      },
      "title": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          },
          "semantic_multilingual": {
            "type": "semantic_text",
            "inference_id": ".jina-embeddings-v5-text-small"
          }
        }
      }
    }
  }
}</code></pre><h2>3. Run the Elastic Open Crawler</h2><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/6d5796c8c3752d25f0c64f905570812f16332f1a-640x398.gif" style="width: 100%;" /><p><strong>Crawly</strong> is one example of how an application can be constructed around the functionalities that the Open Web Crawler provides.</p><p>The application wraps the Elastic Open Crawler in a FastAPI service that manages crawler processes and persists execution data. A React front end provides the interface for configuring and monitoring crawls.</p><p>What happens under the hood is that the crawler service (check <a href="https://github.com/ugosan/elastic-crawler-control/blob/main/crawler-service/app/crawler.py">crawler.py</a>) spawns JRuby processes via <code>subprocess.Popen</code>, allowing multiple concurrent crawls. Each execution&#x27;s configuration, status, and logs are persisted to disk (for now).</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/c20b5055d57f6fe2db4eada6c737d95b19a721f4-1250x960.png" style="width: 100%;" /><p>Clone the repository:</p><pre><code>git clone https://github.com/ugosan/elastic-crawler-control</code></pre><p>Create an <strong>env.local</strong> file with your Elasticsearch credentials:</p><pre><code>ES_URL=https://your-elasticsearch-endpoint.es.cloud
ES_API_KEY=your_api_key_here</code></pre><p>Start the services:</p><pre><code>docker-compose up</code></pre><p>Access the UI at <strong>http://localhost:16700</strong></p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/3858d3764fc7fd5ac217545dfba94c83a0096fb2-1600x820.png" style="width: 100%;" /><p>You don’t necessarily need <code>seed_urls </code>unless you want to be specific, so your config can be as simple as below:</p><pre><code>{
  "domains": [
    {
      "url": "https://furnirem.com"
    }
  ],
  "max_crawl_depth": 3,
  "max_unique_url_count": 500,
  "output_index": "furnirem-website"
}</code></pre><p>From there, you can start a crawl on any website and check its progress:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/f98187d216397973ccc786226466ff8b931118f0-1600x443.png" style="width: 100%;" /><p>Once it&#x27;s finished, we’re ready to query the content in Elasticsearch directly or use the pages you just crawled for chatting with the website on Agent Builder.</p><h2>4. Chat with data in Kibana</h2><p>Now that the data is indexed and vectorized, we can start chatting with the data using the <strong>Elastic Agent Builder</strong>.</p><ol><li>Open <strong>Kibana</strong>, and navigate to <strong>Agents</strong> (under the &quot;Search&quot; section).</li><li><strong>Test the agent</strong>:<ul><li>In the chat window, ask a question, like,<em>&quot;What is the difference between sparse and dense vectors?&quot;</em></li></ul></li></ol><p>The agent will search your Jina-embedded data, retrieve the relevant snippets from the Search Labs blog posts, and generate an answer.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/eebb72090181c5b18dadcc3751f3084857fe1dac-1600x1363.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/4e84a862c6051fa7eda507d4470678ec5dd78eb1-1560x1174.png" style="width: 100%;" /><p>You can also chat with the data directly via Kibana API:</p><pre><code>POST kbn://api/agent_builder/converse/async
{
  "input": "What is the difference between sparse and dense vectors?",
  "agent_id": "elastic-ai-agent",
  "conversation_id": "<CONVERSATION_ID>"
}</code></pre><p>Use<strong> </strong><code>conversation_id </code>to resume an existing conversation with an agent in Elastic Agent Builder. If you don’t provide it on the initial request, the API starts a new conversation and returns a newly generated ID in the streaming response.</p><h2>Summary</h2><p>You now have a working “<strong>chat with your website</strong>” stack: Your site gets crawled, indexed, auto-embedded with <code>semantic_text </code>+ Jina v5, and surfaced through an agent in Kibana that answers questions grounded in your pages.</p><p>From here, you can point the same setup at docs, support content, or internal wikis and iterate on relevance in minutes.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[jina-embeddings-v5-text: Compact state-of-the-art text embeddings for search and intelligent applications]]></title>
            <description><![CDATA[Introducing jina-embeddings-v5-text models, including jina-embeddings-v5-text-small and jina-embeddings-v5-text-nano, and explaining how to use these multilingual embedding models via Elastic Inference Service (EIS).]]></description>
            <link>https://www.elastic.co/search-labs/blog/jina-embeddings-v5-text</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/jina-embeddings-v5-text</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Scott Martens, Sofia Vasileva]]></dc:creator>
            <pubDate>Mon, 23 Feb 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/e98dd30ab925b4bb32830228d71a1a51d02a0917-1600x840.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Jina AI and Elastic are releasing <code>jina-embeddings-v5-text</code>, a family of new, high-performance, compact text embedding models with state-of-the-art performance for models of comparable size across all major task types.</p><p>The family includes two models:</p><ul><li><code>jina-embeddings-v5-text-small</code></li><li><code>jina-embeddings-v5-text-nano</code></li></ul><p>These models are the successful result of an innovative new training recipe for embedding models. They both outperform models many times their size, creating savings in memory and computing resources and responding faster to requests.</p><p>The <code>jina-embeddings-v5-text-small</code> model has 677M parameters, supports a 32768 token input context window, and produces 1024 dimension embeddings by default.</p><p><code>jina-embeddings-v5-text-nano</code> weighs in at roughly a third of its sibling&#x27;s size, with 239M parameters and a 8192 token input context window, yielding slender 768 dimension embeddings.</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Model name</td><td style="border: 1px solid #ccc; padding: 4px;">Total size</td><td style="border: 1px solid #ccc; padding: 4px;">Input context window size</td><td style="border: 1px solid #ccc; padding: 4px;">Embedding size</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">jina-v5-text-small</td><td style="border: 1px solid #ccc; padding: 4px;">
677M params</td><td style="border: 1px solid #ccc; padding: 4px;">
32768 tokens</td><td style="border: 1px solid #ccc; padding: 4px;">1024 dims</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">jina-v5-text-nano</td><td style="border: 1px solid #ccc; padding: 4px;">
239M params</td><td style="border: 1px solid #ccc; padding: 4px;">
8192 tokens</td><td style="border: 1px solid #ccc; padding: 4px;">768 dims</td>
                    </tr>
              </tbody>
            </table>
          <p>These two models are the best in class for overall MMTEB (<a href="https://huggingface.co/spaces/mteb/leaderboard">Multilingual MTEB</a>) benchmark performance. Among models with under 500M parameters, <code>jina-embeddings-v5-text-nano</code> is the top performer, despite having less than 250M parameters, and <code>jina-embeddings-v5-text-small</code> model is the leader among multilingual embedding models with under 750M parameters.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/d97126285fa196c3045b1913b7754e4b186c1e4c-1300x1100.png" style="width: 100%;" /><p>These models are available via Elastic Inference Service (EIS), via an online API, and available for local hosting. For instructions on how to access <code>jina-embeddings-v5-text</code> models, see the “<strong>Getting started</strong>” section, below.</p><p>Embedding models and semantic indexing dramatically increase the accuracy of search algorithms but also have a variety of other uses for tasks involving semantic similarity and meaning extraction, for example:</p><ul><li>Finding duplicate texts.</li><li>Recognizing paraphrases and translations.</li><li>Topic discovery.</li><li>Recommendation engines.</li><li>Sentiment and intention analysis.</li><li>Spam filtering.</li><li>And many others.</li></ul><h2><strong>Features</strong></h2><p>This new model family has a number of features designed to improve relevance and reduce costs.</p><h3>Task optimization</h3><p>We’ve optimized the <code>jina-embeddings-v5-text</code> models for four broad task types:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Task</td><td style="border: 1px solid #ccc; padding: 4px;">Example use cases</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Retrieval</td><td style="border: 1px solid #ccc; padding: 4px;">Searching with natural language queries and retrieving the most relevant matches in a collection of documents.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Text matching</td><td style="border: 1px solid #ccc; padding: 4px;">Semantic similarity, deduplication, paraphrase and translation alignment, and more.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Clustering</td><td style="border: 1px solid #ccc; padding: 4px;">Topic discovery, automatic organization of document collections.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Classification</td><td style="border: 1px solid #ccc; padding: 4px;">Document categorization, sentiment and intent detection, similar tasks.</td>
                    </tr>
              </tbody>
            </table>
          <p>Optimizing for one task usually means having to compromise on another, so most embedding models only have competitive performance for one kind of task. But <code>jina-embeddings-v5-text</code> models are able to specialize in all four areas without compromising by training task-specific <a href="https://arxiv.org/abs/2106.09685">Low-Rank Adaptation (LoRA) adapters</a>.</p><p>LoRA adapters are a kind of plugin for an AI model that changes its behavior dramatically while only adding slightly to the total size. Instead of having an entire model for each task, each one with hundreds of millions of parameters, the <code>jina-embeddings-v5-text</code> model family lets you use just one model with a compact LoRA adapter for each task. This saves memory, storage space, and inference costs.</p><h3>Truncating embeddings</h3><p>We’ve trained the <code>jina-embeddings-v5-text</code> models using <a href="https://arxiv.org/abs/2205.13147">Matryoshka Representation Learning</a>, which lets you cut your embeddings down to smaller sizes at a minimal cost to their quality.</p><p>By default, <code>jina-embeddings-v5-text-small</code> generates 1024-dimension embedding vectors, each represented by a 16-bit number, making every embedding 2KB in size. For a large collection of documents, this can be a lot of data to store, and searching in a vector database full of embeddings is proportional both to the size of the database and to the number of dimensions each stored vector has.</p><p>But you can just halve the size of the embeddings (throw away 512 of the 1024 dimensions), and take up half the space while doubling search speeds. This has an impact on performance. Throwing away information reduces precision. But as the graph below shows, even getting rid of half of the embedding only reduces performance slightly:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/f11969682c81bcf22b5b29a6204ba6ec40471b3d-900x700.png" style="width: 100%;" /><p>As long your embeddings are at least 256 dimensions, the loss in precision should remain fairly small. Below that level, however, relevance and accuracy deteriorate quickly.</p><p>Truncating embeddings like this empowers users to manage their own trade-offs between accuracy and computing costs. It gives you the tools to get big efficiency gains and large cost savings out of your search AI.</p><h3>Robust quantization</h3><p><em>Quantization </em>is another way of reducing the size of embeddings. Instead of throwing away part of each embedding, quantization reduces the precision of the numbers in the embedding. The <code>jina-embeddings-v5-text</code> models generate embeddings with 16-bit numbers, but we can round those numbers off, reducing their precision and the number of bits needed to store them. In the most extreme case, we can reduce each number to one bit (0 or 1), compressing <code>jina-embeddings-v5-text</code>’s default 1024 dimension embeddings from 2 kilobytes to 128 bytes, a 94% reduction from binary quantization alone. Just like for truncation, this produces large savings in memory and computing costs. However, also like truncation, quantization makes embeddings less accurate.</p><p>We’ve trained the <code>jina-embeddings-v5-text</code> models to work with Elasticsearch’s <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/bbq">Better Binary Quantization</a> by minimizing that loss of accuracy, and benchmark tests of binarized embeddings from these models show performance almost equal to their non-binarized equivalents. Consult <a href="https://arxiv.org/abs/2602.15547">the technical report</a> for detailed ablation studies of binarization performance.</p><h3>Multilingual performance</h3><p>Many embedding models are multilingual because they’ve been trained on materials that include large numbers of languages. But that doesn’t mean that they all perform equally well in all supported languages.</p><p>We identified 211 languages in the MMTEB multilingual benchmark and separated them so we could compare our models to similar models on a language-by-language basis. The image below summarizes our results as a heat map. Each patch is a language (identified by its ISO-639 code), and the greener it is, the better the model performed compared to the average of similar models:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/852ac5d0f1977bb0c1124d87f8863a9bb94eb7da-1600x765.png" style="width: 100%;" /><p>Although accuracy varies between languages, the <code>jina-embeddings-v5-text</code> models are state-of-the-art or nearly so across most of the world’s languages.</p><p>For details about multilingual performance, see the <a href="https://arxiv.org/abs/2602.15547"><code>jina-embeddings-v5-text</code> technical report</a>.</p><h2><strong>Jina in Elastic: State-of-the-art native AI for search</strong></h2><p>With <code>jina-embeddings-v5-text</code> models on EIS, you can run high-performance multilingual embedding models natively in Elasticsearch with fully managed, GPU-accelerated inference and no infrastructure to provision or scale. <code>jina-embeddings-v5-text</code> models extend the growing EIS model catalog with compact, multilingual models powered by the latest developments in AI. These models have state-of-the-art performance on information retrieval and standard data analysis benchmarks, and they offer unequaled, globe-spanning multilingual support.</p><p>With two models of vastly different sizes, users can determine which one is best suited for their applications and budgets. Furthermore, with robust embeddings that remain performant when truncated to smaller sizes or quantized to lower precision, <code>jina-embeddings-v5-text </code>models provide opportunities for further concrete savings in storage and computing costs as well as in processing latency.</p><p>With the <code>jina-embeddings-v5-text</code> family, Jina Reranker, and Elastic’s fast vector and BM25 search, users now have access to end-to-end, state-of-the-art <a href="https://www.elastic.co/docs/solutions/search/hybrid-search">hybrid search</a> from Elastic. When you need the most relevant results, whether for retrieval augmented generation (RAG) pipelines, search applications, or data analysis, Elastic with Jina search AI models provides solid and cost-effective quality.</p><h2><strong>Getting started</strong></h2><p>The <code>jina-embeddings-v5-text</code> models are fully integrated into <a href="https://www.elastic.co/docs/explore-analyze/elastic-inference/eis">EIS</a>, and you can use them by setting the <strong><code>type</code> </strong>field to<strong> <code>semantic_text</code></strong> when creating your index and specifying the model (<code>jina-embeddings-v5-text-small</code> or <code>jina-embeddings-v5-text-nano</code>) in the <code>inference_id</code><strong> </strong>field, as in this example:</p><pre><code>PUT multilingual-semantic-index
{
  "mappings": {
    "properties": {
      "content": {
        "type": "semantic_text",
        "inference_id": ".jina-embeddings-v5-text-small"
      }
    }
  }
}

# Ingest data about France
POST multilingual-semantic-index/_doc
{
  "content": "The capital of France is Paris"}

GET multilingual-semantic-index/_search
{
  "query": {
    "semantic": {
      "field": "content",
      "query": "What is the French capital?"
    }
  }
}</code></pre><p>Elasticsearch automatically selects the appropriate LoRA adapter during indexing and retrieval. The embedding dimensions (see the “<strong>Truncating embeddings”</strong> section, above) can be set when <a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-inference-put">creating a custom inference endpoint</a>.</p><p>See the <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/dense-vector">Elasticsearch documentation</a> for more information on using <strong><code>jina-embeddings-v5-text</code> </strong>models.</p><h2><strong>More information</strong></h2><p>To learn more about <code>jina-embeddings-v5-text</code> models, read the <a href="https://jina.ai/news/jina-embeddings-v5-text-distilling-4b-quality-into-sub-1b-multilingual-embeddings/">release notes on the Jina AI blog</a> and the <a href="https://arxiv.org/abs/2602.15547">technical report</a>, with more detailed technical information about performance and Jina AI’s innovative new training procedure. For information about downloading and running these models locally, visit the <a href="https://huggingface.co/collections/jinaai/jina-embeddings-v5-text"><code>jina-embeddings-v5-text</code> collection page</a> on Hugging Face.</p><p>Jina AI models are available under a <a href="https://spdx.org/licenses/CC-BY-NC-4.0">CC-BY-NC-4.0 license</a>, so you are free to download them and try them out, but for commercial use, please contact <a href="https://www.elastic.co/contact?pg=global&amp;plcmt=nav&amp;cta=205352">Elastic sales</a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Ensuring semantic precision with minimum score]]></title>
            <description><![CDATA[Improve semantic precision by employing minimum score thresholds. The article includes concrete examples for semantic and hybrid search. ]]></description>
            <link>https://www.elastic.co/search-labs/blog/semantic-precision-minimum-score</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/semantic-precision-minimum-score</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Mattias Brunnert]]></dc:creator>
            <pubDate>Fri, 20 Feb 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/8b3b5910abfe16d48d309341a0027008b16c4340-720x420.jpg?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Semantic search has opened up a world of opportunities for search relevance. High-quality sparse and dense models, such as ELSER, E5 and Jina Embedding v4, return relevant results based on the meaning of words, rather than keyword matching. However, semantic search sometimes returns irrelevant results at the tail end or for queries that lack relevant results in the index. This property of sparse and dense models can confuse users or waste precious tokens for large language models (LLMs).</p><p>In this article, you’ll learn how you can use the minimum score parameter to increase the precision of your semantic search results. If you’d like to test the examples provided in this blog post, go to <a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/ensuring-semantic-precision-with-minimum-score/ensuring_semantic_precision_with_minimum_score.ipynb">the associated Jupyter notebook</a>.</p><h2>Background: Precision and recall</h2><p>In search relevance, <em>precision </em>and <em>recall </em>are key concepts. Any reader not already familiar is highly encouraged to read up on them. Following is a summary.</p><ul><li><strong>Precision: </strong>The fraction of returned search results that are relevant to the user.</li><li><strong>Recall: </strong>The fraction of all relevant documents in the corpus that are included in the search result set.</li></ul><p>Or, in other words, precision is returning <strong>only </strong>relevant results; and recall is returning <strong>all </strong>relevant results. As you can imagine, these are often competing requirements. Semantic search tends to have very high recall but can struggle with precision. Keep reading to learn how to get around this property.</p><h2>Introducing the minimum score parameter</h2><p>The ‘min_score’ parameter allows us to improve precision by setting a minimum score, which will truncate the result set by removing any matches with a score less than the defined threshold. Following is a simple example:</p><pre><code>GET search-movies/_search
{
  "retriever": {
    "linear": {
      "min_score": 4,
      "retrievers": [
        ...
      ]
    }
  }
}</code></pre><h2>Normalizing the score</h2><p>Setting a minimum score is all well and good; however, not all semantic models return a score suitable for a static threshold. ELSER, for example, returns a score that is unbounded. <a href="https://huggingface.co/intfloat/e5-small#faq">Some</a> dense model scores are densely clustered and only make sense in the context of the specific query.</p><p>For most semantic search cases, we recommend using a normalization approach before applying the ‘min_score’. The normalization ensures that the document score is within a defined interval. Elasticsearch retrievers provide two such <a href="https://www.elastic.co/docs/reference/elasticsearch/rest-apis/retrievers/linear-retriever#linear-retriever-normalizers">normalizers</a>, ‘l2_norm’ and ‘minmax’. The most commonly used is ‘minmax’, since it’s easy to understand and works well in many scenarios. Key properties of ‘minmax’ include:</p><ul><li>Document scores are distributed between 0–1.</li><li>The highest scoring document is always scored as 1.</li><li>The lowest scoring document is always scored as 0.<ul><li>This can make it less suitable for keyword search. See “Hybrid search” section for further discussion.</li></ul></li></ul><p>Following is an example of a normalized semantic query with <code>min_score</code>. Rank window size has been increased to 500 to allow us to return a longer list of search results, starting at 100.</p><pre><code>GET search-movies/_search
{
  "size": 100,
  "_source": [
    "title", "overview"
  ],
  "retriever": {
    "linear": {
      "rank_window_size": 500,
      "min_score": 0.25,
      "retrievers": [
        {
          "normalizer": "minmax",
          "retriever": {
            "standard": {
              "query": {
                "semantic": {
                  "field": "overview_vector",
                  "query": "superhero movie"
                }
              }
            }
          }
        }
      ]
    }
  }
}</code></pre><p>The size has been set to a higher value than normally seen in production. This is so we can inspect the quality of search results and tune the results.</p><h2>Hybrid search using the linear retriever</h2><p>For hybrid search, the simplest approach is to normalize all scores, assign weights, and apply a minimum score. Note that by choosing weights with a sum of 1, you keep the total score within a range of 0–1. This makes it easy to make sense of the final scores and tune <code>min_score</code>. Following is an example:</p><pre><code>GET search-movies/_search
{
  "size": 100,
  "_source": ["title", "overview","keywords"],
  "retriever": {
    "linear": {
      "rank_window_size": 500,
      "min_score": 0.25,
      "retrievers": [
        {
          "weight": 0.6,
          "normalizer": "minmax",
          "retriever": {
            "standard": {
              "query": {
                "semantic": {
                  "field": "overview_vector",
                  "query": "superhero movie"
                }
              }
            }
          }
        },
        {
          "weight": 0.4,
          "normalizer": "minmax",
          "retriever": {
            "standard": {
              "query": {
                "multi_match": {
                  "query": "superhero movie",
                  "fields": ["overview","keywords", "title"],
                  "type": "cross_fields",
                  "minimum_should_match": "2"
                }
              }
            }
          }
        }
      ]
    }
  }
}</code></pre><h2>Hybrid search using RRF</h2><p>With BM25, we often control precision through other means, such as using the <code>AND</code> operator or <code>minimum_should_match</code>. In addition, queries consisting of single, precise, and rare terms will naturally cause search results with few search results, often all being highly relevant. This can lead to:</p><ul><li>Results further back in the result get assigned a low normalized score in the BM25 retriever, even if the absolute BM25 score is close to top scoring hits.</li><li>When adding a very low BM25 score to the semantic score, the total can be approximated as the semantic score.</li><li>The lack of BM25 score contribution can cause the document to be discarded by the <code>min_score threshold</code>.</li></ul><p>As a solution, we can instead use reciprocal rank fusion (RRF) to combine BM25 and semantic results. RRF gets around the challenge of comparing scores from different search algorithms by instead focusing on the position in each result set. In this scenario, the <code>min_score</code> is only applied to the semantic retriever.</p><pre><code>GET search-movies/_search
{
  "_source": ["title", "overview","keywords"],
  "retriever": {
    "rrf": {
      "rank_window_size": 500,
      "retrievers": [
        {
          "linear": {
            "rank_window_size": 500,
            "min_score": 0.25,
            "retrievers": [
              {
                "normalizer": "minmax",
                "retriever": {
                  "standard": {
                    "query": {
                      "semantic": {
                        "field": "overview_vector",
                        "query": "superhero movie"
                      }
                    }
                  }
                }
              }
            ]
          }
        },
        {
          "standard": {
            "query": {
              "multi_match": {
                "query": "superhero movie",
                "fields": ["overview", "keywords","title"],
                "type": "cross_fields",
                "minimum_should_match": "2"
              }
            }
          }
        }
      ]
    }
  }
}</code></pre><h2>Conclusion</h2><p>By using <code>min_score</code>, we’ve shown how we can reduce the number of false positives in our result sets caused by the high recall of semantic search algorithms. To learn more about retrievers, please see this <a href="https://www.elastic.co/search-labs/blog/elasticsearch-retrievers">blog post</a> and the <a href="https://www.elastic.co/docs/solutions/search/retrievers-overview">Elasticsearch documentation</a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Dependency management on Kubernetes]]></title>
            <description><![CDATA[How to streamline dependency management on Kubernetes using Renovate CLI and Argo Workflows.]]></description>
            <link>https://www.elastic.co/search-labs/blog/dependency-management-kubernetes</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/dependency-management-kubernetes</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Nikos Fotiou]]></dc:creator>
            <pubDate>Thu, 19 Feb 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/00519840e6eec7101c1fb096afcae976ee0c454e-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>This is how we built a self-hosted dependency management platform using Kubernetes, Argo Workflows, Argo Events, and Renovate CLI to automate updates, quickly address Common Vulnerabilities and Exposures (CVEs), and efficiently propagate new package versions across thousands of repositories.</p><h2><strong>Dependency management at Elastic</strong></h2><p>At Elastic, we have to manage hundreds or even thousands of repositories, both private and public. When a critical CVE is discovered, we need immediate answers and actions: Which repositories are vulnerable? How quickly can we patch them? Apart from security, productivity questions also arise: How can we quickly propagate the release of a new package version across all the repositories that depend on it without spending too much time on manual tasks?</p><p>The initial trigger for searching ways of doing dependency management was the need to establish a secure foundation with automated updates for <a href="https://www.elastic.co/blog/reducing-cves-in-elastic-container-images">reducing CVEs</a>. After carefully considering solutions on dependency management, we first started working on a self-hosted infrastructure. We were using our own Kubernetes cluster to run Mend Renovate Community Self-Hosted. The idea was to be able to provide a dependency management platform that our users could access in a self-service manner.</p><p>The initial experiment was successful, so more and more teams started onboarding our platform and using it in their everyday repositories’ lifecycle for updates and CVE patching. This happened so fast that we soon hit the ceiling of our self-hosted installation.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/e14d9f98d4eccaa08a335d5bd23d88e5debbb344-1600x1103.png" style="width: 100%;" /><h3><strong>The challenge: How can we scale a dependency management platform in a large organization with a significant number of repositories?</strong></h3><p>Our dependency management platform was processing one repository at a time and the sequential processing model couldn’t keep up, due to the large number of repositories that we own. We had already identified that the issue resided within the concept that <strong>a single instance</strong> of our dependency management tool could process our big and ever-growing list of repositories. Repositories waited in a queue, sometimes for many hours. More than 50% of our repositories were not even processed daily. That means that more than 50% of our repositories waited more than 24 hours between scans.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/45ade5bda08f82bed0b3d0d3736cbd6f056e7a4e-1312x816.jpg" style="width: 100%;" /><p>Large repositories created larger bottlenecks, due to their sizable codebases and their multiple open PRs. GitHub webhook events disrupted the sequence. Automerge became unreliable because scan timing was unpredictable. We had made a promise to our users for the frequency of scans, and we couldn’t fulfill it.</p><h3><strong>The decision to build in-house: Meeting Elastic&#x27;s unique scale and security needs</strong></h3><p>While we considered commercial options, including <strong>Mend&#x27;s Renovate Self-Hosted Enterprise Self-Hosted edition</strong>, internally at Elastic we had a few key initiatives ramping up.</p><p>Our decision to build an in-house platform was driven by the recognition that only a deeply customized solution could meet Elastic&#x27;s specific, nonnegotiable requirements:</p><ol><li><strong>Investing in our internal developer platform:</strong> At the time, we had already started heavily investing in our internal developer platform. We were discussing and designing ways that each one of our services could fit into that. This meant that we wanted to test-drive our own rules and practices for our dependency management platform. On top of that, new guidelines were coming into play and we wanted to design the platform ahead of events.</li><li><strong>Native integration and workflow customization:</strong> We required straightforward integration with our internal tooling and internal processes. For example, we wanted to centralize configuration as code with our Service Catalog (Backstage). We have specific needs around the usage of Backstage that we wanted to make our platform compatible with. So, although it would be possible to make use of the Renovate Self-Hosted APIs alongside our Backstage automation, this wouldn’t cover completely for our internal processes.</li><li><strong>Elastic-specific defense-in-depth security:</strong> Our stringent security compliance required bespoke security mechanisms tailored to our ecosystem. We were working to <a href="https://entro.security/blog/how-elastic-scaled-secrets-nhi-security-elastics-playbook-from-visibility-to-automation/">harden our usage of “non-human identities.”</a> The way this hardening of access worked meant that the nonstandard means to authenticate to GitHub wouldn’t work with an off-the-shelf tool that didn’t support this internal implementation.Our workflow included implementing a parent-child workflow secret encryption pattern and using transient, single-use GitHub tokens. Building in-house was the only practical way to embed these unique security layers and minimize the attack surface across our complex multicloud environment.</li></ol><h2><strong>The solution: Workflow orchestration for dependency management</strong></h2><p>Our solution started from the fact that we wanted to build on the dependency management tool that we already used and not replace it and look for other solutions. It had shown signs of its potential, and its flexibility is important for different needs throughout our organization. We considered different solutions, and what helped us make up our minds was the big and sometimes special needs that we have to cover for. We decided to build a reliable and scalable dependency management platform, where each repository will be processed on its own, removing bottlenecks and setting us up for growth.</p><p>We designed the platform abiding to three core principles:</p><h3><strong>1. Parallel processing</strong></h3><p>Every repository gets its own dependency management processing environment. No more queues. Our concurrency is only limited by the number of resources we spend. We have also applied smart distributed scheduling to avoid getting rate limited by GitHub.</p><h3><strong>2. Self-serviceable</strong></h3><p>We use our Service Catalog (Backstage) to automatically onboard and manage any new repository. We use our own resource definition to give the end user the option to select how often a repository will be processed, how many resources they want to allocate to their schedules, and if they want to turn processing off or back on for any reason. We plan to add more options that way as our users’ needs evolve and they get more fluent with the new installation.</p><h3><strong>3. Reduced secret scope and namespace isolation</strong></h3><p>For increased security, we supply our dependency management pods with ephemeral GitHub tokens that are being generated at the start of each workflow. On top of that, we isolate our workloads in specific namespaces so they can be provided only the necessary secrets. We control what secrets can be accessed by each dependency management workflow using Kubernetes RBAC. We also use encryption to propagate the GitHub token from the parent to the child workflows.</p><p>We rebuilt our platform using Kubernetes and harnessing the power of Kubernetes, Argo Workflows powers the logic of our processes, and Renovate CLI is set up for scanning and processing one repository at a time.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/5560ed20e2bd9ecdd574a9c835126d12b24c332f-1600x1157.png" style="width: 100%;" /><p><strong>The beauty:</strong> We’re using battle-tested open source projects in an original way, providing new working examples for all of those projects and, at the same time, amplifying development velocity and consolidating CVE reduction for our teams.</p><h2><strong>Dependency management architecture: Four microservices</strong></h2><p>The platform comprises four custom-built components:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/2b3d4046c05bb261e45d40c59f864eb51fb9eaa9-1217x1600.png" style="width: 100%;" /><h3><strong>Workflows Operator (Go/Kubebuilder)</strong></h3><p>A Kubernetes operator managing workflow lifecycle through three Custom Resource Definitions (CRDs):</p><ul><li><strong>RepoConfig CRD:</strong> Single source of truth for repository configuration.</li></ul><p>This is how RepoConfig is defined in the operator:</p><pre><code>// RepoConfig is the Schema for the repoconfigs API
type RepoConfig struct {
	metav1.TypeMeta `json:",inline"`

	// metadata is a standard object metadata
	// +optional
	metav1.ObjectMeta `json:"metadata,omitempty,omitzero"`

	// spec defines the desired state of RepoConfig
	// +required
	Spec RepoConfigSpec `json:"spec"`

	// status defines the observed state of RepoConfig
	// +optional
	Status RepoConfigStatus `json:"status,omitempty,omitzero"`
}</code></pre><p>And this is what an instance of RepoConfig would look like:</p><pre><code>apiVersion: workflows.elastic.co/v1
kind: RepoConfig
metadata:
  generation: 3
  name: elastic-test-repo
  namespace: dependency-management-operator
spec:
  owner: group:my-team
  renovate:
    config:
      resourceGroup: SMALL
      runFrequency: 4h
    enabled: true
  repository: elastic/test-repo</code></pre><ul><li><strong>Parent CRD:</strong> Manages CronWorkflows for scheduled scans.</li></ul><p>Inside the reconciliation loop of the parent controller, we make sure that workflow settings are created and kept up to date or even deleted if needed.</p><p>First, it gets some globally configured settings for workflows:</p><pre><code>func (r *ParentReconciler) reconcileSubResources(ctx context.Context, req ctrl.Request, parent *workflowsv1.Parent) error {
	logger := logf.FromContext(ctx)
	logger.Info("Reconcile SubResources for Parent", "name", req.NamespacedName)
	wfSet := workflowsettings.WorkflowSettings{
		RunFrequency:   parent.Spec.RunFrequency,
		ResourceGroups: "parent",
	}</code></pre><p>It makes sure a mutex configmap is up to date to prevent similar workflows from running together:</p><pre><code>	cfMngr := resources.NewConfigMapManager(r.Client, r.Scheme, r.OperatorConfig.ParentNamespace)
	err := cfMngr.CreateOrUpdateSyncMutexConfigmap(ctx, fmt.Sprintf("%s%s", r.OperatorConfig.ResourcesPrefix, r.OperatorConfig.SyncMutexCfgMapName), strings.TrimPrefix(parent.Spec.Repository, "elastic/"), r.OperatorConfig.SemaphoreConcurrencyLimit)</code></pre><p>Then it creates a Workflow Manager that’s the struct which will create or update the CronWorkflows and the Workflow Templates:</p><pre><code>	wfMngr := resources.NewArgoWorkflowManager(r.Client,
		r.Scheme,
		curateResourceName(
			strings.ReplaceAll(parent.Spec.Repository, "/", "-"),
		),
		parent.Namespace,
		"parent-workflow",
		false).
		WithOrganization(r.OperatorConfig.GitHubOrg).
		WithRepoName(parent.Spec.Repository).
		Init(true, true).
		WithPrefix(r.OperatorConfig.ResourcesPrefix).
		WithWfTemplateName(r.OperatorConfig.ParentWorkflowTemplate).
		WithResources(wfSet.GetResourceCategory()).
		WithSchedule(wfSet.GetCronSchedule()).
		WithImagePullSecrets([]corev1.LocalObjectReference{{
			Name: r.OperatorConfig.WorkflowImagePullSecrets,
		}}).
		AddArgument(true, true, "extra_cli_args").
		SetArgument(true, false, "extra_cli_args", "none").
		AddTemplate(resources.NewParentDAGTemplateInstance()).
		AddTemplate(resources.NewWorkflowsTemplateInstance("check-child-workflows", r.OperatorConfig.WorkflowImagePullPolicy, r.OperatorConfig.WorkflowNodeSelector)).
		AddTemplate(resources.NewWorkflowsTemplateInstance("security", r.OperatorConfig.WorkflowImagePullPolicy, r.OperatorConfig.WorkflowNodeSelector)).
		AddTemplate(resources.NewWorkflowsTemplateInstance("submit-child-workflow", r.OperatorConfig.WorkflowImagePullPolicy, r.OperatorConfig.WorkflowNodeSelector))
	wfMngr.OverWriteCommand("submit-child-workflow", r.OperatorConfig.ChildNamespace)
	wfMngr.OverwriteWfTemplateName("parent-wftmpl")
	wfMngr.AddSynchronization(fmt.Sprintf("%s%s", r.OperatorConfig.ResourcesPrefix, r.OperatorConfig.SyncMutexCfgMapName), "{{workflow.parameters.repo_name}}")
	err = wfMngr.CreateOrUpdateCronWorkflow(ctx)
	if err != nil {
		return fmt.Errorf("failed to create or update cron workflow: %w", err)
	}
	err = wfMngr.CreateOrUpdateWorkflowTemplate(ctx)
	if err != nil {
		return fmt.Errorf("failed to create or update workflow template: %w", err)
	}
	return nil</code></pre><ul><li><strong>Child CRD:</strong> Manages WorkflowTemplates with per-repository resources.</li></ul><p>The child controller has a similar reconciliation duty to the parent, but this time it’s responsible for workflow templates in the child namespace that will be triggered by the parent workflows.</p><pre><code>func (r *ChildReconciler) reconcileSubResources(ctx context.Context, req ctrl.Request, child *workflowsv1.Child) error {
	logger := logf.FromContext(ctx)
	logger.Info("Reconcile SubResources for Child", "name", req.NamespacedName)
	wfSet := workflowsettings.WorkflowSettings{
		ResourceGroups: child.Spec.ResourceCategory,
	}
	wfMngr := resources.NewArgoWorkflowManager(r.Client,
		r.Scheme,
		curateResourceName(
			strings.ReplaceAll(child.Spec.Repository, "/", "-"),
		),
		child.Namespace,
		"runner",
		true).
		Init(false, true). // only manage workflow template
		WithPrefix(r.OperatorConfig.ResourcesPrefix).
		WithSuffix("-child-wftmpl").
		WithRepoName(child.Spec.Repository).
		WithOrganization(r.OperatorConfig.GitHubOrg).
		WithResources(wfSet.GetResourceCategory()). // will override resources of presets if set
		WithImagePullSecrets([]corev1.LocalObjectReference{{
			Name: r.OperatorConfig.WorkflowImagePullSecrets,
		}}).
		AddTemplate(resources.NewWorkflowsTemplateInstance("runner", r.OperatorConfig.WorkflowImagePullPolicy, r.OperatorConfig.WorkflowNodeSelector)).
		AddArgument(false, true, "repo_full_name").
		AddArgument(false, true, "repo_name").
		AddArgument(false, true, "encrypted_token").
		AddArgument(false, true, "extra_cli_args")
	wfMngr.OverWriteCommand("runner", r.OperatorConfig.ChildNamespace)
	err := wfMngr.CreateOrUpdateWorkflowTemplate(ctx)
	if err != nil {
		return fmt.Errorf("failed to create or update workflow template: %w", err)
	}
	return nil
}</code></pre><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/7ac70492a1266ba02cb8afbafc5a486cb38a0edc-1600x1290.png" style="width: 100%;" /><p>The multi-controller pattern provides clear separation: RepoConfig Controller handles onboarding/offboarding, Parent Controller manages scheduling, and Child Controller handles execution templates.</p><h3><strong>GitHub Events Gateway (Go)</strong></h3><p>A secure webhook proxy that receives GitHub webhooks, verifies signatures, filters by organization/repository, and routes to Argo Events. We built 10 distinct sensors responding to dependency dashboard interactions, PR events, and package updates.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/4828625456cbd6efa8020a20f10d23f294f98a02-1306x1600.png" style="width: 100%;" /><p>This gateway enables integration with GitHub Apps by:</p><ul><li>Verifying incoming GitHub webhook signatures for security.</li><li>Forwarding valid events to the Argo Events EventSource with all relevant headers and authentication.</li><li>We also configure an authSecret on the EventSource and provide this as a Bearer header in forwarded requests.</li><li>Providing logging, metrics, and retry logic.</li></ul><p>It performs various validations on each GitHub Event request.</p><p>It makes sure some HTTP attributes are present:</p><pre><code>// ValidateRequestMethod checks if the request method is POST.
func ValidateRequestMethod(r *http.Request) error {
	if r.Method != http.MethodPost {
		return fmt.Errorf("method not allowed, only POST is accepted")
	}
	return nil
}

// ValidateRequiredHeaders checks for required GitHub headers.
func ValidateRequiredHeaders(r *http.Request) error {
	eventType := r.Header.Get("X-GitHub-Event")
	deliveryID := r.Header.Get("X-GitHub-Delivery")
	signature := r.Header.Get("X-Hub-Signature-256")
	if eventType == "" || deliveryID == "" || signature == "" {
		return fmt.Errorf("missing required GitHub headers")
	}
	return nil
}

// ValidateUserAgent checks that the User-Agent header starts with GitHub-Hookshot/
func ValidateUserAgent(r *http.Request) error {
	userAgent := r.Header.Get("User-Agent")
	if !strings.HasPrefix(userAgent, "GitHub-Hookshot/") {
		return fmt.Errorf("invalid User-Agent")
	}
	return nil
}</code></pre><p>While it also validates the signature of each request and its organizsation:.</p><pre><code>// ValidateSignature verifies the GitHub webhook signature.
func ValidateSignature(r *http.Request, secret string) ([]byte, error) {
	payload, err := GitHub.ValidatePayload(r, []byte(secret))
	if err != nil {
		return nil, fmt.Errorf("invalid GitHub signature: %w", err)
	}
	return payload, nil
}

// ValidateAllowedOwner checks if the organization login is in the allowed organizations list.
func ValidateAllowedOwner(payload []byte, allowedGitHubOrganizations []string) (string, error) {
	var orgLogin string
	var payloadMap map[string]any
	if err := json.Unmarshal(payload, &payloadMap); err == nil {
		if orgObj, ok := payloadMap["organization"].(map[string]any); ok {
			if login, ok := orgObj["login"].(string); ok {
				orgLogin = login
			} else if name, ok := orgObj["name"].(string); ok {
				orgLogin = name
			}
		}
	}
	if !slices.Contains(allowedGitHubOrganizations, orgLogin) {
		return orgLogin, fmt.Errorf("organization login not allowed")
	}
	return orgLogin, nil
}</code></pre><p>Finally, it routes to Argo Events based on event type:</p><pre><code>	// Map eventType to Argo `EventSource` path
	var endpoint string
	switch eventType {
	case "push":
		endpoint = "/push"
	case "issues":
		endpoint = "/issues"
	case "pull_request":
		endpoint = "/pull-requests"
	default:
		slog.Info("Ignoring unhandled event type", "event_type", eventType, "delivery_id", deliveryID)
		w.WriteHeader(http.StatusOK)
		_, _ = w.Write([]byte("ok"))
		return
	}
	forwardURL := h.config.ArgoEventSourceForwardURL + endpoint</code></pre><p>On the Argo Events side of things, 10 sensors watch the Argo Events EventBus for new events:.</p><pre><code>apiVersion: argoproj.io/v1alpha1
kind: Sensor
metadata:
  name: {{ .Values.sensors.packageUpdateOnDefaultBranch.name }}
  namespace: {{ .Release.Namespace }}
spec:
  eventBusName: {{ .Values.eventBus.name }}</code></pre><p>Then the script applies each sensor’s logic:</p><pre><code>script: |
          local e = event
          if not e or not e.body or not e.body.repository then
            return false
          end

          -- e.g., "refs/heads/main"
          local ref = e.body.ref
          local default_branch = e.body.repository.default_branch
          if not ref or not default_branch then
            return false
          end

          local expected = "refs/heads/" .. default_branch
          if ref ~= expected then
            return false
          end

        {{- if .Values.sensors.packageUpdateOnDefaultBranch.packageFiles }}
          patterns = { {{- range $i, $f := .Values.sensors.packageUpdateOnDefaultBranch.packageFiles }}{{ if $i }}, {{ end }}"{{ $f }}"{{- end }} }
        {{- end }}

          local function anyMatch(path)
            if type(path) ~= "string" then return false end
            for _, pat in ipairs(patterns) do
              -- match filename at repo root, or anywhere under subdirs
              if path:match(pat) or path:match(".+/" .. pat) then
                return true
              end
            end
            return false
          end

          local function filesContainPackage(paths)
            if type(paths) ~= "table" then return false end
            for _, p in ipairs(paths) do
              if anyMatch(p) then return true end
            end
            return false
          end

          -- Inspect all commits (GitHub includes added/modified/removed lists)
          local commits = e.body.commits
          if type(commits) ~= "table" then
            -- Fallback: some payloads include only head_commit
            commits = {}
            if type(e.body.head_commit) == "table" then
              table.insert(commits, e.body.head_commit)
            end
          end

          for _, c in ipairs(commits) do
            if filesContainPackage(c.added) or filesContainPackage(c.modified) or filesContainPackage(c.removed) then
              return true
            end
          end

          return false</code></pre><h3><strong>Backstage Syncer (Go)</strong></h3><p>This polls our Service Catalog (Backstage) for Repository Real Resource Entities, transforms them into RepoConfig CRDs, and keeps the platform in sync with configuration changes. Changes apply within three minutes.</p><pre><code>repoMap := make(map[string]map[string]interface{})
			for i := range entities {
				entity := &entities[i]
				if entity.Spec.Type != "GitHub-repository" {
					continue
				}

				implRaw, err := json.Marshal(entity.Spec.Implementation)
				if err != nil {
					logger.Error("Failed to marshal implementation", "error", err)
					continue
				}

				var implMap map[string]interface{}
				err = json.Unmarshal(implRaw, &implMap)
				if err != nil {
					logger.Error("Failed to unmarshal implementation map", "error", err)
					continue
				}
				var repoName string
				if specMap, ok := implMap["spec"].(map[string]interface{}); ok {
					if repo, ok := specMap["repository"].(string); ok {
						repoName = repo
					}
				}
				if repoName == "" {
					continue
				}

				var workflowsRaw []byte
				if v, ok := implMap["spec"].(map[string]interface{}); ok {
					if r, ok := v["renovate"]; ok {
						workflowsRaw, _ = json.Marshal(r)
					} else {
						workflowsRaw = []byte(`{}`)
					}
				} else {
					workflowsRaw = []byte(`{}`)
				}

				var workflowsWithDefaults schema.WorkflowsMetadata
				err = json.Unmarshal(workflowsRaw, &rworkflowsWithDefaults)
				if err != nil {
					logger.Error("Failed to unmarshal workflows config", "error", err)
					continue
				}

				workflowsMap := map[string]interface{}{
					"enabled":        workflowsWithDefaults.Enabled,
					"require_pr":     workflowsWithDefaults.RequirePr,
					"resource_group": string(workflowsWithDefaults.ResourceGroup),
					"run_frequency":  string(workflowsWithDefaults.RunFrequency),
				}
				repoMap[repoName] = map[string]interface{}{
					"renovate": workflowsMap,
					"owner":    entity.Spec.Owner,
				}
			}
			logger.Info("Fetched GitHub Repository data from Backstage", "repository_count", len(repoMap), "status_code", resp.StatusCode)</code></pre><p>Finally, it writes that data into RepoConfig instances.</p><h3><strong>Workflows base (Mixed: JavaScript, Go, Helm)</strong></h3><p>The foundation layer contains Helm charts, JavaScript configs, a Go wrapper for Renovate CLI with encryption support, and a custom APK Indexer for Alpine packages.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/908d19278face3ce1119dbee9146c1264b6e2f30-1600x873.png" style="width: 100%;" /><h2><strong>Self-service configuration</strong></h2><p>Teams configure their repositories declaratively through Backstage:</p><pre><code>spec:
  renovate:
    enabled: true
    config:
      resourceGroup: LARGE      # SMALL | MEDIUM | LARGE  
      runFrequency: "0 */4 * * *"  # Every 4 hours</code></pre><p>Resource groups allocate CPU and memory based on repository size:</p><ul><li><strong>SMALL:</strong> 500m CPU, 1Gi memory.</li><li><strong>MEDIUM:</strong> 1000m CPU, 2Gi memory.</li><li><strong>LARGE:</strong> 2000m CPU, 4Gi memory.</li></ul><p>Configuration is version-controlled, auditable, and applies automatically.</p><h2><strong>The parent-child pattern</strong></h2><p>The execution model uses a parent-child workflow pattern:</p><ul><li><strong>Parent workflow:</strong> Lightweight CronWorkflow running on schedule. Encrypts secrets, determines whether a scan should run, passes configuration to the child.</li><li><strong>Child workflow:</strong> Ephemeral pod where Renovate CLI runs. Allocated resources dynamically, decrypts secrets in isolation, terminates after completion.</li></ul><p>This separation provides security (secrets encrypted at parent level), resource optimization (parents use minimal resources), and scalability (children run in parallel).</p><h2><strong>The results</strong></h2><h3><strong>Performance transformation</strong></h3><ul><li><strong>Before:</strong> One repository at a time, some repositories would not get processed possibly even for a day or more, less than 1,000 scans per day.</li><li><strong>After:</strong> 100+ concurrent scans, usually 8,000 scans and up to 10,000 recorded scans per day, limited only by the amount of resources we’re willing to spend and how we handle GitHub rate limits.</li></ul><h3><strong>Cost efficiency</strong></h3><p>However weird it may sound, running 8,000 pods a day can get you the same result much cheaper than having one long-running pod trying to achieve the same results.</p><p>In the previous setup, we were running a single instance that, on a good day, would perform 500–600 scans. At the same time, due to the fact that different kinds of repositories would be executed on the same pod, we needed to size the pod for the biggest ones. That sizing would be much bigger than our current extra large offering, using 8 CPUs for the pod and 16G of memory.</p><p>To meet the current daily output, the single pod would need to run for 12 days. So comparing the cost of that single pod running for 12 days to 8,000 pods of our “MEDIUM” size running each day, our new design is far more efficient for the same output of scans:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Metric</td><td style="border: 1px solid #ccc; padding: 4px;">Scenario A (Workflows)</td><td style="border: 1px solid #ccc; padding: 4px;">Scenario B (The long-running single pod)</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Setup</td><td style="border: 1px solid #ccc; padding: 4px;">8,000 pods (1 vCPU / 2GB)</td><td style="border: 1px solid #ccc; padding: 4px;">1 pod (8 vCPU / 16 GB)*</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Duration</td><td style="border: 1px solid #ccc; padding: 4px;">10 minutes each</td><td style="border: 1px solid #ccc; padding: 4px;">12 days continuous</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Total work time</td><td style="border: 1px solid #ccc; padding: 4px;">1,333 compute hours</td><td style="border: 1px solid #ccc; padding: 4px;">288 compute hours</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Total cost</td><td style="border: 1px solid #ccc; padding: 4px;">$65.83</td><td style="border: 1px solid #ccc; padding: 4px;">$113.75</td>
                    </tr>
              </tbody>
            </table>
          <p>However, let’s take into consideration that our default for our workloads is set to “SMALL,” with the great majority running successfully with 0.5 CPU and 1G RAM, and only a few need to change to medium, large. Let’s see what happens if 60% of our workloads are running on “SMALL,” 30% at “MEDIUM,” and 10% at “LARGE,” which is closer to the truth.</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Metric</td><td style="border: 1px solid #ccc; padding: 4px;">Scenario A (Mixed swarm)</td><td style="border: 1px solid #ccc; padding: 4px;">Scenario B (The long runner)</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Strategy</td><td style="border: 1px solid #ccc; padding: 4px;">8,000 pods (mixed sizes)</td><td style="border: 1px solid #ccc; padding: 4px;">1 pod (8 vCPU / 16 GB)*</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Duration</td><td style="border: 1px solid #ccc; padding: 4px;">10 minutes each</td><td style="border: 1px solid #ccc; padding: 4px;">12 days continuous</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Total cost</td><td style="border: 1px solid #ccc; padding: 4px;">$52.66</td><td style="border: 1px solid #ccc; padding: 4px;">$113.75</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Savings</td><td style="border: 1px solid #ccc; padding: 4px;">$61.09 (54% cheaper)</td><td style="border: 1px solid #ccc; padding: 4px;">—</td>
                    </tr>
              </tbody>
            </table>
          <p>We can see that, for the same output, we’re far more cost-efficient in our current setup.</p><h3><strong>Enhanced security</strong></h3><ul><li>Ephemeral GitHub tokens (minutes of exposure versus days).</li><li>Namespace isolation with Role-Based Access Control (RBAC) boundaries.</li><li>Secret encryption at rest in parent workflows.</li><li>Removed direct vault access.</li></ul><h3><strong>Predictable performance</strong></h3><p>With guaranteed scan frequency, we can finally set Service Level Objectives (SLOs). Automerge works reliably. Teams trust the platform to deliver what’s promised.</p><h2><strong>Key architectural decisions</strong></h2><p>Here are some of the milestone design decisions that shaped how the platform looks.</p><ul><li><strong>Why parent-child workflows?</strong></li></ul><p>We adopted this pattern to enforce a <strong>defense-in-depth</strong> strategy. By restricting high-value credentials (such as GitHub App secrets) to a dedicated, locked-down namespace, we use <strong>RBAC</strong> to ensure that ephemeral execution pods cannot arbitrarily access sensitive data. Recent supply chain vulnerabilities (for example, the <strong>&quot;Shai Hulud&quot;</strong> continuous integration/continuous delivery [CI/CD] attacks) have demonstrated the criticality of isolating runtime environments that execute dynamic scripts from the credential store.</p><p>Simultaneously, this decoupling enables <strong>granular resource optimization</strong>. The &quot;parent&quot; workflows act as lightweight orchestrators with a minimal footprint, while the &quot;child&quot; workflows handle the compute-intensive dependency scanning. This separation simplifies <strong>lifecycle management</strong> by allowing us to apply distinct reconciliation logic to each layer, granting users control over execution parameters (child) while retaining administrative control over the scheduling and security infrastructure (parent).</p><ul><li><strong>Why self-serviceable?</strong></li></ul><p>Eliminating our team as a bottleneck for repository configuration was a critical requirement. Our mission was to architect a scalable, <strong>self-service platform</strong> capable of supporting diverse use cases. We recognized that acting as <strong>gatekeepers</strong> for every configuration change was unsustainable, given the sheer volume of repositories. Instead, we adopted a philosophy of enablement: providing the “rails” (infrastructure and <strong>guardrails</strong>) while empowering users to drive the “trains” (execution and customization). We believe this shift toward <strong>team autonomy</strong> significantly enhances productivity by allowing users to tailor the system to their specific operational needs.</p><ul><li><strong>Why Kubernetes Operator pattern?</strong></li></ul><p>As mentioned above, a foundational design principle was to ensure that the platform was fully <strong>self-serviceable</strong>. We required an automated mechanism to capture user intent (such as toggling scans, adjusting scheduling frequency, or tuning runtime resource limits) and instantly propagate those changes to the underlying workflows. Anticipating future requirements, the system also needed to be easily <strong>extensible</strong>.</p><p>To achieve this, we developed a custom <strong>Dependency Management Kubernetes Operator</strong>. By using <strong>CRDs</strong> as the interface for configuration, we established a <strong>Kubernetes-native reconciliation loop</strong>. This operator continuously monitors the desired state defined by the user and automatically orchestrates the necessary updates to the workflow infrastructure. This ensures an <strong>event-driven</strong>, seamless operation, where the platform logic handles all complexity behind the scenes.</p><ul><li><strong>Why design a GitHub Events Gateway?</strong></li></ul><p>Adopting an <strong>event-driven architecture (EDA)</strong> was essential for the platform&#x27;s responsiveness. While CronWorkflows provided a reliable baseline schedule, we required the agility to handle <strong>ad hoc executions, </strong>such as users manually triggering scans via the dashboard. To achieve this, we needed a dedicated <strong>ingestion gateway</strong> to validate payload integrity and route requests intelligently.</p><p>We evaluated existing solutions, including the native GitHub EventSource for Argo, but we identified significant risks regarding <strong>operational overhead</strong> and strict <strong>GitHub API quotas</strong> (for example, webhook limits per repository). Consequently, we built a custom gateway to decouple our infrastructure from these limitations.</p><p>Crucially, this gateway served as a strategic <strong>traffic control point</strong> during our migration. It acted as a switch, enabling us to perform a <strong>gradual, granular rollout</strong> (traffic shifting) from the legacy system to the new infrastructure. This ensured that onboarding thousands of repositories was a controlled, risk-free process rather than a “big bang” switchover.</p><p></p><h2><strong>Lessons learned</strong></h2><p>Some lessons that we learned go hand-in-hand with the <a href="https://www.elastic.co/about/our-source-code">Elastic Source Code</a>:</p><ol><li><strong>Customer First: </strong>Platforms are built for users. So it’s important to take users’ needs as priority number one. This shapes the platform into efficiently designed infrastructure and applications that reduce friction with users, simplify the scaling of the platform and ease adoption.</li><li><strong>Space, Time: </strong>Sometimes the path of least resistance leads to <strong>shifting sands</strong>. We initially tried to optimize the existing sequential processing model, but this failed to resolve our issues; in fact, it only introduced more complexity and loose ends. The bold decision to <strong>rearchitect</strong> the platform with parallel processing required significant up-front effort. However, it ultimately paved the way for sustainable platform growth and virtually eliminated tedious daily administrative work.</li><li><strong>IT, Depends: </strong>A platform cannot operate in isolation; its success depends on how well it integrates with the broader ecosystem. In our case, integration with <strong>Backstage</strong> was critical, as it serves as the source of truth for seamless service onboarding. Similarly, connecting to <strong>Artifactory</strong> allowed us to manage private package updates efficiently, and the list of essential integrations goes on.</li><li><strong>Progress, SIMPLE Perfection: </strong>Throughout the implementation, we constantly pressure-tested our initial assumptions and adapted to new barriers as they emerged. Rather than getting paralyzed by perfectionism, we adopted an <strong>iterative approach</strong>, tackling challenges one by one and adjusting our migration strategy to meet real-world conditions.</li></ol><h2><strong>What’s next</strong></h2><p>The delivery of the platform enables us for more meaningful work that will help us improve the UX and efficiency of our platform. Some examples are:<br/></p><ul><li><strong>Increase and guardrail the adoption of auto-merge</strong></li></ul><p>The auto-merge feature significantly accelerates team velocity by eliminating tedious manual tasks. However, we need to make sure that strict <strong>guardrails</strong> are in place to ensure that this increased speed does not come at the expense of security.<br/></p><ul><li><strong>Improve observability around end-user experience</strong></li></ul><p>A critical priority for our roadmap is enhancing observability, not just at the platform level but also specifically from the <strong>end-user’s perspective</strong>. While capturing infrastructure metrics is straightforward, understanding the actual user experience requires deeper insights. We’re working to define core user-centric key performance indicators (KPIs) so our telemetry can detect friction points and performance issues <strong>before</strong> they escalate into user complaints.</p><ul><li><strong>Remove barriers for greater adoption</strong></li></ul><p>Looking ahead, our priority is to identify and remove any barriers hindering platform adoption. Whether this requires developing new integrations or deploying specific feature sets, we’re committed to data-driven planning. We’ve successfully built a platform designed for scale; our focus now shifts to <strong>maximizing its potential</strong>.<br/></p><h2><strong>The bigger picture</strong></h2><p>The dependency management workflows project demonstrates a broader principle: <strong>When you need to scale open source tools beyond their default deployment model, Kubernetes-native patterns provide a path forward</strong>.</p><p>By embracing:</p><ul><li>CRDs for configuration.</li><li>Operators for lifecycle management.</li><li>Event-driven architecture for responsiveness</li><li>GitOps for deployment.</li></ul><p>We built orchestration that scales independently of the number of repositories it manages. The performance of scanning one repository is the same whether we’re managing 100 or 1,000.</p><p>When a critical CVE is announced, we now have answers in minutes, not hours. That’s the difference between a bottleneck and a competitive advantage.</p><h2><strong>Acknowledgments</strong></h2><p>This platform builds on excellent open source tools:</p><ul><li><strong>Kubebuilder:</strong> The open source framework we used to kick-start our Kubernetes Operators that bootstrap and orchestrate our workflows. [<a href="https://github.com/kubernetes-sigs/kubebuilder">1</a>][<a href="https://book.kubebuilder.io/">2</a>]</li><li><strong>Backstage:</strong> The open source framework on which we’ve built our Service Catalog and which we use as our source of truth. [<a href="https://github.com/backstage/backstage">1</a>][<a href="https://backstage.io/">2</a>]</li><li><strong>Argo Workflows and Argo Events:</strong> The open source suite we used to orchestrate complex processes and add dynamic processing based on events. [<a href="https://github.com/argoproj/argo-workflows">1</a>][<a href="https://argo-workflows.readthedocs.io/en/stable/">2</a>][<a href="https://argoproj.github.io/argo-events/">3</a>][<a href="https://github.com/argoproj/argo-events">4</a>]</li><li><strong>Renovate CLI:</strong> The open source dependency management tool processing our repositories. [<a href="https://github.com/renovatebot/renovate">1</a>][<a href="https://docs.renovatebot.com/getting-started/running/">2</a>]</li></ul><p>* The AWS Fargate pricing model was used as a reference for of the cost of a single pod, although our workloads are not running necessarily on AWS and are running on full- blown Kubernetes clusters.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Better text analysis for complex languages with Elasticsearch and neural models]]></title>
            <description><![CDATA[Using neural models and the Elasticsearch inference API to improve search in Hebrew, German, Arabic, and other morphologically complex languages.]]></description>
            <link>https://www.elastic.co/search-labs/blog/elasticsearch-text-analysis-neural-model</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/elasticsearch-text-analysis-neural-model</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Noam Schwartz]]></dc:creator>
            <pubDate>Wed, 18 Feb 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/a572f9832d8ebc603b70743ac8f2d6e4ea8d2e11-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>If you work with English search, standard text analysis usually just works. You index “running,” the analyzer strips the suffix to store “run,” and a user searching for “run” finds the document. Simple.</p><p>But if you work with languages like Hebrew, Arabic, German, or Polish, you know that standard rule-based analyzers often fail. They either under-analyze (missing relevant matches) or overanalyze (returning garbage results).</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/2484b1f7ce600fbbf75b76a12a67cdfdf9b6e6ab-800x600.jpg" style="width: 100%;" /><p>For years, we’ve had to rely on complex dictionaries and fragile regex rules. Today, we can do better. By replacing rule-based logic with <strong>neural models for text analysis</strong> (small, efficient language models that understand context), we can drastically improve search quality.</p><p>Here’s how to solve the morphology challenge by using the <a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-inference-put-custom"><strong>Elasticsearch inference API</strong></a> and a custom model service.</p><h2><strong>The problem: Why rules fail</strong></h2><p>Most standard analyzers are <strong>context-free</strong>. They look at one word at a time and apply a static set of rules.</p><ul><li><strong>Algorithmic analyzers</strong> (like Snowball) strip suffixes based on patterns.</li><li><strong>Dictionary analyzers</strong> (like Hunspell) look up words in a list.</li></ul><p>This approach breaks down when the structure of a word (its root and affixes) changes based on the sentence it lives in.</p><h3><strong>1. The semitic ambiguity (roots versus prefixes)</strong></h3><p>Semitic languages, like Hebrew and Arabic, are built on root systems and often attach prepositions (such as, in, to, or from) directly to the word. This creates ambiguous tokens that rule-based systems cannot solve.</p><ul><li><strong>Word:</strong> <code>בצל</code> (B-Tz-L).</li><li><strong>Context A:</strong> “The soup tastes better with <strong>onion</strong> (<em>batzal</em>).”</li><li><strong>Context B:</strong> “We sat <strong>in the shadow</strong> (<em>ba-tzel</em>) of the tree.”</li></ul><p>In Context A, <code>בצל</code> is a noun (onion). In Context B, it’s a preposition ב (in) attached to the noun <code>צל</code> (shadow).</p><p>A standard analyzer is forced to guess. If it aggressively strips the ב prefix, it turns &quot;onion&quot; into &quot;shadow.&quot; If it’s conservative and leaves it alone, a user searching for &quot;shadow&quot; (<em>tzel</em>) will fail to find documents containing &quot;in the shadow&quot; (<em>batzel</em>). Neural models solve this by reading the sentence to determine whether the ב is part of the root or a separate preposition.</p><h3><strong>2. The compound problem (German, Dutch, and more)</strong></h3><p>Languages like German, Dutch, Swedish, and Finnish concatenate nouns without spaces to form new concepts. This results in a theoretically infinite vocabulary. To search effectively, you must split (decompound) these words.</p><ul><li><strong>Word:</strong> <code>Wachstube</code>.</li><li><strong>Split A:</strong> <code>Wach</code> (guard) + <code>Stube</code> (room) = guardroom.</li><li><strong>Split B:</strong> <code>Wachs</code> (wax) + <code>Tube</code> (tube) = wax tube.</li></ul><p>A dictionary-based decompounder acts blindly. If both “Wach” and “Wachs” are in its dictionary, it might pick the wrong split, polluting your index with irrelevant tokens.</p><p>To see this problem in English: A naive algorithm might split “carpet” into “car” + “pet.” Without understanding meaning, rules fail.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/eefee9dc6206452d362f8f58dc35c793021dcb1e-800x524.jpg" style="width: 100%;" /><h2><strong>The solution: “Neural analyzers” (neural models for text analysis)</strong></h2><p>We don’t need to abandon the inverted index. We just need to feed it better tokens.</p><p>Instead of a regex rule, we use a <strong>neural model</strong> (like BERT or T5) to perform the analysis. Because these models are trained on massive datasets, they understand context. They look at the surrounding words to decide whether <code>בצל</code> means &quot;onion&quot; or &quot;in shadow&quot; or if <code>Wachstube</code> belongs in a military or cosmetic context.</p><h3><strong>Architecture: The inference sidecar</strong></h3><p>We can integrate these Python-based models directly into the Elasticsearch ingestion pipeline using the <a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-inference-put-custom"><strong>inference API</strong></a>.</p><p><strong>The pattern:</strong></p><ol><li><strong>External model service:</strong> A simple Python service (for example, FastAPI) hosts the model.</li><li><strong><a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-inference-put-custom">Elasticsearch inference API</a>:</strong> Defines this service as a custom model within Elasticsearch.</li><li><strong>Ingest pipeline:</strong> Sends text to the inference processor, which calls your Python service.</li><li><strong>Index mapping: </strong>Create a <code>whitespace</code> target field for the analyzed text.</li><li><strong>Indexing:</strong> The service returns the cleaned text, which Elasticsearch stores in the target field.</li><li><strong>Search:</strong> Queries are analyzed via the inference API before matching.</li></ol><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/5c3055a1594f267c676347da36b1b8b2b187220c-1600x1248.png" style="width: 100%;" /><h2><strong>Implementation guide</strong></h2><p>Let’s build this for <strong>Hebrew</strong> (using <code>DictaBERT</code>) and <strong>German</strong> (using <code>CompoundPiece</code>).</p><p>To follow along, you’ll need:</p><ul><li>Python 3.10+.</li><li>Elasticsearch 8.9.x+.</li></ul><p>Install the Python dependencies:</p><pre><code>pip3 install fastapi uvicorn torch transformers</code></pre><h3><strong>Step 1: External model service</strong></h3><p>To connect Elasticsearch to our neural model, we need a simple API service that:</p><ol><li>Receives text from the Elasticsearch inference API.</li><li>Passes it through the neural model.</li><li>Returns analyzed text in a format Elasticsearch understands.</li></ol><p>This service interfaces Elasticsearch with the neural model. At ingest time, the Elasticsearch pipeline calls this API to analyze and store document fields; at search time, the application calls it to process the user&#x27;s query. You can deploy this on any infrastructure, including EC2, Lambda, or SageMaker.</p><p>The code below loads both models at startup and exposes <code>/analyze/hebrew</code> and <code>/analyze/german</code> endpoints:</p><pre><code>from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List, Union
from transformers import AutoTokenizer, AutoModel, AutoModelForSeq2SeqLM
from contextlib import asynccontextmanager
import torch

# Global models (loaded once at startup)
he_model = None
he_tokenizer = None
de_model = None
de_tokenizer = None


@asynccontextmanager
async def lifespan(app: FastAPI):
   """Load models at startup."""
   global he_model, he_tokenizer, de_model, de_tokenizer

   print("Loading Hebrew model (DictaBERT-Lex)...")
   he_tokenizer = AutoTokenizer.from_pretrained("dicta-il/dictabert-lex")
   he_model = AutoModel.from_pretrained("dicta-il/dictabert-lex", trust_remote_code=True)
   he_model.eval()

   print("Loading German model (CompoundPiece)...")
   de_tokenizer = AutoTokenizer.from_pretrained("benjamin/compoundpiece")
   de_model = AutoModelForSeq2SeqLM.from_pretrained("benjamin/compoundpiece")

   if torch.cuda.is_available():
       he_model.to("cuda")
       de_model.to("cuda")

   print("Models loaded successfully!")
   yield
   print("Shutting down...")


app = FastAPI(
   title="Neural Text Analyzer",
   description="Multi-language text normalization service",
   version="1.0.0",
   lifespan=lifespan
)


class InferenceRequest(BaseModel):
   """ES Inference API sends: {"input": ["text1", "text2"]} or {"input": "text"}"""
   input: Union[str, List[str]]


def format_response(normalized_text: str) -> dict:
   """
   Normalize output to OpenAI-compatible format for ES Inference API.
   ES extracts: $.choices[*].message.content You do not need to stick
   with the OpenAI output format.
   Using it here for consistency reasons, since using the completions API.
   """
   return {
       "choices": [
           {"message": {"content": normalized_text}}
       ]
   }


@app.post("/analyze/hebrew")
async def analyze_hebrew(request: InferenceRequest):
   """Hebrew lemmatization using DictaBERT-Lex."""
   global he_model, he_tokenizer

   if he_model is None:
       raise HTTPException(status_code=503, detail="Model not loaded")

   # Handle input (can be string or list)
   if isinstance(request.input, str):
       texts = [request.input]
   else:
       texts = request.input

   # Run prediction
   with torch.no_grad():
       results = he_model.predict(texts, he_tokenizer)

   # results format: [[[word, lemma], [word, lemma], ...]]
   if results and results[0]:
       lemmas = []
       for word, lemma in results[0]:
           if lemma == '[BLANK]':
               lemma = word
           lemmas.append(lemma)
       normalized = " ".join(lemmas)
   else:
       normalized = ""

   return format_response(normalized)


@app.post("/analyze/german")
async def analyze_german(request: InferenceRequest):
   """German decompounding using CompoundPiece (supports 56 languages)."""
   global de_model, de_tokenizer

   if de_model is None:
       raise HTTPException(status_code=503, detail="Model not loaded")

   # Handle input
   if isinstance(request.input, str):
       text = request.input
   else:
       text = request.input[0] if request.input else ""

   # Format: "de: <word>" for German
   input_text = f"de: {text}"

   inputs = de_tokenizer(input_text, return_tensors="pt")
   if torch.cuda.is_available():
       inputs = {k: v.to("cuda") for k, v in inputs.items()}

   with torch.no_grad():
       outputs = de_model.generate(**inputs, max_length=128)

   # IMPORTANT: decode outputs[0], not outputs
   result = de_tokenizer.decode(outputs[0], skip_special_tokens=True)

   # Clean up: "de: Donau-Dampf-Schiff" -> "Donau Dampf Schiff"
   # Note: model returns "de: " (with space after colon)
   if result.startswith("de: "):
       clean_result = result[4:].replace("-", " ")
   elif result.startswith("de:-"):
       clean_result = result[4:].replace("-", " ")
   elif result.startswith("de:"):
       clean_result = result[3:].replace("-", " ")
   else:
       clean_result = result.replace("-", " ")

   return format_response(clean_result.strip())


@app.get("/health")
async def health():
   return {"status": "healthy"}</code></pre><p>Save the code above to a file (for example, <code>analyzer_service.py</code>), and run:</p><pre><code>python3 -m uvicorn analyzer_service:app --port 8000</code></pre><p>Wait for “<em>Models loaded successfully!</em>” (takes ~30–60 seconds for models to download on first run).</p><p>Test locally:</p><pre><code>#Hebrew
curl -X POST http://localhost:8000/analyze/hebrew \
 -H "Content-Type: application/json" \
 -d '{"input": "הילדים אכלו גלידה בגינה"}'</code></pre><pre><code>#German
curl -X POST http://localhost:8000/analyze/german \
 -H "Content-Type: application/json" \
 -d '{"input": "Donaudampfschifffahrt"}'</code></pre><p>Expected output:</p><pre><code>- Hebrew: `{"choices":[{"message":{"content":"ילד אוכל גלידה גינה"}}]}`
- German: `{"choices":[{"message":{"content":"Donau Dampf Schiff Fahrt"}}]}`</code></pre><h3><strong>Step 2: Configure Elasticsearch inference API</strong></h3><p>We’ll use the<a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-inference-put-custom"> <code>custom</code></a> inference endpoint. This allows us to define exactly how Elasticsearch talks to our Python endpoint.</p><p><strong>Note:</strong> Use <code>response.json_parser</code> to extract the content from our normalized JSON structure. You <strong>do not</strong> need to stick with the OpenAI output format. We’re using it here for consistency reasons, since we’re using the <em>completion</em> task type, which is text to text.</p><h4><strong>Exposing your local service</strong></h4><p>For testing, we’ll use <a href="https://ngrok.com">ngrok</a> to expose the local Python service to the internet. This allows any Elasticsearch deployment (self-managed, Elastic Cloud, or Elastic Cloud Serverless) to reach your service.</p><p>Install and run ngrok:</p><pre><code># Install ngrok (macOS) (Or download from https://ngrok.com/download)
brew install ngrok</code></pre><p>Expose your local service:</p><pre><code>ngrok http 8000</code></pre><p>ngrok will display a forwarding URL like:</p><p>Forwarding <a href="https://abc123.ngrok.io">https://abc123.ngrok.io</a> -&gt; <a href="http://localhost:8000">http://localhost:8000</a></p><p>Copy the HTTPS URL. You’ll use this in the Elasticsearch configuration.</p><p><strong>Configure the inference endpoint</strong></p><pre><code> PUT _inference/completion/hebrew-analyzer                           
 {                                  
   "service": "custom",                                              
   "service_settings": {                             
     "url": "https://abc123.ngrok.io/analyze/hebrew",  
     "headers": {                    
       "Content-Type": "application/json"               
     },                                                
     "request": "{\"input\": ${input}}",                     
     "response": {                                
       "json_parser": {                         
         "completion_result": "$.choices[*].message.content"     
       }                               
     }                                 
   }                                   
 }</code></pre><p>Replace <a href="https://abc123.ngrok.io">https://abc123.ngrok.io</a> with your actual ngrok URL.</p><p><strong>Note:</strong> ngrok is used here for fast testing and development. The free tier has request limits, and URLs change on restart. For production, deploy your service to a persistent infrastructure.</p><h4><strong>For production (with API Gateway)</strong></h4><p>In production, deploy your Python service to a secure, persistent endpoint (such as AWS API Gateway + Lambda, EC2, ECS, or any cloud provider). Use <code>secret_parameters</code> to securely store API keys:</p><pre><code> PUT _inference/completion/hebrew-analyzer                        
 {                                     
   "service": "custom",                  
   "service_settings": {                
     "url": "https://your-api-gateway.execute-api.region.amazonaws.com/prod/analyze/hebrew",                 
     "headers": {                      
       "x-api-key": "${api_key}",       
       "Content-Type": "application/json"  
     },                              
     "secret_parameters": {           
       "api_key": "YOUR-API-KEY"     
     },                           
     "request": "{\"input\": ${input}}",      
     "response": {                    
       "json_parser": {               
         "completion_result": "$.choices[*].message.content"  
       }                             
     }                               
   }                                 
 }</code></pre><h3><strong>Step 3: Ingest pipeline</strong></h3><p>Create a pipeline that passes the raw text field to our model and stores the result in a new field.</p><pre><code>PUT _ingest/pipeline/hebrew_analysis_pipeline
{
 "description": "Lemmatizes Hebrew text using a custom inference endpoint",
 "processors": [
   {
     "inference": {
       "model_id": "hebrew-analyzer",
       "input_output": {
         "input_field": "content",
         "output_field": "content_analyzed"
       }
     }
   }
 ]
}</code></pre><h3><strong>Step 4: Index mapping</strong></h3><p>This is the most critical step. The output from our neural model is already analyzed. We <strong>do not</strong> want a standard analyzer to mess it up again. We use the <code>whitespace</code> analyzer to simply tokenize the text we received.</p><pre><code>PUT /my-hebrew-index
{
 "mappings": {
   "properties": {
     "content": {
       "type": "text",
       "analyzer": "standard"
     },
     "content_analyzed": {
       "type": "text",
       "analyzer": "whitespace"
     }
   }
 }
}</code></pre><h3><strong>Step 5: Indexing</strong></h3><p><strong>Option A: Single document.</strong></p><pre><code>POST /my-hebrew-index/_doc?pipeline=hebrew_analysis_pipeline
{
"content": "הילדים אכלו גלידה בגינה"
}</code></pre><p><strong>Option B: Reindex existing data.</strong></p><p>If you have existing data in another index, reindex it through the pipeline:</p><pre><code>POST _reindex
{
 "source": {
   "index": "my-old-index"
 },
 "dest": {
   "index": "my-hebrew-index",
   "pipeline": "hebrew_analysis_pipeline"
 }
}</code></pre><p><strong>Option C: Set pipeline as default for index.</strong></p><p>Make all future documents automatically use the pipeline:</p><pre><code>PUT /my-hebrew-index/_settings
{
"index.default_pipeline": "hebrew_analysis_pipeline"
}</code></pre><p>Then index normally (no <code>?pipeline=</code> needed):</p><pre><code>POST /my-hebrew-index/_doc
{
"content": "הילדים אכלו גלידה בגינה"
}</code></pre><h3><strong>Step 6: Search</strong></h3><p>Search using a neural analyzer in Elasticsearch is a two-step process, so analyze the query first using the inference API, and then search with the result:</p><p><strong>A. Analyze the query.</strong></p><pre><code> POST _inference/completion/hebrew-analyzer
 {
   "input": "הילדים אכלו גלידה בגינה"
 }</code></pre><p><strong>B. Search with the result.</strong></p><pre><code> GET /my-hebrew-index/_search
 {
   "query": {
     "match": {
       "content_analyzed": "ילד אוכל גלידה גינה"
     }
   }
 }</code></pre><p>In production, wrap these two calls in your application code for a seamless experience.</p><h2><strong>Available models</strong></h2><p>The architecture above works for any language. You simply swap the Python model and adjust the post-processing of the output. Here are verified models for common complex languages:</p><ul><li><strong>Hebrew:</strong> Context-aware lemmatization. Handles prefix ambiguity (ב, ה, ל, and more) <a href="https://huggingface.co/dicta-il/dictabert-lex">dicta-il/dictabert-lex</a>.</li><li><strong>German: </strong>Generative decompounding. Supports 56 languages, including Dutch, Swedish, Finnish, and Turkish. <a href="https://huggingface.co/benjamin/compoundpiece">benjamin/compoundpiece</a>.</li><li><strong>Arabic:</strong> BERT-based disambiguation and lemmatization for Modern Standard Arabic. <a href="https://github.com/CAMeL-Lab/camel_tools">CAMeL Tools</a>.</li><li><strong>Polish:</strong> Case-sensitive lemmatization for Polish inflections. <a href="https://huggingface.co/amu-cai/polemma-large">amu-cai/polemma-large</a>.</li></ul><h2><strong>Conclusion</strong></h2><p>You don’t need to choose between the precision of lexical search and the intelligence of AI. By moving the “smart” part of the process into the analysis phase using the inference API, you fix the root cause of poor search relevance in complex languages.</p><p>The tools are here. The models are open-source. The pipelines are configurable. It’s time to teach our search engines to read.</p><h3><strong>Code</strong></h3><p>All code snippets from this article are available at <a href="https://github.com/noamschwartz/neural-text-analyzer">https://github.com/noamschwartz/neural-text-analyzer</a>.</p><p><strong></strong></p><p><strong>References</strong>:</p><ul><li><a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-inference-put-custom">https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-inference-put-custom</a></li><li><a href="https://www.elastic.co/docs/manage-data/ingest/transform-enrich/ingest-pipelines">https://www.elastic.co/docs/manage-data/ingest/transform-enrich/ingest-pipelines</a></li><li><a href="https://ngrok.com">https://ngrok.com</a></li><li><a href="https://huggingface.co/dicta-il/dictabert-lex">https://huggingface.co/dicta-il/dictabert-lex</a></li><li><a href="https://huggingface.co/benjamin/compoundpiece">https://huggingface.co/benjamin/compoundpiece</a></li><li><a href="https://arxiv.org/pdf/2305.14214">https://arxiv.org/pdf/2305.14214</a></li></ul>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[An open‑source Hebrew analyzer for Elasticsearch lemmatization]]></title>
            <description><![CDATA[An open-source Elasticsearch 9.x analyzer plugin that improves Hebrew search by lemmatizing tokens in the analysis chain for better recall across Hebrew morphology.]]></description>
            <link>https://www.elastic.co/search-labs/blog/elasticsearch-lemmatization-hebrew-analyzer</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/elasticsearch-lemmatization-hebrew-analyzer</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Lily Adler]]></dc:creator>
            <pubDate>Tue, 17 Feb 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/8d9f79cec59d89f4e14657db7df846ed3104a2da-1024x565.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Hebrew is morphologically rich: Prefixes, inflections, and clitics make exact-token search brittle. This project provides an open-source Hebrew analyzer plugin for Elasticsearch 9.x that performs neural lemmatization in the analysis chain, using an embedded DictaBERT model executed in-process via ONNX Runtime with an INT8-quantized model.</p><h2><strong>Quick start</strong></h2><p>Download the relevant <a href="https://github.com/liladler/elasticsearch-analysis-hebrew-plugin/releases">release</a> or build and install (Linux build script generates Elasticsearch‑compatible zip):</p><pre><code>./scripts/build_plugin_linux.sh</code></pre><p>Install in Elasticsearch:</p><pre><code>/path/to/elasticsearch/bin/elasticsearch-plugin install file:///path/to/heb-lemmas-embedded-plugin-<ES_VERSION>.zip</code></pre><p>Test:</p><pre><code>curl -k -X POST "https://localhost:9200/_analyze" \
  -H "Content-Type: application/json" \
  -u "elastic:<password>" \
  -d '{"tokenizer":"whitespace","filter":["heb_lemmas","heb_stopwords"],"text":"הילדים אוכלים את הבננות"}'</code></pre><h2><br/>Why Hebrew search is different</h2><p>Hebrew is morphologically rich: Prefixes, suffixes, inflection, and clitics all collapse into a single surface form. That makes naive tokenization insufficient. Without true lemmatization, search quality suffers; users miss relevant results due to simple variations in form. This project tackles that by embedding a Hebrew lemmatization model inside the analyzer itself, so every token passes through a neural model before indexing and querying.</p><h3>Example</h3><p>Users may search for the lemma “בית” (house), but documents might contain:</p><ul><li>בית (a house)</li><li>בבית (in the house)</li><li>לבית (to the house)</li><li>בבתים (in houses)</li><li>לבתים (to houses)</li></ul><p>Without lemmatization, these become different surface tokens; lemmatization normalizes them toward the same lemma (בית), improving recall:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/d3dfd40569b3f4bfc79df639a49ae995e92b0bc1-1600x983.png" style="width: 100%;" /><h2>What this plugin does</h2><p>Rather than relying on rule-based stemming, the analyzer runs a Hebrew lemmatization model as part of the Elasticsearch analysis chain and emits one normalized lemma per token. Because the model is neural, it can use local context within each analyzed segment to choose a lemma in ambiguous cases—while still producing stable tokens that work well for indexing and querying. The analyzer:</p><ul><li>Runs a Hebrew lemmatization model inside Elasticsearch.</li><li>Produces better normalized tokens for Hebrew text.</li><li>Supports stopwords and standard analyzer pipelines.</li></ul><h2>The result: Fast, reliable lemmatization</h2><p>This analyzer is optimized for real‑world throughput:</p><ul><li>ONNX Runtime in‑process inference.</li><li>INT8-quantized model for lower latency and memory footprint.</li><li>Java Foreign Function Interface (FFI) for high‑performance native inference.</li></ul><p>The result: fast, reliable lemmatization with predictable operational behavior.</p><p>To evaluate performance, we ran a benchmark in a Docker container (4 cores, 12 GB RAM) on 1 million large documents (5.7 GB of data) from the Hebrew Wikipedia dataset. You’ll find the results below:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Metric (search)</td><td style="border: 1px solid #ccc; padding: 4px;">Task</td><td style="border: 1px solid #ccc; padding: 4px;">Value</td><td style="border: 1px solid #ccc; padding: 4px;">Unit</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Min throughput</td><td style="border: 1px solid #ccc; padding: 4px;">hebrew-query-search</td><td style="border: 1px solid #ccc; padding: 4px;">409.75</td><td style="border: 1px solid #ccc; padding: 4px;">ops/s</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Mean throughput</td><td style="border: 1px solid #ccc; padding: 4px;">hebrew-query-search</td><td style="border: 1px solid #ccc; padding: 4px;">490.65</td><td style="border: 1px solid #ccc; padding: 4px;">ops/s</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Median throughput</td><td style="border: 1px solid #ccc; padding: 4px;">hebrew-query-search</td><td style="border: 1px solid #ccc; padding: 4px;">491.85</td><td style="border: 1px solid #ccc; padding: 4px;">ops/s</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Max throughput</td><td style="border: 1px solid #ccc; padding: 4px;">hebrew-query-search</td><td style="border: 1px solid #ccc; padding: 4px;">496.13</td><td style="border: 1px solid #ccc; padding: 4px;">ops/s</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">50th percentile latency</td><td style="border: 1px solid #ccc; padding: 4px;">hebrew-query-search</td><td style="border: 1px solid #ccc; padding: 4px;">7.02242</td><td style="border: 1px solid #ccc; padding: 4px;">ms</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">90th percentile latency</td><td style="border: 1px solid #ccc; padding: 4px;">hebrew-query-search</td><td style="border: 1px solid #ccc; padding: 4px;">10.7338</td><td style="border: 1px solid #ccc; padding: 4px;">ms</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">99th percentile latency</td><td style="border: 1px solid #ccc; padding: 4px;">hebrew-query-search</td><td style="border: 1px solid #ccc; padding: 4px;">19.0406</td><td style="border: 1px solid #ccc; padding: 4px;">ms</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">99.9th percentile latency</td><td style="border: 1px solid #ccc; padding: 4px;">hebrew-query-search</td><td style="border: 1px solid #ccc; padding: 4px;">27.165</td><td style="border: 1px solid #ccc; padding: 4px;">ms</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">50th percentile service time</td><td style="border: 1px solid #ccc; padding: 4px;">hebrew-query-search</td><td style="border: 1px solid #ccc; padding: 4px;">7.02242</td><td style="border: 1px solid #ccc; padding: 4px;">ms</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">90th percentile service time</td><td style="border: 1px solid #ccc; padding: 4px;">hebrew-query-search</td><td style="border: 1px solid #ccc; padding: 4px;">10.7338</td><td style="border: 1px solid #ccc; padding: 4px;">ms</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">99th percentile service time</td><td style="border: 1px solid #ccc; padding: 4px;">hebrew-query-search</td><td style="border: 1px solid #ccc; padding: 4px;">19.0406</td><td style="border: 1px solid #ccc; padding: 4px;">ms</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">99.9th percentile service time</td><td style="border: 1px solid #ccc; padding: 4px;">hebrew-query-search</td><td style="border: 1px solid #ccc; padding: 4px;">27.165</td><td style="border: 1px solid #ccc; padding: 4px;">ms</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Error rate</td><td style="border: 1px solid #ccc; padding: 4px;">hebrew-query-search</td><td style="border: 1px solid #ccc; padding: 4px;">0</td><td style="border: 1px solid #ccc; padding: 4px;">%</td>
                    </tr>
              </tbody>
            </table>
          <h2>Open source and Elastic‑ready</h2><p>The plugin is fully open source and works on:</p><ul><li>Elastic open‑source distributions.</li><li>Elastic Cloud.</li></ul><p>You can build it yourself or download prebuilt releases and install it like any other plugin.</p><p>To upload the analyzer plugin to Elastic Cloud, navigate to the <strong>Extensions</strong> section within your Elastic Cloud console and proceed with the upload.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/e2e8731aeb884e91624f2c8c0998cf8de08a16e3-1402x1600.png" style="width: 100%;" /><h2>Credits</h2><p>This project is a fork of the Korra ai Hebrew analysis plugin (MIT), which was implemented by <a href="http://Korra.ai">Korra.ai</a> with funding and guidance from the National NLP Program led by MAFAT and the Israel Innovation Authority.</p><p>This fork focuses on Elasticsearch 9.x compatibility and running lemmatization fully in-process via ONNX Runtime, using an INT8‑quantized model and bundled Hebrew stopwords. Lemmatization is powered by DictaBERT <a href="https://huggingface.co/dicta-il/dictabert-lex"><code>dicta-il/dictabert-lex</code></a> (CC‑BY‑4.0).</p><p>Huge thanks to the Dicta team for making high-quality Hebrew natural language processing (NLP) models available to the community.</p><h2>Links</h2><ul><li><a href="https://github.com/liladler/elasticsearch-analysis-hebrew-plugin">GitHub repo</a></li><li><a href="https://github.com/liladler/elasticsearch-analysis-hebrew-plugin/releases">Releases</a></li></ul>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Elasticsearch 9.3 adds bfloat16 vector support]]></title>
            <description><![CDATA[Exploring the new Elasticsearch element_type: bfloat16, which can halve your vector data storage.]]></description>
            <link>https://www.elastic.co/search-labs/blog/bfloat16-vector-support-elasticsearch</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/bfloat16-vector-support-elasticsearch</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Simon Cooper]]></dc:creator>
            <pubDate>Mon, 16 Feb 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/f80f988d758f53742f6b4cd13b04d0cb27af7a17-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Elasticsearch 9.3 brings with it several important improvements to vector data, including a new <code>element_type: bfloat16</code>. This has the potential to <em>halve</em> your vector data storage, with minimal reductions in recall and runtime performance for most use cases.</p><h2>Storage formats in dense_vector fields</h2><p>Prior to 9.3, <code>dense_vector</code> fields support vectors of single bits, 1-byte integers, and 4-byte floats. We store the original vectors on top of any quantization and/or hierarchical navigable small world (HNSW) graph used for indexing, and the original vectors make up the vast majority of the required disk space of the vector indices. If your vectors are floating point, then the only option versions of Elasticsearch prior to 9.3 provide is to store 4 bytes per vector value: That’s 4kB for a single 1024-dimensional vector.</p><p>There are other floating-point sizes available, of course: <a href="https://en.wikipedia.org/wiki/IEEE_754">IEEE-754</a> specifies floating-point sizes of many different lengths, including the 4-byte <code>float32</code> and 8-byte <code>float64</code> used by Java <code>float</code> and <code>double</code> types. It also specifies a <code>float16</code> format, which only uses 2 bytes per value. However, this only has a maximum value of 65,504, compared to the 3.4x1038 of 4-byte <code>float32</code> values, and the conversion between the two involves several arithmetic operations.</p><p>As an alternative, many machine learning (ML) applications now use <a href="https://en.wikipedia.org/wiki/Bfloat16_floating-point_format">bfloat16</a>, which is a modification of IEEE-754 <code>float32</code> to only use 2 bytes. It does this by discarding the lowest 2 bytes of the fractional part of the value, leaving the sign and exponent unchanged. This effectively reduces the precision of the floating-point value <em>without</em> a corresponding reduction in range. The conversion from <code>float32</code> to <code>bfloat16</code> is a simple bitwise truncation on the <code>float32</code> value, with a bit of jiggling to account for rounding.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/b0f8d88cfb8c696b8ef805424d7dc7a242023484-913x394.png" style="width: 100%;" /><h2>bfloat16 in Elasticsearch 9.3</h2><p>Elasticsearch 9.3 now supports storing vector element types as bfloat16. In memory, it will still process every vector value as a 4-byte float32, as Java does not have built-in support for bfloat16. As it writes vector data to disk, it will simply truncate and round each float32 value to a 2-byte bfloat16, and zero-expand each bfloat16 value back to float32 on reading the value into memory.</p><p>This effectively <em>halves</em> your vector index sizes, as it uses 2 bytes per value rather than 4 bytes. There may be a small performance cost during reading and writing data as Elasticsearch performs the necessary conversions, but this is often counterbalanced by a significant reduction in the I/O required, as the OS now has to read half as much data. And, for most datasets, there is a minimal effect on search recall.</p><p>As an example, this is the difference in sizes for bfloat16 on our <code>dense_vector</code> dataset:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/d7d2459cead1b8d29d107a1b6bbbe21be1fc7315-1600x990.png" style="width: 100%;" /><p>So, if your input vectors are already at bfloat16 precision, then happy days! Elasticsearch accepts raw bfloat16 vectors as float values, and as <a href="https://www.elastic.co/search-labs/blog/base64-encoded-strings-vector-ingestion">Base64-encoded vectors</a>. The vectors are persisted to disk with the same precision as your original source data, immediately halving your data storage requirements.</p><p>If your input vectors are at 4-byte precision, then you can also use bfloat16 format to halve your index data sizes. Elasticsearch will truncate and round each value to 2-byte precision, throwing away the least significant bits of the fraction. This means that the vector values you get back from Elasticsearch won’t be exactly the same as what you originally indexed, so don’t use bfloat16 if you need to maintain the full 4-byte precision of float32.</p><p>Starting in Elasticsearch 9.3, and on Elasticsearch Serverless, you can specify <code>element_type: bfloat16</code> with all <code>dense_vector</code> index types on any newly created indices. If you wish to use bfloat16 with existing indices, you can reindex into an index with <code>element_type: bfloat16</code> and Elasticsearch will automatically convert your existing float vectors to bfloat16.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[AI agents that perform actions: Automating IT requests with Agent Builder and Workflows]]></title>
            <description><![CDATA[Using  Elastic Agent Builder and Workflows to create an AI agent that automatically performs IT actions, such as laptop refreshes.]]></description>
            <link>https://www.elastic.co/search-labs/blog/agent-builder-one-workflow</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/agent-builder-one-workflow</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Sri Kolagani, Ziyad Akmal]]></dc:creator>
            <pubDate>Fri, 13 Feb 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/80e98ed860633a0a20abcc55ad10b2854a4e8df0-720x420.jpg?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>In the world of IT operations, context switching is the enemy of productivity. For internal teams, simple requests, like a laptop refresh or employee onboarding, often require navigating multiple portals, filling out rigid forms, and manually updating information technology service management (ITSM) tools like ServiceNow.</p><p>At a recent <strong>DevFest</strong>, we demonstrated how to bridge the gap between natural language requests and structured IT workflows. By combining <a href="https://www.elastic.co/docs/explore-analyze/ai-features/elastic-agent-builder"><strong>Elastic Agent Builder</strong></a> with <a href="https://www.elastic.co/docs/explore-analyze/workflows"><strong>Elastic Workflows</strong></a>, we can create AI assistants that not only answer questions but also perform complex actions.</p><p>In this post, we’ll dive into the architecture from that talk, specifically looking at how we built an automated &quot;Laptop Refresh&quot; workflow. We’ll demonstrate how to configure an agent that collects user requirements and triggers a server-side automation to interact directly with ServiceNow APIs.</p><p><strong>Watch the full breakdown:</strong> This post is based on our presentation at Google DevFest. You can <a href="https://www.youtube.com/watch?v=OzStbTUZqyw">watch the full session here</a> to see the demo in action.</p><h2><strong>The architecture: From chat to fulfillment</strong></h2><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/eb42459bfae9c2ac95f2012882ce826db5526705-1600x1000.png" style="width: 100%;" /><p><strong>Note:</strong> The technical implementation described in this document is a streamlined version of the full production environment. While the <strong>architecture diagram</strong> provided serves as an accurate structural reference for the actual deployment, the accompanying text and code snippets have been simplified for illustrative purposes and may differ from the final, complex configurations used in the live implementation.</p><p>The goal is to move from a manual, form-heavy process to a conversational interface. Instead of a user navigating a catalog, they simply tell the AI assistant that they’re due for a laptop upgrade.</p><p>As illustrated above, the flow consists of three distinct layers:</p><p><strong>1. Interaction layer (ElasticGPT/Agent Builder):</strong> The user interacts naturally with an interface powered by ElasticGPT. Behind the scenes, Agent Builder processes this conversation, handling intent detection and slot filling, to structure the data and orchestrate interactions with other internal systems.</p><ul><li><strong>Intent detection</strong><ul><li><strong>Mechanism:</strong> System prompt instruction.</li><li><strong>Implementation:</strong> The agent is explicitly told its single purpose in the <code>MISSION</code> statement. It doesn’t need to &quot;detect&quot; other intents because it’s scoped strictly to IT provisioning.<ul><li><em><strong>Code reference</strong>:</em> <code>MISSION: You are a specialized agent designed to collect complete employee onboarding information...</code></li></ul></li><li><strong>Constraint:</strong> If a user asks about non-IT topics (for example, &quot;What is the weather?&quot;), the <code>MISSION</code> implies that the agent should pivot back to data collection or decline, depending on the large language model’s (LLM&#x27;s) default safety alignment.</li></ul></li><li><strong>Slot filling (data collection)</strong><ul><li><strong>Mechanism:</strong> Phased conversation flow.</li><li><strong>Implementation:</strong> Instead of asking for all slots at once, the DATA <code>COLLECTION STRATEGY</code> breaks the slots into five logical phases. This prevents the context switching fatigue mentioned above.<ul><li><strong><em>Code reference:</em> </strong><code>PHASE 1: Personal information, PHASE 2: Employment Details, and so on.</code></li></ul></li><li><strong>Validation:</strong> The prompt enforces immediate validation (for example, <code>Validate inputs immediately</code>), acting as a gatekeeper before moving to the next slot.</li></ul></li></ul><p><strong>2. Automation layer ( Workflows):</strong> Once the agent has the data, it triggers a workflow. This workflow handles the logic: checking device eligibility, enforcing policy (for example, &quot;Is the laptop &gt; 3 years old?&quot;), and making API calls.</p><p><strong>3. System of record (ServiceNow):</strong> The workflow reads and writes directly to the ITSM tool to maintain audit trails and initiate fulfillment.</p><h2><strong>Step 1: Configuring the agent</strong></h2><p>The first step is defining the &quot;brain&quot; of the operation using <strong>Agent Builder</strong>. We need an agent that acts strictly within the bounds of IT provisioning. We don&#x27;t want a general chatbot; we want a data collection machine that feels like a helpful colleague.</p><p>We achieve this via a robust <strong>system prompt</strong>. The prompt dictates the agent&#x27;s operating protocol, enforcing a step-by-step data collection strategy.</p><p>Here’s the refined structure of the prompt we used. Notice how it enforces validation and logically groups questions to avoid overwhelming the user:</p><pre><code>MISSION: You are a specialized agent designed to collect complete employee onboarding information for IT equipment provisioning.

OPERATING PROTOCOL:
0. On every new chat, send a welcome message, and directly jump to data collection.

1. DATA COLLECTION STRATEGY:
   - Use a step-by-step approach across 5 clear phases
   - Validate inputs immediately

2. CONVERSATION FLOW:
   PHASE 1: Personal Information (Name, Email, Phone)
   PHASE 2: Employment Details (Job Title, Department, Manager)
   PHASE 3: Location & Shipping (Address, Country)
   PHASE 4: Technical Setup (Laptop Type, Accessories)
   PHASE 5: Confirmation

...

6. SUCCESS COMPLETION:
   After all data is collected and validated, invoke the tool "laptoprefreshworkflow" with the JSON payload.</code></pre><p>For a sample system prompt or instructions, please refer <a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/building-actionable-ai-automating-it-requests-with-agent-builder-and-one-workflow/Try%20it%20yourself%20Agents/service_now_utility_agent.ipynb">here</a>.</p><p>By explicitly instructing the agent to send the data in a specific JSON format at the end of the conversation, we ensure that the input matches exactly what our automation layer expects.</p><h2><strong>Step 2: The automation layer (Workflows)</strong></h2><p>The agent provides the <em>intent</em> and the <em>data</em>, but <strong>Workflows</strong> provides the <em>muscle</em>.</p><p>We define a workflow using a YAML configuration. This workflow acts as the bridge between the AI agent and the ServiceNow REST APIs. It handles authentication, data retrieval, and the ordering process.</p><p>Below is the workflow definition. We’ve refined the code to use secure variable handling for credentials rather than hardcoding them.</p><h3><strong>Workflow inputs</strong></h3><p>First, we define the inputs the workflow expects to receive from the agent:</p><pre><code>YAML
version: "1"
name: Submit Laptop Refresh Request
enabled: true
triggers:
  - type: manual
inputs:
  - name: userid
    type: string
  - name: preferred-address
    type: string
  - name: laptop-choice
    default: Macbook latest
    type: string
  - name: laptop-keep-or-return
    default: return
    type: string</code></pre><h3><strong>Interacting with ServiceNow</strong></h3><p>The workflow executes a series of HTTP steps. Crucially, we first need to identify the user&#x27;s <em>current</em> asset to link the refresh request correctly.</p><p>1. Fetching computer data</p><p>We query the cmdb_ci_computer table in ServiceNow to find the asset currently assigned to the user.</p><pre><code>YAML
steps:
  - name: snow_get_computer_data
    type: http
    with:
      url: https://elasticdev.service-now.com/api/now/table/ci_computer?assigned_to={{ inputs.userid }}
      method: GET
      headers:
        Accept: application/json
        Content-Type: application/json
        # Best Practice: Use secrets for authorization headers
        Authorization: Basic {{ secrets.servicenow_creds }}
      timeout: 30s</code></pre><p>2. Adding to cart</p><p>Once we have the asset details and the user&#x27;s preferences, we don&#x27;t just create a generic ticket. We use the ServiceNow Service Catalog API to programmatically add the specific item to a cart.</p><pre><code>YAML
  - name: snow_post_add_item_to_cart
    type: http
    with:
      url: https://elasticdev.service-now.com/example
      method: POST
      headers:
        Accept: application/json
        Content-Type: application/json
        Authorization: Basic {{ secrets.servicenow_creds }}
      body: |
        {
            "sysparm_quantity": 1,
            "variables": {
              "caller_id_common": "{{ inputs.userid }}",
              "current_device": "{{ steps.snow_get_asset.output.data.result.sys_id }}",
              "laptop_keep_or_return": "{{ inputs.laptop-keep-or-return }}",
              "choose_your_laptop": "{{ inputs.laptop-choice }}",
              "shipping_address": "{{ inputs.preferred-address }}"
            }
        }</code></pre><p>3. Indexing the transaction</p><p>Finally, we want to keep a record of this transaction within Elasticsearch for analytics and future reference. We use the elasticsearch.index step to store the request details immediately after submission.</p><pre><code>YAML

  - name: index-submission-record
    type: elasticsearch.index
    with:
      index: laptop-refresh-submission-data
      id: "{{ steps.snow_post_submit_order.output.data.result.request_id }}"
      document:
        request-id: "{{ steps.snow_post_submit_order.output.data.result.request_id }}"
        user-id: "{{ inputs.userid }}"
        configuration-item: "{{ steps.snow_get_computer_data.output.data.result[0].sys_id }}"
        laptop-choice: "{{ inputs.laptop-choice }}"
        timestamp: "{{ steps.snow_post_submit_order.output.data.result.sys_created_on }}"</code></pre><p>For detailed workflow yaml, please refer <a href="https://github.com/elastic/elasticsearch-labs/tree/main/supporting-blog-content/building-actionable-ai-automating-it-requests-with-agent-builder-and-one-workflow">here</a>.</p><h2><strong>The result</strong></h2><p>By stitching these components together, we create a seamless experience:</p><ol><li><strong>The user</strong> chats naturally with the agent to provide details.</li><li><strong>The agent</strong> structures this unstructured conversation into a JSON object.</li><li><strong>Workflow</strong> receives the JSON, validates the user&#x27;s current hardware via ServiceNow, creates the order, and indexes the result.</li></ol><p>This approach reduces a process that traditionally took users 5–10 minutes of form navigation into a quick conversation, while ensuring that IT operations retains full visibility and control.</p><p>Video demo: </p><a href="https://videos.elastic.co/watch/k7U1zBvuEJSSGvprkFnuGE">Watch video on Vidyard</a><h2><strong>Ready to build?</strong></h2><p>This pattern, using an agent for the interface and using Workflows for the execution, can be applied to almost any ITSM task, from password resets to software provisioning.</p><p>If you’re interested in trying this out, be sure to watch the <a href="https://www.youtube.com/watch?v=OzStbTUZqyw">DevFest talk</a> for the full context, and check out the <a href="https://www.elastic.co/docs/explore-analyze/ai-features/elastic-agent-builder">Elastic AI Agent Builder documentation</a> to get started building your own agents today.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Entity resolution with Elasticsearch & LLMs, Part 1: Preparing for intelligent entity matching]]></title>
            <description><![CDATA[Learn what entity resolution is and how to prepare both sides of the entity resolution equation: your watch list and the articles you want to search.]]></description>
            <link>https://www.elastic.co/search-labs/blog/entity-resolution-llm-elasticsearch</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/entity-resolution-llm-elasticsearch</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Jessica Moszkowicz]]></dc:creator>
            <pubDate>Thu, 12 Feb 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/9908bbdc324f4187f057d54d7502055ee78f575e-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>&quot;The new Swift update is here!&quot;</p><p>When you see that headline, what do you think of? For a developer, it&#x27;s a call to action, time to dive into new syntax, concurrency models, and bug fixes for the Swift programming language. For a music fan, it&#x27;s a completely different story, a signal that Taylor Swift has just dropped a new album or is making a major announcement.</p><p>Your brain, in a fraction of a second, performs a remarkable feat of natural language processing (NLP). It doesn&#x27;t just read the word &quot;Swift&quot; in isolation; it uses the surrounding context (the headline&#x27;s source, your personal interests, and more) to resolve that single, ambiguous word to a unique, real-world entity.</p><p>In NLP, we call this ability to disambiguate <em>named entity resolution </em>(NER), and it&#x27;s something humans do all the time. Natural language is inherently ambiguous, so we need to be able to map entities like &quot;Bill Gates&quot; to &quot;the founder of Microsoft&quot; and &quot;The Eras Tour&quot; to &quot;Taylor Swift&#x27;s concert tour&quot;. For humans, these connections come easily; for computers, not so much. Think how disappointed a Swiftie would be when they find out the article that their smart assistant recommended is actually about programming.</p><p>This same challenge becomes critical when you&#x27;re monitoring news articles for mentions of specific people or organizations. Imagine you&#x27;re tracking sanctioned entities or monitoring mentions of specific companies. You have a watch list with “Sakura Shipping Group” on it, and you want to know when articles mention the company. Simple enough, right? But what happens when an article refers to “Sakura Shipping” instead of the full legal name? Or uses an abbreviation like “SSG”? Or describes it indirectly as “a major Japanese maritime logistics firm”? Or mentions the company in Japanese, as “さくら海運グループ”? Your simple text matching won’t find these mentions, even though they all refer to the same organization. For compliance and risk monitoring use cases, missing a mention could have serious consequences. You need to catch every variation, every alias, every way an entity might be referred to.</p><p>This is the problem of <em>entity resolution</em>: identifying when different mentions in text refer to the same real-world entity and determining which entity that is. To solve this, we need a system that can handle <em>semantic search</em> (understanding meaning, not just keywords), <em>named entity recognition</em> (extracting entities from text), and fast, scalable matching across millions of documents. That&#x27;s why we built this prototype on Elasticsearch. It provides built-in semantic search capabilities, integrated NER models, and the scalability needed for entity resolution.</p><p>In this series, we present an educational prototype for intelligent entity resolution that deliberately separates <em>retrieval</em> from <em>judgment and explanation</em>. Elasticsearch is used to efficiently narrow the search space by combining keyword, alias, and semantic (hybrid) search. Once plausible entity candidates are identified, a large language model (LLM) is used to determine whether a candidate refers to the same real-world entity, and the model’s rationale is provided in natural language.</p><p>This division of responsibilities avoids treating LLMs as black-box retrievers, preserves explainability for sensitive use cases, and demonstrates a reusable design pattern for building transparent, Elasticsearch-native systems. We examine why this pattern is particularly effective for entity resolution, where ambiguity is common and explainability matters. The goal is not to present a production-ready solution but to teach the architectural principles behind building transparent entity resolution systems.</p><p><strong>Important note:</strong> This series presents an educational prototype that teaches Elasticsearch-native entity resolution using LLM judgments. We&#x27;ve made some simplifying choices (such as using Wikipedia for entity enrichment) to keep the system accessible for learning. Production systems might use different data sources, additional validation steps, or more sophisticated enrichment pipelines. The goal here is to demonstrate the core concepts and architecture, not to provide a production-ready system.</p><p>This series shows how we can help computers make these necessary connections while working with a 100% Elasticsearch-native architecture. We&#x27;ll explore three major innovations:</p><ul><li>Enhancing entities with contextual information.</li><li>Recognizing basic and complex entities with comprehensive NER.</li><li>Providing transparent reasoning through Elasticsearch candidate matching and LLM-powered explanations.</li></ul><p>We&#x27;ll also evaluate the system and identify an important optimization that improves the overall performance of the educational prototype.</p><p>In this first post of a four-part series, we’ll focus on preparing both sides of the entity resolution equation: your watch list and the articles you want to search.</p><h2>The problem: Why entity resolution requires preparation</h2><p>Entity resolution is hard because we face challenges on both sides of the matching equation. On one side, entities can be mentioned in many different ways. A company might be referred to as &quot;Microsoft&quot;, &quot;Microsoft Corporation&quot;, &quot;MSFT&quot;, or even &quot;the Redmond-based tech giant&quot;, depending on the context and writing style. On the other hand, we need to find these mentions in articles, even when they&#x27;re not obvious, such as when an article uses &quot;the Russian President&quot; or &quot;F.D.R.&quot; instead of a full name.</p><p><strong>Why we can&#x27;t just match names directly:</strong> Without proper preparation, matching becomes unreliable. You might think, &quot;But I can just search for &#x27;Tim Cook&#x27; in the text, right?&quot; Well, yes, if the article always mentions him by that exact name. But what about when it says &quot;Apple CEO&quot; instead? Or &quot;Timothy D. Cook&quot; (his full name)? Your simple text search won&#x27;t find those mentions, even though they all refer to the same person.</p><p>Without entity preparation, we can&#x27;t match &quot;the Russian President&quot; to &quot;Vladimir Putin&quot; because we don&#x27;t know what &quot;the Russian President&quot; means without context. We can&#x27;t match &quot;J.R.R. Tolkien&quot; to &quot;John Ronald Reuel Tolkien&quot; because we don&#x27;t know that they&#x27;re aliases for the same person. We can&#x27;t match &quot;Apple CEO&quot; to &quot;Tim Cook&quot; because we can&#x27;t understand the semantic relationship. Without indexing, finding matches means checking every entity in your watch list individually. This doesn&#x27;t scale: With thousands of entities, every match becomes slow and expensive. For sanctioned individuals monitoring, this means missing critical mentions that use aliases or alternate spellings, a failure that could have serious consequences.</p><p><strong>Why we can&#x27;t just search text directly:</strong> Entity extraction is hard for the same reason entity resolution is hard: Entities can be mentioned in many different ways. The same person might be referred to as &quot;J.R.R. Tolkien&quot;, &quot;the author of <em>The Lord of the Rings</em>&quot;, or just &quot;Tolkien&quot;, depending on the context. Without proper extraction, we can&#x27;t find these mentions in the text. We&#x27;d have to manually identify every entity mention, which doesn&#x27;t scale. We&#x27;d miss entities mentioned in nonstandard ways (for example, titles or abbreviations). We also wouldn&#x27;t capture the context around entity mentions, which is crucial for accurate matching.</p><p>The solution is a two-phase system that prepares both your watch list and the articles you want to search.</p><h2>The solution: Two-phase preparation system</h2><p>To solve entity resolution, we need to prepare both sides of the matching equation. First, we enrich and index our watch list entities to enable semantic search. Second, we extract entity mentions from articles using hybrid techniques that capture explicit and implicit references. Together, these phases create the foundation for intelligent entity matching.</p><h3>Phase 1: Preparing your watch list</h3><p>The solution to preparing entities is to <em>enrich</em> them with meaningful contexts. This enables our entity matching system to work effectively. We&#x27;ll explain how context helps in a bit, but let&#x27;s walk through the prototype&#x27;s simple implementation first.</p><p>Our watch list of entities may be provided in multiple formats. The Office of Foreign Assets Control (OFAC) provides sanctions lists that include first and last names, addresses, and identifying information, such as passport numbers, date and place of birth, and nationality information [1]. While this provides a good amount of context, in practice many of these fields are omitted when the values are unknown for the given entity. Some lists may be just a set of names. The most helpful lists for our purposes come out of the box with rich descriptions, as is often the case with commercial or curated datasets.</p><p>The three-component system used in the prototype starts by managing our entities and organizing their metadata. Since entity lists can vary in the amount of information they contain, our prototype is designed to work with whatever it receives. The JSON format supports entities with minimal information (just a name and type) or full information (with aliases, descriptions, metadata, and more). For example, an entity might be as simple as:</p><pre><code>{
  "name": "J.R.R. Tolkien",
  "entity_type": "person"
}</code></pre><p>Or it might include additional context:</p><pre><code>{
  "name": "J.R.R. Tolkien",
  "entity_type": "person",
  "description": "English writer and philologist, author of The Lord of the Rings",
  "aliases": ["John Ronald Reuel Tolkien", "J.R.R. Tolkien", "Tolkien"],
  "priority": "medium"
}</code></pre><p>The system handles both cases gracefully during enrichment. For the prototype, the enrichment process adds context from Wikipedia (specifically, the first paragraph of the entity&#x27;s Wikipedia page) for entities that don&#x27;t already have context [2]. This Wikipedia context helps with semantic matching, but it doesn&#x27;t add other fields, like aliases or full names; those must come from the original dataset. (In production, you might use other approaches for enrichment, including an agentic system that figures out where to find the context information for a given entity. This is beyond the scope of our prototype, but it’s an exciting feature we could add in the future.) Finally, we index the entities in Elasticsearch with semantic search capabilities, creating a searchable index that understands meaning rather than just text.</p><h4><strong>Key concepts: Semantic search and indexing</strong></h4><p><strong>What is semantic search?</strong> <em>Semantics</em> refers to the meanings of words and phrases. Figuring out meaning is usually easy for humans, but it&#x27;s much more challenging for computers to &quot;get&quot; because it requires a depth of understanding that’s difficult to program. Semantic search works by turning this challenge into a math problem, something that computers are very good at [3].</p><p>Think of semantic search like map coordinates for meaning. Just as latitude and longitude tell you where something is on a map, semantic embeddings tell you where something is in &quot;meaning space.&quot; Whereas traditional keyword search requires exact matches, semantic search relies on describing that &quot;location&quot; in a multidimensional vector space. For example, you might have the coordinates for a specific &quot;big red building&quot;. When you search for a &quot;small red building&quot;, semantic search looks in the &quot;neighborhood&quot; for similar concepts in the vector space. Your big red building might appear as a nearest neighbor, but the relevance score will be lower because parts of the meaning don&#x27;t match.</p><p>Getting back to our example, when you search for &quot;Apple CEO&quot;, semantic search can find &quot;Tim Cook&quot; because the semantic embeddings capture the meaning that both refer to the same person, even though they use completely different words. This capability is invaluable when monitoring for sanctioned individuals, as aliases and code names may be used to evade detection.</p><p><strong>Why Elasticsearch for entity indexing?</strong> Elasticsearch has built-in semantic search capabilities using embedding models, like EmbEddings from bidirEctional Encoder rEpresentations (E5) [4]. This means we can create an index that understands meaning, not just text. When we index our enriched entities, Elasticsearch creates semantic embeddings that capture each entity&#x27;s meaning, enabling intelligent matching later.</p><p><strong>What is the <em>mapping schema</em>?</strong> The mapping schema defines how we structure entity data in Elasticsearch. Our schema includes several field types optimized for different search strategies, including:</p><ul><li><strong>Keyword fields</strong> (<code>id,</code> <code>name.keyword</code>, <code>aliases.keyword</code>): For exact matching on entity names and aliases.</li><li><strong>Text fields</strong> (<code>name</code>, <code>name_lower</code>, <code>context</code>, <code>aliases</code>): For traditional, case-normalized full-text search with BM25 scoring.</li><li><strong>Semantic text fields</strong> (<code>name_semantic</code>, <code>context_semantic</code>): For vector-based similarity search using the multilingual-e5-small model.</li></ul><p>This hybrid mapping enables multiple search strategies: exact matching for precise names, keyword search for aliases, and semantic search for meaning-based matching. Even better, Elasticsearch supports <em>hybrid search</em>, allowing us to use both keyword and semantic search simultaneously.</p><h4><strong>Before and after entity preparation</strong></h4><p>Before entity preparation, you have a simple list without much context, possibly nothing more than a name: &quot;J.R.R. Tolkien&quot;. That&#x27;s it. You can only match exact text matches, which means you&#x27;ll miss &quot;John Ronald Reuel Tolkien&quot;, &quot;Tolkien&quot;, and any other variations. For sanctioned individuals, this means missing critical mentions that use aliases or alternate spellings.</p><p>After entity preparation, you have a rich, searchable index. &quot;Vladimir Putin&quot; is now enriched with Wikipedia context, and if your original dataset included aliases, like &quot;Путин&quot; or &quot;Vladimir Vladimirovich Putin&quot;, those are indexed as well. The entity also has semantic embeddings that capture its meaning. The Wikipedia context helps semantic search understand that &quot;The Russian President&quot; refers to Vladimir Putin, enabling that match. If &quot;Путин&quot; was provided as an alias in your original dataset, exact matching handles that. Semantic variations work because your semantic embeddings understand meaning. For sanctioned individuals, this comprehensive preparation ensures you catch every mention, regardless of how the name is written or what alternative name is used.</p><h3>Phase 2: Extracting entities from articles</h3><p>Now that we have a searchable watch list, we need to extract entity mentions from articles. This is where article processing comes in.</p><p><strong>Remember:</strong> This is an educational prototype designed to teach entity extraction concepts. Production systems might use different NER models, custom extraction rules, or specialized extraction pipelines tailored to specific domains or languages.</p><p>We extract entities from articles using a hybrid NER approach that combines machine learning with pattern-based extraction. First, we process articles to prepare them for extraction. Then, we extract entities using a hybrid extraction approach that combines NER performed in Elasticsearch (using a deployed XLM-RoBERTa model) with pattern-based extraction to catch entities that NER might miss.</p><p>This hybrid extraction approach provides several benefits. NER automatically finds entity mentions in text, even when they&#x27;re not obvious. Pattern-based extraction catches entities that NER might miss, like titles and compound entities. We preserve the context around each entity mention, which helps with matching decisions later. The approach scales well, allowing us to process thousands of articles automatically, not just a few manually.</p><h4><strong>Key concepts: NER, pattern-based extraction, and hybrid extraction approach</strong></h4><p><strong>What is NER?</strong> <em>Named entity recognition</em> is a machine learning technique that identifies named entities in text. When we run NER on an article, it finds mentions like &quot;Microsoft&quot;, &quot;Seattle&quot;, and &quot;Washington&quot; and labels them as organization, location, or person entities.</p><p><strong>Why use NER in Elasticsearch?</strong> Using NER in Elasticsearch maintains our 100% Elasticsearch-native architecture, which simplifies the entity resolution prototype&#x27;s design. Instead of managing separate services for entity extraction and search, everything runs in one system. You can perform NER during document ingestion using inference pipelines, and the extracted entities are immediately available for indexing and searching. This unified approach reduces complexity, eliminates network calls between services, and makes deployment and management easier. The XLM-RoBERTa model is trained to recognize entities in multiple languages, so we can extract entities from articles in different languages without needing separate models for each language. For information on deploying NER models in Elasticsearch, see the <a href="https://www.elastic.co/docs/explore-analyze/machine-learning/nlp/ml-nlp-ner-example">Elasticsearch NER documentation</a>.</p><p><strong>What is pattern-based extraction?</strong> <em>Pattern-based extraction</em> uses rules and patterns to find entities that NER might miss. For example, NER might not recognize &quot;the author of <em>The Lord of the Rings</em>&quot; as an entity mention, but pattern-based extraction can catch titles and roles like &quot;the CEO&quot; or &quot;the President&quot;. However, pattern-based extraction is language-specific. The patterns need to be defined for each language you want to support. This is a significant drawback for multilingual systems, but it&#x27;s acceptable for our educational prototype, which focuses on demonstrating the core concepts. Production systems might use language-specific pattern sets or alternative approaches for multilingual support.</p><p><strong>How do they work together?</strong> The hybrid extraction approach combines both techniques. NER finds obvious entity mentions like &quot;J.R.R. Tolkien&quot;, while pattern-based extraction catches variations that NER might miss, such as &quot;the author of <em>The Lord of the Rings</em>&quot;. Together, they provide comprehensive coverage of entity mentions in text.</p><p>When we extract entities from an article mentioning &quot;the author of <em>The Lord of the Rings</em>&quot;, we get:</p><ul><li>Text: &quot;author of <em>The Lord of the Rings</em>&quot;</li><li>Type: PERSON (from pattern-based extraction)</li><li>Confidence: 0.85</li><li>Context: &quot;The author of <em>The Lord of the Rings</em> published a new edition&quot;</li></ul><h4><strong>Before and after entity extraction</strong></h4><p>With NER-only extraction, we might find &quot;J.R.R. Tolkien&quot; and &quot;<em>The Lord of the Rings</em>&quot; in the article, but we&#x27;d miss &quot;the author of <em>The Lord of the Rings</em>&quot; because NER doesn&#x27;t recognize descriptive phrases as entity mentions.</p><p>With hybrid extraction, we find both &quot;J.R.R. Tolkien&quot; (from NER) and &quot;the author of <em>The Lord of the Rings</em>&quot; (from pattern-based extraction). This comprehensive coverage enables better matching later, since we can match both the name and the descriptive phrase to our watch list.</p><h2>What&#x27;s next: Matching entities to our watch list</h2><p>Now that we&#x27;ve prepared both sides of the entity resolution equation, we have everything we need for intelligent matching:</p><ul><li>A searchable watch list enriched with context and indexed for semantic search.</li><li>Extracted entity mentions from articles using hybrid NER.</li></ul><p>Preparation gives us the raw ingredients, but it doesn’t tell us which entity a mention actually refers to. In the next post, we&#x27;ll explore how to match these extracted entities to our watch list using semantic search and LLM-powered judgment that handles ambiguity and context transparently.</p><h2>Try it yourself</h2><p>Want to see the preparation process in action? Check out these notebooks for complete walkthroughs with real implementations, detailed explanations, and hands-on examples:</p><ul><li><strong><a href="https://github.com/jesslm/entity-resolution-lab-public/tree/main/notebooks#:~:text=3%20minutes%20ago-,01_entity_preparation_v3.ipynb,-Initial%20public%20lab">Entity preparation notebook</a>:</strong> Shows you exactly how to enrich entities with Wikipedia context, create semantic search indexes, and prepare your watch list for intelligent matching.</li><li><strong><a href="https://github.com/jesslm/entity-resolution-lab-public/tree/main/notebooks#:~:text=3%20minutes%20ago-,02_article_processing_v3.ipynb,-Initial%20public%20lab">Article processing notebook</a>:</strong> Shows you exactly how to extract entities from articles using hybrid NER, handle multilingual content, and process compound entities.</li></ul><p><strong>Remember:</strong> This is an educational prototype designed to teach the concepts. When building production systems, consider additional factors, like data source reliability, validation pipelines, error handling, monitoring, compliance requirements, domain-specific NER models, custom extraction rules, and quality validation that aren&#x27;t covered in this learning-focused prototype.</p><h2>References</h2><ol><li><a href="https://sanctionslist.ofac.treas.gov/Home/">OFAC Sanctions List Search</a></li><li>The datasets used for the prototype also use a special field, &#x27;explicit_context&#x27;, in lieu of getting the context from Wikipedia. We do this to control for the entity preparation step when we&#x27;re testing other components such as entity matching.</li><li><a href="https://www.elastic.co/blog/retrieval-augmented-generation-explained#semantic-search:~:text=what%20we%E2%80%99ll%20do.-,Turning%20semantic%20search%20into%20a%20math%20problem,-Let%E2%80%99s%20begin%20with">The big ideas behind retrieval augmented generation</a></li><li><a href="https://www.elastic.co/docs/explore-analyze/machine-learning/nlp/ml-nlp-e5">E5 in Elasticsearch</a></li></ol>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[From vectors to keywords: Elasticsearch hybrid search in LangChain]]></title>
            <description><![CDATA[Learn how to use hybrid search in LangChain via its Elasticsearch integrations, with complete Python and JavaScript examples.]]></description>
            <link>https://www.elastic.co/search-labs/blog/langchain-elasticsearch-hybrid-search</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/langchain-elasticsearch-hybrid-search</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Margaret Gu, Eyo Eshetu]]></dc:creator>
            <pubDate>Wed, 11 Feb 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/9159af2b07b88f288e5c7cb719c8dcbe5d3b37d6-1080x608.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Elasticsearch hybrid search is available for LangChain across our <a href="https://github.com/langchain-ai/langchain-elastic">Python</a> and <a href="https://github.com/langchain-ai/langchainjs">JavaScript</a> integrations. Here we’ll discuss what hybrid search is, when it can be useful and we’ll run through some simple examples to get started.</p><p>We’re also planning to support hybrid search in the community-driven <a href="https://github.com/langchain4j/langchain4j">Java integration</a> very soon.</p><h2><strong>What is hybrid search?</strong></h2><p><em>Hybrid search</em> is an information retrieval approach that combines<em> keyword-based full-text search</em> (lexical matching) with <em>semantic search</em> (vector similarity). Practically, it means a query can match documents because they contain the right terms and/or because they express the right meaning (even if the wording differs).In simple terms, you can think of it like this:</p><ul><li>Lexical retrieval: “Do these documents contain the words I typed (or related words)?”</li><li>Semantic retrieval: “Do these documents mean something similar to what I typed?”</li></ul><p>These two retrieval methods produce scores on different scales, so hybrid search systems typically use a fusion strategy to merge them into one ranking, for example, using <a href="https://www.elastic.co/docs/reference/elasticsearch/rest-apis/reciprocal-rank-fusion">reciprocal rank fusion</a> (RRF).</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/a36692581ec5adb54d3c517e171b6d2f372efd92-1249x514.png" style="width: 100%;" /><p>In the figure above, we show an example: <a href="https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables">BM25</a> (keyword search) returns Docs A, B, and C, while semantic search returns Docs X, A, and B. The RRF algorithm then combines these two result lists into the final ranking: Doc A, Doc B, Doc X, and Doc C. With hybrid search, Doc C is included in the results thanks to BM25.</p><h2><strong>Why hybrid search matters</strong></h2><p>If you’ve built search or retrieval-augmented generation (RAG) features in production, you’ve probably seen the same failure modes show up again and again: </p><ul><li>Keyword search can be too literal. If the user doesn’t use the exact terms that appear in your documents, relevant content gets buried or missed.</li><li>Semantic search can be too fuzzy. It’s great at meaning, but it can also return results that feel related while missing a critical constraint, like a product name, an error code, or a specific phrase the user actually typed.</li></ul><p>Hybrid search exists because real user queries in production environments usually need <em>both</em>.</p><p>Next we’ll dive into how you get started with hybrid search in the LangChain integration for <a href="https://github.com/langchain-ai/langchain-elastic">Python</a> and <a href="https://github.com/langchain-ai/langchainjs">JavaScript</a>. If you want to read more about hybrid search, check out <strong><a href="https://www.elastic.co/what-is/hybrid-search">What is hybrid search?</a> </strong>and <a href="https://www.elastic.co/search-labs/blog/elasticsearch-hybrid-search"><strong>When hybrid search truly shines</strong></a>.</p><h3>Setting up a local Elasticsearch instance</h3><p>Before running the examples, you&#x27;ll need Elasticsearch running locally. The easiest way is using the <a href="https://github.com/elastic/start-local?tab=readme-ov-file"><code>start-local</code></a> script:</p><pre><code>curl -fsSL https://elastic.co/start-local | sh</code></pre><p>After starting, you&#x27;ll have:</p><ul><li>Elasticsearch at http://localhost:9200.</li><li>Kibana at http://localhost:5601.</li></ul><p>Your API key is stored in the .env file (under the elastic-start-local folder) as <code>ES_LOCAL_API_KEY</code>.</p><div style="display:none">Unknown block type "infoPanel", specify a component for it in the `components.types` option</div><h2>Getting started with hybrid search in LangChain (Python and JavaScript)</h2><p>The dataset is a CSV with information on 1,000 science fiction movies, taken from an IMDb dataset on <a href="https://www.kaggle.com/datasets/rajugc/imdb-movies-dataset-based-on-genre/versions/2?select=scifi.csv">Kaggle</a>. This demo uses a subset of the data, which has been cleaned. You can download the dataset used for this article from our <a href="https://gist.github.com/ssh-esh/103fb8220de3b0e045393760c2f36575">GitHub gist</a>, along with the full code for this demo.</p><h3>Step 1: Install what you need.</h3><p>First you’ll need the LangChain Elasticsearch integration and Ollama for embeddings. (You can also use some other embedding model if you wish.)</p><p><strong>In Python:</strong></p><pre><code>pip install langchain-elasticsearch langchain-ollama</code></pre><p><strong>In JavaScript:</strong></p><pre><code>npm install @langchain/community @langchain/ollama @elastic/elasticsearch csv-parse</code></pre><h3>Step 2: Configure your connection and dataset path.</h3><p><strong>In Python:</strong></p><p>At the top of the script, we set:</p><ul><li>Where Elasticsearch is <code>(ES_LOCAL_URL)</code>.</li><li>How to authenticate <code>(ES_LOCAL_API_KEY)</code>.</li><li>Which demo index name to use <code>(INDEX_NAME)</code>.</li><li>Which CSV file we’ll ingest <code>(scifi_1000.csv)</code>.</li></ul><pre><code>ES_URL = os.getenv("ES_LOCAL_URL", "http://localhost:9200") 
ES_API_KEY = os.getenv("ES_LOCAL_API_KEY")
INDEX_NAME = "scifi-movies-hybrid-demo" 
CSV_PATH = Path(__file__).with_name("scifi_1000.csv")</code></pre><p><strong>In JavaScript:</strong></p><p>Notes for JavaScript:</p><ul><li>JavaScript uses <code>process.env</code> instead of <code>os.getenv</code>.</li><li>Path resolution requires <code>fileURLToPath</code> and <code>dirname</code> for Elasticsearch modules.</li><li>The class is called <code>ElasticVectorSearch</code> (not <code>ElasticsearchStore</code> as in Python).</li></ul><pre><code>import { Client } from "@elastic/elasticsearch";
import { OllamaEmbeddings } from "@langchain/ollama";
import {
  ElasticVectorSearch,
  HybridRetrievalStrategy,
} from "@langchain/community/vectorstores/elasticsearch";
import { parse } from "csv-parse/sync";
import { readFileSync } from "fs";
import { dirname, join } from "path";
import { fileURLToPath } from "url";

const __dirname = dirname(fileURLToPath(import.meta.url));

const ES_URL = process.env.ES_LOCAL_URL || "http://localhost:9200";
const ES_API_KEY = process.env.ES_LOCAL_API_KEY;
const INDEX_NAME = "scifi-movies-hybrid-demo";
const CSV_PATH = join(__dirname, "scifi_1000.csv");</code></pre><p>We can now also create the client.</p><p>In Python:</p><pre><code>es = Elasticsearch(ES_URL, api_key=ES_LOCAL_API_KEY)</code></pre><p>In JavaScript:</p><pre><code>const client = new Client({
  node: ES_URL,
  auth: ES_API_KEY ? { apiKey: ES_LOCAL_API_KEY } : undefined,
});</code></pre><h3>Step 3: Ingest the dataset, and then compare vector-only vs. hybrid.</h3><h4>Step 3a: Read the CSV and build what we index.</h4><p>We build three lists:</p><ul><li><code>texts</code>: The actual text that will be embedded + searched.</li><li><code>metadata</code>: Structured fields stored alongside the document.</li><li><code>ids</code>: Stable IDs (so Elasticsearch can dedupe if needed).</li></ul><p><strong>In Python:</strong></p><pre><code># --- Ingest dataset ---
texts: list[str] = []
metadatas: list[dict] = []
ids: list[str] = []

with CSV_PATH.open(newline="", encoding="utf-8") as f:
    for row in csv.DictReader(f):
        movie_id = (row.get("movie_id") or "").strip()
        movie_name = (row.get("movie_name") or "").strip()
        year = (row.get("year") or "").strip()
        genre = (row.get("genre") or "").strip()
        description = (row.get("description") or "").strip()
        director = (row.get("director") or "").strip()

        # This text is both:
        #  - embedded (vector search)
        #  - keyword-matched (BM25 in hybrid mode)
        text = "\n".join(
            [
                f"{movie_name} ({year})" if year else movie_name,
                f"Director: {director}" if director else "Director: (unknown)",
                f"Genres: {genre}" if genre else "Genres: (unknown)",
                f"Description: {description}" if description else "Description: (missing)",
            ]
        )
        texts.append(text)
        metadatas.append(
            {
                "movie_id": movie_id or None,
                "movie_name": movie_name or None,
                "year": year or None,
                "genre": genre or None,
                "director": director or None,
            }
        )
        ids.append(movie_id or movie_name)</code></pre><p><strong>In JavaScript:</strong></p><pre><code>async function main() {
  // --- Ingest dataset ---
  const texts = [];
  const metadatas = [];
  const ids = [];

  const csvContent = readFileSync(CSV_PATH, "utf-8");
  const records = parse(csvContent, {
    columns: true,
    skip_empty_lines: true,
  });

  for (const row of records) {
    const movieId = (row.movie_id || "").trim();
    const movieName = (row.movie_name || "").trim();
    const year = (row.year || "").trim();
    const genre = (row.genre || "").trim();
    const description = (row.description || "").trim();
    const director = (row.director || "").trim();

    // This text is both:
    //  - embedded (vector search)
    //  - keyword-matched (BM25 in hybrid mode)
    const text = [
      year ? `${movieName} (${year})` : movieName,
      director ? `Director: ${director}` : "Director: (unknown)",
      genre ? `Genres: ${genre}` : "Genres: (unknown)",
      description ? `Description: ${description}` : "Description: (missing)",
    ].join("\n");

    texts.push(text);
    metadatas.push({
      movie_id: movieId || null,
      movie_name: movieName || null,
      year: year || null,
      genre: genre || null,
      director: director || null,
    });
    ids.push(movieId || movieName);
  }</code></pre><p><strong>What’s important here:</strong></p><ul><li>We don’t embed only the description. We embed a combined text block (title/year + director + genre + description). That makes results easier to print and sometimes improves retrieval.</li><li>The same text is what the lexical side uses, too (in hybrid mode), because it’s indexed as searchable text.</li></ul><h4>Step 3b: Add texts to Elasticsearch using LangChain.</h4><p>This is the indexing step. Here we embed texts and write them to Elasticsearch.</p><p>For asynchronous applications, please use <a href="https://reference.langchain.com/python/integrations/langchain_elasticsearch/#langchain_elasticsearch._async.vectorstores.AsyncElasticsearchStore"><code>AsyncElasticsearchStore</code></a> with the same API.</p><p>You can find our <a href="https://reference.langchain.com/python/integrations/langchain_elasticsearch/">reference docs</a> for both the sync and async versions of ElasticsearchStore, along with more parameters for advanced fine-tuning RRF.</p><p><strong>In Python:</strong></p><pre><code>print(f"Ingesting {len(texts)} movies into '{INDEX_NAME}' from '{CSV_PATH.name}'...") 

vector_store = ElasticsearchStore(
    index_name=INDEX_NAME,
    embedding=OllamaEmbeddings(model="llama3"),
    es_url=ES_LOCAL_URL,
    es_api_key=ES_LOCAL_API_KEY,
    strategy=ElasticsearchStore.ApproxRetrievalStrategy(hybrid=False),
)

#This is the indexing step. We embed the texts and add them to Elasticsearch
vectore_store.add_texts(texts=texts, metadatas=metadatas, ids=ids)</code></pre><p><strong>In JavaScript:</strong></p><pre><code>  console.log(
    `Ingesting ${texts.length} movies into '${INDEX_NAME}' from 'scifi_1000.csv'...`
  );

  const embeddings = new OllamaEmbeddings({ model: "llama3" });

  // Vector-only store (no hybrid)
  const vectorStore = new ElasticVectorSearch(embeddings, {
    client,
    indexName: INDEX_NAME,
  });

  // This is the indexing step. We embed the texts and add them to Elasticsearch
  await vectorStore.addDocuments(
    texts.map((text, i) => ({
      pageContent: text,
      metadata: metadatas[i],
    })),
    { ids }
  );</code></pre><h4>Step 3c: Create another store for hybrid search.</h4><p>We create another ElasticsearchStore object pointing at the same index but with different retrieval behavior: <code>hybrid=False</code> is <strong><em>vector-only</em></strong> search and <code>hybrid=True</code> is <strong><em>hybrid search</em></strong> (BM25 + kNN, fused with RRF).</p><p><strong>In Python:</strong></p><pre><code># Since we are using the same INDEX_NAME we can avoid adding texts again 
# This ElasticsearchStore will be used for hybrid search

hybrid_store = ElasticsearchStore(
    index_name=INDEX_NAME,
    embedding=OllamaEmbeddings(model="llama3"),
    es_url=ES_LOCAL_URL,
    es_api_key=ES_LOCAL_API_KEY,
    strategy=ElasticsearchStore.ApproxRetrievalStrategy(hybrid=True),
)</code></pre><p><strong>In JavaScript:</strong></p><pre><code>  // Since we are using the same INDEX_NAME we can avoid adding texts again
  // This ElasticVectorSearch will be used for hybrid search
  const hybridStore = new ElasticVectorSearch(embeddings, {
    client,
    indexName: INDEX_NAME,
    strategy: new HybridRetrievalStrategy(),
  });

  // With custom RRF parameters
  const hybridStoreCustom = new ElasticVectorSearch(embeddings, {
    client,
    indexName: INDEX_NAME,
    strategy: new HybridRetrievalStrategy({
      rankWindowSize: 100,  // default: 100
      rankConstant: 60,     // default: 60
      textField: "text",    // default: "text"
    }),
  });</code></pre><h4>Step 3d: Run the same query both ways, and print results.</h4><p>As an example, let’s run the query <em>“Find movies where the main character is stuck in a time loop and reliving the same day.&quot; </em>and compare the results from hybrid search and vector search.</p><p><strong>In Python:</strong></p><pre><code>query = "Find movies where the main character is stuck in a time loop and reliving the same day."
k = 5

print(f"\n=== Query: {query} ===")

vec_docs = vector_store.similarity_search(query, k=k)
hyb_docs = hybrid_store.similarity_search(query, k=k)

print("\nVector search (kNN) top results:")
for i, doc in enumerate(vec_docs, start=1):
    print(f"{i}. {(doc.page_content or '').splitlines()[0]}")

print("\nHybrid search (BM25 + kNN + RRF) top results:")
for i, doc in enumerate(hyb_docs, start=1):
    print(f"{i}. {(doc.page_content or '').splitlines()[0]}")</code></pre><p><strong>In JavaScript:</strong></p><pre><code>  const query =
    "Find movies where the main character is stuck in a time loop and reliving the same day.";
  const k = 5;

  console.log(`\n=== Query: ${query} ===`);

  const vecDocs = await vectorStore.similaritySearch(query, k);
  const hybDocs = await hybridStore.similaritySearch(query, k);

  console.log("\nVector search (kNN) top results:");
  vecDocs.forEach((doc, i) => {
    console.log(`${i + 1}. ${(doc.pageContent || "").split("\n")[0]}`);
  });

  console.log("\nHybrid search (BM25 + kNN + RRF) top results:");
  hybDocs.forEach((doc, i) => {
    console.log(`${i + 1}. ${(doc.pageContent || "").split("\n")[0]}`);
  });
}

main().catch(console.error);</code></pre><p><strong>Example output</strong></p><pre><code>Ingesting 1000 movies into 'scifi-movies-hybrid-demo' from 'scifi_1000.csv'...

=== Query: Find movies where main character is stuck in a time loop and reliving the same day. ===

Vector search (kNN) top results:
1. The Witch: Part 1 - The Subversion (20  18)
2. Divinity (2023)
3. The Maze Runner (2014)
4. Spider-Man (2002)
5. Spider-Man: Into the Spider-Verse (2018)

Hybrid search (BM25 + kNN + RRF) top results:
1. Edge of Tomorrow (2014)
2. The Witch: Part 1 - The Subversion (2018)
3. Boss Level (2020)
4. Divinity (2023)
5. The Maze Runner (2014)</code></pre><h2><strong>Why these results? </strong></h2><p>This query (“time loop / reliving the same day”) is a great case where hybrid search tends to shine because the dataset contains literal phrases that BM25 can match and vectors can still capture meaning.</p><ul><li>Vector-only (kNN) embeds the query and tries to find semantically similar plots. Using a broad sci‑fi dataset, this can drift into “trapped / altered reality / memory loss / high-stakes sci‑fi” even when there’s no time-loop concept. That’s why results like “The Witch: Part 1 – The Subversion” (amnesia) and “The Maze Runner” (trapped/escape) can appear.</li><li>Hybrid (BM25 + kNN + RRF) rewards documents that match both keywords and meaning. Movies whose descriptions explicitly mention “time loop” or “relive the same day” get a strong lexical boost, so titles like “Edge of Tomorrow” (relive the same day over and over again…) and “Boss Level” (trapped in a time loop that constantly repeats the day…) rise to the top.</li></ul><p>Hybrid search doesn’t guarantee that every result is perfect. It balances lexical and semantic signals so you may still see some non-time-loop sci‑fi in the tail of the top‑k.</p><p>The main takeaway is that hybrid search helps anchor semantic retrieval with exact textual evidence when the dataset contains those keywords.</p><h2>Full code example</h2><p>You can find our full demo code in Python and JavaScript, as well as the dataset used, hosted on <a href="https://gist.github.com/ssh-esh/103fb8220de3b0e045393760c2f36575">GitHub gist</a>.</p><h2>Conclusion</h2><p>Hybrid search provides a pragmatic and powerful retrieval strategy by combining traditional BM25 keyword search with modern vector similarity into a single, unified ranking. Instead of choosing between lexical precision and semantic understanding, you get the best of both worlds, without adding significant complexity to your application.</p><p>In real-world datasets, this approach consistently yields results that feel more intuitively correct. Exact term matches help anchor results to the user’s explicit intent, while embeddings ensure robustness against paraphrasing, synonyms, and incomplete queries. This balance is especially valuable for noisy, heterogeneous, or user-generated content, where relying on only one retrieval method often falls short.</p><p>In this article, we demonstrated how to use hybrid search in LangChain through its Elasticsearch integrations, with complete examples in both Python and JavaScript. We’re also contributing to other open-source projects, such as <a href="https://github.com/langchain4j/langchain4j/pull/4069">LangChain4j</a>, to extend hybrid search support with Elasticsearch.</p><p>We believe hybrid search will be a key capability for generative AI (GenAI) and agentic AI applications, and we plan to continue collaborating with libraries, frameworks, and programming languages across the ecosystem to make high-quality retrieval more accessible and robust.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to defend your RAG system from context poisoning]]></title>
            <description><![CDATA[How context engineering techniques prevent context poisoning in LLM responses.]]></description>
            <link>https://www.elastic.co/search-labs/blog/context-poisoning-llm</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/context-poisoning-llm</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Tomás Murúa]]></dc:creator>
            <pubDate>Tue, 10 Feb 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/32e26f6dc2c8fc41d4c1d981eff6487eb8423fa5-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>With the recent release of models like Claude Sonnet 4.5, Gemini 3 series and GPT 5 featuring million-token context windows, there’s a growing misconception that context management is becoming simpler. If a large language model (LLM) can process millions of tokens at once, does it really matter what information we provide?</p><p>The reality is precisely the opposite. <a href="https://www.elastic.co/search-labs/blog/context-engineering-overview"><em>Context engineering</em></a>, the practice of managing what information reaches your LLM, is more critical than ever. Large context windows don’t eliminate the need for precision; they amplify it. With more context comes exponentially more opportunities for error, hallucinations, and irrelevant information to contaminate your LLM reasoning process.</p><p>Whether you’re using retrieval-augmented generation (RAG) retrieval, tool outputs, or memory systems, effective context engineering isn’t about providing more information but about providing <em>the right</em> information. That’s where Elasticsearch comes in, serving as your context engineering platform.</p><p>In this article, we’ll explore what context poisoning is, how it manifests across different types of memory, and how Elasticsearch RAG capabilities provide defense at every stage of the retrieval pipeline, from ingestion to composition, ensuring your LLM receives clean, relevant, and reliable context.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/d832932f12723226ecc98e2a9b8d6bfc008829bf-1080x571.png" style="width: 100%;" /><h2>What is context poisoning?</h2><p><em>Context poisoning</em> occurs when compromised, outdated, or irrelevant information enters an LLM’s context window, leading to degraded responses, hallucinations, or perpetuated errors. Once corrupted or incorrect information enters the context window, it propagates into answers. The LLM references it as truth, creating cascading errors across the conversation.</p><p>This poisoning can happen at multiple stages of the LLM lifecycle (like in training), but our focus is on the retrieval and composition stages. Although adversarial attacks, like prompt injection, also pose risks, this article focuses on the operational patterns that teams encounter most frequently in production environments.</p><h3><strong>Operational understanding</strong></h3><p>Context poisoning often happens for reasons like:</p><ul><li><strong>Context rot:</strong> Information becomes outdated but remains in your knowledge base without being updated or deleted.</li><li><strong>Context overflow: </strong>Too much information overwhelms the LLM&#x27;s attention to the real important and relevant context, leading to missing relevant information from answers.</li><li><strong>Conflicting information:</strong> Multiple sources provide contradictory data, confusing the model.</li><li><strong>Semantic noise: </strong>Vectorial similar but contextually irrelevant content dilutes relevance.</li><li><strong>Malicious injection: </strong>Content deliberately inserted by attackers into knowledge bases, including prompt injections or manipulated data.</li></ul><p>Understanding these patterns is the first step toward building robust defenses. Let’s examine each pattern and how Elasticsearch helps you address them. You can follow along with the <a href="https://github.com/elastic/elasticsearch-labs/tree/main/supporting-blog-content/context-poisoning">supporting notebook</a>.</p><h2>Types of context poisoning</h2><h3><strong>Temporal degradation</strong></h3><p>Over time, information in your knowledge base becomes outdated, and without proper management, stale content continues to be retrieved and presented to your LLM as current truth. This is especially problematic in industries where information changes frequently, like product documentation, pricing, regulations, or news.</p><h4><strong>Impact</strong></h4><p>Your LLM provides outdated advice, references to deprecated features, or contradictions to current reality, disengaging user trust.</p><h4><strong>Solutions: Temporal filtering in hybrid search</strong></h4><p>Elasticsearch’s date-based query capabilities ensure your RAG system prioritizes recent and relevant information through explicit temporal filters.</p><p><strong>Example: Product documentation search with time filtering</strong></p><p>A user asks your chatbot about authentication setup. Six months ago, the authentication had a significant change, so it’s important to only return documents from six months or earlier.</p><h3><strong>Without temporal filtering</strong></h3><pre><code>POST product-docs/_search
{
  "retriever": {
    "rrf": {
      "retrievers": [
        {
          "standard": {
            "query": {
              "semantic": {
                "field": "content_semantic",
                "query": "how to configure OAuth authentication"
              }
            }
          }
        },
        {
          "standard": {
            "query": {
              "multi_match": {
                "query": "configure OAuth authentication",
                "fields": ["title^2", "content"]
              }
            }
          }
        }
      ],
      "rank_window_size": 50,
      "rank_constant": 20
    }
  },
  "_source": ["title", "last_updated", "version", "content_snippet"]
}</code></pre><h4><strong>Response without filtering: Contradictory results</strong></h4><p>The LLM receives three different methods for OAuth configuration: the current security API (9.x), legacy realm settings (7.x), and the deprecated shield plugin (6.x). This contradictory context leads to confused or misleading responses:</p><pre><code>{
  "hits": {
    "total": { "value": 23 },
    "max_score": 24.5,
    "hits": [
      {
        "_id": "doc-oauth-2025",
        "_score": 24.5,
        "_source": {
          "title": "OAuth 2.0 Authentication Setup",
          "last_updated": "2025-10-15",
          "version": "9.x",
          "content_snippet": "To configure OAuth 2.0 authentication in Elasticsearch 9.x, use the new security API..."
        }
      },
      {
        "_id": "doc-oauth-2023",
        "_score": 23.8,
        "_source": {
          "title": "OAuth Authentication Configuration",
          "last_updated": "2023-04-20",
          "version": "7.x",
          "content_snippet": "Configure OAuth using the legacy realm settings in elasticsearch.yml..."
        }
      },
      {
        "_id": "doc-oauth-deprecated",
        "_score": 22.9,
        "_source": {
          "title": "Setting Up OAuth (Deprecated)",
          "last_updated": "2022-11-10",
          "version": "6.x",
          "content_snippet": "Use the shield plugin to configure OAuth authentication..."
        }
      }
    ]
  }
}</code></pre><h4><strong>With temporal filtering</strong></h4><p>Add a filter to restrict results to documents updated within the last six months:</p><pre><code>POST product-docs/_search
{
  "retriever": {
    "rrf": {
      "retrievers": [
        {
          "standard": {
            "query": {
              "semantic": {
                "field": "content_semantic",
                "query": "how to configure OAuth authentication"
              }
            }
          }
        },
        {
          "standard": {
            "query": {
              "multi_match": {
                "query": "configure OAuth authentication",
                "fields": ["title^2", "content"]
              }
            }
          }
        }
      ],
      "filter": [
        {"range": {"last_updated": {"gte": "now-6M"}}},
        {"term": {"status": "published"}}
      ],
      "rank_window_size": 50,
      "rank_constant": 20
    }
  },
  "_source": ["title", "last_updated", "version", "content_snippet"],
  "size": 5
}</code></pre><h4><strong>This hybrid search query</strong></h4><ul><li><strong>Semantic search</strong> (<code>semantic</code>) captures related concepts and context using the <code>content_semantic</code> field.</li><li><strong>Lexical search</strong> (<code>multi_match</code>) matches exact keywords like “OAuth” with field boosting <code>title^2</code>.</li><li><strong>Reciprocal rank fusion (RRF)</strong> combines both results sets with balanced reranking, retrieving the most relevant results.</li><li><strong>Temporal filter</strong> ensures only documents updated within the last six months are retrieved.</li><li><strong>Status filter</strong> restricts results to published documents, excluding drafts or deprecated content.</li></ul><h4><strong>Response with temporal filtering: Consistent results</strong></h4><p>The temporal filtering eliminated outdated documents, leaving only current documentation for version 9.x. The LLM now receives consistent context and generates confident, accurate responses:</p><pre><code>{
  "hits": {
    "hits": [
      {
        "_source": {
          "title": "OAuth 2.0 Authentication Setup",
          "last_updated": "2026-01-15",
          "version": "9.x",
          "content_snippet": "Configure OAuth 2.0 in Elasticsearch 9.x using the security API via Stack Management > Security."
        }
      },
      {
        "_source": {
          "title": "OAuth Provider Configuration",
          "last_updated": "2025-12-20",
          "version": "9.x",
          "content_snippet": "Configure Okta, Azure AD, Auth0 via security API with OIDC auto-discovery."
        }
      }
    ]
  }
}</code></pre><h4><strong>Relative versus absolute time filters</strong></h4><p><strong>Relative filtering </strong>(recommended for most use cases):</p><pre><code>"filter": [
  {
    "range": {
      "last_updated": {
        "gte": "now-1y"
      }
    }
  }
]</code></pre><p><strong>Absolute filtering </strong>(for specific time ranges):</p><pre><code>"filter": [
  {
    "range": {
      "last_updated": {
        "gte": "2025-01-01",
        "lte": "2025-12-31"
      }
    }
  }
]</code></pre><h4><strong>Impact on LLM response quality</strong></h4><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/0c225743cc92df873bd8685b5073fe749f4d2da6-1600x414.png" style="width: 100%;" /><ul><li><strong>Without filtering: </strong>LLM receives contradictory guidance from 2023–2025, producing uncertain responses mixing deprecated and current methods.</li><li><strong>With temporal filtering: </strong>LLM receives only recent documentation, generating confident responses based on current best practices.</li></ul><h3><strong>Information conflicts</strong></h3><p>When your RAG system retrieves documentation for features that behave differently across deployment types, versions, or configurations, conflicting information can confuse the LLM about which guidance applies to the user’s specific context.</p><h4><strong>Impact</strong></h4><p>The LLM has to use more resources and tokens to understand and determine which information is correct, becoming more prone to errors and hallucinations.</p><h4><strong>Solutions: Hybrid search with metadata boosting</strong></h4><p>Elasticsearch’s <code>bool</code> query with a <code>should</code> clause allows you to boost values to prioritize documents matching specific metadata, ensuring deployment-specific or version-specific documentation appears first in the context window. For query syntax details, refer to <a href="https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-bool-query">Bool query reference</a>.</p><h4><strong>Example: Deployment-specific feature documentation</strong></h4><p>A user asks, “How do I configure custom users in serverless?” Your knowledge base contains information about cloud, self-hosted, and managed deployments. With proper metadata prioritization, the LLM retrieves signals about feature availability and provides correct guidance:</p><pre><code>POST platform-docs/_search
{
  "retriever": {
    "rrf": {
      "retrievers": [
        {
          "standard": {
            "query": {
              "bool": {
                "must": [
                  {
                    "multi_match": {
                      "query": "How do I configure custom users in serverless?",
                      "fields": ["title^2", "content"]
                    }
                  }
                ],
                "should": [
                  {"term": {"deployment_type": {"value": "serverless", "boost": 3.0}}},
                  {"term": {"doc_status": {"value": "current", "boost": 2.0}}}
                ]
              }
            }
          }
        },
        {
          "standard": {
            "query": {
              "semantic": {
                "field": "content_semantic",
                "query": "How do I configure custom users in serverless?"
              }
            }
          }
        }
      ],
      "rank_window_size": 50,
      "rank_constant": 20
    }
  },
  "_source": ["title", "deployment_type", "feature_supported", "content_snippet"],
  "size": 5
}</code></pre><h4><strong>What this query does</strong></h4><ul><li><code>must</code> clause: All documents must match “How do I configure custom users in serverless?”</li><li><code>should</code> clauses with explicit boosting:<ul><li>Documents with <code>deployment_type: “serverless”</code> receive 3x boost.</li><li>Documents with <code>doc_status: “current”</code> receive 2x boost.</li></ul></li><li>Semantic search runs in parallel to capture conceptual matches.</li><li>RRF combines lexical (with metadata boosting) and semantic results to get the best of both approaches.</li></ul><h4><strong>Expected response:</strong></h4><pre><code>{
  "hits": {
    "hits": [
      {
        "_source": {
          "title": "Authentication in Serverless",
          "deployment_type": "serverless",
          "feature_supported": false,
          "content_snippet": "Custom authentication not available in Serverless. Use SSO with your identity provider."
        }
      },
      {
        "_source": {
          "title": "User Management in Serverless",
          "deployment_type": "serverless",
          "feature_supported": false,
          "content_snippet": "Direct user creation not supported in Serverless. Use your organization's IdP."
        }
      },
      {
        "_source": {
          "title": "Role-Based Access in Serverless",
          "deployment_type": "serverless",
          "feature_supported": true,
          "content_snippet": "Configure roles in Serverless console. Roles sync with SSO provider groups."
        }
      },
      {
        "_source": {
          "title": "API Keys in Serverless",
          "deployment_type": "serverless",
          "feature_supported": true,
          "content_snippet": "Create API keys for programmatic Serverless access. Keys inherit user permissions."
        }
      },
      {
        "_source": {
          "title": "SSO Configuration for Serverless",
          "deployment_type": "serverless",
          "feature_supported": true,
          "content_snippet": "Configure SSO in Serverless via Cloud console with SAML 2.0 or OIDC."
        }
      }
    ]
  }
}</code></pre><h4><strong>How metadata boosting resolves conflicts</strong></h4><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/43a9227fc651ebd6ed6f0058972e0dac502aa7c5-955x1600.png" style="width: 100%;" /><h4><strong>Impact on LLM response quality</strong></h4><ul><li><strong>Without metadata boosting: </strong>The<strong> </strong>context window receives equal-weight documents from all deployment types. The LLM produces vague responses that hedge between possibilities, failing to clearly state deployment-specific limitations.</li><li><strong>With metadata boosting (3x): </strong>Managed-specific documentation dominates the top results. The LLM generates direct answers about feature unavailability and provides actionable alternatives while maintaining the cross-deployment context for follow-up questions.</li></ul><h3><strong>Semantic noise</strong></h3><p>Vector similarity search can retrieve documents that are semantically related but contextually irrelevant to the user’s need. This “semantic drift” occurs when embeddings capture a similarity without understanding the query intent. So when your context window fills with irrelevant information, the LLM&#x27;s ability to generate precise answers declines.</p><h4><strong>Impact</strong></h4><p>The LLM receives correct information that doesn’t answer the question, wasting the context window and lowering the quality of the provided answer.</p><h4><strong>Solution: Hybrid search</strong></h4><p>Elasticsearch hybrid search combines lexical precision with semantic understanding, using explicit product filters to eliminate cross-product drift while maintaining conceptual recall.</p><h4><strong>Example: Technical documentation search</strong></h4><p>A developer searches for “Elastic Agent configuration,” and your knowledge base contains both the Elastic Agent (Elastic Observability) and the Elastic Agent Builder documentation. Both use the word &quot;agent&quot; prominently, making them semantically similar.</p><p>Let’s search for agent configuration documentation:</p><pre><code>POST elastic-docs/_search
{
  "retriever": {
    "rrf": {
      "retrievers": [
        {
          "standard": {
            "query": {
              "multi_match": {
                "query": "agent configuration logs metrics collection",
                "fields": ["title^3", "content", "tags^2"],
                "type": "best_fields"
              }
            }
          }
        },
        {
          "standard": {
            "query": {
              "semantic": {
                "field": "content_semantic",
                "query": "configuring agents to collect logs and metrics from hosts"
              }
            }
          }
        }
      ],
      "filter": [
        {"terms": {"product": ["observability", "elastic-agent"]}},
        {"term": {"doc_type": "configuration"}}
      ],
      "rank_window_size": 50,
      "rank_constant": 20
    }
  },
  "_source": ["title", "product", "tags", "url"],
  "size": 5
}</code></pre><p>This hybrid query:</p><ul><li><strong>Lexical component</strong> (<code>multi_match</code>) ensures exact keyword matches for &quot;agent&quot;, &quot;configuration&quot;, &quot;logs&quot;, &quot;metrics&quot;, and “collection”.</li><li><strong>Field boosting</strong> (<code>title^3</code>, <code>tags^2</code>) prioritizes documents where terms appear in important fields.</li><li><strong>Semantic component</strong> captures conceptual relationships and the intent about “configuring data collection agents”.</li><li><strong>RRF </strong>merges both result sets with balanced ranking using <code>rank_constant: 20</code>.</li><li><strong>Product filter </strong>restricts results to Elastic Observability and Elastic Agent domains, eliminating Agent Builder docs entirely.</li><li><strong>Category filter</strong> restricts results to &quot;observability&quot; and &quot;elastic-agent&quot; domains, eliminating semantic drift to other domains.</li></ul><p><strong>Expected response:</strong><br/></p><pre><code>{
  "hits": {
    "hits": [
      {
        "_source": {
          "title": "Elastic Agent Input Configuration",
          "product": "elastic-agent",
          "tags": ["inputs", "logs", "metrics", "configuration"],
          "url": "/docs/elastic-agent/inputs"
        }
      },
      {
        "_source": {
          "title": "Configure Elastic Agent for Log and Metric Collection",
          "product": "elastic-agent",
          "tags": ["configuration", "logs", "metrics", "observability"],
          "url": "/docs/elastic-agent/configure"
        }
      },
      {
        "_source": {
          "title": "Agent Policies and Integrations",
          "product": "observability",
          "tags": ["policies", "integrations", "fleet"],
          "url": "/docs/fleet/policies"
        }
      },
      {
        "_source": {
          "title": "Configuring Agent Outputs",
          "product": "elastic-agent",
          "tags": ["outputs", "elasticsearch", "logstash"],
          "url": "/docs/elastic-agent/outputs"
        }
      },
      {
        "_source": {
          "title": "Manage Elastic Agents with Fleet",
          "product": "observability",
          "tags": ["fleet", "agent-management", "deployment"],
          "url": "/docs/fleet/manage-agents"
        }
      }
    ]
  }
}</code></pre><h4><strong>Why hybrid search works</strong></h4>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Search type</td><td style="border: 1px solid #ccc; padding: 4px;">Strengths</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Lexical only</td><td style="border: 1px solid #ccc; padding: 4px;">Precise keyword matching</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Semantic only</td><td style="border: 1px solid #ccc; padding: 4px;">Captures semantic meaning</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Hybrid search</td><td style="border: 1px solid #ccc; padding: 4px;">Precision and recall, intent understanding</td>
                    </tr>
              </tbody>
            </table>
          <p><strong>Before and after: LLM response comparison</strong></p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/437dbe84121502b6c9b320d29b0df5ac67bd35fa-1600x371.png" style="width: 100%;" /><h2>Elasticsearch RAG best practices</h2><p>Following these best practices optimizes your context engineering and significantly reduces the risk of context poisoning in your RAG systems. By implementing the following strategies, you ensure that every token in your context window contributes to relevant, accurate, and trustworthy LLM responses.</p><ol><li><strong>Choose the right search strategy for your data: <br/></strong>Select your search approach based on your data characteristics and query patterns. Choose between lexical, semantic, or hybrid search. For more details, refer to <a href="https://www.elastic.co/docs/solutions/search/search-approaches">Search approaches | Elastic Docs</a>.</li><li><strong>Implement temporal awareness<br/></strong>Time-sensitive information requires active management to prevent outdated content from contaminating your context window. Use range queries with relative time filters (like <code>now-6M</code> or <code>now-1y</code>) for content that changes frequently, ensuring your RAG system prioritizes recent content. For more details, refer to <a href="https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-range-query">Range query | Reference</a>. </li><li><strong>Use metadata boosting<br/></strong>When your knowledge base contains similar content across different contexts, such as multiple product versions, deployment types, or user roles, metadata boosting helps prioritize contextually relevant results. For more details, refer to <a href="https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-bool-query">Boolean query | Reference</a>.</li><li><strong>Apply reranking when needed<br/></strong>For complex or high-priority queries where precision is critical, consider implementing a reranker solution where the model can significantly improve the search result quality by reordering results based on the semantic understanding of the queries and documents. For more details, refer to <a href="https://www.elastic.co/docs/solutions/search/ranking">Ranking and reranking | Elastic Docs</a>.</li><li><strong>Optimize chunking strategies<br/></strong><em>Chunking</em> is the process of breaking down large text into smaller “chunks.” Document chunking strategy affects both semantic representation and retrieval precision. Smaller chunks provide more granularity, but you may lose context; and larger chunks preserve more context, but you reduce retrieval precision. For more details, refer to <a href="https://www.elastic.co/search-labs/blog/chunking-strategies-elasticsearch">Understanding chunking strategies in Elasticsearch</a>.</li><li><strong>Filter the data before it reaches the LLM<br/></strong>Vector similarity search can retrieve semantically related but contextually irrelevant documents. Apply explicit filters on product, category, or domain fields to constrain results to the appropriate context before delivering the results to the LLM. For more details, refer to <a href="https://www.elastic.co/search-labs/blog/rag-in-production">RAG pipelines in production: Operationalize your GenAI project - Elasticsearch Labs</a>.</li><li><strong>Calibrate your retrieval volume (k)<br/></strong>Finding the &quot;Goldilocks zone&quot; for the number of documents retrieved is essential. Too few results lead to incomplete answers, while too many can cause the LLM to miss key facts. Balance your token budget against the depth of the model&#x27;s window. For more details, refer to <a href="https://www.elastic.co/docs/solutions/search/vector/knn">kNN search in Elasticsearch | Elastic Docs</a>.</li><li><strong>Consider summarization for large documents<br/></strong>When retrieved content exceeds your context budget, summarization techniques help retain essential information while reducing token count. For more details, refer to <a href="https://www.elastic.co/search-labs/blog/ai-generated-summaries-using-elastic">Adding AI summaries to your site with Elastic - Elasticsearch Labs</a>.</li><li><strong>Monitor and iterate<br/></strong>Over time, as knowledge bases grow and content evolves, we recommend that you implement monitoring to track relevance score distributions, temporal patterns in retrieved results, and user feedback signals. Watch for signs like outdated documents, declining user satisfaction scores, or a growing number of “no relevant results” queries. For more details, refer to <a href="https://www.elastic.co/observability-labs/blog/data-quality-and-failure-store-in-streams">Elastic Observability: Streams Data Quality and Failure Store Insights</a>.</li></ol><h2>Conclusion</h2><p>The new era of million-token context windows has not made context management obsolete; it has made context engineering more critical than ever. As context windows grow, so does the potential for poisoning from any source retrieval, tools, or memory.</p><p>The patterns shown in this article apply beyond just RAG. Temporal filtering, metadata boosting, and hybrid search are foundational techniques that improve context quality, regardless of source.</p><p>By implementing these strategies, you maintain control over what information reaches your LLM, ensuring relevance, accuracy, and trust at scale.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[jina-vlm: Seeing like an AI with vision language models]]></title>
            <description><![CDATA[Learn about vision language models (VLMs), what jina-vlm can do, how to use it, and best practices.]]></description>
            <link>https://www.elastic.co/search-labs/blog/visual-language-model-jina-vlm</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/visual-language-model-jina-vlm</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Scott Martens]]></dc:creator>
            <pubDate>Mon, 09 Feb 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/38c086d709c066acaab063deee25ad23ce488a30-1600x900.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Vision-language models (VLMs) are generative language models that take texts and images as input and generate new text as outputs. They’re designed for <em>visual question-answering</em> (VQA), an application of multimodal AI that fuses image and natural language processing (NLP) in ways that would have been impossible a few years ago. Jina’s new model, <code>jina-vlm</code>, breaks new ground for VQA with state-of-the-art semantic processing of images like scans, slides, and diagrams that combine text and visual materials to carry information instead of natural scenes like photographs.</p><p><code>jina-vlm</code> opens up new application areas:</p><ul><li><strong>Accessibility:</strong> Creating descriptions of images for visually impaired people.</li><li><strong>Indexing:</strong> Generating detailed textual information to improve search for images and documents that contain images.</li><li><strong>Querying:</strong> Matching natural language queries to images through direct analysis of image content.</li><li><strong>Visual content analysis:</strong> Automatically inspecting images for classification and analysis.</li><li><strong>Retrieval-augmented generation (RAG):</strong> Using visual information as a knowledge source in RAG systems.</li></ul><p>This article will introduce you to VLMs and show you how to use <code>jina-vlm</code> to bring multimodal AI to your own solutions.</p><h2>Vision-language models</h2><p><code>jina-vlm</code> is a decoder-only language model with additional features and training to answer questions or create general descriptions of images.</p><p>We adapted the <a href="https://huggingface.co/Qwen/Qwen3-1.7B-Base">Qwen3-1.7B-Base</a> open weight language model as the backbone of <code>jina-vlm</code>. What makes it a <em>vision</em> language model is that we’ve connected it to a custom fine-tuned image encoder model, derived from the <a href="https://huggingface.co/google/siglip2-so400m-patch14-384">SigLIP2-So400M/14-384</a> model, which converts images into a form that the language model can understand and trained it specifically for VQA and description generation.</p><p>To understand how this works, think about how language models “read” your prompts. Input texts are split into tokens (words or parts of words), which are replaced with embedding vectors from a lookup table and then inputted to a neural network. The model never sees the actual letters you typed.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/b5c711b2e1e0710370fc1979e4a987a6c2a497ba-1600x235.png" style="width: 100%;" /><p>As seen in the image, each unique token is substituted with a number that, in turn, corresponds to an embedding vector stored in the model’s vocabulary. This enables the model to process whatever text you give it, even if the words aren’t in its stored vocabulary.</p><p>This is why language models struggle to count letters in obscure words, like this example with Claude Haiku 4.5:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/cc242a0f9f5a8fa9aa14a1511ad47abfe6440b51-628x582.png" style="width: 100%;" /><p>This model correctly recognizes what <em>Qaqortoq</em> is; it knows that the word means a town in Greenland, and it can tell you all about it, but it has no idea which actual letters compose the name. There are ways to supplement language models to address this shortcoming, so some models (like ChatGPT 5, Gemini, and Claude Sonnet) don’t generally make this kind of mistake.</p><p>However, this architectural shortcoming makes it possible to create multimodal language models.</p><p>The actual input to generative language models is a sequence of semantic embeddings, and there’s no reason those embeddings have to come from texts. They could come from anything, as long as they’re compatible with the model’s training. We can make any language model handle images, provided we can generate vectors for those images in the same embedding space as the tokens.</p><p><code>jina-vlm</code> contains an image embedding model and adapter trained to do just that. The image (with some preprocessing into overlapping patches) is the input to the model, and it outputs a sequence of embeddings in the same semantic embedding space as the language model’s text token embeddings. These embeddings are then treated just like the text input embeddings. The full language model has access to your text and to the image at the same time, and it treats both as the same kind of input.</p><p>Figure 3 is a schematic of the <code>jina-vlm</code> model, showing the dual track processing of text and images together.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/66886ced02dc707434a1f7d19c519681f648eb13-1128x1579.png" style="width: 100%;" /><p>This architecture has some important benefits. Language models learn a lot of implicit information about the world from their training data, and the VLM architecture can use that information to help understand the content of images. Furthermore, some of the bias problems present in other multimodal architectures are absent in this one, notably the <a href="https://jina.ai/news/the-what-and-why-of-text-image-modality-gap-in-clip-models/">media “gap” that troubles CLIP-style models</a>.</p><h2>What can jina-vlm do?</h2><p><code>jina-vlm</code> has three overlapping core competencies:</p><ul><li>Describing images and generating captions.</li><li>Understanding informationally rich images.</li><li>Answering questions about the content of images.</li></ul><h2>Describing and captioning pictures</h2><p>Providing good descriptions of images can be a boon for accessibility. There are increasing legal and informal requirements to provide alternative text (alt-text) for digitized images on the internet. Search engine optimization (SEO) strategies in particular often demand high-quality image captions and alt-text. Manually composing such descriptions can be expensive and time consuming.</p><p><code>jina-vlm</code> can help. For example:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/816b17945624f8965705a06f4ac7e8774ae65714-1280x853.png" style="width: 100%;" /><p>You can input this image to <code>jina-vlm</code> with the text:</p><ul><li><strong>Input Instruction: </strong>Describe this image in detail.</li></ul><p>The result is the following:</p><ul><li><strong>Output: </strong>The image captures a vibrant scene of a white plate filled with a multitude of colorful jelly beans. The jelly beans, in hues of green, yellow, orange, purple, red, and blue, are scattered across the plate, creating a rainbow-like effect. The plate itself is placed on a gray surface, providing a stark contrast to the colorful jelly beans. The image does not contain any discernible text. The relative positions of the jelly beans vary, with some clustered together and others spread out, giving the image a sense of depth and texture. The overall composition of the image is simple yet striking, with the colorful jelly beans being the central focus.</li></ul><p>This description is a little wordy but quite detailed.</p><p><code>jina-vlm</code> treats the text you give it as a question or instruction, and some prompt engineering may improve the result a lot. For example, using the same picture of jelly beans:</p><ul><li><strong>Input Instruction: </strong>Describe this image. Be concise.</li><li><strong>Output: </strong>A white plate is filled with a variety of colorful jelly beans, including green, yellow, red, purple, and blue, arranged in a scattered manner.</li></ul><p>Or:</p><ul><li><strong>Input Instruction: </strong>Provide a caption for this picture.</li><li><strong>Output: </strong>A white plate full of colorful jelly beans.</li></ul><p>Asking specifically for a caption tends to produce very short results. Most captions are brief, so <code>jina-vlm</code> is trained to generate brief captions.</p><p>You can also use <code>jina-vlm</code> to identify and list items in images:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/6a7a2d832e225905e3e29a6df11c3d56dbf0655d-1764x876.png" style="width: 100%;" /><p>As a relatively small model, <code>jina-vlm</code> has limits to its object recognition. For common categories, it’s very good. For example:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/253083f674293de2a2aea866151c1844f1561d03-1768x752.png" style="width: 100%;" /><p>It’s able to handle content like “red cars”, but ask jina-vlm a question involving detailed categories and it can become unreliable very quickly. For example, using the same image:</p><ul><li><strong>Input Instruction: </strong>What model of car is this?</li><li><strong>Output: </strong>The car in the image is a classic French car, specifically a Citroën DS. This model is known for its distinctive rounded body shape, which was a hallmark of the DS series. The DS was produced from 1955 to 1969 and was a popular choice for its combination of style, performance, and practicality. It was designed to be a luxury car that could be driven on city streets as well as on the roads, making it a versatile vehicle for its time. The DS was also notable for its innovative features, such as the rear spoiler and the front grille with the Citroën emblem.</li></ul><p>This answer is largely hallucinated. The car in the picture is one of the most recognizable cars in history: a Citroën 2CV, manufactured from 1948 to 1990. The Citroën DS was made from 1955 to 1975 and looked like this:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/79f2d0348b0bae6a9253f649e43d2a030983408a-1600x1066.png" style="width: 100%;" /><p>In short, when pressed for details beyond its capacity, jina-vlm hallucinates without warning.</p><p>It’s especially important to keep this in mind with images of people:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/499a0378f555ad24c745119f1d0110bbf82d5d2c-1772x1108.png" style="width: 100%;" /><p>Tom Cruise is famous enough, and present enough in the training data, for j<code>ina-vlm</code> to recognize him. But, using the same picture:</p><ul><li><strong>Input Instruction: </strong>Provide a caption for this picture.</li><li><strong>Output: </strong>A man with a green shirt and a big smile.</li></ul><p>This caption never mentions Tom Cruise, even though it’s the most important fact about the picture. Asking for a caption produces very little hallucination but also very little information.</p><p>In contrast, let’s do the same thing with this image:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/2ccda0dd49d4caf78dffee2a797c7359e1ecca5f-1772x1076.png" style="width: 100%;" /><p>This is terribly wrong. Ronald Reagan looks nothing like Richard Nixon.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/dab9812917a27d064845a5baae2e3d2bdf6b2076-250x313.png" style="width: 100%;" /><p>When you ask <code>jina-vlm</code> to identify persons, it’s likely to hallucinate because it doesn’t have very deep world knowledge. Only people incidentally included many times in the training data might be correctly identified.</p><div style="display:none">Unknown block type "infoPanel", specify a component for it in the `components.types` option</div><p><code>jina-vlm</code> hallucinates much less when asked to write a caption. For example, using the same picture of Richard Nixon:</p><ul><li><strong>Input Instruction: </strong>Provide a caption for this picture.</li><li><strong>Output: </strong>A man in a suit sits in a chair with a bookshelf behind him.</li></ul><p>The most effective way to reduce hallucination is to prompt the model to say less. Any use or deployment of <code>jina-vlm</code> should take this into account.</p><h3>Understanding informational content</h3><p><code>jina-vlm </code>truly shines at understanding informational graphics and synthetic images, like diagrams, graphs, screenshots, scans of printed pages, advertising copy, and presentation slides. The examples below show only part of <code>jina-vlm</code>’s range.</p><h4>Charts</h4><p><code>jina-vlm</code> can report on the contents of all sorts of data presentations.</p><p><strong>Bar charts:</strong></p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/dbe91a785ef93e1036f2a7b227e4ffbdfc3d9d07-1742x772.png" style="width: 100%;" /><p><strong>Pie charts:</strong></p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/460dd1663f32744297cede944af69b46f45bbb37-1832x872.png" style="width: 100%;" /><p><strong>Area charts:</strong></p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/645ee7038dd1e0b6be7485796cf6ea6f2057b841-1828x928.png" style="width: 100%;" /><p><strong>Line graphs:</strong></p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/852da7806adf0aaf0071bab0a045894e4f7e8272-1828x660.png" style="width: 100%;" /><p><strong>Table:</strong></p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/5e86d3bc65a4fc285bbc4751edc9cadf52a266a7-1818x948.png" style="width: 100%;" /><h4>Multilingual content</h4><p><code>jina-vlm</code>’s multilingual support means that it can recognize languages and describe informational content in a different language from its presentation:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/0473ad1847e931a8f3b33e04c4e70ead94c2e5a8-1824x1104.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/e3fc53c96bb1ae235b29a01df05c0ffcf6b68bb4-1864x774.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/75376c40ae29ea33519dc79c55a46dc9f9f42438-1870x802.png" style="width: 100%;" /><p>It can also list the languages when there’s more than one:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/1acf868a00ddd221a563d33a3d73a353ed816124-1854x992.png" style="width: 100%;" /><p><code>jina-vlm</code> may answer in a different language than you asked for, if the image content is in another language or strongly associated with it:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/2778d64e56cde5f438e92a07555a80ae3fe48eb4-1934x710.png" style="width: 100%;" /><p>To fix this, specify the language in the query. With the same image:</p><ul><li><strong>Input Instruction: </strong>Summarize the content of this document in English.</li><li><strong>Output: </strong>The image is a promotional poster for a KFC Thailand event. The poster is predominantly red with white and black text. At the top, there is a logo of KFC, which is a red and white square with a black silhouette of Colonel Sanders. Below the logo, there are three young men, presumably the main participants or hosts of the event. They are dressed in casual attire, with two of them wearing red jackets and one in a white shirt with a black stripe.<br/><br/>The central part of the poster features a large text in white that reads, “เปิดตัวใหม่! ร่วมกับ KFC Thailand พบกับความสนุกสนานที่ไม่เหมือนใคร!” which translates to “Launching New! Join KFC Thailand and experience something unique!” Below this, there is a smaller text that says, “ทุกคนมีส่วนร่วม! วันนี้ 8 พฤศจิกายน 64 เวลา 12:00 น.” indicating that the event is happening on November 8, 2021, at 12:00 PM.<br/><br/>The bottom section of the poster includes social media icons and a Facebook page link, suggesting that the event will be live on Facebook. There are also icons of a chicken drumstick and a bucket, which are associated with KFC, reinforcing the brand’s identity.<br/><br/>The overall design of the poster is vibrant and eye-catching, with a clear focus on the event’s theme and the brand’s presence. The use of red and white colors is typical for KFC branding, which is known for its bold and energetic visual style. The poster is designed to attract attention and inform viewers about the upcoming event.</li></ul><h4>Complex information displays</h4><p>We specifically trained <code>jina-vlm</code> to understand complex information displays and to be able to summarize them in natural language.</p><p><strong>Infographics:</strong></p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/b964abf7f3f1176b0546babae3f4ae1d22d99c86-1850x1396.png" style="width: 100%;" /><p><strong>Technical and educational diagrams:</strong></p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/8499b5793faa3c4879e1b1ec8994c76e3b10ad20-1850x838.png" style="width: 100%;" /><p><strong>Maps:</strong></p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/4e6c3d47890dbaf9476963e992126c0f497eacda-1854x758.png" style="width: 100%;" /><h2>Screenshots</h2><p>jina-vlm has been trained to parse and understand screenshots:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/50d17af1a4c08e4a69c4fb625229264dbfc65ce8-1848x708.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/3a9273a5be5134185005bdb5efe7666f5494abb2-1864x1104.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/5b66c570624d61c4e4c9f31791cd63656258b238-1850x810.png" style="width: 100%;" /><p>It can even parse video games:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/335b9482505a7dd2dcd86851f19ae514647368a7-1698x902.png" style="width: 100%;" /><h3>Scans</h3><p><code>jina-vlm</code> excels at materials like scans and images of printed text:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/bb057adc4451107d6cc2b3dde90eb11d562ef581-1494x1468.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/687fb7b5c707ade1511ee6d40a7d034cfa7cfefd-1796x1540.png" style="width: 100%;" /><h3>Pre-digital materials</h3><p>Robust support for scans also makes pre-digital materials accessible to digital search technologies:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/6b560a39e1e48d0e6755ed7b5273be77f03eca3c-1788x936.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/d998c8e1ece6f1fc087b6d815b41ae11e16ee0bd-1498x1094.png" style="width: 100%;" /><h3>Social media</h3><p>Since much of the internet is “<a href="https://doctorow.medium.com/my-mcluhan-lecture-on-enshittification-ea343342b9bc">five giant websites filled with screenshots of text from the other four,”</a> <code>jina-vlm</code> is especially helpful in processing the kind of content that the web is full of:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/8f901fd079c7c53a6c0517e93b90b28367490710-1856x1044.png" style="width: 100%;" /><p>jina-vlm can <em>sometimes </em>even explain memes:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/4b55322d02cc13c6e718721763233f26497c64d0-1862x608.png" style="width: 100%;" /><h2>Visual question-answering</h2><p>“A picture is worth a thousand words.” As cliché as that saying is, multimodal AI can prove it.</p><p>VQA is the domain where <code>jina-vlm</code> excels and the one with the most potential for new and revolutionary applications. <code>jina-vlm</code> answers questions about the content of images just like the way a language model can answer questions about texts. This means it can serve as part of a RAG system that assembles responses from mixtures of images and texts stored in a database.</p><p>The infographic below is very information rich, and with <code>jina-vlm</code> you can ask natural language questions about its content:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/5f71a444daedfae06eccbd6555451ea0d53889be-1792x1388.png" style="width: 100%;" /><p>Or, using the same image:</p><ul><li><strong>Input Instruction: </strong>What is the crew module of the SLS called?</li><li><strong>Output: </strong>Orion spacecraft</li></ul><p>This capability makes it possible to query formal schematics and diagrams:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/a11f29f186455831f36849695ff69e0425d78f42-1792x838.png" style="width: 100%;" /><p>With the same image:</p><ul><li><strong>Input Instruction: </strong>How many bathrooms are there?</li><li><strong>Output: </strong>4</li></ul><p>If you use photographs or natural scenes, jina-vlm is typically good with high-visibility characteristics:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/61e2eebcd677e7a38e47cfe812b57c2f43022b43-1796x844.png" style="width: 100%;" /><p>But if you’re asking about finer, less visible details, the answers might get spotty:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/6377ac8e9156cc0515abd54d7090d5739a575bd7-1786x758.png" style="width: 100%;" /><p>Thank you, Captain Obvious.</p><p><code>jina-vlm</code> is also not well-suited to resolving optical illusions:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/26304a5b8b4159e67d80712006f39db3ec90d5af-1794x626.png" style="width: 100%;" /><p>You can ask questions about text in photographs:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/f5dc84db1d9a25f68fa914928b1d9effdaf0e821-1630x1068.png" style="width: 100%;" /><p>And <code>jina-vlm</code> is ideally suited to understand traditional magazine and newspaper advertisements:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/c83c55998e7b68e2a802572e8234d1a277e50f6e-1634x1268.png" style="width: 100%;" /><p>Visual question-answering also works cross-lingually:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/fad19e8b0008c52954441e98879ee170bb278291-1628x1278.png" style="width: 100%;" /><p>Using the same image:</p><ul><li><strong>Input Instruction: </strong>How much do grapes cost per kilo?</li><li><strong>Output: </strong>The cost of grapes per kilogram is 3.98.</li></ul><p>It answers questions from manual pages:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/4b3cb87ad2132671f8c079fc1922088753f58036-1788x1424.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/e3ef1b1e578d200511ee8805eaedba04391d43c2-1796x1532.png" style="width: 100%;" /><p>It can also answer questions about the content of posters and artistic works:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/7771639037d176121702dc6d68307e377928e284-1500x1090.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/7e8eb8e43a46d5f5f6ca93c1810bc41f7baada5b-1496x684.png" style="width: 100%;" /><p>We showed in the previous section that jina-vlm is not very good at general face recognition, outside of a few very famous people. However, it can use the text in images to answer questions it can’t answer from just a picture:</p><h2>Best practices</h2><p>jina-vlm is a generative language model. It’s not a database. It’s prone to error and hallucination. It’s very important to take that into account in your use case.</p><div style="display:none">Unknown block type "infoPanel", specify a component for it in the `components.types` option</div><p>To minimize errors, we offer the following advice:</p><h3>Test your prompts</h3><p>If you depend on getting certain kinds of results consistently, you should do thorough quality testing on your queries. Getting a few good results with one prompt does not guarantee the same quality of result with different images. You can minimize this risk by manually reviewing the results of queries over a broad collection of images that are genuinely representative of what you plan to process.</p><h3>Specify the language</h3><p>Add to your query the language you expect the answer to be in. Like all language models, jina-vlm cannot easily distinguish between your instructions and the data it’s supposed to apply your instructions to. The only way for it to know how to answer is to look at the languages it recognizes in its input or to follow your explicit instructions.</p><h3>Don’t ask leading questions</h3><p>It’s very easy to get <code>jina-vlm</code> to hallucinate by asking leading questions. For example:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/454b1b483a0d555f37dbf575e064fdbb13a3b517-1658x676.png" style="width: 100%;" /><h3>Prompt for short answers</h3><p><code>jina-vlm</code> can be very wordy sometimes. Using words like “briefly” and “concise” reduces the size of the output.</p><p>Compare:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/0805c2868fa220c3bd39d6cee6d6092d50c79d06-1830x1042.png" style="width: 100%;" /><p><br/>And with the same image:</p><ul><li><strong>Input Instruction: </strong>What kind of vehicle is this? Be very concise.</li><li><strong>Output: </strong>This is a Massey Ferguson tractor.</li></ul><h3>Only ask questions relevant to the image</h3><p><code>jina-vlm</code> will not always be able to tell that your question is irrelevant to the content and can easily hallucinate answers. For example:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/a493d89b6a3aba7040083f4028da9cda0bc37aa5-1778x1272.png" style="width: 100%;" /><p>With some planning, you might be able to reduce this effect by using a decision tree: Ask multiple questions, each one dependent on the answer to the previous one. Using the same Picasso painting above, we would never have asked about the language of the text if we had asked this question first:</p><ul><li><strong>Input Instruction: </strong>Is there any written language in this image? Be very concise.</li><li><strong>Output: </strong>No</li></ul><h2>Think like a language model</h2><p>We trained <code>jina-vlm</code> to attend to the image given to it. This means that no matter what you ask it, its answer will be a result of the picture it sees. Anthropomorphizing language models is a major hazard for AI users, but it’s easy to see how <code>jina-vlm</code> “thinks” if we ask it questions that call for reflection. For example:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/a49ce6aed18dcf023deaaf52c0b2e536e237b2b7-1782x910.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/4711f4c8da535107dff421275af6bbbdd602aaf8-1774x916.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/805c2b4598999094b3c36c828813b471316237bf-1772x916.png" style="width: 100%;" /><p>You can ask <code>jina-vlm</code> anything, even questions like, “How are you feeling?” and it will answer in a way that is, if possible, attentive to the image.</p><p><code>jina-vlm</code> has no filter, safety training, or reliable source of facts, and it only knows that what it sees in the image you give it matters. For example:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/c1e91515231fd88b707480f102c98576fa7707d6-1778x594.png" style="width: 100%;" /><p>But:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/65576bf161a845447f17f1f42615a6e679767b37-1768x622.png" style="width: 100%;" /><p>This is not a bug, it’s a feature. <code>jina-vlm</code> will rely on what you give it to ground what it says. It’s up to you to decide what to give it.</p><p>As they say,</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/66efa7ef107db4948b25d32843699244e3a8bb9c-1770x784.png" style="width: 100%;" /><p>And,</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/a542cf0363b432662ef4bf78946c9f75d3681e0e-1782x830.png" style="width: 100%;" /><h2>Getting started with jina-vlm</h2><p><code>jina-vlm </code>is available via a public <a href="https://platform.openai.com/docs/api-reference/introduction">OpenAI-compatible API</a> at <a href="https://api-beta-vlm.jina.ai">https://api-beta-vlm.jina.ai</a>. To use this API, get a <a href="https://jina.ai/api-dashboard">free Jina API key at the Jina website</a>. It comes with 10 million free tokens to try out Jina by Elastic’s models. Additional tokens are available for purchase.</p><p>You can find <a href="https://jina.ai/news/jina-vlm-small-multilingual-vision-language-model/#getting-started">instructions for accessing the <code>jina-vlm</code> API on the Jina AI website</a>, and there’s a <a href="https://jina.ai/api-dashboard/visual-qa-test">web-based sandbox for trying out images and requests</a>.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/de2fa4ded4913a7592665e2944d395ef87c8d768-1072x798.png" style="width: 100%;" /><p>You can also <a href="https://huggingface.co/jinaai/jina-vlm">download the model from Hugging Face</a> and run it locally for noncommercial purposes. For commercial deployment on site, contact <a href="https://www.elastic.co/contact?pg=global&amp;plcmt=nav&amp;cta=205352">Elastic sales</a>.</p><p>For more technical information about the model, including pricing information, visit <a href="https://jina.ai/models/jina-vlm/">the jina-vlm model page</a>. You can also read <a href="https://arxiv.org/abs/2512.04032">the release paper on arXiv.org</a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Using Groq with Elasticsearch for intelligent queries]]></title>
            <description><![CDATA[Learn how to use Groq with Elasticsearch to run LLM queries and natural language searches in milliseconds.]]></description>
            <link>https://www.elastic.co/search-labs/blog/groq-elasticsearch</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/groq-elasticsearch</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Mark Puddick]]></dc:creator>
            <pubDate>Fri, 06 Feb 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/c9661fc360c775ebfe48ccda81160057c49f7088-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>One of the challenges with using large language models (LLMs) in conjunction with Elastic is that we often need fast results. Elastic has no issues providing millisecond response time. However, when we introduce LLM calls with this, we potentially have a big issue with our performance dropping to unacceptable levels. This is where hardware inference using Groq can supercharge the speed of your results when combining Elastic with an LLM.</p><p>Groq is a hardware and software company focused on delivering ultra-low-latency, deterministic AI inference at scale. Its core innovation is the Groq Language Processing Unit (LPU) Inference Engine, a purpose-built, custom-designed chip architecture specifically engineered for running LLMs at extremely high speed with predictable performance. The links below give a more detailed overview of the Groq architecture.</p><ul><li><a href="https://groq.com/lpu-architecture">Groq: Introducing the LPU</a></li><li><a href="https://groq.com/blog/artificialanalysis-ai-llm-benchmark-doubles-axis-to-fit-new-groq-lpu-inference-engine-performance-results">Groq: The New LLM Inference Benchmark</a></li></ul><p>Unlike traditional GPU-based systems, Groq&#x27;s inference-specific architecture allows it to process tokens at unprecedented throughput with minimal variance in response time. This directly addresses the memory bandwidth bottlenecks and scheduling overhead that typically slow down traditional LLM calls, ensuring that integrating an LLM with Elastic&#x27;s search results maintains a real-time user experience. Groq provides this industry-leading speed and performance, often at the best price performance, via GroqCloud, an easy-to-use tokens-as-a-service platform.</p><p>Let’s start by looking at a common intelligence query layer request pattern and what improvements we can get from this.</p><h2>Natural language search</h2><p>Since the widespread adoption of LLMs, a common search ask is to be able to use natural language for domain-specific search. A naive way of addressing this is to do a simple semantic search within a retrieval-augmented generation (RAG) workflow; however, in most cases, this doesn’t provide the desired results. This is largely due to specific attributes in the question that need to be translated into query terms. To address this, we can ask an LLM to generate a query that we can execute. However, this leaves a lot of room for error. Ultimately, we’ve found that providing a tool with specific parameters for the domain and using that with the LLM gives the best result. There’s more info on this in <a href="https://www.elastic.co/search-labs/blog/mcp-intelligent-search">this blog</a>.</p><p>To define the agent, we’ll use the following prompt:</p><pre><code>You are a helpful banking transaction agent. You help users search and analyze their banking transactions.

Current date: {current_date}

When users ask about transactions, use the appropriate tools:
- Use trans-search for finding specific transactions

For date references:
- "last month" = past 30 days from today
- "this month" = current month from 1st to today
- "last week" = past 7 days
- "this year" = January 1st of current year to today

By default set the make the to date today and the from date 1 year ago
Common categories: groceries, dining, gas, shopping, entertainment, utilities, healthcare, transportation, travel, subscriptions, insurance, phone, internet</code></pre><p>As an example:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/71a1abec1c90a5faace7dc4699b2cb1096da405b-1600x215.png" style="width: 100%;" /><p>This gives us good results, but our search time goes from under 100ms to over 1 second, due to the LLM call.</p><p>To address this, we can use Groq’s hardware inference to run this query in a fraction of the time. To run through the example, you’ll need to <a href="https://console.groq.com/home">sign up for a Groq account</a>.</p><p>Then you can generate an API key from the top-right menu:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/6acd9e3155c4008aa903c8f0ad1276a7b1af8bea-970x330.png" style="width: 100%;" /><p>We’ve created the tool <a href="https://github.com/markpudd/transaction_search_mcp">here</a> to be able to execute the search.</p><p></p><p></p><p></p><p>Once you’ve cloned the above repo, you’ll need to update the .env to point to Groq:</p><pre><code>OPENAI_API_KEY=gsk-........

OPENAI_API_BASE=https://api.groq.com/openai/v1

OPENAI_MODEL=openai/gpt-oss-20b</code></pre><p>We’ve used the 20b gpt-oss model, as this will give accurate results. There’s little to no gain in using a bigger model for this type of solution.</p><p>Now, from a testing situation, we can run this from a simple UI with a prompt to use the tool:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/9fe93bd101f230ff72f43b126ea88037b624bee5-1600x957.png" style="width: 100%;" /><p>To test the timings around this, we’ll run the tool 50 times and get the average of the repose from the total, the LLM, and Groq. We’ll use ChatGPT-4.1-nano and the Groq OSS-20b model. Here’s the result from this test:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/d8eb1cca56914a54a19a80a3983106753b5e7367-1568x1324.png" style="width: 100%;" /><p>It’s clear that we drop around a second by using Groq’s hardware inference. We’re also using a smaller model which, for this use case, still provides good results. By dropping this from 1.5 seconds to 250ms, we’ll generally fall inside the Service Level Agreement (SLA) levels for a lot of organizations.</p><h2>Elastic Agent Builder</h2><p>We’ve shown how this cannot only be used to accelerate natural language processing (NLP) search with Elastic but also how we can use this to accelerate Elastic Agent Builder. Agent Builder was recently released into technical preview and is now capable of connecting to Groq via the Groq endpoint. Agent Builder is available on Elastic 9.1+. We can use the same API key that we used earlier.</p><p>Here’s how you set this up:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/237e1c08060d0598dc398750f636f5da15807163-1600x171.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/065e70fedf794a2d791340ed88785e07e2579dca-1536x596.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/ea0979dfcbc4de7cd3db748b575dbfd718a9fbb9-1600x214.png" style="width: 100%;" /><p>If you’re using serverless, you need to create a new connector from the stack management connectors page. First, click AI Connector.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/ba55fbc1e424792de3afe567ad92547e7a4165de-530x700.png" style="width: 100%;" /><p>On the next screen, select Groq as the service:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/e628e81738de5ffdeaf5a5e533a7070ffad44354-1232x1246.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/43863ceb10f4c0d3f306e660a0a23f514869d5ef-670x414.png" style="width: 100%;" /><p>You can then set up the model you want to use. The supported models are listed on the <a href="https://console.groq.com/docs/models">Groq website</a>.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/fdbf14924d209b69da572e3d1aca883403444d5a-654x663.png" style="width: 100%;" /><p>If you need to add your organization ID, this can be added by expanding <strong>More options</strong> under <strong>Settings</strong>.</p><p>If you’re on a hosted version of Elastic, at the time of this writing, you can use the OpenAI compatible endpoint on Groq to connect to Elastic. To do this, select the OpenAI service and use a custom URL that points to the Groq URL, as below:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/71261d380c60e08280cce54f0dbf67a0cc3a3e74-1584x928.png" style="width: 100%;" /><p>Once you have set up Groq using either of the above methods, go to GenAI Settings and set Groq as your default GenAI.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/7c367bebb80074ba1a04116d0cc66b991a0a735d-1600x842.png" style="width: 100%;" /><p>Agent Builder will now default to using the Groq connector.</p><p>Let&#x27;s look to see if we can replicate the NLP search within Agent Builder and use Groq.</p><p>In order to create agents, we generally need to have some tools for the agent to use. In Agent Builder, you have the ability to use built-in tools or create your own. A number of built-in tools are documented <a href="https://www.elastic.co/docs/explore-analyze/ai-features/agent-builder/tools/builtin-tools-reference">here</a>.</p><p>You can use these tools for your transaction searching. The LLM will use the built-in tools, such as the `index_explorer`, `generate_esql`, and `execute_esql`, which will try and find the relevant index, inspect the structure, and execute an Elasticsearch Query Language (ES|QL) generated query. However, this presents a few challenges:</p><ul><li>The time to run the agent will increase substantially, as there will be multiple reasoning steps and tool executions. Since we’re using Groq to get faster results, this isn’t ideal.</li><li>As the number of steps and the tool usage grows, we’re going to consume substantially more tokens and therefore increase cost.</li></ul><p>To avoid the above issues, we can create a new tool, designed to specifically search transactions. At the time of this writing, there are three types of tools we can use:</p><ul><li>ES|QL tools: These allow you to use templated ES|QL to define a query.</li><li>Index search tools: These allow you to provide an index, and the LLM creates the query.</li><li>Model Context Protocol (MCP) tools: These allow you to use external tools via MCP.</li></ul><p>We could use our previously created MCP tools; however, to keep things simple, we’ll use the index search tool. You can set this up as below:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/f9b85dacf677613e5421cd6b4590f4b57d18e6b6-1505x1600.png" style="width: 100%;" /><p>Once we’ve created the tools, we can create an agent in Agent Builder. To do this, we click the <strong>Create agent</strong> button and fill in the screenshot below, using the prompt we used in our original example:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/74c190822b3ebd974b224d749e5229e9f3552b83-1600x1112.png" style="width: 100%;" /><p>We also want to select the tool we created as part of the agent:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/270fc98a264dacb4a48af60135f9a82bbf6cdc6e-1600x1045.png" style="width: 100%;" /><p>And test in the Agent Builder UI by asking a few different questions:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/3fde2839943659e599c013cae50929913c81d092-1600x1376.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/ba2bb5b219c23289dd5b1c4608dcf98379bbf030-1600x1445.png" style="width: 100%;" /><p>We actually get some more functionality via Agent Builder, since it can create extra queries due to the extra built-in tools we selected. The only real disadvantage of this is that it can take longer to answer questions overall, since the LLM has the ability to do more. Again, this is where Groq can help. Let&#x27;s take a look at the performance difference in Agent Builder with Groq.</p><h2>Performance with Groq in Agent Builder</h2><p>A great feature of Agent Builder is that it has MCP and agent-to-agent (A2A) out of the box. We can use this to do some simple benchmark testing. Using A2A, we can replace the built-in agent in the UI and test harness. This allows us to test Agent Builder with Elastic LLM and a couple of different models in Groq.</p><p>There’s an <a href="https://github.com/markpudd/a2a_chat">updated repo</a> that has the benchmark script in it.</p><p>To test this, we’ll ask the question:</p><p><em>How much did I spend on gas?</em></p><p>The results of the testing are shown below:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;"></td><td style="border: 1px solid #ccc; padding: 4px;">Groq -openai/gpt-oss-120b</td><td style="border: 1px solid #ccc; padding: 4px;">Groq llama-3.3-70b-versatile</td><td style="border: 1px solid #ccc; padding: 4px;">Elastic LLM</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Min: 6.040s</td><td style="border: 1px solid #ccc; padding: 4px;">6.04</td><td style="border: 1px solid #ccc; padding: 4px;">4.433</td><td style="border: 1px solid #ccc; padding: 4px;">15.962</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Max: 9.625s</td><td style="border: 1px solid #ccc; padding: 4px;">9.625</td><td style="border: 1px solid #ccc; padding: 4px;">7.986</td><td style="border: 1px solid #ccc; padding: 4px;">24.037</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Mean: 7.862s</td><td style="border: 1px solid #ccc; padding: 4px;">7.862</td><td style="border: 1px solid #ccc; padding: 4px;">6.216</td><td style="border: 1px solid #ccc; padding: 4px;">17.988</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Median: 7.601s</td><td style="border: 1px solid #ccc; padding: 4px;">7.601</td><td style="border: 1px solid #ccc; padding: 4px;">6.264</td><td style="border: 1px solid #ccc; padding: 4px;">17.027</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">StdDev: 1.169s</td><td style="border: 1px solid #ccc; padding: 4px;">1.169</td><td style="border: 1px solid #ccc; padding: 4px;">1.537</td><td style="border: 1px solid #ccc; padding: 4px;">2.541</td>
                    </tr>
              </tbody>
            </table>
          <p>As you can see, the built-in Elastic LLM is not bad, but Groq still outperforms this by almost 3x on average. You’ll notice that the speed overall is substantially slower than the external app. This is due to the way we’ve set up our tool in Agent Builder just to use the index. As a result, a lot of this time is taken by Agent Builder reasoning (that is, inspecting the index). We could use templated ES|QL tools instead of the index, and this would move the results closer to the external app.</p><h2>Conclusion</h2><p>It’s clear to see that by using Groq with Elastic we open up a range of new possibilities in which speed is an important factor. This article covers the basic intelligent query example, but there are many other applications, such as image understanding, summarization, and captioning, that become possible with the 10x increase in speed.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[ES|QL dense vector search support]]></title>
            <description><![CDATA[Using ES|QL for vector search on your dense_vector data.]]></description>
            <link>https://www.elastic.co/search-labs/blog/dense-vector-search-elasticsearch-query-language</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/dense-vector-search-elasticsearch-query-language</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Carlos Delgado]]></dc:creator>
            <pubDate>Thu, 05 Feb 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/c61f7d90a3d5c79da51518f56a115bf059224cad-1024x559.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>You can now use Elasticsearch Query Language (ES|QL) for vector search! ES|QL can retrieve, filter, and score <code>dense_vector</code> fields. Use k-nearest neighbors (KNN) queries for fast, approximate nearest neighbors search at scale. Use vector similarity functions for exact search and custom scoring.</p><p>KNN is simpler to use in ES|QL than in the Query DSL. Prefilters and the number of results to retrieve per shard are automatically inferred from the ES|QL query.</p><h2><strong>What is vector search?</strong></h2><p>Modern search is no longer limited to exact keyword matching. Users expect systems to understand <em>meaning</em>, not just text. This is where <strong>vector embeddings</strong> and Elasticsearch’s <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/dense-vector"><code>dense_vector</code></a> field type come in.</p><p>The easiest way to use vector search in Elasticsearch is to use the <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/semantic-text"><code>semantic_text</code></a><a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/semantic-text"> field type</a>. It allows you to automatically generate text embeddings, perform semantic search, and handle chunking. However, you may want to use <code>dense_vector</code> instead when:</p><ul><li>You’re already using <code>dense_vector</code> fields.</li><li>You’re using non-textual data, like images, sound, or video.</li><li>You need to generate embeddings separately to ingestion in Elasticsearch.</li><li>You need to do custom or advanced scoring.</li><li>You want to perform exact nearest neighbors search.</li></ul><p>A <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/dense-vector"><code>dense_vector</code></a> stores numerical embeddings produced by machine learning models. These embeddings capture semantic similarity: documents with similar meaning have vectors that are close to each other in high-dimensional space.</p><p>With vectors, you can build:</p><ul><li><a href="https://www.elastic.co/docs/solutions/search/semantic-search"><code>Semantic text</code> search</a>, for finding documents related to a question.</li><li><a href="https://www.elastic.co/what-is/retrieval-augmented-generation">Retrieval-augmented generation</a> (RAG).</li><li>Recommendation systems.</li></ul><p>ES|QL brings the power of a query-piped experience to Elasticsearch. Adding first-class support for <code>dense_vector</code> fields means you can now <strong>retrieve, filter, score, and search using vectors directly in ES|QL</strong>, alongside your text and non-text data.</p><p>In this post, we’ll walk through how to work with <code>dense_vector</code> fields in ES|QL, from basic inspection to approximate and exact similarity search, and how to use vector search as part of hybrid search strategies.</p><h2><strong>The basics: Retrieving vector data</strong></h2><p>Assume you have an index with a mapping similar to:</p><pre><code>{
  "mappings": {
    "properties": {
      "title": { "type": "text" },
      "category": { "type": "keyword" },
      "content_vector": {
        "type": "dense_vector",
        "dims": 384,
        "similarity": "cosine"
      }
    }
  }
}</code></pre><p>You can retrieve vector fields just like any other column:</p><pre><code>FROM documents
| KEEP title, content_vector
| LIMIT 5</code></pre><p>Keep in mind that vectors can be large. For exploration and debugging, it may be useful to retrieve vector data, but in production you should avoid returning full vector data unless it&#x27;s really necessary.</p><p>You can use familiar ES|QL constructs to check how many rows have vector information:</p><pre><code>FROM documents
| WHERE content_vector IS NOT NULL
| STATS non_null = COUNT(*)</code></pre><h2><strong>Approximate search using KNN</strong></h2><p>Vector search means finding the most similar vectors to a given query vector.</p><p>For large datasets, the most common approach is <a href="https://www.elastic.co/blog/understanding-ann">approximate nearest neighbor</a> (ANN) search. ANN tries to find the most similar vectors by using data structures that allow for fast computation of similar vectors but doesn’t guarantee that all vectors will be considered.</p><p>ES|QL exposes approximate search via the <a href="https://www.elastic.co/docs/reference/query-languages/esql/functions-operators/dense-vector-functions#esql-knn"><code>KNN</code> function</a>:</p><pre><code>FROM documents MEDATADATA _score
| WHERE KNN(content_vector, [0.12, -0.03, 0.98, ...])
| SORT _score DESC
| KEEP title, _score
| LIMIT 10</code></pre><p>This simple example:</p><ul><li>Searches over the <code>content_vector</code> field.</li><li>Uses a dense vector query <code>[0.12, -0.03, 0.98, ...]</code> to search similar vectors to it.</li><li>Sorts the results by score, by using the <code>METADATA _score</code> attribute that will be populated by the `KNN` function.</li><li>Keeps just the title and score, as the <code>content_vector</code> field is not interesting to be returned and we can avoid loading its contents.</li><li>Retrieves the top 10 elements by using <code>LIMIT</code>. This automatically sets <code>k</code> to 10 in the <code>KNN</code> function.</li></ul><p>The <code>KNN</code> function can be further customized by using options:</p><pre><code>FROM documents MEDATADATA _score
| WHERE KNN(content_vector, [0.12, -0.03, 0.98, ...], {"k": 20, "min_candidates": 100, "rescore_oversample": 4.0})
| SORT _score DESC
| KEEP title, _score
| LIMIT 10</code></pre><p>See the <a href="https://www.elastic.co/docs/reference/query-languages/esql/functions-operators/dense-vector-functions#esql-knn"><code>KNN</code> function</a> named parameters for a complete description of the available parameters.</p><h3><strong>Combining KNN with filters</strong></h3><p>You can narrow down the candidate set for the vector search:</p><pre><code>FROM documents METADATA _score
| WHERE category == "tutorial"
| WHERE KNN(content_vector, [0.12, -0.03, 0.98, ...])
| SORT _score DESC
| LIMIT 10
| KEEP title, category, _score</code></pre><p>Of course, you can use any other <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/where"><code>WHERE</code> clauses</a> that filter the results or include <code>KNN</code> as part of a filter expression:</p><pre><code>FROM documents METADATA _score
| WHERE published_date > NOW() - 1 hour AND LENGTH(category) > 10 AND KNN(content_vector, [0.12, -0.03, 0.98, ...])</code></pre><h3><strong>KNN made simple</strong></h3><p><code>KNN</code> is simpler to use in ES|QL. You won&#x27;t have to specify prefilters or k for your query explicitly.</p><p><a href="https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-knn-query#knn-query-filtering">Prefilters</a> are the way to ensure that a <code>KNN</code> query returns as many results as expected. Prefilters are applied on the KNN search itself, instead of being applied after the query.</p><p>Keep in mind that <code>KNN</code> returns the top k results it&#x27;s been asked for. If filters are applied after the <code>KNN</code> query, some of the results returned by the query may be filtered. If that happens, we’ll retrieve fewer results than expected.</p><p>The Query DSL <a href="https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-knn-query"><code>knn</code></a><a href="https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-knn-query"> query</a> contains a section for specifying prefilters:<br/></p><pre><code>POST my-image-index/_search
{
  "query" : {
    "knn": {
      "field": "content_vector",
      "query_vector": [0.12, -0.03, 0.98, ...],
      "filter" : {
        "term" : { "category" : "tutorial" }
      }
    }
  }
}</code></pre><p>You don&#x27;t need to care about prefilters when using <code>KNN</code> in ES|QL. All filters are applied as prefilters for the <code>KNN</code> function, so there’s no need to specify them as a specific option or command; just use `WHERE` and let ES|QL do it for you!</p><p>KNN also allows specifying the number of results to retrieve per shard; that is, the k parameter. Similar to the Query DSL, k defaults to the <code>LIMIT</code> specified in your query.</p><h2><strong>Exact search using vector similarity functions</strong></h2><p><code>KNN</code> is designed to be fast, and that makes it ideal for large datasets (hundreds of thousands or millions of vectors) and latency-sensitive applications. The trade-off is that results are <em>approximate</em>, though usually very accurate.</p><p>Sometimes you want <em>exact</em> similarity computation instead of approximate search, for example:</p><ul><li>When your dataset is small.</li><li>When the filters used in the query are very restrictive and select a small subset of your dataset.</li></ul><p>ES|QL provides the following <a href="https://www.elastic.co/docs/reference/query-languages/esql/functions-operators/dense-vector-functions">vector similarity functions</a>:</p><ul><li><a href="https://www.elastic.co/docs/reference/query-languages/esql/functions-operators/dense-vector-functions#esql-v_cosine"><code>V_COSINE</code></a></li><li><a href="https://www.elastic.co/docs/reference/query-languages/esql/functions-operators/dense-vector-functions#esql-v_dot_product"><code>V_DOT_PRODUCT</code></a></li><li><a href="https://www.elastic.co/docs/reference/query-languages/esql/functions-operators/dense-vector-functions#esql-v_hamming"><code>V_HAMMING</code></a></li><li><a href="https://www.elastic.co/docs/reference/query-languages/esql/functions-operators/dense-vector-functions#esql-v_l1_norm"><code>V_L1_NORM</code></a></li><li><a href="https://www.elastic.co/docs/reference/query-languages/esql/functions-operators/dense-vector-functions#esql-v_l2_norm"><code>V_L2_NORM</code></a></li></ul><p>Using these functions, you can calculate the similarity of your query vector with all the vectors your query retrieves.</p><p>The following query uses the same mapping as our <code>KNN</code> example, above, but does exact search using cosine similarity:</p><pre><code>FROM documents
| EVAL similarity = V_COSINE(content_vector, [0.12, -0.03, 0.98, ...])
| SORT similarity DESC
| KEEP title, similarity
| LIMIT 10</code></pre><p>This query:</p><ul><li>Computes the similarity using the <a href="https://www.elastic.co/docs/reference/query-languages/esql/functions-operators/dense-vector-functions#esql-v_cosine"><code>V_COSINE</code></a> vector similarity function.</li><li>Sorts on the computed similarity.</li><li>Keeps the top 10 similar results.</li></ul><h2>Semantic search</h2><p>When doing semantic search, you&#x27;ll be trying to match a text query to your vectors. Of course, you can retrieve the query vector by first calculating the embeddings and then supply the query vector directly to your vector search.</p><p>But it would be much simpler to allow Elasticsearch to calculate the embeddings for you by using the <a href="https://www.elastic.co/docs/reference/query-languages/esql/functions-operators/dense-vector-functions#esql-text_embedding"><code>TEXT_EMBEDDING</code></a> function:</p><pre><code>FROM documents METADATA _score
| WHERE KNN(content_vector, TEXT_EMBEDDING("my semantic query", inference_id))
| SORT _score DESC
| LIMIT 10
| KEEP title, _score</code></pre><p><a href="https://www.elastic.co/docs/reference/query-languages/esql/functions-operators/dense-vector-functions#esql-text_embedding"><code>TEXT_EMBEDDING</code></a> uses an already existing <a href="https://www.elastic.co/docs/solutions/search/semantic-search/semantic-search-inference">inference endpoint</a> to automatically calculate the embeddings and use them in your query.</p><h2>Hybrid search</h2><p>Most searches don&#x27;t rely just on vector data; they need to be combined with lexical search, as well, so we have the best of both worlds:</p><ul><li>Lexical information is great for exact search of words and synonyms and provides a strong signal that users are looking for specific terms.</li><li>Vectors capture meaning and intention, using similar phrases or terms that are not lexically related.</li></ul><p>Combining vector search and lexical search is best done using <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/fork"><code>FORK</code></a> and <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/fuse"><code>FUSE</code></a>:</p><pre><code>FROM documents METADATA _score, _id, _index
| FORK
(WHERE KNN(content_vector, TEXT_EMBEDDING("my query")) | SORT _score DESC | LIMIT 10)
(WHERE MATCH(title, "my query") | SORT _score DESC | LIMIT 10)
| FUSE
| SORT _score DESC
| LIMIT 10</code></pre><p>The query above:</p><ul><li>Uses <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/fork"><code>FORK</code></a> to do two queries:<ul><li>A <a href="https://www.elastic.co/docs/reference/query-languages/esql/functions-operators/dense-vector-functions#esql-knn"><code>KNN</code></a> query over a `dense_vector` field.</li><li>A <a href="https://www.elastic.co/docs/reference/query-languages/esql/functions-operators/search-functions#esql-match"><code>MATCH</code></a> query over a text field.</li><li>Both queries sort by score and return the top 10 results for each.</li></ul></li><li>Uses <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/fuse"><code>FUSE</code></a> to mix together the query results by using reciprocal rank fusion (RRF) by default.</li></ul><p>This allows total control of the queries you want to perform, how many results to retrieve from each one, and how to combine the results together.</p><p>Check out our <a href="https://www.elastic.co/search-labs/blog/hybrid-search-multi-stage-retrieval-esql#vector-search">multistage retrieval</a> blog post for more details on how modern search works and how easy it is to implement via ES|QL.</p><h2>Custom scoring</h2><p>Calculating custom scoring is easy using ES|QL! Just use the <code>_score</code> metadata field for calculating your custom score:</p><pre><code>FROM documents METADATA _score
| WHERE KNN(content_vector, TEXT_EMBEDDING("my semantic query", inference_id)
| EVAL my_custom_score = _score * 1.5 + ...
| SORT my_custom_score DESC
| LIMIT 10</code></pre><p>If you&#x27;re using exact search, you already have an evaluation for the vector similarity that you can fine-tune:</p><pre><code>FROM documents
| EVAL similarity = V_COSINE(content_vector, [0.12, -0.03, 0.98, ...])
| EVAL my_custom_score = similarity * 1.5 + ...
| SORT my_custom_score DESC
| LIMIT 10</code></pre><p>Compared to the Query DSL <a href="https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-script-score-query"><code>script_score</code></a>, this is a much simpler and more iterative approach and fits perfectly into the ES|QL execution flow.</p><h2>Using query parameters</h2><p>When using a query vector, you can specify it directly on the query as in our previous examples. But you may have noticed that we&#x27;re using ellipses (...) to signal that there&#x27;s more data to come.</p><p>Dense vectors are usually high dimensional; they can have hundreds or thousands of dimensions, so copying and pasting your query vector on the query itself can make it difficult to understand or reason about, as you&#x27;ll be seeing thousands of numeric values on your screen.</p><p>Remember that you can use ES|QL query parameters for supplying parameters to your query:</p><pre><code>POST _query
{
  "query": """
FROM documents
| WHERE KNN(content_vector, ?query_vector)
| SORT _score DESC
| KEEP title, _score
| LIMIT 10
   """,
 "params": [{"query_vector" : [0.12, -0.03, 0.98, ...]}]
}</code></pre><p>This helps to keep your query and parameters separated, so you can focus on the query logic and not on specific parameters that get in your way.</p><p>Using query parameters for vectors is also more performant, as vectors are parsed faster that way using the request parser instead of the ES|QL parser.</p><h2><strong>Conclusion</strong></h2><p>ES|QL doesn’t just support vector search; it makes it a natural part of how you query your data. It allows you to use a single, powerful syntax for text, vectors, and everything in between, including:</p><ul><li>Vector search, both approximate and exact.</li><li>Semantic search, using text for search over vector data.</li><li>Hybrid search, combining the best of text and vector search.</li><li>Custom vector scoring, using <code>EVAL</code> and ES|QL constructs.</li></ul><p>Vector search in ES|QL is easier than in the Query DSL, by inferring prefilters and parameters, and integrating with the expressive, feature-rich expressions that ES|QL allows.</p><p>Defining KNN as part of a query pipeline for <a href="https://www.elastic.co/search-labs/blog/hybrid-search-multi-stage-retrieval-esql#vector-search">multistage retrieval</a> is just another piece in the query; you can keep using filters, combine with other text functions for hybrid search, and apply reranking or query completion on top of your vector results.</p><p>We’ll keep adding vector functions for performing vector arithmetic and aggregations over dense vectors, so you can use the full power of ES|QL to manipulate your vector data.</p><p>Happy (vector) searching!</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Speed up vector ingestion using Base64-encoded strings]]></title>
            <description><![CDATA[Introducing Base64-encoded strings to speed up vector ingestion in Elasticsearch.]]></description>
            <link>https://www.elastic.co/search-labs/blog/base64-encoded-strings-vector-ingestion</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/base64-encoded-strings-vector-ingestion</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Jim Ferenczi, Benjamin Trent, Ignacio Vera Sequeiros]]></dc:creator>
            <pubDate>Wed, 04 Feb 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/4c1ebbd7a1071e8e1721a9871cba87f6aed140e9-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>We’re improving the ingestion speed of vectors in Elasticsearch. Now, in <a href="https://www.elastic.co/cloud/serverless">Elastic Cloud Serverless</a> and in v9.3, you can send your vectors to Elasticsearch encoded as Base64 strings, which will provide immediate benefits to your ingestion pipeline.</p><p>This change reduces the overhead of parsing vectors in JSON by an order of magnitude, which translates to almost a 100% improvement on indexing throughput for DiskBBQ and around 20% improvement for hierarchical navigable small world (HNSW) workloads. In this blog, we’ll take a closer look at Base64-encoded strings and the improvements it brings to vector ingestion.</p><h2>What’s the problem?</h2><p>At Elastic, we’re always looking for ways to improve our vector search capabilities, whether that’s enhancing existing storage formats or introducing new ones. Recently, for example, we added a new disk-friendly storage format called <a href="https://www.elastic.co/search-labs/blog/diskbbq-elasticsearch-introduction">DiskBBQ</a> and enabled vector indexing with <a href="https://www.elastic.co/search-labs/blog/elasticsearch-gpu-accelerated-vector-indexing-nvidia">NVIDIA cuVS</a>.</p><p>In both cases, we expected to see major gains in ingestion speed. However, once these changes were fully integrated into Elasticsearch, the improvements weren’t as large as we had hoped. A flamegraph of the ingestion process made the issue clear: JSON parsing had become one of the main bottlenecks.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/a286408afc85ff1cd3dd448b8fdf59dd3e11d599-1600x675.png" style="width: 100%;" /><p>Parsing JSON requires walking through every element in the arrays and converting numbers from text format into 32-bit floating-point values, which is very expensive.</p><h3>Why Base64-encoded strings?</h3><p>The most efficient way to parse vectors is directly from their binary representation, where each element uses a 32-bit floating-point value. However, JSON is a text-based format, and the way to include binary data in it is by using <a href="https://en.wikipedia.org/wiki/Base64">Base64</a>-encoded strings. Base64 is just a binary-to-text encoding schema.</p><pre><code>{
  “emb” : [1.2345678, 2.3456789, 3.4567891]
}</code></pre><p>We can now send vectors encoded as Base64 strings:</p><pre><code>{
  “emb” : ”P54GUUAWH5pAXTwI”
}</code></pre><p>Is it worth it? Our benchmarks suggest yes. When parsing 1,000 JSON documents, using Base64 encoded strings instead of float arrays resulted in performance improvements of more than an order of magnitude, at the cost of a small encode/decode trade-off (client-side Base64 encoding and a temporary byte array on the server for decoding) in exchange for eliminating expensive per-element numeric parsing.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/86e5a926e13b07bb3b0abe80bd4930464e8f6f9b-1200x742.png" style="width: 100%;" /><h3>Give me some ingestion numbers</h3><p>We can see these improvements in practice when running the <a href="https://github.com/elastic/rally-tracks/blob/master/so_vector/README.md"><code>so_vector</code></a> rally track with the different approaches. The actual gains depend on how fast indexing is for each storage format. For <code>bbq_disk</code>, indexing throughput increases by about 100%, while for <code>bbq_hnsw</code>, the improvement is closer to 20%, since indexing is inherently slower there.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/4277057ee59cb84d068176b56bb7fa00b66e1cb3-1200x742.png" style="width: 100%;" /><p>Starting with Elasticsearch v9.2, <a href="https://www.elastic.co/search-labs/blog/elasticsearch-exclude-vectors-from-source">vectors are excluded from <code>_source</code></a> by default and are stored internally as 32-bit floating-point values. This behavior also applies to Base64-encoded vectors, making the choice of indexing format completely transparent at search time.</p><h2>Client support</h2><p>Adding a new format for indexing vectors might require changes on ingestion pipelines. To help this effort, in v9.3, Elasticsearch official clients can transform vectors with 32-bit floating-point values into Base64-encoded strings and the other way around. You might need to check the client documentation for the specific implementation.</p><p>For example, here’s a snippet for implementing bulk loading using the Python client:</p><pre><code>from elasticsearch.helpers import bulk, pack_dense_vector

def get_next_document():
    for doc in dataset:
        yield {
            "_index": "my-index",
            "_source": {
                "title": doc["title"],
                "text": doc["text"],
                "emb": pack_dense_vector(doc["emb"]),
            },
        }

result = bulk(
    client=client,
    chunk_size=chunk_size,
    actions=get_next_document,
    stats_only=True,
)</code></pre><p>The only difference from a bulk ingest using floats is that the embedding is wrapped with the <code>pack_dense_vector()</code> auxiliary function.</p><h2>Conclusion</h2><p>By switching from JSON float arrays to Base64-encoded vectors, we remove one of the largest remaining bottlenecks in Elasticsearch’s vector ingestion pipeline: numeric parsing. The result is a simple change with outsized impact: up to 2× higher throughput for DiskBBQ workloads and meaningful gains even for slower indexing strategies, like HNSW.</p><p>Because vectors are already stored internally in a binary format and excluded from <code>_source</code> by default, this improvement is completely transparent at search time. With official client support landing in v9.3, adopting Base64 encoding requires only minimal changes to existing ingestion code, while delivering immediate performance benefits.</p><p>If you’re indexing large volumes of embeddings, especially in high-throughput or serverless environments, Base64-encoded vectors are now the fastest and most efficient way to get your data into Elasticsearch.Those interested in the implementation details can follow the related Elasticsearch issues and pull requests: #<a href="https://github.com/elastic/elasticsearch/issues/111281">111281</a> and #<a href="https://github.com/elastic/elasticsearch/issues/135943">135943</a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Jina Rerankers bring fast, multilingual reranking to Elastic Inference Service (EIS)]]></title>
            <description><![CDATA[Elastic now offers jina-reranker-v2-base-multilingual and jina-reranker-v3 on EIS, enabling fast multilingual reranking directly in Elasticsearch for higher-precision retrieval, RAG, and agentic workflows without added infrastructure.]]></description>
            <link>https://www.elastic.co/search-labs/blog/jina-rerankers-elastic-inference-service</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/jina-rerankers-elastic-inference-service</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Sean Handley, Brendan Jugan, Ranjana Devaji]]></dc:creator>
            <pubDate>Tue, 03 Feb 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/be373a8af506d463948b9027f1cc2f19356f5499-720x421.jpg?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Today we’re excited to introduce <a href="https://jina.ai/models/jina-reranker-v2-base-multilingual/"><code>jina-reranker-v2-base-multilingual</code></a> and <a href="https://jina.ai/news/jina-reranker-v3-0-6b-listwise-reranker-for-sota-multilingual-retrieval/"><code>jina-reranker-v3</code></a> on Elastic Inference Service (EIS), enabling fast multilingual, high-precision reranking directly in Elasticsearch.</p><p>Jina AI, recently acquired by Elastic, is a leader in open-source multilingual and multimodal models, delivering state-of-the-art <a href="https://www.elastic.co/search-labs/blog/jina-models-elasticsearch-guide">search foundation models</a> for high-quality retrieval and retrieval-augmented generation (RAG). <a href="https://www.elastic.co/blog/elastic-inference-service">EIS</a> makes it easy to run fast, high-quality inference with an expanding catalog of these ready-to-use models on managed GPUs, with no setup or hosting complexity.</p><p>Rerankers add semantic precision by refining the ordering of retrieved results, helping select the best matches for a query. They add relevance improvements without the time and effort involved in reindexing or pipeline disruption, and they’re especially valuable for hybrid and RAG workflows where better context boosts downstream accuracy.</p><p>This follows the recent introduction of <a href="https://www.elastic.co/search-labs/blog/jina-embeddings-v3-elastic-inference-service"><code>jina-embeddings-v3</code> on EIS</a>, expanding the model catalog for multilingual reranking. Developers can now perform hybrid search using lexical search with BM25F and vector search with multilingual embeddings from <code>jina-embeddings-v3</code>, then rerank results with Jina Rerankers v2 or v3 depending on the use case. This delivers full control over recall tuning natively in Elasticsearch.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/1afc549ebe74619e750b3809bca014f87a6dfb73-1600x437.png" style="width: 100%;" /><h2>jina-reranker-v2-base-multilingual</h2><p><code>jina-reranker-v2-base-multilingual</code> is a compact general-purpose reranker with features designed to support function calling and SQL querying.</p><ul><li><strong>Low-latency inference at scale:</strong> A compact 278M parameter model with <a href="https://arxiv.org/abs/2307.08691">Flash Attention 2</a> for low-latency inference, delivering strong multilingual performance that outperforms larger rerankers based on <a href="https://huggingface.co/spaces/AIR-Bench/leaderboard">AIR metrics</a> and other widely used benchmarks.</li><li><strong>Supports agentic use cases:</strong> Accurate multilingual text reranking with additional support for selecting SQL tables and external functions that match text queries, enabling agentic workflows.</li><li><strong>Unbounded candidate support:</strong> v2 handles arbitrarily large candidate lists by scoring documents independently. Scores are compatible across batches, so developers can rerank large result sets incrementally. For example, a pipeline can score 100 candidates at a time, merge the scores, and sort the combined results. This makes v2 suitable when pipelines don’t apply strict top-k limits.</li></ul><h2>jina-reranker-v3</h2><p><code>jina-reranker-v3</code> performs multilingual listwise reranking, offering state-of-the-art performance with higher precision for RAG and agent-driven workflows.</p><ul><li><strong>Lightweight, production-friendly architecture:</strong> A ~0.6B parameter listwise reranker optimized for low-latency inference and efficient deployment in production settings.</li><li><strong>Strong multilingual performance:</strong> <a href="https://jina.ai/news/jina-reranker-v3-0-6b-listwise-reranker-for-sota-multilingual-retrieval/">Benchmarks</a> show that v3 delivers state-of-the-art multilingual performance while outperforming much larger alternatives and maintains stable top-k rankings under permutation.</li><li><strong>Cost-efficient, cross-document reranking:</strong> Unlike v2, v3 reranks up to 64 documents together in a single inference call, reasoning over relationships across the full candidate set to improve ordering when results are similar or overlapping. By batching candidates instead of scoring them individually, v3 significantly reduces inference usage, making it a strong fit for RAG and agentic workflows with defined top-k results.</li></ul><p>More models are on the way. EIS continues to expand with models optimized for candidate reranking, retrieval, and agentic reasoning. Next up is <a href="https://jina.ai/news/fair-scoring-for-multimodal-documents-with-jina-reranker-m0/"><strong><code>jina-reranker-m0</code></strong></a> for multimodal reranking, followed closely by frontier models from OpenAI, Google and Anthropic.</p><h2><strong>Get started</strong></h2><p>You can start using <code>jina-reranker-v2-base-multilinugal</code> on EIS with just a few steps.</p><h3><strong>Create embeddings with jina-embeddings-v3</strong></h3><pre><code>POST _inference/text_embedding/.jina-embeddings-v3
{
  "input": [
    "The Atlantic is a vast, deep ocean.",
    "A small puddle formed on the sidewalk."
  ]
}</code></pre><p>The response:</p><pre><code>{
  "text_embedding": [
    {
      "embedding": [
        0.0061287,
        ...
      ]
    },
    {
      "embedding": [
        -0.11765291,
        ...
      ]
    }
  ]
}</code></pre><h3><strong>Rerank with jina-reranker-v2-base-multilingual</strong></h3><p>Perform inference:</p><pre><code>POST _inference/rerank/.jina-reranker-v2-base-multilingual
{
 "input": ["puddle", "ocean", "cup of tea"],
 "query": "a large body of water"
}</code></pre><p>The response:</p><pre><code>{
  "rerank": [
    {
      "index": 1,
      "relevance_score": 0.48755136
    },
    {
      "index": 0,
      "relevance_score": 0.41489884
    },
    {
      "index": 2,
      "relevance_score": 0.07696084
    }
  ]
}</code></pre><p>The response contains the ranked list of inputs sorted by relevance score. In this example, the model identifies &quot;ocean&quot; as the most relevant match for a large body of water, assigning it the highest score while correctly ranking &quot;puddle&quot; and &quot;cup of tea&quot; lower.</p><h3><strong>Rerank with jina-reranker-v3</strong></h3><p>Perform inference:</p><pre><code>POST _inference/rerank/.jina-reranker-v3
{
  "input": ["pebble", "The Swiss Alps", "a steep hill"],
  "query": "mountain range"
}</code></pre><p>The response:</p><pre><code>{
  "rerank": [
    {
      "index": 1,
      "relevance_score": 0.06519848
    },
    {
      "index": 2,
      "relevance_score": -0.05002501
    },
    {
      "index": 0,
      "relevance_score": -0.09782915
    }
  ]
}</code></pre><p>Similar to <code>jina-reranker-v2-base-multilingual</code>, the response provides a prioritized list of the inputs sorted by relevance. In this example, the model identifies &quot;The Swiss Alps&quot; as the most relevant match for &quot;mountain range,&quot; compared to &quot;pebble&quot; and &quot;a steep hill.&quot;</p><p>However, a key difference is that <code>jina-reranker-v3</code> is a listwise reranker. Unlike <code>jina-reranker-v2-base-multilingual</code>, which scores document-query pairs individually, <code>jina-reranker-v3</code> processes all inputs simultaneously, enabling rich cross-document interactions before determining the final ranking.</p><h2><strong>What’s new in EIS</strong></h2><p><strong>EIS via <a href="https://www.elastic.co/docs/deploy-manage/cloud-connect">Cloud Connect</a></strong> brings EIS to self-managed clusters, allowing developers to access its GPU fleet to prototype and ship RAG, semantic search, and agent workloads without needing to procure GPU capacity on their self-managed clusters. Platform teams gain hybrid flexibility by keeping data and indexing on-prem while scaling GPU inference in Elastic Cloud when needed.</p><h2>What’s next</h2><p><strong><code>semantic_text</code> fields will soon default to <code>jina-embeddings-v3</code> on EIS</strong>, providing built-in inference at ingestion time, making it easier to adopt multilingual search without additional configuration.</p><h2>Try it out</h2><p>With Jina AI models on EIS, you can build multilingual, high-precision retrieval pipelines without managing models, GPUs, or infrastructure. You get fast dense retrieval, accurate reranking, and tight integration with Elasticsearch’s relevance stack, all in one platform.</p><p>Whether you’re building RAG systems, search, or agentic workflows that need reliable context, Elastic now gives you high-performance models out of the box and the operational simplicity to move from prototype to production with confidence.</p><p>All Elastic Cloud trials have access to the Elastic Inference Service. <a href="https://www.elastic.co/cloud/serverless">Try it now on Elastic Cloud Serverless and Elastic Cloud Hosted.</a></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Building automation with Elastic Workflows]]></title>
            <description><![CDATA[A practical introduction to workflow automation in Elastic. Learn what workflows look like, how they work, and how to build one.]]></description>
            <link>https://www.elastic.co/search-labs/blog/elastic-workflows-automation</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/elastic-workflows-automation</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Tinsae Erkailo, Tal Borenstein, Shahar Glazner]]></dc:creator>
            <pubDate>Tue, 03 Feb 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/cc239e947732501cbe0579f15093efa3d9db1cb8-1600x933.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Elastic Workflows is an automation engine built into the Elasticsearch platform. You define workflows in YAML; what triggers (starts) them, what steps they take, what actions they perform, and the platform handles execution. A workflow can query Elasticsearch, transform data, branch based on conditions, call external APIs, and integrate with services like Slack, Jira, PagerDuty and more through connectors you&#x27;ve already configured.</p><p>In this blog post, we&#x27;ll cover the core concepts of Workflows and build an example workflow together.</p><h2>Workflows are declarative and defined in YAML</h2><p>Workflows are composable. You define what should happen, and the platform handles execution, error recovery, and logging. Every workflow is defined as YAML and lives inside Kibana.</p><p>A workflow is composed of a few key parts: <em>triggers</em>, <em>inputs</em>, and <em>steps</em>.</p><p><strong>Triggers</strong> determine when a workflow runs. An alert trigger runs when a Kibana alerting rule fires, with full access to the alert context. A scheduled trigger runs on an interval or cron pattern. A manual trigger runs on demand from the UI or API. A workflow can have multiple triggers.</p><p><strong>Inputs</strong> define parameters that can be passed to the workflow at runtime. These let you create reusable workflows that accept different values depending on how they&#x27;re invoked.</p><p><strong>Steps</strong> are the actions a workflow takes. They execute in sequence, and each step can reference outputs from previous steps. Step types include:</p><ul><li><strong>Internal actions</strong> for actions you perform inside Elasticsearch and Kibana, like querying indices, running Elasticsearch Query Language (ES|QL) queries, creating cases, or updating alerts.</li><li><strong>External actions</strong> for actions you perform on external systems, like sending a Slack message or creating a Jira ticket. Use any connector you&#x27;ve configured in Elastic, with the flexibility to hit any API or internal service using HTTP steps.</li><li><strong>Flow control</strong> for defining the logic of your workflow with conditionals, loops, and parallel execution.</li><li><strong>AI</strong> for everything from prompting a large language model (LLM) to enabling agents as workflow steps, unlocking agentic workflow use cases.</li></ul><h2>Getting practical: Your first workflow</h2><p>Let&#x27;s build a workflow that demonstrates core capabilities: working with Elasticsearch indices, conditional logic, and data flow between steps. We&#x27;ll create a simple demo that sets up a National Parks index, loads sample data, and searches it.</p><h3>Enable Workflows</h3><p>Workflows is available in Elastic 9.3 (Technical Preview). Go to <strong>Stack Management → Advanced Settings</strong>, and enable <strong>Elastic Workflows</strong>:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/afc1f7168f3248d1792703973010e37ff05d19af-5344x3044.png" style="width: 100%;" /><h3>Create a workflow</h3><p>Navigate to <strong>Workflows</strong> in Kibana. If this is your first time, you&#x27;ll see the Get Started screen:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/7ca4bf4505ca26c32d12158a33211bf943a5bf37-5344x3044.png" style="width: 100%;" /><p>Click <strong>Create a new workflow</strong> to open the editor. You can also explore <strong>Example workflows</strong> to see the <a href="https://github.com/elastic/workflows">Elastic Workflow Library</a>, a collection of ready-to-use workflows for search, observability, and security use cases.</p><h3>The workflow editor</h3><p>The editor gives you YAML editing with autocomplete and validation. Start typing a step type, and suggestions appear. Use the <strong>quick action menu</strong> (Cmd+K / Ctrl+K) to browse available triggers, steps, and actions by category:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/b94d4b4496f2214867bbd1ed1eb82174df426fd6-2538x1340.gif" style="width: 100%;" /><h3>Build the workflow</h3><p>Here&#x27;s our National Parks demo workflow:</p><pre><code>name: National Parks Demo
description: Creates an Elasticsearch index, loads sample national park data, searches for parks, and displays the results.
enabled: true
consts:
  indexName: national-parks

triggers:
  - type: manual

steps:
  - name: get_index
    type: elasticsearch.indices.exists
    with:
      index: '{{ consts.indexName }}'

  - name: check_if_index_exists
    type: if
    condition: 'steps.get_index.output: true'
    steps:
      - name: index_already_exists
        type: console
        with:
          message: 'index: {{ consts.indexName }} already exists. Will proceed to delete it and re-create'
      - name: delete_index
        type: elasticsearch.indices.delete
        with:
          index: '{{ consts.indexName }}'
    else:
      - name: no_index_found
        type: console
        with:
          message: 'index: {{ consts.indexName }} not found. Will proceed to create'

  - name: create_parks_index
    type: elasticsearch.indices.create
    with:
      index: '{{ consts.indexName }}'
      mappings:
        properties:
          name:
            type: text
          category:
            type: keyword
          description:
            type: text

  - name: index_park_data
    type: elasticsearch.index
    with:
      index: '{{ consts.indexName }}'
      id: yellowstone
      document:
        name: Yellowstone National Park
        category: geothermal
        description: "America's first national park, established in 1872, famous for Old Faithful geyser and diverse wildlife including grizzly bears, wolves, and herds of bison and elk."
      refresh: wait_for

  - name: search_park_data
    type: elasticsearch.search
    with:
      index: '{{ consts.indexName }}'
      query:
        term:
          _id: yellowstone

  - name: log_results
    type: console
    with:
      message: 'Found {{ steps.search_park_data.output.hits.total.value }} park with doc id of yellowstone.'</code></pre><p>This workflow demonstrates several capabilities:</p><ul><li><strong>Constants:</strong> Define reusable values, like <code>indexName</code>, that can be referenced throughout.</li><li><strong>Elasticsearch operations:</strong> Check whether an index exists, delete it, create it with mappings, index a document, and search.</li><li><strong>Conditional logic with branches:</strong> If the index exists, log a message and delete it. If not, log that it wasn&#x27;t found. Either way, continue to create the index.</li><li><strong>Data flow:</strong> Each step references outputs from previous steps using <code>steps.&lt;name&gt;.output</code>.</li></ul><p>Notice the <code>{{ }}</code> syntax throughout. This is Liquid templating for passing data between steps. <code>consts</code> contains workflow constants. <code>steps.&lt;name&gt;.output</code> references the output of a previous step.</p><h3>Run the workflow</h3><p>Save the workflow and click the <strong>Play </strong>button next to the <strong>Save </strong>button to execute it.</p><p>The workflow begins executing, and you&#x27;ll see the execution view. Each step appears in the side panel as it runs, showing status and timing:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/db89d94db97e51f76af0db084fef55c325b1426c-2538x1318.gif" style="width: 100%;" /><p>Click on any step to see its inputs and outputs. The side panel shows exactly what data the workflow received and produced at each step. This makes debugging straightforward: You see what went in, what came out, and where things went wrong if a step fails.</p><h3>Extending the workflow</h3><p>Let&#x27;s extend this workflow with AI and external notifications. We&#x27;ll add a step that generates a poem about the park using an LLM and then sends it to Slack.</p><p>Add these steps after <code>log_results</code>:</p><pre><code>  - name: generate_poem
    type: ai.prompt
    with:
      prompt: >
        Write a short, fun poem about {{ steps.search_park_data.output.hits.hits[0]._source | json }}.
        Include something about its famous features. No other formatting.

  - name: share_poem
    type: slack
    connector-id: my-slack-connector
    with:
      message: |
        🏔️ *Poem of the Day about Yellowstone*
        
        {{ steps.generate_poem.output.content }}</code></pre><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/a529292d102b2b2800729458ecffb411b83f6f8f-5344x3008.png" style="width: 100%;" /><p>Now the workflow creates the index, loads data, searches it, generates a poem with AI, and shares it to Slack. The same patterns apply throughout: Add steps, reference their outputs, and let the workflow handle execution.</p><p>This is a simple example, but the same approach scales to real use cases. Replace national parks with security alerts, observability metrics, or any data in Elasticsearch. Replace the poem with an AI summary or assessment. Replace Slack with Jira, PagerDuty, or any connector you&#x27;ve configured.</p><h2>Workflows and Elastic Agent Builder</h2><p>The National Parks example demonstrates the core components of Workflows: triggers, steps, conditional logic, data flow, AI prompts, and external notifications. These components can be combined to build automations that orchestrate outcomes within Elasticsearch, Kibana, external systems, and AI.</p><p></p><p>This covers processes where the steps are known. But what about processes where they aren&#x27;t, where the right action depends on what you find, and what you find depends on where you look?</p><p>This is where <a href="https://www.elastic.co/search-labs/blog/agent-builder-elastic-ga">Agent Builder</a> expands what&#x27;s automatable. An agent grounded in your operational context can do the initial analysis or investigation that an analyst or developer would normally do manually. It can explore, reason across data sources, and surface findings. Then the workflow continues with the structured steps that follow.</p><p>Workflows integrate with Agent Builder, and the integration works both directions.</p><h3>Agents as workflow steps</h3><p>Use the <code>ai.agent</code> step type to invoke an agent from within a workflow:</p><pre><code>- name: analyze
  type: ai.agent
  with:
    agent_id: my-analyst-agent
    message: 'Analyze this data and recommend next steps: {{ steps.search.output | json }}'</code></pre><p>The agent uses its configured tools to query indices, correlate data, and reason across results. The workflow waits for findings and continues with the next steps.</p><h3>Workflows as agent tools</h3><p>Workflows can also be exposed to Agent Builder as tools, allowing agents to invoke them when action is required.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/d73f329f5166f4c3bea4d697c451341ea3e7423e-5344x3044.png" style="width: 100%;" /><p>This pattern is useful when an agent determines what should be done, but the how should follow a known and repeatable process. Agents are well suited for reasoning, exploration, and judgment, especially when information is incomplete or evolving. Workflows are well suited for executing established procedures that involve multiple steps, external systems, and state changing actions.</p><p>By exposing a workflow as a tool, you separate decision making from execution. An agent can conclude that a specific outcome is needed, such as declaring an incident or initiating a response, and then delegate the execution to a workflow that encodes the correct sequence of actions. That workflow handles coordination across systems, applies the same logic every time, and makes the result observable and auditable.</p><p>This also enables reuse and consistency. The same workflow can be invoked by different agents or run manually, while enforcing a shared process and set of controls. Instead of teaching every agent how to carry out complex actions safely and correctly, that responsibility lives in one place.</p><p>Together, this creates a clear contract. Agents decide when an action is needed. Workflows ensure that the action follows the right process, every time.<br/></p><h2>Looking ahead</h2><p>This technical preview establishes Workflows as a core capability of the Elasticsearch platform. The foundation is here: triggers, steps, data flow, AI integration, and the bidirectional connection with Agent Builder.</p><p>What comes next builds on this foundation by expanding how workflows are authored and where they show up. In addition to new step types and broader connector support, Workflows will support multiple authoring modes. These include natural language intent, where AI helps translate what you want to achieve into a working workflow, as well as visual, drag and drop builders. As these capabilities are embedded directly into solution specific experiences across Elastic, they change not just how workflows are built, but how work actually gets done.</p><p><strong>For Developers</strong>, this means AI assistants that go beyond conversation. Agents that can actually do things: query systems, update records, trigger processes, and return results. Reasoning backed by reliable execution.</p><p><strong>In Observability</strong>, this means correlating signals from logs, metrics, and traces. Surfacing likely root causes. Orchestrating remediation steps. Closing the loop between detection and resolution without waiting for someone to intervene.</p><p><strong>In Security</strong>, this means investigation that begins the moment an alert fires. Enriching findings with context from internal and external sources. Coordinating response actions across tools. Updating cases and notifying the right people. Work that used to require manual effort, handled automatically.</p><p>The patterns you learn today apply directly to these experiences. Workflows is the automation layer that makes them possible.</p><h2>Get started with Elastic Workflows</h2><p>Elastic Workflows is available now as a technical preview. Get started with an <a href="https://cloud.elastic.co/registration">Elastic Cloud trial</a>, and check out the <a href="https://www.elastic.co/docs/explore-analyze/workflows">documentation</a> or <a href="https://github.com/elastic/workflows">library of example workflows</a> to learn more.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Skip MLOps: Managed cloud inference for self-managed Elasticsearch with EIS via Cloud Connect]]></title>
            <description><![CDATA[Introducing Elastic Inference Service (EIS) via Cloud Connect, which provides a hybrid architecture for self-managed Elasticsearch users and removes MLOps and CPU hardware barriers for semantic search and RAG.]]></description>
            <link>https://www.elastic.co/search-labs/blog/cloud-connect-elastic-inference-service</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/cloud-connect-elastic-inference-service</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Jordi Mon Companys, Matt Ryan]]></dc:creator>
            <pubDate>Tue, 03 Feb 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/e419160397b56d1eacc101cd69ecc9c8e0b20967-720x420.jpg?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>If you’re running Elasticsearch on-premises or in a private cloud, you’ve likely faced a familiar standoff when trying to modernize your search experience: You want to implement <a href="https://www.elastic.co/docs/solutions/search/semantic-search/semantic-search-inference">semantic search</a>. You know that state-of-the-art dense vector models, like jina-embeddings-v3, are the standard for relevance. But when you look at the infrastructure requirements to run these models at scale, the project stalls.</p><p>The problem usually isn&#x27;t the software: Elasticsearch has supported vector search for years. The problem is the hardware.</p><h2><strong>The MLOps bottleneck</strong></h2><p>The process of running model inference to generate the embeddings required for semantic search is computationally expensive. If you’re self-managing your cluster, this presents a difficult set of trade-offs regarding operational complexity and resource flexibility:</p><ul><li><strong>Burn CPU cycles:</strong> You run the models on your existing CPU-bound nodes. This works for small datasets and small models, but as ingestion rates climb, your indexing throughput plummets and your search nodes end up choking on vector generation instead of serving queries.</li><li><strong>Provision GPUs:</strong> You ask your infrastructure team for GPU-accelerated nodes. In many organizations, this triggers a procurement nightmare. GPUs are expensive and scarce, and they introduce a new layer of machine learning operations (MLOps) complexity: driver compatibility, container orchestration, and scaling logic that your team might not have time to manage.</li></ul><p>This creates a gap where self-managed deployments remain stuck on keyword search (BM25) simply because the infrastructure barrier to AI is too high.</p><h2><strong>Introducing a hybrid architecture for inference</strong></h2><p>We built the <strong><a href="https://www.elastic.co/docs/explore-analyze/elastic-inference/eis">Elastic Inference Service (EIS)</a> </strong>and made it available via <a href="https://www.elastic.co/docs/deploy-manage/cloud-connect"><strong>Cloud Connect</strong></a> to solve this specific hardware constraints problem. It allows self-managed clusters (running on Elastic Cloud on Kubernetes [ECK], Elastic Cloud Enterprise [ECE], or stand-alone) hosted on-premises or in private cloud environments to delegate compute-intensive model inference to Elastic Cloud. It doesn’t require a lift-and-shift migration of your cluster architecture. While the specific text fields you need to vectorize are transmitted to the cloud for processing, you don’t need to permanently relocate your terabytes of business data or re-architect your storage. It allows you to use a <strong>hybrid topology</strong>: Your data nodes, master nodes, and index storage remain in your private environment, but the heavy lifting of generating embeddings is outsourced to Elastic’s managed GPU fleets.</p><h2><strong>How it works: Data stays, intelligence travels</strong></h2><p>It’s important to understand the distinction between <em>storage </em>and <em>inference</em>. When you use EIS via Cloud Connect:</p><ol><li><strong>The handshake:</strong> You generate an API key in Elastic Cloud and paste it into your self-managed Kibana instance. This establishes a secure, authenticated bridge.</li><li><strong>The pipeline:</strong> When you index a document using the <code>semantic_text</code> field (or manually configure an inference processor), your local cluster automatically sends only the specific text content to the EIS endpoint, encrypted in transit.</li><li><strong>The inference:</strong> The text is processed in memory on Elastic’s managed GPUs. The resulting vector embedding is returned immediately to your local cluster.</li><li><strong>The storage:</strong> The vector is indexed and stored on your local disks alongside the original source document.</li></ol><p>The raw text payload is ephemeral: It’s processed for inference and discarded. It’s never indexed or permanently stored in the cloud. You get the relevance of a GPU-powered cluster without changing your data residency posture.</p><h2><strong>Scaling semantic search without hardware provisioning</strong></h2><p>Let’s look at a practical scenario. You’re a site reliability engineer (SRE) managing a cluster for a large ecommerce platform. The search team wants to deploy Jina to fix “zero results” queries, but your on-premises nodes are CPU-bound and you have no GPU infrastructure available.</p><p>Here’s how you can use EIS via Cloud Connect to solve this in minutes, not months.</p><p>Step 1: <strong>The handshake</strong></p><p>First, we establish the bridge between your self-managed cluster and Elastic Cloud:</p><ul><li>Navigate to <strong>Stack Management</strong> in your local Kibana.</li><li>Under the <strong>Cloud Connect</strong> section, click <strong>Connect to Elastic Cloud</strong>.</li><li>Authenticate with your Elastic Cloud credentials, and authorize the connection.<ul><li>Result: Your local cluster is now cloud connected, acting as a satellite that can consume Software as a Service (SaaS) services.</li></ul></li></ul><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/e7f2f846515f30375340080c6f61b0321f6db5c1-1600x1162.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/329916c4ecd6bdb78d21812751014b887de6148f-1600x1158.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/9e98f5b7f1aa1d245507163df285b6688745c6ff-1600x1156.png" style="width: 100%;" /><p>Step 2: <strong>Enable the service</strong></p><ul><li>On the Cloud connected services page, locate <strong>Elastic Inference Service</strong>.</li><li>Click <strong>Enable</strong> and wait for the status to switch to <strong>Enabled</strong>.</li></ul><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/c171888a47a8e412551b701c59274fd7b3f6d33e-1600x1159.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/f678694670f3686ac4b7be8a1850b39421999f48-1600x245.png" style="width: 100%;" /><p>Which makes all these inference endpoints immediately available locally:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/4332fb202e9537128d03abf115f6728b925755ce-800x579.gif" style="width: 100%;" /><p>Step 3: <strong>Configure the pipeline</strong></p><p>Now that the service is available, you configure an ingest pipeline. Instead of managing a local Jina container, you simply point to the cloud-hosted model. Open the console, and try the Jina embeddings model locally:</p><pre><code>PUT /semantic-search-eis
{
  "mappings": {
    "properties": {
      "text": {
        "type": "semantic_text",
        "inference_id": ".jina-embeddings-v3"
      }
    }
  }
}

POST /semantic-search-eis/_doc
{
  "text": "Aberdeen Football Club"
}

GET /semantic-search-eis/_search
{
  "query": {
    "match": {
      "text": "soccer"
    }
  }
}</code></pre><p>The match query will return the semantically correct result for “Aberdeen Football Club” where an exact match for “soccer would not”:</p><pre><code>{
  "took": 42,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.89421,
    "hits": [
      {
        "_index": "semantic-search-eis",
        "_id": "1",
        "_score": 0.89421,
        "_source": {
          "text": "Aberdeen Football Club"
        }
      }
    ]
  }
}</code></pre><p>Step 4: <strong>Hybrid ingestion</strong></p><p>When you index a document through this pipeline, the flow is streamlined:</p><ul><li>Your local node accepts the document.</li><li>The text field is securely transmitted to the EIS endpoint.</li><li>Elastic Cloud generates the dense vector embedding on a managed GPU using Jina v3.</li><li>The vector is returned and indexed locally on your self-managed disk.</li></ul><h2><strong>Oh and one more thing: LLMs via Cloud Connect too!</strong></h2><p>While vector search solves the <em>retrieval</em> part of retrieval-augmented generation (RAG), Cloud Connect also solves the <em>generation</em> side. Typically, giving your self-managed teams access to high-quality large language models (LLMs), like Anthropic’s Claude, involves a separate procurement cycle, managing vendor API keys, and handling secure egress.</p><p>EIS via Cloud Connect removes that friction entirely.</p><ul><li><strong>Preconfigured access:</strong> The moment you enable the service, you get access to a preconfigured Anthropic connector (one for Claude 3.7 and one for Claude 4.5). You don’t need to bring your own API key or sign a separate contract with the model provider.</li><li><strong>Zero setup:</strong> It just works. Because the secure bridge is already established, you can immediately select these models in the Search Playground to test RAG against your local data.</li><li><strong>Integrated workflows:</strong> This also powers our internal AI tools. Your team can instantly start using the Elastic AI Assistant for observability insights or Attack Discovery for security threat analysis, all powered by cloud-hosted LLMs, grounded in your on-premises data.</li></ul><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/361d7f503a0401078b19b5108d40d8a59fa2e607-1600x1160.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/8bfd1b8de4a5a9621b414791fe03f979ad7458ec-1600x1164.png" style="width: 100%;" /><h2><strong>The shortcut to AI</strong></h2><p>EIS via <a href="https://www.elastic.co/docs/deploy-manage/cloud-connect">Cloud Connect </a>removes the operational friction of managing GPU drivers, specialized hardware, and complex performance monitoring stacks. By offloading these requirements to Elastic Cloud, you can build RAG applications and semantic search interfaces today, regardless of whether your cluster is running on-premises or in a private cloud VPC.</p><p>The hybrid approach solves the resource utilization problem typical of self-managed AI. You no longer have to over-provision expensive hardware that sits idle during quiet periods, nor do you have to worry about performance bottlenecks when traffic bursts. You simply establish the secure connection via Cloud Connect and consume high-performance inference as a managed API, keeping your data residency intact while your search capabilities scale instantly.</p><p>EIS via Cloud Connect is available immediately for <strong>Elastic Enterprise</strong> self-managed customers on <strong>Elastic Stack 9.3</strong>.</p><ul><li><a href="https://www.elastic.co/search-labs/blog/elastic-autoops-self-managed-elasticsearch"><strong>Read the docs:</strong> Guide to Configuring Cloud Connect</a></li><li><a href="https://cloud.elastic.co/registration"><strong>Start a trial:</strong> Create an Elastic Cloud account</a></li></ul>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Cookbook for a production-grade generative AI sandbox]]></title>
            <description><![CDATA[Exploring the recipe for a generative AI sandbox, giving developers a secure environment to deploy application prototypes while enabling privacy and innovation.]]></description>
            <link>https://www.elastic.co/search-labs/blog/generative-ai-sandbox-data-privacy</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/generative-ai-sandbox-data-privacy</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Sean MacKirdy]]></dc:creator>
            <pubDate>Mon, 02 Feb 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/e7ed91ecf91e7a1de7d9ff514a2c285f2cb3f65c-1000x628.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Building generative AI (GenAI) applications is all the rage, and c<em>ontext engineering</em>, that is, providing the prompt structure and data needed for a large language model (LLM) to return specific, relevant answers to a question without filling in the blanks itself, is one of the most popular patterns that has emerged in the past 24 months. One particular subset of context engineering, retrieval-augmented generation (RAG), is being used widely to bring additional context to LLM interactions by using the power of natural language-based search to surface the most relevant results in private datasets based on meaning rather than on keywords.</p><p>As context engineering is exploding, ensuring that rapid prototype projects don’t expose business- or mission-critical data to unauthorized recipients is a significant concern. For audiences interested in technology and policy alike, I&#x27;ve championed the concept of a <em>privacy-first GenAI sandbox</em>, which I’ll simply refer to as a sandbox from here on. In this article, the term <em>sandbox</em> refers to a self-service, secure prototyping space (much like a child&#x27;s sandbox, where the wooden edges prevent sand from escaping), allowing organization members to test their custom context engineering applications safely, without risking exposure of confidential data.</p><h2>Production-grade GenAI sandboxes = enabling privacy <em>and</em> innovation</h2><p>GenAI, from text-generating tools, like ChatGPT, Claude, and Gemini, to image creators, such as Google’s Nano Banana, OpenAI’s DALL-E, and Midjourney, has sparked discussions everywhere: in classrooms, at dinner tables, in regulatory circles, in courts, and in boardrooms over the past two years.</p><p>I’ve had the privilege of sharing Elastic’s approach to context engineering, and particularly RAG, with customers, including developers and C-suite executives, and with contacts of mine, ranging from friends and family to legislators. Think of context engineering as a librarian that looks up and then serves contextual data to augment text, audio, or image GenAI apps that they don’t have in the data they were trained on for their intended tasks; for example, looking up sports scores and headlines to help a text-generation application answer the question, “What happened in the National Hockey League yesterday?”</p><p>Elasticsearch Labs has fabulous primers on context engineering <a href="https://www.elastic.co/search-labs/blog/context-engineering-overview">here</a> and RAG <a href="https://www.elastic.co/search-labs/blog/articles/retrieval-augmented-generation-rag">here</a>, if you’re unfamiliar with the concept and would like to do more reading.</p><p>A <em>privacy first</em> approach ensures that context engineering supplies the GenAI app with protected, selected, or delicate data, fostering responses that are better informed and more pertinent than what might be generated using solely public information. An example of this would be providing a GenAI-powered, interactive text chat experience (chatbot) for university students to obtain financial aid and scholarship information relevant to their personal background, without risking exposure of personally identifiable information (PII), such as their Social Security number or birthdate, to malicious actors extracting information via common vulnerabilities, as per the <a href="https://genai.owasp.org/resource/owasp-top-10-for-llm-applications-2025/">OWASP Top 10</a>, or the LLM itself.</p><p></p><p>The core tenets of the logic behind deploying a sandbox are as follows:</p><ol><li>Users will find a way to incorporate GenAI into their daily workflow, regardless of whether one’s organization provides the tools. Even in organizations where preventing such “shadow IT” is, realistically, impractical or impossible, providing and monitoring access to prevent disclosure of an organization’s sensitive data still remains imperative; a sandbox is just the place to turn such tools loose.</li><li>Providing a sandbox to deploy applications with Application Performance Monitoring (APM) and information security (InfoSec) best practices embedded allows an organization to derive insights into potential use cases for GenAI while also safeguarding privacy, enabling audit and accountability of GenAI use, and establishing centralized cost management.</li><li>An organization’s sandbox should allow either self-service or low-touch deployment of peer-reviewed GenAI applications to permit maximum experimentation with minimum friction by those inclined to develop their own applications. </li><li>If properly implemented and contained within the organization’s controlled perimeter, the sandbox allows leveraging data assets available to the organization without triggering the liabilities that could attach to unauthorized or unintended external sharing or other leakage of protected data such as PII – think California CCPA, or the EU/UK GDPR for instance.</li></ol><p>This article will not focus on building a GenAI app; there are numerous excellent examples here on Elasticsearch Labs. Instead, I’ll be focusing on the <em>recipe</em> necessary for deploying a sandbox that provides the security and availability needed to implement principle #3 above.</p><h3>Foundational ingredients</h3><p>For a sandbox to be considered <em>production grade</em>, the following foundational ingredients should be considered:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/c53aaa04503baf654ccee274f012d7c1ddc2f643-1380x1600.png" style="width: 100%;" /><p>Let&#x27;s explore why each ingredient plays a crucial role in our sandbox recipe. As we do, please note that brand-name decisions I’ve listed below are based on personal experience and aren’t an endorsement of one technology or another by Elastic. As with any recipe, these then form my preferred ingredients. You can, of course, substitute in each area to make the recipe to your liking:</p><h4>1. Containerization platform</h4><p>The first ingredient in our sandbox recipe is the selection of a <strong>containerization platform</strong>. These platforms, while conceptually similar to the virtual machines that have been a staple of enterprise IT for the past 15+ years, represent a significant evolution in how applications are packaged and deployed. They’re designed for rapid deployment, upgrades without service disruption, and native distribution across both on-premises and cloud computing environments, while also providing increased testability, validation of infrastructure, and auditability. The platform you choose, often managed through <strong>infrastructure as code</strong> (IaC) to ensure reproducibility and consistency, is the foundation that enables agility and scalability for your GenAI applications.</p><p>Key components of a containerization platform</p><p>A robust containerization platform is built on several key components:</p><ul><li><strong>Container runtime:</strong> The software that executes containers and manages their lifecycle. A popular example is <strong>Docker</strong>, which provides the tools to build, share, and run container images.</li><li><strong>Image build infrastructure:</strong> This is the process and tooling used to create container images from your application&#x27;s source code. Tools like <strong>Dockerfiles</strong> provide a clear, repeatable way to define the environment, dependencies, and application code within an image, ensuring consistency across development, testing, and production environments.</li><li><strong>Orchestration engine:</strong> For a production-grade environment, you need a system to automate the deployment, scaling, and management of containers. <strong>Kubernetes</strong> (k8s) is the industry-standard for this, providing powerful features for load balancing, self-healing, and service discovery. More on that below in ingredient #2.</li></ul><p><strong>1. 1 Infrastructure as code:</strong></p><p>To ensure the reproducibility and maintainability of your sandbox, a <strong>containerization platform</strong> should be managed using <strong>IaC</strong> principles. This means that instead of manually configuring your platform, you define your infrastructure (for example, Kubernetes clusters, networking rules, security policies) in code files (for example, using <strong>Terraform</strong> or <strong>Pulumi</strong>). This approach provides several benefits:</p><ul><li><strong>Version control:</strong> Your infrastructure can be treated like any other code, allowing you to track changes, revert to previous versions, and collaborate with your team using Git.</li><li><strong>Consistency:</strong> IaC dramatically reduces manual errors and ensures that your sandbox environment can be recreated identically in any cloud or on-premises location.</li><li><strong>Automation:</strong> It enables you to automate the entire setup and teardown process, making it easy to create temporary sandboxes for specific projects or testing.</li></ul><h4>2. Hosting and orchestration</h4><p>As we introduced in the &quot;Containerization platform&quot; section, a powerful orchestration engine is needed to manage our containers at scale. For this, k8s is the de facto standard for orchestrating a production-grade sandbox. If you’re unfamiliar, check out the Cloud Native Computing Foundation (CNCF) primer on k8s available <a href="https://kubernetes.io/docs/tutorials/kubernetes-basics/">here</a>. Whether running in the cloud or on-premises, Kubernetes provides the robust framework needed to deploy, scale, and manage the lifecycle of containerized applications. Major cloud providers, like Google Cloud (Google Kubernetes Engine [GKE]), Amazon Web Services (Elastic Kubernetes Service [EKS]), and Microsoft Azure (Azure Kubernetes Service [AKS]), all offer mature, managed Kubernetes services that handle the underlying complexity, including in particular contractually assured and independently certified compliance with statutory privacy and information security mandates, allowing your teams to focus on building and deploying applications.</p><p>For a GenAI sandbox, Kubernetes is particularly valuable because it can efficiently manage and scale GPU resources, which are often necessary for two key components of the GenAI stack: 1) privately hosted LLMs; and 2) the inference processes that power them (discussed in more detail in ingredients #6 and #7). Its ability to automate deployments and manage resources ensures that rapid prototypers can experiment with different models and applications without needing to become infrastructure experts, all within the secure and isolated area, called a <em>namespace</em> in k8s, that you define. This abstraction is key to the sandbox&#x27;s success, empowering innovation while maintaining centralized control.</p><h4>3. Code repository / image repository</h4><p>A centralized code repository is an essential element of a secure and collaborative GenAI sandbox. It provides a single, controlled environment for developers to store, manage, and version their code, preventing the proliferation of sensitive information across disparate, unsecured locations. By establishing a centralized repository, organizations can enforce security policies, monitor for vulnerabilities, and maintain a clear audit trail of all code changes, which is critical for maintaining data privacy and integrity within the sandbox environment.</p><p>For instance, a service like GitHub, when integrated with your organization&#x27;s identity and access management (IAM) and single sign-on (SSO) solutions (see ingredient #4 below), becomes a powerful tool for enforcing the principle of least privilege. This integration ensures that only authenticated and authorized developers can access specific code repositories. You can create teams and apply granular permissions, restricting access to sensitive projects and preventing unauthorized code modifications. This is especially important in a GenAI context where code might contain proprietary algorithms, sensitive data connectors, or even, in some cases, organization or user-level credentials or other confidential information.</p><p>Furthermore, modern repository platforms offer automated security scanning features. These tools continuously scan code for known vulnerabilities, insecure coding practices, and exposed secrets. If a developer accidentally commits a password or an API key, the system can automatically flag it and notify the security team. This proactive approach to security is essential for preventing data breaches, enforcing legal requirements and contractual commitments of confidentiality, and ensuring the overall integrity of the GenAI applications being developed to deploy in the sandbox. By mandating that all development occurs in a centralized and secured repository, you create a transparent, auditable, and secure foundation for innovation, allowing your developers the freedom to experiment without compromising organizational security.</p><h4>4. Identity and access management</h4><p>IAM is a core component of a secure, privacy-first grounded AI environment. It provides the foundation for ensuring that only authorized individuals and services can access sensitive data and powerful AI models. A robust IAM framework enforces the principle of least privilege, granting the minimum level of access necessary for a user or service to perform its function.</p><p><strong>4.1 Single sign-on:</strong></p><p>SSO streamlines user access by allowing users to authenticate once and gain access to multiple applications and services without re-entering their credentials. In a sandbox environment, SSO simplifies the user experience for developers, data scientists, and business users who need to interact with various components of the AI ecosystem, such as data repositories, modeling workbenches, and deployment pipelines. By centralizing authentication, SSO also enhances security by reducing the number of passwords that can be compromised and providing a single point for enforcing authentication policies. Importantly, it also lowers the barrier to entry for less-experienced developers to properly protect the data they are using in the sandbox, in turn preventing the inadvertent disclosure of sensitive information to insiders and outsiders alike.</p><p><strong>4.2 Role-based access control: </strong></p><p>Role-based access control (RBAC) is a method of restricting network access based on the roles of individual users within an organization. In the context of a GenAI sandbox, RBAC is used to define and enforce permissions for different user personas. For example, a data scientist role might have read/write access to specific datasets and the ability to apply machine learning models, while a business analyst role may only have read-only access to the outputs of those models. This ensures a clear separation of duties and prevents unauthorized access to or modification of sensitive data and AI assets.</p><p><strong>4.3 Attribute-based access control:</strong></p><p>Attribute-based access control (ABAC) provides a more granular and dynamic approach to access control than traditional RBAC. ABAC makes access decisions based on a combination of attributes of the user, the resource being accessed, and the environment. For instance, access to a particularly sensitive dataset could be restricted to users who are on the data scientist team (user attribute), accessing a resource tagged as PII (resource attribute), and are doing so from a corporate network during business hours (environment attributes). This level of granular control is critical in a GenAI sandbox for enforcing complex data governance and privacy requirements. We’ll come back to this later, when discussing the search AI datastore.</p><p><strong>4.4 Access auditability:</strong></p><p>A robust IAM framework also ensures that the granting, use, review and revocation of all access permissions is granularly logged, discoverable and auditable, so that in case of any suspected or confirmed incident, responders can quickly understand what happened, contain the incident, assess its extent, and comprehensively remedy its consequences. This is not only important for the organization’s own security, but also necessary to comply with any incident reporting and breach notice requirements that could be triggered.</p><h4>5. Secrets management</h4><p>Of all the ingredients in our recipe, secrets management is perhaps the most potent, yet most frequently overlooked. Much like a tiny pinch of saffron can dramatically alter a culinary dish, a single mishandled secret can have an outsized and devastating impact on your organization&#x27;s security and reputation. In our context, a <em>secret</em> is any piece of sensitive information needed for our applications to function: API keys for first- or third-party services, database passwords, trust certificates, or tokens for authenticating to LLMs.</p><p>When these secrets are hard-coded into source code or left in plain-text configuration files, they create a massive vulnerability. A leaked API key or an exposed database credential can bypass all other security measures, providing a direct path for attackers to access sensitive data and systems. This is especially critical in a GenAI sandbox, where developers are frequently connecting to various data sources and external model providers. Without a robust secrets management strategy, you’re leaving the keys to your kingdom scattered across your digital landscape, turning your innovative sandbox into a potential source for a major data breach.</p><p>To properly secure these secrets, a dedicated secrets management platform is an essential ingredient. These tools provide a centralized, encrypted vault for storing secrets, with robust access control, auditing, and dynamic rotation capabilities. Whether you choose a self-hosted solution, like HashiCorp Vault, or a managed cloud service, such as Google Cloud&#x27;s Secret Manager, or AWS Key Management Service (KMS), the principle is the same: Programmatically inject secrets into your applications at runtime. This practice ensures that secrets are never exposed in your code, keeping your most valuable credentials secure and your sandbox environment protected.</p><p>And this is more than just a best practice: since secret management technology is readily available and widely used, it forms part of the “state-of-the-art” which certain privacy laws and regulators reference as the benchmark against which an organization’s information security posture must be assessed. Failing to protect an organization’s most valuable secrets with the latest and greatest techniques available is not only a missed opportunity, it is also a potential case of regulatory non-compliance, as enforcement agencies and courts of law often recall.</p><h4>6. Private LLM deployment(s)</h4><p>Early in the advent of modern GenAI, the primary driver for using managed services, like Azure OpenAI, was the assurance that customer prompts and data would not be used to retrain public models. This was a crucial first step in enterprise adoption. However, as the field has matured, the conversation has shifted. While data privacy remains paramount, the decision to use private LLM instances, whether from major cloud providers or self-hosted, is now equally driven by the need for guaranteed throughput, predictable latency, and fine-grained control over the model&#x27;s operational environment to support production-grade applications.</p><p>This critical ingredient comes in three distinct flavors, each with valid use cases and its own set of trade-offs:</p><p><strong>A. Cloud-hosted SaaS</strong></p><p>This is the most common and accessible approach. Services like OpenAI Enterprise, Azure OpenAI, Google Cloud&#x27;s Vertex AI, and AWS Bedrock provide access to powerful, state-of-the-art models through a managed API.</p><ul><li><strong>Pros:</strong> This flavor offers the fastest time-to-market. The cloud provider handles all the underlying infrastructure, scaling, and maintenance, allowing teams to focus purely on application development. It provides a simple, pay-as-you-go model and access to a diverse model library of proprietary and open-source options.</li><li><strong>Cons:</strong> This approach offers the least control over the underlying infrastructure, which can lead to variability in performance during peak demand. It can also be more expensive at very high volumes, and it creates a dependency on the provider&#x27;s roadmap and model availability. It also increases the potential vulnerability surface of the application, with data leaving the customer premises: a challenge for highly regulated and/or sovereignty-minded customers.</li></ul><p><strong>B. Cloud-hosted GPU + containerized LLMs</strong></p><p>This flavor involves running open-source LLMs (like models from Mistral or Meta&#x27;s Llama series) on your own virtualized GPU infrastructure within a cloud provider. This is typically managed using the containerization and Kubernetes orchestration we&#x27;ve already discussed, often with high-performance inference servers like vLLM.</p><ul><li><strong>Pros:</strong> This approach provides a powerful balance of control and flexibility. You gain direct control over resource allocation, model versioning, and the serving configuration, allowing for significant performance tuning. In high-concurrency scenarios, a well-tuned inference server can dramatically increase throughput. For example, benchmarks have shown inference engines like vLLM delivering significantly higher tokens-per-second and lower latency compared to less production-oriented servers under heavy load [<a href="https://developers.redhat.com/articles/2025/09/30/vllm-or-llamacpp-choosing-right-llm-inference-engine-your-use-case">Red Hat, 2025</a>].</li><li><strong>Cons:</strong> This option carries a higher operational burden. Your team is now responsible for managing the GPU instances, container images, and the inference server configuration. It requires a deeper technical expertise in machine learning operations (MLOps) and infrastructure management to implement and maintain effectively.</li></ul><p><strong>C. On-premises GPUs + containerized LLMs</strong></p><p>The most controlled, and often most complex, approach involves deploying containerized LLMs on your own dedicated hardware within your own data centers. This setup is functionally similar to the second flavor but removes the reliance on a public cloud provider for the hardware layer.</p><ul><li><strong>Pros:</strong> This flavor offers maximum security, control, and data sovereignty. It’s the only option for organizations that require a completely <em>air-gapped environment</em>, where no data leaves the physical premises. For massive, predictable workloads, it can become more cost-effective in the long run by avoiding cloud data egress fees and per-transaction costs.</li><li><strong>Cons:</strong> The initial capital expenditure for purchasing and maintaining high-end GPU hardware is substantial. It requires a highly specialized team to manage the physical infrastructure, networking, and the entire software stack. This approach is more difficult to scale, as it requires the physical procurement and installation of new hardware.</li></ul><h4>7. Search AI data store</h4><p>If the LLM is the brain of our GenAI application, then the datastore is its heart, pumping relevant, context-rich information to be reasoned upon. For a RAG application to be truly effective, it cannot rely on a simple vector database alone. The grounding data is often complex, containing a mix of unstructured text, structured metadata, and a variety of data types. Therefore, the datastore you select must possess a unique set of characteristics to handle this complexity at scale.</p><p>Underpinning this entire process is the creation of <em>vector embeddings</em>, numerical representations of your data relative to the knowledge set of that embedding space. To enable semantic search, your data must first be converted into these numerical representations by an inference model. A flexible datastore should not only store these vectors but also be capable of hosting the inference process itself. Crucially, it should allow you to use your model of choice, whether it&#x27;s a state-of-the-art multilingual model, a fine-tuned model for a specific domain like finance or law, a compact model built for very high-speed results, or even a model that can process images. By managing inference, the platform ensures that your data is consistently and efficiently vectorized, paving the way for the powerful search capabilities that follow.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/dc05dfaf1bde3c7a74cf559b8c790a6e8e36be45-1600x900.png" style="width: 100%;" /><p>First, it must master <strong>hybrid search</strong>. The best retrieval systems don&#x27;t force a choice between traditional keyword search, like BM25, which excels at finding specific keywords, and modern vector search, which excels at finding results using semantic meaning (that is, natural language). A truly capable datastore allows you to use both simultaneously in a single query. This ensures you can find documents that match exact product codes or acronyms while also finding documents that are conceptually related, providing the LLM with the most relevant possible context.</p><p></p><p>Second, it needs a sophisticated method for <strong>intelligent result reranking</strong>. When you run a hybrid search that combines multiple approaches, you need a way to merge the different result sets into a single, coherent ranking. Techniques like reciprocal rank fusion (RRF) are crucial here, as they intelligently combine the relevance scores from different queries to produce a final list that is more accurate and relevant than any single approach could deliver on its own.</p><p>Finally, a search AI-oriented datastore must be a <strong>unified engine with security built in</strong>. For enterprise RAG, it&#x27;s not enough to just find similar vectors. You must be able to apply security and access controls to data <em>before</em> the search even happens. The aforementioned RBAC and ABAC capabilities allow prefiltering of content at search time, ensuring that the vector search is only performed on data a user is authorized to see. This mitigates risks of accidental or malicious circumvention of your access controls through the sandbox preserving demonstrable compliance with privacy and confidentiality requirements. This capability, which combines filtering, full-text search, and vector search in a single, scalable platform, is the defining characteristic of a datastore truly ready to power a secure, privacy-first GenAI sandbox.</p><h4>8. APM and security</h4><p>The final ingredient in our recipe ensures the health, security, and performance of the entire sandbox: a unified platform for APM and security information and event monitoring (SIEM). A key characteristic of a truly versatile search AI datastore is its ability to power the <em>R</em> in your RAG applications, while also acting as the standards-based repository for all logs, metrics, and traces generated by your infrastructure and applications. By consolidating this operational data into the same powerful datastore, you create a single pane of glass for observability and security.</p><p>This approach provides several critical capabilities. At the infrastructure level, you can monitor the performance and resource utilization of both the k8s clusters hosting your sandbox and the underlying GPUs that power your LLMs, allowing you to proactively identify bottlenecks or failures. At the application layer, APM provides detailed traces to diagnose latency issues or errors within your GenAI prototypes. For security, this centralized datastore becomes your SIEM, correlating login events, application logs, and network traffic to detect anomalous behavior or potential threats within the sandbox.</p><p>Most importantly, this unified platform allows you to gain deep insights into the usage of the GenAI applications themselves. By ingesting and analyzing the application telemetry, which should include the prompts being submitted by users wherever permissible, potentially with PII redacted, you can identify trends, understand what types of questions are being asked, and discover popular use cases. This provides an invaluable feedback loop for improving your RAG applications and demonstrates the power of using a single, scalable datastore to secure, monitor, and optimize your entire GenAI ecosystem.</p><h2>Cooking the recipe</h2><p>With all of the ingredients in place, let’s talk about the steps for assembling them into a production-grade sandbox.</p><p>As with any recipe book, let’s start with a photo of the cooked dish. Here’s a view of what a final architecture might look like:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/de028ce2f5acd6f84e59ea67091472128a8a0143-1430x990.png" style="width: 100%;" /><p>The holistic environment depicted here consists of a Kubernetes cluster to host your sandboxed AI applications (with dev/preprod/prod namespaces for a continuous integration and continuous deployment [CI/CD] pipeline), an IAM infrastructure for authentication, a few GenAI applications, a repository for code and container images, and a wrapper of APM and cyber monitoring around the entire sandbox.</p><h3>Recipe step 0: Policy baseline</h3><p>Before you begin mixing any ingredients, every good chef performs their <em>mise en place</em>, that is, setting up their station for success. In our recipe, this means establishing clear policies for how the sandbox will be used. This is the foundational step, where you decide the rules of your kitchen. Will developers be allowed to use internal production data, or production data sanitized with techniques like pseudonymization and differential privacy, or life-like synthetic data, or only public data? Will the sandbox be a completely self-service platform, or a managed service with guardrails? Will application updates require a formal Change Review Board, or is a peer-review process sufficient? These questions are highly specific to each organization’s context and purposes. Answering them up front is critical, as these policy decisions will directly influence how you configure every other ingredient in the recipe.</p><h3>Recipe step 1: InfoSec baseline</h3><p>As stated in the “Ingredients” section, IAM is a nonnegotiable part of our recipe. Before letting anyone into the kitchen, you must secure the perimeter and ensure only authorized chefs wearing your approved uniform and compliant protective gear can access the tools and ingredients. This means working directly with your information security organization from day one to build the sandbox on a foundation of strong security principles. Access to your datastore, your code repository, your Kubernetes hosting environment, and the applications themselves must be restricted based on established best practices.</p><p>With your organization’s IAM policies enforced in the environment, a practical authentication flow might look like the one depicted in figure 3.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/7b1062f775ed2cf21e9386d1999933ba3f73efc4-1432x1488.png" style="width: 100%;" /><p>As you can see in the figure, no communication can occur between applications in the Kubernetes production namespace without first passing through an OAuth proxy, such as Vouch. This ensures every user is authenticated against a central provider, like Okta, which enforces policies such as two-factor authentication. In this model, critical user context, such as username and IP address, can be passed along with every request, enabling robust auditing and nonrepudiation at the application layer.</p><h3>Recipe step 2: Container configuration baseline</h3><p>Assuming that many of your rapid prototypers are passionate innovators but not necessarily seasoned software engineers or legally trained data compliance experts, it’s critical to provide a baseline configuration to ensure their success and security, without putting them at risk of breaching any rules or policies inadvertently. Think of this step as providing a master recipe card that guarantees consistency. At a minimum, you should provide clear documentation on how to build a container image, deploy it into the Kubernetes cluster, and test that all connectivity is secure.</p><p>Even better, you can create a “Clone This Starter App” template in your code repository. This gives developers a preconfigured, security-blessed starting point, complete with Dockerfiles and pipeline scripts, that they can immediately fork to begin tinkering, dramatically lowering the barrier to entry while enforcing best practices from the outset.</p><p>Additionally, many real life GenAI use cases will inevitably involve some form of PII processing, or can produce outputs that will materially impact individuals such as your employees, your consumers, or your customers’ staff. In such cases, more and more state, federal and international laws require completing various risk assessments before actual work can begin. These assessments can be cumbersome to conduct and are difficult to scale if they are carried out case by case. The “Clone This Starter App” approach also helps to prevent such compliance mandates from becoming bottlenecks to innovation, since under most legal mandates, the required assessments can be completed once for your template, and they need not be repeated for any clone that doesn’t exceed your initially defined parameters.</p><h3>Recipe step 3: Deploy user applications</h3><p>With your policies defined, your security baseline established, and your developer templates in place, it’s finally time to serve the dish. Whether you&#x27;ve chosen a self-service or managed deployment model, you can now confidently invite the rapid prototypers in your organization to start creating in the sandbox.</p><p>Because you’ve included APM and security logging (ingredient #8) from the beginning, you have the necessary observability to monitor application performance and user activity. This is where the magic happens: You can now learn from the applications people build, identify powerful new use cases, and gather real-world data to improve the platform, all while safeguarding organizational data. Coincidentally, this approach will also allow you to organically collect the information you might need to put on record, disclose to users, or share with auditors and regulators to demonstrate the transparency, accountability and explainability of your GenAI application, ticking many compliance boxes as you build (and not after the fact) – a textbook best practice of <em>Privacy by Design</em>.</p><h2>Where do you go from here?</h2><p>We&#x27;ve now walked through the entire cookbook, from selecting your fresh ingredients to following the recipe step by step. Most of the domains we&#x27;ve discussed (containerization, APM, IAM, and more) are culinary specialties in and of themselves.</p><h2>Conclusion</h2><p>This cookbook was designed to provide a clear recipe for building a production-grade GenAI sandbox. By carefully selecting each foundational ingredient, from your containerization platform and Kubernetes orchestration to your search AI datastore and unified APM, you ensure your final dish will be both successful and secure. Following the recipe ensures that this powerful environment is built on a foundation of security and thoughtful policy from day one.</p><p>The goal is to empower your rapid prototypers, not restrict them, and to foster a culture of responsible innovation. By providing a secure, observable, and well-equipped kitchen for experimentation, you get ahead of the curve, fostering a culture of responsible innovation. This proactive approach enables you to harness the creativity of your entire organization, transforming brilliant ideas into tangible prototypes while preventing the rise of shadow AI. You&#x27;ve cooked the meal; now you can enjoy the innovation it serves.</p><p>If you want to chat about this or anything else related to Elasticsearch, come join us in our <a href="https://discuss.elastic.co/">Discuss forum</a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Query rewriting strategies for LLMs and search engines to improve results]]></title>
            <description><![CDATA[Exploring query rewriting strategies and explaining how to use the LLM's output to boost the original query's results and maximize search relevance and recall.]]></description>
            <link>https://www.elastic.co/search-labs/blog/query-rewriting-llm-search-improve</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/query-rewriting-llm-search-improve</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Christina Nasika, Emilia Garcia Casademont]]></dc:creator>
            <pubDate>Fri, 30 Jan 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/e56cb99dab7734d506e6b8a226c798047122c1a9-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>One of the biggest challenges in integrating large language models (LLMs) into search pipelines is the complexity of navigating the space of possibilities that they provide. This blog focuses on a small set of concrete <strong>query rewriting (QR) </strong>strategies, using LLM-generated keywords, pseudo-answers, or enriched terms. We specifically focus on how to best use the LLM&#x27;s output to strategically boost the original query&#x27;s results to maximize search relevance and recall.</p><h2>LLMs and search engines: An exploration of query rewriting strategies for search improvement</h2><p>The integration of large language models (LLMs) with search engines represents a significant advancement in the fields of search and AI. This integration can take many forms, depending on the application [1]. On one hand, search engines can enhance LLMs by grounding their generation in freshly retrieved data, a strategy that’s commonly referred to as retrieval-augmented generation (RAG). On the other hand, LLMs can improve search engines by optimizing various stages of the search pipeline.</p><p>The search pipeline typically consists of three main components: indexing, first-stage retrieval, and reranking. LLMs can contribute at each of these stages. During indexing, LLMs can assist with data preparation tasks, such as text cleaning, entity extraction, and document enrichment by adding domain knowledge, synonyms, or clarifications. In the first-stage retrieval, LLMs can optimize user queries by rewriting them in natural language to improve retrieval accuracy or by mapping them to structured queries (for example, domain‑specific language–style [DSL-style] or Elasticsearch Query Language–style [ES|QL-style] queries) [2]. This blog focuses on query optimization strategies for this stage.</p><p>While there is some research on using LLMs as rerankers [3], the literature is less extensive. Technical blogs on the topic suggest that LLMs as rerankers may not always be the optimal choice, though this area remains an active field of exploration [4,5].</p><p>The advancements in LLMs have also unlocked new possibilities beyond the traditional indexing and retrieval stages. For example, LLMs can be used to generate natural language responses grounded in retrieved data (RAG). Users increasingly anticipate coherent, natural language responses to their queries, which are also dependable and guaranteed to be based on retrieval data. This is a significant shift in user expectation, occurring quickly since LLM-integrated search engines became generally available, demonstrating a major improvement in user experience. A language model that can understand intent, retrieve data, and synthesize information is especially valuable in multi-hop scenarios where a query requires combining information from various sources.</p><p>This is even clearer when looking at the application of LLMs in the creation of interactive, conversational search interfaces. These interfaces allow users to submit queries, provide feedback on responses, or introduce clarifications, enabling iterative improvements to the results, while making use of the historical context of the conversation. Taking this a step further, integrating LLMs with autonomous capabilities, such as planning, retrieving, reasoning, and decision-making, can lead to the development of agentic search systems. These systems can refine results based on user feedback or self-evaluation, creating a dynamic and intelligent search experience.</p><p>Finally, LLMs are widely used in search tool development, from data generation to serving as backbones for embedding and reranking models [6,7,8]. Synthetic data generation has become a common step in training retrieval models, and LLMs are increasingly being used as judges to generate labels for training and evaluation.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/e8ce0659b6c1b5bf59c1a7d4ab23e672f0970533-1600x656.png" style="width: 100%;" /><h3>Query rewriting and optimization strategies</h3><p>Query rewriting strategies are best understood by categorizing user queries into two main types: <em>retrieval </em>and <em>computational</em>.</p><p></p><h4>Retrieval queries</h4><p></p><p>Unlike computational queries, the user&#x27;s intent here is information retrieval, not calculation. These are the standard queries handled by retrieval algorithms, like lexical and vector search. For example, for the following query:</p><p></p><p>&quot;What is the origin of COVID-19?&quot;</p><p></p><p>texts providing answers or context relevant to the query are targeted.</p><h4>Computational queries</h4><p>These queries require calculation, aggregation, or structured filtering to produce an answer. They must be translated from natural language into a structured query language, like Elasticsearch DSL or ES|QL.</p><p>For example, a query like:</p><p>&quot;What was the average amount spent by customers who placed more than five orders in the last month?&quot;</p><p>Assuming that the information on the orders and customers can be found in some available index, this query requires more than simple text matching. It involves filtering by a date range, grouping by customer, calculating order counts, filtering customers with fewer than five orders, and computing the final average. In this case, the LLM&#x27;s task is to parse the natural language and generate the corresponding structured query to execute these calculations.</p><p>Another example would be:</p><p>&quot;Which universities in Germany have an acceptance rate below 20%, and what is their average tuition fee?&quot;</p><p>When there is no indexed document that contains that specific information, but rather there might be documents containing acceptance rate information separately from tuition fees information.</p><p>In computational queries, the model is essentially expected to decompose the query into a retrievable informational query and a calculation that can be performed when the retrieved data is available, or to build a structured query that can do both.</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Query type</td><td style="border: 1px solid #ccc; padding: 4px;">Primary mechanisms</td><td style="border: 1px solid #ccc; padding: 4px;">Example</td><td style="border: 1px solid #ccc; padding: 4px;">Query rewriting task</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Retrieval</td><td style="border: 1px solid #ccc; padding: 4px;">Lexical and vector search. </td><td style="border: 1px solid #ccc; padding: 4px;">"What is the origin of COVID-19?"</td><td style="border: 1px solid #ccc; padding: 4px;">Optimize the query's text to improve lexical or vector matching.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Computational</td><td style="border: 1px solid #ccc; padding: 4px;">Calculation, aggregation, or structured filtering.</td><td style="border: 1px solid #ccc; padding: 4px;">"Which universities in Germany have an acceptance rate below 20%, and what is their average tuition fee?"</td><td style="border: 1px solid #ccc; padding: 4px;">Decompose the query: Retrieve relevant data (university profiles), and then perform a calculation (average tuition fee) on the subset of retrieved documents.</td>
                    </tr>
              </tbody>
            </table>
          <h3>Design methodology: Template-based expansion</h3><p>The main focus of this blog is on retrieval queries. However, unlike standard approaches that simply ask an LLM to rewrite or completely rephrase a query, we adopted a template-based design methodology.</p><p>Improving query-text-to-query-text relevance by completely rephrasing the query is often not straightforward for vector nor lexical search. It introduces the complexity of merging search results when multiple hypotheses are considered, and the LLM can often drift from the original user intent. Instead, we explore expanding the original query through an Elasticsearch Query DSL template + prompt strategy. By using specific prompts, we guide the LLM to output specific textual elements (like a list of entities, synonyms, or a pseudo-answer) rather than giving it free rein. These elements are then plugged into a predefined Elasticsearch Query DSL template (a search &quot;recipe&quot;).</p><p>This approach reduces the scope of the LLM application, making the output more deterministic. In our experiments, the LLM is simply prompted to output some text, which then is inserted into the template.</p><p>To validate this approach, we performed a limited exploration of different Elasticsearch primitives to identify and &quot;freeze&quot; a good-enough search template. This allowed us to test how different prompting strategies affect relevance within that fixed structure, rather than changing the structure itself.</p><p>While this blog focuses on retrieval queries, and lexical extraction and semantic expansion strategies where the linguistic aspect plays the major role, this methodology is flexible. Specific templates could be designed for other specific retrieval query use cases, such as handling product codes since relevance criteria are often context dependent. However, use cases with queries dependent on complex aggregations or strict filtering should be considered computational queries, which would require query optimization strategies outside the scope of this blog.</p><h3>Query optimization strategies</h3><p>While query optimization predates LLMs, LLMs excel at this task. They can be prompted to apply several rewriting strategies [9], such as:</p><ul><li>Generic query rephrasing.</li><li>Pseudo-answer generation.</li><li>Noise reduction (removing irrelevant text, extracting important entities).</li><li>Entity enrichment (synonyms, abbreviation expansion, or other related terms).</li><li>Fixing typos.</li><li>A combination of the above.</li></ul><p>Most of these techniques depend on the model’s capacity to understand user intent and its knowledge of the corpus characteristics.</p><p></p><p>In the following sections, we’ll present our experimentation with query rewriting for informational queries and their application to Elasticsearch. We’ll present our most successful experiments and discuss our unsuccessful ones.</p><h3>Experiments and results</h3><p>All the experiments presented in this blog were run using Anthropic Claude 3.5 Sonnet. Normalized Discounted Cumulative Gain (NDCG) at position 10 and Recall at positions 10 and 50 are the metrics used to evaluate the experiments throughout the blog. All <a href="https://en.wikipedia.org/wiki/Discounted_cumulative_gain">NDCG</a> and <a href="https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Recall">recall</a> scores were computed using <code>pytrec_eval</code> [10].</p><p>We experimented with several variations of the query rewriting task for informational queries and measured relevance and recall gains for three widely used information retrieval benchmarks: Benchmarking Information Retrieval - BEIR, 15 datasets, English-only [11]. Multilingual Document Ranking - MLDR, 13 languages [12] and Multilingual Information Retrieval Across a Continuum of Languages - MIRACL, 18 languages [13].</p><p>We mainly explored the following tasks:</p><ul><li>Lexical keyword enrichment.</li><li>Pseudo-answer generation.</li><li>Letting the model decide on a method or a combination of methods among keyword extraction, keyword enrichment, and pseudo-answer generation.</li></ul><p>We detail the prompts we used for each case and expand on some attempted variations below.</p><p>It’s worth noting that, out of the datasets we evaluated, only four within BEIR (NQ, Robust04, Quora, and MS MARCO) contain real user queries that can benefit from generic query rewriting strategy fixes, such as misspellings, corrections, or query cleaning. The rest of the datasets are either synthetically generated (MLDR, MIRACL) or human-constructed (most of the BEIR datasets).</p><h4>Lexical keyword enrichment</h4><p>This is the first task we tried and considered various prompts in an effort to optimize results. We started from the simplest possible version, prompting the LLM to extract relevant keywords without specifying more details.</p><p><strong>Prompt 1.</strong> </p><pre><code>You are a search engine and you want to extract the relevant keyword, codewords or proper names from a query. 
Please, extract them and list them in a comma-separated list. Reply with the list only.</code></pre><p>On a second attempt, we tried a prompt with more explicit instructions, prompting the model to provide only the most important keywords, and insisting on why that is important for our use case. We also introduce here the idea of entity enrichment, prompting the model to augment the original query only if it considers it to be too small or missing information.</p><p><strong>Prompt 2. </strong></p><pre><code>You are a search engine and you want to extract the relevant keyword, codewords or proper names from a query.
            <Instructions>
            Think about the query and select only the very central and necessary entities. 
            They will be used as boosters for the lexical search, so make sure to only extract critical ones. We don't want to boost documents that contain entities that might not be related to the specific context of the query. 
            If, and only if, the query is too short and is missing essential information, see if you can extract synonyms or enrich the query with entities that don't exist in the original query but might help the search.
            </Instructions>
            Return the thinking process inside <thinking> tags and the final entities inside <entities> tags.</code></pre><p>Finally, we tried a prompt with even more explicit instructions and details encouraging the model to apply different techniques based on the original query’s length.</p><p><strong>Prompt 3.</strong></p><pre><code>You are a search engine and you want to extract the relevant keyword, codewords or proper names from a query.
            <Instructions>
            Extraction for boosting -  query length independent:
         Think about the query and identify the main context or the information that the user is asking for. 
         Select only the very central and necessary entities. They will be used as boosters for the lexical search, so make sure to only extract critical ones. 
         We don't want to boost documents that contain entities that might not be related to the specific context of the query.

             Enrichment - query length dependent:
       	If the query is shorter than 5 words, then extract the most important entities, but also evaluate whether you can enrich with synonyms or other entities that might enhance recall.
             If the query is larger than 10 words stick to entities that exist in the query
             If you find any initials, return the initials as well as the full words represented by the initials

            Stem-proofing - query length independent:
         Regardless of the query length, return the entities in different forms, in order to overcome stemming miss-matches, provided that the versions of the word are relevant to the query context e.g.: walk, walked, walking.
            </Instructions>
Return the thinking process inside <thinking> tags and the final entities, comma separated, inside <entities> tags.
          Make sure you are only returning the most important entities! Don't return entities that are too limiting.	
</code></pre><p>We ran lexical search tests on the three prompt variations on a subset of BEIR datasets and compared performance in terms of relevance and recall. The following table lists averaged results over datasets ArguAna, FiQA-2018, Natural Questions (NQ), SciDocs, SciFact, TREC-COVID, Touché 2020, NFCorpus, Robust04:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;"></td><td style="border: 1px solid #ccc; padding: 4px;">Original query</td><td style="border: 1px solid #ccc; padding: 4px;">Prompt 1</td><td style="border: 1px solid #ccc; padding: 4px;">Prompt 2</td><td style="border: 1px solid #ccc; padding: 4px;">Prompt 3</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">NDCG@10</td><td style="border: 1px solid #ccc; padding: 4px;">0.346</td><td style="border: 1px solid #ccc; padding: 4px;">0.345</td><td style="border: 1px solid #ccc; padding: 4px;">0.356</td><td style="border: 1px solid #ccc; padding: 4px;">0.346</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Recall@10</td><td style="border: 1px solid #ccc; padding: 4px;">0.454</td><td style="border: 1px solid #ccc; padding: 4px;">0.453</td><td style="border: 1px solid #ccc; padding: 4px;">0.466</td><td style="border: 1px solid #ccc; padding: 4px;">0.455</td>
                    </tr>
              </tbody>
            </table>
          <p>We consider a baseline lexical search of the original queries and compare with experiments where the LLM-output entities are used in lexical search. The results are linearly combined with an Elasticsearch <code>bool</code> query with a <code>must</code> clause wrapping the original query and a <code>should</code> clause for each returned LLM term. (See the <a href="https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-bool-query">Elasticsearch docs</a> for more information on Boolean queries.) The linear combination can be expressed as:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/b5ccd419194eab69ea5c214bfb0e109c4f8248a7-1146x306.png" style="width: 100%;" /><p>Where n is the number of keywords returned for each query. Score in this case refers to a lexical BM25 score. The query is written in native Elasticsearch Query DSL and is provided in the following section.</p><p>The rescoring, including the contribution of the extracted keywords, was applied to the top 200 retrieved documents.</p><p>Prompt 2 performs best, both in terms of relevance and recall considering the top 10 results, improving +1pt, whereas the other prompts have little effect. We kept prompt 2 for lexical keyword extraction going forward.</p><p>We visually inspect the returned keywords for some of the tested queries in these following examples, and identify several query optimization techniques that were mentioned in the previous section.</p><p>Typo corrections:</p><ul><li><strong>Query:</strong> when does season 3 of strnger things come out</li><li><strong>Keywords:</strong> Stranger Things, Season 3</li></ul><p>Noise reduction:</p><ul><li><strong>Query:</strong> Evolvability : What Is It and How Do We Get It?</li><li><strong>Keywords:</strong> evolvability</li></ul><p>Keyword expansion/augmentation:</p><p>a)</p><ul><li><strong>Query:</strong> Trust-Aware Review Spam Detection</li><li><strong>Keywords:</strong> Trust-Aware, Review Spam Detection, fake reviews, credibility&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </li></ul><p>b)&nbsp;&nbsp; </p><ul><li><strong> Query:</strong> Dimensional inconsistencies in code and ROS messages: A study of 5.9M lines of code</li><li><strong>Keywords:</strong> dimensional inconsistencies, code, ROS, Robot Operating System, 5.9M</li></ul><p>c)</p><ul><li><strong>Query:</strong> low-carb diets</li><li><strong>Keywords:</strong> low-carb, diets, ketogenic, keto, carbohydrate</li></ul><h4>Pseudo-answer generation</h4><p>Prompting the LLM to generate pseudo-answers to the original query is another technique that was explored. It has been explored before in the context of query rewriting,9 and is expected to especially benefit factual queries and match the advantages of keyword augmentation. For this method we used the prompt:</p><p><br/><strong>Prompt 4.</strong></p><pre><code>You are a search engine assistant and you want to generate 5 possible replies for a query. 
The replies will be used to boost the search results, in a combined manner with the original query. Make sure that the generated results respect the preferences of most search (lexical or vector) engines, that is, they should be concise, relevant, and cover different aspects of the query.
Return the rewritten replies inside <replies> tags. Separate the replies by the line break symbol.
</code></pre><p>We explore some examples of LLM outputs for this prompting strategy. For comparison, the extracted Entities for Prompt 4 are also listed.<br/>Providing replies to a general knowledge question:</p><p><strong>Query: </strong>What is the best way to practice web development?</p><ul><li><strong>Replies:</strong><ul><li>Learn HTML, CSS, and JavaScript fundamentals</li><li>Build personal projects and create a portfolio</li><li>Contribute to open-source projects on GitHub</li><li>Follow online tutorials and take coding courses</li><li>Practice responsive design and cross-browser compatibility</li></ul></li><li><strong>Entities (Prompt 4): </strong>development, practice, coding projects, tools</li></ul><p>Offering imagined document titles that might contain a reply to a question that cannot be answered based on general knowledge:</p><p><strong>Query: </strong>In your opinion, what is the best classical song of all time?</p><ul><li><strong>Replies:</strong><ul><li>Best classical compositions ranked by music critics</li><li>Most popular and influential classical pieces in history</li><li>Timeless masterpieces from renowned classical composers</li><li>Iconic classical works that have stood the test of time</li><li>Top classical music recommendations for beginners and enthusiasts</li></ul></li><li><strong>Entities (Prompt 4):</strong> classical music, classical composition, symphony, concerto, opera</li></ul><p><strong>Query: </strong>A description of the history and meaning of the 1st amendment</p><ul><li><strong>Replies:</strong><ul><li>First Amendment: origins, historical context, and significance in US Constitution</li><li>Freedom of speech, religion, press, assembly: core principles of 1st Amendment</li><li>Supreme Court interpretations and landmark cases on First Amendment rights</li><li>Evolution of First Amendment protections throughout American history</li><li>Impact and controversies surrounding the First Amendment in modern society</li></ul></li><li><strong>Entities (Prompt 4):</strong> Amendment, Constitution, Bill of Rights</li></ul><p>We observe an interesting combination of LLM output modes for this strategy. Different types of queries are handled differently:</p><ul><li>When the query can be directly replied to based on general knowledge, the LLM returns possible replies.</li><li>If not, the LLM “imagines” what the type of document that the user might want could look like and returns these title-like document descriptions as replies.</li></ul><p>Both methods end up introducing useful and creative new keywords.</p><h4>Letting the model choose a method</h4><p>Finally, experiments were run using a prompt that encouraged the model to select the most appropriate method, or combination of methods and asked it to explain its rationale behind the choice. We expect the model to be inherently capable of “thinking” through how each query can be improved in a search context:</p><p><br/><strong>Prompt 5.</strong></p><pre><code>You are a search engine and you want to extract the relevant keyword, codewords or proper names from a query.
            There are 3 basic methods to do so:
            1. extract important entities and keywords
            2. create a pseudo answer to the query
            3. expand the initial query with synonyms and related terms
            Given a query, choose a method or a combination of any of them and return the rewritten query inside <rewritten query> tags.
            You can separate the different parts or terms with the line break symbol.
            Also return the rationale behind your choice of methods inside <thinking> tags. Why does this query need this type of method?</code></pre><p>The query rewriting implementation for prompts 4 and 5 closely resembles that of prompt 2. Specifically, the LLM output is incorporated into a Boolean query, where the original query is placed in a <code>must </code>clause and each LLM-generated term is included in a <code>should</code> clause. For prompt 4, an LLM-output term represents a single pseudo-answer, while for prompt 5, it represents a rewrite.</p><p>Overall, the change in performance using different prompts is significant in the context of lexical search, where prompts 4 and 5 outperform prompt 2. In the following table, LKE refers to lexical keyword extraction using prompt 2, PA refers to pseudo-answer generation using prompt 4, and MC stands for model’s choice and refers to prompt 5. The model’s output is used according to equation 1.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/0bb953db787dcb03c9221e9b6e7969d3bffdb17f-2002x670.png" style="width: 100%;" /><p>In the final row of the table, the scores are averaged at the benchmark level. It’s computed as an average of the average scores of BEIR, MLDR, and MIRACL benchmarks. The pseudo-answers and model’s choice strategies perform better across metrics, with pseudo-answers being slightly better.</p><p>We further analyze these prompting techniques and obtain more results in the following section, with respect to vector search experiments.</p><p>We also present the results of lexical search when using <em>only</em> the query rewriting terms. In the table below, the original query&#x27;s contribution to the scores is entirely excluded:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/4dcdfe2572217ce017c746c55f016f282c61c6e3-2006x670.png" style="width: 100%;" /><p>Excluding the original query scores from the result seems to hurt relevance performance by average. In terms of Recall@50, the pseudo-answer strategy results in improved performance compared to baseline, but the boost is smaller than when the strategy includes the original query.</p><p>Overall, we recommend combining the query rewriting terms with the original query to achieve gains across metrics in lexical search.</p><h4>Large language models versus small language models</h4><p>For the majority of the results discussed in this blog, we utilized Anthropic&#x27;s Claude 3.5 Sonnet LLM. However, we also experimented with a smaller model to assess how inference cost affects performance. We tried LKE with Anthropic’s Claude 3.5 Haiku for a subset of datasets from BEIR (ArguAna, FiQA-2018, Natural Questions [NQ], SciDocs, SciFact, TREC-COVID, Touché 2020, NFCorpus, Robust04).</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;"></td><td style="border: 1px solid #ccc; padding: 4px;">Original query </td><td style="border: 1px solid #ccc; padding: 4px;">LKE with Sonnet</td><td style="border: 1px solid #ccc; padding: 4px;">LKE with Haiku</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">NDCG@10</td><td style="border: 1px solid #ccc; padding: 4px;">0.346</td><td style="border: 1px solid #ccc; padding: 4px;">0.364</td><td style="border: 1px solid #ccc; padding: 4px;">0.368</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Recall@10</td><td style="border: 1px solid #ccc; padding: 4px;">0.454</td><td style="border: 1px solid #ccc; padding: 4px;">0.472</td><td style="border: 1px solid #ccc; padding: 4px;">0.475</td>
                    </tr>
              </tbody>
            </table>
          <p>Relevance and recall within the top 10 results remain unaffected. While this initial investigation is not exhaustive and requires further study in real-world scenarios that implement query optimization, these first results strongly suggest that small language models (SLMs) are likely a viable option for this specific use case.</p><p>A comparison between Claude 3.5 Sonnet and Claude 3.5 Haiku is provided below:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Model</td><td style="border: 1px solid #ccc; padding: 4px;">Number of
parameters</td><td style="border: 1px solid #ccc; padding: 4px;">Context window</td><td style="border: 1px solid #ccc; padding: 4px;">Max output</td><td style="border: 1px solid #ccc; padding: 4px;">Input cost</td><td style="border: 1px solid #ccc; padding: 4px;">Output cost</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Claude 3.5 Sonnet</td><td style="border: 1px solid #ccc; padding: 4px;">Est. ~175B</td><td style="border: 1px solid #ccc; padding: 4px;">200K</td><td style="border: 1px solid #ccc; padding: 4px;">8192 tokens</td><td style="border: 1px solid #ccc; padding: 4px;">$15.00 / million tokens</td><td style="border: 1px solid #ccc; padding: 4px;">$4.00 / million tokens</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Claude 3.5 Haiku</td><td style="border: 1px solid #ccc; padding: 4px;">Est. ~20B</td><td style="border: 1px solid #ccc; padding: 4px;">200K</td><td style="border: 1px solid #ccc; padding: 4px;">8192 tokens</td><td style="border: 1px solid #ccc; padding: 4px;">$3.00 / million tokens</td><td style="border: 1px solid #ccc; padding: 4px;">$0.80 / million tokens</td>
                    </tr>
              </tbody>
            </table>
          <h3>Query rewriting in Elasticsearch</h3><p>In enterprise search, where precision and control are paramount, we explored methods that integrate query rewriting with existing search functionality. The focus was on strategies that build upon the original query to target relevance gains without a high implementation cost.</p><p>Elasticsearch features a wide range of search tools that tackle different search scenarios. It supports lexical and vector retrieval, as well as rerankers. We look for optimal ways to integrate query rewriting strategies in Elasticsearch, trying to explore across base retrievers and hybrid methods.</p><p>In the previous section, we presented results on lexical search and introduced equations 1 and 2. These correspond to the following Elasticsearch Query DSL code:</p><pre><code>{
  "query": {
    "bool": {
      "must":  { "match": { "text": "ORIGINAL QUERY" } },
      "should": [
        { "match": { "text": "QR TERM 1" } },
        { "match": { "text": "QR TERM 2" } },
        { "match": { "text": "QR TERM 3" } }
      ]
    }
  }
}</code></pre><p>QR TERM 1, 2, 3 stands for query rewriting term and refers to whatever the LLM output represents: keywords, pseudo-answers, or other types of replies.</p><p>The <code>bool</code> query functions like a linear combination of terms. Crucially, the <code>must</code> clause enforces <em>hard</em> requirements, meaning any document that fails to match this clause is excluded from the results. In contrast, the <code>should</code> clause operates as a score booster: Documents matching it receive a higher final score, but documents that don&#x27;t match are not discarded from the results.</p><p>Through iterative experimentation, we determined the most effective query configuration. Initial attempts included querying solely with terms generated by the LLM or various combinations of the original query and LLM terms. We observed that overreliance on LLM output reduced relevance. The optimal setup, which consistently yielded the best results, required the full inclusion of the original query, with the LLM output used only to selectively boost the ranking of certain documents.</p><h4>Dense vector search as base retriever</h4><p>When moving to vector search, the narrative changes. It’s already well-established in the industry that hybrid search (lexical + vector) improves both relevance and recall by combining the semantic understanding of dense vectors with the exact matching precision of BM25. Our goal here was to determine whether query rewriting applied to a vector retriever covers the same gap that hybrid search fixes or provides additional improvement.</p><p>For experiments where the base retriever was not BM25 but rather a K-Nearest Neighbors (KNN) search using dense vectors, we chose to explore the maximum gains we can achieve in terms of relevance and recall using an optimized, weighted linear combination. We opted to optimize the weight to ensure that the impact of query rewriting strategies on performance is isolated, preventing any potential influence from a poorly chosen linear combination on the results. We run an optimization process expressed in equation 3 using Bayesian Optimization (Optuna14) on the test set itself.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/e4d8498261842d6d0942661e8bbfbe8726ddce9c-1186x176.png" style="width: 100%;" /><p>The optimization process was conducted on the same text queries used for evaluation to establish the upper bounds of potential performance improvements. Because we’re tuning only one parameter, the chance of fitting the test data is minimal. We verify this hypothesis below by running the optimization for some datasets on the train split and observe the difference in terms of performance.</p><p>For vector search evaluation, we use two optimization metrics:</p><ul><li><strong>LINEAR NDCG@10 OPTIMIZED(vector_oq, bm25_qr):</strong> The weight is optimized to achieve the maximum NDCG at the top 10 results.</li><li><strong>LINEAR RECALL@50 OPTIMIZED(vector_oq, bm25_qr):</strong> The weight is optimized to achieve the maximum recall at the top 50 results.</li></ul><p>In these metrics, <strong>oq</strong> stands for the original query, and <strong>qr</strong> stands for query rewriting. We include recall at 50 to assess query optimization&#x27;s performance as a first-stage retriever, with the assumption that the search results will subsequently be processed by a reranker.</p><p>To provide a comparison, we also conducted experiments where the BM25 scores of the original query were combined with the vector search scores. These combinations are referred to as:</p><ul><li><strong>LINEAR NDCG@10 OPTIMIZED(vector_oq, bm25_oq)</strong></li><li><strong>LINEAR RECALL@50 OPTIMIZED(vector_oq, bm25_qr)</strong></li></ul><p>For the experiments in the following tables we used the <code>multilingual-e5-large</code> [15] dense vector model for benchmarks BEIR and MIRACL, and the <code>Qwen3-0.6B-Embedding</code> [16] model to search for long-context documents in the MLDR benchmark.</p><p>First, we verify that performing the optimization on the evaluation split instead of the training split yields results that follow the same trends. The pseudo-answers prompting strategy was used to compute the qr scores.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/fb329c72a7b1123428ed519f77d008c3084c064e-1736x532.png" style="width: 100%;" /><p>We compare the recall@50 scores when optimizing on the training split versus the evaluation split for MIRACL and MLDR, finding that both produced results on the same range.</p><p>In the following tables, the evaluation split is used for optimization. The final row contains results that are averaged on the benchmark level as in the previous tables.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/9d10f644b474b5e114504e9515eee7457bbfa565-2014x820.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/dfe8d5cbe9d9f5d7e8d2aa6f146550465ed4e73b-2014x402.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/a88b3aef38fad245f629951b247a16fc8eb93574-1452x352.png" style="width: 100%;" /><p>We omit the Recall@10 results as they are very similar to NDCG@10 results. These tables show no advantage in hybrid search using QR terms instead of the original query. In terms of relevance, replacing the original query with QR seems to deteriorate results. In terms of recall, some gains are achieved in BEIR and MIRACL, but the averaged score reveals no advantage over a well-tuned hybrid search.</p><p>We further explored hybrid search using reciprocal rank fusion (RRF), relying on the built-in Elasticsearch functionality. Method <strong>RRF(vector_oq, bm25_qr) </strong>refers to DSL code:</p><pre><code>{
  "retriever": {
    "rrf": {
      "retrievers": [
        {
          "knn": {
            "field": "dense_vector_field",
            "query_vector": ORIGINAL_QUERY_VECTOR,
            "k": 100,
            "num_candidates": 200
          }
        },
        {
          "standard": {
            "query": {
              "bool": {
                "should": [
                  { "match": { "text": "QR TERM" } },
                  { "match": { "text": "QR TERM" } },
                  { "match": { "text": "QR TERM" } }
                ]
              }
            }
          }
        }
      ],
      "rank_window_size": 500,
      "rank_constant": 20
    }
  },
  "size": 200,
}</code></pre><p>The corresponding baseline run is denoted <strong>RRF(vector_oq, bm25_oq)</strong>.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/d3359c99afbf0649b9a610689198ff21e2cf0ac3-1398x504.png" style="width: 100%;" /><p>Replacing the original query with LLM-output terms to get lexical search scores deteriorates recall by average in all cases. In terms of relevance, we observe marginal improvement in BEIR and a notable increase by ~8 points of NDCG@10 in MIRACL. Relevance in MLDR is however so negatively affected that the average result is overall higher in baseline runs.</p><p>So far, our exploration has not revealed a definitive strategy for enhancing vector search performance through the exposed query rewriting methods. Considering the learnings from the exploration of lexical search, we try experimenting with hybrid search strategies that include both the original query and the query rewriting terms. We focus on a first-stage retrieval scenario and aim to improve Recall@50. In the following experiment, we try hybrid search with RRF, where the lexical scores are computed as an optimized linear combination of BM25 scores of the original query and BM25 scores of LLM-output terms. To establish an upper-bound for potential performance improvements, we perform Bayesian optimization on the set of queries using the Optuna library [14].</p><p>We denote this experiment <strong>RRF(vector_oq, LINEAR RECALL50 OPTIMIZED(bm25_oq, bm25_qr))</strong>. The same baseline as run in the previous table, <strong>RRF(vector_oq, bm25_oq)</strong>, still provides a useful comparison in this experiment.<br/></p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/41e0eb5cd708c3705f5dc5ac0f7c3c01074556a2-1394x614.png" style="width: 100%;" /><p>This experiment was not designed for relevance optimization; however, the resulting NDCG@10 scores are documented for completeness. A 1–3 percentage points of recall@50 increase was achieved with this method, with the prompting strategy that generates pseudo-answers being the most prevalently beneficial among the benchmarks. This strategy is suitable only for informational queries based on general knowledge or when the LLM possesses the necessary domain expertise. This method was employed to determine the upper limits of potential performance improvements. It&#x27;s important to note that optimizing the weights using the complete test dataset, as was done, is not feasible in real-world applications.</p><p>PA turns out to be the most successful strategy for BEIR and MIRACL, while LKE gives the highest boost in recall for MLDR.</p><h4>First-stage retriever and reranking</h4><p>To maximize performance in a production setting, query rewriting could be viewed as part of a multistage pipeline. The goal of the first-stage retriever is not to be good at relevance but rather at recall, that is, to ensure the good documents make it into the candidate set for the reranker.</p><p>We implemented the following pipeline configuration:</p><ol><li><strong>Base retrieval:</strong> Retrieve top 200 documents.</li><li><strong>Entity boosting:</strong> Rescore based on LLM-extracted entities (from prompt 2).</li><li><strong>Pruning:</strong> Cut to the top 50 documents.</li><li><strong>Reranking:</strong> Apply <code>jina-reranker-v2</code> to the top 50 documents.</li></ol><p>We compared the final outcomes of this pipeline using different first-stage inputs including base retrievers and base retrievers plus QR. For QR and hybrid baseline, we used the experiments that obtained higher recall.</p><ul><li>Lexical corresponds to the original query using BM25.</li><li>Lexical + QR corresponds to the experiment using prompt MC.</li><li>Vector corresponds to the original query using vector search.</li><li>Vector + QR corresponds to the experiment LINEAR RECALL50 OPTIMIZED (vector_oq, bm25_qr) for prompt PA.</li><li>Hybrid corresponds to the RRF (vector_oq, bm25_oq).</li></ul><p>Hybrid + QR corresponds to the experiment RRF (vector_oq, LINEAR RECALL50 OPTIMIZED (bm25_oq, bm25_qr)) for prompt PA.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/8e5bd75acf283e162d90a4778d6e0c2d0def4ec8-1210x434.png" style="width: 100%;" /><p>The last row shows the averaged results across BEIR, MLDR, and MIRACL.</p><p></p><p>QR in lexical and vector is applied by combining the scores as shown in equations (1, 2) and optimized for recall. RRF is widely considered a strong standard for maximizing recall in hybrid search. Our results show that an optimized linear combination of the original vector and QR actually can achieve better recall than the RRF configurations, which cannot be optimized. This suggests that, with the right weighting, a linear combination can be more effective at using LLM-generated queries for recall than rank-based fusion, since optimizing RRF is less effective.</p><p>The table below shows the averages across datasets for NDCG@10 before and after reranking and recall@50:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/8a0b9b1fe65c767199d212de8084f900f00d99c5-1440x326.png" style="width: 100%;" /><p>Relevance (NDCG@10) results improve after reranking is applied. Consistently improving alongside recall@50.</p><h4>Strategy domain adaptation</h4><p>Unlike open web search, enterprise domains (such as legal, medical, or internal wikis) have specific vocabularies that generic LLMs might miss. Below we discuss several strategies that could be used to tailor the presented QR strategies for specific domains:</p><ul><li>Domain-specific tuning: Further tune the query rewriting instructions (prompts) specifically for niche domains or specific use cases.</li><li>In-context learning: Use few-shot examples retrieved from a knowledge base (such as, append the top <em>k</em> BM25 results to the prompt from a quick, cheap initial search) to ground the rewriting process.</li><li>LLMs + rules hybrid approach: Combine the flexibility of LLMs with deterministic rules for specific domain terms.</li><li>Gated query rewriting: Selectively apply QR only when necessary, employing rules, custom classifiers, or specialized prompts and models to detect whether the query requires optimization for a specific use case</li><li>Generation: Query rewriting for generation: Expanding the query or context not just for retrieval but also specifically to improve the quality of the final LLM response generation.</li></ul><h3>Conclusions</h3><p>The investigation shows how simple LLM-driven query optimization can have a positive impact within the modern search ecosystem.</p><h4>Key take-aways</h4><ul><li>LLMs are a good complement to improve lexical search: Using LLMs to enrich keywords or generate pseudo-answers provides consistent improvements in both relevance and recall for standard lexical retrieval.</li><li>Hybrid search is harder to beat<strong>:</strong> When using dense vector search or hybrid retrieval, simple query rewriting terms offer marginal gains. The best results come from using QR to <em>boost</em> existing hybrid scores rather than replacing them.</li><li>Pseudo-answers improve recall<strong>:</strong> Generating hypothetical answers (pseudo-answer generation) proved to be the most effective strategy for maximizing recall in multistage pipelines.</li><li>Structured guidance over free-form generation: Guiding the LLM is critical. Rather than allowing the LLM to freely rephrase a query, providing a strict template (like extracting specific entities to fit a DSL clause) ensures that the output adds value without introducing noise. A specific prompt + DSL template combination allows the design for a specific relevance use case (such as lexical extraction versus semantic expansion) and reduces the scope of error.</li><li>Efficiency with small models: The strategies explored here are simple strategies that could be deployed effectively using SLMs or distilled into compact models, offering a cost-effective solution.</li></ul><p>The following table contains some practical guidelines on how to incorporate the most successful query rewriting techniques into your search pipeline, depending on your particular setting:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Real-world setting

</td><td style="border: 1px solid #ccc; padding: 4px;">QR strategy </td><td style="border: 1px solid #ccc; padding: 4px;">Implementation</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Lexical search </td><td style="border: 1px solid #ccc; padding: 4px;">Lexical keyword enrichment (prompt 2)</td><td style="border: 1px solid #ccc; padding: 4px;">Boost search performance without the cost of migrating to vector search:
Use the LLM's output (keywords, entities, synonyms) in an Elasticsearch `should` clause to boost documents that match the expanded terms, while keeping the original query in a `must` clause. Avoid relying solely on QR-generated terms, and use SLMs to reduce costs without compromising QR performance.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Multistage retrieval with reranking</td><td style="border: 1px solid #ccc; padding: 4px;">Pseudo-answer generation (prompt 4)</td><td style="border: 1px solid #ccc; padding: 4px;">In a multistage search system (retriever followed by a reranker), use the pseudo-answers as boosting terms. Use the Elasticsearch `should` +` must` clauses structure as before to retrieve the widest set of potentially relevant documents before passing them to the reranker to score.</td>
                    </tr>
              </tbody>
            </table>
          <p>To reiterate our earlier comments, these solutions will benefit scenarios where most queries are retrieval queries, that is, scenarios where relevance does not depend on specific filtering, aggregations, or other types of structure. However, the same template meta strategy can potentially be adapted to such cases.</p><h4>The value of task‑focused tuning in search pipeline design</h4><p></p><p>One of the broader implications of this investigation is the importance of viewing search pipeline architectures as a set of modular, well‑defined stages where lightweight, task‑focused adjustments can meaningfully improve performance, allowing pipeline components to be tuned for specific retrieval goals. Such tuning could involve a variety of strategies, including experimenting with how LLMs are prompted to target particular gains (such as maximizing recall versus precision), parametrizing how LLM output is combined with the original query (for example, DSL Query template), or evaluating the impact of different rescoring strategies (such as MMR or <code>match_phrase–based</code> query rescoring) on an initial candidate set (such as the top 200 retrieved documents), and layering these techniques before a more computationally intensive reranking step. Overall, this perspective encourages designing pipelines with clear component boundaries and a small, controllable set of hyperparameters that can be tuned to achieve targeted retrieval outcomes. Furthermore, although our experiments demonstrated measurable gains in a general‑purpose IR setting, we expect these interventions to be even more impactful in scenarios where relevance is narrowly defined, allowing the template‑based approach to improve results in a more controlled way.</p><h4>LLM-driven query optimization in modern search pipelines</h4><p>Simple query rewriting strategies can be well-suited, easy-to-plug-in solutions for targeted performance gains. In environments where LLMs are already in use (for example, RAG, conversational interfaces, or agentic search workflows), the overhead of an extra LLM call for rewriting is absorbed, making latency less of an issue. This allows for significant and targeted improvements in relevance and recall across specific domains or challenging query types.</p><p>All the strategies discussed in this blog consist of a combination of an LLM prompt and an Elasticsearch Query DSL template, and hence they can be naturally integrated into the application layer of a search solution.</p><p>Finally, Elasticsearch has already begun integrating LLM-powered capabilities directly into its search experience, offering tools like <a href="https://www.elastic.co/docs/reference/query-languages/esql">ES|QL</a> <code>COMPLETION</code>, managed LLMs through the <a href="https://www.elastic.co/docs/explore-analyze/elastic-inference/eis">Elastic Inference Service (EIS)</a>, and lately, the possibility to build a custom query rewriting tool within <a href="https://www.elastic.co/elasticsearch/agent-builder">Elastic Agent Builder</a>.</p><p>A detailed table of the results presented can be found <a href="https://docs.google.com/spreadsheets/d/1kd7ToPZFwFjow3OIIwQv_-G34SDZzwYYOgTFJT76vgY/edit?gid=521501384#gid=521501384">here</a>.</p><h2>References</h2><ol><li>Xiong, H., Bian, J., Li, Y., Li, X., Du, M., Wang, S., Yin, D., &amp; Helal, S. (2024). When search engine services meet large language models: Visions and challenges. <a href="https://doi.org/10.48550/arXiv.2407.00128">arXiv</a>.</li><li>Remmey, M. (2024, May 14). <em>NL to SQL architecture alternatives</em>. Azure Architecture Blog. <a href="https://techcommunity.microsoft.com/blog/azurearchitectureblog/nl-to-sql-architecture-alternatives/4136387">https://techcommunity.microsoft.com/blog/azurearchitectureblog/nl-to-sql-architecture-alternatives/4136387</a></li><li>Abdallah, A., Piryani, B., Mozafari, J., Ali, M., &amp; Jatowt, A. (2025, August 22). <em>How good are LLM-based rerankers? An empirical analysis of state-of-the-art reranking models</em>. arXiv. <a href="https://arxiv.org/abs/2508.16757">arxiv</a></li><li>Joshi, A., Shi, Z., Goindani, A., &amp; Liu, H. (2025, October 22). <em>The case against LLMs as rerankers</em>. Voyage AI. <a href="https://blog.voyageai.com/2025/10/22/the-case-against-llms-as-rerankers/">https://blog.voyageai.com/2025/10/22/the-case-against-llms-as-rerankers/</a></li><li>Oosterhuis, H., Jagerman, R., Qin, Z., &amp; Wang, X. (2025, July). <em>Optimizing compound retrieval systems</em>. In <em>Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25)</em> (pp. 1–11). ACM. <a href="https://doi.org/10.1145/3726302.3730051">https://doi.org/10.1145/3726302.3730051</a></li><li>Zhang Y, Li M, Long D, Zhang X, Lin H, Yang B, Xie P, Yang A, Liu D, Lin J, Huang F, Zhou J. <em>Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models.</em> arXiv preprint arXiv:2506.05176. 2025.<a href="https://arxiv.org/abs/2506.05176"> arXiv</a></li><li>Wang L, Yang N, Huang X, Yang L, Majumder R, Wei F. <em>Improving Text Embeddings with Large Language Models.</em> arXiv preprint arXiv:2401.00368. 2024.<a href="https://arxiv.org/pdf/2401.00368"> arXiv</a></li><li>Lee J, Dai Z, Ren X, Chen B, Cer D, Cole JR, et al. <em>Gecko: Versatile Text Embeddings Distilled from Large Language Models.</em> arXiv preprint arXiv:2403.20327. 2024.<a href="https://arxiv.org/abs/2403.20327"> arxiv</a></li><li>Li, Zhicong; Wang, Jiahao; Jiang, Zhishu; Mao, Hangyu; Chen, Zhongxia; Du, Jiazhen; Zhang, Yuanxing; Zhang, Fuzheng; Zhang, Di; Liu, Yong (2024). <em>DMQR-RAG: Diverse Multi-Query Rewriting for RAG</em>. arXiv preprint arXiv:2411.13154. DOI:10.48550/arXiv.2411.13154. (<a href="https://arxiv.org/abs/2411.13154">arxiv</a>)</li><li>Van Gysel C, de Rijke M. <em>pytrec_eval: An extremely fast Python interface to trec_eval</em>. In: Proceedings of the 2018 SIGIR Workshop on Reproducibility in Information Retrieval; 2018.</li><li>Thakur N, Reimers N, Rücklé A, Srivastava A, Gurevych I. <em>BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models</em>. arXiv [cs.IR]. 2021;arXiv:2104.08663. (<a href="https://arxiv.org/abs/2104.08663">arxiv</a>)</li><li>Chen J, Xiao S, Zhang P, Luo K, Lian D, Liu Z. <em>BGE M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation</em>. arXiv [cs.CL]. 2024;arXiv:2402.03216. (<a href="https://arxiv.org/abs/2402.03216">arxiv</a>)</li><li>Zhang X, Thakur N, Ogundepo O, Kamalloo E, Alfonso-Hermelo D, Li X, Liu Q, Rezagholizadeh M, Lin J. <em>MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages</em>. Trans Assoc Comput Linguistics. 2023;11:1114-1131. (<a href="https://aclanthology.org/2023.tacl-1.63/">aclanthology.org</a>)</li><li>Akiba T, Sano S, Yanase T, Ohta T, Koyama M. Optuna: A Next-generation Hyperparameter Optimization Framework. <em>Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery &amp; Data Mining (KDD ’19).</em> 2019:2623-2631. doi:10.1145/3292500.3330701</li><li>Wang L, Yang N, Huang X, Yang L, Majumder R, Wei F. <em>Multilingual E5 Text Embeddings: A Technical Report.</em> arXiv preprint arXiv:2402.05672. Published 2024. Accessed November 18, 2025. <a href="https://arxiv.org/abs/2402.05672">arxiv</a></li><li>Zhang Y, Li M, Long D, Zhang X, Lin H, Yang B, Xie P, Yang A, Liu D, Lin J, Huang F, Zhou J. Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models. <em>arXiv preprint</em> arXiv:2506.05176. Published 2025. Accessed November 18, 2025. https://arxiv.org/abs/2506.05176<a href="https://huggingface.co/papers/2506.05176"> Hugging Face+2GitHub+2</a></li></ol>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Building human-in-the-loop (HITL) AI agents with LangGraph and Elasticsearch]]></title>
            <description><![CDATA[ Learn what human-in-the-loop (HITL) is and how to build an HITL system with LangGraph and Elasticsearch for a flight system.]]></description>
            <link>https://www.elastic.co/search-labs/blog/human-in-the-loop-hitllanggraph-elasticsearch</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/human-in-the-loop-hitllanggraph-elasticsearch</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Jeffrey Rengifo]]></dc:creator>
            <pubDate>Thu, 29 Jan 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/17a2863e110d7415959184a8375d70676695f35c-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>In this article, we’ll explore how to combine <a href="https://langchain-ai.github.io/langgraph/">LangGraph</a> and Elasticsearch to build a human-in-the-loop (HITL) system. This approach allows AI systems to involve users directly in the decision-making process, making interactions more reliable and context-aware. We’ll implement a practical example using a context-driven scenario to demonstrate how LangGraph workflows can integrate with Elasticsearch to retrieve data, handle user input, and produce refined results.</p><h2>Requirements</h2><ul><li>NodeJS version 18 or newer</li><li>OpenAI API Key</li><li>Elasticsearch 8.x+ deployment</li></ul><h2>Why use LangGraph for production HITL systems</h2><p>In a <a href="https://www.elastic.co/search-labs/blog/ai-agent-workflow-finance-langgraph-elasticsearch">previous article</a>, we presented LangGraph and its benefits for building a retrieval-augmented generation (RAG) system using large language models (LLMs) and <a href="https://docs.langchain.com/oss/javascript/langgraph/graph-api#conditional-edges">conditional edges</a> to automatically make decisions and display results. Sometimes we don’t want the system to act autonomously end to end, but we want users to select options and make decisions within the execution loop. This concept is called <strong><em>human in the loop</em></strong>.</p><h2>Human in the loop</h2><p>HITL is an AI concept that allows a real person to interact with AI systems to provide more context, evaluate responses, edit responses, ask for more information, and perform other tasks. This is very useful in low-error-tolerance scenarios, such as compliance, decision-making, and content generation, helping improve the reliability of LLM outputs.</p><p>It&#x27;s important to note that the primary purpose of HITL in agentic systems is validation, not blind trust in the agent&#x27;s approach. HITL interventions should be reactive and triggered only when the system detects missing or ambiguous information. This ensures human involvement remains meaningful and adds value, rather than becoming a mandatory checkpoint that interrupts every workflow unnecessarily.</p><p>A common example is when your coding assistant asks you for permission to execute a certain command on the terminal or shows you the step-by-step thinking process for you to approve before starting coding.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/4df9fe7383f0dc665a1b0866651f4f6a40a14aa8-499x215.png" style="width: 100%;" /><h2>Elasticsearch and LangGraph: How they interact</h2><p><a href="https://www.langchain.com/">LangChain</a> allows us to use Elasticsearch as a <a href="https://js.langchain.com/docs/integrations/vectorstores/elasticsearch/">vector store</a> and to perform queries within LangGraph applications, which is useful to execute full-text or semantic searches, while LangGraph is used to define the specific workflow, tools, and interactions. It also adds HITL as an additional interaction layer with the user.</p><h2>Practical implementation: Human in the loop</h2><p>Let’s imagine a case where a lawyer has a question about a case he recently took on. Without the right tools, he would need to manually search through legal articles and precedents, read them in full, and then interpret how they apply to his situation. With LangGraph and Elasticsearch, however, we can build a system that searches a database of legal precedents and generates a case analysis that incorporates the specific details and context provided by the lawyer. You can find the full implementation of this use case in the following <a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/human-in-the-loop-with-langgraph-and-elasticsearch">repository</a>.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/e95bf4ce115cbe9954ee5033776408bed485b8d2-1600x906.png" style="width: 100%;" /><p>The workflow begins when the lawyer submits a legal question. The system performs a vector search in Elasticsearch, retrieves the most relevant precedents, and presents them for the lawyer to choose from, using natural language. After the selection, the LLM generates a draft analysis and checks whether the information is complete. At this point, the workflow can follow two paths: If everything is clear, it proceeds directly to generate a final analysis; if not, it pauses to request clarification from the lawyer. Once the missing context is provided, the system completes the analysis and returns it, taking into consideration the clarifications.</p><p>The following is a graph drawn by LangGraph showing how the app will look at the end of the development. Each node represents a tool or functionality:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/43b274d52f3644e3949bea3c8e4312c4c94384c2-307x1000.png" style="width: 100%;" /><h3>Dataset</h3><p>Here’s the <a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/human-in-the-loop-with-langgraph-and-elasticsearch/dataset.json">dataset</a> that will be used for this example. This dataset contains a collection of legal precedents, each describing a case involving service delays, the court’s reasoning, and the final outcome.</p><pre><code>[
  {
    "pageContent": "Legal precedent: Case B - Service delay not considered breach. A consulting contract used term 'timely delivery' without specific dates. A three-week delay occurred but contract lacked explicit schedule. Court ruled no breach as parties had not defined concrete timeline and delay did not cause demonstrable harm.",
    "metadata": {
      "caseId": "CASE-B-2022",
      "contractType": "consulting agreement",
      "delayPeriod": "three weeks",
      "outcome": "no breach found",
      "reasoning": "no explicit deadline defined, no demonstrable harm",
      "keyTerms": "timely delivery, open terms, schedule definition",
      "title": "Case B: Delay Without Explicit Schedule"
    }
  },
  ...
]</code></pre><h3>Ingestion and index setup</h3><p>The index setup and data ingestion logic are defined in the <a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/human-in-the-loop-with-langgraph-and-elasticsearch/dataIngestion.ts">dataIngestion.ts</a> file, where we declare functions to handle the index creation. This setup is compatible with the LangChain <a href="https://python.langchain.com/docs/integrations/vectorstores/elasticsearch/">vector store interface for Elasticsearch</a>.</p><p><em>Note: The mapping setup is also included in the <a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/human-in-the-loop-with-langgraph-and-elasticsearch/dataIngestion.ts">dataIngestion.ts</a> file.</em></p><h3>Install packages and set up environment variables</h3><p>Let&#x27;s initialize a <a href="https://nodejs.org/en">Node.js</a> project with default settings:</p><pre><code>npm init -y</code></pre><p>Now let&#x27;s install all required dependencies:</p><ul><li><strong>@elastic/elasticsearch: </strong>Elasticsearch client for Node.js. Used to connect, create indices, and run queries.</li><li><strong>@langchain/community:</strong> Provides integrations for community-supported tools, including the ElasticVectorSearch store.</li><li><strong>@langchain/core:</strong> Core building blocks of LangChain, such as chains, prompts, and utilities.</li><li><strong>@langchain/langgraph:</strong> Adds graph-based orchestration, allowing workflows with nodes, edges, and state management.</li><li><strong>@langchain/openai: </strong>Provides access to OpenAI models (LLMs and embeddings) through LangChain.</li><li><strong>dotenv:</strong> Loads environment variables from an .env file into process.env.</li><li><strong>tsx: </strong>Is a useful tool to execute typescript code.</li></ul><p>Run the following command in the console to install all of them:</p><pre><code>npm install @elastic/elasticsearch @langchain/community @langchain/core @langchain/langgraph @langchain/openai dotenv --legacy-peer-deps && npm install --save-dev tsx</code></pre><p>Create an .env file to set up the environment variables:</p><pre><code>ELASTICSEARCH_ENDPOINT=
ELASTICSEARCH_API_KEY=
OPENAI_API_KEY=</code></pre><p>We’ll use <a href="https://www.typescriptlang.org/">TypeScript</a> to write the code because it provides a layer of type safety and a better developer experience. Create a TypeScript file named <a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/human-in-the-loop-with-langgraph-and-elasticsearch/main.ts">main.ts</a>, and insert the next section&#x27;s code.</p><h3>Package imports</h3><p>In the <a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/human-in-the-loop-with-langgraph-and-elasticsearch/main.ts">main.ts</a> file, we start by importing the required modules and initializing the environment variable configuration. This includes the core LangGraph components, the OpenAI model integrations, and the Elasticsearch client.</p><p>We also import the following from the <a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/human-in-the-loop-with-langgraph-and-elasticsearch/dataIngestion.ts">dataIngestion.ts</a> file:</p><ul><li><strong><code>ingestData</code>: </strong>A function that creates the index and ingests the data.</li><li><strong>Document and DocumentMetadata:</strong> Interfaces that define the dataset document structure.</li></ul><h3>Elasticsearch vector store client, embeddings client, and OpenAI client</h3><p>This code will initialize the vector store, the embeddings client, and one OpenAI client:</p><pre><code>const VECTOR_INDEX = "legal-precedents";

const llm = new ChatOpenAI({ model: "gpt-4o-mini" });
const embeddings = new OpenAIEmbeddings({
  model: "text-embedding-3-small",
});

const esClient = new Client({
  node: process.env.ELASTICSEARCH_ENDPOINT,
  auth: {
    apiKey: process.env.ELASTICSEARCH_API_KEY ?? "",
  },
});

const vectorStore = new ElasticVectorSearch(embeddings, {
  client: esClient,
  indexName: VECTOR_INDEX,
});</code></pre><p>Application workflow state schema will help in the communication between nodes:</p><pre><code>const LegalResearchState = Annotation.Root({
  query: Annotation<string>(),
  precedents: Annotation<Document[]>(),
  userChoice: Annotation<string>(),
  selectedPrecedent: Annotation<Document | null>(),
  validation: Annotation<string>(),
  draftAnalysis: Annotation<string>(),
  ambiguityDetected: Annotation<boolean>(),
  userClarification: Annotation<string>(),
  finalAnalysis: Annotation<string>(),
});</code></pre><p>In the state object, we’ll pass the following through the nodes: the user’s query, the concepts extracted from it, the legal precedents retrieved, and any ambiguity detected. The state also tracks the precedent selected by the user, the draft analysis generated along the way, and the final analysis once all clarifications are completed.</p><h3>Nodes</h3><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/f16148cd439491f8e6fd929d66c38fc1066a4693-221x159.png" style="width: 100%;" /><p><strong><code>searchPrecedents</code>: </strong>This node performs a similarity search in the Elasticsearch vector store based on the user’s input. It retrieves up to five matching documents and prints them so they can be reviewed by the user:</p><pre><code>async function searchPrecedents(state: typeof LegalResearchState.State) {
  console.log(
    "📚 Searching for relevant legal precedents with query:\n",
    state.query
  );

  const results = await vectorStore.similaritySearch(state.query, 5);
  const precedents = results.map((d) => d as Document);

  console.log(`Found ${precedents.length} relevant precedents:\n`);

  for (let i = 0; i < precedents.length; i++) {
    const p = precedents[i];
    const m = p.metadata;
    console.log(
      `${i + 1}. ${m.title} (${m.caseId})\n` +
        `   Type: ${m.contractType}\n` +
        `   Outcome: ${m.outcome}\n` +
        `   Key reasoning: ${m.reasoning}\n` +
        `   Delay period: ${m.delayPeriod}\n`
    );
  }

  return { precedents };
}</code></pre><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/72f7b09029662ae3935e17fde8f932e885c6efdd-273x203.png" style="width: 100%;" /><p><strong><code>precedentSelection</code>:</strong> This node allows the user to select, using natural language, the use case retrieved by the proximity search that best matches the question. At this point, the application<strong> interrupts the workflow</strong> and waits for user input:</p><pre><code>function precedentSelection(state: typeof LegalResearchState.State) {
  console.log("\n⚖️  HITL #1: Human input needed\n");
  const result = interrupt({
    question: "👨‍⚖️  Which precedent is most similar to your case? ",
  });

  return { userChoice: result as string };
}</code></pre><p><strong><code>selectPrecedent</code>:</strong> This node sends the user input, along with the retrieved documents, to be interpreted so that one of them can be selected. The LLM performs this task by returning a number that represents the document it infers from the user’s natural-language input:</p><pre><code>async function selectPrecedent(state: typeof LegalResearchState.State) {
  const precedents = state.precedents || [];
  const userInput = state.userChoice || "";

  const precedentsList = precedents
    .map((p, i) => {
      const m = p.metadata;
      return `${i + 1}. ${m.caseId}: ${m.title} - ${m.outcome}`;
    })
    .join("\n");

  const structuredLlm = llm.withStructuredOutput({
    name: "precedent_selection",
    schema: {
      type: "object",
      properties: {
        selected_number: {
          type: "number",
          description:
            "The precedent number selected by the lawyer (1-based index)",
          minimum: 1,
          maximum: precedents.length,
        },
      },
      required: ["selected_number"],
    },
  });

  const prompt = `
    The lawyer said: "${userInput}"

    Available precedents:
    ${precedentsList}

    Which precedent number (1-${precedents.length}) matches their selection?
  `;

  const response = await structuredLlm.invoke([
    {
      role: "system",
      content:
        "You are an assistant that interprets lawyer's selection and returns the corresponding precedent number.",
    },
    { role: "user", content: prompt },
  ]);

  const selectedIndex = response.selected_number - 1;
  const selectedPrecedent = precedents[selectedIndex] || precedents[0];

  console.log(`✅ Selected: ${selectedPrecedent.metadata.title}\n`);
  return { selectedPrecedent };
}</code></pre><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/1864ff80880ba7d394c25880e40d7c7f2b4a4765-305x402.png" style="width: 100%;" /><p><strong><code>validatePrecedentSelection</code></strong>This node introduces an explicit validation step (yes/no) to confirm that the precedent inferred by the system is indeed the one the lawyer intends to use. This step acts as a safeguard against misinterpretation of the user’s intent, ensuring that downstream legal analysis is based on the correct case:</p><pre><code>function validatePrecedentSelection(state: typeof LegalResearchState.State) {
  const precedent = state.selectedPrecedent;
  if (!precedent) return {};

  const m = precedent.metadata;
  console.log("\n⚖️  HITL #1.5: Validation needed\n");
  console.log(
    `Selected precedent: ${m.title} (${m.caseId})\n` +
      `Type: ${m.contractType}\n` +
      `Outcome: ${m.outcome}\n`
  );

  const result = interrupt({
    question: "👨‍⚖️  Is this the correct precedent? (yes/no): ",
  });
  const validation =
    typeof result === "string" ? result : (result as any)?.value || "";

  return { validation };
}</code></pre><p><strong><code>processValidation</code>: </strong>This node processes the lawyer’s confirmation response (yes/no) from the previous HITL step and determines the next path in the workflow. It interprets the user input as an explicit approval or rejection of the selected precedent.</p><p>If the precedent isn’t confirmed, the node clears the current selection and redirects the graph back to the <code>precedentSelection</code> phase, allowing the user to choose again. If the precedent is confirmed, the workflow continues without modification:</p><pre><code>function processValidation(state: typeof LegalResearchState.State) {
  const userInput = (state.validation || "").toLowerCase().trim();
  const isValid = userInput === "yes" || userInput === "y";

  if (!isValid) {
    console.log("❌ Precedent not confirmed. Returning to selection...\n");
    return { selectedPrecedent: null, userChoice: "" };
  }

  console.log("✅ Precedent confirmed.\n");
  return {};
}</code></pre><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/1c3b726d21b04cc916213da81607389a3b0304fc-305x127.png" style="width: 100%;" /><p><strong><code>createDraft</code>: </strong>This node generates the initial legal analysis based on the precedent selected by the user. It uses an LLM to evaluate how the chosen precedent applies to the lawyer’s question and determines whether the system has enough information to proceed.</p><p>If the precedent can be applied directly, the node produces a draft analysis and, taking the right path, jumps to the final node. If the LLM detects ambiguities, such as undefined contract terms, missing timeline details, or unclear conditions, it returns a flag indicating that clarification is required, along with a list of the specific pieces of information that must be provided. In that case, the ambiguity triggers the left path of the graph:</p><pre><code>async function createDraft(state: typeof LegalResearchState.State) {
  console.log("📝 Drafting initial legal analysis...\n");

  const precedent = state.selectedPrecedent;
  if (!precedent) return { draftAnalysis: "" };

  const m = precedent.metadata;

  const structuredLlm = llm.withStructuredOutput({
    name: "draft_analysis",
    schema: {
      type: "object",
      properties: {
        needs_clarification: {
          type: "boolean",
          description:
            "Whether the analysis requires clarification about contract terms or context",
        },
        analysis_text: {
          type: "string",
          description: "The draft legal analysis or the ambiguity explanation",
        },
        missing_information: {
          type: "array",
          items: { type: "string" },
          description:
            "List of specific information needed if clarification is required (empty if no clarification needed)",
        },
      },
      required: ["needs_clarification", "analysis_text", "missing_information"],
    },
  });

  const prompt = `
    Based on this precedent:
    Case: ${m.title}
    Outcome: ${m.outcome}
    Reasoning: ${m.reasoning}
    Key terms: ${m.keyTerms}

    And the lawyer's question: "${state.query}"

    Draft a legal analysis applying this precedent to the question.
    
    If you need more context about the specific contract terms, timeline details, 
    or other critical information to provide accurate analysis, set needs_clarification 
    to true and list what information is missing.
    
    Otherwise, provide the legal analysis directly.
  `;

  const response = await structuredLlm.invoke([
    {
      role: "system",
      content:
        "You are a legal research assistant that analyzes cases and identifies when additional context is needed.",
    },
    { role: "user", content: prompt },
  ]);

  let displayText: string;
  if (response.needs_clarification) {
    const missingInfoList = response.missing_information
      .map((info: string, i: number) => `${i + 1}. ${info}`)
      .join("\n");
    displayText = `AMBIGUITY DETECTED:\n${response.analysis_text}\n\nMissing information:\n${missingInfoList}`;
  } else {
    displayText = `ANALYSIS:\n${response.analysis_text}`;
  }

  console.log(displayText + "\n");

  return {
    draftAnalysis: displayText,
    ambiguityDetected: response.needs_clarification,
  };
}</code></pre><p>The two paths that the graph can take look like this:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/653083d5a2885b15c250288b353f65a993943c0d-305x379.png" style="width: 100%;" /><p>The left path includes an additional node that handles the clarification.</p><p><strong><code>requestClarification</code>:</strong> This node triggers the second HITL step when the system identifies that the draft analysis lacks essential context. The workflow is interrupted, and the user is asked to clarify the missing contract details detected by the previous node:</p><pre><code>function requestClarification(state: typeof LegalResearchState.State) {
  console.log("\n⚖️  HITL #2: Additional context needed\n");
  const userClarification = interrupt({
    question: "👨‍⚖️  Please provide clarification about your contract terms:",
  });
  return { userClarification };
}</code></pre><p>This intervention exemplifies the validation-driven nature of HITL: The system only pauses to request human input when it has identified specific gaps that could compromise the quality of the analysis. As an improvement, in production systems, this node could be enhanced with validation logic to verify that all required information points are addressed before proceeding, ensuring the analysis is built on complete and accurate context.</p><p><strong><code>generateFinalAnalysis</code>: </strong>This node produces the final legal analysis by combining the selected precedent with the additional context provided by the user, if needed. Using the clarification gathered in the previous HITL step, the LLM synthesizes the precedent’s reasoning, the contract details supplied by the user, and the conditions that determine whether a breach may have occurred.</p><p>The node outputs a complete analysis that integrates legal interpretation and practical recommendations:</p><pre><code>async function generateFinalAnalysis(state: typeof LegalResearchState.State) {
  console.log("📋 Generating final legal analysis...\n");

  const precedent = state.selectedPrecedent;
  if (!precedent) return { finalAnalysis: "" };

  const m = precedent.metadata;

  const prompt = `
    Original question: "${state.query}"
    
    Selected precedent: ${m.title}
    Outcome: ${m.outcome}
    Reasoning: ${m.reasoning}
    
    Lawyer's clarification: "${state.userClarification}"
    
    Provide a comprehensive legal analysis integrating:
    1. The selected precedent's reasoning
    2. The lawyer's specific contract context
    3. Conditions for breach vs. no breach
    4. Practical recommendations
  `;

  const response = await llm.invoke([
    {
      role: "system",
      content:
        "You are a legal research assistant providing comprehensive analysis.",
    },
    { role: "user", content: prompt },
  ]);

  const finalAnalysis = response.content as string;

  console.log(
    "\n" +
      "=".repeat(80) +
      "\n" +
      "⚖️  FINAL LEGAL ANALYSIS\n" +
      "=".repeat(80) +
      "\n\n" +
      finalAnalysis +
      "\n\n" +
      "=".repeat(80) +
      "\n"
  );

  return { finalAnalysis };
}</code></pre><p>Building graph:</p><pre><code>const workflow = new StateGraph(LegalResearchState)
  .addNode("searchPrecedents", searchPrecedents)
  .addNode("precedentSelection", precedentSelection)
  .addNode("selectPrecedent", selectPrecedent)
  .addNode("validatePrecedentSelection", validatePrecedentSelection)
  .addNode("processValidation", processValidation)
  .addNode("createDraft", createDraft)
  .addNode("requestClarification", requestClarification)
  .addNode("generateFinalAnalysis", generateFinalAnalysis)
  .addEdge("__start__", "searchPrecedents")
  .addEdge("searchPrecedents", "precedentSelection") // HITL #1
  .addEdge("precedentSelection", "selectPrecedent")
  .addEdge("selectPrecedent", "validatePrecedentSelection") // Selection validation
  .addEdge("validatePrecedentSelection", "processValidation")
  .addConditionalEdges(
    "processValidation",
    (state: typeof LegalResearchState.State) => {
      const userInput = (state.validation || "").toLowerCase().trim();
      const isValid = userInput === "yes" || userInput === "y";
      return isValid ? "validated" : "reselect";
    },
    {
      validated: "createDraft",
      reselect: "precedentSelection",
    }
  )
  .addConditionalEdges(
    "createDraft",
    (state: typeof LegalResearchState.State) => {
      // If ambiguity detected, request clarification (HITL #2)
      if (state.ambiguityDetected) return "needsClarification";
      // Otherwise, generate final analysis
      return "final";
    },
    {
      needsClarification: "requestClarification",
      final: "generateFinalAnalysis",
    }
  )
  .addEdge("requestClarification", "generateFinalAnalysis") // HITL #2
  .addEdge("generateFinalAnalysis", "__end__");</code></pre><p>In the graph, we can see the <code>createDraft</code> conditional edge, which defines the condition for choosing the “final” path. As shown, the decision now depends on whether the draft analysis detected ambiguity that requires additional clarification.</p><p>Put all together to be executed:</p><pre><code>await ingestData();

// Compile workflow
const app = workflow.compile({ checkpointer: new MemorySaver() });
const config = { configurable: { thread_id: "hitl-circular-thread" } };

await saveGraphImage(app);

// Execute workflow
const legalQuestion =
  "Does a pattern of repeated delays constitute breach even if each individual delay is minor?";

console.log(`⚖️  LEGAL QUESTION: "${legalQuestion}"\n`);

let currentState = await app.invoke({ query: legalQuestion }, config);

// Handle all interruptions in a loop
while ((currentState as any).__interrupt__?.length > 0) {
  console.log("\n💭 APPLICATION PAUSED WAITING FOR USER INPUT...");

  const interruptQuestion = (currentState as any).__interrupt__[0]?.value
    ?.question;

  // Handling empty responses
  let userChoice = "";
  while (!userChoice.trim()) {
    userChoice = await getUserInput(interruptQuestion || "👤 YOUR CHOICE: ");
    if (!userChoice.trim()) {
      console.log("⚠️  Please provide a response.\n");
    }
  }

  currentState = await app.invoke(
    new Command({ resume: userChoice.trim() }),
    config
  );
}</code></pre><h3>Execute the script</h3><p>With all code allocated, let&#x27;s execute the <a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/human-in-the-loop-with-langgraph-and-elasticsearch/main.ts">main.ts</a> file, writing the following command on terminal:</p><pre><code>tsx main.ts</code></pre><p>Once the script is executed, the question <strong>“Does a pattern of repeated delays constitute breach even if each individual delay is minor?”</strong> will be sent to Elasticsearch to perform a proximity search, and the results retrieved from the index will be displayed. The app detects that multiple relevant precedents match the query, so it pauses execution and asks the user to help disambiguate which legal precedent is most applicable:</p><pre><code>📚 Searching for relevant legal precedents with query:
 Does a pattern of repeated delays constitute breach even if each individual delay is minor?
Found 5 relevant precedents:

1. Case H: Pattern of Repeated Delays (CASE-H-2021)
   Type: ongoing service agreement
   Outcome: breach found
   Key reasoning: pattern demonstrated failure to perform, cumulative effect
   Delay period: multiple instances

2. Case E: Minor Delay Quality Maintained (CASE-E-2022)
   Type: service agreement
   Outcome: minor breach only
   Key reasoning: delay minimal, quality maintained, termination unjustified
   Delay period: five days

3. Case A: Delay Breach with Operational Impact (CASE-A-2023)
   Type: service agreement
   Outcome: breach found
   Key reasoning: delay affected operations and caused financial harm
   Delay period: two weeks

4. Case B: Delay Without Explicit Schedule (CASE-B-2022)
   Type: consulting agreement
   Outcome: no breach found
   Key reasoning: no explicit deadline defined, no demonstrable harm
   Delay period: three weeks

5. Case C: Justified Delay External Factors (CASE-C-2023)
   Type: construction service
   Outcome: no breach found
   Key reasoning: external factors beyond control, force majeure applied
   Delay period: one month

⚖️  HITL #1: Human input needed

💭 APPLICATION PAUSED WAITING FOR USER INPUT...
👨‍⚖️  Which precedent is most similar to your case? </code></pre><p>The interesting thing about this application is that we can use natural language to choose one option, letting the LLM interpret the user’s input to determine the correct choice. Let’s see what happens if we enter the text <strong>“Case H”</strong>:</p><pre><code>💭 APPLICATION PAUSED WAITING FOR USER INPUT...
👨‍⚖️  Which precedent is most similar to your case? Case H

✅ Selected: Case H: Pattern of Repeated Delays</code></pre><p>After case selection, the application requests a small validation to confirm that the model selected the correct use case. If you answer “no”, the application returns to the case selection step:</p><pre><code>Validation needed

Selected precedent: Case H: Pattern of Repeated Delays (CASE-H-2021)
Type: ongoing service agreement
Outcome: breach found


💭 APPLICATION PAUSED WAITING FOR USER INPUT...
👨‍⚖️  Is this the correct precedent? (yes/no):  yes</code></pre><p>Once confirmed, the application workflow continues drafting the analysis:</p><pre><code>✅ Precedent confirmed.

📝 Drafting initial legal analysis...

AMBIGUITY DETECTED:
Based on Case H, a pattern of repeated delays can indeed constitute a breach of contract, even if each individual delay is minor. The outcome in Case H indicates that the cumulative effect of these minor delays led to a significant failure to perform the contractual obligations adequately. The reasoning emphasizes that consistent performance is critical in fulfilling the terms of a contract. Therefore, if the repeated delays create a situation where the overall performance is hindered, this pattern could be interpreted as a breach. However, the interpretation may depend on the specific terms of the contract at issue, as well as the expectations of performance set forth in that contract.

Missing information:
1. Specific contract terms regarding performance timelines
2. Details on the individual delays (duration, frequency)
3. Context on consequences of delays stated in the contract
4. Other parties' expectations or agreements related to performance


⚖️  HITL #2: Additional context needed


💭 APPLICATION PAUSED WAITING FOR USER INPUT...
👨‍⚖️  Please provide clarification about your contract terms:</code></pre><p>The model takes the user’s case selection and integrates it into the workflow, proceeding with the final analysis once enough context has been provided. In this step, the system also makes use of the previously detected ambiguity: The draft analysis highlighted missing contractual details that could meaningfully affect the legal interpretation. These “missing information” items guide the model in determining what clarifications are essential to resolve uncertainty before producing a reliable final opinion.</p><p>The user must include in the next input the requested clarifications. Let&#x27;s try with <strong>&quot;Contract requires ‘prompt delivery’ without timelines. 8 delays of 2-4 days over 6 months. $50K in losses from 3 missed client deadlines. Vendor notified but pattern continued.&quot;</strong>:</p><pre><code>💭 APPLICATION PAUSED WAITING FOR USER INPUT...
👨‍⚖️  Please provide clarification about your contract terms: Contract requires "prompt delivery" without timelines. 8 delays of 2-4 days over 6 months. $50K in losses from 3 missed client deadlines. Vendor notified but pattern continued.

📋 Generating final legal analysis...

================================================================================
⚖️  FINAL LEGAL ANALYSIS
================================================================================

To analyze the question of whether a pattern of repeated minor delays constitutes a breach of contract, we need to combine insights from the selected precedent, the specifics of the lawyer's contract situation, conditions that typically govern breach versus non-breach, and practical recommendations for the lawyer moving forward.

### 1. Selected Precedent's Reasoning

The precedent case, referred to as Case H, found that a pattern of repeated delays amounted to a breach of contract. The court reasoned that even minor individual delays, when considered cumulatively, demonstrated a failure to perform as stipulated in the contract. The underlying rationale was that the cumulative effect of these minor delays could significantly undermine the purpose of the contract, which typically aims for timely performance and reliable delivery.

### 2. Lawyer's Specific Contract Context

In the lawyer's situation, the contract specified "prompt delivery" but did not provide a strict timeline. The vendor experienced 8 delays ranging from 2 to 4 days over a period of 6 months. These delays culminated in $50,000 in losses due to three missed client deadlines. The vendor was notified regarding these delays; however, the pattern of delays persisted.

Key considerations include:
- **Nature of the Obligations**: While “prompt delivery” does not define a strict timeline, it does imply an expectation for timely performance.
- **Material Impact**: The missed client deadlines indicate that these delays had a material adverse effect on the lawyer's ability to fulfill contractual obligations to third parties, likely triggering damages.

### 3. Conditions for Breach vs. No Breach

**Conditions for Breach**:
- **Pattern and Cumulative Effect**: Similar to the reasoning in Case H, evidence of a habitual pattern of delays can amount to a breach. Even if individual delays are minor, when combined, they may show a lack of diligence or reliability by the vendor.
- **Materiality**: The impact of these delays is crucial. If the cumulative delays adversely affect the contract's purpose or cause significant losses, this reinforces the case for a breach.
- **Notification and Opportunity to Cure**: The fact that the vendor was notified of the delays and failed to rectify the behavior can often be interpreted as a further indication of breach.

**Conditions for No Breach**:
- **Non-Material Delays**: If the delays did not affect the overall contractual performance or client obligations, this may lessen the likelihood of establishing a breach. However, given the risks and losses involved, this seems less relevant in this scenario.
- **Force Majeure or Justifiable Delays**: If the vendor could show that these delays were due to justify circumstances not within their control, it may potentially provide a defense against breach claims.

### 4. Practical Recommendations

1. **Assess Damages**: Document the exact nature of the financial losses incurred due to the missed deadlines to substantiate claims of damages.
  
2. **Gather Evidence**: Collect all communication regarding the delays, including any notifications sent to the vendor about the issues.

3. **Consider Breach of Contract Action**: Based on the precedent and accumulated delays, consider formalized communication to the vendor regarding a breach of contract claim, highlighting both the pattern and the impact of these repeated delays.

4. **Evaluate Remedies**: Depending upon the contract specifics, the lawyer may wish to pursue several remedies, including:
   - **Compensatory Damages**: For the financial losses due to missed deadlines.
   - **Specific Performance**: If timely delivery is critical and can still be enforced.
   - **Contract Termination**: Depending on the severity, terminating the contract and seeking replacements may be warranted.

5. **Negotiate Terms**: If continuing to work with the current vendor is strategic, the lawyer should consider renegotiating terms for performance guarantees or penalties for further delays.

6. **Future Contracts**: In future contracts, consider including explicit timelines and conditions for prompt delivery, as well as specified damages for delays to better safeguard against this issue.

By integrating the legal principles from the precedent with the specific context and conditions outlined, the lawyer can formulate a solid plan to address the repeated delays by the vendor effectively.</code></pre><p>This output shows the final stage of the workflow, where the model integrates the selected precedent (Case H) and the lawyer’s clarifications to generate a complete legal analysis. The system explains why the pattern of delays likely constitutes a breach, outlines the factors that support this interpretation, and provides practical recommendations. Overall, the output demonstrates how the HITL clarifications resolve ambiguity and allow the model to produce a well-founded, context-specific legal opinion.</p><h3>Other real-world scenarios</h3><p>This kind of application, using Elasticsearch, LangGraph, and HITL, can be useful in other kinds of apps, including:</p><ul><li><strong>Reviewing tools calls before their execution</strong>; for example, in financial trading, a human approves buy/sell orders before they’re placed.</li><li><strong>Provide additional parameters when needed</strong>; for example, in customer support triage, where a human agent selects the correct issue category when the AI finds multiple possible interpretations of the customer’s problem.</li></ul><p>And there are plenty of use cases to discover, where HITL will be a game changer.</p><h2>Conclusion</h2><p>With LangGraph and Elasticsearch, we can build agents that make their own decisions and act as linear workflows or follow conditional paths based on context. With human in the loop, the agents can involve the actual user in the decision-making process to fill contextual gaps and request confirmations on systems where fault tolerance is critical.</p><p>The key advantage of this approach is that you can filter a large dataset using Elasticsearch capabilities and then use an LLM to select a single document based on the user intent; that is, HITL. The LLM complements Elasticsearch by handling the dynamics of how users express their intent..</p><p>This approach keeps the system fast and token efficient, as we’re only sending the LLM what’s needed to make the final decision and not the whole dataset. And at the same time, this keeps it very precise at detecting user intent and iterating until the desired option is picked.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Apache Lucene 2025 wrap-up]]></title>
            <description><![CDATA[2025 was a stellar year for Apache Lucene; here are our highlights.]]></description>
            <link>https://www.elastic.co/search-labs/blog/apache-lucene-wrapped-2025</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/apache-lucene-wrapped-2025</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Benjamin Trent, Chris Hegarty]]></dc:creator>
            <pubDate>Wed, 28 Jan 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/bda54d0af6f391ea71aae7936bc02d53fa47fefa-1600x901.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>It’s hard to believe that Apache Lucene has been around for over a quarter of a century! Yes, more than 25 years of search enabled through Apache Lucene.</p><h2>Community by the numbers</h2><p>For contributions and the community as a whole, 2025 was another strong year, with 1,756 commits and 1,080 pull requests from 134 unique contributors. The community continued to grow this year, increasing the number of contributors by 98 from the previous year. The project management committee (PMC) and committer cohort also increased. Apache Lucene added nine new committers in 2025, one of whom was Elastic’s own Simon Cooper. (Congratulations, <a href="https://github.com/thecoop">thecoop</a>!) The PMC also welcomed two new members. Our little community continues to grow.</p><h2>Our highlights</h2><p>With almost 2,000 commits and eight releases, it’s difficult to summarize all the things that we loved from the year in Apache Lucene. But, not hiding from a challenge, here are some of our highlights.</p><h2>The year of faster queries</h2><p>In many ways, 2025 was the year Apache Lucene embraced better auto-vectorization, manual <a href="https://en.wikipedia.org/wiki/Single_instruction,_multiple_data">SIMD</a> optimizations, and overall making things go faster. Lucene community member <a href="https://jpountz.github.io/2025/08/28/compiled-vs-vectorized-search-engine-edition.html">Adrien Grand goes into great detail in this blog</a>, and below are links and summaries for the biggest jumps. As always, a huge shout-out to <a href="https://github.com/mikemccand">Mike McCandless</a> for maintaining the <a href="https://benchmarks.mikemccandless.com/AndHighHigh.html">Apache Lucene Benchmarks for years</a>.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/bff3742a2d3fd0e61a75785a4f6c79e1a3dcd94d-1600x613.png" style="width: 100%;" /><ul><li><code>[IN]</code> <a href="https://github.com/apache/lucene/pull/14701">Refactor main top-n bulk scorers to evaluate hits in a more term-at-a-time fashion </a></li><li><code>[IS]</code> Irrelevant hardware update, so it’s just noise here</li><li><code>[IY]</code> <a href="https://github.com/apache/lucene/pull/14906">Use branchless way to speedup filterCompetitiveHits</a></li><li><code>[IZ]</code> <a href="https://github.com/apache/lucene/pull/14935">Improve collecting docs stored as bitsets</a></li><li><code>[JA]</code> <a href="https://github.com/apache/lucene/pull/14896">Vectorize filterCompetitiveHits by hand with the Java Panama API</a></li><li><code>[JK]</code> <a href="https://github.com/apache/lucene/pull/15160">Increase the document block size to 256</a></li></ul><p>Discounting the hardware change <code>[IS]</code>, this is almost a 60% increase in query speed in 2025, going from &lt;100 queries per second (qps) to &gt;170 qps.</p><h2>Vector search</h2><p>There were a number of vector search improvements in 2025. Three to highlight are improving filtered <a href="https://github.com/apache/lucene/pull/14160">vector search with ACORN</a>, <a href="https://github.com/apache/lucene/pull/14226">adding optimistic multisegment search</a>, and <a href="https://github.com/apache/lucene/pull/14978">bulk scoring of vectors</a>.</p><p><a href="https://arxiv.org/pdf/2403.04871">ACORN-1</a> is an interesting algorithm for graph-based vector indices. It has the significant benefit of being filter and algorithm-agnostic. Since Apache Lucene uses hierarchical navigable small world (HNSW) for its indexing and users generally want to just filter on anything and everything without additional configuration, it’s a perfect fit. A community <a href="https://github.com/apache/lucene/pull/14085">member originally investigated</a> adding this algorithm. (He has since been hired by Elastic. Hi, <a href="https://github.com/benchaplin">Ben!</a> 👋👋👋) A nice balance was found for Lucene, providing faster filtered vector search without requiring significant work by the user in configuration and no additional indexing of information.</p><p>The addition of <a href="https://github.com/apache/lucene/pull/14226">optimistic multisegment vector search</a> to Apache Lucene really highlights how the community comes together to make it all happen. Three different individuals combined powers to debug, benchmark, design, and iterate on this solution. Originally proposed by <a href="https://github.com/msokolov">Michael Sokolov</a> (a vector search superstar in the Lucene community), it immediately caught my attention, too, because it claimed to fix our <a href="https://github.com/apache/lucene/issues/14180">weird concurrency consistency bug</a> without sacrificing performance. With some iteration and benchmarking by fellow community member <a href="https://github.com/dungba88">Dzung Bui</a>, we were able to strike the right balance in speed and recall, improving performance, making multithreaded searches consistent, and implementing a pretty neat algorithm.</p><p>Bulk scoring was born out of a collaboration between community member <a href="https://github.com/mccullocht">Trevor McCulloch</a> and our very own <a href="https://github.com/ChrisHegarty">Chris Hegarty</a> and was introduced as a new scoring interface in <a href="https://github.com/apache/lucene/pull/14978">PR #14978</a>, with an initial float32 implementation following in <a href="https://github.com/apache/lucene/pull/14980">PR #14980</a>. Modern vector search involves thousands or even millions of comparisons as a query vector is evaluated against vectors in the index, often by traversing a graph of nearest neighbors. Traditionally, these comparisons are performed one vector at a time. Bulk scoring inverts that model by passing a batch of vectors, represented as ordinals in the index, to the scorer in a single call. This allows the scorer to prefetch and pipeline work across vectors, amortizing cache misses and reducing per-vector overhead. An exciting aspect of the original design discussion considered bulk scorers implemented in Rust and C. While Lucene itself remains a Java library, this opens the door to highly optimized, SIMD-friendly, native implementations.</p><p>We didn’t even get to talk about several other improvements that landed in this area, including HNSW optimizations, like more compact <a href="https://github.com/apache/lucene/pull/14932">GroupVarInt graph</a> encoding, bypassing graph construction for <a href="https://github.com/apache/lucene/pull/14963">tiny segments</a>, and continued reductions in memory footprint. On the operations side, Lucene now <a href="https://github.com/apache/lucene/pull/14426">exposes off-heap memory requirements</a>, making it easier to understand and debug native memory usage. While these changes are small(ish) individually, together they help make Lucene’s vector search faster, leaner, and easier to operate in production.</p><h2>Bonus</h2><p>The last highlight is a little out of place. It&#x27;s a particularly <a href="https://github.com/apache/lucene/pull/15380">frustrating but satisfying bug fix</a>. I won’t go into deep details here, as it touches on how Lucene does max scoring and bulk scoring, applies filters, and handles all its internal iterator state. This deserves its own blog post. In short, we ran into this bug in production in late September 2025. It was surfacing as an <code>EndOfFileException</code> during a specific query execution. And then, like all fun bugs, it took a week or two of work to reproduce and fully debug. Finally, when we knew what exactly caused the exception to throw, we then had to dig into why to fix it. All in all, a month’s worth of work summarized in one line of code. Bask in its glory:</p><pre><code>- top.doc = top.approximation.advance(filter.doc);
+ // Must use the iterator as `top` might be a two-phase iterator
+ top.doc = top.iterator.advance(filter.doc);</code></pre><h2>Goodbye 2025, hello 2026</h2><p>A big thank you to everyone in the Apache Lucene community who worked tirelessly to improve this venerable search library. We ❤️ you.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Using Discord and Elastic Agent Builder A2A to build a gaming community support bot]]></title>
            <description><![CDATA[Learn how to connect Discord to Elastic Agent Builder's Agent-to-Agent (A2A) server to create a gaming community support bot.]]></description>
            <link>https://www.elastic.co/search-labs/blog/discord-elastic-agent-builder-a2a</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/discord-elastic-agent-builder-a2a</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Tomás Murúa]]></dc:creator>
            <pubDate>Tue, 27 Jan 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/4e90b56e9bcf0ab0b7b906fb6ec2413fdc4e6a00-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>In this article, we&#x27;ll build a gaming community support bot that connects Discord to Elastic Agent Builder&#x27;s <a href="https://www.elastic.co/docs/solutions/search/agent-builder/a2a-server">Agent-to-Agent (A2A) server</a>. Players will ask questions like, &quot;Who&#x27;s the best Mage?&quot; or &quot;What&#x27;s the current meta?&quot; and get real-time answers powered by Elasticsearch Query Language (ES|QL) analytics and semantic search.</p><p>Many gaming companies already use Elasticsearch for player data and game analytics. In this tutorial, we&#x27;ll access that data through Discord, where many users spend their time.</p><h2>Prerequisites</h2><ul><li>Elasticsearch cluster (9.2 or higher)</li><li>Python 3.9+</li><li>Discord <a href="https://discord.com/developers/docs/quick-start/getting-started#step-1-creating-an-app">bot token</a> and <a href="https://support.discord.com/hc/en-us/articles/204849977-How-do-I-create-a-server">server created</a></li></ul><h2>What we&#x27;re building</h2><p>We&#x27;ll create a Discord bot that uses Elastic Agent Builder&#x27;s A2A server to answer player questions. The bot will have two types of tools:</p><ul><li><strong>ES|QL tools</strong> for analytics: leaderboards, hero stats, meta reports.</li><li><strong>Index search tools</strong> for knowledge: game mechanics, FAQs.</li></ul><p>The architecture looks like this:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/4e90b56e9bcf0ab0b7b906fb6ec2413fdc4e6a00-1280x720.png" style="width: 100%;" /><h2>Architecture overview</h2><p><a href="https://www.elastic.co/docs/solutions/search/elastic-agent-builder">Elastic Agent Builder</a> provides an <a href="https://www.elastic.co/docs/solutions/search/agent-builder/a2a-server">A2A server</a> to connect with clients. This server exposes the agent with tools that can query Elasticsearch using <a href="https://www.elastic.co/docs/solutions/search/agent-builder/tools/esql-tools">ES|QL</a> or <a href="https://www.elastic.co/docs/solutions/search/agent-builder/tools/index-search-tools">index search</a> tools. Any A2A-compatible client can connect to it.</p><p>The A2A client instantiates a bot that can connect to the Discord server, communicate with Agent Builder, and receive/send messages to the users.</p><h2>Why A2A instead of MCP?</h2><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/d723675b2b0bfebfae69548157420115b215572e-1600x1138.png" style="width: 100%;" /><p>Agent Builder also offers the option to expose the agent&#x27;s tools via a Model Context Protocol (MCP) server. The key difference is that with MCP, the client can only access the tools and their descriptions, not the agent itself. Additionally, tool selection logic must be implemented on the MCP client side, since all tools from all agents are exposed together.</p><p>With A2A, the entire agent, including instructions and tools, is available to the client. This gives you more control from the Agent Builder side, letting you manage behavior centrally rather than in each client.</p><p>The right choice depends on where you want that control to live. For this Discord bot, we want to manage the agent&#x27;s behavior from Elastic, so A2A is the better fit.</p><h2>Setting up sample game data</h2><p>Let&#x27;s create gaming data that our bot can query. We&#x27;ll set up three indices:</p><ul><li><code>player_stats</code>: Player profiles with wins, kills, rank.</li><li><code>hero_meta</code>: Hero pick rates and win rates by tier.</li><li><code>game_knowledge</code>: FAQs and game mechanics. Uses <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/semantic-text"><code>semantic_text</code></a> for meaning-based versus keyword matching. Both <strong>title</strong> and <strong>content</strong> are <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/copy-to">copied into</a> <code>semantic_field</code> for a hybrid search solution.</li></ul><h2>Create the indices</h2><pre><code>from elasticsearch import Elasticsearch
import os

es = Elasticsearch(
    hosts=[os.getenv("ELASTICSEARCH_URL")],
    api_key=os.environ["ELASTIC_API_KEY"]
)

# Player stats index
es.indices.create(
    index="player_stats",
    mappings={
        "properties": {
            "player_id": {"type": "keyword"},
            "username": {"type": "keyword"},
            "hero": {"type": "keyword"},
            "wins": {"type": "integer"},
            "losses": {"type": "integer"},
            "kills": {"type": "integer"},
            "deaths": {"type": "integer"},
            "rank": {"type": "keyword"},
            "last_played": {"type": "date"}
        }
    }
)

# Hero meta index
es.indices.create(
    index="hero_meta",
    mappings={
        "properties": {
            "hero_name": {"type": "keyword"},
            "pick_rate": {"type": "float"},
            "win_rate": {"type": "float"},
            "tier": {"type": "keyword"},
            "patch_version": {"type": "keyword"}
        }
    }
)

# Game knowledge index (for semantic search)
es.indices.create(
    index="game_knowledge",
    mappings={
        "properties": {
            "title": {"type": "text", "copy_to": "semantic_field"},
            "content": {"type": "text", "copy_to": "semantic_field"},
            "category": {"type": "keyword"},
            "semantic_field": {"type": "semantic_text"} # Semantic search queries this combined field
        }
    }
)</code></pre><h3>Index sample data</h3><pre><code>from datetime import datetime, timedelta
from elasticsearch.helpers import bulk
import random

# Sample player data
players = [
    {"player_id": "p001", "username": "DragonSlayer99", "hero": "Warrior", "wins": 342, "losses": 198, "kills": 4521, "deaths": 2103, "rank": "Diamond"},
    {"player_id": "p002", "username": "ShadowMage", "hero": "Mage", "wins": 567, "losses": 234, "kills": 8932, "deaths": 3421, "rank": "Master"},
    {"player_id": "p003", "username": "HealBot3000", "hero": "Healer", "wins": 423, "losses": 187, "kills": 1234, "deaths": 1876, "rank": "Diamond"},
    {"player_id": "p004", "username": "TankMaster", "hero": "Tank", "wins": 298, "losses": 302, "kills": 2341, "deaths": 1543, "rank": "Platinum"},
    {"player_id": "p005", "username": "AssassinX", "hero": "Assassin", "wins": 789, "losses": 156, "kills": 12453, "deaths": 2987, "rank": "Grandmaster"},
]

for player in players:
    player["last_played"] = datetime.now() - timedelta(hours=random.randint(1, 72))

# Hero meta data
heroes = [
    {"hero_name": "Warrior", "pick_rate": 15.2, "win_rate": 51.3, "tier": "A", "patch_version": "2.4.1"},
    {"hero_name": "Mage", "pick_rate": 22.8, "win_rate": 54.7, "tier": "S", "patch_version": "2.4.1"},
    {"hero_name": "Healer", "pick_rate": 18.5, "win_rate": 52.1, "tier": "A", "patch_version": "2.4.1"},
    {"hero_name": "Tank", "pick_rate": 12.3, "win_rate": 48.9, "tier": "B", "patch_version": "2.4.1"},
    {"hero_name": "Assassin", "pick_rate": 31.2, "win_rate": 49.2, "tier": "A", "patch_version": "2.4.1"},
]

# Game knowledge for semantic search
knowledge = [
    {"title": "How to unlock the Dragon Mount", "content": "Complete the Dragon's Lair dungeon on Nightmare difficulty with all party members alive. The mount has a 15% drop rate.", "category": "mounts"},
    {"title": "Best Mage build for Season 4", "content": "Focus on Intelligence and Critical Chance. Use the Arcane Staff with Frost Runes. Prioritize cooldown reduction for burst damage.", "category": "builds"},
    {"title": "Understanding the ranking system", "content": "Ranks go from Bronze to Grandmaster. You need 100 points to advance. Wins give 25 points, losses subtract 20.", "category": "ranked"},
]

# Bulk index all data
actions = []
for player in players:
    actions.append({"_index": "player_stats", "_source": player})
for hero in heroes:
    actions.append({"_index": "hero_meta", "_source": hero})
for doc in knowledge:
    actions.append({"_index": "game_knowledge", "_source": doc})

success, errors = bulk(es, actions)
print(f"Indexed {success} documents")

es.indices.refresh(index="player_stats,hero_meta,game_knowledge")</code></pre><p>Done! Now we have data to query.</p><h2>Creating ES|QL tools via API</h2><p>Instead of using the UI, we&#x27;ll create tools programmatically using the <a href="https://www.elastic.co/docs/api/doc/kibana/group/endpoint-agent-builder">Agent Builder API</a>. This makes it easier to version control and deploy.</p><p>First, let&#x27;s set up our <a href="https://www.elastic.co/docs/api/doc/kibana/">Kibana API</a> connection:</p><pre><code>import requests

KIBANA_URL = os.environ["KIBANA_URL"]  # e.g., https://your-deployment.kb.us-central1.gcp.cloud.es.io
KIBANA_API_KEY = os.environ["KIBANA_API_KEY"]

headers = {
    "kbn-xsrf": "true",
    "Authorization": f"ApiKey {KIBANA_API_KEY}",
    "Content-Type": "application/json"
}</code></pre><h3>Tool 1: Leaderboard</h3><p>This tool shows top players by kills. We use ES|QL for aggregations:</p><pre><code>leaderboard_tool = {
    "id": "leaderboard",
    "type": "esql",
    "description": "Shows top players ranked by kills. Use when someone asks Who is the best? or Show me top players.",
    "configuration": {
        "query": """FROM player_stats
| STATS total_kills = SUM(kills), total_wins = SUM(wins) BY username, hero, rank
| SORT total_kills DESC
| LIMIT 10""",
        "params": {}
    }
}

response = requests.post(
    f"{KIBANA_URL}/api/agent_builder/tools",
    headers=headers,
    json=leaderboard_tool
)
print(f"Leaderboard tool: {response.status_code}")</code></pre><h3>Tool 2: Hero stats</h3><p>The hero stats tool returns performance stats for a specific hero using a dynamic parameter:</p><pre><code>hero_stats_tool = {
    "id": "hero_stats",
    "type": "esql",
    "description": "Gets win rate, pick rate, and tier for a specific hero. Use when someone asks How good is Mage? or What is the win rate for Warrior?",
    "configuration": {
        "query": """FROM hero_meta
| WHERE hero_name == ?hero
| KEEP hero_name, win_rate, pick_rate, tier, patch_version""",
        "params": {
            "hero": {
                "type": "keyword",
                "description": "The hero name to look up"
            }
        }
    }
}

response = requests.post(
    f"{KIBANA_URL}/api/agent_builder/tools",
    headers=headers,
    json=hero_stats_tool
)
print(f"Hero stats tool: {response.status_code}")</code></pre><p><code>?hero</code> is a dynamic parameter. When the agent calls this tool, it fills in the hero name from the user&#x27;s question, bridging natural language, “Is Mage strong?” to structured queries, like <code>| WHERE hero_name == “Mage”</code>.</p><h3>Tool 3: Meta report</h3><p>This tool shows which heroes are dominating the current patch:</p><pre><code>meta_report_tool = {
    "id": "meta_report",
    "type": "esql",
    "description": "Shows all heroes sorted by tier and win rate. Use when someone asks What is the current meta? or Which heroes are S-tier?",
    "configuration": {
        "query": """FROM hero_meta
| SORT tier ASC, win_rate DESC
| KEEP hero_name, tier, win_rate, pick_rate""",
        "params": {}
    }
}

response = requests.post(
    f"{KIBANA_URL}/api/agent_builder/tools",
    headers=headers,
    json=meta_report_tool
)
print(f"Meta report tool: {response.status_code}")</code></pre><h2>Adding index search tools</h2><p>For unstructured content, like FAQs and game mechanics, we use <a href="https://www.elastic.co/docs/solutions/search/agent-builder/tools/index-search-tools">index search</a> tools. The <code>pattern</code> parameter specifies which Elasticsearch index to search:</p><pre><code>game_knowledge_tool = {
    "id": "game_knowledge",
    "type": "index_search",
    "description": "Searches game guides, FAQs, and mechanics. Use when someone asks How do I...? or What is...? questions about game content.",
    "configuration": {
        "pattern": "game_knowledge"
    }
}

response = requests.post(
    f"{KIBANA_URL}/api/agent_builder/tools",
    headers=headers,
    json=game_knowledge_tool
)
print(f"Game knowledge tool: {response.status_code}")</code></pre><p>Keep in mind that ES|QL tools are for structured analytics (&quot;Who has the most kills?&quot;), while index search tools are for unstructured knowledge (&quot;How do I unlock the dragon mount?&quot;).</p><h2>Creating the agent</h2><p>Let’s create an agent that uses these tools:</p><pre><code>agent = {
    "id": "gaming_support_bot",
    "name": "Gaming Support Bot",
    "description": "A gaming community support bot that answers player questions about stats, heroes, and game mechanics.",
    "configuration": {
        "tools": [{"tool_ids": ["leaderboard", "hero_stats", "meta_report", "game_knowledge"]}],
        "instructions": """You are a helpful gaming community bot. Answer player questions about:
- Player stats and leaderboards (use leaderboard tool)
- Hero performance and meta (use hero_stats and meta_report tools)
- Game mechanics and guides (use game_knowledge tool)

Be concise and friendly. Format leaderboards clearly with rankings."""
    }
}

response = requests.post(
    f"{KIBANA_URL}/api/agent_builder/agents",
    headers=headers,
    json=agent
)
print(f"Agent created: {response.status_code}")</code></pre><p>The agent now has access to all our tools and can choose one on the user&#x27;s question.</p><p>You can test the connectivity by visiting the Kibana <a href="https://www.elastic.co/docs/solutions/search/agent-builder/chat#agent-chat-gui">Agent Chat GUI</a> or by sending an API call. By default, Agent Builder uses the <a href="https://www.elastic.co/docs/reference/kibana/connectors-kibana/elastic-managed-llm">Elastic Managed LLM</a>, so no connector configuration is required:</p><pre><code>test_message = "Show me all heroes sorted by tier"

response = requests.post(
    f"{KIBANA_URL}/api/agent_builder/converse",
    headers=headers,
    json={
        "agent_id": "gaming_support_bot",
        "input": test_message
    },
    timeout=60
)

print(f"Status: {response.status_code}")
if response.status_code == 200:
    result = response.json()
    print(f"\nAgent used tools: {[step.get('tool_id') for step in result.get('steps', []) if step.get('type') == 'tool_call']}")
    print(f"\nResponse:\n{result.get('response', {}).get('message', 'No message')}")
else:
    print(f"Error: {response.text}")</code></pre><h3>Setting up a Discord bot</h3><p>If you don&#x27;t have a Discord bot yet, you&#x27;ll need to create one following these steps:</p><p>1. Go to the <a href="https://discord.com/developers/applications">Discord Developer Portal</a>.</p><p>2. Click <strong>New Application</strong>, and give it a name.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/9a7cf03a67865f62744e0b2f971531726b5b53d9-1600x1039.jpg" style="width: 100%;" /><p>3. Go to the <strong>Bot</strong> section, and click <strong>Add Bot</strong>.</p><p>4. Copy the bot token. (You&#x27;ll need this later.)</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/9a7cf03a67865f62744e0b2f971531726b5b53d9-1600x1039.jpg" style="width: 100%;" /><p>5. Under <strong>Privileged Gateway Intents</strong>, enable <strong>Message Content Intent</strong>.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/9c8edbd72d02d993efef8cf7bceb0a0e20a05da4-1502x1538.png" style="width: 100%;" /><p>6. Go to <strong>OAuth2 &gt; URL Generator</strong>, select <strong>bot</strong> and <strong>application.commands</strong> scopes and <strong>Send Messages</strong> + <strong>Read Message History</strong> permissions.</p><p>7. Use the generated URL to invite the bot to your server.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/572077eef6ee7cde811c55795c245c6b42a95369-1600x1039.jpg" style="width: 100%;" /><h3><strong>Security considerations</strong></h3><p>On the <strong>Discord</strong> side, request only the minimum permissions:</p><ul><li>Send Messages.</li><li>Send Messages in Threads.</li><li>Read Message History.</li><li>Create Polls.</li></ul><p>This limits what the bot can do if compromised.</p><p>On the <strong>Elastic </strong>side, create an API key with restricted privileges. For this bot, you only need read access to the game indices:</p><pre><code>POST /_security/api_key
{
  "name": "gaming-bot-key",
  "role_descriptors": {
    "gaming_bot_role": {
      "cluster": ["monitor"],
      "indices": [
        {
          "names": ["player_stats", "hero_meta", "game_knowledge"],
          "privileges": ["read"]
        }
      ]
    }
  }
}</code></pre><h2>Configuring the Discord A2A client</h2><p>The next step is to connect Discord to Agent Builder. We&#x27;ll use a <a href="https://github.com/llermaly/agentbuilder-a2a-discord">modified version</a> of the <a href="https://github.com/lekkerelou/kagent-a2a-discord"><code>kagent-a2a-discord</code></a>.</p><h3>Clone and setup</h3><pre><code>git clone https://github.com/llermaly/agentbuilder-a2a-discord
cd agentbuilder-a2a-discord</code></pre><h3>Environment variables</h3><p>Create an <code>.env</code> file in the root directory:</p><pre><code>DISCORD_BOT_TOKEN=your_bot_token
AGENT_BUILDER_URL=https://<kibana_url>/api/agent_builder/a2a/gaming_support_bot
A2A_API_KEY=your_api_key</code></pre><h3>Run with Python</h3><pre><code># Create virtual environment
uv venv

# Install dependencies
uv sync

# Run the bot 
uv run main.py</code></pre><p>The bot should come online in your Discord server within a few seconds.<br/></p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/91b6644120bd6750322de433271b4cb09282ebb8-921x163.png" style="width: 100%;" /><h2>Testing the bot</h2><p>Let&#x27;s test with different query types.</p><h3>Analytics queries (ES|QL)</h3><p><strong>User:</strong> &quot;Who is the best Mage player?&quot;</p><p><strong>Bot:</strong> Based on the leaderboard data, ShadowMage is currently the best Mage player in the game. They&#x27;ve reached Master rank with an impressive 8,932 kills and 567 wins…</p><p><strong>User:</strong> &quot;What&#x27;s the current meta?&quot;</p><p><strong>Bot:</strong> The current meta is dominated by Mage, which stands alone in S-tier with an impressive 54.7% win rate and 22.8% pick rate in patch 2.4.1.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/a014f6baa8a0fb39de91c3cf09dc49a9481db833-1600x677.jpg" style="width: 100%;" /><h3>Semantic queries (index search)</h3><p><strong>User:</strong> &quot;How do I unlock the dragon mount?&quot;</p><p><strong>Bot:</strong> To unlock the Dragon Mount, you need to:</p><ol><li>Complete the Dragon&#x27;s Lair dungeon on Nightmare difficulty.</li><li>Keep all party members alive throughout the entire run.</li><li>Cross your fingers for luck - the mount has a 15% drop rate…</li></ol><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/296f781c20a724c813d8ca70319b12769bdea130-1600x416.jpg" style="width: 100%;" /><p>The bot automatically picks the right tool based on the question.</p><h2>Bidirectional: Giving the agent actions</h2><p>Beyond answering questions, we can give Agent Builder the ability to trigger Discord actions. With a <a href="https://github.com/llermaly/agentbuilder-a2a-discord/commit/da66c4642dababc06318f3a62e464c200ce3aec4">small modification</a> to the Discord client, we can parse special tags in the agent&#x27;s response and execute Discord commands.</p><p>For example, we added support for a <code>&lt;poll&gt;</code> tag:</p><pre><code><poll>Should Mage be nerfed?|Yes, too strong|No, it's balanced|Need more data</poll></code></pre><p>When the agent includes this in its response, the bot creates a native Discord poll. The agent just needs instructions on when to use it. Run the following to add to the instructions:</p><pre><code>agent = {
    "id": "gaming_support_bot",
    "name": "Gaming Support Bot",
    "description": "A gaming community support bot that answers player questions about stats, heroes, and game mechanics.",
    "configuration": {
        "tools": [{"tool_ids": ["leaderboard", "hero_stats", "meta_report", "game_knowledge"]}],
        "instructions": """You are a helpful gaming community bot. Answer player questions about:
- Player stats and leaderboards (use leaderboard tool)
- Hero performance and meta (use hero_stats and meta_report tools)
- Game mechanics and guides (use game_knowledge tool)

When discussing balance topics, create a poll for community input.
Use: <poll>Question|Option1|Option2|Option3</poll>

Be concise and friendly. Format leaderboards clearly with rankings."""
    }
}

response = requests.put(
    f"{KIBANA_URL}/api/agent_builder/agents",
    headers=headers,
    json=agent
)
print(f"Agent created: {response.status_code}")</code></pre><p>Now when a user asks &quot;Is Mage overpowered?&quot; the agent can share the stats and create a poll, turning data insights into community engagement.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/c892045115be7484677f534098ef75df30bdc808-1600x1081.jpg" style="width: 100%;" /><p>The same pattern works for other Discord features, like reactions, role mentions, or scheduled announcements. The agent&#x27;s response becomes a control channel for Discord actions.</p><h2>Conclusion</h2><p>In this article, we successfully built a Discord bot using Elastic Agent Builder&#x27;s A2A server. The process covered creating ES|QL tools via API for analytics, such as leaderboards, hero stats, and meta reports, and developing index search tools for semantic search against a knowledge base. Furthermore, we demonstrated the use of dynamic parameters, like <code>?hero</code> for flexible queries, established the connection between Discord and the A2A client, and tested both analytics and semantic query types.</p><p>The Agent Builder provides the A2A server, so you only need to focus on creating tools and connecting clients, while ES|QL handles structured analytics and index search handles unstructured knowledge.</p><h2>Resources</h2><ul><li><a href="https://github.com/elastic/elasticsearch-labs/tree/main/supporting-blog-content/discord-elastic-agent-builder-gaming-bot">Try the notebook</a></li><li><a href="https://www.elastic.co/docs/api/doc/kibana/operation/operation-post-agent-builder-tools">Agent Builder API docs</a></li><li><a href="https://github.com/llermaly/agentbuilder-a2a-discord">Discord Client</a></li></ul>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[All about those chunks, ’bout those chunks, and snippets!]]></title>
            <description><![CDATA[Exploring chunking and snippet extraction for LLMs, highlighting enhancements for identifying the most relevant chunks and snippets to send to models such as rerankers and LLMs.]]></description>
            <link>https://www.elastic.co/search-labs/blog/llm-chunking-snippet-extraction</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/llm-chunking-snippet-extraction</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Kathleen DeRusso]]></dc:creator>
            <pubDate>Mon, 26 Jan 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/daa2741ce2fd31137fc0b0eb4320f4ee9db0c0e7-1536x1024.jpg?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>These days, if you’ve performed semantic searches or context engineering, you’ve probably worked a lot with chunks. If you’re not familiar with chunks, a <em>chunk </em>is a small, meaningful piece of content extracted from a larger document. This <a href="https://www.elastic.co/search-labs/blog/chunking-strategies-elasticsearch">blog</a> provides a great foundational overview of chunking, why it’s important, and various chunking strategies.</p><p>For this blog, we want to focus on one specific problem in the topic of chunking, and that is defining the best context to send to a large language model (LLM) or other model. Models have a limited number of tokens they can take in as context, but even within that limit, sending in large amounts of content can lead to relevance degradation caused by factors such as <a href="https://research.trychroma.com/context-rot">context rot</a> or <a href="https://arxiv.org/abs/2307.03172">“lost in the middle”</a> problems, where important information hidden in large blocks of text is overlooked.</p><p>This led to the question: How can we make this better?</p><h2>Reranking in retrievers</h2><p>We started by looking at retrievers, specifically the <a href="https://www.elastic.co/docs/reference/elasticsearch/rest-apis/retrievers/text-similarity-reranker-retriever"><code>text_similarity_reranker</code> retriever</a>. We know that many <a href="https://www.elastic.co/docs/solutions/search/ranking/semantic-reranking#semantic-reranking-models">cross-encoder</a> rerankers do not perform well on long documents. This is because these rerankers will truncate long content to the model’s token window, discarding the remaining content. This can actually lead to degraded search relevance, if the most relevant part of the document is cut off before being sent to the reranker!</p><p>We decided to address this by introducing a <code>chunk_rescorer</code> to the <code>text_similarity_reranker</code> retriever. When specified, rather than sending in the entire document to the reranker, we will chunk the document first and evaluate each chunk based on the reranking inference text. We do this by indexing each chunk into a temporary in-memory Lucene index and performing a BM25 <a href="https://www.elastic.co/docs/solutions/search/full-text">text search</a> over these chunks. We return the best chunks for consideration into the reranker.</p><p>The chunk rescorer is simple to use with a small update to the API call:</p><pre><code>GET books/_search
{
  "retriever": {
    "text_similarity_reranker": {
      "retriever": {
        "standard": {
          "query": {
            "match": {
              "author": "Tolkien"
            }
          }
        }
      },
      "rank_window_size": 10,
      "field": "semantic_text",
      "inference_text": "are all who wander lost?",
      "chunk_rescorer": {
        "size": 1
      }
    }
  }
}</code></pre><p>When we evaluated the chunk rescorer, we found a significant improvement for many truncating models, including the <a href="https://www.elastic.co/docs/explore-analyze/machine-learning/nlp/ml-nlp-rerank">Elastic Reranker</a> and <a href="https://docs.cohere.com/docs/rerank">Cohere&#x27;s</a> <code>rerank-english-v3.0</code> model. However, when we evaluated against <a href="https://huggingface.co/jinaai/jina-reranker-v2-base-multilingual"><code>jina-reranker-v2-base-multilingual</code></a>, the results were not as impressive due to the fact that Jina already addresses this long document problem internally.</p><p>We performed evaluations using the <a href="https://huggingface.co/datasets/Shitao/MLDR">Multilingual Long-Document Retrieval</a> (MLDR) English dataset. This is a document containing very long article content that would trigger this document truncation issue in many reranking models. The following table shows our evaluation results with BM25 text search and a <code>rank_window_size</code> of 32:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Reranker model</td><td style="border: 1px solid #ccc; padding: 4px;">NDCG@10</td><td style="border: 1px solid #ccc; padding: 4px;">NDCG@10
NDCG@10 with chunk rescoring</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">jina-reranker-v2-base-multilingual</td><td style="border: 1px solid #ccc; padding: 4px;">0.771145</td><td style="border: 1px solid #ccc; padding: 4px;">0.764488</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Cohere rerank-english-v3.0</td><td style="border: 1px solid #ccc; padding: 4px;">0.592588</td><td style="border: 1px solid #ccc; padding: 4px;">0.707842</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">.rerank-v1-elasticsearch</td><td style="border: 1px solid #ccc; padding: 4px;">0.478121</td><td style="border: 1px solid #ccc; padding: 4px;">0.751994</td>
                    </tr>
              </tbody>
            </table>
          <p>It’s worth noting that the raw BM25 results without reranking had a <a href="https://en.wikipedia.org/wiki/Discounted_cumulative_gain">Normalized Discounted Cumulative Gain</a> (NDCG) score, or relevance score, close to 0.64. (Find additional background in this <a href="https://arxiv.org/pdf/2402.03216">paper</a>.) This means that for rerankers that perform truncation, reranked results for long documents were actually <em>worse</em> than without reranking. Note that this only applies for long documents; shorter documents that fit into the token window would not be affected by this long document problem.</p><p>Of the rerankers we evaluated, Jina was the only reranker to perform well against long documents out of the box, thanks to its sliding window approach.</p><p>We saw better baseline performance but similar overall difference in results when using <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/semantic-text"><code>semantic_text</code></a> fields with <a href="https://www.elastic.co/docs/explore-analyze/machine-learning/nlp/ml-nlp-elser">Elastic Learned Sparse EncodeR (ELSER)</a>.</p><p>We felt the results for truncating models were promising enough to release the chunk rescorer as an opt-in feature for models that will benefit from the additional relevance, but we recommend evaluating against specific rerankers before implementing this in production.</p><h2>ES|QL</h2><p>The real power of chunk extraction, however, lies in the <a href="https://www.elastic.co/docs/reference/query-languages/esql">Elasticsearch Query Language</a> (ES|QL). We wanted chunks and snippets to be first class citizens in ES|QL so they could be easily extracted and repurposed for reranking, sending into LLM context, or other purposes.</p><p>We started by introducing the <a href="https://www.elastic.co/docs/reference/query-languages/esql/functions-operators/string-functions#esql-chunk"><code>CHUNK</code></a> function in Elasticsearch version 9.2:</p><pre><code>FROM books
| EVAL chunks = CHUNK(description) </code></pre><p><code>CHUNK</code> is an extreme primitive that takes some string content (a text field, a semantic text field, or any other row content that is a string) and chunks it. You can view and interact with these chunks, and you can also explore using different chunking settings:</p><pre><code>FROM books
| EVAL chunks = CHUNK(description, {"strategy": "sentence", "max_chunk_size": 25, "sentence_overlap": 0}) </code></pre><p>You can then combine chunk with existing primitives, like <a href="https://www.elastic.co/docs/reference/query-languages/esql/functions-operators/mv-functions#esql-mv_slice"><code>MV_SLICE</code></a> and <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/mv_expand"><code>MV_EXPAND</code></a>, to format the way chunks are represented in your row output:</p><pre><code>FROM books
| WHERE MATCH(author, "Tolkien")
| EVAL chunks = CHUNK(semantic_description, {"strategy": "sentence", "max_chunk_size": 25, "sentence_overlap": 0})
| EVAL truncated = MV_SLICE(chunks, 0, 2)
| MV_EXPAND truncated
| KEEP title, truncated</code></pre><p>This is great, but what we really wanted was to get the top matching snippets for a query, so we also introduced <a href="https://www.elastic.co/docs/reference/query-languages/esql/functions-operators/search-functions#esql-top_snippets"><code>TOP_SNIPPETS</code></a> in Elasticsearch version 9.3:</p><pre><code>FROM books
| EVAL snippets = TOP_SNIPPETS(description, "ring")</code></pre><p>We added support to control the number of snippets you want to return and the word size using a sentence-based <a href="https://www.elastic.co/search-labs/blog/chunking-strategies-elasticsearch">chunking strategy</a>:</p><pre><code>FROM books
| EVAL snippets = TOP_SNIPPETS(description, "ring", { "num_snippets": 3, "num_words": 25 })
| MV_EXPAND snippets
| KEEP title, snippets</code></pre><p>This fits into the broader story of LLMs when you add in <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/completion"><code>COMPLETION</code></a>. Here is an example of how we envision <code>TOP_SNIPPETS</code> integrating with LLMs:</p><pre><code>FROM books METADATA _score
| WHERE semantic_description:"what are hobbits?"
| SORT _score DESC
| LIMIT 10
| FORK
   ( EVAL snippets = TOP_SNIPPETS(semantic_description, "what are hobbits?", {"num_snippets": 3})
    | COMPLETION CONCAT("Answer what are hobbits", snippets) WITH {"inference_id": "my-completion-endpoint"})</code></pre><p>In this example, we’re performing a semantic search, but for each document we’re identifying the top snippets from that document. We’re sending in the highly relevant snippets into the completion command, rather than the entire document. This is a simple document, but you could also use reranking here, and in the future, when multiple forks are available, hybrid search will be supported in the same format.</p><p>We can also utilize snippets in the newest version of <code>RERANK</code>:</p><pre><code>FROM books
| WHERE MATCH(title, "ring")
| EVAL snippets = TOP_SNIPPETS(semantic_description, "what are hobbits?", { "num_snippets": 3, "num_words": 25 })
| RERANK "what are hobbits?" ON  snippets WITH { "inference_id" : "my-reranker" }</code></pre><h2>What we’re thinking about next</h2><p>The story isn’t over for chunking and snippet extraction; in fact, it’s only getting started.</p><p>We’re looking at how to best integrate existing <code>semantic_text</code> chunks out of the box into strategies using chunking and snippet extraction. We’re also exploring what other features we need to make snippet extraction a compelling feature to use in products such as Elastic Agent Builder.</p><p>Overall, we’re excited to share these tools and look forward to your feedback as we evolve our strategies for getting the best context for LLMs!</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Improve search performance with `best_compression`]]></title>
            <description><![CDATA[While `best_compression` is typically seen as a storage-saving feature for Elastic Observability and Elastic Security use cases, this blog demonstrates its effectiveness as a performance-tuning lever for search.]]></description>
            <link>https://www.elastic.co/search-labs/blog/improve-elasticsearch-performance-best-compression</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/improve-elasticsearch-performance-best-compression</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Sherry Ger, Ryan Eno]]></dc:creator>
            <pubDate>Fri, 23 Jan 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/5141a8c2618337207d848ce16b258a86885955b2-1600x1034.jpg?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p></p><p>When tuning Elasticsearch for high-concurrency workloads, the standard approach is to maximize RAM to keep the working set of documents in memory to achieve low search latency. Consequently, <a href="https://www.elastic.co/docs/reference/elasticsearch/index-settings/index-modules"><code>best_compression</code></a> is rarely considered for search workloads, as it is primarily viewed as a storage saving measure for Elastic Observability and Elastic Security use cases where storage efficiency takes priority.</p><p>In this blog, we demonstrate that when the dataset size significantly exceeds the OS page cache, <code>best_compression</code> improves search performance and resource efficiency by reducing the I/O bottleneck.</p><h2><strong>The setup</strong></h2><p>Our use case is a high-concurrency search application running on <a href="https://www.elastic.co/docs/deploy-manage/deploy/elastic-cloud/ec-change-hardware-profile#ec-profiles-compute-optimized-arm">Elastic Cloud CPU optimized instances</a>.</p><ul><li>Data volume: ~500 million documents</li><li>Infrastructure: 6 Elastic Cloud (Elasticsearch service) instances (each instance: 1.76 TB storage | 60 GB RAM | 31.9 vCPU)</li><li>Memory-to-storage ratio: ~5% of the total dataset fits into RAM</li></ul><h2><strong>The symptoms: high latency</strong></h2><p>We observed that when the number of current requests spiked around 19:00, the search latency deteriorated significantly. As shown in Figure 1 and Figure 2, while traffic peaked around 400 requests per minute per Elasticsearch instance, the average query service time degraded to over 60ms.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/f9c6cc1882e7db24336c65c54bbc1d38dcdb7fa3-697x311.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/bbb705ae2fcd14c81d335bf322346caf3bf33765-996x618.png" style="width: 100%;" /><p>The CPU usage remained relatively low after the initial connections handling, indicating that compute was not the bottleneck.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/cec15a28d2d22e9adedd2951bb2334b3717890a1-1494x730.png" style="width: 100%;" /><p>A strong correlation emerged between query volume and page faults. As requests increased, we observed a proportional rise in page faults, peaking around 400k/minute. This indicated that the active dataset could not fit in the page cache.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/511f2f10300a9d10ba3d7a82b9a8c8d567ac5636-1492x678.png" style="width: 100%;" /><p>Simultaneously, the JVM heap usage appeared to be normal and healthy. This ruled out garbage collection issues and confirmed the bottleneck was I/O.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/336bbad638f866304358dba1d06ee987de0f23cf-1490x568.png" style="width: 100%;" /><h2><strong>The diagnosis: I/O bound</strong></h2><p>The system was I/O bound. <a href="https://www.elastic.co/blog/elasticsearch-caching-deep-dive-boosting-query-speed-one-cache-at-a-time">Elasticsearch relies on the OS page cache to serve index data from memory</a>. When the index is too large for the cache, queries trigger expensive disk reads. While the typical solution is to scale horizontally (add nodes/RAM), we wanted to exhaust efficiency improvements on our existing resources first.</p><h2><strong>The fix</strong></h2><p>By default, Elasticsearch uses <a href="https://en.wikipedia.org/wiki/LZ4_(compression_algorithm)">LZ4</a> compression for its index segments, striking a balance between speed and size. We hypothesized that switching to <code>best_compression</code> (which uses <a href="https://en.wikipedia.org/wiki/Zstd">zstd</a>) would reduce the size of indices. A smaller footprint allows a larger percentage of the index to fit in the page cache, trading a negligible increase in CPU (for decompression) for a reduction in disk I/O.</p><p>To enable <code>best_compression</code>, we reindexed the data with the index setting <code>index.codec: best_compression</code>. Alternatively, the same result could be achieved by closing the index, resetting the index codec to <code>best_compression</code>, and then performing a segment merge.</p><pre><code>POST my-index/_close
PUT my-index/_settings
{
    "codec": "best_compression"
}
  
POST my-index/_open  
POST my-index/_forcemerge?max_num_segments=1</code></pre><h2><strong>The results</strong></h2><p>The results confirmed our hypothesis: improved storage efficiency directly translated into a substantial boost in search performance with no accompanying increase in CPU utilization.</p><p>Applying <code>best_compression</code> reduced the index size by approximately 25%. While less than the reduction seen in repetitive log data, this 25% reduction effectively increased our page cache capacity by the same margin.</p><p>During the next load test (starting at 17:00), the traffic was even higher, peaking at 500 requests per minute per Elasticsearch node.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/fc1902f396cb2115c0013155ad07f6eb87389c60-660x309.png" style="width: 100%;" /><p>Despite the higher load, the CPU utilization was lower than in the previous run. The elevated usage in the earlier test was likely due to the overhead of excessive page fault handling and disk I/O management.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/15699ef4c65b3f0a9f8a3e1bae8bb18f7b647025-819x352.png" style="width: 100%;" /><p>Crucially, page faults dropped significantly. Even at higher throughput, faults hovered around &lt;200k per minute, compared to &gt;300k in the baseline test.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/f76ca967976d740af88a9359b66041701abb46fc-764x340.png" style="width: 100%;" /><p>Although the page fault results were still less than optimal, query service time was cut by about 50%, hovering below 30ms even under heavier load.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/750ec1c59b8eb5069aed4c066d856ecea82d5bca-620x311.png" style="width: 100%;" /><p></p><h2><strong>The conclusion: best_compression for search</strong></h2><p>For search use cases where data volume exceeds available physical memory, <code>best_compression</code> is a potent performance-tuning lever.</p><p>The conventional solution to cache misses is to scale out to increase RAM. However, by reducing the index footprint, we achieved the same goal: maximizing the document count in the page cache. Our next step is to explore <a href="https://www.elastic.co/blog/space-savings-a-lesser-known-benefit-of-index-sorting-in-elasticsearch"><strong>index sorting</strong></a> to further optimize storage and squeeze even more performance out of our existing resources.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Agent Builder now GA: Ship context-driven agents in minutes]]></title>
            <description><![CDATA[Agent Builder is now GA. Learn how it allows you to quickly develop context-driven AI agents.]]></description>
            <link>https://www.elastic.co/search-labs/blog/agent-builder-elastic-ga</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/agent-builder-elastic-ga</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Anish Mathur, Evan Castle]]></dc:creator>
            <pubDate>Thu, 22 Jan 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/6840eb7dbb884055ab0e965dcfd614fec54936af-2210x1440.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>We are excited to announce the general availability of Agent Builder in Elastic Cloud Serverless and in the upcoming 9.3 release. Agent Builder brings the power of Elasticsearch as a context engineering platform to quickly develop contextual, data-focused AI agents.</p><p>Agents are gaining traction driven by their potential to deliver efficiency gains and better customer experiences. But in practice, providing agents with the right context is difficult, especially when operating over messy, unstructured enterprise data. Developers must manage tools, prompts, state, reasoning logic, models, and crucially retrieve relevant context from business sources to deliver accurate results and actions. Elastic Agent Builder delivers these core components to develop secure, reliable, context-driven agents.</p><h2>Agent Builder core capabilities</h2><p>Agent Builder leverages Elastic’s long-term investments in search relevance and retrieval-augmented generation, and work to make Elasticsearch the best vector database to simplify the development of contextual, data-focused AI agents.</p><p>Agent Builder allows you to:</p><ul><li>Immediately start with a built-in conversational agent that can answer questions, perform analytics and drive investigations over any data in Elasticsearch.</li><li>Quickly go from complex unstructured data to a custom agent with configuration-based development experience.</li><li>Leverage best-in-class, hybrid search relevance through built-in ES|QL or custom tools to improve context quality and agent reliability.</li><li>Execute complex workflows (preview) as reusable tools to enrich data, update records, send messages, and more for rules-based automation.</li><li>Connect to data sources outside of Elasticsearch using workflows and MCP to correlate and combine context for agents.</li><li>Integrate with any agentic or application framework using built-in and custom tools exposed over MCP, and the ability to connect to external MCP (preview), support for A2A, and full API support.</li><li>Extend the capabilities of Agent Builder with integration to third-party solutions like LlamaIndex for complex document processing or Arcade.dev for secure, structured tool access.</li></ul><p>To further extend Agent Builder functionality, we are introducing Elastic Workflows, our new rules-based automation capabilities, now in technical preview. For organizational tasks, agents at times need certainty and reliability of rules-based actions, which are often necessary to implement specific business logic. Elastic Workflows provides agents with a simple, declarative way to orchestrate internal and external systems to take actions, gather and transform data and context. Workflows are fully composable, event-driven and flexible, and can be exposed as tools to an agent via MCP.</p><h2>Go from data to agent in minutes</h2><p>Developing agents can take weeks of upfront work to consolidate separate data stores, build manual pipelines, tune queries, and manage complex orchestration. Agent Builder reduces the time to develop agents by removing the need for separate data stores, vector databases, RAG pipelines, search layers, query translators, and tool orchestrators, allowing you to focus on agent logic and application delivery.</p><p>Agent Builder natively integrates Elasticsearch platform primitives to make agent development fast.</p><ul><li>Start with a built-in conversational agent that can immediately chat with and reason with your indexed data.</li><li>Integrate agents into applications, dashboards, or CI/CD systems with interactive access via Kibana, APIs, or MCP and A2A.</li><li>Build with default tooling to understand your data structure, select the appropriate index, generate optimized hybrid, semantic, and structured queries, and create configurable visualizations using ES|QL based on natural language prompts.</li></ul><p>To go deeper, try a complete <a href="https://www.elastic.co/search-labs/blog/ai-agent-builder-elasticsearch">hands-on walkthrough</a>.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/b55b63eae40f72952967cc8f3ea4df4cd62d7d70-1080x608.gif" style="width: 100%;" /><h2>Build on Elasticsearch, a complete data platform for context engineering</h2><p>For AI agents, context quality is essential to provide effective reasoning and reduce the risks of hallucination. For many enterprise AI agents, the business data required to perform a task is the most crucial piece of context. As a massively scalable data store, vector database, and leader in relevance, Elasticsearch already offers many strong context-engineering primitives. Context engineering goes beyond simple retrieval-augmented generation by allowing you to tailor and scale how data is fetched, ranked, filtered, and presented to agents, helping reduce noise and ambiguity.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/419b9b6f13739e0a8983249d8ac31478e73dac89-1600x901.png" style="width: 100%;" /><p>Elasticsearch delivers a context engine that combines lexical search, vector search, and structured filtering for retrieval that materially <a href="https://www.elastic.co/search-labs/blog/context-engineering-relevance-ai-agents-elasticsearch">improves LLM performance</a> by ensuring the model operates on relevant and precise context. This capability is supported by agentic retrieval, along with built-in tools and search logic that automatically select the right indexes and transform natural language into optimized queries for context.</p><p>With Agent Builder, you can ensure agents receive the most useful context first with controls for relevance and ranking, allowing you to fine-tune scoring, ranking, and filtering logic. Elasticsearch lets you control what matters, why it matters, and how it is prioritized, instead of relying on opaque retrieval behavior. This is all underpinned by Elasticsearch as a scalable data platform to store and scale all your data from text, vectors, metadata, logs, and more on one platform, making it easier to manage context for agents.</p><h2>Execute complex workflows as reusable tools</h2><p>While AI agents enable reasoning for complex tasks, much automation depends on reliably executing rules-based actions that enforce specific business logic. Elastic Workflows provides a simple, declarative way to orchestrate internal and external systems to take actions, gather context or data, and integrate them as part of agents. Defined in YAML, workflows are fully composable, allowing them to be as simple or as complex as the job requires. This gives agents an efficient way to take action across the Elasticsearch platform and solutions, as well as with third-party applications.</p><p>Integrating a workflow with Agent Builder can be done in three steps (prerequisite: enable workflows with details provided <a href="https://github.com/elastic/workflows">here</a>)</p><p>1. Create and save a new workflow using the simple YAML-based editor with built-in autocomplete and testing.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/308888bf3d2fa013f9391a55be6a6fbd458b6dac-1600x998.png" style="width: 100%;" /><p>2. Create a new tool in Agent Builder with type “Workflow” and provide a description to help the agent determine when to use the workflow tool.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/c04810d30d226112c3610bd58e208607b213fc3d-1600x945.png" style="width: 100%;" /><p>3. Add the workflow tool to your custom agent.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/724cd4ac93c46efb0d339fd140e5caf138f8150f-1600x948.png" style="width: 100%;" /><p>4. That’s it! Now the agent can call the workflow from within a conversation.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/8dfdd726ab89e31c48b79372650ce33946713dca-1600x929.png" style="width: 100%;" /><h2>Your agent, your rules</h2><p>Agent Builder doesn’t lock you into a single development paradigm. Instead, it’s designed to enable open, flexible development approaches for agents with full control of data, relevance, models, interoperability, security and agent design.</p><p>Custom agent definitions let you choose exactly which tools an agent can access, embed custom system prompts, tailor the agent’s instructions, and define security boundaries. Agents remain model-agnostic, allowing you to flexibly configure a preferred LLM, both native and across the broader ecosystem, without being locked into a single provider.</p><p>Build extensible tools that encapsulate domain-specific logic (e.g., specific index filters, ES|QL joins, analytical pipelines), and constrain them for safe use in production. Full API support enables interoperability with other agentic frameworks, with native support for Model Context Protocol (MCP). A2A integration means you can expose your Elastic agents to other frameworks, services, and client apps, reusing the same data and context engineering logic across integrations.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/5e903ba24ffb3f40231e901f63bd494c89cb7757-1600x1004.png" style="width: 100%;" /><p>Agent Builder supports flexible, open development and is designed to integrate easily with popular agent frameworks and platforms. These integrations can be essential to delivering effective agents. As <strong>Sam Partee, Co-Founder at Arcade.dev</strong> describes,</p><p><em>“Agentic systems fail today because connecting AI to tools and data is complex. Elastic Agent Builder with Arcade.dev gives developers a structured, secure way to handle how agents retrieve context, reason, and act, taking agents from demo to production grade.&quot;</em></p><p>Agent Builder also leverages the extensibility of Elasticsearch for handling complex data. As <strong>Jerry Liu, CEO at LlamaIndex </strong>describes,</p><p><em>“Unlocking enterprise context from unstructured data sources is key to building effective agents. Elastic Agent Builder combined LlamaIndex complex document processing strengthens the critical context layer, helping teams retrieve, process, and prepare data so agents can reason more accurately and deliver better outcomes.”</em></p><h2>What can you build?</h2><p>Agent Builder is already being used for a variety of use cases. Below are a few examples and reference architectures to get started with agents:</p><ul><li><strong>Automate infrastructure: </strong>In support scenarios, agents have been used to read, think, and chat, but to date, they cannot reach out and touch the infrastructure they may need to manage. Elastic’s engineering team built an agent for <a href="https://www.elastic.co/search-labs/blog/agent-builder-augmented-infrastructure">automated infrastructure management</a> as part of a hackathon. The agent actively investigates issues with application infrastructure and takes automated actions. It uses workflows to optimize configurations, respond to issues and scale resources, all based on an intelligent understanding of infrastructure logs.</li><li><strong>Security threat analysis: </strong>A security vulnerability agent was developed with Elastic Agent Builder, MCP, and Elasticsearch. It automates threat analysis by correlating internal security data with external threat intelligence. The agent performs semantic search over historical incidents and configurations, augments results with live internet data, and applies LLM reasoning to assess environmental relevance, prioritize risks, and produce actionable remediation. See the <a href="https://www.elastic.co/search-labs/blog/agent-builder-mcp-reference-architecture-elasticsearch">reference architecture</a><strong>.</strong></li><li><strong>Technical customer support: </strong>Agents can perform multiple support tasks, including case summarization, issue deduplication and creation, and deep technical investigation. Agent Builder enables this with multi-step, hybrid search to find only the most relevant related issues, solutions, and procedures, and formulate root cause hypotheses and remediation plans. Agent Builder can simplify the architecture of complex <a href="https://www.elastic.co/blog/generative-ai-customer-support-elastic-support-assistant">support systems</a> and accelerate time to delivery.</li><li><strong>Product and content discovery:</strong> Agent Builder simplifies the process of <a href="https://www.elastic.co/search-labs/blog/build-voice-agents-elastic-agent-builder">exposing complex product catalogs for conversational experiences</a>, while allowing organizations to maintain flexibility to include their own business logic and requirements.</li><li><strong>Build your own:</strong> Join the <a href="https://elasticsearch.devpost.com/">Agent Builder Hackathon,</a> running from January 22 to February 27, 2026. Work with the community to build context-driven, multi-step AI agents that combine search, workflows, tools, and reasoning to automate real-world tasks*</li></ul><h2>Start building custom agents now</h2><p>Get started with an <a href="https://cloud.elastic.co/registration?onboarding_token=search&amp;pg=en-enterprise-search-page">Elastic Cloud Trial</a>, and check out the documentation <a href="https://www.elastic.co/docs/solutions/search/elastic-agent-builder">here</a>. For existing customers, Agent Builder is available in Cloud Serverless and on the Enterprise Tier in Elastic Cloud Hosted and self-managed.</p><p>* <a href="https://elasticsearch.devpost.com/rules">Click here</a> for full terms, conditions, and eligibility requirements for the hackathon</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Building voice agents with Elastic Agent Builder]]></title>
            <description><![CDATA[Exploring how voice agents work and how to build one using Elastic Agent Builder and LiveKit.]]></description>
            <link>https://www.elastic.co/search-labs/blog/build-voice-agents-elastic-agent-builder</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/build-voice-agents-elastic-agent-builder</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Kenneth Kreindler]]></dc:creator>
            <pubDate>Thu, 22 Jan 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/43ceabb9e2c0966261c188bd40e03178d5a91e5c-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>AI has been trapped in a glass box. You type commands, it responds with text, and that’s it. It’s useful but distant, like watching someone move behind a screen. This year, 2026, will be the year when business will shatter this glass and bring AI agents into products, where they really deliver value.</p><p>One of the ways the glass will be shattered is by the adoption of <em>voice agents</em>, which are AI agents that recognize human speech and synthesize computer-generated audio. With the rise of low-latency transcriptions, fast large language models (LLMs), and text-to-speech models that sound human, this has become possible.</p><p>Voice agents also need access to business data to become really valuable. In this blog, we’ll learn how voice agents work and build one for ElasticSport, a fictitious outdoor sports equipment shop, using <a href="https://livekit.io/">LiveKit</a> and <a href="https://www.elastic.co/elasticsearch/agent-builder">Elastic Agent Builder</a>. Our voice agent will be context-aware and will work with our data.</p><h2>How it works</h2><p>There are two paradigms in the world of voice agents: The first uses speech-to-speech models, and the second uses a voice pipeline consisting of speech-to-text, LLM, and text-to-speech. Speech-to-speech models have their own benefits, but voice pipelines offer much more customization over the technologies used and how context is managed, along with control over the agent’s behavior. We’ll focus on the voice pipeline model.</p><h3>Key components</h3><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/b237501903f9c3a71fe1b7755c3990e40c5495c8-1600x653.png" style="width: 100%;" /><h4>Transcription (speech-to-text)</h4><p>Transcription is the voice pipeline entrypoint. The transcription component takes as input raw audio frames, transcribes speech into text, and outputs that text. The transcribed text is buffered until the system detects that the user&#x27;s speech has ended, at which point the LLM generation is kicked off. Various third-party providers offer low-latency transcriptions. When selecting one, consider latency and transcription accuracy, and ensure that they support streamed transcripts.</p><p></p><p>Examples of third-party APIs: <a href="https://www.assemblyai.com/">AssemblyAI</a>, <a href="https://deepgram.com/product/speech-to-text">Deepgram</a>, <a href="https://platform.openai.com/docs/guides/realtime-transcription">OpenAI</a>, <a href="https://elevenlabs.io/speech-to-text">ElevenLabs</a></p><h4>Turn detection</h4><p>Turn detection is the component of the pipeline that detects when the speaker has finished speaking and the generation should begin. One common way to do this is through a voice activity detection (VAD) model, such as <a href="https://github.com/snakers4/silero-vad">Silero VAD</a>. VAD uses audio energy levels to detect when audio contains speech and when speech has ended. However, VAD by itself cannot identify the difference between a pause and the end of speech. This is why it’s often combined with an end-of-utterance model that predicts whether the speaker has finished speaking, based on the interim transcript or raw audio.</p><p>Examples (Hugging Face): <a href="https://huggingface.co/livekit/turn-detector">livekit/turn-detector</a>, <a href="https://huggingface.co/pipecat-ai/smart-turn-v3">pipecat-ai/smart-turn-v3</a></p><h4>Agent</h4><p>The agent is the core of a voice pipeline. It’s responsible for understanding intent, gathering the right context, and formulating a reply in text format. <a href="https://www.elastic.co/elasticsearch/agent-builder">Elastic Agent Builder</a>, with its built-in reasoning capabilities, tool library, and workflow integration, make for an agent that can work on top of your data and interact with external services.</p><h4>LLM (text-to-text)</h4><p>When selecting an LLM for Elastic Agent Builder, there are two main characteristics to consider: LLM reasoning benchmarks, and time to first token (TTFT).</p><p>Reasoning benchmarks indicate how well the LLM is able to generate correct responses. Benchmarks to consider are ones that evaluate multiturn conversation adherence and intelligence benchmarks, such as MT-Bench and the Humanity&#x27;s Last Exam dataset, respectively.</p><p>TTFT benchmarks evaluate how fast the model produces its first output token. There are other types of latency benchmarks, but TTFT is particularly important for voice agents, as audio synthesis can begin as soon as the first token is received, resulting in lower latency between turns, a natural-feeling conversation.</p><p>Usually, one needs to make a trade-off between these two characteristics because faster models often perform worse on reasoning benchmarks.</p><p>Examples (Hugging Face): <a href="https://huggingface.co/openai/gpt-oss-20b">openai/gpt-oss-20b</a>, <a href="https://huggingface.co/openai/gpt-oss-120b">openai/gpt-oss-120b</a></p><h4>Synthesis (text-to-speech)</h4><p>The final part of the pipeline is the text-to-speech model. This component is responsible for converting the text output from the LLM into audible speech. Similar to the LLM, latency is a characteristic to look out for when selecting a text-to-speech provider. Text-to-speech latency is measured by time to first byte (TTFB). That’s the time it takes for the first audio byte to be received. Lower TTFB also reduces turn latency.</p><p>Examples: <a href="https://elevenlabs.io/text-to-speech-api">ElevenLabs</a>, <a href="https://cartesia.ai/sonic">Cartesia</a>, <a href="https://www.rime.ai/">Rime</a></p><h4>Constructing the voice pipeline</h4><p>Elastic Agent Builder can be integrated into a voice pipeline at several different levels:</p><ol><li>Agent Builder tools only: speech-to-text → LLM (with Agent Builder tools) → text-to-speech</li><li>Agent Builder as an MCP: speech-to-text → LLM (with Agent Builder access via MCP) → text-to-speech</li><li>Agent Builder as the core: speech-to-text → Agent Builder → text-to-speech</li></ol><p>For this project, I chose the Agent Builder as the core approach. With this approach, the full functionality of Agent Builder and workflows can be used. The project uses LiveKit to orchestrate speech-to-text, turn detection, and text-to-speech, and it implements a custom LLM node that integrates directly with Agent Builder.</p><h2>Elastic support voice agent</h2><p>We’ll be building a custom support voice agent for a fictitious sports shop called ElasticSport. Customers will be able to call the help line, ask for product recommendations, find product details, check order statuses, and have order information sent to them via text. To achieve this, we first need to configure a custom agent and create tools for executing Elasticsearch Query Language (ES|QL) queries and workflows.</p><a href="https://videos.elastic.co/watch/jpX6D21URZuh3PtXVV7Ks6">Watch video on Vidyard</a><h3>Configuring the agent</h3><h4>Prompt</h4><p>The prompt instructs the agent what personality it should take and how to respond. Importantly, there are a few voice-specific prompts that ensure the responses are synthesized into audio properly and misunderstandings are recovered from gracefully.</p><pre><code>You are a Sales Assistant at ElasticSport, an outdoor sport shop specialized in hiking and winter equipment. 

[Profile]
- name: Iva
- company: ElasticSport
- role: Sales Assistant
- language: en-GB
- description: ElasticSport virtual sales assistant

[Context]
- Ask clarifying questions to understand the context.
- Use available tools to answer the user's question.
- Use the knowledge base to retrieve general information

[Style]
- Be informative and comprehensive.
- Maintain a professional, friendly and polite tone.
- Mimic human behavior and speech patterns.
- Be concise. Do not over explain initially

[Response Guideline]
- Present dates in spelled-out month date format (e.g., January fifteenth, two thousand and twenty-four).
- Avoid the use of unpronounceable punctuation such as bullet points, tables, emojis.
- Respond in plain text, avoid any formatting.
- Spell out numbers as words for more natural-sounding speech.
- Respond in short and concise sentences. Responses should be 1 or 2 sentences long.

[ERROR RECOVERY]
### Misunderstanding Protocol
1. Acknowledge potential misunderstanding
2. Request specific clarification</code></pre><h4>Workflows</h4><p>We’ll add a small workflow to send an SMS through Twilio’s messaging API. The workflow will be exposed to the custom agent as a tool, resulting in a user experience where the agent can send the caller an SMS while on the call. This allows the caller to, for example, ask, “Can you send more details about <em>X</em> over text?”</p><pre><code>name: send sms
enabled: true
triggers:
  - type: manual
inputs:
  - name: message
    type: string
    description: The message to send to the phone number.

  - name: phone_number
    type: string
    description: The phone number to send the message to.

consts:
  TWILIO_ACCOUNT: "****"
  BASIC_AUTH: "****"
  FROM_PHONE_NNUMBER: "****"
steps:
  - name: http_step
    type: http
    with:
      url: https://api.twilio.com/2010-04-01/Accounts/{{consts.TWILIO_ACCOUNT}}/Messages.json
      method: POST
      headers:
        Content-Type: application/x-www-form-urlencoded
        Authorization: Basic {{consts.BASIC_AUTH | base64_encode}}
      body: From={{consts.FROM_PHONE_NNUMBER}}&To={{inputs.phone_number}}&Body={{inputs.message}}
      timeout: 30s</code></pre><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/b057e71b0a7c50eb3da47cd4f95e77ec7b4c6126-1600x1245.png" style="width: 100%;" /><h4>ES|QL tools</h4><p>The following tools allow the agent to provide relevant responses that are grounded in real data. The example repo contains a setup script to initialize Kibana with product, order, and knowledge base datasets.</p><ul><li><strong>Product.search</strong></li></ul><p>The product dataset contains 65 fictitious products. This is an example document:</p><pre><code>{
      "sku": "ort3M7k",
      "name": "Ortovox Free Rider 26 Backpack",
      "price": 189,
      "currency": "USD",
      "image": "https://via.placeholder.com/150",
      "description": "The Ortovox Free Rider 26 is a technical freeride backpack with a dedicated safety compartment and diagonal ski carry system. Perfect for backcountry missions.\n\nKey Features:\n- 26L capacity\n- Diagonal ski carry system\n- Safety equipment compartment\n- Helmet holder\n- Hydration system compatible",
      "category": "Accessories",
      "subCategory": "Backpacks",
      "brand": "Ortovox",
      "sizes": ["One Size"],
      "colors": ["Black", "Blue", "Orange"],
      "materials": ["Nylon", "Polyester"]
    }</code></pre><p>The name and description fields are mapped as <code>semantic_text</code>, enabling the LLM to use semantic search via ES|QL to retrieve relevant products. The hybrid search query performs semantic matching across both fields, with a slightly higher weight applied to matches on the name field using a boost.</p><p>The query first retrieves the top 20 results ranked by their initial relevance score. These results are then reranked based on their description field using the <code>.rerank-v1-elasticsearch</code> inference model, and finally trimmed down to the top five most relevant products.</p><pre><code>type: ES|QL
toolId: products.search
description: Use this tool to search through the product catalogue by keywords.
query: |
    FROM products
        METADATA _score
      | WHERE
          MATCH(name, ?query, {"boost": 0.6}) OR
            MATCH(description, ?query, {"boost": 0.4})
      | SORT _score DESC
      | LIMIT 20
      | RERANK ?query
            ON description
            WITH {"inference_id": ".rerank-v1-elasticsearch"}
      | LIMIT 5

parameters:
    query: space separated keywords to search for in catalogue</code></pre><ul><li><strong>Knowledgebase.search</strong></li></ul><p>The knowledgebase datasets contain documents of the following shape, where the title and content fields are stored as semantic text:</p><pre><code>{
        id: "8273645",
        createdAt: "2025-11-14",
        title: "International Orders",
        content: `International orders are processed through our international shipping partner. Below are the countries we ship to and average delivery times.
        Germany: 3-5 working days
        France: 3-5 working days
        Italy: 3-5 working days
        Spain: 3-5 working days
        United Kingdom: 3-5 working days
        United States: 3-5 working days
        Canada: 3-5 working days
        Australia: 3-5 working days
        New Zealand: 3-5 working days
        `
}</code></pre><p>And the tool uses a similar query as the <code>product.search</code> tool:</p><pre><code>type: "ES|QL"
toolId: knowledgebase.search
description: Use this tool to search the knowledgebase.
query: |
  FROM knowledge_base
    METADATA _score
  | WHERE
      MATCH(title, ?query, {"boost": 0.6}) OR
      MATCH(content, ?query, {"boost": 0.4})
  | SORT _score DESC
  | LIMIT 20
  | RERANK ?query
      ON content
      WITH {"inference_id": ".rerank-v1-elasticsearch"}
  | LIMIT 5

parameters:
  query: space separated keywords or natural language phrase to semantically search for in the knowledge base</code></pre><ul><li><strong>Orders.search</strong></li></ul><p>The final tool that we’ll add is the one used to retrieve orders by <code>order_id</code>:</p><pre><code>type: "ES|QL"
toolId: order.search
description: Use this tool to retrieve an order by its ID.
query: |
  FROM orders
    METADATA _score
  | WHERE order_id == ?order_id
  | SORT _score DESC
  | LIMIT 1

parameters:
  order_id: "the ID of the order"</code></pre><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/d22bdd540a95b5a9c2bd5f308620835e8e6f7ecb-1600x1361.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/d91709a50fb5391876b714885242d998b2b21027-1600x1443.png" style="width: 100%;" /><p>After configuring the agent and attaching these workflows and ES|QL tools to the agent, the agent can be tested inside of Kibana.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/8691f41624247a6b1352d158c970031e1426ce5e-1600x1056.png" style="width: 100%;" /><p>Aside from building an ElasticSport support agent, the agent, workflows, and tools can be tailored to other use cases, such as a sales agent that qualifies leads, a servicing agent for home repairs, reservations for a restaurant, or an appointment-scheduling agent.</p><p></p><p>The final part is linking up the agent we just made with LiveKit, text-to-speech, and speech-to-text models. The repo linked at the end of this blog contains a custom Elastic Agent Builder LLM node that can be used with LiveKit. Just replace the <code>AGENT_ID</code> with your own, and link it with your Kibana instance.</p><h2>Getting started</h2><p>Check out the code, and try it for yourself <a href="https://github.com/KDKHD/elastic_agent_builder_livekit">here</a>. </p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Agent Builder, beyond the chatbox: Introducing Augmented Infrastructure]]></title>
            <description><![CDATA[Learn about Elastic Agent Builder with Augmented Infrastructure, an AI agent that enables augmented operations, augmented development, and augmented synthetics.]]></description>
            <link>https://www.elastic.co/search-labs/blog/agent-builder-augmented-infrastructure</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/agent-builder-augmented-infrastructure</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Alexander Wert, Bill Easton, Gil Raphaelli, Graham Hudgins, Norrie Taylor]]></dc:creator>
            <pubDate>Thu, 22 Jan 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/e08daf78f328e826f39d06329f6a5487f75d178d-1272x700.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p><strong>This is not talk. We&#x27;re doing it.</strong></p><p>We’ve all seen the rise of AI agents. They’re fantastic at summarizing text, writing code snippets, and answering questions based on documentation. But for those of us in DevOps and site reliability engineering (SRE), there has been a frustrating limitation. Most agents are trapped in the Call Center paradigm, meaning that they can read, think, and chat, but they cannot reach out and touch the infrastructure they’re supposed to be managing.</p><p>For our latest hackathon project, we decided to blow that limitation up.</p><p>We built <strong>Augmented Infrastructure</strong>: an infrastructure copilot that not only gives you advice but also creates, deploys, monitors, and fixes your live environment.</p><h2><strong>The problem: Copy, reformat, paste</strong></h2><p>Standard agents operate in a vacuum. If your app goes down and costs the company $5 million, a standard agent can read you the runbook on how to fix it. But <em>you</em> still have to do the work. You’re left to copy the code, reformat it for your environment, and paste it into your terminal.</p><p>We wanted an agent that understands the difference between <em>talking</em> about Kubernetes and <em>configuring</em> Kubernetes.</p><h2><strong>The engine: What is Elastic Agent Builder?</strong></h2><p>To build this, we didn&#x27;t start from scratch. We built it on top of <a href="https://www.elastic.co/elasticsearch/agent-builder"><strong>Elastic Agent Builder</strong></a>. For those unfamiliar, Elastic Agent Builder is a framework designed to rapidly develop agents, and it acts as the bridge between a large language model (LLM) (in our demo, we used Google Gemini) and private data stored in Elasticsearch.</p><p>Agent Builder can be used for conversational AI by grounding it in internal data, like documents or logs. But its most powerful feature is the ability to assign <strong>tools</strong>. These tools allow the LLM to step outside of the chat interface to perform specific tasks. We realized that if we pushed this feature to its limit, we could transform Agent Builder into an automation powerhouse.</p><h2><strong>Making it work: Building the first version</strong></h2><p>When we started on the project, we knew we wanted to make the agents be able to change the outside world. We had an idea: What if we built some “runner” software (to run any command the agent could think of on the host)? And then: What if the runners, Elastic Agent Builder, and the user were in a three-way call?</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/8dc8317c1301b8eb7b89438529e8d8d17411c95a-1024x559.png" style="width: 100%;" /><p>We started by building out a Python project, Augmented Infrastructure Runners, which was essentially a <code>while(true)</code> loop that queried the Elastic Agent Builder conversations API every second and checked for a special syntax we had created:</p><pre><code>{
	"tool_name": "my_tool",
       "tool_arguments": "\{stringified json arguments\}"
}</code></pre><p>We then updated the prompt to teach it about our new tool calling syntax. Bill is a maintainer of <a href="https://gofastmcp.com/getting-started/welcome">FastMCP</a>, the most popular framework for building Model Context Protocol (MCP) servers in Python. He set out to work using FastMCP client with this new runner software to mount MCP servers and make their tools available to the runner. When the agent saw this, it would run the tool call and it would <code>POST</code> the results back to the conversation as if the user had sent the results. This triggered the LLM to respond to the result, and off we went!</p><p>This was great but it had two main problems:</p><ol><li>The agent would spew all of this JSON right into the conversation with the user.</li><li>The earliest point in time when messages were visible through the conversations API was when a conversation round was completed (that is, when the LLM replied).</li></ol><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/7b3755aeae17722ff1bb9677712293e9195f96a0-1058x1034.png" style="width: 100%;" /><p>So we set out to figure out how to move this into the background.</p><p>We then switched to giving the agent a tool called <code>call_external_tool</code> with two arguments: the <code>tool_name</code> and the stringified JSON tool arguments. This external tool call would return nothing, but importantly, would be visible in the <code>GET</code> request to the conversations API. We then gave the runners permission to write documents directly to Elasticsearch, which the Elastic Agent Builder agent could retrieve as needed. The agent is always operating in response to a user message, so we need to kick-start the agent with a user message so that it will go look for results and continue processing. So we had the agents insert a small message into the chat to resume the conversation:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/61ff59a57c68ed5fad492d19c0580644113a507d-1600x1321.png" style="width: 100%;" /><p>So now we had external tool calls. However, because of the second problem mentioned above, we had to get rid of that final kick-start part. Otherwise, every external tool call required a full conversation round to retrieve the results!</p><h2><strong>Making it great: Introducing workflows</strong></h2><p>In addition to Elasticsearch Query Language (ES|QL) and index search tool calls, Agent Builder agents can call Elastic workflow-based tools. Elastic workflows provide a flexible and easy to manage way to execute an arbitrary sequence and logic of actions. For our purposes, all we need the workflow to do is store an external tool request to Elasticsearch and return an ID to poll the results for. This results in the following, simple workflow definition:</p><pre><code>name: ai-tool-call
enabled: true
triggers:
  - type: manual
inputs:
  - name: runner_id
    type: string
  - name: tool_calls
    type: string

steps:
  - name: store_request
    type: elasticsearch.create
    with:
      index: distributed-tool-requests
      id: "{{inputs.runner_id}}_{{ execution.id }}"
      document:
        request_id: "{{ execution.id }}"
        runner_id: "{{inputs.runner_id}}"
        tool_call: "{{inputs.tool_calls}}"
        status: "unhandled"

  - name: output_result
    type: console
    with:
      message: "Called tool, with execution id: {{ execution.id }}. Use this ID to poll the results."</code></pre><p>With that, instead of relying on the tool call request being written into the conversation, the runners can just poll the Elasticsearch <code>distributed-tool-requests</code> index for new external tool requests and report the results back into another Elasticsearch index with the provided <code>execution.id</code>.</p><p>This eliminates the two main issues mentioned above:</p><ol><li>The conversation history isn’t cluttered with the payload for the external tool calls anymore.</li><li>As the runners are polling the Elasticsearch index instead of the conversation history, they aren’t blocked by the conversation round to be completed for the external tool requests to become visible.</li></ol><p>The second point has the great advantage that processing of the external tool calls starts within the agent’s thinking phase (rather than when the conversation round has been completed). This allows us to instruct the LLM in the system prompt to poll for the external tool results until the results are available and eliminates the need for the kick-start message. Overall, this has the nice effect that the conversation feels more natural: The LLM can process multiple external tool requests within a single conversation round (instead of requiring one conversation round per tool request) and, thus, can accomplish more complex user requests in one go.</p><h2><strong>Putting it all together</strong></h2><p>To bridge the gap between the LLM and the server rack, we developed a specific architecture using Agent Builder’s tool capabilities:</p><ol><li><strong>Augmented Infrastructure runners:</strong> We deployed lightweight runners inside the target environments (servers, Kubernetes clusters, cloud accounts). These runners are connected directly to Elastic, using secured endpoints and secrets only available to each of the runners.</li><li><strong>ES|QL retrieval:</strong> The copilot uses Elastic’s <strong>ES|QL</strong> to perform hybrid searches. It doesn&#x27;t just search for knowledge; it searches for <em>capabilities</em>. It queries the connected runners to see which tools are available (for example, <code>list_ec2_instances</code>, <code>install_helm_chart</code>).</li><li><strong>Workflow execution:</strong> Once the agent decides on a course of action, it creates a structured workflow.</li><li><strong>Feedback loop:</strong> The runners execute the command locally and report the results back into Elasticsearch. The copilot reads the result from the index and decides the next step.</li></ol><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/76be256da722c1965971fc506502768bd890f0c4-1290x1076.png" style="width: 100%;" /><h2><strong>The demo: From outage to observability</strong></h2><a href="https://videos.elastic.co/watch/bRgPJfJLeABjPMMhvy1sbB">Watch video on Vidyard</a><p>In the video, we showcased two distinct scenarios demonstrating the power of this architecture.</p><h3><strong>Scenario 1: DevOps rescue</strong></h3><p>We started with a user panicking about a $5-million outage caused by a blind spot in their Kubernetes cluster.</p><ul><li><strong>The request:</strong> &quot;How do I make sure this doesn&#x27;t happen again?&quot;</li><li><strong>The action:</strong> The agent didn&#x27;t just provide a tutorial. It identified the cluster, created the necessary namespaces, generated Kubernetes secrets, installed the OpenTelemetry Operator, and instantly provided a link to a live APM dashboard.</li><li><strong>The result:</strong> Full Kubernetes observability and application insights without the user writing a single line of YAML.</li></ul><h3><strong>Scenario 2: Security handoff</strong></h3><p>A fundamental rule of infrastructure security is that you cannot protect what you cannot see. While performing our DevOps rescue, the agent sees an opportunity to improve the security of the environment.</p><p>With an alert kicked off from a previous Elastic Observability–related investigation, we demonstrate how a security practitioner can chat directly with their infrastructure: first, to enumerate the assets and resources in their cloud environment; and second, to deploy the tools necessary to ensure that the environment is secured.</p><ul><li><strong>Discovery:</strong> The copilot enumerated AWS resources for the security practitioner and identified a critical gap: an Amazon Elastic Compute Cloud (EC2) instance and an Amazon Elastic Kubernetes Service (EKS) cluster with public endpoints missing endpoint protection.</li><li><strong>Remediation:</strong> With a simple approval, the copilot deployed <strong>Elastic Security</strong> <strong>extended detection and response (XDR) and cloud detection and response (CDR)</strong> to the vulnerable assets, securing the environment in real time.</li><li><strong>The Result:</strong> Protection of deployed AWS assets and resources with complete runtime security.</li></ul><h2><strong>The future: Augmented everything</strong></h2><p>This project proves that Elastic Agent Builder can be the central brain for distributed operations. We aren&#x27;t limited to just infrastructure. Our runner technology can power:</p><ul><li><strong>Augmented synthetics:</strong> Diagnosing TLS errors across global runners.</li><li><strong>Augmented development:</strong> Creating pull requests and implementing CAPTCHAs on frontend services.</li><li><strong>Augmented operations:</strong> Automatically reconfiguring DNS resolvers during an outage.</li></ul><h2><strong>Try it yourself</strong></h2><p>We believe the future of AI isn&#x27;t just about chat support; it&#x27;s about <strong>Augmented Infrastructure</strong>. It’s about having a partner that can deploy, fix, observe, and protect alongside you.</p><p>Check out the code and try it for yourself with distributed runners (<a href="https://github.com/strawgate/augmented-infrastructure">GitHub</a>) plus Elastic Agent Builder on <a href="https://cloud.elastic.co/">Elastic Cloud Serverless</a> today!</p><ul><li>Create a serverless project on Elastic Cloud.</li><li>Deploy the code to a runner.</li><li>Set up the runner.</li><li>Configure your mcp.json.</li><li>Start the runner, which will create your agent and its tools automatically.</li><li>Chat with an agent that can reason, plan, and execute actions on your distributed runners!</li></ul><p><strong>The team: </strong><em>Alex, Bill, Gil, Graham, &amp; Norrie</em></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Monitoring LLM inference and Agent Builder with OpenRouter]]></title>
            <description><![CDATA[Learn how to monitor LLM usage, costs, and performance across Agent Builder and inference pipelines using OpenRouter's OpenTelemetry broadcast and Elastic APM.]]></description>
            <link>https://www.elastic.co/search-labs/blog/llm-monitoring-openrouter-agent-builder</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/llm-monitoring-openrouter-agent-builder</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Tomás Murúa]]></dc:creator>
            <pubDate>Wed, 21 Jan 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/d9744a1f78863313de34472dff4696ea3cf82930-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p></p><p>New models are released every week that surpass previous ones in intelligence, speed, or cost. This makes vendor lock-in risky and managing multiple connectors, billing accounts, and APIs unnecessarily complex. Each model behaves differently, in terms of token consumption, response latency, and compatibility with specific tool sets.</p><p></p><p>In this article, we&#x27;ll build an AI-enriched audio products catalog, connect it to Elastic Agent Builder, and use OpenRouter to access different models while monitoring their performance across the entire workflow, from data ingestion to agent interactions.</p><h2>Prerequisites</h2><ul><li>Elastic Cloud 9.2 or Elastic Cloud Serverless</li><li>Integration server with APM enabled</li><li>OpenRouter account and <a href="https://openrouter.ai/docs/api/reference/authentication">API Key</a></li><li>Python 3.9+</li></ul><h2>What is OpenRouter?</h2><p><a href="https://openrouter.ai/">OpenRouter</a> is a platform that unifies access to over 500 models from multiple providers through a single account and API. Instead of managing separate accounts for OpenAI, Anthropic, Google, and others, you access all of them through OpenRouter.</p><p>OpenRouter handles load balancing across providers, automatically routing requests to the provider with the best latency and fewest errors. You can also manually select providers or configure fallback chains. OpenRouter is compatible with standard APIs, code assistants, integrated development environments (IDEs), and more.</p><p>One key feature is <a href="https://openrouter.ai/docs/guides/features/broadcast/overview">Broadcast</a>, which sends traces of your model usage to external observability systems. Since OpenRouter supports OpenTelemetry, we can monitor our complete pipeline plus any other OpenRouter usage costs in the Elastic Stack.</p><h2>Architecture overview</h2><p>We&#x27;ll use an audio products catalog for which we generate new fields with AI using an inference ingest pipeline and then create an agent that can answer questions based on the indexed product data.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/03f9811dcd883b3c3302079369e690fc4d09350f-1156x491.png" style="width: 100%;" /><p>When ingesting data, the ingest pipeline uses an OpenRouter <a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-inference-put">inference endpoint</a> to generate new attribute fields based on the unstructured descriptions of the products, which triggers OpenRouter to send logs about that inference to Elasticsearch.</p><p>Similarly, when chatting with the <a href="https://www.elastic.co/docs/solutions/search/elastic-agent-builder">Agent Builder</a> that uses this data, logs are also sent to Elasticsearch for visualization.</p><p>We’ll use separate OpenRouter API keys for Agent Builder and ingestion:</p><ul><li><code>OPENROUTER_API_KEY</code> for Agent Builder Interactions</li><li><code>OPENROUTER_INGESTION_KEY</code> for the inference pipeline</li></ul><p>This allows us to differentiate traffic in monitoring dashboards and attribute costs to specific workflows.</p><h2>Setup</h2><p>First, we need to create an <a href="https://www.elastic.co/docs/reference/kibana/connectors-kibana/gen-ai-connectors">AI connector</a> for the agent to interact with the large language model (LLM) and an inference endpoint for the ingest pipeline to extract fields from descriptions. Both connect to OpenRouter using the same API (but can use different keys for monitoring separation).</p><h3><strong>Create the AI connector</strong></h3><p>The AI connector allows Agent Builder to communicate with LLMs. We configure it to use OpenRouter as the provider:</p><pre><code>import requests
import os

ELASTIC_URL = os.getenv("ELASTIC_URL")
KIBANA_URL = os.environ["KIBANA_URL"]
ELASTIC_API_KEY = os.environ["ELASTIC_API_KEY"]
OPENROUTER_API_KEY = os.environ["OPENROUTER_AGENT_KEY"]
OPENROUTER_INGESTION_KEY = os.environ.get("OPENROUTER_INGESTION_KEY", OPENROUTER_API_KEY)

# Create AI Connector for Agent Builder
connector_payload = {
    "name": "OpenRouter Agent Connector",
    "connector_type_id": ".gen-ai",
    "config": {
        "apiProvider": "Other",
        "apiUrl": "https://openrouter.ai/api/v1/chat/completions",
        "defaultModel": "openai/gpt-5.2",
        "enableNativeFunctionCalling": True
    },
    "secrets": {
        "apiKey": OPENROUTER_API_KEY
    }
}

response = requests.post(
    f"{KIBANA_URL}/api/actions/connector",
    headers={
        "kbn-xsrf": "true",
        "Authorization": f"ApiKey {ELASTIC_API_KEY}",
        "Content-Type": "application/json"
    },
    json=connector_payload
)

connector = response.json()
print(f"Connector created: {connector['id']}")</code></pre><p>We use a reasoning-capable model, like <a href="https://openrouter.ai/openai/gpt-5.2">GPT-5.2</a>, for the agent since it needs to handle complex queries and tool orchestration.</p><h3><strong>Create the inference endpoint</strong></h3><p>The inference endpoint allows Elasticsearch to call LLMs during data processing:</p><pre><code>from elasticsearch import Elasticsearch

es = Elasticsearch(
    hosts=[ELASTIC_URL],
    api_key=ELASTIC_API_KEY,
    request_timeout=60  # Higher timeout for inference operations
)

# Create inference endpoint for ingestion
inference_config = {
    "service": "openai",
    "service_settings": {
        "model_id": "openai/gpt-4.1-mini",
        "api_key": OPENROUTER_INGESTION_KEY,
        "url": "https://openrouter.ai/api/v1/chat/completions"
    }
}

response = es.inference.put(
    inference_id="openrouter-inference-endpoint",
    task_type="completion",
    body=inference_config
)

print(f"Inference endpoint created: {response['inference_id']}")</code></pre><p>We use a fast, cheaper model, like <a href="https://openrouter.ai/openai/gpt-4.1-mini">GPT-4.1 Mini</a>, for bulk ingestion tasks that don&#x27;t require advanced reasoning capabilities.</p><h2>Data pipeline</h2><p>Let&#x27;s configure the <a href="https://www.elastic.co/docs/manage-data/ingest/transform-enrich/ingest-pipelines">ingest pipeline</a>. It will read from the product description field and extract structured categories that Agent Builder can use for filtering and aggregations.</p><p>For example, given this product description:</p><p>&quot;<em>Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium leather ear cushions. Perfect for travel and office use.</em>&quot;</p><p>We can extract:</p><ul><li><strong>Category:</strong> Headphones</li><li><strong>Features:</strong> [&quot;wireless&quot;, &quot;noise_cancellation&quot;, &quot;long_battery&quot;]</li><li><strong>Use case:</strong> Travel</li></ul><p>The key is providing possible values to the LLM as an enum so it groups consistently. Otherwise, we might get variations like &quot;noise cancellation&quot;, &quot;ANC&quot;, and &quot;noise-canceling&quot; that are harder to aggregate.</p><pre><code># Define the extraction prompt
EXTRACTION_PROMPT = (
    "Extract audio product information from this description. "
    "Return raw JSON only, no markdown, no explanation. Fields: "
    "category (string, one of: Headphones/Earbuds/Speakers/Microphones/Accessories), "
    "features (array of strings from: wireless/noise_cancellation/long_battery/waterproof/voice_assistant/fast_charging/portable/surround_sound), "
    "use_case (string, one of: Travel/Office/Home/Fitness/Gaming/Studio). "
    "Description: "
)

# Create the enrichment pipeline
pipeline_config = {
    "processors": [
        {
            "script": {
                "source": f"ctx.prompt = '{EXTRACTION_PROMPT}' + ctx.description"
            }
        },
        {
            "inference": {
                "model_id": "openrouter-inference-endpoint",
                "input_output": {
                    "input_field": "prompt",
                    "output_field": "ai_response"
                }
            }
        },
        {
            "json": {
                "field": "ai_response",
                "add_to_root": True  # Parses JSON and adds fields to document root
            }
        },
        {
            "remove": {
                "field": ["prompt", "ai_response"]
            }
        }
    ]
}

es.ingest.put_pipeline(
    id="product-enrichment-pipeline",
    body=pipeline_config
)

print("Pipeline created: product-enrichment-pipeline")</code></pre><p>After using OpenAI to extract a JSON with the new attributes, we use the <a href="https://www.elastic.co/search-labs/blog/elasticsearch-parse-json-field-ingest-pipeline">json processor</a> to spread them into new fields.</p><p>Now let&#x27;s index some sample audio products:</p><pre><code># Sample audio product data
products = [
    {
        "name": "Wireless Noise-Canceling Headphones",
        "description": "Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium leather ear cushions. Perfect for travel and office use.",
        "price": 299.99
    },
    {
        "name": "Portable Bluetooth Speaker",
        "description": "Compact waterproof speaker with 360-degree surround sound. 20-hour battery life, perfect for outdoor adventures and pool parties.",
        "price": 149.99
    },
    {
        "name": "Studio Condenser Microphone",
        "description": "Professional USB microphone with noise cancellation and voice assistant compatibility. Ideal for podcasting, streaming, and home studio recording.",
        "price": 199.99
    }
]

# Create index with mapping
es.indices.create(
    index="products-enriched",
    body={
        "mappings": {
            "properties": {
                "name": {"type": "text"},
                "description": {"type": "text"},
                "price": {"type": "float"},
                "category": {"type": "keyword"},
                "features": {"type": "keyword"},
                "use_case": {"type": "keyword"}
            }
        }
    },
    ignore=400  # Ignore if already exists
)

# Index products using the enrichment pipeline
for i, product in enumerate(products):
    es.index(
        index="products-enriched",
        id=i,
        body=product,
        pipeline="product-enrichment-pipeline"
    )
    print(f"Indexed: {product['name']}")

# Refresh to make documents searchable
es.indices.refresh(index="products-enriched")</code></pre><h2>Agent Builder</h2><p>Now we can create an Agent Builder agent to use this index and answer both text questions and analytical queries using the new fields we&#x27;ve created:</p><pre><code># Create Agent Builder agent
agent_payload = {
    "id": "audio-product-assistant",
    "name": "Audio Product Assistant",
    "description": "Answers questions about audio product catalog using semantic search and analytics",
    "labels": ["audio"],
    "avatar_color": "#BFDBFF",
    "avatar_symbol": "AU",
    "configuration": {
        "tools": [
            {
                "tool_ids": [
                    "platform.core.search",
                    "platform.core.list_indices",
                    "platform.core.get_index_mapping",
                    "platform.core.execute_esql"
                ]
            }
        ],
        "instructions": """You are an audio product assistant that helps users find and analyze audio equipment.

Use the products-enriched index for all queries. The extracted fields are:
- category: Headphones, Earbuds, Speakers, Microphones, or Accessories
- features: array of product features like wireless, noise_cancellation, long_battery
- use_case: Travel, Office, Home, Fitness, Gaming, or Studio

For analytical questions, use ESQL to aggregate data.
For product searches, use semantic search on the description field."""
    }
}

response = requests.post(
    f"{KIBANA_URL}/api/agent_builder/agents",
    headers={
        "kbn-xsrf": "true",
        "Authorization": f"ApiKey {ELASTIC_API_KEY}",
        "Content-Type": "application/json"
    },
    json=agent_payload
)

agent = response.json()
print(f"Agent created: {agent['id']}")</code></pre><p>For tools, we use search for semantic queries and Elasticsearch Query Language (ES|QL) for analytical queries:</p><p>Now you can chat with your agent and ask questions like:</p><ul><li>&quot;What headphones do we have for travel?&quot;</li><li>&quot;Show me products with noise cancellation under $200&quot;</li><li>&quot;What&#x27;s the average price by category?&quot;</li></ul><p>The agent uses the AI-enriched fields to provide better filtering and aggregations.</p><h2>Implementing OpenRouter Broadcast</h2><p>Now let&#x27;s set up inference monitoring. First, we need our OpenTelemetry endpoint URL. Navigate to the APM tutorial in Kibana:</p><pre><code>https://<your_kibana_url>/app/observabilityOnboarding/otel-apm/?category=application</code></pre><p>Collect the URL and authentication token from the OpenTelemetry tab:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/0aa268bbee53da9059bbdd12b4ccb5ace0764b1d-1348x1292.jpg" style="width: 100%;" /><p><strong>Important:</strong> Your Kibana server needs to be reachable via the public internet so that it can receive data from OpenRouter.</p><p>In OpenRouter, go to <a href="https://openrouter.ai/settings/broadcast">Broadcast settings</a> and add a new destination for the &quot;OpenTelemetry Collector&quot;:<br/></p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/20fc11b904ced2f7c9b26da3a5f809ec2ad3e8c9-752x447.jpg" style="width: 100%;" /><p><strong>Important: </strong>Configure the endpoint with the <code>/v1/traces</code> path and authentication headers:</p><pre><code>Endpoint: https://xxxxx.ingest.us-east-2.aws.elastic-cloud.com:443/v1/traces

Headers: {"Authorization": "Bearer YOUR_APM_SECRET_TOKEN"}</code></pre><p>Press <strong>Test connection</strong>, and you should see a success message.</p><h2>Monitoring in Elastic</h2><p>After using OpenRouter models, you should start seeing documents in Kibana. The indexed documents are in the data stream <code>traces-generic.otel-default</code> with <code>service.name: &quot;openrouter&quot;</code> and include information about:</p><ul><li>Request and response details.</li><li>Token usage (prompt, completion, total).</li><li>Cost (in USD).</li><li>Latency (time to first token, total).</li><li>Model information.</li></ul><p>From now on, the activity of the inference pipeline and Agent Builder in relation to LLM usage will be recorded in OpenRouter and sent to Elastic.</p><h3><strong>Default APM dashboards</strong></h3><p>You can view the default dashboards in Kibana under Observability &gt; Applications &gt; Service Inventory &gt; openrouter:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/d5d79419b51de2c9b21c383e8f8326b81d47f2f0-1600x1039.png" style="width: 100%;" /><p>The service view shows:</p><ul><li><strong>Latency:</strong> Average response times across all calls.</li><li><strong>Throughput:</strong> Requests per minute.</li><li><strong>Failed transactions:</strong> Error rates.</li><li><strong>Transactions:</strong> Breakdown by operation type.</li></ul><h3><strong>Custom LLM monitoring dashboard</strong></h3><p>For more control over the information displayed, you can create a custom dashboard. We created one that differentiates ingestion from agent chat and measures relevant parameters, like token usage and cost, plus usage outside Elastic, like coding assistants via API key filters:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/753d942ea7721e2e24074e87d3e535a637f9e707-1600x1089.jpg" style="width: 100%;" /><p>The dashboard shows:</p><ul><li>Success rate by workflow type.</li><li>Token usage by model.</li><li>Cost breakdown by API key.</li><li>Latency trends over time.</li><li>Model comparison metrics.</li></ul><p>You can download the dashboard <a href="https://gist.github.com/llermaly/7ca1ae96613c3449291b4e5d16810ba1">here</a> and import it to your Kibana instance using <a href="https://www.elastic.co/docs/explore-analyze/find-and-organize/saved-objects#saved-objects-import-and-export">Saved Objects import</a>.</p><h2>Conclusion</h2><p>OpenRouter lets you move quickly and test multiple models and providers using the same API and billing account, making it convenient to compare models of different types—large parameter, small parameter, commercial, open source, and more.</p><p>Using OpenRouter Broadcast, we can effortlessly monitor the performance of these models during ingestion via ingest pipelines or chat via Agent Builder, plus combine it with OpenRouter usage for other purposes, like coding agents and apps.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Context engineering vs. prompt engineering]]></title>
            <description><![CDATA[Learn how context engineering and prompt engineering differ and why mastering both is essential for building production AI agents and RAG systems.]]></description>
            <link>https://www.elastic.co/search-labs/blog/context-engineering-vs-prompt-engineering</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/context-engineering-vs-prompt-engineering</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Tomás Murúa]]></dc:creator>
            <pubDate>Tue, 20 Jan 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/3c91bde403096645057988e83cfe460d82fd0a6f-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>In the early days of web development, web design was a single discipline encompassing everything from visual aesthetics to user workflows. As the industry matured, it split into two different fields: user interface (UI) and user experience (UX). They remain interconnected, but each requires distinct expertise and tools.</p><p>We&#x27;re seeing a similar split in AI. Since ChatGPT&#x27;s launch in November 2022, the industry has focused on improving prompts to optimize large language model (LLM) interactions. As we build more sophisticated AI systems, particularly agents and Model Context Protocol (MCP) tools that need access to external knowledge, two distinct disciplines are emerging: prompt engineering and context engineering. While they work together, they address fundamentally different challenges.</p><p>Prompt engineering focuses on how you communicate with the model. Context engineering focuses on what information the model has access to when it generates responses.</p><h2>What is prompt engineering?</h2><p>A <em>prompt </em>is an input to a generative AI model used to guide its output. Prompts can consist of text, images, sound, or other media.</p><p><em>Prompt engineering</em> is the iterative process of refining how you communicate with the model to achieve better results. It focuses on the words, structure, and techniques you use within a single interaction.</p><p>Here are some examples of prompt engineering techniques:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/f38222691de873b792629949afd2a6e61323d573-1600x1192.png" style="width: 100%;" /><p>Common techniques include <em>few-shot prompting</em> (providing examples), <em>chain-of-thought reasoning</em> (asking the model to show its work), and <em>role assignment</em> (giving the model a persona). These techniques address challenges like ambiguity, where a question can be interpreted multiple ways and the model must guess which interpretation the user intended.</p><p>A key challenge in prompt engineering is finding what Anthropic calls &quot;the right altitude&quot; for instructions. At one extreme, engineers hardcode complex, brittle logic in their prompts to anticipate every scenario. This creates fragility and maintenance overhead. At the other extreme, engineers provide vague guidance that fails to give the model concrete signals or that falsely assumes shared context. The optimal altitude strikes a balance: specific enough to guide behavior, yet flexible enough to let the model apply good judgment.</p><p>Prompt engineering typically operates at the single-turn level, optimizing how you phrase one query to get the best response. This works well for straightforward interactions but reaches its limits when tasks require external knowledge, persistent state, or multistep reasoning.</p><p>For a deeper exploration of prompt engineering techniques, refer to <a href="https://arxiv.org/pdf/2406.06608">The Prompt Report: A Systematic Survey of Prompting Techniques</a>.</p><h2>What is context engineering?</h2><p><em>Context engineering</em> is the broader discipline of curating and maintaining the optimal set of tokens during LLM inference. While prompt engineering asks, &quot;How should I phrase this?,&quot; context engineering asks, &quot;What information does the model need access to right now?&quot;</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/f42d4980623088cfcf0c26c9acce8577bb5660ed-1600x1326.png" style="width: 100%;" /><p>As the <a href="https://github.com/humanlayer/12-factor-agents">12-Factor Agents framework</a> explains, LLMs are stateless functions that turn inputs into outputs. At any given point, your input to an LLM is essentially, &quot;Here&#x27;s what&#x27;s happened so far. What&#x27;s the next step?&quot; Every interaction becomes context:</p><ul><li>The prompt and instructions you give to the model.</li><li>Documents or external data you retrieve through retrieval-augmented generation (RAG).</li><li>Past state, tool calls, or other history.</li><li>Instructions about structured data output formats.</li></ul><p>The distinction matters because modern AI agents don&#x27;t operate in single turns. An agent running in a loop generates an ever-expanding universe of information: tool outputs, retrieved documents, conversation history, intermediate reasoning. Context engineering is the practice of deciding what from that universe makes it into the model&#x27;s limited working memory at any given moment.</p><p>For a comprehensive exploration of context engineering components and best practices, refer to <a href="https://www.elastic.co/search-labs/blog/context-engineering-overview#what-is-context-engineering?">What is context engineering?</a> and <a href="https://www.elastic.co/search-labs/blog/context-engineering-hybrid-search-evolution-agentic-ai">You Know, for Context</a>.</p><h2>Key differences: Prompt engineering vs. context engineering</h2>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Dimension</td><td style="border: 1px solid #ccc; padding: 4px;">Prompt engineering</td><td style="border: 1px solid #ccc; padding: 4px;">Context engineering</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Core question</td><td style="border: 1px solid #ccc; padding: 4px;">"How should I phrase this?"</td><td style="border: 1px solid #ccc; padding: 4px;">"What does the model need to know?"</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Scope</td><td style="border: 1px solid #ccc; padding: 4px;">Single query</td><td style="border: 1px solid #ccc; padding: 4px;">System-wide information flow</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Failure mode</td><td style="border: 1px solid #ccc; padding: 4px;">Ambiguity: poorly phrased instructions lead to misinterpretation</td><td style="border: 1px solid #ccc; padding: 4px;">Retrieval problems: wrong documents, stale information, or context overflow</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Tools</td><td style="border: 1px solid #ccc; padding: 4px;">Describes desired output</td><td style="border: 1px solid #ccc; padding: 4px;">Selects and sequences tools</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Debugging approach</td><td style="border: 1px solid #ccc; padding: 4px;">Linguistic precision: refine wording, add examples</td><td style="border: 1px solid #ccc; padding: 4px;">Data architecture: tune retrieval, prune context, sequence tools</td>
                    </tr>
              </tbody>
            </table>
          <h3><strong>Single-turn vs. multi-turn</strong></h3><p>Prompt engineering optimizes one interaction. Context engineering thinks in sequences: What did previous turns establish? What tool outputs carry forward? What should persist three steps from now? As tasks move from simple question-answering to multistep agent workflows, context engineering becomes the dominant challenge.</p><h3><strong>Context window management</strong></h3><p>Context engineering manages a finite resource with <a href="https://www.elastic.co/search-labs/blog/context-engineering-llm-evolution-agentic-ai">three failure modes</a>:</p><ul><li><strong>Too little information</strong> leads to hallucination or bad responses. When the LLM lacks enough context, it cannot determine the semantic context to generate accurate responses.</li><li><strong>Too much information </strong>causes context overflow. This overwhelms the LLM&#x27;s attention span and lowers relevance across the whole context window, causing the model to struggle identifying which parts matter most.</li><li><strong>Distracting or conflicting information</strong> confuses the model. Larger context windows increase the chances of conflicting or irrelevant information that distracts the LLM answer.</li></ul><p><strong>Key distinction: </strong>Prompt engineering takes the context window as given. Context engineering actively curates it.</p><h3><strong>Tool orchestration</strong></h3><p>Prompt engineering can request tool use and describe what tools should do. Context engineering decides which tools to make available, what information to pass them, and how their outputs flow back into context.</p><p>One of the most common failure modes is bloated tool sets with overlapping functionality. If a human engineer can&#x27;t definitively say which tool should be used in a given situation, an AI agent can&#x27;t be expected to do better. Context engineering applies a clear principle: Curate the minimal viable set of tools. Each tool should be self-contained, robust to error, and unambiguous in its purpose. Tools should also be token-efficient, returning only the information necessary rather than everything available.</p><h3><strong>Just-in-time context vs. pre-retrieval</strong></h3><p>Traditional RAG systems preprocess and retrieve all potentially relevant data up front, loading it into the prompt before inference. Context engineering increasingly favors just-in-time strategies, like <a href="https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills">Anthropic’s Agent Skills</a>, which agents discover and dynamically load into the context.</p><p>Rather than loading everything up front, agents maintain lightweight references (file paths, stored queries, document IDs) and dynamically load data at runtime using tools. This mirrors human cognition: We don&#x27;t memorize entire books but maintain systems like file folders and bookmarks to retrieve information on demand.</p><p>The trade-off is speed versus precision. Pre-retrieval is faster but risks context overflow. Just-in-time retrieval is slower but keeps the context window focused. The most effective agents often use a hybrid approach: retrieving essential baseline context up front while enabling further exploration as needed.</p><h2>Practical example: Book recommendation agent</h2><p>To demonstrate how prompt engineering and context engineering work together, we built a book recommendation agent using <a href="https://www.elastic.co/search-labs/blog/elastic-ai-agent-builder-context-engineering-introduction">Elastic Agent Builder</a> with a <a href="https://www.kaggle.com/datasets/elvinrustam/books-dataset">dataset</a> of 103,063 books indexed in Elasticsearch.</p><p><strong>Setup:</strong></p><ul><li><strong>Index: </strong><code>books-dataset</code> with 103,063 documents</li><li><strong>Fields: </strong>Title, Authors, Description, Category, Publisher, Price, Published Date</li><li><strong>Tools: </strong>Agent Builder predetermined tools</li><li><strong>Model: </strong><a href="https://www.elastic.co/docs/reference/kibana/connectors-kibana/elastic-managed-llm">Elastic Managed LLM</a></li></ul><p><strong>Mappings:</strong></p><pre><code>{
  "mappings": {
    "properties": {
      "@timestamp": { "type": "date" },
      "Authors": { "type": "text" },
      "Category": { "type": "text" },
      "Description": { "type": "text" },
      "Price Starting With ($)": { "type": "double" },
      "Published Date": { "type": "date", "format": "iso8601" },
      "Publisher": { "type": "text" },
      "Title": { "type": "text" }
    }
 }</code></pre><p>We tested three scenarios to show different outcomes based on prompt quality and context management.</p><h3><strong>Scenario 1: Prompt engineering failure (ambiguity)</strong></h3><ul><li><strong>User prompt: </strong>“Recommend a good book”</li></ul><p>The agent searched for &quot;highly rated popular books&quot; and returned results about Labrador retrievers and a Paul Reiser comedy book, neither matching typical &quot;good book&quot; expectations.</p><ul><li><strong>Problem: </strong>The agent had to guess what “good” means without any filtering criteria. The LLM interpreted the request based on its assumption of what a “good” book is rather than on the user preferences.</li></ul><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/11166e428139d0bd552c145980c893ba65f3c882-879x730.png" style="width: 100%;" /><h3><strong>Scenario 2: Context engineering failure (too much information)</strong></h3><ul><li><strong>User prompt: </strong>“Retrieve all books from the database&quot;</li></ul><p>Elasticsearch Query Language (ES|QL) query generated:</p><pre><code>FROM books-dataset 
| LIMIT 100</code></pre><ul><li><strong>Context retrieved:</strong> 100 random books across all categories (cooking, history, and fiction all mixed together)</li><li><strong>Problem: </strong>Too much unfiltered information. The agent brought excessive context that makes finding a relevant book difficult, and the answer is incomplete.</li></ul><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/0b5fc4e5dd2fca32d6601190661b2383cd6ebb4d-937x1118.png" style="width: 100%;" /><h3><strong>Scenario 3: Both disciplines working together</strong></h3><ul><li><strong>User prompt: </strong>“I enjoy science and fantasy fiction like The Lord of the Rings or Foundation. Find books that match these preferences.”</li></ul><p>The agent executed targeted searches, retrieving relevant titles: <em>The Return of the King</em>, <em>Dune: House Corrino</em>, <em>Far Horizons</em> (a collection featuring <em>Foundation</em> and <em>Dune</em> universe stories).</p><ul><li><strong>Search query: </strong>&quot;science fiction and fantasy books similar to The Lord of the Rings or Foundation&quot;<br/></li></ul><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/19d2415c3f00374317f1116049d03656fcbc0024-944x1118.png" style="width: 100%;" /><h3><strong>Agent reasoning</strong></h3><p>The agent demonstrated context engineering through a smart tool usage and focused retrieval:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/b74faabcfcfef4b995d056b029637f1851db5132-845x532.jpg" style="width: 100%;" /><p>The agent used <code>platform.core.search</code> to query the <code>books-dataset</code> index with a targeted search: “science fiction fantasy books similar to The Lord of the Rings Foundation”. From 103,063 documents, it retrieved only the most relevant matches.</p><h3><strong>Why it worked</strong></h3><ul><li><strong>Prompt engineering: </strong>Clear genre specification and concrete examples (<em>Lord of the Rings</em>, <em>Foundation</em>) eliminated ambiguity.</li><li><strong>Context engineering: </strong>Focused retrieval bringing only relevant books, maintaining a manageable context window despite the dataset having 103,063 entries.</li></ul><p>The agent used the same tools in all three cases, but input quality determined how effectively those tools retrieved relevant context.</p><h2>Conclusion</h2><p>Prompt engineering and context engineering are distinct but complementary disciplines. What started as a general practice of prompting is splitting into specialized fields requiring different expertise, much like the UI/UX split in web development.</p><p>For straightforward question-answering, prompt engineering skills may be sufficient. But as systems grow more complex, adding retrieval, tools, and multistep reasoning, context engineering becomes the dominant challenge. Teams building production AI systems need both skill sets, and increasingly, they need practitioners who understand how the two disciplines interact.</p><p>To dive deeper into context engineering strategies for AI agents, including hybrid retrieval, semantic chunking, and agentic search patterns, see <a href="https://www.elastic.co/search-labs/blog/context-engineering-relevance-ai-agents-elasticsearch">The impact of relevance in context engineering for AI agents</a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Faster ES|QL stats with Swiss-style hash tables]]></title>
            <description><![CDATA[How Swiss-inspired hashing and SIMD-friendly design deliver consistent, measurable speedups in Elasticsearch Query Language (ES|QL).]]></description>
            <link>https://www.elastic.co/search-labs/blog/esql-swiss-hash-stats</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/esql-swiss-hash-stats</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Chris Hegarty, Matthew Alp, Nik Everett]]></dc:creator>
            <pubDate>Mon, 19 Jan 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/21036e031070f14faccb2b53b22723de2750c391-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>We recently replaced key parts of Elasticsearch’s hash table implementation with a Swiss-style design and observed up to 2–3x faster build and iteration times on uniform, high-cardinality workloads. The result is lower latency, better throughput, and more predictable performance for Elasticsearch Query Language (ES|QL) stats and analytics operations.</p><h2>Why this matters</h2><p>Most typical analytical workflows eventually boil down to grouping data. Whether it’s computing average bytes per host, counting events per user, or aggregating metrics across dimensions, the core operation is the same — map keys to groups and update running aggregates.</p><p>At a small scale, almost any reasonable hash table works fine. At the large scale (hundreds of millions of documents and millions of distinct groups) details start to matter. Load factors, probing strategy, memory layout, and cache behavior can make the difference between linear performance and a wall of cache misses.</p><p>Elasticsearch has supported these workloads for years, but we’re always looking for opportunities to modernize core algorithms. As such, we evaluated a newer approach inspired by Swiss tables and applied it to how ES|QL computes statistics.</p><h2>What are Swiss tables, really?</h2><p>Swiss tables are a family of modern hash tables popularized by Google’s SwissTable and later adopted in Abseil and other libraries.</p><p>Traditional hash tables spend a lot of time chasing pointers or loading keys just to discover that they don’t match. Swiss tables’ defining feature is the ability to reject most probes using a tiny cache-resident array structure, stored separately from the keys and values, called <em>control bytes</em>, to dramatically reduce memory traffic.</p><p>Each control byte represents a single slot and, in our case, encodes two things: whether the slot is empty, and a short fingerprint derived from the hash. These control bytes are laid out contiguously in memory, typically in groups of 16, making them ideal for <a href="https://en.wikipedia.org/wiki/Single_instruction,_multiple_data">single instruction, multiple data</a> (SIMD) processing.</p><p>Instead of probing one slot at a time, Swiss tables scan an entire control-byte block using vector instructions. In a single operation, the CPU compares the fingerprint of the incoming key against 16 slots and filters out empty entries. Only the few candidates that survive this fast path require loading and comparing the actual keys.</p><p>This design trades a small amount of extra metadata for much better cache locality and far fewer random loads. As the table grows and probe chains lengthen, those properties become increasingly valuable.</p><h2>SIMD at the center</h2><p>The real star of the show is SIMD.</p><p>Control bytes are not just compact, they’re also explicitly designed to be processed with vector instructions. A single SIMD compare can check 16 fingerprints at once, turning what would normally be a loop into a handful of wide operations. For example:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/bd418778f0c6144f8f5f18419f6220ac0c935c7a-903x407.png" style="width: 100%;" /><p>In practice, this means:</p><ul><li>Fewer branches.</li><li>Shorter probe chains.</li><li>Fewer loads from key and value memory.</li><li>Much better utilization of the CPU’s execution units.</li></ul><p>Most lookups never make it past the control-byte scan. When they do, the remaining work is focused and predictable. This is exactly the kind of workload that modern CPUs are good at.</p><h2>SIMD under the hood</h2><p>For readers who like to peek under the hood, here’s what happens when inserting a new key into the table. We use the Panama Vector API with 128-bit vectors, thus operating on 16 control bytes in parallel.</p><p>The following snippet shows the code generated on an Intel Rocket Lake with AVX-512. While the instructions reflect that environment, the design does not depend on AVX-512. The same high-level vector operations are emitted on other platforms using equivalent instructions (for example, AVX2, SSE, or NEON).</p><pre><code>; Load 16 control bytes from the control block
vmovdqu xmm0, XMMWORD PTR [r9+r10*1+0x10]

; Broadcast the 7-bit fingerprint of the new key across the vector
vpbroadcastb xmm1, r11d

; Compare all 16 control bytes to the new fingerprint
vpcmpeqb k7, xmm0, xmm1
kmovq rbx, k7

; Check if any matches were found
test rbx, rbx
jne <handle_match></code></pre><p>Each instruction has a clear role in the insertion process:</p><ul><li><code>vmovdqu</code>: Loads 16 consecutive control bytes into the 128-bit <code>xmm0</code> register.</li><li><code>vpbroadcastb</code>: Replicates the 7-bit fingerprint of the new key across all lanes of the <code>xmm1</code> register.</li><li><code>vpcmpeqb</code>: Compares each control byte against the broadcasted fingerprint, producing a mask of potential matches.</li><li><code>kmovq</code> + <code>test</code>: Moves the mask to a general-purposes register and quickly checks whether a match exists.</li></ul><p>Finally, we settled on probing groups of 16 control bytes at a time, as benchmarking showed that expanding to 32 or 64 bytes with wider registers provided no measurable performance benefit.</p><h2>Integration in ES|QL</h2><p>Adopting Swiss-style hashing in Elasticsearch was not just a drop-in replacement. ES|QL has strong requirements around memory accounting, safety, and integration with the rest of the compute engine.</p><p>We integrated the new hash table tightly with Elasticsearch’s memory management, including the page recycler and circuit breaker accounting, ensuring that allocations remain visible and bounded. Elasticsearch&#x27;s aggregations are stored densely and indexed by a group ID, keeping the memory layout compact and fast for iteration, as well as enabling certain performance optimizations by allowing random access.</p><p>For variable-length byte keys, we cache the full hash alongside the group ID. This avoids recomputing expensive hash codes during probing and improves cache locality by keeping related metadata close together. During rehashing, we can rely on the cached hash and control bytes without inspecting the values themselves, keeping resizing costs low.</p><p>One important simplification in our implementation is that entries are never deleted. This removes the need for <em>tombstones</em> (markers to identify previously occupied slots) and allows empty slots to remain truly empty, which further improves probe behavior and keeps control-byte scans efficient.</p><p>The result is a design that fits naturally into Elasticsearch’s execution model while preserving the performance characteristics that make Swiss tables attractive.</p><h2>How does it perform?</h2><p>At small cardinalities, Swiss tables perform roughly on par with the existing implementation. This is expected: When tables are small, cache effects dominate less and there is little probing to optimize.</p><p>As cardinality increases, the picture changes quickly.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/24900afc47ab07b0e9933f6117b99d0f4613f794-962x599.png" style="width: 100%;" /><p>The heatmap above plots time improvement factors for different key sizes (8, 32, 64, and 128 bytes) across cardinalities from 1,000 up to 10,000,000 groups. As cardinality grows, the improvement factor steadily increases, reaching up to 2–3x for uniform distributions.</p><p>This trend is exactly what the design predicts. Higher cardinality leads to longer probe chains in traditional hash tables, while Swiss-style probing continues to resolve most lookups inside SIMD-friendly control-byte blocks.</p><h2>Cache behavior tells the story</h2><p>To better understand the speedups, we ran the same JMH <a href="https://github.com/elastic/elasticsearch/pull/139343/files#diff-d0e0cc91a7495bf36b2d44eacce95f5185d01879e5f6c38089ac7a89aad17da7"><code>benchmarks</code></a> under Linux <code>perf</code> and captured cache and TLB statistics.</p><p>Compared to the original implementation, the Swiss version performs about 60% fewer cache references overall. Last-level cache loads drop by more than 4x, and LLC load misses fall by over 6x. Since LLC misses often translate directly into main-memory accesses, this reduction alone explains a large portion of the end-to-end improvement.</p><p>Closer to the CPU, we see fewer L1 data cache misses and nearly 6x fewer data TLB misses, pointing to tighter spatial locality and more predictable memory access patterns.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/6e49b7609fba83e33692cb9834552b6ca7e42a83-998x499.png" style="width: 100%;" /><p>This is the practical payoff of SIMD-friendly control bytes. Instead of repeatedly loading keys and values from scattered memory locations, most probes are resolved by scanning a compact, cache-resident structure. Less memory touched means fewer misses, and fewer misses mean faster queries.</p><h2>Wrapping up</h2><p>By adopting a Swiss-style hash table design and leaning hard into SIMD-friendly probing, we achieved 2–3x speedups for high-cardinality ES|QL stats workloads, along with more stable and predictable performance.</p><p>This work highlights how modern CPU-aware data structures can unlock substantial gains, even for well-trodded problems, like hash tables. There is more room to explore here, like additional primitive type specializations and use in other high-cardinality paths, like joins, all of which are just part of the broader and ongoing effort to continually modernize Elasticsearch internals.</p><p>If you’re interested in the details or want to follow the work, check out this <a href="https://github.com/elastic/elasticsearch/pull/139343">pull request</a> and <a href="https://github.com/elastic/elasticsearch/issues/138799">meta issue</a> tracking progress on Github.</p><p>Happy hashing!</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Managing agentic memory with Elasticsearch]]></title>
            <description><![CDATA[Creating more context-aware and efficient agents by managing memories using Elasticsearch.]]></description>
            <link>https://www.elastic.co/search-labs/blog/agentic-memory-management-elasticsearch</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/agentic-memory-management-elasticsearch</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Someshwaran Mohankumar]]></dc:creator>
            <pubDate>Fri, 16 Jan 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/80907fd072e72d6ec902470b449c9f337957a0d7-1280x720.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>In the emerging discipline of <strong>context engineering</strong>, giving AI agents the right information at the right time is crucial. One of the most important aspects of context engineering is managing an AI’s <strong>memory</strong>. Much like humans, AI systems rely on both a short-term memory and a long-term memory to recall information. If we want large language model (LLM) agents to carry on logical conversations, remember user preferences, or build on previous results or responses, we need to equip them with effective memory mechanisms.</p><p>After all, everything in the context influences the AI’s responses. G<em>arbage in, garbage out</em> holds true.</p><p>In this article, we’ll introduce what short-term and long-term memory mean for AI agents, specifically:</p><ul><li>The difference between short- and long-term memory.</li><li>How they relate to retrieval-augmented generation (RAG) techniques with vector databases, like Elasticsearch, and why careful memory management is necessary.</li><li>The risks of neglecting memory, including context overflow and context poisoning.</li><li>Best practices, like context pruning, summarizing, and retrieving only what’s relevant, to keep an agent’s memory both useful and safe.</li><li>Finally, we’ll touch on how memory can be shared and propagated in multi-agent systems to enable agents to collaborate without confusion using Elasticsearch.</li></ul><h2>Short-term versus long-term memory in AI agents</h2><p><strong><em>Short-term memory</em></strong> in an AI agent typically refers to the immediate conversational context or state—essentially, the current chat history or recent messages in the active session. This includes the user’s latest query and recent back-and-forth exchanges. It’s very similar to the information a person holds in mind during an ongoing conversation.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/4fbcc6f68055b2bccefc4176297a4ca50056dc0d-764x498.png" style="width: 100%;" /><p>AI frameworks often maintain this transient memory as part of the agent’s state (for example, using a checkpointer to store the conversation state as covered by <a href="https://docs.langchain.com/oss/python/langgraph/persistence#checkpoints">this example from LangGraph</a>). Short-term memory is <strong><em>session-scoped</em></strong>; that is, it exists within a single conversation or task and is reset or cleared when that session ends, unless explicitly saved elsewhere. An example of session-bound short-term memory would be the <strong><a href="https://help.openai.com/en/articles/8914046-temporary-chat-faq">temporary chat</a> </strong>available in ChatGPT.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/150bdf209cda5ed20b59cddf34e624ad1a8016aa-1100x577.png" style="width: 100%;" /><p><strong><em>Long-term memory</em></strong>, on the other hand, refers to information that persists <strong>across conversations or sessions</strong>. This is the knowledge an agent retains over time, facts it learned earlier, user preferences, or any data we’ve told it to remember permanently.</p><p>Long-term memory is usually implemented by storing and fetching it from an external source, such as a file or vector database that’s outside the immediate context window. Unlike short-term chat history, long-term memory isn’t automatically included in every prompt. Instead, based on a given scenario, the agent must <strong>recall</strong> or retrieve it when relevant tools are invoked. In practice, long-term memory might include a user’s profile info, prior answers or analyses the agent produced, or a knowledge base the agent can query.</p><p>For instance, if you have a travel-planner agent, the <em>short-term memory</em> would contain details of the current trip inquiry (dates, destination, budget) and any follow-up questions in that chat; whereas the <em>long-term memory</em> could store the user’s general travel preferences, past itineraries, and other facts shared in previous sessions. When the user returns later, the agent can pull from this long-term store (for example, the user loves beaches and mountains, has an average budget of INR 100,000, has a bucket list to visit, and prefers to experience history and culture rather than kid-friendly attractions) so that it doesn’t treat the user as a blank slate each time.</p><p>The short-term memory (chat history) provides immediate context and continuity, while long-term memory provides a broader context that the agent can draw upon when needed. Most advanced AI agent frameworks enable both: They keep track of recent dialogue to maintain context <em>and</em> offer mechanisms to look up or store information in a longer-term repository. Managing short-term memory ensures it stays within the context window, while managing long-term memory helps the agent to ground the answers based on prior interactions and personas.</p><h2>Memory and RAG in context engineering</h2><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/46635aa11ceff89b8d6a26ac3e22da52407d82f3-1600x900.png" style="width: 100%;" /><p><strong><em>How do we give an AI agent a useful long-term memory in practice?</em></strong></p><p>One prominent approach for long-term memory is <strong><em>semantic memory</em></strong>, often implemented via <strong>retrieval-augmented generation (RAG)</strong>. This involves coupling the LLM with an external knowledge store or vector-enabled datastore, like Elasticsearch. When the LLM needs information beyond what’s in the prompt or its built-in training, it performs semantic retrieval against Elasticsearch and injects the most relevant results into the prompt as context. This way, the model’s effective context includes not only the recent conversation (short-term memory) but also pertinent long-term facts fetched on the fly. The LLM then grounds its answer on both its own reasoning and the retrieved information, effectively combining short-term memory and long-term memory to produce a more accurate, context-aware response.</p><p><strong>Elasticsearch </strong>can be used to implement long-term memory for AI agents. Here’s a high-level example of how context can be retrieved from Elasticsearch for long-term memory.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/41ccbc7b5171e8170ac300139a963c0708816ba6-1600x900.png" style="width: 100%;" /><p>This way, the agent “remembers” by searching for relevant data rather than by storing everything in its limited prompt, <strong>where it leads to different risks.</strong></p><p><strong>Using RAG with Elasticsearch or any vector stores offers multiple benefits:</strong></p><p>First, it <strong>extends the knowledge</strong> of the model beyond its training cutoff. The agent can retrieve up-to-date information or domain-specific data that the LLM might not know. This is crucial for questions about recent events or specialized topics.</p><p>Second, retrieving context on demand helps reduce hallucinations, especially since LLMs aren’t trained on the proprietary or highly specialized data relative to your niche use case, which is highly likely to expose it to hallucinations. Instead of the LLM guessing or inventing new information as it has been incentivised through evaluation, as highlighted in a recent OpenAI paper (<a href="https://arxiv.org/pdf/2509.04664">Why Language Models Hallucinate</a>), the model can be grounded by factual references from Elasticsearch. Naturally, the LLM depends on the reliability of the data in the vector store to truly prevent misinformation and the relevant data is retrieved as per the core relevance measures.</p><p>Third, RAG allows an agent to work with knowledge bases far larger than anything you could ever fit into a prompt. Instead of pushing entire documents, like long research papers or policy documents, into the context window and risking overload or irrelevant information <a href="https://www.elastic.co/search-labs/blog/agentic-memory-management-elasticsearch#context-poisoning">context poisoning</a> the model’s reasoning, RAG relies on <a href="https://www.elastic.co/search-labs/blog/chunking-strategies-elasticsearch">chunking</a>. Large documents are broken into smaller, semantically meaningful pieces, and the system retrieves only the few chunks most relevant to the query. This way, the model doesn’t need a million-token context to appear knowledgeable; it just needs access to the right chunks of a much larger corpus.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/e6897356c9f0940e35a63d005e9cd20bc33e5dd7-1600x931.png" style="width: 100%;" /><p>It’s worth noting that as LLM context windows have grown (<a href="https://www.anthropic.com/news/1m-context">some models now support hundreds of thousands or even millions of tokens</a><em>)</em>, a debate arose about whether RAG is “dead.” Why not push all the data into the prompt? If you feel likewise, refer to this wonderful article by my colleagues, Jeffrey Rengifo and Eduard Martin, <a href="https://www.elastic.co/search-labs/blog/rag-vs-long-context-model-llm">Longer context ≠ better: Why RAG still matters</a>. This avoids the “garbage in, garbage out” problem: The LLM stays focused on the few chunks that matter, rather than running through noise.</p><p>That said, integrating Elasticsearch or any vector store into an AI agent architecture provides <strong>long-term memory</strong>. The agent stores knowledge externally and pulls it in as memory context when needed. This could be implemented as an <em>architecture</em>, where after each user query, the agent performs a search on Elasticsearch for relevant info and then appends the top results to the prompt before calling the LLM. The response might also be saved back into the long-term store if it contains useful new information (creating a feedback loop of learning). By using such retrieval-based memory, the agent remains informed and up to date, without having to cram everything it knows into every prompt, even though the context window supports <em>one million tokens</em>. This technique is a cornerstone of context engineering, combining the strengths of information retrieval and generative AI. </p><p>Here’s an example of a managed in-memory conversation state using LangGraph&#x27;s checkpoint system for short-term memory during the session. (Refer to our <a href="https://github.com/someshwaranM/elastic-context-engineering-short-term-long-term-memory">supporting context engineering app</a>.)</p><pre><code># Initialize chat memory (Note: This is in-memory only, not persistent)
memory = MemorySaver()

# Create a LangGraph agent
langgraph_agent = create_react_agent(model=llm, tools=tools, checkpointer=memory)

...
...
# Only process and display checkpoints if verbose mode is enabled
if args.verbose:
    # List all checkpoints that match a given configuration
    checkpoints = memory.list({"configurable": {"thread_id": "1"}})
    # Process the checkpoints
    process_checkpoints(checkpoints)</code></pre><p>Here’s how it stores <strong>checkpoints</strong>:</p><pre><code>Checkpoint:
Timestamp: 2025-12-30T09:19:41.691087+00:00
Checkpoint ID: 1f0e560a-c2fa-69ec-8001-14ee5373f9cf
User: Hi I'm Som, how are you? (Message ID: ad0a8415-5392-4a58-85ad-84154875bbf2)
Agent: Hi Som! I'm doing well, thank you! How about you? (Message ID: 
56d31efb-14e3-4148-806e-24a839799ece)
Agent:  (Message ID: lc_run--019b6e8e-553f-7b52-8796-a8b1fbb206a4-0)

Checkpoint:
Timestamp: 2025-12-30T09:19:40.350507+00:00
Checkpoint ID: 1f0e560a-b631-6a08-8000-7796d108109a
User: Hi I'm Som, how are you? (Message ID: ad0a8415-5392-4a58-85ad-84154875bbf2)
Agent: Hi Som! I'm doing well, thank you! How about you? (Message ID: 
56d31efb-14e3-4148-806e-24a839799ece)

Checkpoint:
Timestamp: 2025-12-30T09:19:40.349027+00:00
Checkpoint ID: 1f0e560a-b62e-6010-bfff-cbebe1d865f6</code></pre><p>For long-term memory, here&#x27;s how we perform semantic search on Elasticsearch to retrieve relevant previous conversations using vector embeddings after summarizing and indexing the checkpoints to Elasticsearch.</p><pre><code>Functions: 
retrieve_from_elasticsearch() 

# Enhanced Elasticsearch retrieval with rank_window and verbose display
def retrieve_from_elasticsearch(query: str, k: int = 5, rank_window: int = None) -> tuple[List[Dict[str, Any]], str]:
    """
    Retrieve context from Elasticsearch with score-based ranking
    
    Args:
        query: Search query
        k: Number of results to return
        rank_window: Number of candidates to retrieve before ranking (default: args.rank_window)
        
    Returns:
        Tuple of (retrieved_documents, formatted_context_string)
    """
    if not es_client or not es_index_name:
        return [], "Elasticsearch is not available. Cannot search long-term memory."
    
    if rank_window is None:
        rank_window = args.rank_window
    
    try:
        # Check if index exists and has documents
        if not es_client.indices.exists(index=es_index_name):
            return [], "No previous conversations stored in long-term memory yet."
        
        # Get document count
        try:
            doc_count = es_client.count(index=es_index_name)["count"]
            if doc_count == 0:
                return [], "Long-term memory is empty. No previous conversations to search."
        except Exception as e:
            return [], f"Error checking memory: {str(e)}"
        
        # Generate embedding for the query
        try:
            query_embedding = embeddings.embed_query(query)
        except Exception as e:
            return [], f"Error generating embedding: {str(e)}"
        
        # Perform semantic search using kNN with rank_window
        try:
            search_body = {
                "knn": {
                    "field": "vector",
                    "query_vector": query_embedding,
                    "k": k,
                    "num_candidates": rank_window  # Retrieve more candidates, then rank top k
                },
                "_source": ["text", "content", "message_type", "timestamp", "thread_id"],
                "size": k
            }
            
            response = es_client.search(index=es_index_name, body=search_body)
            
            if not response.get("hits") or len(response["hits"]["hits"]) == 0:
                return [], "No relevant previous conversations found in long-term memory."
            
            # Extract documents with scores
            retrieved_docs = []
            for hit in response["hits"]["hits"]:
                source = hit["_source"]
                score = hit["_score"]
                retrieved_docs.append({
                    "content": source.get("content", source.get("text", "")),
                    "message_type": source.get("message_type", "unknown"),
                    "timestamp": source.get("timestamp", "unknown"),
                    "thread_id": source.get("thread_id", "unknown"),
                    "score": score
                })
            
            # Format context string
            context_parts = []
            for i, doc in enumerate(retrieved_docs, 1):
                context_parts.append(doc["content"])
            
            context_string = "\n\n".join(context_parts)
            
            # Verbose display
            if args.verbose:
                rich.print(f"\n[bold yellow]🔍 RETRIEVAL ANALYSIS[/bold yellow]")
                rich.print("="*80)
                rich.print(f"[blue]Query:[/blue] {query}")
                rich.print(f"[blue]Retrieved:[/blue] {len(retrieved_docs)} documents (from {rank_window} candidates)")
                rich.print(f"[blue]Total context length:[/blue] {len(context_string)} characters\n")
                
                for i, doc in enumerate(retrieved_docs, 1):
                    rich.print(f"[cyan]📄 Document {i} | Score: {doc['score']:.4f} | Type: {doc['message_type']}[/cyan]")
                    rich.print(f"[cyan]   Timestamp: {doc['timestamp']} | Thread: {doc['thread_id']}[/cyan]")
                    content_preview = doc['content'][:200] + "..." if len(doc['content']) > 200 else doc['content']
                    rich.print(f"[cyan]   Content: {content_preview}[/cyan]")
                    rich.print("-" * 80)
            
            return retrieved_docs, context_string
            
        except Exception as e:
            return [], f"Error searching memory: {str(e)}"
            
    except Exception as e:
        return [], f"Error accessing long-term memory: {str(e)}"</code></pre><p>Now that we’ve explored how short-term memory and long-term memory are indexed and fetched using LangGraph’s checkpoints in Elasticsearch, let’s take some time to understand why indexing and dumping the complete conversations can be risky.</p><h2>Risks of not managing context memory</h2><p>As we’re talking much about context engineering, along with short-term and long-term memory, let’s understand what happens if we don’t manage an agent’s memory and context well.</p><p>Unfortunately, many things can go wrong when an AI’s context grows extremely long or contains bad information. As context windows get larger, <strong>new failure modes</strong> emerge, like:</p><ul><li><strong>Context poisoning</strong></li><li><strong>Context distraction</strong></li><li><strong>Context confusion</strong></li><li><strong>Context clash</strong></li><li><strong>Context leakage and knowledge conflicts</strong></li><li><strong>Hallucinations and misinformation</strong></li></ul><p>Let’s break down these issues and other risks that arise from poor context management:</p><h3>Context poisoning</h3><p><em>Context poisoning</em> refers to when incorrect or harmful information ends up in the context and “poisons” the model’s subsequent outputs. A common example is a hallucination by the model that gets treated as fact and inserted into the conversation history. The model might then build on that error in later responses, compounding the mistake. In iterative agent loops, once a false information makes it into the shared context (for example, in a summary of the agent’s working notes), it can be reinforced over and over. </p><p><a href="https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf">Researchers at DeepMind, in the release of the Gemini 2.5 report</a> (TL;DR, check <a href="https://www.dbreunig.com/2025/06/17/an-agentic-case-study-playing-pok%C3%A9mon-with-gemini.html">here</a>), observed this in a long-running <em>Pokémon</em>-playing agent: If the agent hallucinated a wrong game state and that got recorded into its <em>context </em>(its memory of goals), the agent would form <strong>nonsensical strategies</strong> around an impossible goal and get stuck. In other words, a poisoned memory can send the agent down the wrong path indefinitely.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/3f2facf5aad67613ad557422e09ec23a66adc0ed-1600x1388.png" style="width: 100%;" /><p>Context poisoning can happen innocently (by mistake) or even maliciously, for instance, via prompt injection attacks where a user or third-party sneaks in a hidden instruction or false fact that the agent then remembers and follows.</p><p><strong>Recommended countermeasures:</strong></p><p>Based on insights from <a href="https://www.wiz.io/academy/data-poisoning">Wiz</a>, <a href="https://zerlo.net/en/blog/what-is-llm-data-poisoning">Zerlo</a>, and <a href="https://www.anthropic.com/research/small-samples-poison">Anthropic</a>, countermeasures for context poisoning focus on preventing bad or misleading information from entering an LLM’s prompt, context window, or retrieval pipeline. Key steps include:</p><ul><li>Check the context constantly: Monitor the conversation or retrieved text for anything suspicious or harmful, not just the starting prompt.</li><li>Use trusted sources: Score or label documents based on credibility so the system prefers reliable information and ignores low scored data.</li><li>Spot unusual data: Use tools that detect odd, out-of-place, or manipulated content, and remove it before the model uses it.</li><li>Filter inputs and outputs: Add guardrails so harmful or misleading text can’t easily enter the system or be repeated by the model.</li><li>Keep the model updated with clean data: Regularly refresh the system with verified information to counter any bad data that slipped through.</li><li>Human-in-the-loop: Have people review important outputs or compare them against known, trustworthy sources.</li></ul><p>Simple user habits also help, resetting long chats, sharing only relevant information, breaking complex tasks into smaller steps, and maintaining clean notes outside the model.</p><p>Together, these measures create a layered defense that protects LLMs from context poisoning and keeps outputs accurate and trustworthy.</p><p>Without countermeasures as mentioned here, an agent might remember instructions, like ignore previous guidelines<em> </em>or trivial facts that an attacker inserted, leading to harmful outputs.</p><h3>Context distraction</h3><p><em>Context distraction</em> is when a context grows so long that the model overfocuses on the context, neglecting what it learned during training. In extreme cases, this resembles <a href="https://en.wikipedia.org/wiki/Catastrophic_interference"><em>catastrophic forgetting</em></a>; that is, the model effectively “forgets” its underlying knowledge and becomes overly attached to the information placed in front of it. Previous studies have shown that LLMs often lose focus when the prompt is extremely long.</p><p>The Gemini 2.5 agent, for example, supported a million-token window, but once its context grew beyond a certain point (on the order of 100,000 tokens in an experiment), it began to <strong>fixate on repeating its past actions</strong> instead of coming up with new solutions. In a sense, the agent became a prisoner of its extensive history. It kept looking at its long log of previous moves (the context) and mimicking them, rather than using its underlying training knowledge to devise fresh and novel strategies.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/e08e5b6d2e8ec7e3511d455985eed3d7fa6241e0-1352x636.png" style="width: 100%;" /><p>This is counterproductive. We want the model to use relevant context to help reasoning, not override its ability to think. Notably, even models with huge windows exhibit this <a href="https://research.trychroma.com/context-rot"><em>context rot</em></a>: Their performance degrades nonuniformly as more tokens are added. There appears to be an <em>attention budget</em>., Like humans with limited working memory, an LLM has a finite capacity to attend to tokens, and as that budget is stretched, its precision and focus drop.</p><p>As a mitigation, you can prevent context distraction using chunking, engineering the right information, regular context summarization, and evaluation and monitoring techniques to measure the accuracy of the response using scoring.</p><p>These methods keep the model grounded in both relevant context and its underlying training, reducing the risk of distraction and improving overall reasoning quality.</p><h3>Context confusion</h3><p><em>Context confusion</em> is when superfluous content in the context is used by the model to generate a low-quality response.A prime example is giving an agent a large set of tools or API definitions that it might use. If many of those tools are unrelated to the current task, the model may still try to use them inappropriately, simply because they’re present in context. Experiments have found that providing <em>more</em> tools or documents can <em>hurt</em> performance if they’re not all needed. The agent starts making mistakes, like calling the wrong function or referencing irrelevant text. </p><p>In one case, a small <strong>Llama 3.1 8B</strong> model failed a task when given 46 tools to consider but succeeded when given only 19 tools. The extra tools created confusion, even though the context was within length limits. The underlying issue is that any information in the prompt will be <em>attended to</em> by the model. If it doesn’t know to ignore something, that something could influence its output in undesired ways. Irrelevant bits can “steal” some of the model’s attention and lead it astray (for instance, an irrelevant document might cause the agent to answer a different question than asked). Context confusion often manifests as the model producing a low-quality response that integrates unrelated context. Refer to the research paper: <a href="https://arxiv.org/pdf/2411.15399">Less is More: Optimizing Function Calling for LLM Execution on Edge Devices.</a></p><p>It reminds us that more context isn’t always better, especially if it’s not <strong>curated</strong> for relevance.</p><h3>Context clash</h3><p><em>Context clash</em> occurs when <strong>parts of the context contradict each other</strong>, causing internal inconsistencies that derail the model’s reasoning. A clash can happen if the agent accumulates multiple pieces of information that are in conflict. </p><p>For example, imagine an agent that fetched data from two sources: One says <em>Flight A departs at 5 PM</em>, and the other says <em>Flight A departs at 6 PM</em>. If both facts end up in the context, the poor model has no way to know which is correct; it may get confused or produce an incorrect or non-similar answer.</p><p>Context clash also frequently occurs in multiturn conversations where the model’s <strong>earlier attempts</strong> at answering are still lingering in the context along with later refined information.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/500d7a80dc8db1923f9b5ca84728eed64fa296f7-1316x580.png" style="width: 100%;" /><p>A <a href="https://arxiv.org/pdf/2505.06120">research study</a> by Microsoft and Salesforce shows that if you break a complex query into multiple chatbot turns (adding details gradually), the final accuracy drops significantly, compared to giving all details in a single prompt. Why? Because the early turns contain partial or incorrect intermediate answers from the model, and those remain in the context. When the model later tries to answer with all info, its <em>memory</em> still includes those wrong attempts, which conflict with the corrected info and lead it off track. Essentially, the conversation’s context clashes with itself. The model may inadvertently use an outdated piece of context (from an earlier turn) that doesn’t apply after new info is added.</p><p>In agent systems, context clash is especially dangerous because an agent might combine outputs from different tools or subagents. If those outputs disagree, the aggregated context is inconsistent. The agent could then get stuck or produce nonsensical results trying to reconcile the contradictions. Preventing context clash involves ensuring the context is <strong>fresh and consistent</strong>,<strong> </strong>for instance, clearing or updating any outdated info and not mixing sources that haven’t been vetted for consistency.</p><h3>Context leakage and knowledge conflicts</h3><p>In systems where multiple agents or users share a memory store, there’s a risk of information bleeding over between contexts.</p><p>For example, if two separate users’ data embeddings reside in the same vector database without proper access control, an agent answering User A’s query might accidentally retrieve some of User B’s memory. This <strong><em>cross-context leak</em></strong> can expose private information or just create confusion in responses.</p><p>According to the <a href="https://wtit.com/blog/2025/04/17/owasp-top-10-for-llm-applications-2025/">OWASP Top 10 for LLM Applications</a>, multitenant vector databases must guard against such leakage:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/8f0ccf0b2f7bd6715c14aceee2deffb213d50bd9-1600x936.png" style="width: 100%;" /><p>According to <a href="https://wtit.com/blog/2025/04/17/owasp-top-10-for-llm-applications-2025/">LLM08:2025 Vector and Embedding Weaknesses</a><em>,</em> one of the common risks is context leakage:</p><blockquote><em>In multi-tenant environments where multiple classes of users or applications share the same vector database, there&#x27;s a risk of context leakage between users or queries. Data federation knowledge conflict errors can occur when data from multiple sources contradict each other. This can also happen when an LLM can’t supersede old knowledge that it has learned while training, with the new data from Retrieval Augmentation.</em></blockquote><p>Another aspect is that an LLM might have trouble overriding its <strong>built-in knowledge</strong> with new info from memory. If the model was trained on some fact and the retrieved context says the opposite, the model can get confused about which to trust. Without proper design, the agent could mix up contexts or fail to update old knowledge with new evidence, leading to stale or incorrect answers.</p><h3><strong>Hallucinations and misinformation</strong></h3><p>While <em>hallucination </em>(the LLM making up plausible-sounding but false information) is a known problem even without long contexts, poor memory management can amplify it. </p><p>If the agent’s memory is lacking a crucial fact, the model may just <strong>fill in the gap with a guess</strong>, and if that guess then enters the context (poisoning it), the error persists. </p><p>The OWASP LLM security report <a href="https://wtit.com/blog/2025/04/17/owasp-top-10-for-llm-applications-2025/"><strong>(LLM09:2025 Misinformation)</strong></a> highlights misinformation as a core vulnerability: LLMs can produce confident but fabricated answers, and users may overtrust them. An agent with a bad or outdated long-term memory might confidently cite something that was true last year but is false now, unless its memory is kept up to date. </p><p>Overreliance on the AI’s output (by either the user or the agent itself in a loop) can make this worse. If no one ever checks the info in memory, the agent can accumulate falsehoods. This is why RAG is often used to reduce hallucinations: By retrieving an authoritative source, the model doesn’t have to invent facts. But if your retrieval pulls in the wrong document (say, one that contains misinformation) or if an early hallucination isn’t pruned, the system may propagate that misinformation throughout its actions. </p><p>The bottom line: Failing to manage memory can lead to <strong>incorrect and misleading outputs</strong>, which can be damaging, especially if the stakes are high (for example, bad advice in a finance or medical domain). An agent needs mechanisms to verify or correct its memory content, not just unconditionally trust whatever is in the context.</p><p>In summary, giving an AI agent an infinitely long memory or dumping every possible thing into its context is <em>not</em> a recipe for success.</p><h2>Best practices for memory management in LLM applications</h2><p>To avoid the pitfalls above, developers and researchers devised a number of <strong>best practices for managing context and memory</strong> in AI systems. These practices aim to keep the AI’s working context lean, relevant, and up to date.Here are some of the key strategies, along with examples of how they help.</p><h3>RAG: Use targeted context</h3><p>Much of RAG has already been covered in the earlier section, so this serves as a concise set of practical reminders:</p><ul><li>Use targeted retrieval, not bulk loading: Retrieve only the most relevant chunks instead of pushing entire documents or full conversation histories into the prompt.</li><li>Treat RAG as just-in-time memory recall: Fetch context only when it’s needed, rather than carrying everything forward across turns.</li><li>Prefer relevance-aware retrieval strategies: Approaches like top-k semantic search, Reciprocal Rank Fusion, or tool loadout filtering help reduce noise and improve grounding.</li><li>Larger context windows don’t remove the need for RAG: Two highly relevant paragraphs are almost always more effective than 20 loosely related pages.</li></ul><p>That said, RAG isn’t about adding more context; it’s about adding the right context.</p><h3>Tool loadout</h3><p><em>Tool loadout</em> is about giving a model only the tools it actually needs for a task. The term comes from gaming: You pick a loadout that fits the situation. Too many tools slow you down; the wrong ones cause failure. LLMs behave the same way, according to the research paper <a href="https://arxiv.org/abs/2411.15399">Less is more</a>. Once you pass ~30 tools, descriptions start overlapping and the model gets confused. Past ~100 tools, failure is almost guaranteed. This isn’t a context window problem, it’s context confusion.</p><p>A simple and effective fix is <a href="https://arxiv.org/abs/2505.03275"><strong>RAG-MCP</strong></a>. Instead of dumping every tool into the prompt, tool descriptions are stored in a vector database and only the most relevant ones are retrieved per request. In practice, this keeps the loadout small and focused, dramatically shortens prompts, and can improve tool selection accuracy by up to 3x.</p><p>Smaller models hit this wall even sooner. The research shows an 8B model failing with dozens of tools but succeeding once the loadout is trimmed. Dynamically selecting tools, sometimes with an LLM first, reasoning about what it thinks it needs, can boost performance by 44%, while also reducing power usage and latency. The takeaway is that most agents only need a few tools, but as your system grows, tool loadout and RAG-MCP become first-order design decisions.</p><h3>Context pruning: Limit the chat history length</h3><p>If a conversation goes on for many turns, the accumulated chat history can become too large to fit, leading to context overflow or becoming too distracting to the model. </p><p><em>Trimming</em> means programmatically removing or shortening less important parts of the dialogue as it grows. One simple form is to drop the oldest turns of the conversation when you hit a certain limit, keeping only the latest <em>N</em> messages. More sophisticated pruning might remove irrelevant digressions or previous instructions that are no longer needed. The goal is to <strong>keep the context window uncluttered</strong> by old news. </p><p>For example, if the agent solved a subproblem 10 turns ago and we have since moved on, we might delete that portion of the history from the context (assuming it won’t be needed further). Many chat-based implementations do this: They maintain a rolling window of recent messages. </p><p>Trimming can be as simple as “forgetting” the earliest parts of a conversation once they’ve been summarized or are deemed irrelevant. By doing so, we reduce the risk of context overflow errors and also reduce <a href="https://www.elastic.co/search-labs/blog/agentic-memory-management-elasticsearch#context-distraction"><strong>context distraction</strong></a>, so the model won’t see and get sidetracked by old or off-topic content. This approach is very similar to how humans might not remember every word from an hour-long talk but will retain the highlights. </p><p>If you’re confused about context pruning, as highlighted by the author Drew Breunig <a href="https://www.dbreunig.com/2025/06/26/how-to-fix-your-context.html#tool-loadout:~:text=Provence%20is%20fast%2C%20accurate%2C%20simple%20to%20use%2C%20and%20relatively%20small%20%E2%80%93%20only%201.75%20GB.%20You%20can%20call%20it%20in%20a%20few%20lines%2C%20like%20so%3A">here</a>, usage of the Provence (`<a href="https://huggingface.co/naver/provence-reranker-debertav3-v1">naver/provence-reranker-debertav3-v1</a>`) model, a lightweight (1.75 GB), efficient, and accurate context pruner for question answering, can make a difference. It can trim large documents down to only the most relevant text for a given query. You can call it in specific intervals.</p><p>Here’s how we invoke the `provence-reranker` model in our code to prune the context:</p><pre><code># Context pruning with Provence
def prune_with_provence(query: str, context: str, threshold: Optional[float] = None) -> str:
    """
    Prune context using Provence reranker model
    
    Args:
        query: User's query/question
        context: Original context to prune
        threshold: Relevance threshold (0-1) for Provence reranker.
                   If None, uses args.pruning_threshold.
                   0.1 = conservative (recommended, no performance drop)
                   0.3-0.5 = moderate to aggressive pruning
    
    Returns:
        Pruned context with only relevant sentences
    """
    if provence_model is None:
        return context
    
    if threshold is None:
        threshold = args.pruning_threshold
    
    try:
        # Use Provence's process method
        provence_output = provence_model.process(
            question=query,
            context=context,
            threshold=threshold,
            always_select_title=False,
            enable_warnings=False
        )
        
        # Extract pruned context from output
        pruned_context = provence_output.get('pruned_context', context)
        reranking_score = provence_output.get('reranking_score', 0.0)
        
        # Log statistics
        original_length = len(context)
        pruned_length = len(pruned_context)
        reduction_pct = ((original_length - pruned_length) / original_length * 100) if original_length > 0 else 0
        
        if args.verbose:
            rich.print(f"[cyan]📊 Pruning stats: {pruned_length}/{original_length} chars ({reduction_pct:.1f}% reduction, threshold={threshold:.2f}, rerank_score={reranking_score:.3f})[/cyan]")
        
        return pruned_context if pruned_context else context
        
    except Exception as e:
        rich.print(f"[yellow]⚠️ Error in Provence pruning: {str(e)}[/yellow]")
        rich.print(f"[yellow]⚠️ Falling back to original context[/yellow]")
        return context</code></pre><p>We use the Provence reranker model (`naver/provence-reranker-debertav3-v1`) to score sentence relevance. Threshold-based filtering keeps sentences above the relevance threshold. Also, we introduce a fallback mechanism, where we return to the original context if pruning fails. Finally, statistics logging tracks reduction percentage in verbose mode.</p><h3>Context summarization: Condense older information instead of dropping it entirely</h3><p><em>Summarization</em> is a companion to trimming. When the history or knowledge base becomes too large, you can employ the LLM to generate a brief summary of the important points and use that summary in place of the full content going forward, as we performed in our code above.</p><p>For example, if an AI assistant has had a 50-turn conversation, instead of sending all 50 turns to the model on turn 51 (which likely won’t fit), the system might take turns 1–40, have the model summarize them in a paragraph, and then only supply that summary plus the last 10 turns in the next prompt. This way, the model still knows what was discussed without needing every detail. Early chatbot users did this manually by asking, “Can you summarize what we’ve talked about so far?” and then continuing in a new session with the summary. Now it can be automated. Summarization not only saves context window space but can also reduce <strong>context confusion/distraction</strong> by stripping away extra detail and retaining just the salient facts.</p><p>Here’s how we use OpenAI models (you can use any LLMs) to condense context while preserving all relevant information, eliminating redundancy and duplication.<br/></p><pre><code># Context summarization
def summarize_context(query: str, context: str) -> str:
    """
    Summarize context using LLM to reduce duplication and focus on relevant information
    
    Args:
        query: User's query/question
        context: Context to summarize
        
    Returns:
        Summarized context
    """
    try:
        summary_prompt = f"""You are an expert at summarizing conversation context.

Your task: Analyze the provided conversation context and produce a condensed summary that fully answers or supports the user's specific question.

The summary must:
1. Preserve every fact, detail, and information that directly relates to the question
2. Eliminate redundancy and duplicate information
3. Maintain chronological flow when relevant
4. Focus on information that helps answer: "{query}"

Context to summarize:
{context}

Provide a concise summary that preserves all relevant information:"""

        summary = llm.invoke(summary_prompt).content
        
        if args.verbose:
            original_length = len(context)
            summary_length = len(summary)
            reduction_pct = ((original_length - summary_length) / original_length * 100) if original_length > 0 else 0
            rich.print(f"[cyan]📝 Summarization stats: {summary_length}/{original_length} chars ({reduction_pct:.1f}% reduction)[/cyan]")
        
        return summary
        
    except Exception as e:
        rich.print(f"[yellow]⚠️ Error in context summarization: {str(e)}[/yellow]")
        rich.print(f"[yellow]⚠️ Falling back to original context[/yellow]")
        return context</code></pre><p>Importantly, when the context is summarized, the model is less likely to get overwhelmed by trivial details or past errors (assuming the summary is accurate). </p><p>However, summarization has to be done carefully. A bad summary might omit a crucial detail or even introduce an error. It’s essentially another prompt to the model (“summarize this”), so it can hallucinate or lose nuance. Best practice is to summarize incrementally and perhaps keep some canonical facts unsummarized.</p><p>Nonetheless, it has proven very useful. <a href="https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf">In the Gemini agent scenario, </a>summarizing the context every ~100k tokens was a way to counteract the model’s tendency to repeat itself. The summary acts like a compressed memory of the conversation or data. As developers, we can implement this by having an agent periodically call a summarization function (maybe a smaller LLM or a dedicated routine) on the conversation history or a long document. The resulting summary replaces the original content in the prompt. This tactic is widely used to keep contexts within limits and distill the information.</p><h3>Context quarantine: Isolate contexts when possible</h3><p>This is more relevant in complex agent systems or multistep workflows. The idea of context segmentation is to split a big task into smaller, isolated tasks, each with its own context, so that you never accumulate one enormous context that contains everything. Each subagent or subtask works on a piece of the problem with a focused context, and then a higher-level agent, or supervisor or coordinator integrates the results.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/f2de68c3339883d7658e633af3948f29f427e6cf-1600x900.png" style="width: 100%;" /><p><a href="https://www.anthropic.com/engineering/multi-agent-research-system">Anthropic’s research strategy uses multiple subagents</a>, each investigating a different aspect of a question, with their own context windows, and a lead agent that reads the distilled results from those subagents. This parallel, modular approach means that no single context window gets too bloated. It also reduces the chance of irrelevant information mixing, each thread stays on topic (no context confusion), and it doesn’t carry unnecessary baggage when answering its specific subquestion. In a sense, it’s like running separate threads of thought that only share their outcomes, not their entire thought process.</p><p>In multi-agent systems, this approach is essential. If Agent A is handling task A and Agent B is handling task B, there’s no reason for either agent to consume the other’s full context unless it’s truly required. Instead, agents can exchange only the necessary information. For example, Agent A can pass a consolidated summary of its findings to Agent B via a supervisor agent, while each subagent maintains its own dedicated context thread. This setup doesn’t require human-in-the-loop intervention; it relies on a supervisory agent with enabled tools with minimal and controlled context sharing.</p><p>Nonetheless, designing your system so that agents or tools operate with minimal necessary context overlap can greatly enhance clarity and performance. Think of it as <strong>microservices for AI</strong>, each component deals with its context, and you pass messages between them in a controlled way, instead of one monolithic context.These best practices are often used in combination. Also, this gives you the flexibility to trim trivial history, summarize important older messages or conversations, offload the detailed logs to Elasticsearch for long-term context, and use retrieval to bring back anything relevant when needed.</p><p>As mentioned <a href="https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents#:~:text=While%20some%20models,to%20the%20LLM">here</a>, the guiding principle is that context is a limited and precious resource. You want every token in the prompt to earn its keep, meaning it should contribute to the quality of the output. If something in memory is not pulling its weight (or worse, actively causing confusion), then it should be pruned, summarized, or kept out.</p><p>As developers, we can now program the context just like we program code, deciding what information to include, how to format it, and when to omit or update it. By following these practices, we can give LLM agents the much-needed context to perform tasks without falling victim to the failure modes described earlier. The result is agents that remember what they should, forget what they don’t need, and retrieve what they require just in time.</p><h2>Conclusion</h2><p>Memory isn’t something you add to an agent; it’s something you engineer. Short-term memory is the agent’s working scratch pad, and long-term memory is its durable knowledge store. RAG is the bridge between the two, turning a passive datastore, like Elasticsearch, into an active recall mechanism that can ground outputs and keep the agent current.</p><p>But memory is a double-edged sword. The moment you let context grow unchecked, you invite poisoning, distraction, confusion, and clashes, and in shared systems, even data leakage. That’s why the most important memory work isn’t “store more,” it’s “curate better”: Retrieve selectively, prune aggressively, summarize carefully, and avoid mixing unrelated contexts unless the task truly demands it.</p><p>In practice, good context engineering looks like good systems design: smaller, sufficient contexts, controlled interfaces between components, and a clear separation between raw and the distilled state you actually want the model to see. Done right, you don’t end up with an agent that remembers everything - you end up with an agent that remembers the right things, at the right time, for the right reason.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Higher throughput and lower latency: Elastic Cloud Serverless on AWS gets a significant performance boost]]></title>
            <description><![CDATA[We've upgraded the AWS infrastructure for Elasticsearch Serverless to newer, faster hardware. Learn how this massive performance boost delivers faster queries, better scaling, and lower costs.]]></description>
            <link>https://www.elastic.co/search-labs/blog/elasticsearch-serverless-aws-performance-boost</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/elasticsearch-serverless-aws-performance-boost</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Pete Galeotti, Yuvraj Gupta, Rachel Forshee]]></dc:creator>
            <pubDate>Wed, 14 Jan 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/3aa0b481994d2445ba979d3c79fff64c5ee6676a-720x420.jpg?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Elastic Cloud Serverless is already the definitive solution for developers who want to build efficient search and AI applications without the operational burden of managing infrastructure. Now, we&#x27;re taking the performance of your serverless projects to a whole new level.</p><p>We&#x27;ve completed a major infrastructure upgrade for all <a href="https://www.elastic.co/cloud/serverless">Elastic Cloud Serverless</a> projects running on AWS, migrating to newer, faster hardware. This change has been rolled out to every serverless project automatically. It delivers <strong>higher throughput and lower latency</strong> for Elasticsearch, Elastic Observability, and Elastic Security serverless projects on AWS.</p><h2><strong>Key performance benefits for developers</strong></h2><p>The new AWS hardware infrastructure underpins everything you do with Elastic Cloud Serverless, translating to tangible benefits for your applications&#x27; speed and responsiveness.</p><h3><strong>Reduced query latency… increased throughput</strong></h3><p>The improved hardware dramatically enhances the speed of compute resources, which means your search queries are processed faster than ever.</p><ul><li><strong>Search and vector search:</strong> Whether you&#x27;re running traditional full-text queries or using cutting-edge vector search for your <a href="https://www.elastic.co/generative-ai">generative AI and retrieval-augmented generation (RAG) applications</a>, you&#x27;ll see a marked decrease in latency. Internal benchmarking showed a 35% average decrease in search latency.</li><li><strong>Faster indexing:</strong> Data ingestion rates are optimized, allowing you to index massive data volumes and complex documents with increased throughput. This is crucial for applications that require near–real-time data visibility. Internal benchmarking showed a 26% average increase in indexing throughput.</li></ul><h3><strong>Consistent performance under load</strong></h3><p>Elastic Cloud Serverless is designed to autoscale dynamically in real time to meet demand, minimizing latency, regardless of your workload. With this hardware upgrade, that scaling is now more performant and responsive.</p><ul><li><strong>Handling spikes with ease:</strong> Whether you&#x27;re facing a sudden surge in user traffic or a massive batch data ingest, the new infrastructure ensures that your search and indexing resources scale up more efficiently to maintain consistently low latency.</li><li><strong>Optimized compute-storage decoupling:</strong> The serverless architecture separates compute and storage, which allows workloads to scale independently for optimal performance and cost efficiency. The faster hardware enhances the compute layer, maximizing the efficiency of this decoupled design.</li></ul><h2><strong>Under the hood: Internal benchmarking results</strong></h2><p>To quantify the impact of our AWS infrastructure upgrade, the Elastic engineering team conducted comprehensive internal benchmarking against a range of serverless workloads. These workloads provided empirical evidence of performance improvements that you can expect across your applications, regardless of your use case.</p><h3><strong>The benchmarking approach</strong></h3><p>We focused our testing on the key metrics that directly affect the developer experience and application responsiveness: response time (that is, latency) and throughput on search and on indexing operations.</p><ul><li><strong>Workloads tested:</strong> The tests included high-concurrency search operations typical of user-facing applications, complex vector search queries, and high-volume data ingestion/indexing for observability and security use cases. In particular, our testing methodology used <a href="https://github.com/elastic/rally-tracks/tree/master">publicly</a> <a href="https://github.com/elastic/rally-tracks/tree/master">available datasets for Rally</a>, Elastic’s benchmarking tool.<ul><li><a href="https://github.com/elastic/rally-tracks/tree/3bedd51/wikipedia"><code>wikipedia</code></a>: A dataset derived from a snapshot of Wikipedia’s text contents, to measure general-purpose text search performance.</li><li><a href="https://github.com/elastic/rally-tracks/tree/3bedd51/msmarco-passage-ranking"><code>MSMARCO-Passage-Ranking</code></a>: A dataset derived from Microsoft’s Machine Reading Comprehension (MS MARCO), to measure search performance on sparse vector fields.</li><li><a href="https://github.com/elastic/rally-tracks/tree/3bedd51/openai_vector"><code>OpenAI_Vector</code></a>: A dataset derived from BEIR’s NQ and enriched with embeddings generated by OpenAI’s <code>text-embedding-ada-002</code> model, to measure search performance on dense vector fields.</li></ul></li><li><strong>Measurement:</strong> We compared performance on the old and new infrastructure, measuring latency at the 99th percentile (P99) to capture the worst-case, tail-latency performance and operations per second. Each track was run five times for each hardware profile to ensure consistency in the results.</li><li><strong>The goal:</strong> Our aim was to validate the infrastructure&#x27;s ability to deliver consistently <strong>faster and more predictable performance</strong> across the board, even during periods of rapid autoscaling.</li></ul><h3><strong>Performance data summary</strong></h3><p>The results confirm significant gains in efficiency and speed. These gains translate directly into lower response times for your users and lower operational costs as a result of the ability to complete the same amount of work with fewer compute resources.</p><p>The following tables detail the quantitative improvements. Higher values are better for throughput; lower values are better for latency.</p><p><strong>Searching benchmark results:</strong></p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Benchmark</td><td style="border: 1px solid #ccc; padding: 4px;">Comparison</td><td style="border: 1px solid #ccc; padding: 4px;">Old infra</td><td style="border: 1px solid #ccc; padding: 4px;">New infra</td><td style="border: 1px solid #ccc; padding: 4px;">Differential</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`wikipedia` (plain text)</td><td style="border: 1px solid #ccc; padding: 4px;">Search operation throughput (ops/s)</td><td style="border: 1px solid #ccc; padding: 4px;">729</td><td style="border: 1px solid #ccc; padding: 4px;">1107</td><td style="border: 1px solid #ccc; padding: 4px;">+52%</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`wikipedia` (plain text)</td><td style="border: 1px solid #ccc; padding: 4px;">Search operation latency (p99, ms)</td><td style="border: 1px solid #ccc; padding: 4px;">56</td><td style="border: 1px solid #ccc; padding: 4px;">35</td><td style="border: 1px solid #ccc; padding: 4px;">-37%</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`MSMARCO-Passage-Ranking` (sparse vectors)</td><td style="border: 1px solid #ccc; padding: 4px;">Search operation throughput (ops/s)</td><td style="border: 1px solid #ccc; padding: 4px;">22</td><td style="border: 1px solid #ccc; padding: 4px;">31</td><td style="border: 1px solid #ccc; padding: 4px;">+40%</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`MSMARCO-Passage-Ranking` (sparse vectors)</td><td style="border: 1px solid #ccc; padding: 4px;">Search operation latency (p99, ms)</td><td style="border: 1px solid #ccc; padding: 4px;">108</td><td style="border: 1px solid #ccc; padding: 4px;">67</td><td style="border: 1px solid #ccc; padding: 4px;">-38%</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`OpenAI_Vector` (dense vectors)</td><td style="border: 1px solid #ccc; padding: 4px;">Search operation throughput (ops/s)</td><td style="border: 1px solid #ccc; padding: 4px;">475</td><td style="border: 1px solid #ccc; padding: 4px;">624</td><td style="border: 1px solid #ccc; padding: 4px;">+31%</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`OpenAI_Vector` (dense vectors)</td><td style="border: 1px solid #ccc; padding: 4px;">Search operation latency (p99, ms)</td><td style="border: 1px solid #ccc; padding: 4px;">35</td><td style="border: 1px solid #ccc; padding: 4px;">22</td><td style="border: 1px solid #ccc; padding: 4px;">-37%</td>
                    </tr>
              </tbody>
            </table>
          <p><strong>Indexing benchmark results:</strong></p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">Benchmark</td><td style="border: 1px solid #ccc; padding: 4px;">Comparison</td><td style="border: 1px solid #ccc; padding: 4px;">Old infra</td><td style="border: 1px solid #ccc; padding: 4px;">New infra</td><td style="border: 1px solid #ccc; padding: 4px;">Differential</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`wikipedia` (plain text)</td><td style="border: 1px solid #ccc; padding: 4px;">Search operation throughput (ops/s)</td><td style="border: 1px solid #ccc; padding: 4px;">2845</td><td style="border: 1px solid #ccc; padding: 4px;">3220</td><td style="border: 1px solid #ccc; padding: 4px;">+13%</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`wikipedia` (plain text)</td><td style="border: 1px solid #ccc; padding: 4px;">Search operation latency (p99, ms)</td><td style="border: 1px solid #ccc; padding: 4px;">1769</td><td style="border: 1px solid #ccc; padding: 4px;">1120</td><td style="border: 1px solid #ccc; padding: 4px;">-37%</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`MSMARCO-Passage-Ranking` (sparse vectors)</td><td style="border: 1px solid #ccc; padding: 4px;">Search operation throughput (ops/s)</td><td style="border: 1px solid #ccc; padding: 4px;">7087</td><td style="border: 1px solid #ccc; padding: 4px;">8900</td><td style="border: 1px solid #ccc; padding: 4px;">+26%</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`MSMARCO-Passage-Ranking` (sparse vectors)</td><td style="border: 1px solid #ccc; padding: 4px;">Search operation latency (p99, ms)</td><td style="border: 1px solid #ccc; padding: 4px;">824</td><td style="border: 1px solid #ccc; padding: 4px;">677</td><td style="border: 1px solid #ccc; padding: 4px;">-18%</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`OpenAI_Vector` (dense vectors)</td><td style="border: 1px solid #ccc; padding: 4px;">Search operation throughput (ops/s)</td><td style="border: 1px solid #ccc; padding: 4px;">2972</td><td style="border: 1px solid #ccc; padding: 4px;">3187</td><td style="border: 1px solid #ccc; padding: 4px;">+7%</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">`OpenAI_Vector` (dense vectors)</td><td style="border: 1px solid #ccc; padding: 4px;">Search operation latency (p99, ms)</td><td style="border: 1px solid #ccc; padding: 4px;">2946</td><td style="border: 1px solid #ccc; padding: 4px;">2944</td><td style="border: 1px solid #ccc; padding: 4px;">0%</td>
                    </tr>
              </tbody>
            </table>
          <h2><strong>The added bonus: Cost reduction</strong></h2><p>While our focus is on delivering low-latency performance, the efficiency of the new hardware also has a direct, positive impact on costs for Elasticsearch projects.</p><p><a href="https://www.elastic.co/pricing/serverless-search">Elasticsearch Serverless pricing</a> is usage-based, meaning that you only pay for the ingest and search resources you consume. Because the newer, faster hardware is more efficient, your workloads will often complete tasks using fewer resources, leading to an inherent cost reduction for most projects. You get a premium performance boost without the premium price tag—the definition of optimized efficiency.</p><h2><strong>What does this mean for you, the developer?</strong></h2><p>This infrastructure upgrade is entirely managed by Elastic, so you don&#x27;t need to lift a finger—no migrations and no configuration changes. The improvement is immediate and automatic across all your AWS-based serverless projects.</p><p>This upgrade empowers you to:</p><ul><li><strong>Build faster applications:</strong> Focus on feature velocity, knowing that your underlying search platform is delivering the speed your users demand.</li><li><strong>Innovate with confidence:</strong> Deploy new search, observability, and security features—including complex AI capabilities, like vector search and relevance ranking—with the assurance that the platform can handle the load at peak performance.</li><li><strong>Simplify your stack:</strong> Use a fully managed service that handles infrastructure management, capacity planning, and scaling, so you can focus on your code and data.<br/></li></ul>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[jina-embeddings-v3 is now available on Elastic Inference Service]]></title>
            <description><![CDATA[Introducing jina-embeddings-v3 on Elastic Inference Service (EIS) and explaining how to get started.]]></description>
            <link>https://www.elastic.co/search-labs/blog/jina-embeddings-v3-elastic-inference-service</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/jina-embeddings-v3-elastic-inference-service</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Sean Handley, Ranjana Devaji, Brendan Jugan]]></dc:creator>
            <pubDate>Mon, 12 Jan 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/9bb1c96c697d8d48b764bee487a73a6cae130d0d-720x420.jpg?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>Today, we are excited to introduce <a href="https://jina.ai/news/jina-embeddings-v3-a-frontier-multilingual-embedding-model/"><code>jina-embeddings-v3</code></a> on Elastic Inference Service (EIS), enabling fast multilingual dense retrieval directly into Elasticsearch. Jina-embeddings-v3 is the first Jina AI model available on EIS, with many more to come soon.</p><p>Jina AI, <a href="https://www.elastic.co/blog/elastic-jina-ai">which recently joined Elastic via acquisition</a>, is a leader in open-source multilingual and multimodal embeddings, rerankers, and small language models. Jina brings deep expertise in <a href="https://www.elastic.co/search-labs/blog/jina-models-elasticsearch-guide">search foundation models</a> that help developers build high-quality retrieval and RAG systems across text, images, code, and long multilingual content.</p><p><a href="https://www.elastic.co/blog/elastic-inference-service">Elastic Inference Service</a> (EIS) makes it easy for developers to add fast, high-quality, and reliable semantic retrieval for search, RAG, and agentic applications with predictable, consumption-based pricing. EIS offers fully-managed GPU-powered inference with ready-to-use models, no additional setup or hosting complexity required.</p><p><code>jina-embeddings-v3</code> supports high-quality multilingual retrieval with long-context reasoning and task-tuned modes for RAG and agents. It provides developers fast dense embeddings across a broad range of languages without the operational overhead.</p><h2>Why jina-embeddings-v3?</h2><p><code>jina-embeddings-v3</code> is a text embedding model that supports 32 languages and up to 8192-token context, high relevance at lower cost, and GPU-powered inference through EIS.</p><h3><strong>Key capabilities</strong></h3><ul><li><strong>Multilinguality</strong>: Closes the language gap and aligns meaning across 32 languages, including Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, Georgian, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Latvian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu, and Vietnamese.</li><li><strong>Parameter-efficiency</strong>: <a href="https://jina.ai/news/jina-embeddings-v3-a-frontier-multilingual-embedding-model/">Delivers higher performance</a> with only 570M parameters, achieving comparable performance to much larger LLM-based embeddings at lower costs.</li><li><strong>Dimensionality control</strong>: Default in 1024 dimensions, and with Matryoshka representation support, it lets developers dial the embedding size all the way down to 32 dimensions, giving flexibility to balance accuracy, latency, and storage based on your needs.</li><li><strong>Task-specific optimization</strong>: Features task-specific Low-Rank Adaptation (LoRA) adapters, enabling it to generate high-quality embeddings for various tasks including query-document retrieval, clustering, classification, and text matching.</li></ul><h2><strong>Get started</strong></h2><pre><code>POST _inference/text_embedding/.jina-embeddings-v3
{
  "input": ["Rocky Mountain National Park"],
  "input_type": "ingest"
}</code></pre><p>The response:</p><pre><code>{
  "text_embedding": [
    {
      "embedding": [
        -0.06678891,
        -0.0073341704,
        0.011903269,
        -0.041797,
      ...
      ]
    }
  ]
}</code></pre><h2>What’s next</h2><p>Alongside these new models, EIS continues to evolve to support more users and simplify semantic search across environments.</p><p><strong>Cloud Connect for EIS: </strong><a href="https://www.elastic.co/docs/deploy-manage/cloud-connect">Cloud Connect</a> for EIS will soon bring EIS to self-managed environments, reducing operational overhead and enabling hybrid architectures and scaling where it works best for you.</p><p><strong>semantic_text defaults to jina-embeddings-v3 on EIS: </strong><a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/semantic-text"><code>semantic_text</code></a> currently uses ELSER as the embeddings model behind the scenes, but will default to the <code>jina-embeddings-v3</code> endpoint on EIS in the near future. This change will provide built-in inference at ingestion time, making it easier to adopt multilingual search without additional configuration.</p><p><strong>More models: </strong>We’re expanding the EIS model catalog to meet the rising inference demands of our customers. In the coming months, we’ll introduce new models that support an even broader set of search and inference workloads. Hot on the heels of <code>jina-embeddings-v3</code>, the next models to follow are <code><a href="https://jina.ai/models/jina-reranker-v2-base-multilingual/">jina-reranker-v2-base-multilingual</a> </code>and <a href="https://jina.ai/news/jina-reranker-v3-0-6b-listwise-reranker-for-sota-multilingual-retrieval/"><code>jina-reranker-v3</code></a>. Both Jina AI models greatly improve precision through multilingual reranking for RAG and AI agents.</p><h2>Conclusion</h2><p>With <code>jina-embeddings-v3</code> on EIS, you can build multilingual, high-precision retrieval pipelines without managing models, GPUs, or infrastructure. You get fast dense retrieval and tight integration with Elasticsearch’s relevance stack, all in one platform.</p><p>Whether you are building global RAG systems, search, or agentic workflows that need reliable context, Elastic now gives you a high-performance model out-of-the-box, and the operational simplicity to move from prototype to production with confidence.</p><p>All Elastic Cloud trials have access to the Elastic Inference Service. <a href="https://www.elastic.co/cloud/serverless">Try it now </a>on Elastic Cloud Serverless and Elastic Cloud Hosted.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Hybrid search and multistage retrieval in ES|QL]]></title>
            <description><![CDATA[Explore the multistage retrieval capabilities of ES|QL, using FORK and FUSE commands to integrate hybrid search with semantic reranking and native LLM completions.]]></description>
            <link>https://www.elastic.co/search-labs/blog/hybrid-search-multi-stage-retrieval-esql</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/hybrid-search-multi-stage-retrieval-esql</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Ioana Tagirta, Aurélien Foucret, Carlos Delgado]]></dc:creator>
            <pubDate>Thu, 08 Jan 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/fceada10b0c09d6a4a372f137bb3040e1ff41fbf-1600x895.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>In Elasticsearch 9.2, we’ve introduced the ability to do dense vector search and hybrid search in Elasticsearch Query Language (ES|QL). This continues our investment in making ES|QL the best search language to solve modern search use cases.</p><h2>Multistage retrieval: The challenge of modern search</h2><p>Modern search has evolved beyond simple keyword matching. Today&#x27;s search applications need to understand intent, handle natural language, and combine multiple ranking signals to deliver the best results.</p><p>Retrieval of the most relevant results happens in multiple stages, with each stage gradually refining the result set. This wasn’t the case in the past, where most use cases would require one or two stages of retrieval: an initial query to get results and a potential rescoring phase.	</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/fceada10b0c09d6a4a372f137bb3040e1ff41fbf-1600x895.png" style="width: 100%;" /><p>We start with an initial retrieval, where we cast a wide net to gather results that are relevant to our query. Since we need to sieve through all the data, we should use techniques that return results fast, even when we index billions of documents.</p><p>We therefore employ trusted techniques, such as lexical search that Elasticsearch has supported and optimized since the beginning, or vector search, where Elasticsearch excels in speed and accuracy.</p><p>Lexical search using BM25 is quite fast and best at exact term matching or phrase matching, and <a href="https://www.elastic.co/docs/solutions/search/vector">vector</a> or <a href="https://www.elastic.co/docs/solutions/search/semantic-search">semantic search</a> is better suited for handling natural language queries. <a href="https://www.elastic.co/what-is/hybrid-search">Hybrid search</a> combines lexical and <a href="https://www.elastic.co/docs/solutions/search/vector">vector search</a> results to bring the best from both. The challenge that hybrid search solves is that vector and lexical search have completely different and incompatible scoring functions which produce values in different intervals, following different distributions. A vector search score close to 1 can mean a very close match, but it doesn’t mean the same for lexical search. Hybrid search methods, such as <a href="https://www.elastic.co/docs/reference/elasticsearch/rest-apis/reciprocal-rank-fusion">reciprocal rank fusion</a> (RRF) and linear combination of scores, assign new scores that blend the original scores from lexical and vector search.</p><p>After hybrid search, we can employ techniques such as <a href="https://www.elastic.co/docs/solutions/search/ranking/semantic-reranking">semantic reranking</a> and <a href="https://www.elastic.co/docs/solutions/search/ranking/learning-to-rank-ltr">Learning To Rank</a> (LTR), which use specialized machine learning models to rerank the result.</p><p>With our most relevant results, we can use large language models (LLMs) to further enrich our response or pass the most relevant results as context to LLMs in agentic workflows in tools such as <a href="https://www.elastic.co/search-labs/blog/elastic-ai-agent-builder-context-engineering-introduction">Elastic Agent Builder</a>.</p><p>ES|QL is able to handle all these stages of retrieval. By design, ES|QL is a piped language, where each command transforms the input and sends the output to the next command. Each stage of retrieval is represented by one or more consecutive ES|QL commands. In this article, we show how each stage is supported in ES|QL.</p><h2>Vector search</h2><p>In Elasticsearch 9.2, we introduced tech preview support for dense vector search in ES|QL. This is as simple as calling the <code>knn</code> function, which only requires a <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/dense-vector"><code>dense_vector</code></a> field and a query vector:</p><pre><code>FROM books METADATA _score
| WHERE KNN(description_vector, ?query_vector)
| SORT _score DESC
| LIMIT 100</code></pre><p>This query executes an approximate nearest neighbor search, retrieving 100 documents that are the most similar to the <code>query_vector</code>.</p><h2>Hybrid search: Reciprocal rank fusion</h2><p>In Elasticsearch 9.2, we introduced support for hybrid search using RRF and linear combination of results in ES|QL.</p><p>This allows combining vector search and lexical search results into a single result set.</p><p>To achieve this in ES|QL, we need to use the <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/fork"><code>FORK</code></a> and <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/fuse"><code>FUSE</code></a> commands. <code>FORK</code> runs multiple branches of execution, and <code>FUSE</code> merges the results and assigns new relevance scores using RRF or linear combination.</p><p>In the following example, we use <code>FORK</code> to run two separate branches, where one is doing a lexical search using the <code>match</code> function, while the other is doing a vector search using the <code>knn</code> function. We then merge the results together using <code>FUSE</code>:</p><pre><code>FROM books METADATA _score, _id, _index
| FORK (WHERE KNN(description_vector, ?query_vector) | SORT _score DESC | LIMIT 100)
       (WHERE MATCH(description, ?query) | SORT _score DESC | LIMIT 100)
| FUSE // uses RRF by default
| SORT _score DESC</code></pre><p>Let&#x27;s decompose the query to better understand the execution model and first look at the output of the <code>FORK</code> command:</p><pre><code>FROM books METADATA _score, _id, _index
| FORK (WHERE KNN(description_vector, ?query_vector) | SORT _score DESC | LIMIT 100)
       (WHERE MATCH(description, ?query) | SORT _score DESC | LIMIT 100)</code></pre><p>The<code> FORK</code> commands outputs the results from both branches and adds a <code>_fork</code> discriminator column:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">_id</td><td style="border: 1px solid #ccc; padding: 4px;">title</td><td style="border: 1px solid #ccc; padding: 4px;">_score</td><td style="border: 1px solid #ccc; padding: 4px;">_fork</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4001</td><td style="border: 1px solid #ccc; padding: 4px;">The Hobbit</td><td style="border: 1px solid #ccc; padding: 4px;">0.88</td><td style="border: 1px solid #ccc; padding: 4px;">fork1</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">3999</td><td style="border: 1px solid #ccc; padding: 4px;">The Fellowship of the Ring</td><td style="border: 1px solid #ccc; padding: 4px;">0.88</td><td style="border: 1px solid #ccc; padding: 4px;">fork1</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4005</td><td style="border: 1px solid #ccc; padding: 4px;">The Two Towers</td><td style="border: 1px solid #ccc; padding: 4px;">0.86</td><td style="border: 1px solid #ccc; padding: 4px;">fork1</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4006</td><td style="border: 1px solid #ccc; padding: 4px;">The Return of the King</td><td style="border: 1px solid #ccc; padding: 4px;">0.84</td><td style="border: 1px solid #ccc; padding: 4px;">fork1</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4123</td><td style="border: 1px solid #ccc; padding: 4px;">The Silmarillion</td><td style="border: 1px solid #ccc; padding: 4px;">0.78</td><td style="border: 1px solid #ccc; padding: 4px;">fork1</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4144</td><td style="border: 1px solid #ccc; padding: 4px;">The Children of Húrin</td><td style="border: 1px solid #ccc; padding: 4px;">0.79</td><td style="border: 1px solid #ccc; padding: 4px;">fork1</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4001</td><td style="border: 1px solid #ccc; padding: 4px;">The Hobbit</td><td style="border: 1px solid #ccc; padding: 4px;">4.55</td><td style="border: 1px solid #ccc; padding: 4px;">fork2</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">3999</td><td style="border: 1px solid #ccc; padding: 4px;">The Fellowship of the Ring</td><td style="border: 1px solid #ccc; padding: 4px;">4.25</td><td style="border: 1px solid #ccc; padding: 4px;">fork2</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4123</td><td style="border: 1px solid #ccc; padding: 4px;">The Silmarillion</td><td style="border: 1px solid #ccc; padding: 4px;">4.11</td><td style="border: 1px solid #ccc; padding: 4px;">fork2</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4005</td><td style="border: 1px solid #ccc; padding: 4px;">The Two Towers</td><td style="border: 1px solid #ccc; padding: 4px;">3.8</td><td style="border: 1px solid #ccc; padding: 4px;">fork2</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4006</td><td style="border: 1px solid #ccc; padding: 4px;">The Return of the King</td><td style="border: 1px solid #ccc; padding: 4px;">4.1</td><td style="border: 1px solid #ccc; padding: 4px;">fork2</td>
                    </tr>
              </tbody>
            </table>
          <p>As you’ll notice, certain documents appear twice, which is why we then use <code>FUSE</code> to merge rows that represent the same documents and assign new relevance scores. <code>FUSE</code> is executed in two stages:</p><ul><li>For each row, <code>FUSE</code> assigns a new relevance score, depending on the hybrid search algorithm that is being used.</li><li>Rows that represent the same document are merged together, and a new score is computed.</li></ul><p>In our example, we’re using RRF. As a first step, <code>FUSE</code> assigns a new score to each row using the RRF formula:</p><pre><code>score(doc) = 1 / (rank_constant + rank(doc))</code></pre><p>Where the <code>rank_constant</code> takes a default value of 60 and <code>rank(doc)</code><strong> </strong>represents the position of the document in the result set.</p><p>In the first phase, our results become:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">_id</td><td style="border: 1px solid #ccc; padding: 4px;">title</td><td style="border: 1px solid #ccc; padding: 4px;">_score</td><td style="border: 1px solid #ccc; padding: 4px;">_fork</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4001</td><td style="border: 1px solid #ccc; padding: 4px;">The Hobbit</td><td style="border: 1px solid #ccc; padding: 4px;">1 / (60 + 1) = 0.01639</td><td style="border: 1px solid #ccc; padding: 4px;">fork1</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">3999</td><td style="border: 1px solid #ccc; padding: 4px;">The Fellowship of the Ring</td><td style="border: 1px solid #ccc; padding: 4px;">1 / (60 + 2) = 0.01613</td><td style="border: 1px solid #ccc; padding: 4px;">fork1</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4005</td><td style="border: 1px solid #ccc; padding: 4px;">The Two Towers</td><td style="border: 1px solid #ccc; padding: 4px;">1 / (60 + 3) = 0.01587</td><td style="border: 1px solid #ccc; padding: 4px;">fork1</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4006</td><td style="border: 1px solid #ccc; padding: 4px;">The Return of the King</td><td style="border: 1px solid #ccc; padding: 4px;">1 / (60 + 4) = 0.01563</td><td style="border: 1px solid #ccc; padding: 4px;">fork1</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4123</td><td style="border: 1px solid #ccc; padding: 4px;"> The Silmarillion</td><td style="border: 1px solid #ccc; padding: 4px;">1 / (60 + 5) = 0.01538</td><td style="border: 1px solid #ccc; padding: 4px;">fork1</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4144</td><td style="border: 1px solid #ccc; padding: 4px;">The Children of Húrin</td><td style="border: 1px solid #ccc; padding: 4px;">1 / (60 + 6) = 0.01515</td><td style="border: 1px solid #ccc; padding: 4px;">fork1</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4001</td><td style="border: 1px solid #ccc; padding: 4px;">The Hobbit</td><td style="border: 1px solid #ccc; padding: 4px;">1 / (60 + 1) = 0.01639</td><td style="border: 1px solid #ccc; padding: 4px;">fork2</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">3999</td><td style="border: 1px solid #ccc; padding: 4px;">The Fellowship of the Ring</td><td style="border: 1px solid #ccc; padding: 4px;">1 / (60 + 2) = 0.01613</td><td style="border: 1px solid #ccc; padding: 4px;">fork2</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4123</td><td style="border: 1px solid #ccc; padding: 4px;">The Silmarillion</td><td style="border: 1px solid #ccc; padding: 4px;">1 / (60 + 3) = 0.01587</td><td style="border: 1px solid #ccc; padding: 4px;">fork2</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4005</td><td style="border: 1px solid #ccc; padding: 4px;">The Two Towers</td><td style="border: 1px solid #ccc; padding: 4px;">1 / (60 + 4) = 0.01563</td><td style="border: 1px solid #ccc; padding: 4px;">fork2</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4006</td><td style="border: 1px solid #ccc; padding: 4px;">The Return of the King</td><td style="border: 1px solid #ccc; padding: 4px;">1 / (60 + 5) = 0.01538</td><td style="border: 1px solid #ccc; padding: 4px;">fork2</td>
                    </tr>
              </tbody>
            </table>
          <p>Then the rows are merged together and a new score is assigned. Since a <code>SORT _score DESC</code> follows the <code>FUSE</code> command, the final results are:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">_id</td><td style="border: 1px solid #ccc; padding: 4px;">title</td><td style="border: 1px solid #ccc; padding: 4px;">_score</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4001</td><td style="border: 1px solid #ccc; padding: 4px;">The Hobbit</td><td style="border: 1px solid #ccc; padding: 4px;">0.01639 + 0.01639 = 0.03279</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">3999</td><td style="border: 1px solid #ccc; padding: 4px;">The Fellowship of the Ring</td><td style="border: 1px solid #ccc; padding: 4px;">0.01613 + 0.01613 = 0.03226</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4005</td><td style="border: 1px solid #ccc; padding: 4px;">The Two Towers</td><td style="border: 1px solid #ccc; padding: 4px;">0.01587 + 0.01563 = 0.0315</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4123</td><td style="border: 1px solid #ccc; padding: 4px;">The Silmarillion</td><td style="border: 1px solid #ccc; padding: 4px;">0.01538 + 0.01587 = 0.03125</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4006</td><td style="border: 1px solid #ccc; padding: 4px;">The Return of the King</td><td style="border: 1px solid #ccc; padding: 4px;">0.01563 + 0.01538 = 0.03101</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4144</td><td style="border: 1px solid #ccc; padding: 4px;">The Children of Húrin</td><td style="border: 1px solid #ccc; padding: 4px;">0.01515</td>
                    </tr>
              </tbody>
            </table>
          <h2>Hybrid search: Linear combination of scores</h2><p><a href="https://www.elastic.co/docs/reference/elasticsearch/rest-apis/reciprocal-rank-fusion">Reciprocal rank fusion</a> is the simplest way to do hybrid search, but it isn’t the only hybrid search method that we support in ES|QL.</p><p>In the following example, we use <code>FUSE</code> to combine lexical and <a href="https://www.elastic.co/docs/solutions/search/semantic-search/semantic-search-semantic-text">semantic search</a> results using linear combination of scores:</p><pre><code>FROM books METADATA _score, _id, _index
| FORK (WHERE MATCH(semantic_description, ?query) | SORT _score DESC | LIMIT 100)
       (WHERE MATCH(description, ?query) | SORT _score DESC | LIMIT 100)
| FUSE LINEAR WITH { "weights": { "fork1": 0.7, "fork2": 0.3 } }
| SORT _score DESC</code></pre><p>Let&#x27;s first decompose the query and take a look at the input of the <code>FUSE</code> command when we only run the <code>FORK</code> command.</p><p>Notice that we use the <code>match</code> function, which is able to not only query lexical fields, such as <code>text</code> or <code>keyword</code>, but also <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/semantic-text"><code>semantic_text</code></a> fields.</p><p>The first <code>FORK</code> branch executes a semantic query by querying a <code>semantic_text</code> field, while the second one executes a lexical query:</p><pre><code>FROM books METADATA _score, _id, _index
| FORK (WHERE MATCH(semantic_description, ?query) | SORT _score DESC | LIMIT 100)
       (WHERE MATCH(description, ?query) | SORT _score DESC | LIMIT 100)</code></pre><p>The output of the <code>FORK</code> command can contain rows with the same <code>_id</code> and <code>_index</code> values representing the same Elasticsearch document:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">_id</td><td style="border: 1px solid #ccc; padding: 4px;">title</td><td style="border: 1px solid #ccc; padding: 4px;">_score</td><td style="border: 1px solid #ccc; padding: 4px;">_fork</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4001</td><td style="border: 1px solid #ccc; padding: 4px;">The Hobbit</td><td style="border: 1px solid #ccc; padding: 4px;">0.88</td><td style="border: 1px solid #ccc; padding: 4px;">fork1</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">3999</td><td style="border: 1px solid #ccc; padding: 4px;">The Fellowship of the Ring</td><td style="border: 1px solid #ccc; padding: 4px;">0.88</td><td style="border: 1px solid #ccc; padding: 4px;">fork1</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4005</td><td style="border: 1px solid #ccc; padding: 4px;">The Two Towers</td><td style="border: 1px solid #ccc; padding: 4px;">0.86</td><td style="border: 1px solid #ccc; padding: 4px;">fork1</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4006</td><td style="border: 1px solid #ccc; padding: 4px;">The Return of the King</td><td style="border: 1px solid #ccc; padding: 4px;">0.84</td><td style="border: 1px solid #ccc; padding: 4px;">fork1</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4123</td><td style="border: 1px solid #ccc; padding: 4px;">The Silmarillion</td><td style="border: 1px solid #ccc; padding: 4px;">0.78</td><td style="border: 1px solid #ccc; padding: 4px;">fork1</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4144</td><td style="border: 1px solid #ccc; padding: 4px;">The Children of Húrin</td><td style="border: 1px solid #ccc; padding: 4px;">0.79</td><td style="border: 1px solid #ccc; padding: 4px;">fork1</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4001</td><td style="border: 1px solid #ccc; padding: 4px;">The Hobbit</td><td style="border: 1px solid #ccc; padding: 4px;">4.55</td><td style="border: 1px solid #ccc; padding: 4px;">fork2</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">3999</td><td style="border: 1px solid #ccc; padding: 4px;">The Fellowship of the Ring</td><td style="border: 1px solid #ccc; padding: 4px;">4.25</td><td style="border: 1px solid #ccc; padding: 4px;">fork2</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4123</td><td style="border: 1px solid #ccc; padding: 4px;">The Silmarillion</td><td style="border: 1px solid #ccc; padding: 4px;">4.11</td><td style="border: 1px solid #ccc; padding: 4px;">fork2</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4005</td><td style="border: 1px solid #ccc; padding: 4px;">The Two Towers</td><td style="border: 1px solid #ccc; padding: 4px;">3.8</td><td style="border: 1px solid #ccc; padding: 4px;">fork2</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4006</td><td style="border: 1px solid #ccc; padding: 4px;">The Return of the King</td><td style="border: 1px solid #ccc; padding: 4px;">4.1</td><td style="border: 1px solid #ccc; padding: 4px;">fork2</td>
                    </tr>
              </tbody>
            </table>
          <p>In the next step, we use <code>FUSE</code> to merge rows that have the same <code>_id</code> and <code>_index</code> values, and assign new relevance scores.</p><p>The new score is a linear combination of the scores the row had in each <code>FORK</code> branch:</p><pre><code>_score = 0.7 *_score1 + 0.3 * _score2</code></pre><p>Here, <code>_score1</code> and <code>_score2</code> represent the score a document has in the first <code>FORK</code> branch and the second <code>FORK</code> branch, respectively.</p><p>Notice that we also apply custom weights, giving more weight to the semantic score over the lexical one, resulting in this set of documents:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">_id</td><td style="border: 1px solid #ccc; padding: 4px;">title</td><td style="border: 1px solid #ccc; padding: 4px;">_score</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4001</td><td style="border: 1px solid #ccc; padding: 4px;">The Hobbit</td><td style="border: 1px solid #ccc; padding: 4px;">0.7 * 0.88 + 0.3 * 4.55 = 1.981</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">3999</td><td style="border: 1px solid #ccc; padding: 4px;">The Fellowship of the Ring</td><td style="border: 1px solid #ccc; padding: 4px;">0.7 * 0.88 + 0.3 * 4.25 = 1.891</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4006</td><td style="border: 1px solid #ccc; padding: 4px;">The Return of the King</td><td style="border: 1px solid #ccc; padding: 4px;">0.7 * 0.84 + 0.3 * 4.1 = 1.818</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4123</td><td style="border: 1px solid #ccc; padding: 4px;">The Silmarillion</td><td style="border: 1px solid #ccc; padding: 4px;">0.7 * 0.78 + 0.3 * 4.11 = 1.779</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4005</td><td style="border: 1px solid #ccc; padding: 4px;">The Two Towers</td><td style="border: 1px solid #ccc; padding: 4px;">0.7 * 0.86 + 0.3 * 3.8 = 1.742</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4144</td><td style="border: 1px solid #ccc; padding: 4px;">The Children of Húrin</td><td style="border: 1px solid #ccc; padding: 4px;">0.7 * 0.79 + 0.3 * 0 = 0.553</td>
                    </tr>
              </tbody>
            </table>
          <p>One challenge is that the semantic and lexical scores can be incompatible to apply the linear combination, since they can follow completely different distributions. To mitigate this, we first need to normalize the scores, employing score normalization methods, such as <code>minmax</code>. This ensures that the scores from each <code>FORK</code> branch are first normalized to take values between 0 and 1, before applying the linear combination formula.</p><p>To achieve this with <code>FUSE</code>, we need to specify the <code>normalizer</code> option:</p><pre><code>FROM books METADATA _score, _id, _index
| FORK (WHERE MATCH(semantic_description, ?query) | SORT _score DESC | LIMIT 100)
       (WHERE MATCH(description, ?query) | SORT _score DESC | LIMIT 100)
| FUSE LINEAR WITH { "weights": { "fork1": 0.7, "fork2": 0.3 }, "normalizer": "minmax" }
| SORT _score DESC</code></pre><h2>Semantic reranking</h2><p>At this stage, after hybrid search, we should be left with the most relevant documents. We can now use semantic reranking to reorder the results using the <code>RERANK</code> command. By default, <code>RERANK</code> uses the latest Elastic <a href="https://www.elastic.co/docs/solutions/search/ranking/semantic-reranking">semantic reranking</a> machine learning model, so no additional configuration is needed:</p><pre><code>FROM books METADATA _score, _id, _index
| FORK (WHERE KNN(description_vector, ?query_vector) | SORT _score DESC | LIMIT 100)
       (WHERE MATCH(description, ?query) | SORT _score DESC | LIMIT 100)
| FUSE
| SORT _score DESC
| LIMIT 100
| RERANK ?query ON description
| SORT _score DESC</code></pre><p>We now have our best results, sorted by relevance.</p><p>One key feature that sets the <code>RERANK</code> command apart from other products that offer semantic reranking integrations is that it doesn’t require the input to represent a mapped field from an index. <code>RERANK</code> only expects an expression that evaluates to a string value, making it possible to do semantic reranking using multiple fields:</p><pre><code>FROM books METADATA _score, _id, _index
| FORK (WHERE KNN(description_vector, ?query_vector) | SORT _score DESC | LIMIT 100)
       (WHERE MATCH(description, ?query) | SORT _score DESC | LIMIT 100)
| FUSE
| SORT _score DESC
| LIMIT 100
| RERANK ?query ON CONCAT(title, "\n", description) 
| SORT _score DESC</code></pre><h2>LLM completions</h2><p>Now we have a set of highly relevant, reranked results.</p><p>At this stage, you might simply decide to return the results back to your application or you might want to further enhance your results using LLM completions.</p><p>If you’re using ES|QL as part of a retrieval-augmented generation (RAG) workflow, you can choose to call your favorite LLM directly from ES|QL.<br/>To achieve this, we’ve added a new <code>COMPLETION</code> command that takes in a prompt, a completion inference ID which designates which LLM to call, and a column identifier to specify where to output the LLM response.</p><p>In the following example, we’re using <code>COMPLETION</code> to add a new <code>_completion</code> column that contains the summary of the <code>content</code> column:</p><pre><code>FROM books METADATA _score, _id, _index
| FORK (WHERE KNN(description_vector, ?query_vector) | SORT _score DESC | LIMIT 100)
       (WHERE MATCH(description, ?query) | SORT _score DESC | LIMIT 100)
| FUSE
| SORT _score DESC
| LIMIT 100
| RERANK ?query ON description
| SORT _score DESC
| LIMIT 10
| COMPLETION CONCAT("Summarize the following:\n", description) WITH { "inference_id" : "my_inference_endpoint" } </code></pre><p>Each row now contains a summary:</p>
            <table style="border-collapse: collapse; margin: 20px 0;">
              <tbody>
                
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">_id</td><td style="border: 1px solid #ccc; padding: 4px;">title</td><td style="border: 1px solid #ccc; padding: 4px;">_score</td><td style="border: 1px solid #ccc; padding: 4px;">summary</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4001</td><td style="border: 1px solid #ccc; padding: 4px;">The Hobbit</td><td style="border: 1px solid #ccc; padding: 4px;">0.03279</td><td style="border: 1px solid #ccc; padding: 4px;">Bilbo helps dwarves reclaim Erebor from the dragon Smaug.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">3999</td><td style="border: 1px solid #ccc; padding: 4px;">The Fellowship of the Ring</td><td style="border: 1px solid #ccc; padding: 4px;">0.03226</td><td style="border: 1px solid #ccc; padding: 4px;">Frodo begins the quest to destroy the One Ring.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4005</td><td style="border: 1px solid #ccc; padding: 4px;">The Two Towers</td><td style="border: 1px solid #ccc; padding: 4px;">0.0315</td><td style="border: 1px solid #ccc; padding: 4px;">The Fellowship splits; war comes to Rohan; Frodo nears Mordor.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4123</td><td style="border: 1px solid #ccc; padding: 4px;">The Silmarillion</td><td style="border: 1px solid #ccc; padding: 4px;">0.03125</td><td style="border: 1px solid #ccc; padding: 4px;">Ancient myths and history of Middle-earth's First Age.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4006</td><td style="border: 1px solid #ccc; padding: 4px;">The Return of the King</td><td style="border: 1px solid #ccc; padding: 4px;">0.3101</td><td style="border: 1px solid #ccc; padding: 4px;">Sauron is defeated and Aragorn is crowned King.</td>
                    </tr>
                    <tr>
                      <td style="border: 1px solid #ccc; padding: 4px;">4144</td><td style="border: 1px solid #ccc; padding: 4px;">The Children of Húrin</td><td style="border: 1px solid #ccc; padding: 4px;">0.01515</td><td style="border: 1px solid #ccc; padding: 4px;">The tragic tale of Túrin Turambar's cursed life.</td>
                    </tr>
              </tbody>
            </table>
          <p>In another use case, you may simply want to answer a question using the proprietary data that you have indexed in Elasticsearch. In this case, the best search results that we’ve computed in the previous stage can be used as context for the prompt:</p><pre><code>FROM books METADATA _score, _id, _index
| FORK (WHERE KNN(description_vector, ?query_vector) | SORT _score DESC | LIMIT 100)
       (WHERE MATCH(description, ?query) | SORT _score DESC | LIMIT 100)
| FUSE
| SORT _score DESC
| LIMIT 100
| RERANK ?query ON description
| SORT _score DESC
| LIMIT 10
| STATS context = VALUES(CONCAT(title, "\n", description)
| COMPLETION CONCAT("Answer the following question ", ?query, "based on:\n", context) WITH { "inference_id" : "my_inference_endpoint" }</code></pre><p>Since the <code>COMPLETION</code> command unlocks the ability to send any prompt to an LLM, the possibilities are endless. Although we’re only showing a few examples, the <code>COMPLETION</code> command can be used in a wide range of scenarios, from security analysts using it to assign scores depending on whether a log event can represent a malicious action or data scientists using it to analyze data, to cases where you just need to<a href="https://www.elastic.co/search-labs/blog/esql-completion-command-llm-fact-generator"> generate Chuck Norris facts based on your data</a>.</p><h2>This is only the beginning</h2><p>In the future, we’ll be expanding ES|QL to improve semantic reranking for long documents, better conditional execution of the ES|QL queries using multiple <code>FORK</code> commands, support sparse vector queries, removing close duplicate results to enhance result diversity, allowing full text search on runtime generated columns, and many other scenarios.</p><p>Additional tutorials and guides:</p><ul><li><a href="https://www.elastic.co/docs/solutions/search/esql-for-search">ES|QL for search</a></li><li><a href="https://www.elastic.co/docs/reference/query-languages/esql/esql-search-tutorial">ES|QL for search tutorial</a></li><li><a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/semantic-text">Semantic_text field type</a></li><li><a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/fork"><code>FORK</code></a> and <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/fuse"><code>FUSE</code></a> documentation</li><li>ES|QL search functions</li></ul>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Implementing an agentic reference architecture with Elastic Agent Builder and MCP]]></title>
            <description><![CDATA[Explore an agentic reference architecture with Elastic Agent Builder, MCP, and semantic search to build a security agent for automated threat analysis.]]></description>
            <link>https://www.elastic.co/search-labs/blog/agent-builder-mcp-reference-architecture-elasticsearch</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/agent-builder-mcp-reference-architecture-elasticsearch</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Jeffrey Rengifo]]></dc:creator>
            <pubDate>Wed, 07 Jan 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/33a57e3f61f9095c99b6d1499175a6edb0d5dfc5-4720x2560.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>In this article, we will present a reference architecture for using Elasticsearch with AI capabilities through the <a href="https://www.elastic.co/docs/solutions/search/elastic-agent-builder">Elastic Agent Builder</a>, exposing an <a href="https://modelcontextprotocol.io/docs/getting-started/intro">MCP server</a> to access Agent Builder tools and Elasticsearch data.</p><p>Model Context Protocol (<a href="https://modelcontextprotocol.io/docs/getting-started/intro">MCP</a>) is an open-source standard that enables applications and LLMs to communicate with external systems via <a href="https://modelcontextprotocol.io/specification/2025-06-18/server/tools">MCP tools</a> (programmatic capabilities), and <a href="https://docs.langchain.com/oss/python/langgraph/overview">LangGraph</a> (an extension of <a href="https://docs.langchain.com/oss/javascript/langchain/overview">LangChain</a>) provides the orchestration framework for these agentic workflows.</p><p>We’ll implement an application that can search both internal knowledge (Elasticsearch stored data) and external sources (on the internet) to identify potential and known vulnerabilities related to a specific tool. The application will gather the information and generate a detailed summary of the findings.</p><h2>Requirements</h2><ul><li>Elasticsearch 9.2</li><li>Python 3.1x</li><li><a href="https://platform.openai.com/api-keys">OpenAI API Key</a></li><li><a href="https://www.elastic.co/docs/deploy-manage/api-keys/elasticsearch-api-keys">Elasticsearch API Key</a></li><li><a href="https://serpapi.com/users/sign_up?plan=free">Serper API Key</a></li></ul><h2>Elastic Agent Builder</h2><p><a href="https://www.elastic.co/docs/solutions/search/elastic-agent-builder">Elastic Agent Builder</a> is a set of AI-powered capabilities for developing and integrating agents that can interact with your Elasticsearch data. It provides a built-in agent that can be used for natural language conversations with your data or instance, and it also supports tool creation, Elastic APIs, A2A, and MCP. In this article, we will focus on using the <a href="https://www.elastic.co/docs/solutions/search/agent-builder/mcp-server">MCP server</a> for external access to the Elastic Agent Builder tools.</p><p>To know more about Agent Builder features, you can read <a href="https://www.elastic.co/search-labs/blog/elastic-ai-agent-builder-context-engineering-introduction">this article</a>.</p><h3>Agent Builder MCP feature</h3><p>The <a href="https://www.elastic.co/docs/solutions/search/agent-builder/mcp-server">MCP server</a> is available in the Agent Builder and can be accessed at:</p><pre><code>{KIBANA_URL}/api/agent_builder/mcp
# Or if you are using a custom Kibana space:
{KIBANA_URL}/s/{SPACE_NAME}/api/agent_builder/mcp</code></pre><p>The Agent Builder offers <a href="https://www.elastic.co/docs/solutions/search/agent-builder/tools#built-in-tools">Built-in tools</a>, and you can also create your <a href="https://www.elastic.co/docs/solutions/search/agent-builder/tools#custom-tools">custom tools</a>.</p><h2>Reference architecture</h2><p>To get a complete overview of the elements used by an agentic application in an end-to-end workflow, let’s look at the following diagram:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/c5bbba345340bfe5571b17d53b5896d4a3235eac-4720x2560.png" style="width: 100%;" /><p>Elasticsearch is at the center of this architecture, functioning as a vector store, providing the embeddings generation model, and also serving the MCP server to access the data via tools. To better explain the workflow, let’s look at the ingestion and the Agent Builder layer separately.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/97a0075ae637d64140ec7ff0d167297723675632-3000x1176.png" style="width: 100%;" /><p>Here, the first element is the data that will be stored in Elasticsearch. The data passes through an ingest pipeline, where it is processed by the Elasticsearch ELSER model to generate embeddings and then stored in Elasticsearch.</p><h3>Elastic Agent Builder layer</h3><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/9bb396bbd4c3baa3be26f9d9e386f4d5405132ab-2180x2560.png" style="width: 100%;" /><p>On this layer, the Agent Builder plays a central role by exposing the tools needed to interact with the Elasticsearch data. It manages the tools that operate over Elasticsearch indices and makes them available for consumption. Then <a href="https://docs.langchain.com/oss/python/langchain/overview">LangChain</a> handles the orchestration via the MCP client.</p><p>This architecture allows Agent Builder to work as one of many MCP servers available to the client so that the Elasticsearch agent builder can combine with other MCPs. This way, the MCP client can ask cross-source questions and then combine the answers.</p><h2>Use case: Security vulnerability agent</h2><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/1b46b29a8cde4645ebaec1f747be4f6888dd8d39-1600x906.png" style="width: 100%;" /><p>The security vulnerability agent identifies potential risks based on a user’s question by combining three complementary layers:</p><p><strong>First</strong>, it performs a <a href="https://www.elastic.co/docs/solutions/search/semantic-search">semantic search</a> with embeddings over an internal knowledge base of past incidents, configurations, and known vulnerabilities to retrieve relevant historical evidence.</p><p><strong>Second</strong>, it searches the internet for newly published recommendations or threat intelligence that may not yet exist internally.</p><p><strong>Finally</strong>, an LLM correlates and prioritizes both internal and external findings, evaluates their relevance to the user’s specific environment, and produces a clear explanation along with potential mitigation steps.</p><h2>Developing the application</h2><p>The application’s code can be found in the attached <a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/elasticsearch-reference-architecture-for-agentic-applications/notebook.ipynb">notebook</a>.</p><p>You can see the setup for the Python application below:</p><pre><code># load environment variables
load_dotenv()

ELASTICSEARCH_ENDPOINT = os.getenv("ELASTICSEARCH_ENDPOINT")
ELASTICSEARCH_API_KEY = os.getenv("ELASTICSEARCH_API_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
SERPER_API_KEY = os.getenv("SERPER_API_KEY")
KIBANA_URL = os.getenv("KIBANA_URL")

INDEX_NAME = "security-vulnerabilities"
KIBANA_HEADERS = {
    "kbn-xsrf": "true",
    "Content-Type": "application/json",
    "Authorization": f"ApiKey {ELASTICSEARCH_API_KEY}",
} # Useful for Agent Builder API calls


es_client = Elasticsearch(ELASTICSEARCH_ENDPOINT, api_key=ELASTICSEARCH_API_KEY) # Elasticsearch client</code></pre><p>We need to access Agent Builder and create one agent specialized in security queries and one tool to perform semantic search. You need to have the<a href="https://www.elastic.co/docs/solutions/search/agent-builder/get-started"> Agent Builder <strong>enabled</strong></a> for the next step. Once it’s on, we’ll use the <a href="https://www.elastic.co/docs/solutions/search/agent-builder/kibana-api#tools">tools API</a> to create a tool that will perform a semantic search.</p><pre><code>security_search_tool = {
    "id": "security-semantic-search",
    "type": "index_search",
    "description": "Search internal security documents including incident reports, pentests, internal CVEs, security guidelines, and architecture decisions. Uses semantic search powered by ELSER to find relevant security information even without exact keyword matches. Returns documents with severity assessment and affected systems.",
    "tags": ["security", "semantic", "vulnerabilities"],
    "configuration": {
        "pattern": INDEX_NAME,
    },
}

try:
    response = requests.post(
        f"{KIBANA_URL}/api/agent_builder/tools",
        headers=KIBANA_HEADERS,
        json=security_search_tool,
    )

    if response.status_code == 200:
        print("✅ Security semantic search tool created successfully")    
    else:
        print(f"Response: {response.text}")
except Exception as e:
    print(f"❌ Error creating tool: {e}")</code></pre><p>Configure your tools following the <a href="https://www.elastic.co/docs/solutions/search/agent-builder/tools#best-practices">best practices</a> defined by Elastic for developing Tools. Once created, this tool will be ready to use in the Kibana UI.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/5e5b3282dea07987613d8e8d35c372ca68820e44-1600x381.png" style="width: 100%;" /><p>With the tool created, we can start writing the code for the ingestion workflow:</p><h3>Ingest pipeline</h3><p>To define the data structure, we need to have a <a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/elasticsearch-reference-architecture-for-agentic-applications/dataset.json">dataset</a> prepared for ingestion. Below is a sample document for this example:</p><pre><code>{
    "title": "Incident Report: Node.js Express 4.17 Prototype Pollution RCE",
    "content": "In March 2024, our production Node.js Express 4.17 API gateway experienced a critical prototype pollution vulnerability leading to remote code execution. The attack vector involved manipulating object prototypes through JSON payloads in POST requests. This affected all Express middleware processing user input. Immediate mitigation: upgrade to Express 4.18.2+, implement input validation, use Object.freeze() for critical objects. Related to CVE-2022-24999.",
    "doc_type": "incident_report",
    "severity": "critical",
    "affected_systems": [
      "api-gateway-prod",
      "api-gateway-staging"
    ],
    "date": "2024-03-15"
}</code></pre><p>For this type of document, we will use the following index mappings:</p><pre><code>index_mapping = {
    "mappings": {
        "properties": {
            "title": {"type": "text", "copy_to": "semantic_field"},
            "content": {"type": "text", "copy_to": "semantic_field"},
            "doc_type": {"type": "keyword", "copy_to": "semantic_field"},
            "severity": {"type": "keyword", "copy_to": "semantic_field"},
            "affected_systems": {"type": "keyword", "copy_to": "semantic_field"},
            "date": {"type": "date"},
            "semantic_field": {"type": "semantic_text"},
        }
    }
}

if es_client.indices.exists(index=INDEX_NAME) is False:
    es_client.indices.create(index=INDEX_NAME, body=index_mapping)
    print(f"✅ Index '{INDEX_NAME}' created with semantic_text field for ELSER")
else:
    print(f"ℹ️  Index '{INDEX_NAME}' already exists, skipping creation")</code></pre><p>We are creating a <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/semantic-text">semantic_text</a> field to perform semantic search using the information from the fields marked with the <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/copy-to">copy_to</a> property.</p><p>With that mapping definition, we can ingest the data using the <a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-bulk">bulk API</a>.</p><pre><code>def build_bulk_actions(documents, index_name):
    for doc in documents:
        yield {"_index": index_name, "_source": doc}


try:
    with open("dataset.json", "r") as f:
        security_documents = json.load(f)

    success, failed = helpers.bulk(
        es_client,
        build_bulk_actions(security_documents, INDEX_NAME),
        refresh=True,
    )
    print(f"📥 {success} documents indexed successfully")

except Exception as e:
    print(f"❌ Error during bulk indexing: {str(e)}")</code></pre><h3>LangChain MCP client</h3><p>Here we’re going to create an MCP client using LangChain to consume the Agent Builder tools and build a workflow with LangGraph to orchestrate the client execution. The first step is to <a href="https://www.elastic.co/docs/solutions/search/agent-builder/mcp-server#configuring-mcp-clients">connect to the MCP server</a>:</p><pre><code>client = MultiServerMCPClient(
    {
        "agent-builder": {
            "transport": "streamable_http",
            "url": MCP_ENDPOINT,
            "headers": {"Authorization": f"ApiKey {ELASTICSEARCH_API_KEY}"},
        }
    }
)

tools = await client.get_tools()

print(f"📋 MCP Tools available: {[t.name for t in tools]}") # ['platform_core_search',  ... 'security-semantic-search']</code></pre><p>Next, we create an agent that selects the appropriate tool based on the user input:</p><pre><code>reasoning = {"effort": "low"}

llm = ChatOpenAI(
    model="gpt-5.2-2025-12-11", reasoning=reasoning, openai_api_key=OPENAI_API_KEY
) # LLM client 

agent = create_agent(
    llm,
    tools=tools,
    system_prompt="""You are a cybersecurity expert specializing in infrastructure security.

        Your role is to:
        1. Analyze security queries from users
        2. Search internal security documents (incidents, pentests, CVEs, guidelines)
        3. Provide actionable security recommendations
        4. Assess vulnerability severity and impact

        When responding:
        - Always search internal documents first using the agent builder tools
        - Provide specific, technical, and actionable advice
        - Cite relevant internal incidents and documentation
        - Assess severity (critical, high, medium, low)
        - Recommend immediate mitigation steps

        Be concise but comprehensive. Focus on practical security guidance.""",
)</code></pre><p>We’ll use the GPT-5.2 model, which represents OpenAI’s state-of-the-art for agent management tasks. We configure it with low reasoning effort to achieve faster responses compared to the medium or high settings, while still delivering high-quality results by leveraging the full capabilities of the GPT-5 family. You can read more about the GPT 5.2 <a href="https://openai.com/index/introducing-gpt-5-2/">here</a>.</p><p>Now that the initial setup is done, the next step is to define a workflow capable of making decisions, running tool calls, and summarizing results.</p><p>For this, we use LangGraph. We won’t cover LangGraph in depth here; <a href="https://www.elastic.co/search-labs/blog/ai-agent-workflow-finance-langgraph-elasticsearch">this article</a> provides a detailed overview of its functionality.</p><p>The following image shows a high-level view of the LangGraph application.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/729295115427ec981a594e873245fa541dd977aa-332x531.png" style="width: 100%;" /><p>We need to define the application state:</p><pre><code>class AgentState(TypedDict):
    query: str
    agent_builder_response: dict
    internet_results: list
    final_response: str
    needs_internet_search: bool</code></pre><p>To better understand how the workflow operates, here is a brief description of each function. For full implementation details, refer to the accompanying <a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/elasticsearch-reference-architecture-for-agentic-applications/notebook.ipynb">notebook</a>.</p><ul><li><strong>call_agent_builder_semantic_search:</strong> Queries internal documentation using the Agent Builder MCP server and also stores the retrieved messages in the state.</li><li><strong>decide_internet_search:</strong> Analyzes the internal results and determines whether an external search is required.</li><li><strong>perform_internet_search: </strong>Runs an external search using the <a href="https://serper.dev/">Serper</a> API when needed.</li><li><strong>generate_response:</strong> Correlates internal and external findings and produces a final, actionable cybersecurity analysis for the user.</li></ul><p>With the workflow defined, we can now send a query:</p><pre><code>query = "We are using Node.js with Express 4.17 for our API gateway. Are there known prototype pollution or remote code execution vulnerabilities?"</code></pre><p>In this example, we want to evaluate whether this specific version of Express is affected by known vulnerabilities.</p><h4>Research results</h4><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/b18a31e42bcd8f4d86bb605f85d4ff77135b0855-1084x517.png" style="width: 100%;" /><p>See the complete response in <a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/elasticsearch-reference-architecture-for-agentic-applications/notebook.ipynb">this file</a>.</p><p>This response clearly correlates internal and internet findings and provides actionable mitigation steps. It successfully highlights the severity of the vulnerability and offers a structured, security-oriented summary.</p><h3>Extensions and future enhancements</h3><p>This architecture is modular and allows us to extend its capabilities by replacing, improving, or adding components to the existing list. We could add another agent, consumed by the same MCP client. We can also use an automated ingestion workflow with tools such as Logstash, Kafka, or <a href="https://www.elastic.co/docs/reference/search-connectors/self-managed-connectors">Elastic self-managed connectors.</a> Feel free to change the LLM, the MCP client framework, or the embeddings model or add more tools depending on your needs.</p><h2>Conclusion</h2><p>This reference architecture shows a practical way to combine Elasticsearch, the Agent Builder, and MCP to build an AI-driven application. Its structure keeps each part independent, which makes the system easy to implement, maintain, and extend.</p><p>You can start with a simple setup (like the security use case in this article) and scale it by adding new tools, data sources, or agents as your needs grow. Overall, it provides a straightforward path for building flexible and reliable agentic workflows on top of Elasticsearch.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Automating log parsing in Streams with ML]]></title>
            <description><![CDATA[Learn how a hybrid ML approach achieved 94% log parsing and 91% log partitioning accuracy through automation experiments with log format fingerprinting in Streams.]]></description>
            <link>https://www.elastic.co/search-labs/blog/log-parsing-partitioning-automation-experiments-streams</link>
            <guid isPermaLink="true">https://www.elastic.co/search-labs/blog/log-parsing-partitioning-automation-experiments-streams</guid>
            <category><![CDATA[elasticsearch-labs]]></category>
            <dc:creator><![CDATA[Nastia Havriushenko]]></dc:creator>
            <pubDate>Fri, 02 Jan 2026 00:00:00 GMT</pubDate>
            <enclosure url="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/965c58f19742361160593c38fcaa8b2f4b0d6cc5-3838x2159.png?fm=webp&amp;w=1440&amp;h=754&amp;q=90&amp;fit=crop" length="0" type="image/webp"/>
            <content:encoded><![CDATA[<p>In modern observability stacks, ingesting unstructured logs from diverse data providers into platforms like Elasticsearch remains a challenge. Reliance on manually crafted parsing rules creates brittle pipelines, where even minor upstream code updates lead to parsing failures and unindexed data. This fragility is compounded by the scalability challenge: in dynamic microservices environments, the continuous addition of new services turns manual rule maintenance into an operational nightmare.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/9108ec303339dd091faa3c363c7cf5c228155f49-3840x2160.png" style="width: 100%;" /><p>Our goal was to transition to an automated, adaptive approach capable of handling both log parsing (field extraction) and log partitioning (source identification). We hypothesized that Large Language Models (LLMs), with their inherent understanding of code syntax and semantic patterns, could automate these tasks with minimal human intervention.</p><p>We are happy to announce that this feature is already available in <a href="http://elastic.co/elasticsearch/streams"><span style="text-decoration:underline">Streams</span></a>!</p><h2>Dataset description</h2><p>We chose a <strong><a href="https://github.com/logpai/loghub">Loghub</a> </strong>collection of logs for PoC purposes. For our investigation, we selected representative samples from the following key areas:</p><ul><li>Distributed systems: We used the HDFS (Hadoop Distributed File System) and Spark datasets. These contain a mix of info, debug, and error messages typical of big data platforms.</li><li>Server &amp; web applications: Logs from Apache web servers and OpenSSH provided a valuable source of access, error, and security-relevant events. These are critical for monitoring web traffic and detecting potential threats.</li><li>Operating systems: We included logs from Linux and Windows. These datasets represent the common, semi-structured system-level events that operations teams encounter daily.</li><li>Mobile systems: To ensure our model could handle logs from mobile environments, we included the Android dataset. These logs are often verbose and capture a wide range of application and system-level activities on mobile devices.</li><li>Supercomputers: To test performance on high-performance computing (HPC) environments, we incorporated the BGL (Blue Gene/L) dataset, which features highly structured logs with specific domain terminology.</li></ul><p>A key advantage of the Loghub collection is that the logs are largely unsanitized and unlabeled, mirroring a noisy live production environment with microservice architecture.</p><p>Log examples:</p><pre><code>[Sun Dec 04 20:34:21 2005] [notice] jk2_init() Found child 2008 in scoreboard slot 6
[Sun Dec 04 20:34:25 2005] [notice] workerEnv.init() ok /etc/httpd/conf/workers2.properties
[Mon Dec 05 11:06:51 2005] [notice] workerEnv.init() ok /etc/httpd/conf/workers2.properties
17/06/09 20:10:58 INFO output.FileOutputCommitter: Saved output of task 'attempt_201706092018_0024_m_000083_1138' to hdfs://10.10.34.11:9000/pjhe/test/1/_temporary/0/task_201706092018_0024_m_000083
17/06/09 20:10:58 INFO mapred.SparkHadoopMapRedUtil: attempt_201706092018_0024_m_000083_1138: Committed</code></pre><p>In addition, we created a Kubernetes cluster with a typical web application + database set up to mine extra logs in the most common domain.</p><p>Example of common log fields: timestamp, log level (INFO, WARN, ERROR), source, message.</p><h2>Few-shot log parsing with an LLM</h2><p>Our first set of experiments focused on a fundamental question: <strong>Can an LLM reliably identify key fields and generate consistent parsing rules to extract them?</strong></p><p>We asked a model to analyse raw log samples and generate log parsing rules in regular expression (regex) and <a href="https://www.elastic.co/docs/explore-analyze/scripting/grok">Grok</a> formats. Our results showed that this approach has a lot of potential, but also significant implementation challenges.</p><h3>High confidence &amp; context awareness</h3><p>Initial results were promising. The LLM demonstrated a strong ability to generate parsing rules that matched the provided few-shot examples with high confidence. Besides simple pattern matching, the model showed a capacity for log understanding —it could correctly identify and name the log source (e.g., health tracking app, Nginx web app, Mongo database).</p><h3>The &quot;Goldilocks&quot; dilemma of input samples</h3><p>Our experiments quickly surfaced a significant lack of robustness because of extreme<strong> sensitivity to the input sample.</strong> The model&#x27;s performance fluctuates wildly based on the specific log examples included in the prompt. We observed a log similarity problem where the log sample needs to include <em>just diverse enough </em>logs:</p><ul><li>Too homogeneous (overfitting)<strong>:</strong> If the input logs are too similar, the LLM tends to <strong>overspecify</strong>. It treats variable data—such as specific Java class names in a stack trace—as static parts of the template. This results in brittle rules that cover a tiny ratio of logs and extract unusable fields.</li><li>Too heterogeneous (confusion): Conversely, if the sample contains significant formatting variance—or worse, &quot;trash logs&quot; like progress bars, memory tables, or ASCII art—the model struggles to find a common denominator. It often resorts to generating complex, broken regexes or lazily over-generalizing the entire line into a single message blob field.</li></ul><h3>The context window constraint</h3><p>We also encountered a context window bottleneck. When input logs were long, heterogeneous, or rich in extractable fields, the model&#x27;s output often deteriorated, becoming &quot;messy&quot; or too long to fit into the output context window. Naturally, chunking helps in this case. By splitting logs using character-based and entity-based delimiters, we could help the model focus on extracting the main fields without being overwhelmed by noise.</p><h3>The consistency &amp; standardization gap</h3><p>Even when the model successfully generated rules, we noted slight inconsistencies:</p><ul><li>Service naming variations: The model proposes different names for the same entity (e.g., labeling the source as &quot;Spark,&quot; &quot;Apache Spark,&quot; and &quot;Spark Log Analytics&quot; in different runs).</li><li>Field naming variations: Field names lacked standardization (e.g., <code>id</code> vs. <code>service.id</code> vs. <code>device.id</code>). We normalized names using a standardized <a href="https://www.elastic.co/docs/reference/ecs/ecs-field-reference">Elastic field naming</a>.</li><li>Resolution variance: The resolution of the field extraction varied depending on how similar the input logs were to one another.</li></ul><h2>Log format fingerprint</h2><p>To address the challenge of log similarity, we introduce a high-performance heuristic: <strong>log format fingerprint (LFF)</strong>.</p><p>Instead of feeding raw, noisy logs directly into an LLM, we first apply a deterministic transformation to reveal the underlying structure of each message. This pre-processing step abstracts away variable data, generating a simplified &quot;fingerprint&quot; that allows us to group related logs.</p><p>The mapping logic is simple to ensure speed and consistency:</p><ol><li>Digit abstraction: Any sequence of digits (0-9) is replaced by a single ‘0’.</li><li>Text abstraction: Any sequence of alphabetical characters with whitespace is replaced by a single ‘a’.</li><li>Whitespace normalization: All sequences of whitespace (spaces, tabs, newlines) are collapsed into a single space.</li><li>Symbol preservation: Punctuation and special characters (e.g., :, [, ], /) are preserved, as they are often the strongest indicators of log structure.</li></ol><p>We introduce the log mapping approach. The basic mapping patterns include the following:</p><ul><li>Digits 0-9 of any length -&gt; to ‘0.’</li><li>Text (alphabetical characters with spaces) of any length -&gt; to ‘a’.</li><li>White spaces, tabs, and new lines -&gt; to a single space.</li></ul><p>Let&#x27;s look at an example of how this mapping allows us to transform the logs.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/78fa2887486eb9417804354ee3bf2a4fdb0f6383-846x252.png" style="width: 100%;" /><p>As a result, we obtain the following log masks:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/ec439a3d3a25002498b97defcff733ea5ebc6b55-826x94.png" style="width: 100%;" /><p>Notice the fingerprints of the first two logs. Despite different timestamps, source classes, and message content, their prefixes (<code>0/0/0 0:0:0 a a.a:</code>) are identical. This structural alignment allows us to automatically bucket these logs into the same cluster.</p><p>The third log, however, produces a completely divergent fingerprint (<code>0-0-0...</code>). This allows us to algorithmically separate it from the first group <em>before</em> we ever invoke an LLM.</p><h2>Bonus part: Instant implementation with ES|QL</h2><p>It’s as easy as passing this query in Discover.</p><pre><code>FROM loghub |
EVAL pattern = REPLACE(REPLACE(REPLACE(REPLACE(raw_message, "[ \t\n]+", " "), "[A-Za-z]+", "a"), "[0-9]+", "0"), "a( a)+", "a") |
STATS total_count = COUNT(), ratio = COUNT() / 2000.0, datasources=VALUES(filename), example=TOP(raw_message, 3, "desc") BY SUBSTRING(pattern, 0, 15) |
SORT total_count DESC |
LIMIT 100</code></pre><p><strong>Query breakdown:</strong></p><p><strong>FROM</strong> loghub: Targets our index containing the raw log data.</p><p><strong>EVAL</strong> pattern = …: The core mapping logic. We chain REPLACE functions to perform the abstraction (e.g., digits to &#x27;0&#x27;, text to &#x27;a&#x27;, etc.) and save the result in a “pattern” field.</p><p><strong>STATS </strong>[column1 =] expression1, …<strong> BY </strong>SUBSTRING(pattern, 0, 15):</p><p>This is a clustering step. We group logs that share the first 15 characters of their pattern and create aggregated fields such as total log count per group, list of log datasources, pattern prefix, 3 log examples</p><p><strong>SORT</strong> total_count DESC | <strong>LIMIT</strong> 100 : Surfaces the top 100 most frequent log patterns</p><p>The query results on LogHub are displayed below:</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/b119498f124376c41d242a099bf9081fd6536be8-1600x394.png" style="width: 100%;" /><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/4dcfc0a5b7fa753497cc5def5ea3cd54449c0481-1600x719.png" style="width: 100%;" /><p>As demonstrated in the visualization, this “LLM-free” approach partitions logs with high accuracy. It successfully clustered 10 out of 16 data sources (based on LogHub labels) completely (&gt;90%) and achieved majority clustering in 13 out of 16 sources (&gt;60%) —all without requiring additional cleaning, preprocessing, or fine-tuning.</p><p>Log format fingerprint offers a pragmatic, high-impact alternative and addition to sophisticated ML solutions like <a href="https://www.elastic.co/docs/reference/aggregations/search-aggregations-bucket-categorize-text-aggregation">log pattern analysis</a>. It provides immediate insights into log relationships and effectively manages large log clusters.</p><ul><li>Versatility as a primitive </li></ul><p>Thanks to <a href="https://www.elastic.co/blog/getting-started-elasticsearch-query-language">ES|QL</a> implementation, LFF serves both as a standalone tool for fast data diagnostics/visualisations, and as a building block in log analysis pipelines for high-volume use cases. </p><ul><li>Flexibility</li></ul><p>LFF is easy to customize and extend to capture specific patterns, i.e. hexadecimal numbers and IP addresses.</p><ul><li>Deterministic stability</li></ul><p>Unlike ML-based clustering algorithms, LFF logic is straightforward and deterministic. New incoming logs do not retroactively affect existing log clusters.</p><ul><li>Performance and mMemory</li></ul><p>It requires minimal memory, no training or GPU making it ideal for real-time high-throughput environments.</p><h2>Combining log format fingerprint with an LLM</h2><p>To validate the proposed hybrid architecture, each experiment contained a random 20% subset of the logs from each data source. This constraint simulates a real-world production environment where logs are processed in batches rather than as a monolithic historical dump.</p><p>The objective was to demonstrate that LFF acts as an effective compression layer. We aimed to prove that high-coverage parsing rules could be generated from small, curated samples and successfully generalized to the entire dataset.</p><h2>Execution pipeline</h2><p>We implemented a multi-stage pipeline that filters, clusters, and applies stratified sampling to the data before it reaches the LLM.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/b3f46ea471760b406a32fc7d4bc74cc03faaced2-3840x1660.png" style="width: 100%;" /><p>1. Two-stage hierarchical clustering</p><ul><li>Subclasses (exact match): Logs are aggregated by identical fingerprints. Every log in one subclass shares the exact same format structure.</li><li>Outlier cleaning. We discard any subclasses that represent less than 5% of the total log volume. This ensures the LLM focuses on the dominant signal and won’t be sidetracked by noise or malformed logs.</li><li>Metaclasses (prefix match): Remaining subclasses are grouped into Metaclasses by the first N characters of the format fingerprint match. This grouping strategy effectively splits lexically similar formats under a single umbrella.We chose N=5 for Log parsing and N=15 for Log partitioning when data sources are unknown.</li></ul><p>2. Stratified sampling. Once the hierarchical tree is built, we construct the log sample for the LLM. The strategic goal is to maximize variance coverage while minimizing token usage.</p><ul><li>We select representative logs from <em>each</em> valid subclass within the broader metaclass.</li><li>To manage an edge case of too numerous subclasses, we apply random down-sampling to fit the target window size.</li></ul><p>3. Rule generation Finally, we prompt the LLM to generate a regex parsing rule that fits all logs in the provided sample for each Metaclass. For our PoC, we used the GPT-4o mini model.</p><h2>Experimental results &amp; observations</h2><p>We achieved 94% parsing accuracy and 91% partitioning accuracy on the Loghub dataset.</p><img src="https://cdn.sanity.io/images/me0ej585/search-labs-import-testing/49b2b6a1401dd1f33951da68e5a3fac37d0b5aaa-1600x1506.png" style="width: 100%;" /><p>The confusion matrix above illustrates log partitioning results. The vertical axis represents the actual data sources, and the horizontal axis represents the predicted data sources. The heatmap intensity corresponds to log volume, with lighter tiles indicating a higher count. The diagonal alignment demonstrates the model&#x27;s high fidelity in source attribution, with minimal scattering.</p><h2>Our performance benchmarks insights:</h2><ul><li><strong>Optimal baseline:</strong> a context window of <strong>30–40 log samples</strong> per category proved to be the &quot;sweet spot,&quot; consistently producing robust parsing with both Regex and Grok patterns.</li><li><strong>Input minimisation:</strong> we pushed the input size to 10 logs per category for Regex patterns and observed only 2% drop in parsing performance, confirming that diversity-based sampling is more critical than raw volume.</li></ul>]]></content:encoded>
        </item>
    </channel>
</rss>