New to Elasticsearch? Join our getting started with Elasticsearch webinar. You can also start a free cloud trial or try Elastic on your machine now.
Parts 1 through 7 of this series described a governed control plane for ecommerce search. A user types a query. The control plane classifies intent, enforces business constraints, resolves policy conflicts, and routes to the appropriate retrieval strategy, all before the product catalog is ever queried. The entire architecture assumes that the input is a search string typed by a human shopper.
This final post asks: What changes when the input comes from an AI agent instead?
The answer is that the architecture doesn't change, but the stakes do. Every property of the governed control plane that matters for human-authored queries matters more when the upstream decision-maker is a large language model (LLM). Determinism, auditability, conflict resolution, and constraint enforcement become critical guardrails rather than operational conveniences, because the system producing the input is probabilistic by nature.
The agentic search problem
The most common approach to AI-driven search is straightforward: Give the LLM the database schema, provide business rules in the prompt, and let the agent generate the query directly.
For an ecommerce chatbot, this means injecting the Elasticsearch index mapping, field types, category taxonomies, pricing logic, and business constraints into the agent's context window, and then asking the LLM to translate natural language into valid Elasticsearch Query DSL. The LLM becomes the query author.
This approach works in demos. It fails in production for four reasons.
Context bloat
An enterprise ecommerce index mapping is not a trivial document. Field definitions, nested objects, multi-field configurations, and analyzer settings can run to thousands of tokens before any business logic is added. On top of the mapping, the agent needs category taxonomies (which in enterprise ecommerce can contain tens of thousands of values), pricing rules, brand hierarchies, eligibility constraints, and campaign logic.
The result is a context window dominated by structural metadata rather than the user's actual intent. This increases latency, increases token cost, and degrades the LLM's ability to follow instructions as the context grows. This is a well-documented phenomenon, sometimes called context rot: As the prompt gets longer, the model's attention to any particular instruction weakens.
Probabilistic hallucination
LLMs generate queries based on patterns in their training data and the context provided. When asked to produce Elasticsearch Query DSL, the model can hallucinate field names that don't exist, construct syntactically invalid query clauses, misapply filter types to the wrong field types, or produce queries that are syntactically valid but semantically wrong, returning results that don't match the user's intent.
Google Cloud's BIRD benchmark for Text-to-SQL illustrates the ceiling of this approach. Google's state-of-the-art single-model result achieved between 70% and 80% accuracy, meaning that nearly one in four generated queries was incorrect. This is for SQL, which is far more standardized than Elasticsearch Query DSL. The error rate for LLM-generated Elasticsearch queries in a real production environment, with complex mappings and business-specific semantics, would likely be higher.
For a revenue-critical ecommerce system, a one in four query error rate isn’t a tuning problem to be solved iteratively. It’s an architectural limitation of the approach.
The security gap
When the LLM has access to the database schema and acts as the query author, the system is vulnerable to indirect prompt injection. A user interacting with an ecommerce chatbot can craft inputs designed to manipulate the agent into generating unintended queries.
This isn’t a theoretical risk. Prompt injection is one of the most actively researched attack surfaces in deployed LLM systems. The fundamental issue is that when the agent authors the query, there’s no structural boundary between user intent and query execution. The LLM is simultaneously interpreting the user's request and constructing the database operation. Any manipulation of the first directly affects the second.
High-cardinality scaling failure
Certain ecommerce fields have extreme cardinality. A product catalog might have 17,000 category values, thousands of brand names, and hundreds of attribute combinations. Standard agentic workflows require injecting these values into the context so the LLM can select the correct one when constructing a query.
This creates an impossible trade-off: Either inject all possible values (consuming enormous context and degrading performance), inject a subset (and accept that the agent cannot reference values outside that subset), or fall back to ungoverned search. This connects directly to the core problem from Part 1: If the LLM searches for “oranges” and Elasticsearch returns orange soda, the chat experience degrades in the same way a search experience does. The absence of governance means the system cannot enforce the shopper's intended resolution.

Retrieving relevant values dynamically based on the query is a known alternative, but it introduces an additional nondeterministic step where the retrieval itself can miss relevant values. Additionally, this adds latency and complexity to every query.
The architectural alternative: Decoupling intent from execution
The governed control plane described in Parts 1 through 7 offers a fundamentally different approach. Instead of the LLM authoring the final query, the LLM's role is reduced to a single, well-bounded task: extracting a search intent string from the user's natural language input.
The user says: "I'm looking for cheap brown shoes." The agent's job isn’t to generate an Elasticsearch query. It’s to extract and pass along the search intent, (in this case, something like "cheap brown shoes") to the control plane. The control plane then does what it has always done: percolates the intent string against stored policies, composes matching policies through cascading transformations, resolves conflicts deterministically, and produces a governed Elasticsearch query.
The LLM never sees the index mapping. It never knows about field types, category taxonomies, or pricing thresholds. It never constructs a query clause. It operates on the natural language side of an architectural boundary that we call the metadata air gap, a strict separation between the probabilistic component (the LLM) and the structured data layer (schema, policies, and query construction).

What the metadata air gap provides
- Schema blindness. The LLM has no access to the database schema and therefore cannot generate invalid queries, hallucinate field names, or be manipulated into exposing structural information. The schema exists only on the deterministic side of the air gap.
- Minimal context. Instead of thousands of tokens of mapping data, business rules, and category taxonomies, the LLM's prompt contains only a persona and intent extraction instructions. This dramatically reduces token cost, latency, and context rot.
- Deterministic execution. Every query that reaches Elasticsearch is constructed by the control plane using human-vetted policy templates, not generated probabilistically by an LLM. Syntactic validity is guaranteed. Semantic correctness is enforced by the same policy framework that Parts 1 through 6 described.
- Security by architecture. Prompt injection becomes structurally ineffective. Even if a user manipulates the agent into producing an unusual intent string, that string is percolated against stored policies. If no policy matches, no query is generated. The user cannot instruct the agent to construct a query because the agent doesn't construct queries. The control plane does, and the control plane is deterministic.
How the pieces connect
The following walkthrough shows how the governed control plane handles an agent-mediated query.
Step 1: The user speaks to the agent
A shopper interacting with an ecommerce chatbot says: "I'm looking for cheap chocolate, nothing with peanuts."
Step 2: The agent extracts intent
The LLM's role is intent extraction, not query generation. Given a minimal prompt that instructs it to identify the product intent, the agent produces a search intent string: "cheap chocolate without peanuts".
This is a lightweight classification task. The LLM doesn’t need the index mapping, category taxonomy, or pricing rules to perform it. It needs to understand natural language, which is exactly what LLMs are good at.
Step 3: The control plane governs the query
The intent string "cheap chocolate without peanuts" is passed to the control plane, which percolates it against the policy index. Three policies match:
- The "cheap" policy (extracts "cheap", applies a price filter based on the product category).
- The "chocolate" policy (constrains results to chocolate categories).
- The "without" negation policy (extracts the exclusion target and applies a
must_notfilter)
The control plane applies these policies through the same cascading transformation described in Part 3 and Part 4: priority ordering, per-field conflict resolution, consumed phrase tracking. If a “Christmas campaign” policy is also active, it composes with the product policies exactly as described in Part 3, the agent's involvement doesn't change the governance model at all.
Step 4: The governed query executes
The control plane produces a fully governed Elasticsearch query: a search for “chocolate”, constrained to the appropriate categories, with a price ceiling derived from the “cheap” policy, an exclusion filter for peanut-containing products, and any active campaign boosts applied. If the “chocolate” policy also includes economic optimization weights (Part 7), those are applied as well. Margin boosting is set to 3.0x because “chocolate” is a browsing query where the retailer benefits from promoting higher-margin products. If the shopper has purchase history (Part 6), personalization signals are layered on top. This query is syntactically valid by construction and semantically correct by policy design.
Step 5: Results return through the agent
The product results are returned to the agent, which presents them conversationally to the user. The agent's role in the return path is presentation: formatting results, answering follow-up questions, providing product details. The retrieval itself was governed, deterministic, and explainable.
What the agent is good at (and what it isn't)
This architecture leverages the LLM for what it does well and protects the system from what it does poorly.
LLMs excel at understanding natural language intent. "I'm looking for cheap chocolate, nothing with peanuts" is a natural language understanding task, parsing intent, identifying product references, recognizing negation. LLMs handle this reliably because it's a classification problem, not a generation problem. The output is a short intent string, not a complex structured query.
LLMs struggle with precise structured output under complex constraints. Generating valid Elasticsearch Query DSL requires exact field names, correct clause nesting, appropriate filter types for each field, and consistent application of business rules across thousands of edge cases. These are exactly the properties that a deterministic system enforces trivially and that a probabilistic system enforces unreliably.
The governed control plane puts each component where it belongs: the LLM on the natural language side, the deterministic policy engine on the query construction side, and an architectural boundary between them.
Governance constrains the blast radius
This is the same insight from Part 3, extended to the agentic context. In Part 3, we observed that governance makes semantic retrieval safer by narrowing the candidate set before retrieval begins. A semantic search over 500 products in a governed category is a fundamentally different proposition from a semantic search over 500,000 SKUs.
The same principle applies to agent-mediated queries. Without governance, an agent that misinterprets "cheap chocolate" could generate a query that searches the entire catalog with no price constraint, no category filter, and no exclusions. With governance, even if the agent produces an imperfect intent string, the control plane constrains the query to the policies that match. The worst case is that fewer policies fire, not that an unbounded query hits the product catalog.
Governance narrows the blast radius of probabilistic errors. This is true whether the probabilistic component is a semantic retrieval model or an LLM agent.
LLM-suggested policies: Expanding coverage
Part 2 introduced the idea that an LLM can suggest new policies that enter the same Author → Test → Promote pipeline as human-authored ones. In the agentic context, this becomes a powerful feedback loop.
An LLM can analyze query logs, identify patterns where the control plane has no matching policy (queries that fall through to unmodified retrieval), and suggest new policies to cover those gaps. A merchandiser reviews each suggestion, tests it, and promotes it if it produces the expected behavior. The governance model ensures that no LLM-suggested policy reaches production without human validation.
Over time, this creates a virtuous cycle: The control plane's policy coverage expands, the proportion of queries that require unmodified retrieval shrinks, and the system becomes progressively more governed, with every policy auditable, versioned, and individually reversible.
The broader pattern: Deterministic guardrails for probabilistic systems
The architecture described in this series, a deterministic control plane that sits between a probabilistic input source and a data retrieval system, isn’t specific to ecommerce search. The same pattern applies wherever an AI agent needs to interact with structured data.
An agent querying a SQL database faces the same challenges: context bloat from schema injection, hallucinated column names, prompt injection risks, and high-cardinality value selection. An agent interacting with a ticketing system like Jira, a customer relationship management (CRM) system like Salesforce, or a code repository like GitHub faces analogous problems. In every case, the core architectural question is the same: Should the LLM author the query, or should the LLM extract intent and pass it to a deterministic layer that authors the query?
The governed control plane provides a repeatable answer to that question. Policies are data. Intent extraction is the LLM's job. Query construction is the control plane's job. The metadata air gap keeps them separated. And the governance framework (priority ordering, conflict resolution, cascading transformations, auditability) ensures that the deterministic layer is operationally manageable as the number of policies grows.
Conclusion
The ecommerce search governance patterns described in this series (policies as data, the Author → Test → Promote workflow, cascading transformations, per-field conflict resolution, percolator-based reverse matching, and multi-tier fallback) were designed for a world where a merchandiser authors policies and a shopper types queries. But the architecture can enable much more than its initial use case.
When the input source is an AI agent rather than a human shopper, the governed control plane becomes the critical safety layer between a probabilistic system and a production data store. It provides the deterministic guarantees (syntactic validity, semantic correctness, auditability, and security) that enterprise systems require and that LLMs cannot provide on their own.
The deterministic control plane doesn’t replace the AI agent. It makes the AI agent safe to deploy.
Put governed ecommerce search into practice
The governed control plane architecture described in this series, from the policy-as-data paradigm to the percolator-based lookup to personalization, economic optimization, and the agentic air gap, was designed and built by Elastic Services Engineering. Every pattern described across this series comes from a working system built and validated against enterprise-scale product catalogs.
If your team is building AI-powered search experiences and needs deterministic guardrails for agent-mediated queries, or if you want to implement a governed, business-editable search architecture on Elasticsearch, Elastic Professional Services can accelerate your implementation. Contact Elastic Professional Services.
Join the discussion
Have questions about search governance, retrieval strategies, or ecommerce search architecture? Join the broader Elastic community conversation.




