Gauntlet: What happens when your agent's tools fight back
Elasticsearch Agent Builder Hackathon
.png)
With two days left before the hackathon deadline, I made the decision to step back and rethink my approach from scratch.
The original idea was called Rehearse: an agent that rehearses actions in a sandbox mocked by another agent before executing them in the real world. The concept was sound, but the flaw was obvious in hindsight. The environment can change between rehearsal and execution. Your agent rehearses sending an email, but by the time it actually runs, the inbox looks different. Simulation diverges from reality, and the whole thing falls apart.
But one class of problems doesn't have this issue: adversarial fuzz-testing. If your agent fails in simulation, it can fail in real life too. That's how Gauntlet was born — 48 hours before the deadline and reusing the same core insight (an agent that uses search to build memory and stay creative) pointed at a problem where stochasticity doesn't matter.
The problem with testing agents on the happy path
Most of us have heard of OpenClaw, the personal AI assistant that went viral. If you've followed the discourse around agentic AI assistants with broad tool access, you've seen the security concerns. Agents forget what they're not supposed to do or never knew in the first place. The reason is straightforward: We test the happy path. We check that the agent does what it should. We rarely check what happens when someone tries to make it do what it shouldn't.
Adversarial testing sandboxes exist, but they're painful to build. You design attack vectors manually. You seed adversarial data by hand. You configure test infrastructure for each scenario. It's slow, it doesn't scale, and it only finds the bugs you already thought of.
I wanted something different: a system where the environment itself is automatically adversarial and gets more creative over time.
The idea: Mock the sandbox with another agent
Instead of building a sandbox, Gauntlet uses a mocking agent that intercepts your primary agent's tool calls and finds creative ways to break it. When your agent calls search_emails, the mocking agent sees the result and decides whether to mutate it, injecting a prompt injection into an email body, returning subtly wrong data, or feeding false information to see if the primary agent catches it. The primary agent never knows it's in a simulation.
The interface is two decorators:
@function_tool
@gauntlet.query
def search_emails(folder: str = "inbox") -> str:
"""Search emails in the given folder."""
return json.dumps(fetch_emails(folder))There is @gauntlet.query for read operations and @gauntlet.mutation for writes. That's the entire integration surface. When the run finishes, evaluate() reviews what happened and stores confirmed bugs.
It’s simple to use, but there two hard problems that hide underneath.
The two problems that make this a search problem
First, the mocking agent needs to maintain a coherent model of the world throughout the conversation. If it told the primary agent that an email was from Alice, it can't later contradict that. A mutated email that's obviously fake teaches you nothing. Plausibility is the whole game.
Second, the mocking agent needs to find novel bugs. Rediscovering the same prompt injection pattern 50 times isn't useful. It needs to remember what it has already found and explore in new directions while staying grounded in what the tools actually do.
Both of these are search problems. And that's where Elasticsearch becomes the backbone of the system.
Two memory circuits
The mocking agent runs on two memory circuits, both living in Elasticsearch.
Short-term memory tracks everything within the current session: every tool call intercepted, the original result, what it was mutated to, and what the primary agent did in response. This is the coherence layer. The mocking agent can query its own recent decisions and stay internally consistent while still being adversarial. Balancing creativity with coherence was the hardest design problem in the entire project.
Long-term memory is where the creativity compounds. It stores confirmed bugs with embeddings for similarity search, full tool implementations so the agent can reason about failure modes, and historical results from past runs. When the mocking agent needs a new attack idea, it searches long-term memory for what's been tried before, finds gaps, and hypothesizes something new.
These feed into a closed cycle: hypothesize what bugs might exist, create circumstances to prove them, and store confirmed bugs back into the index. The inventory grows. The attacks get more creative. The gap between Gauntlet and manual sandbox setup widens over time.
Everything runs inside Elastic Agent Builder
The entire mocking agent is built inside Elastic Agent Builder — instructions, tool bindings, and multi-turn conversation state via the Amazon Bedrock Converse API; no external orchestration needed.
The tool I'm most proud of is generate-hypothesis. It's a single ES|QL statement that samples existing bugs, aggregates them with MV_CONCAT, and calls COMPLETION inline to propose a novel attack hypothesis. It handles sampling, aggregation, LLM reasoning, and result generation all in one query, never leaving the ES|QL pipeline. I went in expecting I'd need to shuttle data between Elasticsearch and an external script. I didn't.
ES|QL's COMPLETION function was the biggest surprise. Between COMPLETION, STATS, MV_CONCAT, and SAMPLE, I could build entire reasoning pipelines as single queries. Bug storage uses Kibana Workflows, and a programmatically created Kibana Dashboard gives real-time visibility into bug counts, severity breakdowns, and attack pattern heatmaps.
The Converse API solved another problem I'd been dreading. The mocking agent needs to remember what it's already told the primary agent within a single run. I assumed I'd have to fetch conversation histories from indices and reload them into the agent on every call. But it turns out that the Converse API handles multi-turn state natively. I didn't write any conversation management logic. Just keep calling converse, and it stays coherent.
What this actually buys you
Manual adversarial sandbox setup takes roughly an hour per scenario. With Gauntlet, the same process takes 2–10 minutes, and its long-term memory means each run is informed by every previous run. The more you use it, the more it learns about your agent’s weak points and the harder it tries to find new ones.
What's next?
Right now Gauntlet is a 1v1: one mocking agent versus one primary agent. But the problem is embarrassingly parallel. 20 attack sessions could run simultaneously on separate sessions without any architectural changes. Scaling is the obvious next step.
The more interesting open question is exploration versus exploitation in the long-term memory. The mocking agent needs to balance trying variations of known successful attacks (exploitation) against completely novel hypotheses (exploration). This is a well-studied problem in other domains, but applying it to adversarial agent testing feels unexplored. There might be something worth pursuing beyond this project entirely.
I also keep thinking about Rehearse. Gauntlet is a special case: fuzz-testing works because failure in simulation implies possible failure in reality. But there are other domains where the environment is stable enough between rehearsal and execution that the original Rehearse concept could work. I haven't found them yet, but I'm looking.
The takeaway
If you're building agents with access to real-world tools, test what happens when those tools fight back. Not just once manually, but continuously, with a system that remembers what it's tried and gets more creative over time. That's what Gauntlet does.
The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.
In this blog post, we may have used or referred to third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.
Elastic, Elasticsearch, and associated marks are trademarks, logos or registered trademarks of elasticsearch B.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.