Gauntlet: What happens when your agent's tools fight back

Elasticsearch Agent Builder Hackathon

May 13, 2026

With two days left before the hackathon deadline, I made the decision to step back and rethink my approach from scratch.

The original idea was called Rehearse: an agent that rehearses actions in a sandbox mocked by another agent before executing them in the real world. The concept was sound, but the flaw was obvious in hindsight. The environment can change between rehearsal and execution. Your agent rehearses sending an email, but by the time it actually runs, the inbox looks different. Simulation diverges from reality, and the whole thing falls apart.

But one class of problems doesn't have this issue: adversarial fuzz-testing. If your agent fails in simulation, it can fail in real life too. That's how Gauntlet was born — 48 hours before the deadline and reusing the same core insight (an agent that uses search to build memory and stay creative) pointed at a problem where stochasticity doesn't matter.

Gauntlet

Watch the Gauntlet demo

Test what happens when your agent's tools fight back. Not just once manually, but continuously, with a system that remembers what it's tried and gets more creative over time.

Watch the demo here

The problem with testing agents on the happy path

Most of us have heard of OpenClaw, the personal AI assistant that went viral. If you've followed the discourse around agentic AI assistants with broad tool access, you've seen the security concerns. Agents forget what they're not supposed to do or never knew in the first place. The reason is straightforward: We test the happy path. We check that the agent does what it should. We rarely check what happens when someone tries to make it do what it shouldn't.

Adversarial testing sandboxes exist, but they're painful to build. You design attack vectors manually. You seed adversarial data by hand. You configure test infrastructure for each scenario. It's slow, it doesn't scale, and it only finds the bugs you already thought of.

I wanted something different: a system where the environment itself is automatically adversarial and gets more creative over time.

The idea: Mock the sandbox with another agent

Instead of building a sandbox, Gauntlet uses a mocking agent that intercepts your primary agent's tool calls and finds creative ways to break it. When your agent calls search_emails, the mocking agent sees the result and decides whether to mutate it, injecting a prompt injection into an email body, returning subtly wrong data, or feeding false information to see if the primary agent catches it. The primary agent never knows it's in a simulation.

The interface is two decorators:

@function_tool
@gauntlet.query
def search_emails(folder: str = "inbox") -> str:
    """Search emails in the given folder."""
    return json.dumps(fetch_emails(folder))

There is @gauntlet.query for read operations and @gauntlet.mutation for writes. That's the entire integration surface. When the run finishes, evaluate() reviews what happened and stores confirmed bugs.

It’s simple to use, but there two hard problems that hide underneath.

The two problems that make this a search problem

First, the mocking agent needs to maintain a coherent model of the world throughout the conversation. If it told the primary agent that an email was from Alice, it can't later contradict that. A mutated email that's obviously fake teaches you nothing. Plausibility is the whole game.

Second, the mocking agent needs to find novel bugs. Rediscovering the same prompt injection pattern 50 times isn't useful. It needs to remember what it has already found and explore in new directions while staying grounded in what the tools actually do.

Both of these are search problems. And that's where Elasticsearch becomes the backbone of the system.

Two memory circuits

The mocking agent runs on two memory circuits, both living in Elasticsearch.

Short-term memory tracks everything within the current session: every tool call intercepted, the original result, what it was mutated to, and what the primary agent did in response. This is the coherence layer. The mocking agent can query its own recent decisions and stay internally consistent while still being adversarial. Balancing creativity with coherence was the hardest design problem in the entire project.

Long-term memory is where the creativity compounds. It stores confirmed bugs with embeddings for similarity search, full tool implementations so the agent can reason about failure modes, and historical results from past runs. When the mocking agent needs a new attack idea, it searches long-term memory for what's been tried before, finds gaps, and hypothesizes something new.

These feed into a closed cycle: hypothesize what bugs might exist, create circumstances to prove them, and store confirmed bugs back into the index. The inventory grows. The attacks get more creative. The gap between Gauntlet and manual sandbox setup widens over time.

Everything runs inside Elastic Agent Builder

The entire mocking agent is built inside Elastic Agent Builder — instructions, tool bindings, and multi-turn conversation state via the Amazon Bedrock Converse API; no external orchestration needed.

The tool I'm most proud of is generate-hypothesis. It's a single ES|QL statement that samples existing bugs, aggregates them with MV_CONCAT, and calls COMPLETION inline to propose a novel attack hypothesis. It handles sampling, aggregation, LLM reasoning, and result generation all in one query, never leaving the ES|QL pipeline. I went in expecting I'd need to shuttle data between Elasticsearch and an external script. I didn't.

ES|QL's COMPLETION function was the biggest surprise. Between COMPLETION, STATS, MV_CONCAT, and SAMPLE, I could build entire reasoning pipelines as single queries. Bug storage uses Kibana Workflows, and a programmatically created Kibana Dashboard gives real-time visibility into bug counts, severity breakdowns, and attack pattern heatmaps.

The Converse API solved another problem I'd been dreading. The mocking agent needs to remember what it's already told the primary agent within a single run. I assumed I'd have to fetch conversation histories from indices and reload them into the agent on every call. But it turns out that the Converse API handles multi-turn state natively. I didn't write any conversation management logic. Just keep calling converse, and it stays coherent.

What this actually buys you

Manual adversarial sandbox setup takes roughly an hour per scenario. With Gauntlet, the same process takes 2–10 minutes, and its long-term memory means each run is informed by every previous run. The more you use it, the more it learns about your agent’s weak points and the harder it tries to find new ones.

What's next?

Right now Gauntlet is a 1v1: one mocking agent versus one primary agent. But the problem is embarrassingly parallel. 20 attack sessions could run simultaneously on separate sessions without any architectural changes. Scaling is the obvious next step.

The more interesting open question is exploration versus exploitation in the long-term memory. The mocking agent needs to balance trying variations of known successful attacks (exploitation) against completely novel hypotheses (exploration). This is a well-studied problem in other domains, but applying it to adversarial agent testing feels unexplored. There might be something worth pursuing beyond this project entirely.

I also keep thinking about Rehearse. Gauntlet is a special case: fuzz-testing works because failure in simulation implies possible failure in reality. But there are other domains where the environment is stable enough between rehearsal and execution that the original Rehearse concept could work. I haven't found them yet, but I'm looking.

The takeaway

If you're building agents with access to real-world tools, test what happens when those tools fight back. Not just once manually, but continuously, with a system that remembers what it's tried and gets more creative over time. That's what Gauntlet does.

Kavish Sathia

Student, National University of Singapore

Kavish Sathia is a computer science student at NUS working on agentic systems.

GitHub · Demo · Website · LinkedIn

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.

In this blog post, we may have used or referred to third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.

Elastic, Elasticsearch, and associated marks are trademarks, logos or registered trademarks of elasticsearch B.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.

Context engineering

Vector database

Search powered applications

Logs

Threat protection

Workflows

Elasticsearch

Kibana (Discover, Dashboards)

Elastic Agent Builder

AutoOps

Piped query language

Jina AI search models

Elastic Cloud Serverless

Elastic Cloud Hosted

Self-managed Elasticsearch

Ecommerce search

Customer support search

Search-driven apps

Log analytics

Infrastructure monitoring

Digital experience monitoring

App performance monitoring

AIOps

LLM observability

Next-gen SIEM

Workflows for security

XDR and endpoint security

AI for security

10x your data's value

Cloud providers

Elastic AI Ecosystem

Search AI Partner Program

AV-Comparatives

Forrester Wave™ Leader

Gartner Magic Quadrant Leader

IDC MarketScape Leader

Search

Security

Observability

Get started

Demo gallery

Downloads

Integrations

Docs

Elasticsearch Labs

Elastic Security Labs

Elastic Observability Labs

Blog

Community

Events

Webinars

Discuss

Training

Support

Consulting