Troubleshooting your Agents and Amazon Bedrock AgentCore with Elastic Observability

Discover how to achieve end-to-end observability for Amazon Bedrock AgentCore: from tracking service health and token costs to debugging complex reasoning loops with distributed tracing.

Troubleshooting your Agents and Amazon Bedrock AgentCore with Elastic Observability

Troubleshooting your Agents and Amazon Bedrock AgentCore with Elastic Observability

Introduction

We're excited to introduce Elastic Observability’s Amazon Bedrock AgentCore integration, which allows users to observe Amazon Bedrock AgentCore and the agents' LLM interactions end-to-end. Agentic AI represents a fundamental shift in how we build applications. 

Unlike standard LLM chatbots that simply generate text, agents can reason, plan, and execute multi-step workflows to complete complex tasks autonomously. Many times these agents are running on a platform such as Amazon Bedrock AgentCore, which helps developers build, deploy and scale agents. Amazon Bedrock AgentCore is Amazon Bedrock's platform providing the secure, scalable, and modular infrastructure services (like agent runtime, memory, and identity) necessary for developers to deploy and operate highly capable AI agents built with any framework or model.

Using a platform, such as Amazon Bedrock Agentcore, is easy, but troubleshooting an agent is far more complex than debugging a standard microservice. Key challenges include:

  • Non-Deterministic Behavior: Agents may choose different tools or reasoning paths for the same prompt, making it difficult to reproduce bugs.

  • "Black Box" Execution: When an agent fails or provides a hallucinated answer, it is often unclear if the issue lies in the LLM's reasoning, the context provided, or a failed tool execution.

  • Cost & Latency Blind Spots: A single user query can trigger recursive loops or expensive multi-step tool calls, leading to unexpected spikes in token usage and latency.

To effectively observe these systems, you need to correlate signals from two distinct layers:

  1. The Platform Layer (Amazon Bedrock AgentCore): You need to understand the overall health of the managed service. This includes high-level metrics like invocation counts, latency, throttling, and platform-level errors that affect all agents running in AgentCore.

  2. The Application Layer (Your Agentic Logic): You want to understand the granular "why" behind the behavior. This includes distributed traces, usually with OpenTelemetry, that visualize the full request lifecycle (e.g. waterfall view), identifying exactly which step in the reasoning chain failed or took too long.

Agentic AI Observability in Elastic provides a unified, end-to-end view of your agentic deployment by combining platform-level insights from Amazon Bedrock AgentCore, through the new Amazon Bedrock AgentCore integration, with deep application-level visibility from OpenTelemetry (OTel) traces, logs and metrics form the agent. This unified view in Elastic allows you to observe, troubleshoot, and optimize your agentic applications from end to end without switching tools. Additionally, Elastic provides Agent Builder which allows you to create agents to analyze any of the data from Amazon Bedrock AgentCore and the agents running on it.

Agentic AI Observability in Elastic

As mentioned above there are two main parts to end-to-end Agentic AI Observability in Elastic.

  • Amazon Bedrock AgentCore Platform Observability - using platform logs and metrics,  Elastic provides comprehensive visibility into the high-level health of the AgentCore service by ingesting AWS vended logs and metrics across four critical components:

    • Runtime: Monitor core performance indicators such as agent errors, overall latency, throttle counts, and invocation rates, for each endpoint. 

    • Gateway: specific insights into gateway and tool call performance, including invocations, error rates, and latency.

    • Memory: Track short-term and long-term memory operations, including event creation, retrieval, and listing, alongside performance analysis, errors, and latency metrics.

    • Identity: Audit security and access health with logs on successful and failed access attempts.

  • Agent Observability with APM, logs and metrics - To understand how your agent is behaving, Elastic ingests OTel-native traces, metrics and logs from your application running within AgentCore. This allows you to visualize the full execution path, including LLM reasoning steps and tool calls, in a detailed waterfall diagram. 
  • Agentic AI Analysis - All of the data from Amazon Bedrock AgentCore and the agent running on it, can be analyzed with Elastic’s AI driven capabilities. These include:
  • Elastic AgentCore SRE Agent built on Elastic Agent Builder - We don't just monitor agents; we provide you with one to assist your team. The AgentCore SRE Agent is a specialized assistant built using Elastic Agent Builder. It possesses specialized knowledge of AgentCore applications observed in Elastic.

    • How it helps: You can ask specific questions regarding your AgentCore environment, such as how to interpret a complex error log or why a specific trace shows latency.

    • Get the Agent: You can deploy this agent yourself from our GitHub repository.

  • Elastic Observability AI Assistant - Use natural language anywhere in Elastic’s UI to help you pinpoint issues, analyze something specific, or just learn what the problem is through LLM knowledge base. Additionally, SREs can interpret log messages, errors, metrics patterns, optimize code, write reports, and even identify and execute a runbook, or find a related github issue.

  • Streams - AI-Driven Log Analysis - When you send AgentCore logs from your instrumented application into Elastic, you can parse and analyze them. Additionally, Streams finds Significant Events within your log stream allowing you to focus immediately on what matters most.

  • Dashboards and ES|QL Data is only useful if you can act on it. Elastic provides out-of-the-box (OOTB) assets to accelerate your mean time to resolution (MTTR). And Elastic provides ES|QL to help you perform ad-hoc analysis on any signal

    • OOTB Dashboards: Pre-built visualizations based on AgentCore service signals. These dashboards provide an immediate, high-level overview of the usage, health, and performance of your AgentCore runtime, gateway, memory, and identity components.

    • OOTB Alert Templates: Pre-configured alerts for common agentic issues (e.g., high error rates, latency spikes, or unusual token consumption), allowing you to move from reactive to proactive troubleshooting immediately.

Onboarding Amazon Bedrock AgentCore signals into Elastic

 Amazon Bedrock AgentCore Integration

To get started with platform-level visibility, you need to enable the Amazon Bedrock AgentCore integration in Elastic. This integration automatically collects metrics and logs from your AgentCore runtime, gateway, memory, and identity components via Amazon CloudWatch.

Setup Steps:

  1. Prepare AWS Environment: Ensure your AgentCore agents are deployed and running and that you have enabled logging on your AgentCore resources in the AWS console.

  2. Add the Integration:

    • In Elastic (Kibana), navigate to Integrations.

    • Search for "Amazon Bedrock AgentCore". Select Add Amazon Bedrock AgentCore.

  3. Configure & Deploy:

    Configure Elastic's Amazon Bedrock AgentCore integration to collect CloudWatch metrics from your chosen AWS region at the specified collection interval. Logs will be added soon after the publication of this blog.

Onboard the Agent with OTel Instrumentation

The next step is observing the application logic itself. The beauty of Amazon Bedrock AgentCore is that the application runtime often comes pre-instrumented. You simply need to tell it where to send the telemetry data.

For this example, we will use the Travel Assistant from the Elastic Observability examples.

To instrument this agent, you do not need to modify the source code. Instead, when you invoke the agent using the

agentcore
CLI, you simply pass your Elastic connection details as environment variables. This redirects the OTel signals (traces, metrics, and logs) directly to the Elastic EDOT collector.

Example Invoke Command: Run the following command to launch the agent and start streaming telemetry to Elastic:

    agentcore launch \
    --env BEDROCK_MODEL_ID="us.anthropic.claude-3-5-sonnet-20240620-v1:0" \
    --env OTEL_EXPORTER_OTLP_ENDPOINT="https://<REPLACE_WITH_ELASTIC_ENDPOINT>.region.cloud.elastic.co:443" \
    --env OTEL_EXPORTER_OTLP_HEADERS="Authorization=ApiKey <REPLACE_WITH_YOUR_API_KEY>" \
    --env OTEL_EXPORTER_OTLP_PROTOCOL="http/protobuf" \
    --env OTEL_METRICS_EXPORTER="otlp" \
    --env OTEL_TRACES_EXPORTER="otlp" \
    --env OTEL_LOGS_EXPORTER="otlp" \
    --env OTEL_RESOURCE_ATTRIBUTES="service.name=travel_assistant,service.version=1.0.0" \
    --env AGENT_OBSERVABILITY_ENABLED="true" \
    --env DISABLE_ADOT_OBSERVABILITY="true" \
    --env TAVILY_API_KEY="<REPLACE_WITH_YOUR_TAVILY_KEY>"

Key Configuration Parameters:

  • OTEL_EXPORTER_OTLP_ENDPOINT
    : Your Elastic OTLP endpoint (ensure port 443 is specified).

  • OTEL_EXPORTER_OTLP_HEADERS
    : The Authorization header containing your Elastic API Key.

  • DISABLE_ADOT_OBSERVABILITY=true
    : This ensures the native AgentCore signals are routed exclusively to your defined endpoint (Elastic) rather than default AWS paths.

Analyzing Agentic Data in Elastic Observability

As we walk through the analysis features below, we will use the Travel Assistant agent which we instrumented earlier as well as any other apps you may be running on AgentCore. For the purposes of this example, as a second agent, we will use Customer Support Assistant from the AWS Labs AgentCore samples 

Out-of-the-Box (OOTB) Dashboards

Elastic populates a set of comprehensive dashboards based on Amazon Bedrock AgentCore service logs and metrics. These appear as a unified view with tabs, providing a "single pane of glass" into the operational health of your platform.

This view is divided into four key zones, each addressing specific components of AgentCore - Runtime, Gateway, Memory, Identity. Note that note all agentic applications use all 4 components. In our example only the Customer Assistant uses all four components, whereas the Travel agent uses only Runtime. 

Runtime Health


Visualize agent invocations, session metrics, error trends (system vs. user), and performance stats like latency and throttling, split per endpoint. This dashboard helps you answer questions like

  • "How are my Travel Assistant agent and Customer Support agent performing in terms of overall traffic and latency, and are there any spikes in errors or throttling?"

Gateway Performance


Analyze invocations across Lambda and MCP (Model Context Protocol), with detailed breakdowns for tool vs. non-tool calls. The dashboard highlights throttling detection, target execution times, and separates system errors from user errors.

  • Question answered: "Are my external integrations (Lambda, MCP) performing efficiently, or are specific tool calls experiencing high latency, throttling, or system-level errors?"

Memory Operations


Track core operations like event creation, retrieval, and listing, alongside deep dives into long-term memory processing. This includes extraction and consolidation metrics broken down by strategy type, as well as specific monitoring for throttling and system vs. user errors.

  • Question answered: "Are failures in memory consolidation strategies or high retrieval latency preventing the agent from effectively recalling user context?"

Identity & Access


Monitor identity token fetch operations (workload, OAuth, API keys) and real-time authentication success/failure rates. The dashboard breaks down activity by provider and highlights throttling or capacity bottlenecks.

  • Question answered: "Are authentication failures or token fetch bottlenecks from specific providers preventing agents from accessing required resources?"

Out-of-the-Box (OOTB) Alert Templates

Observability isn't just about looking at dashboards; it's about knowing when to act. To move from reactive checking to proactive monitoring, Elastic provides OOTB Alert Rule Templates (starting with Elastic version 9.2.1).

These templates eliminate guesswork by pre-selecting the optimal metrics to monitor and applying sensible thresholds. This configuration focuses on high-fidelity alerts for genuine anomalies, helping you catch critical issues early while minimizing alert fatigue.

Suggested OOTB Alerts:

  • Agent Runtime System Errors: Detects server-side errors (500 Internal Server Error) during agent runtime invocations, indicating infrastructure or service issues with AWS Bedrock AgentCore.

  • Agent Runtime User Errors: Flags client-side errors (4xx) during agent runtime invocations, including validation failures (400), resource not found (404), access denied (403), and resource conflicts (409). This helps catch misconfigured permissions, invalid input, or missing resources early.

  • Agent Runtime High Latency: Triggers when the average latency for agent runtime invocations exceeds 10 seconds (10,000ms). Latency measures the time elapsed between receiving a request and sending the final response token.

 APM Tracing

While logs and metrics tell you that an issue exists, APM Tracing tells you exactly where and why it is happening. By ingesting the OpenTelemetry signals from your instrumented agent, Elastic generates a detailed distributed trace (e.g. waterfall view) for every interaction. To get further details on LLM information such as prompts, responses, token usage, etc, you can explore the APM logs.  

This allows you to peer inside the "black box" of the agent's execution flow:

  • Visualize the Chain of Thought: See the full sequence of events, from the user's initial prompt to the final response, including all intermediate reasoning steps.

  • Pinpoint Tool Failures: Identify exactly which external tool (e.g., a Lambda function for flight booking or a knowledge base query) failed or timed out.

  • Analyze Latency Contributors: Distinguish between latency caused by the LLM's generation time versus latency caused by slow downstream API calls.

  • Debug with Context: Drill down into individual spans to see specific error messages, attributes, and metadata that explain why a particular step failed.

Conclusion

As organizations move from experimental chatbots to complex, autonomous agents in production, the need for robust observability has never been greater. Agentic applications introduce new layers of complexity—non-deterministic behaviors, multi-step reasoning loops, and cost implications—that standard monitoring tools simply cannot see.

Elastic Agentic AI Observability for Amazon Bedrock AgentCore bridges this gap. By unifying platform-level health metrics from AgentCore with deep, transaction-level distributed tracing from OpenTelemetry, Elastic gives SREs and developers the complete picture. Whether you are debugging a failed tool call, optimizing latency, or controlling token costs, you have the visibility needed to run agentic AI with confidence.

Complete Visibility: AgentCore + Amazon Bedrock: For the most comprehensive view, we recommend onboarding Elastic’s Amazon Bedrock integration alongside AgentCore. While the AgentCore integration focuses on the orchestration layer—monitoring agent errors, tool latency, and invocations—the Bedrock integration provides deep visibility into the underlying foundation models themselves. This includes tracking model-specific latency, token usage, full prompts and responses, and even Guardrails usage and effectiveness. By combining both, you ensure complete coverage from the high-level agent workflow down to the raw model inference.

Get Started Today Ready to see your agents in action?

Share this article