Carly Richmond

AI agent observability and monitoring with OTel, OpenLit & Elastic

Learn how to monitor AI web agents to identify performance bottlenecks, token waste, and hallucinations using OpenTelemetry, OpenLit, and Elastic

AI agents don't fail like traditional apps. They hallucinate, loop, burn tokens, and make unpredictable tool calls that standard monitoring was never designed to capture. Traditional APM tools show HTTP status codes and latency, but they miss the AI-specific failures that matter: prompt injection attempts, evaluation score degradation, and tool-calling loops.

This guide explains the key considerations for full-stack monitoring AI web agents, exploring both best practices and practical examples using OpenLit, OpenTelemetry and Elastic. Specifically we'll cover monitoring an example web travel planner located in this example repo.

Why is AI agent observability different?

The aim of traditional monitoring is to detect and alert on failures, performance issues, inefficiencies and resource bottlenecks. Monitoring AI agents still adheres to this common goal, but there are several differences that must be considered:

  • AI models are probabilistic, meaning that the same input can lead to different outputs. This makes it hard to define and monitor success based on a single correct answer.
  • AI systems can appear to function correctly on the surface, but their outputs may be suspect, incorrect, or biased without a way to immediately detect it. Telemetry must therefore be able to capture hidden capabilities such as tool call executions for SREs to scrutinize.
  • The dynamic and evolving nature of LLMs can mean that their behavior can change dramatically between updates and versions due to changes in data, embeddings, or prompts. This means monitoring and pre-production evaluation when upgrading is vitally important for performance continuity.
  • Models are black boxes. For this reason it's often difficult to understand why an AI made a particular decision. This makes troubleshooting harder compared to systems with clear, explicit logic.
  • Beyond traditional metrics, AI output must be monitored for issues like hallucinations (generating false information), toxicity, and bias, which can damage user trust and lead to reputational harm.
  • Contextual performance of an AI system can vary greatly depending on the context, including user interaction. Capturing user prompts and telemetry helps establish a complete picture of system performance.
  • From a security perspective, AI agents can be vulnerable to adversarial attacks including data poisoning and obfuscation. Monitoring for unusual behavioral patterns and prompts is crucial to detect and mitigate these threats.

For these reasons the metrics and tracing that SREs capture and investigate will differ.

AI agent monitoring in practice

Let's apply these concepts by instrumenting an actual AI agent and capturing telemetry. Here we shall be using the TypeScript SDK of OpenLit, an open-source library that generates OpenTelemetry signals from LLM interactions in JavaScript applications. Specifically we shall instrument a simple web travel planner agent, available here that uses LLMs to generate travel recommendations based on user prompts and information from various tools. OpenLit works well for this type of project due to its TypeScript SDK and built in capabilities for capturing LLM interactions, tool calls, and generating evaluation and guardrail metrics.

The architecture diagram shows the key components:

The concepts and best practices discussed in this article can be applied to any AI agent regardless of the specific monitoring tools used. Many vendors have AI monitoring capabilities. Alternative open source technologies are also available for agentic monitoring, including LangSmith, OpenLLMetry, or indeed manual instrumentation using OpenTelemetry SDKs and the AI semantic conventions.

Prerequisites

This project requires that the following prerequisites are met:

  • Active Elastic cluster (Cloud, Serverless or self-managed)
  • OpenAI, Azure OpenAI, or compatible LLM provider API key
  • Node.js 18+ with npm or yarn
  • OTLP-compatible endpoint (Elastic Managed OTLP endpoint or OTel collector)

The following environment variables should be set:

  • OTEL_ENDPOINT: Your Elastic OTLP endpoint or OTel collector URL
  • OPENAI_API_KEY: API key for the evaluation/guardrail LLM
  • OPENAI_ENDPOINT: Optional custom base URL for OpenAI-compatible providers

Basic instrumentation

Often DevOps engineers and SREs start with automatic instrumentation to obtain basic telemetry. This is possible with the OpenLit Python SDK. However with TypeScript we manually have to add our configuration to the AI entrypoint (here api/chat/route.ts).

First we install the dependency using our favourite package manager:

npm install openlit

Then we add the OpenLit configuration to our entrypoint:

import openlit from "openlit";

// Other imports omitted for brevity 

// Allow streaming responses up to 30 seconds to address typically longer responses from LLMs
export const maxDuration = 30;

openlit.init({
  applicationName: "ai-travel-agent", // akin to OTEL resource name
  environment: "development",
  otlpEndpoint: process.env.OTEL_ENDPOINT, // OTLP compatible endpoint (Elastic ingest or OTel collector)
  disableBatch: true, // batching disabled for demo purposes - not recommended for production use
});

// Post request handler
export async function POST(req: Request) {
   // AI logic omitted for brevity - see full code in repo
}

This instrumentation will automatically generate OpenTelemetry traces for all LLM interactions, including tool calls, and send them to the specified OTLP endpoint. Note that for production rather than demo usage, environment should be set to production and batching should not be disabled to ensure optimal network usage and protect the OTel backend.

Let's discuss the key telemetry signals that are generated in subsequent sections.

Inputs

The first rule of debugging AI agents is simple: if you don't capture the prompt, you can't reproduce the problem. Unlike traditional applications where inputs are predictable request parameters, AI agents consume free-form user prompts that can trigger wildly different behaviors based on subtle phrasing changes. OpenLit automatically captures system prompts and all user messages as structured attributes on your traces, giving you the exact input that caused your agent to hallucinate, loop, or fail.

The full conversation needs to be available to SREs to understand the context of failures and performance issues and identify patterns in the inputs that may be causing issues. However these inputs are also useful for improving agent behavior, and can be used as testing messages to evaluate model performance and test enhancements once sanitized for identifiable attributes such as PII.

Beyond prompts, we still need comprehensive logging, specifically capturing full stack traces emitted by our applications. This is crucial for diagnosing issues that may arise from the underlying infrastructure or codebase, rather than the AI model itself. For example, the below error sent to Elastic shows a simple fetch error. We must not forget that traditional errors can still occur in AI applications, and capturing them is essential.

Tracing

Traces are essential for understanding the flow of requests through your AI agent, especially when it comes to tool calls. Generally traces are a hierarchy of spans, which themselves are a single, timed unit representing a specific operation, such as a database query or an HTTP handler. In AI systems they also represent tool calls made by the LLM, along with the API calls and data retrieval steps performed within the tool execution.

Visualizing tool calling patterns is important in validating pre-production systems as well as monitoring production systems for several reasons:

  1. It helps us evaluate the tool calling capabilities of different models. LLMs make the choice of which tools to use based on the user prompt, system instructions and the tool metadata (such as name and description). By visualizing the tool calling patterns we can understand whether the model is correctly interpreting the tool metadata and making appropriate calls based on the prompt.
  2. It allows us to identify inefficient or erroneous tool calling patterns. For example, if we see a pattern of repeated calls to the same tool with similar inputs, it may indicate that the model is stuck in a loop or not effectively utilizing the tools. Or if a single tool is being called where we would expect multiple tools to be called, it may indicate that the model is not correctly connecting the prompt or system instructions require said tool(s).
  3. Commonly occurring tool-calling patterns can also be identified to optimize the available tools. For example, if the location and weather tools are frequently called together, it may make sense to combine them into a single tool that provides both pieces of information in one call.

With the above configuration, we can see the traces for each tool call, as illustrated in the below example:

Metrics

While tracing is essential for understanding the flow of requests and tool calls, metrics are crucial for monitoring the overall health and performance of your system, agentic or not. When considering metrics many think solely of cost and total token usage. While both are important, they are not the only metrics that matter.

Through the example above OpenLit automatically generates key metrics that can be used to evaluate agent performance, such as request latency, error rates, cost and token usage, which can be visualized in Elastic to identify trends and anomalies. Token usage specifically can be split by input, output and reasoning token counts, helping us identify optimization opportunities at key stages in the generation cycle. For example, an increase in input token counts may indicate a significant increase in context length that can be optimized via prompt and context engineering techniques.

It's also important to make sure that metrics capture traditional performance metrics such as CPU, memory and request counts to caches, traditional databases and vector databases. This helps us identify whether performance issues are being caused by the AI model itself or by underlying infrastructure problems. Alerting for spikes in key measures such as large token usage increases or request volumes would also be considered best practice.

Evaluation

AI evaluation refers to the process of assessing the performance of AI models and the quality of the responses they generate. This involves monitoring various metrics and signals to ensure that the AI system is functioning as intended, providing accurate outputs, and not exhibiting undesirable behaviors such as hallucinations, toxic behavior or bias. While evaluation is considered as a pre-production activity to test and validate an agentic system, it's also important to continue monitoring these signals in production to identify issues over time.

There are several different evaluation methodologies that we can use. OpenLit makes use of AI as a Judge. This involves using an LLM to evaluate the quality of the output generated by another LLM based on a set of criteria. An example of traditional evaluation is depicted below:

When considering evaluation from a monitoring viewpoint, it's important to identify hallucinations, bias, toxicity and potential injection issues in production. Hallucinations, bias and toxic responses expose us to reputational risk and loss of user trust. Out of the box, OpenLit identifies the following issues, calculates a score and provides an explanation of the issue:

IssueDescription
HallucinationsThe LLM generates false or misleading information based on the provided context and its own knowledge
BiasA generated response contains bias or statements negatively impacting protected groups and characteristics including but not limited to gender, ethnicity, socioeconomic status or religion
ToxicityThe LLM returns harmful or offensive content that is threatening, harassing or dismissive

These issues can be identified using the below code:

import openlit from "openlit";

// Other imports omitted for brevity

// Allow streaming responses up to 30 seconds to address typically longer responses from LLMs
export const maxDuration = 30;

// Tools and Azure configuration omitted for brevity

openlit.init({
  applicationName: "ai-travel-agent",
  environment: "development",
  otlpEndpoint: process.env.OTEL_ENDPOINT,
  disableBatch: true,
});

// Choose one of the following approaches:
// Option 1: enable all available evaluations
const evalsAll = openlit.evals.All({
  provider: "openai",
  collectMetrics: true, // Ensures evaluations are exported to Elastic
  apiKey: process.env.OPENAI_API_KEY,
  baseUrl: process.env.OPENAI_ENDPOINT
});

// Option 2: enable specific evaluations with custom configuration
const evalsHallucination = openlit.evals.Hallucination({
  provider: "openai",
  collectMetrics: true,
  apiKey: process.env.OPENAI_API_KEY,
  baseUrl: process.env.OPENAI_ENDPOINT
});

// Post request handler
export async function POST(req: Request) {
  const { messages, id } = await req.json();

  try {
    const convertedMessages = await convertToModelMessages(messages);
    const prompt = `You are a helpful assistant that returns travel itineraries...`;

    const result = streamText({
      model: azure("gpt-4o"),
      system: prompt,
      messages: convertedMessages,
      stopWhen: stepCountIs(2),
      tools,
      experimental_telemetry: { isEnabled: true },
      onFinish: async ({ text, steps }) => {
        // Concatenate tool results and content as full evaluation context
        const toolResults = steps.flatMap((step) => {
          return step.content
            .filter((content) => content.type == "tool-result")
            .map((c) => {
              return JSON.stringify(c.output);
            });
        });

        // Measure evaluation
        const evalResults = await evalsAll.measure({
          prompt: prompt,
          contexts: convertedMessages
            .map((m) => {
              return m.content.toString();
            })
            .concat(toolResults),
          text: text,
        });
        console.log(`Evals results: ${evalResults}`);
      },
    });

    // Return data stream to allow the useChat hook to handle the results as they are streamed through for a better user experience
    return result.toUIMessageStreamResponse();
  } catch (e) {
    console.error(e);
    return new NextResponse(
      "Unable to generate a plan. Please try again later!"
    );
  }
}

By using the collectMetrics option, the evaluation results are automatically exported as metrics to Elastic, allowing us to monitor the quality of our AI agent's outputs over time and identify trends or issues that may arise in production. The evaluation results can also be used to trigger alerts or automated responses if certain thresholds or SLOs are breached, such as a high evaluation score sustained for several minutes, increased number of hallucinations detected over time, or triggering of a toxic result.

The advantage of using LLMs to evaluate results is that they help identify issues and inaccuracies quickly compared to leveraging manual quality checks. However, this methodology does have limitations, specifically:

  1. Increased cost and latency due to the additional requests to an LLM to evaluate results. This can be mitigated by using a smaller, cheaper model for evaluation, by only evaluating a sample of responses, or using cached responses to reduce the number of LLM calls for similar questions.
  2. LLM evaluations are prone to biases. Specifically, Zheng et al. cite in their 2023 paper that LLM evaluations are subject to:
  • Positional bias, where an LLM prefers responses where the answer is located in a specific position in the response, and may miss correct answers located elsewhere in the reply.
  • Self-enhancement bias, where LLMs show preference for responses they have generated compared to other models. This can be a consideration if you wish to use cheaper, or self-hosted models for evaluation.
  • Verbosity bias, where they prefer more expansive responses over succinct replies.

Guardrail monitoring

In addition to assessing the quality of responses, we must also be monitoring for dangerous or irrelevant responses that could be harmful to users. The quality of in-built protections within models are patchy and model dependent. Research from several research papers, including Anthropic in their 2025 agentic misalignment paper, show that in some cases models can resort to malicious behaviors and the model bypassing company policies and moral expectations.

Guardrail detection in monitoring tools allows us to identify risky responses generated by AI agents, such as generating harmful content, engaging in inappropriate interactions, or performing injection actions to try and hack into systems or elicit confidential information. Using OpenLit as our example, we are able to monitor for breaches of the following guardrail types:

Guardrail TypeDescription
Prompt InjectionDetection of malicious injection attempts, impersonation and other jailbreaking techniques
Sensitive TopicsDetection of content on controversial, sensitive or illegal topics such as politics, religion, adult content, substance abuse or violence
Restricted TopicsDetection of content that violates company policies, ethical guidelines or covers topics that the tool should avoid such as giving financial or legal advice

These shields can be set up using OpenLit as per the below code:

import openlit from "openlit";

// Other imports omitted for brevity

// Allow streaming responses up to 30 seconds to address typically longer responses from LLMs
export const maxDuration = 30;

// Tools and Azure configuration omitted for brevity

openlit.init({
  applicationName: "ai-travel-agent",
  environment: "development",
  otlpEndpoint: process.env.OTEL_ENDPOINT,
  disableBatch: true,
});

// Choose one of the following approaches:
// Option 1: enable all available guardrails
const guardsAll = openlit.guard.All({
  provider: "openai",
  collectMetrics: true,
  apiKey: process.env.OPENAI_API_KEY,
  baseUrl: process.env.OPENAI_ENDPOINT,
  validTopics: ["travel", "culture"],
  invalidTopics: ["finance", "software engineering"],
});

// Option 2: enables specific guardrail types (for example, prompt injection detection)
const guardsPromptInjection = openlit.guard.PromptInjection({
  provider: "openai",
  collectMetrics: true, // Ensures guardrail breaches are exported to Elastic
  apiKey: process.env.OPENAI_API_KEY,
  baseUrl: process.env.OPENAI_ENDPOINT
});

// Post request handler
export async function POST(req: Request) {
  const { messages, id } = await req.json();

  try {
    const convertedMessages = await convertToModelMessages(messages);
    const prompt = `You are a helpful assistant that returns travel itineraries...`;

    const result = streamText({
      model: azure("gpt-4o"),
      system: prompt,
      messages: convertedMessages,
      stopWhen: stepCountIs(2),
      tools,
      experimental_telemetry: { isEnabled: true },
      onFinish: async ({ text, steps }) => {
        const guardrailResult = await guardsAll.detect(text);
        console.log(`Guardrail results: ${guardrailResult}`);
      },
    });

    // Return data stream to allow the useChat hook to handle the results as they are streamed through for a better user experience
    return result.toUIMessageStreamResponse();
  } catch (e) {
    console.error(e);
    return new NextResponse(
      "Unable to generate a plan. Please try again later!"
    );
  }
}

In the event of a guardrail breach, metrics containing detail of the breach are sent to Elastic, similar to the following:

Of course we can leverage dashboards to visualize trends of guardrail breaches, including metrics such as volumes by category, as shown in the below example (with the corresponding NDJSON available here):

We can also perform action on these breaches to notify relevant teams via alerts triggered via the available alerting tools. These should be triggered based on the severity of the detected issue, as well as the classification, for example mentions of violence, illegal themes, or injection attacks may trigger immediately compared to minor inaccuracies. The guardrail breaches can also be used to trigger automated responses, such as blocking the response from being sent to the user, or providing a warning message to the user that their request has been flagged for review, triggering a human-in-the-loop response for the relevant teams.

Conclusion

AI agents are becoming more autonomous, more powerful, and more unpredictable. For this reason, it's important to introduce monitoring telemetry as early as possible in the development process and in organizational cultures. This article helps you understand how monitoring AI agents is different and how to do it using OpenLit to generate OpenTelemetry signals to send to Elastic. Check out the code here and start monitoring your AI agents in production.

Developer resources:

Share this article