Elastic Observability Labs - Articles by Daniela Tzvetkova

LLM Observability for Google Cloud’s Vertex AI platform - understand performance, cost and reliability

Wed, 09 Apr 2025 00:00:00 GMT

As organizations increasingly adopt large language models (LLMs) for AI-powered applications such as content creation, Retrieval-Augmented Generation (RAG), and data analysis, SREs and developers face new challenges. Tasks like monitoring workflows, analyzing input and output, managing query latency, and controlling costs become critical. LLM observability helps address these issues by providing clear insights into how these models perform, allowing teams to quickly identify bottlenecks, optimize configurations, and improve reliability. With better observability, SREs can confidently scale LLM applications, especially on platforms like Google Cloud’s Vertex AI.

New Elastic Observability LLM integration with Google Cloud’s Vertex AI platform

We are thrilled to announce general availability of monitoring LLMs hosted in Google Cloud through the Elastic integration with Vertex AI. This integration enables users to experience enhanced LLM Observability by providing deep insights into the usage, cost and operational performance of models on Vertex AI, including latency, errors, token usage, frequency of model invocations as well as resources utilized by models. By leveraging this data, organizations can optimize resource usage, identify and resolve performance bottlenecks, and enhance the model efficiency and accuracy.

Observability needs for AI-powered applications using the Vertex AI platform

Leveraging AI models creates unique needs around the observability and monitoring of AI-powered applications. Some of the challenges that come with using LLMs are related to the high cost to call the LLMs, the quality and safety of LLM responses, and the performance, reliability and availability of the LLMs.

Lack of visibility into LLM observability data can make it harder for SREs and DevOps teams to ensure their AI-powered applications meet their service level objectives for reliability, performance, cost and quality of the AI-generated content and have enough telemetry data to troubleshoot related issues. Thus, robust LLM observability and detection of anomalies in the performance of models hosted on Google Cloud’s Vertex AI platform in real time is critical for the success of AI-powered applications.

Depending on the needs of their LLM applications, customers can make use of a growing list of models hosted on the Vertex AI platform such as Gemini 2.0 Pro, Gemini 2.0 Flash, and Imagen for image generation. Each model excels in specific areas and generates content in some modalities including Language, Audio, Vision, Code, etc. No two models are the same; each model has specific performance characteristics. So, it is important that service operators are able to track the individual performance, behaviour and cost of each model.

Unlocking Insights with Vertex AI Metrics

The Elastic integration with Google Cloud’s Vertex AI platform collects a wide range of metrics from models hosted on Vertex AI, enabling users to monitor, analyze, and optimize their AI deployments effectively.

Once you use the integration, you can review all the metrics in the Vertex AI dashboard

These metrics can be categorized into the following groups:

1. Prediction Metrics

Prediction metrics provide critical insights into model usage, performance bottlenecks, and reliability. These metrics help ensure smooth operations, optimize response times, and maintain robust, accurate predictions.

Prediction Count by Endpoint: Measures the total number of predictions across different endpoints.
Prediction Latency: Provides insights into the time taken to generate predictions, allowing users to identify bottlenecks in performance.
Prediction Errors: Monitors the count of failed predictions across endpoints.

2. Model Performance Metrics

Model performance metrics provide crucial insights into deployment efficiency, and responsiveness. These metrics help optimize model performance and ensure reliable operations.

Model Usage: Tracks the usage distribution among different model deployments.
Token Usage: Tracks the number of tokens consumed by each model deployment, which is critical for understanding model efficiency.

Invocation Rates: Tracks the frequency of invocations made by each model deployment.
Model Invocation Latency: Measures the time taken to invoke a model, helping in diagnosing performance issues.

3. Resource Utilization Metrics

Resource utilization metrics are vital for monitoring resource efficiency and workload performance. They help optimize infrastructure, prevent bottlenecks, and ensure smooth operation of AI deployments.

CPU Utilization: Monitors CPU usage to ensure optimal resource allocation for AI workloads.
Memory Usage: Tracks the memory consumed across all model deployments.
Network Usage: Measures bytes sent and received, providing insights into data transfer during model interactions.

4. Overview Metrics

These metrics give an overview of the models deployed in Google Cloud’s Vertex AI platform. They are essential for tracking overall performance, optimizing efficiency, and identifying potential issues across deployments.

Total Invocations: The overall count of prediction invocations across all models and endpoints, providing a comprehensive view of activity.
Total Tokens: The total number of tokens processed across all model interactions, offering insights into resource utilization and efficiency.
Total Errors: The total count of errors encountered across all models and endpoints, helping identify reliability issues.

All metrics can be filtered by region, offering localized insights for better analysis.

Note: The Elastic I integration with Vertex AI provides comprehensive visibility into both deployment models: provisioned throughput, where capacity is pre-allocated, and pay-as-you-go, where resources are consumed on demand.

Conclusion

This integration with Vertex AI represents a significant step forward in enhancing the LLM Observability for users of Google Cloud’s Vertex AI platform. By unlocking a wealth of actionable data, organizations can assess the health, performance and cost of LLMs and troubleshoot operational issues, ensuring scalability, and accuracy in AI-driven applications.

Now that you know how the Vertex AI integration enhances LLM Observability, it’s your turn to try it out n. Spin up an Elastic Cloud, and start monitoring your LLM applications hosted on Google Cloud’s Vertex AI platform.

Troubleshooting your Agents and Amazon Bedrock AgentCore with Elastic Observability

Mon, 01 Dec 2025 00:00:00 GMT

Troubleshooting your Agents and Amazon Bedrock AgentCore with Elastic Observability

Introduction

We're excited to introduce Elastic Observability’s Amazon Bedrock AgentCore integration, which allows users to observe Amazon Bedrock AgentCore and the agents' LLM interactions end-to-end. Agentic AI represents a fundamental shift in how we build applications.

Unlike standard LLM chatbots that simply generate text, agents can reason, plan, and execute multi-step workflows to complete complex tasks autonomously. Many times these agents are running on a platform such as Amazon Bedrock AgentCore, which helps developers build, deploy and scale agents. Amazon Bedrock AgentCore is Amazon Bedrock's platform providing the secure, scalable, and modular infrastructure services (like agent runtime, memory, and identity) necessary for developers to deploy and operate highly capable AI agents built with any framework or model.

Using a platform, such as Amazon Bedrock Agentcore, is easy, but troubleshooting an agent is far more complex than debugging a standard microservice. Key challenges include:

Non-Deterministic Behavior: Agents may choose different tools or reasoning paths for the same prompt, making it difficult to reproduce bugs.
"Black Box" Execution: When an agent fails or provides a hallucinated answer, it is often unclear if the issue lies in the LLM's reasoning, the context provided, or a failed tool execution.
Cost & Latency Blind Spots: A single user query can trigger recursive loops or expensive multi-step tool calls, leading to unexpected spikes in token usage and latency.

To effectively observe these systems, you need to correlate signals from two distinct layers:

The Platform Layer (Amazon Bedrock AgentCore): You need to understand the overall health of the managed service. This includes high-level metrics like invocation counts, latency, throttling, and platform-level errors that affect all agents running in AgentCore.
The Application Layer (Your Agentic Logic): You want to understand the granular "why" behind the behavior. This includes distributed traces, usually with OpenTelemetry, that visualize the full request lifecycle (e.g. waterfall view), identifying exactly which step in the reasoning chain failed or took too long.

Agentic AI Observability in Elastic provides a unified, end-to-end view of your agentic deployment by combining platform-level insights from Amazon Bedrock AgentCore, through the new Amazon Bedrock AgentCore integration, with deep application-level visibility from OpenTelemetry (OTel) traces, logs and metrics form the agent. This unified view in Elastic allows you to observe, troubleshoot, and optimize your agentic applications from end to end without switching tools. Additionally, Elastic provides Agent Builder which allows you to create agents to analyze any of the data from Amazon Bedrock AgentCore and the agents running on it.

Agentic AI Observability in Elastic

As mentioned above there are two main parts to end-to-end Agentic AI Observability in Elastic.

Amazon Bedrock AgentCore Platform Observability - using platform logs and metrics, Elastic provides comprehensive visibility into the high-level health of the AgentCore service by ingesting AWS vended logs and metrics across four critical components:
- Runtime: Monitor core performance indicators such as agent errors, overall latency, throttle counts, and invocation rates, for each endpoint.
- Gateway: specific insights into gateway and tool call performance, including invocations, error rates, and latency.
- Memory: Track short-term and long-term memory operations, including event creation, retrieval, and listing, alongside performance analysis, errors, and latency metrics.
- Identity: Audit security and access health with logs on successful and failed access attempts.

Agent Observability with APM, logs and metrics - To understand how your agent is behaving, Elastic ingests OTel-native traces, metrics and logs from your application running within AgentCore. This allows you to visualize the full execution path, including LLM reasoning steps and tool calls, in a detailed waterfall diagram.

Agentic AI Analysis - All of the data from Amazon Bedrock AgentCore and the agent running on it, can be analyzed with Elastic’s AI driven capabilities. These include:

Elastic AgentCore SRE Agent built on Elastic Agent Builder - We don't just monitor agents; we provide you with one to assist your team. The AgentCore SRE Agent is a specialized assistant built using Elastic Agent Builder. It possesses specialized knowledge of AgentCore applications observed in Elastic.
- How it helps: You can ask specific questions regarding your AgentCore environment, such as how to interpret a complex error log or why a specific trace shows latency.
- Get the Agent: You can deploy this agent yourself from our GitHub repository.
Elastic Observability AI Assistant - Use natural language anywhere in Elastic’s UI to help you pinpoint issues, analyze something specific, or just learn what the problem is through LLM knowledge base. Additionally, SREs can interpret log messages, errors, metrics patterns, optimize code, write reports, and even identify and execute a runbook, or find a related github issue.
Streams - AI-Driven Log Analysis - When you send AgentCore logs from your instrumented application into Elastic, you can parse and analyze them. Additionally, Streams finds Significant Events within your log stream allowing you to focus immediately on what matters most.
Dashboards and ES|QL Data is only useful if you can act on it. Elastic provides out-of-the-box (OOTB) assets to accelerate your mean time to resolution (MTTR). And Elastic provides ES|QL to help you perform ad-hoc analysis on any signal
- OOTB Dashboards: Pre-built visualizations based on AgentCore service signals. These dashboards provide an immediate, high-level overview of the usage, health, and performance of your AgentCore runtime, gateway, memory, and identity components.
- OOTB Alert Templates: Pre-configured alerts for common agentic issues (e.g., high error rates, latency spikes, or unusual token consumption), allowing you to move from reactive to proactive troubleshooting immediately.

Onboarding Amazon Bedrock AgentCore signals into Elastic

Amazon Bedrock AgentCore Integration

To get started with platform-level visibility, you need to enable the Amazon Bedrock AgentCore integration in Elastic. This integration automatically collects metrics and logs from your AgentCore runtime, gateway, memory, and identity components via Amazon CloudWatch.

Setup Steps:

Prepare AWS Environment: Ensure your AgentCore agents are deployed and running and that you have enabled logging on your AgentCore resources in the AWS console.
Add the Integration:
- In Elastic (Kibana), navigate to Integrations.
- Search for "Amazon Bedrock AgentCore". Select Add Amazon Bedrock AgentCore.
Configure & Deploy:

Configure Elastic's Amazon Bedrock AgentCore integration to collect CloudWatch metrics from your chosen AWS region at the specified collection interval. Logs will be added soon after the publication of this blog.

Onboard the Agent with OTel Instrumentation

The next step is observing the application logic itself. The beauty of Amazon Bedrock AgentCore is that the application runtime often comes pre-instrumented. You simply need to tell it where to send the telemetry data.

For this example, we will use the Travel Assistant from the Elastic Observability examples.

To instrument this agent, you do not need to modify the source code. Instead, when you invoke the agent using the agentcore CLI, you simply pass your Elastic connection details as environment variables. This redirects the OTel signals (traces, metrics, and logs) directly to the Elastic EDOT collector.

Example Invoke Command: Run the following command to launch the agent and start streaming telemetry to Elastic:

    agentcore launch \
    --env BEDROCK_MODEL_ID="us.anthropic.claude-3-5-sonnet-20240620-v1:0" \
    --env OTEL_EXPORTER_OTLP_ENDPOINT="https://.region.cloud.elastic.co:443" \
    --env OTEL_EXPORTER_OTLP_HEADERS="Authorization=ApiKey " \
    --env OTEL_EXPORTER_OTLP_PROTOCOL="http/protobuf" \
    --env OTEL_METRICS_EXPORTER="otlp" \
    --env OTEL_TRACES_EXPORTER="otlp" \
    --env OTEL_LOGS_EXPORTER="otlp" \
    --env OTEL_RESOURCE_ATTRIBUTES="service.name=travel_assistant,service.version=1.0.0" \
    --env AGENT_OBSERVABILITY_ENABLED="true" \
    --env DISABLE_ADOT_OBSERVABILITY="true" \
    --env TAVILY_API_KEY=""

Key Configuration Parameters:

OTEL_EXPORTER_OTLP_ENDPOINT: Your Elastic OTLP endpoint (ensure port 443 is specified).
OTEL_EXPORTER_OTLP_HEADERS: The Authorization header containing your Elastic API Key.
DISABLE_ADOT_OBSERVABILITY=true: This ensures the native AgentCore signals are routed exclusively to your defined endpoint (Elastic) rather than default AWS paths.

Analyzing Agentic Data in Elastic Observability

As we walk through the analysis features below, we will use the Travel Assistant agent which we instrumented earlier as well as any other apps you may be running on AgentCore. For the purposes of this example, as a second agent, we will use Customer Support Assistant from the AWS Labs AgentCore samples

Out-of-the-Box (OOTB) Dashboards

Elastic populates a set of comprehensive dashboards based on Amazon Bedrock AgentCore service logs and metrics. These appear as a unified view with tabs, providing a "single pane of glass" into the operational health of your platform.

This view is divided into four key zones, each addressing specific components of AgentCore - Runtime, Gateway, Memory, Identity. Note that note all agentic applications use all 4 components. In our example only the Customer Assistant uses all four components, whereas the Travel agent uses only Runtime.

Runtime Health

Visualize agent invocations, session metrics, error trends (system vs. user), and performance stats like latency and throttling, split per endpoint. This dashboard helps you answer questions like

"How are my Travel Assistant agent and Customer Support agent performing in terms of overall traffic and latency, and are there any spikes in errors or throttling?"

Gateway Performance

Analyze invocations across Lambda and MCP (Model Context Protocol), with detailed breakdowns for tool vs. non-tool calls. The dashboard highlights throttling detection, target execution times, and separates system errors from user errors.

Question answered: "Are my external integrations (Lambda, MCP) performing efficiently, or are specific tool calls experiencing high latency, throttling, or system-level errors?"

Memory Operations

Track core operations like event creation, retrieval, and listing, alongside deep dives into long-term memory processing. This includes extraction and consolidation metrics broken down by strategy type, as well as specific monitoring for throttling and system vs. user errors.

Question answered: "Are failures in memory consolidation strategies or high retrieval latency preventing the agent from effectively recalling user context?"

Identity & Access

Monitor identity token fetch operations (workload, OAuth, API keys) and real-time authentication success/failure rates. The dashboard breaks down activity by provider and highlights throttling or capacity bottlenecks.

Question answered: "Are authentication failures or token fetch bottlenecks from specific providers preventing agents from accessing required resources?"

Out-of-the-Box (OOTB) Alert Templates

Observability isn't just about looking at dashboards; it's about knowing when to act. To move from reactive checking to proactive monitoring, Elastic provides OOTB Alert Rule Templates (starting with Elastic version 9.2.1).

These templates eliminate guesswork by pre-selecting the optimal metrics to monitor and applying sensible thresholds. This configuration focuses on high-fidelity alerts for genuine anomalies, helping you catch critical issues early while minimizing alert fatigue.

Suggested OOTB Alerts:

Agent Runtime System Errors: Detects server-side errors (500 Internal Server Error) during agent runtime invocations, indicating infrastructure or service issues with AWS Bedrock AgentCore.
Agent Runtime User Errors: Flags client-side errors (4xx) during agent runtime invocations, including validation failures (400), resource not found (404), access denied (403), and resource conflicts (409). This helps catch misconfigured permissions, invalid input, or missing resources early.
Agent Runtime High Latency: Triggers when the average latency for agent runtime invocations exceeds 10 seconds (10,000ms). Latency measures the time elapsed between receiving a request and sending the final response token.

APM Tracing

While logs and metrics tell you that an issue exists, APM Tracing tells you exactly where and why it is happening. By ingesting the OpenTelemetry signals from your instrumented agent, Elastic generates a detailed distributed trace (e.g. waterfall view) for every interaction. To get further details on LLM information such as prompts, responses, token usage, etc, you can explore the APM logs.

This allows you to peer inside the "black box" of the agent's execution flow:

Visualize the Chain of Thought: See the full sequence of events, from the user's initial prompt to the final response, including all intermediate reasoning steps.
Pinpoint Tool Failures: Identify exactly which external tool (e.g., a Lambda function for flight booking or a knowledge base query) failed or timed out.
Analyze Latency Contributors: Distinguish between latency caused by the LLM's generation time versus latency caused by slow downstream API calls.
Debug with Context: Drill down into individual spans to see specific error messages, attributes, and metadata that explain why a particular step failed.

Conclusion

As organizations move from experimental chatbots to complex, autonomous agents in production, the need for robust observability has never been greater. Agentic applications introduce new layers of complexity—non-deterministic behaviors, multi-step reasoning loops, and cost implications—that standard monitoring tools simply cannot see.

Elastic Agentic AI Observability for Amazon Bedrock AgentCore bridges this gap. By unifying platform-level health metrics from AgentCore with deep, transaction-level distributed tracing from OpenTelemetry, Elastic gives SREs and developers the complete picture. Whether you are debugging a failed tool call, optimizing latency, or controlling token costs, you have the visibility needed to run agentic AI with confidence.

Complete Visibility: AgentCore + Amazon Bedrock: For the most comprehensive view, we recommend onboarding Elastic’s Amazon Bedrock integration alongside AgentCore. While the AgentCore integration focuses on the orchestration layer—monitoring agent errors, tool latency, and invocations—the Bedrock integration provides deep visibility into the underlying foundation models themselves. This includes tracking model-specific latency, token usage, full prompts and responses, and even Guardrails usage and effectiveness. By combining both, you ensure complete coverage from the high-level agent workflow down to the raw model inference.

Read more: Monitor Amazon Bedrock with Elastic
Read more: Amazon Bedrock Guardrails Observability

Get Started Today Ready to see your agents in action?

Try it out: Log in to Elastic Cloud and add the Amazon Bedrock AgentCore integration. Or use Elastic from Amazon Marketplace
Explore the Code: Check out our GitHub repository for the Travel assistant which you saw in this blog, as well as the AgentCore SRE Agent.
Learn More: Read the full documentation on setting up integration for Agentic AI Observability for Amazon Bedrock AgentCore.

LLM observability with Elastic: Taming the LLM with Guardrails for Amazon Bedrock

Sun, 02 Mar 2025 00:00:00 GMT

In a previous blog we showed you how to set up observability for your models hosted on Amazon Bedrock using Elastic’s integration. You can now effortlessly enable observability for your Amazon Bedrock guardrails using the enhanced Elastic Amazon Bedrock integration. If you previously onboarded the Amazon Bedrock integration, just upgrade it and you will automatically get all guardrails-related updates. The enhanced integration provides a single pane of glass dashboard with two panels - one focusing on overall Bedrock visualizations as well as a separate panel dedicated to Guardrails. You can now ingest and visualize metrics and logs specific to Guardrails, such as guardrail invocation count, invocation latency, text unit utilization, guardrail policy types associated with interventions and many more.

In this blog we will show you how to set up observability for Amazon Bedrock Guardrails, how you can make use of the enhanced dashboards and what key signals to alert on for an effective observability coverage of your Bedrock guardrails.

Prerequisites

To follow along with this blog, please make sure you have:

An account on Elastic Cloud and a deployed stack in AWS (see instructions here). Ensure you are using version 8.16.2 or higher. Alternatively, you can use Elastic Cloud Serverless, a fully managed solution that eliminates infrastructure management, automatically scales based on usage, and lets you focus entirely on extracting value from your data.
An AWS account with permissions to pull the necessary data from AWS. See details in our documentation.

Steps to create a guardrail for Amazon Bedrock

Before you set up observability for the guardrails, ensure that you have configured guardrails for your model. Follow the steps below to create an Amazon Bedrock Guardrail

Access the Amazon Bedrock Console
- Sign in to the AWS Management Console with appropriate permissions and navigate to the Amazon Bedrock console.
Navigate to Guardrails
- From the left-hand menu, select Guardrails.
Create a New Guardrail
- Select Create guardrail.
- Provide a descriptive name, an optional brief description, and specify a message to display when the guardrail blocks the user prompt.
  - Example: Sorry, I am not configured to answer such questions. Kindly ask a different question.
Configure Guardrail Policies
- Content Filters: Adjust settings to block harmful content and prompt attacks.
- Denied Topics: Specify topics to block.
- Word Filters: Define specific words or phrases to block.
- Sensitive Information Filters: Set up filters to detect and remove sensitive information.
- Contextual Grounding:
  - Configure the Grounding Threshold to set the minimum confidence level for factual accuracy.
  - Set the Relevance Threshold to ensure responses align with user queries.
Review and Create
- Review your settings and select Create to finalize the guardrail.
Create a Guardrail Version
- In the Version section, select Create.
- Optionally add a description, then select Create Version.

After creating a version of your guardrail, it's important to note down the Guardrail ID and the Guardrail Version Name. These identifiers are essential when integrating the guardrail into your application, as you'll need to specify them during guardrail invocation.

Example code to integrate with Amazon Bedrock guardrails

Integrating Amazon Bedrock's ChatBedrock into your Python application enables advanced language model interactions with customisable safety measures. By configuring guardrails, you can ensure that the model adheres to predefined policies, preventing it from generating inappropriate or sensitive content.

The following code demonstrates how to integrate Amazon Bedrock with guardrails to enforce contextual grounding in AI-generated responses. It sets up a Bedrock client using AWS credentials, defines a reference grounding statement, and uses the ChatBedrock API to process user queries with contextual constraints. The converse_with_guardrails function sends a user query alongside a predefined grounding reference, ensuring that responses align with the provided knowledge source.

Setting Up Environment Variables

Before running the script, configure the required AWS credentials and guardrail settings as environment variables. These variables allow the script to authenticate with Amazon Bedrock and apply the necessary guardrails for safe and controlled AI interactions.

Create a .env file in the same directory as your script and add:

AWS_ACCESS_KEY="your-access-key" 
AWS_SECRET_KEY="your-secret-key" 
AWS_REGION="your-aws-region" 
GUARDRAIL_ID="your-guardrail-id" 
GUARDRAIL_VERSION="your-guardrail-version"

Create a Python script and run

Create a Python script using the code below and execute it to interact with the Amazon Bedrock Guardrails you set up.

import os
import boto3
from dotenv import load_dotenv
from langchain_aws import ChatBedrock
import json
from botocore.exceptions import ClientError

# Load environment variables
load_dotenv()

# Function to check for hallucinations using contextual grounding
def check_hallucination(response):
   output_assessments = response.get("trace", {}).get("guardrail", {}).get("outputAssessments", {})

   # Iterate over all assessments
   for key, assessments in output_assessments.items():
       for assessment in assessments:
           contextual_policy = assessment.get("contextualGroundingPolicy", {})
          
           if "filters" in contextual_policy:
               grounding = relevance = None
               grounding_threshold = relevance_threshold = None

               for filter_result in contextual_policy["filters"]:
                   filter_type = filter_result.get("type")
                   if filter_type == "RELEVANCE":
                       relevance = filter_result.get("score", 0)
                       relevance_threshold = filter_result.get("threshold", 0)
                   elif filter_type == "GROUNDING":
                       grounding = filter_result.get("score", 0)
                       grounding_threshold = filter_result.get("threshold", 0)
          
           if relevance < relevance_threshold or grounding < grounding_threshold:
               return True, relevance, grounding, relevance_threshold, grounding_threshold  # Hallucination detected
  
   return False, relevance, grounding, relevance_threshold, grounding_threshold  # No hallucination detected

def converse_with_guardrails(bedrock_client, messages, grounding_reference):
   message = [
       {
           "role": "user",
           "content": [
               {
                   "guardContent": {
                       "text": {
                           "text": grounding_reference,
                           "qualifiers": ["grounding_source"],
                       }
                   }
               },
               {
                   "guardContent": {
                       "text": {
                           "text": messages,
                           "qualifiers": ["query"],
                       }
                   }
               },
           ],
       }
   ]
   converse_config = {
       "modelId": os.getenv('CHAT_MODEL'),
       "messages": message,
       "guardrailConfig": {
           "guardrailIdentifier": os.getenv("GUARDRAIL_ID"),
           "guardrailVersion": os.getenv("GUARDRAIL_VERSION"),
           "trace": "enabled"
       },
       "inferenceConfig": {
           "temperature": 0.5       
       },
   }
   try:
       response = bedrock_client.converse(**converse_config)
       return response
   except ClientError as e:
       error_message = e.response['Error']['Message']
       print(f"An error occurred: {error_message}")
       print("Converse config:")
       print(json.dumps(converse_config, indent=2))
       return None
  
def pretty_print_response(response, is_hallucination, relevance, relevance_threshold, grounding, grounding_threshold):
   print("\n" + "="*60)
   print(" Guardrail Assessment")
   print("="*60)
   # Extract response message safely
   response_text = response.get("output", {}).get("message", {}).get("content", [{}])[0].get("text", "N/A")
   print("\n **Model Response:**")
   print(f"   {response_text}")
   print("\n **Guardrail Assessment:**")
   print(f"   Is Hallucination : {is_hallucination}")
   print("\n **Contextual Grounding Policy Scores:**")
   print(f"   - Relevance Score : {relevance:.2f} (Threshold: {relevance_threshold:.2f})")
   print(f"   - Grounding Score : {grounding:.2f} (Threshold: {grounding_threshold:.2f})")
   print("\n" + "="*60 + "\n")
  
def main():
   bs = boto3.Session(
       aws_access_key_id=os.getenv('AWS_ACCESS_KEY'),
       aws_secret_access_key=os.getenv('AWS_SECRET_KEY'),
       region_name=os.getenv('AWS_REGION')
   )

   # Initialize Bedrock client
   bedrock_client = bs.client("bedrock-runtime")

   # Grounding reference
   grounding_reference = "The Wright brothers made the first powered aircraft flight on December 17, 1903."

   # User query
   user_query = "Who were the first to fly an airplane?"
  
   # Get model response
   response = converse_with_guardrails(bedrock_client, user_query, grounding_reference)

   # Check for hallucinations
   is_hallucination, relevance, grounding, relevance_threshold, grounding_threshold = check_hallucination(response)

   # Print the results
   pretty_print_response(response, is_hallucination, relevance, relevance_threshold, grounding, grounding_threshold)


if __name__ == "__main__":
   main()

Identifying Hallucinations with Contextual Grounding

The contextual grounding feature proved effective in identifying potential hallucinations by comparing model responses against reference information. Relevance and grounding scores provided quantitative measures to assess the accuracy of model outputs.

The python script run output below demonstrates how the Grounding Score helps detect hallucinations:

============================================================
 Guardrail Assessment
============================================================

 **Model Response:**
   Sorry, I am not configured to answer such questions. Kindly ask a different question.

 **Guardrail Assessment:**
   Is Hallucination : True

 **Contextual Grounding Policy Scores:**
   - Relevance Score : 1.00 (Threshold: 0.99)
   - Grounding Score : 0.03 (Threshold: 0.99)

============================================================

Here, the Grounding Score of 0.03 is significantly lower than the configured threshold of 0.99, indicating that the response lacks factual accuracy. Since the score falls below the threshold, the system flags the response as a hallucination, highlighting the need to monitor guardrail outputs to ensure AI safety.

Configuring Amazon Bedrock Guardrails Metrics & Logs Collection

Elastic makes it easy to collect both logs and metrics from Amazon Bedrock Guardrails using the Amazon Bedrock integration. By default, Elastic provides a curated set of logs and metrics, but you can customize the configuration based on your needs. The integration supports Amazon S3 and Amazon CloudWatch Logs for log collection, along with metrics collection from your chosen AWS region at a specified interval.

Follow these steps to enable the collection of metrics and logs:

Navigate to Amazon Bedrock Settings - In the AWS Console, go to Amazon Bedrock and open the Settings section.
Choose Logging Destination - Select whether to send logs to Amazon S3 or Amazon CloudWatch Logs.
Provide Required Details
- If using Amazon S3, logs can be collected from objects referenced in S3 notification events (read from an SQS queue) or by direct polling from an S3 bucket.
- If using CloudWatch Logs: you need to create a CloudWatch log group and note its ARN, as this will be required for configuring both Amazon Bedrock and Elastic Amazon Bedrock integration.

Configure Elastic's Amazon Bedrock integration - In Elastic, set up the Amazon Bedrock integration, ensuring the logging destination matches the one configured in Amazon Bedrock. Logs from your selected source and metrics from your AWS region will be collected automatically.

Accept Defaults or Customize Settings - Elastic provides a default configuration for logs and metrics collection. You can accept these defaults or adjust settings such as collection intervals to better fit your needs.

Understanding the pre-configured dashboard for Amazon Bedrock Guardrails

You can access the Amazon Bedrock Guardrails dashboard using either of the following methods:

Navigate to the Dashboard Menu - Select the Dashboard menu option in Elastic and search for [Amazon Bedrock] Guardrails to open the dashboard.
Navigate to the Integrations Menu - Open the Integrations menu in Elastic, select Amazon Bedrock, go to the Assets tab, and choose [Amazon Bedrock] Guardrails from the dashboard assets.

The Amazon Bedrock Guardrails dashboard in the Elastic integration provides insights into guardrail performance, tracking total invocations, API latency, text unit usage, and intervention rates. It analyzes policy-based interventions, highlighting trends, text consumption, and frequently triggered policies. The dashboard also showcases instances where guardrails modified or blocked responses and offers a detailed breakdown of invocations by policy and content source.

Guardrail invocation overview

This dashboard section provides a comprehensive summary of key metrics related to guardrail performance and usage:

Total guardrails API invocations: Displays the overall count of times guardrails were invoked.
Average Guardrails API invocation latency: Shows the average response time for guardrail API calls, offering insights into system performance.
Total text unit utilization: Indicates the volume of text processed during guardrail invocations. For pricing of text units refer to Amazon Bedrock pricing page.
Invocations - with and without guardrail interventions: A pie chart representation showing the distribution of LLM invocations based on guardrail activity. It displays the count of invocations where no guardrail interventions occurred, those where guardrails intervened and detected policy violations, and those where guardrails intervened but found no violations.

These metrics help users evaluate guardrail effectiveness, track intervention patterns, and optimize configurations to ensure policy enforcement while maintaining system performance.

Guardrail policy types for interventions

This section provides a comprehensive view of guardrail policy interventions and their impact:

Interventions by Policy Type: Bar charts display the number of interventions applied to user inputs and model outputs, categorized by policy type (e.g., Contextual Grounding Policy, Word Policy, Content Policy, Sensitive Information Policy, Topic Policy).
Text Unit Utilization by Policy Type: Panels highlight the text units consumed by various policy interventions, separately for user inputs and model outputs.
Policy Usage Trends: A word cloud visualisation reveals the most frequently applied policy types, offering insights into intervention patterns.

By analyzing intervention counts, text unit usage, and policy trends, users can identify frequently triggered policies, optimize guardrail settings, and ensure LLM interactions align with compliance and safety requirements.

Prompt and response where guardrails intervened

This dashboard section displays the original LLM prompt, inputs from various sources (API calls, applications, or chat interfaces), and the corresponding guardrail response. The text panel presents the prompt alongside the model's response after applying guardrail interventions. These interventions occur when input evaluation or model responses violate configured policies, leading to blocked or masked outputs.

The section also includes additional details to enhance visibility into how guardrails operate. It indicates whether a violation was detected, along with the violation type (e.g., GROUNDING, RELEVANCE) and the action taken (BLOCKED, NONE). For contextual grounding, the dashboard also shows the filter threshold, which defines the minimum confidence level required for a response to be considered valid, and the confidence score, which reflects how well the response aligns with the expected criteria.

By analyzing violations, actions taken, and confidence scores, users can adjust guardrail thresholds to balance blocking unsafe responses and allowing valid ones, ensuring optimal accuracy and compliance. This process is particularly crucial for detecting and mitigating hallucinations—instances where models generate information not grounded in source data. Implementing contextual grounding checks enables the identification of such ungrounded or irrelevant content, enhancing the reliability of applications like retrieval-augmented generation (RAG).

Guardrail invocation by guardrail policy

This section offers insights into the number of Guardrails API invocations, the overall latency, the total text units categorised by various guardrail policies (identified by guardrail ARN) and the policy versions.

Guardrail invocation by content source (Input & Output)

This section provides a detailed overview of critical metrics related to guardrail performance and usage. It includes the total number of guardrail invocations, the count of intervention invocations where policies were applied, the volume of text units consumed during these interventions for both user inputs and model outputs and the average guardrail API invocation latency.

These insights help users understand how guardrails operate across different policies and content sources. By analyzing invocation counts, latency, and text unit consumption, users can assess policy effectiveness, track intervention patterns, and optimize configurations. Evaluating how guardrails interact with user inputs and model outputs ensures consistent enforcement, helping refine thresholds and improve compliance strategies.

Configure SLOs and Alerts

To create an SLO for monitoring contextual grounding accuracy, define a custom query SLI where good events are model responses that meet contextual grounding criteria, ensuring factual accuracy and alignment with the provided reference.

A suitable query for tracking good events is:

gen_ai.prompt : "*qualifiers[\\\"grounding_source\\\"]*" and 
(gen_ai.compliance.violation_detected : false or 
not gen_ai.compliance.violation_detected : *)

The total query considers all relevant interactions having contextual grounding check is:

gen_ai.prompt : "*qualifiers[\\\"grounding_source\\\"]*"

Set an SLO target of 99.5%, ensuring that the vast majority of responses remain factually grounded. This helps detect hallucinations and misaligned outputs in real-time. By continuously monitoring contextual grounding accuracy, you can proactively address inconsistencies, retrain models, or refine RAG pipelines before inaccuracies impact end users.

Elastic's alerting capabilities enable proactive monitoring of key performance metrics. For instance, by setting up an alert on the average aws_bedrock.guardrails.invocation_latency with a 500ms threshold, you can promptly identify and address performance bottlenecks, ensuring that policy enforcement remains efficient without causing unexpected delays.

Conclusion

The Elastic Amazon Bedrock integration makes it easy for you to collect a curated set of metrics and logs for your LLM-powered applications using Amazon Bedrock including Guardrails. It comes with an out-of-the-box dashboard which you can further customize for your specific needs.

If you haven’t already done so, read our previous blog on what you can do with the Amazon Bedrock integration, set up guardrails for your Bedrock models, and enable the Bedrock integration to start observing your Bedrock models and guardrails today!

LLM Observability with Elastic’s Azure AI Foundry Integration

Fri, 25 Jul 2025 00:00:00 GMT

Introduction

As organizations increasingly adopt LLMs for AI-powered applications such as content creation, Retrieval-Augmented Generation (RAG), and data analysis, SREs and developers face new challenges. Tasks like monitoring workflows, analyzing input and output, managing query latency, and controlling costs become critical. LLM Observability helps address these issues by providing clear insights into how these models perform, allowing teams to quickly identify bottlenecks, optimize configurations, and improve reliability. With better observability, SREs can confidently scale LLM applications, especially on platforms like Azure AI Foundry, while minimizing downtime and keeping costs in check.

Elastic is expanding support for LLM Observability with Elastic Observability's new Azure AI Foundry integration. This is now available as a tech preview on Elastic Cloud. This new observability integration provides you with comprehensive visibility into the performance and usage of foundational models, such as GPT-4, Mistral, Llama, and thousands of others from leading AI companies and from Azure available through Azure AI Foundry. The new Azure AI Foundry Integration in Elastic Observability integration offers an out-of-the-box experience by simplifying the collection of metrics and logs, making it easier to gain actionable insights and effectively manage your models. The integration is simple to set up and comes with pre-built, out-of-the-box dashboards. With real-time insights, SREs can now monitor, optimize and troubleshoot LLM applications that are using Azure AI Foundry.

This blog will walk through the features available to SREs, such as monitoring invocations, errors, and latency information across various models, along with the usage and performance of LLM requests. Additionally, the blog will show how easy it is to set up and what insights you can gain from Elastic for LLM Observability.

Prerequisites

To get started with the Azure AI Foundry integration, you will need:

An account on Elastic Cloud and a deployed stack in Azure (see instructions here). Ensure you are using version 9.0.0 or higher.
An Azure account with permissions to pull the necessary data from Azure and Azure AI Foundry. See details in our documentation.

Configuring Azure AI Foundry Integration

To collect logs and metrics from Azure AI Foundry ensure you properly configure Azure logs and metrics from the following links:

Configure to receive Azure Metrics - This integration specifically collects Azure AI Foundry metrics which will come from the service, and ensure you have the client id, subscription id, and tenant id from Azure AI Foundry to collect metrics.
Configure to receive Azure Logs and more specifically ensure that you configure Azure event hub to properly allow Elastic to ingest logs. Once you have the Azure event hub information, you will need it to configure the logs section of the Azure AI Foundry Integration.

Maximize Visibility with Out-of-the-box dashboards

Azure AI Foundry integration offers rich out-of-the-box visibility into the performance and usage information of models in Azure AI Foundry, including text and image models. There are several dashboards currently available. More will be coming as the integration goes to GA.

Azure AI Foundry Overview dashboard provides a summarized view of the invocations, errors and latency information across various models.
Azure AI Foundry Billing dashboard - which provides total costs and daily usage costs from Azure cognitive services.
Azure AI Foundry Advanced Monitoring - which focuses on logs generated by the Azure AI Foundry service when connected through the API Management Service. Provides request rate, error rate, model usage, latency, LLM prompt input, response completion.

Each dashboard provides specific insights important to SREs. Here is a quick overview of some of these insights:

Model Usage and Token Trends – Visualize token consumption and completion counts by model, endpoint, and time window.
Latency Metrics – Monitor average and percentile latency per prompt, per endpoint, and correlate with prompt types or user IDs.
Cost Estimation – Estimate API usage cost based on token consumption and model pricing.
Prompt/Completion Logging – View prompt-response pairs for debugging and quality monitoring.
Content Filtering and Guardrails – See which prompts or completions are being filtered, and why.

You can drill into specific users or sessions, slice by model type or region, and export reports for usage reviews or compliance.

Try it out today

The Azure AI Foundry Integration is currently available in Elastic Cloud (both serverless and hosted options). Sign up for a 7 day trial by signing up to Elastic Cloud directly or through Azure Marketplace. Alternatively you can also deploy a cluster on our Elasticsearch Service, download the Elasticsearch stack, or run Elastic from Azure Marketplace then spin up the new technical preview of Azure AI Foundry integration, open the curated dashboards in Kibana and start monitoring your Azure AI Foundry service!

Optimizing Spend and Content Moderation on Azure OpenAI with Elastic

Tue, 13 May 2025 00:00:00 GMT

In a previous blog we showed you how to set up observability for your models hosted on Azure OpenAI using Elastic’s integration. We’ve expanded the integration to also include Azure OpenAI content filtering, and cost analysis for Azure OpenAI. If you previously onboarded the Azure OpenAI integration, just upgrade it and you will automatically get all new features we discuss in this blog. The enhanced integration now provides multiple dashboards including a general Azure OpenAI Overview, Azure Provisioned Throughput Unit dashboard, Azure Content filtering, and a dashboard for Azure OpenAI billing.

In this blog we will cover how to use Azure OpenAI Content Filtering and tracking Azure OpenAI usage costs. Let’s first review what these two capabilities from Azure OpenAI enable you to do:

Azure OpenAI Content Filtering: Enhancing AI Safety

Content filtering for Azure OpenAI plays a critical role in addressing AI safety challenges by helping to mitigate the risks associated with harmful or inappropriate content generated by AI models. By implementing robust content filtering mechanisms, organizations can proactively identify and filter out potentially harmful content, such as hate speech, misinformation, or violent imagery, before it is disseminated to users. This helps prevent the spread of harmful content and reduces the potential negative impact on individuals and communities.

Monitoring Azure OpenAI content filtering is essential for staying proactive in addressing emerging content moderation challenges. By closely monitoring the system, businesses can quickly detect any new types of harmful content or patterns of misuse that may arise. This enables organizations to stay ahead of potential content moderation issues and take timely action to protect their users and uphold their brand reputation.

Tracking Azure OpenAI Usage Costs

Monitoring Azure OpenAI model usage costs is crucial for managing budget and resource allocation effectively. By keeping track of usage costs, organizations can optimize their operations to avoid unnecessary expenses and ensure that they are getting the best value from their investment in AI technologies. Additionally, it helps in forecasting future expenses and aids in scaling resources according to the demand without compromising performance or incurring excessive costs. Effective monitoring also allows for transparency and accountability, enabling better decision-making in terms of AI deployment and utilization within Azure environments.

As we walk through this blog, we will provide you with prerequisites to set up and use the pre-configured dashboards for both of these capabilities, which are part of the Azure OpenAI integration.

Prerequisites

In order to follow along in this blog you will have to

Set up and install the Azure billing integration to monitor the usage costs. Once the integration is installed, you can track the usage in the enhanced Azure OpenAI Billing dashboard.
Additionally, make sure you have enabled the Azure API Management service to access the Azure OpenAI models.

How to Use Azure API Management with Azure OpenAI:

Provision an Azure OpenAI resource: Create an Azure OpenAI resource and select a model for your application.
Create an API Management instance: Establish an Azure API Management instance to manage the Azure OpenAI APIs.
Import the Azure OpenAI API: Import the Azure OpenAI API into your API Management instance using its OpenAPI specification.
Configure Policies: Implement policies in API Management to manage request authentication, rate limiting, traffic shaping, and more.

Steps to create a content filter for Azure OpenAI

Before you set up observability for the content filtering, ensure that you have configured the Azure content filtering for your model. Follow the steps below to create an Azure OpenAI content filtering,

Access the Azure OpenAI service console:
- Sign in to the Azure Console with the appropriate permissions and navigate to the Azure OpenAI service console.
Navigate to Safety + security:
- From the left-hand menu, select Safety + security.
Create a New Content filter:
- Select Create content filter.
- Configure various content filter policies including the following
  - Set input filter: Content will be annotated by category and blocked according to the threshold you set for prompts.
  - Set output filter: Content will be annotated by category and blocked according to the threshold you set for response output.
  - Blocklists: Define specific words or phrases to block.
  - Deployments: Apply filters to model deployments.
Review and Create:
- Review your settings and select Create to finalize the content filter configurations.

Customers can also configure content filters and create custom safety policies that are tailored to their use case requirements. The configurability feature allows customers to adjust the settings, separately for prompts and completions, to filter content for each content category at different severity levels.

Content filter types

The content filtering categories,
- (hate, sexual, violence, self-harm)
- Other optional classification models aimed at detecting jailbreak risk and known content for text and code.
Severity level within each content filter category,
- (low, medium, high)
- Content detected at the 'safe' severity level is labeled in annotations but isn't subject to filtering and isn't configurable.

Understanding the pre-configured dashboard for Azure OpenAI Content Filtering

Now that you have set up the filter, you can see what is being filtered in Elastic through the Azure OpenAI content filtering dashboard.

Navigate to the Dashboard Menu – Select the Dashboard menu option in Elastic and search for [Azure OpenAI] Content Filtering Overview to open the dashboard.
Navigate to the Integrations Menu – Open the Integrations menu in Elastic, select Azure OpenAI, go to the Assets tab, and choose [Azure OpenAI] Content Filtering Overview from the dashboard assets.

The Azure OpenAI Content Filtering Overview dashboard in the Elastic integration provides insights into blocked requests, API latency, error rates. This dashboard also provides detailed breakdown of content being filtered by the content filtering policy.

Content Filter overview

When the content filtering system detects harmful content, you receive either an error on the API call if the prompt was deemed inappropriate, or the finish_reason on the response will be content_filter to signify that some of the completion was filtered.

This can be summarized as,

Prompt filters: The prompt content that is classified in the filtered category will return HTTP 400 error.
Non-streaming completion: When the content is filtered, non-streaming completions calls won't return any content. In rare cases with longer responses, a partial result can be returned. In these cases, the finish_reason is updated.
Streaming completion: For streaming completions calls, segments are returned back to the user as they're completed. The service continues streaming until either reaching a stop token, length, or when content that is classified at a filtered category and severity level is detected.

Prompt and response where content has been blocked

This dashboard section displays the original LLM prompt, inputs from various sources (API calls, applications, or chat interfaces), and the corresponding completion response. The panel below gives a view on the responses after applying content filtering policy for prompts and completions.

You can use the following code snippet to start integrating your current prompt and settings into your application to test the content filter:

chat_prompt = [
   {
       "role": "user",
       "content": "How to kill a mocking bird?"
   }
]

After running the code, you can find the content being filtered by violence category with the severity level medium.

Content filtered by content source (Input & Output)

The content filtering system helps monitor and moderate different categories of content based on severity levels. The categories typically include things like adult content, offensive language, hate speech, violence, and more. The severity levels indicate the degree of sensitivity or potential harm associated with the content. This panel helps the user to effectively monitor and filter out inappropriate or harmful content to maintain a safe environment.

These metrics can be categorized into the following groups:

Blocked requests by category: Provides insights into the total blocked requests by category.
Severity distribution by categories: Monitors the blocked requests by categories and severity distribution. The severity distribution may be either low, medium or high.
Content filtered categories: Provides insights into the content filtered categories over time.

Reviewing the Azure OpenAI Billing dashboard

You can now look at what you are spending on Azure OpenAI.

Here is what you see on this dashboard:

Total costs: This measures the total usage cost across all the model deployments.
Overall Usage by model: This tracks the total usage costs broken down by model.
Daily usage: Monitors usage costs on a daily basis.
Daily usage costs by model: Monitors daily usage costs broken down by model deployments.

Conclusion

The Azure OpenAI integration makes it easy for you to collect a curated set of metrics and logs for your LLM-powered applications using Azure OpenAI along with content filtered responses. It comes with an out-of-the-box dashboard which you can further customize for your specific needs.

Deploy a cluster on our Elasticsearch Service or download the stack, spin up the new Azure OpenAI integration, open the curated dashboards in Kibana and start monitoring your Azure OpenAI service!

End to end LLM observability with Elastic: seeing into the opaque world of generative AI applications

Wed, 02 Apr 2025 00:00:00 GMT

In the ever-evolving landscape of artificial intelligence, Large Language Models (LLMs) stand as beacons of innovation, offering unprecedented capabilities across industries. From generating human-like text and translating languages to providing personalized customer interactions, the possibilities with LLMs are vast and increasingly indispensable. Enterprises are deploying these models for everything, from automating customer support systems to enhancing creative writing processes. Imagine a virtual assistant not only answering questions but also drafting business proposals or a customer service bot that understands and responds with empathy—all powered by LLMs. However, with great power comes the need for great oversight.

Despite the transformative potential, LLMs introduce complex challenges that necessitate a new level of observability as LLMs are notoriously opaque. Enter LLM observability: a crucial component in the lifecycle management of LLMs. This aspect becomes vital for Service Reliability Engineers (SREs) and other key stakeholders tasked with ensuring seamless, error-free operations, cost control, and minimizing the risks associated with the unpredictable nature of LLM generated responses. SREs need insights into performance metrics, error frequencies, latency issues, the cost implications of running these sophisticated models, and the prompt and response exchange with the model. Traditional monitoring tools fall short in this high-stakes environment; what’s needed is a nuanced approach to address the unique observability demands that LLMs introduce.

Elastic's LLM Observability Capabilities Address These Challenges

With Elastic’s end-to-end LLM observability you can cover a wide range of use cases. To achieve this, you can onboard two types of integrations - API-based logs and metrics and via APM instrumentation. Depending on your use case, you can also choose to use of the LLM integrations.

High level overview: via API-based logs and metrics. Monitoring LLM services from providers by ingesting a curated set of service metrics and logs like latency, invocation frequency, tokens, errors, and prompts and responses. Each LLM integration comes with out-of-the-box dashboards.
Troubleshooting applications: via APM instrumentation. Fully OTel-native tracing and auto-instrumentation for LLM-based applications through Elastic Distributions of OpenTelemetry (EDOT). Additionally, you can use third party libraries (Langtrace, OpenLit, OpenLLMetry) together with Elastic to extend the coverage to additional LLM-related technologies.

High level overview: LLM Observability for Leading Providers

Elastic offers tailored API-based integrations for four major LLM hosting providers:

Azure OpenAI
OpenAI
Amazon Bedrock
Google Vertex AI

These integrations bring a curated set of logs and metrics collection tailored to each provider. What this means for SREs is straightforward access to pre-configured dashboards that highlight the prompts and responses, usage patterns, performance metrics, and cost details across different models and providers.

For instance, SREs keen on identifying which LLM generates the most errors or insights about the models in terms of latency, cost, or usage frequency can leverage these integrations. Imagine having the capability to instantly visualize which LLM is slowing down processes or incurring high costs, thus enabling data-driven decisions to optimize operations.

Troubleshooting applications: Tracing and Auto-Instrumentation of OpenAI, Amazon Bedrock and Google Vertex AI models

Elastic supports OTLP tracing capabilities in EDOT for applications using OpenAI models and models hosted on Amazon Bedrock and Google Vertex AI. In addition, Elastic also supports LLM tracing from third party libraries (Langtrace, OpenLIT, OpenLLMetry).

Tracing offers a comprehensive map of an application's request flow, pinpointing granular details about each call within the system. For each transaction and span of a request, tracing shows critical information such as specific models utilized, request duration, errors encountered, tokens used per request, and the prompts and responses between the LLM.

Tracing helps SREs troubleshoot performance issues with applications developed in languages like Python, Node.js and Java." If an SRE needs to investigate latency or error issues, LLM tracing provides a zoomed-in view into the request lifecycle and allows for profound insights into whether a delay is application-specific, model-specific or systemic across deployments.

Use Cases: Bringing Elastic's Observability Features to Life

Let’s explore some practical scenarios where Elastic’s observability tools shine:

1. Understanding LLM Performance and Reliability

An SRE team looking to optimize a customer support system powered by Azure OpenAI can utilize Elastic’s Azure OpenAI integration to quickly ascertain which model variants incur higher latency or error rates. This enhances decision-making regarding model deployment or even switching providers based on performance metrics.

Similarly SREs can also use in parallel integrations for Google Vertex AI, Amazon Bedrock, and OpenAI for other applications using models hosted on these providers.

2. Troubleshooting OpenAI-Powered Applications

Consider an enterprise utilizing an OpenAI model for real-time user interactions. Encountering unexplained delays, an SRE can use OpenAI tracing to dissect the transaction pathway, identifying if one specific API call or model invocation is the bottleneck. The SRE can also check the out-of-the-box OpenAI integration dashboard to verify if the latency is only affecting this application or all model invocations across the organization.

An engineer troubleshooting the LLM-based application can also check to see what were the prompt and response exchanges with the LLM during this request so they can rule out possible impact on performance due to the input.

3. Addressing Cost and Usage Concerns

SREs are generally acutely aware of which LLM configurations are less cost-effective than required. Elastic’s integration dashboards, pre-configured to display model usage patterns, help mitigate unnecessary spending effectively. You can find out-of-the box dashboards for Azure OpenAI, OpenAI, Amazon Bedrock, and Google VertexAI models. These dashboards show key cost and usage information such as total invocations and tokens, as well as time series breakdown by model and endpoint. In addition, some integrations show more advanced usage information such as provisioned throughput units (PTU) as well as billing cost.

4. Understanding LLM Compliance

With the Elastic Amazon Bedrock integration for Guardrails, and Azure OpenAI integration for content filtering, SREs can swiftly address security concerns, like verifying if certain user interactions prompt policy violations. Elastic's observability logs clarify whether guardrails rightly blocked potentially harmful responses, bolstering compliance assurance.

Conclusion

As LLMs continue to revolutionize the capabilities of modern applications, the role of observability becomes increasingly paramount. Elastic’s comprehensive observability framework empowers enterprises to harness the full potential of LLMs while maintaining robust operational insight and control. The integration with prominent LLM hosting providers and advanced tracing for OpenAI, Amazon Bedrock and Google Vertex AI models, equips SREs with the necessary arsenal to navigate the complex landscape of LLM-driven applications, ensuring they remain safe, reliable, efficient, and cost-effective.

In this new era of AI, balancing innovation with observability isn't just beneficial—it's essential. Whether optimizing performance, troubleshooting intricacies, or managing costs and compliance, Elastic stands at the forefront, ensuring your LLM journey is as seamless as it is groundbreaking.

LLM observability: track usage and manage costs with Elastic's OpenAI integration

Tue, 11 Mar 2025 00:00:00 GMT

In an era where AI-driven applications are becoming ubiquitous, understanding and managing the usage of language models is crucial. OpenAI has been at the forefront of developing advanced language models that power a multitude of applications, from chatbots to code generation. However, as applications grow in complexity and scale, observing crucial metrics that ensure optimal performance and cost-effectiveness becomes essential. Specific needs arise in areas such as performance and reliability monitoring, and cost management, which are pivotal for maximizing the potential of language models.

As organizations adopt OpenAI's diverse AI models, including language models like GPT-4o and GPT-3.5 Turbo, image models like DALL·E, and audio models like Whisper, comprehensive usage monitoring is crucial to track and optimize performance, reliability, usage and cost of each model.

Elastic's new OpenAI integration offers a solution to the challenges faced by developers and businesses using these models. It is designed to provide a unified view of your OpenAI usage across all model types.

Key benefits of the OpenAI integration

OpenAI's usage-based pricing model applies across all these services, making it essential to track consumption and identify which models are being used to control costs and optimize deployments. The new OpenAI integration by Elastic utilizes the OpenAI Usage API to track consumption and identify specific models being used. It offers an out-of-the-box experience with pre-built dashboards, simplifying the process of monitoring your usage patterns.

Continue reading to learn about what you will get with the integration. We'll also show you the setup process, how to leverage the pre-built dashboards, and what insights you can gain from Elastic for LLM Observability.

Setting up the OpenAI Integration

Prerequisites

To follow along with this blog, you will need:

An Elastic cloud account (version 8.16.3 or higher). Alternatively, you can use Elastic Cloud Serverless, a fully managed solution that eliminates infrastructure management, automatically scales based on usage, and lets you focus entirely on extracting value from your data.
An OpenAI account with an Admin API key.
Applications that use the OpenAI APIs.

Generating sample OpenAI usage data

If you're new to OpenAI and eager to try this integration, you can quickly set it up and populate your dashboards with sample data. You'll just need to generate some usage by interacting with the OpenAI API. If you don't have an OpenAI API key, you can create one here. For more information on authentication, refer to the OpenAI documentation.

The OpenAI documentation provides detailed examples for each of their API endpoints. Here are direct links to the relevant sections for generating sample usage data:

Language models (completions): Use the Chat Completions API to generate text. See the examples here.
Audio models (text-to-speech): Generate audio from text using the Speech API. See the examples here.
Audio models (speech-to-text): Transcribe audio to text using the Transcriptions API. See the examples here.
Embeddings: Generate vector representations of text using the Embeddings API. See the examples here.
Image models: Create images from text prompts using the Image Generation API. See the examples here.
Moderation: Check the contents with Moderation API. See the examples here.

There are more endpoints that you can explore to generate sample usage data.

After running these examples (using your API key), remember that the OpenAI Usage API has a delay. It may take some time (usually a few minutes) for the usage data to appear in your dashboard.

Configuration

To connect the OpenAI integration to your OpenAI account, you'll need your OpenAI's Admin API key. The integration will use this key to periodically retrieve usage data from the OpenAI Usage API.

The integration supports eight distinct data streams, corresponding to different categories of OpenAI API usage:

Audio speeches (text-to-speech)
Audio transcriptions (speech-to-text)
Code interpreter sessions
Completions (language models)
Embeddings
Images
Moderations
Vector stores

By default, all data streams are enabled. However, you can disable any data streams that are not relevant to your usage. All enabled data streams are visualized in a single, comprehensive dashboard, providing a unified view of your usage.

For advanced users, the integration offers additional configuration options, including setting the bucket width and initial interval. These options are documented in detail in the official integration documentation.

Maximize visibility with the out-of-the-box dashboard

You can access the OpenAI dashboard in two ways:

Navigate to the Dashboards menu in the left side panel and search for "OpenAI". In the search results select [Metrics OpenAI] OpenAI Usage Overview to open the dashboard.
Alternatively, navigate to the Integrations Menu — Open the Integrations menu under the Management section in Elastic, select OpenAI, go to the Assets tab, and choose [Metrics OpenAI] OpenAI Usage Overview from the dashboards assets.

Understanding the pre-configured dashboard for OpenAI

The pre-built dashboard provides a structured view of OpenAI's API consumption, displaying key metrics such as token usage, API call distribution, and model-wise invocation counts. It highlights top-performing projects, users, and API keys, along with breakdowns of image generation, audio transcription, and text-to-speech usage. By analyzing these insights, users can track usage patterns, and optimize AI-driven applications.

OpenAI usage metrics overview

This dashboard section shows key usage metrics from OpenAI, including invocation rates, token usage, and the top-performing models. It also highlights the total number of invocations and tokens and the invocation count by object type. Understanding these insights can help users optimize model usage, reduce costs, and enhance efficiency when integrating AI models into their applications.

Top performing Project, User, and API Key IDs

Here, you can analyze the top Project IDs, User IDs, and API Key IDs based on invocation counts. This data provides valuable insights to help organizations track usage patterns across different projects and applications.

Token metrics

In this dashboard section you can see token usage trends across various models. This can help you analyze trends across input types (e.g., audio, embeddings, moderations), output types (e.g., audio), and input cached tokens. This information can help developers fine-tune their prompts and optimize token consumption.

Image generation metrics

AI-generated images are becoming increasingly popular across industries. This section provides an overview of image generation metrics, including invocation rates by model and the most common output dimensions. These insights help assess invocation costs and analyze image generation usage.

Audio transcription metrics

OpenAI's AI-powered transcription services make speech-to-text conversion easier than ever. This section tracks audio transcription metrics, including invocation rates and total transcribed seconds per model. Understanding these trends can help businesses optimize costs when building audio transcription-based applications.

Audio speech metrics

OpenAI's text-to-speech (TTS) models deliver realistic voice synthesis for applications such as accessibility tools and virtual assistants. This section explores TTS invocation rates and the number of characters synthesized per model, offering insights into the adoption of AI-driven voice synthesis.

Creating Alerts and SLOs to monitor OpenAI

As with every other Elastic integration, all the logs and metrics information is fully available to leverage in every capability in Elastic Observability, including SLOs, alerting, custom dashboards, in-depth logs exploration, etc.

To proactively manage your OpenAI token usage and avoid unexpected costs, create a custom threshold rule in Observability Alerts.

Example: Target the relevant data stream, and configure the rule to sum the related tokens field (along with other token-related fields, if applicable). Set a threshold representing your desired usage limit, and the alert will notify you if this limit is exceeded within a specified timeframe, such as daily or hourly.

When an alert condition is met, the Alert Details view linked in the alert notification for that alert provides detailed insights surrounding the violation, such as when the violation started, its current status, and any previous history of similar violations, enabling proactive issue resolution, and improving system resilience.

Example: To create an SLO that monitors model distribution in OpenAI, start by defining a custom metric SLI definition, adding good events where openai.base.model contains gpt-3.5* and total events encompassing all OpenAI requests, grouped by openai.base.project_id and openai.base.user_id. Then, set an appropriate SLO target such as 80% and monitor this over a 7-day rolling window to identify projects and users that may be overusing more expensive models.

You can now track the distribution of requests across different OpenAI models by project and user. This example demonstrates how Elastic's OpenAI integration helps you optimize costs. By monitoring the percentage of requests handled by cost-efficient GPT-3.5 models — the SLI — against the 80% target (part of the SLO), you can quickly identify which specific projects or users are driving up costs through excessive usage of models like GPT-4-turbo, GPT-4o, etc. This visibility enables targeted optimization strategies, ensuring your AI initiatives remain cost-effective while still leveraging advanced capabilities.

Conclusion, next steps and further reading

You now know how Elastic's OpenAI integration provides an essential tool for anyone relying on OpenAI's models to power their applications. By offering a comprehensive and customizable dashboard, this integration empowers SREs and developers to effectively monitor performance, manage costs, and optimize your AI systems effortlessly. Now, it's your turn to onboard this application following the instructions in this blog and start monitoring your OpenAI usage! We'd love to hear from you on how you get on and always welcome ideas for enhancements.

To learn how to set up Application Performance Monitoring (APM) tracing of OpenAI-powered applications, read this blog. For further reading and more LLM observability use cases, explore Elastic's observability lab blogs here.

Transforming Industries and the Critical Role of LLM Observability: How to use Elastic's LLM integrations in real-world scenarios

Thu, 08 May 2025 00:00:00 GMT

In today's tech-centric world, Large Language Models (LLMs) are transforming sectors from finance and healthcare to research. LLMs are starting to underpin products and services across the spectrum. Take for example recent advanced coding developments in Google's Gemini 2.5 which enable it to use its reasoning capabilities to create a video game by producing the executable code from a short prompt. Or new ways to interact with Amazon's Alexa - for example, you could send a picture of a live music schedule, and have Alexa add the details to your calendar. And let's not forget Microsoft's personalization of Copilot which remembers what you talk about, so it learns your likes and dislikes and details about your life; the name of your dog, that tricky project at work, what keeps you motivated to stick to your new workout routine.

Despite their widespread utility of LLMs, deploying these sophisticated tools in real-world scenarios poses distinct challenges, especially in managing their complex behaviors. For users such as Site Reliability Engineers (SREs), DevOps teams, and AI/ML engineers, ensuring reliability, performance, and compliance of these models introduces an additional layer of complexity. This is where the concept of LLM Observability becomes essential. It offers crucial insights into the performance of these models, ensuring that these advanced AI systems operate both effectively and ethically.

Why LLM Observability Matters and How Elastic Makes It Easy

LLMs are not just another piece of software; they are sophisticated systems capable of human-like capabilities such as text generation, comprehension, and even coding. But with great power comes greater need for oversight. The opaque nature of these models can obscure how decisions are made and content generated. This makes it even more critical to implement robust observability to monitor and troubleshoot issues such as hallucinations, inappropriate content, cost overruns, errors and performance degradation. By monitoring these models closely, we can safeguard against unexpected outcomes and maintain user trust.

Real-World Scenarios

Let's explore real-world scenarios where companies leverage LLM-powered applications to enhance productivity and user experience, and how Elastic's LLM observability solutions monitor critical aspects of these models.

1. Generative AI for Customer Support

Companies are increasingly leveraging LLMs and generative AI to enhance customer support, using platforms like Google Vertex AI for hosting these models efficiently. With the introduction of advanced AI models such as Google's Gemini, which is integrated into Vertex AI, businesses can deploy sophisticated chatbots that manage customer inquiries, from basic questions to complex issues, in real time. These AI systems understand and respond with natural language, offering instant support for issues such as product troubleshooting or managing orders thus reducing wait times. They also learn from each interaction to improve accuracy continuously. This boosts customer satisfaction and allows human agents to focus on complex tasks, enhancing overall efficiency. Other ways that AI tools can further empower customer care agents is with real-time analytics, sentiment detection, and conversation summarization.

To support use cases like the AI-powered customer support described above, Elastic recently launched LLM observability integrations including support for LLMs hosted on GCP Vertex AI. Customers who wish to monitor foundation models such as Gemini and Imagen hosted on Google Vertex AI can benefit from Elastic’s Vertex AI integration to get a deeper understanding of model behavior and performance, and ensure that the AI-driven tools are not only effective but also reliable. Customers get out-of-the-box experience ingesting a curated set of metrics from Vertex AI as well as a pre-configured dashboard.

By continuously tracking these metrics, customers can proactively manage their AI resources, optimize operations, and ultimately enhance the overall customer experience.

Let's look at some of the metrics you get from the Google Vertex AI integration which are helpful in the context of using generative AI for customer support.

Prediction Latency: Measures the time taken to complete predictions, critical for real-time customer interactions.
Error Rate: Tracks errors in predictions, which is vital for maintaining the accuracy and reliability of AI-driven customer support.
Prediction Count: Counts the number of predictions made, helping assess the scale of AI usage in customer interactions.
Model Usage: Tracks how frequently the AI models are accessed by both virtual assistants and customer support tools.
Total Invocations: Measures the total number of times the AI services are used, providing insights into user engagement and dependency on these tools.
CPU and Memory Utilization: By observing CPU and memory usage, users can optimize resource allocation, ensuring that the AI tools are running efficiently without overloading the system.

To learn more about how Elastic's Google Vertex AI integration can augment your LLM observability, have a quick read of this blog.

2. Transforming Healthcare with Generative AI

The healthcare industry is embracing generative AI to enhance patient interactions and streamline operational workflows. By leveraging platforms like Amazon Bedrock, healthcare organizations deploy advanced large language models (LLMs) to power tools that convert doctor-patient conversations into structured medical notes, reducing administrative overhead and allowing clinicians to prioritize diagnosis and treatment. These AI-driven solutions provide real-time insights, enabling informed decision-making and improving patient outcomes. Additionally, patient-facing applications powered by LLMs offer secure access to health records, empowering individuals to manage their care proactively.

Robust observability is essential to maintain the reliability and performance of these generative AI applications in healthcare. Elastic’s Amazon Bedrock integration equips providers with tools to monitor LLM behavior, capturing critical metrics like invocation latency, error rates, token usage and guardrail invocation. Pre-configured dashboards provide visibility into prompt and completion text, enabling teams to verify the accuracy of AI-generated outputs, such as medical notes, and detect issues like hallucinations.

Additionally, customers who configure Guardrails for Amazon Bedrock to filter harmful content like hate speech, personal insults, and other inappropriate topics, can use the Bedrock Integration to observe the prompts and responses that caused the guardrail to filter them out. This helps application developers take proactive actions to maintain a safe and positive user experience.

Some of the logs and metrics that can be helpful for customers using LLMs hosted on Amazon Bedrock are the following

Invocation Details: This Integration records the Invocation latency, count, throttles. These metrics are critical for ensuring that generative AI models respond quickly and accurately to patient queries or appointment scheduling tasks, maintaining a seamless user experience.
Error Rates: Tracking error rates ensures that AI tools, such as patient query assistants or appointment systems, consistently deliver accurate and reliable results. By identifying and addressing issues early, healthcare providers can maintain trust in AI systems and prevent disruptions in critical patient interactions.
Token Usage: In healthcare, tracking token usage helps identify resource-intensive queries, such as detailed patient record summaries or complex symptom analyses, ensuring efficient model operation. By monitoring token usage, healthcare providers can optimize costs for AI-powered tools while maintaining scalability to handle growing patient interactions.
Prompt and Completion Text: Capturing prompt and completion text allows healthcare providers to analyze how AI models respond to specific patient queries or administrative tasks, ensuring meaningful and contextually accurate interactions. This insight helps refine prompts to improve the AI's understanding and ensures that generated responses, such as appointment details or treatment explanations, meet the quality standards expected in healthcare.
Prompt and response where guardrails intervened: Being able to track requests and responses that were deemed inappropriate by guardrails helps healthcare providers monitor what information patients are asking for. With this information users can make continuous adjustments to the LLMs to ensure appropriate responses, balancing flexibility and rich communication on the one hand, and on the other, privacy protection, hallucination prevention, and harmful content filtering.

Amazon Bedrock Gaurdrails OOTB dashboard

To learn about the Amazon Bedrock Integration, read this blog. To dive deeper into how the integration can help with observability of Guardrails for Amazon Bedrock, take a look at this blog.

3. Enhancing Telco Efficiency with GenAI

The telecommunication industry can leverage services like Azure OpenAI to transform customer interactions, optimize operations, and enhance service delivery. By integrating advanced generative AI models, telcos can offer highly personalized and responsive customer experiences across multiple channels. AI-powered virtual assistants streamline customer support by automating routine queries and providing accurate, context-aware responses, reducing the workload on human agents and enabling them to focus on complex issues while improving efficiency and satisfaction. Additionally, AI-driven insights help telcos understand customer preferences, anticipate needs, and deliver tailored offerings that boost customer loyalty. Operationally, LLMs such as Azure OpenAI enhance internal processes by enabling smarter knowledge management and faster access to critical information.

Elastic's LLM observability integrations like the Azure OpenAI integration can provide visibility into AI performance and costs, empowering telecom providers to make data-driven decisions and enhance customer engagement. It can help optimize resource allocation by analyzing call patterns, predicting service demands, and identifying trends, enabling telcos to scale their AI operations efficiently while maintaining high service quality.

Some of the key metrics and logs that Azure OpenAI that can provide insights are:

Error Counts: It provides critical insights into failed requests and incomplete transactions, enabling telecom providers to proactively identify and resolve issues in AI-powered applications.
Prompt Input and Completion Text: This captures the input queries provided to AI systems and the corresponding AI-generated outputs. These fields allow telecom providers to analyze customer queries, monitor response quality, and refine AI training datasets to improve relevance and accuracy.
Response Latency: It measures the time taken by AI models to generate responses, ensuring that virtual assistants and automated systems deliver quick and efficient replies to customer queries.
Token Usage: It tracks the number of input and output tokens processed by the AI model, offering insights into resource consumption and cost efficiency. This data helps telecom providers monitor AI usage patterns, optimize configurations, and scale resources effectively
Content Filter Results: In Azure OpenAI, this plays a crucial role in handling sensitive inputs provided by customers, ensuring compliance, safety, and responsible AI usage. This feature identifies and flags potentially inappropriate or harmful queries and responses in real time, enabling telecom providers to address sensitive topics with care and accuracy.

The Azure OpenAI content filtering OOTB dashboard

You can learn more about Elastic's Azure OpenAI integration from these two blogs - Part 1 and Part 2.

4. OpenAI Integration for Generative AI Applications

As AI-powered solutions become integral to modern workflows, OpenAI's sophisticated models, including language models like GPT-4o and GPT-3.5 Turbo, image generation models like DALL·E, and audio processing models like Whisper, drive innovation across applications such as virtual assistants, content creation, and speech-to-text systems. With growing complexity and scale, ensuring these models perform reliably, remain cost-efficient, and adhere to ethical guidelines is paramount. Elastic's OpenAI integration provides a robust solution, offering deep visibility into model behaviour to support seamless and responsible AI deployments.

By tapping into the OpenAI Usage API, Elastic's integration delivers actionable insights through intuitive, pre-configured dashboards, enabling Site Reliability Engineers (SREs) and DevOps teams to monitor performance and optimize resource usage across OpenAI's diverse model portfolio. This unified observability approach empowers organizations to track critical metrics, identify inefficiencies, and maintain high-quality AI-driven experiences. The following key metrics from Elastic's OpenAI integration help organizations achieve effective oversight:

Request Latency: Measures the time taken for OpenAI models to process requests, ensuring responsive performance for real-time applications like chatbots or transcription services.
Invocation Rates: Tracks the frequency of API calls across models, providing insights into usage patterns and helping identify high-demand workloads.
Token Usage: Monitors input and output tokens (e.g., prompt, completion, cached tokens) to optimize costs and fine-tune prompts for efficient resource consumption.
Error Counts: Captures failed requests or incomplete transactions, enabling proactive issue resolution to maintain application reliability.
Image Generation Metrics: Tracks invocation rates and output dimensions for models like DALL·E, helping assess costs and usage trends in image-based applications.
Audio Transcription Metrics: Monitors invocation rates and transcribed seconds for audio models like Whisper, supporting cost optimization in speech-to-text workflows.

To learn more about Elastic's OpenAI integration, read this blog.

Actionable LLM Observability

Elastic's LLM observability integrations empower users to take proactive control of their AI operations through actionable insights and real-time alerts. For instance, by setting a predefined threshold for token count, Elastic can trigger automated alerts when usage exceeds this limit, notifying Site Reliability Engineers (SREs) or DevOps teams via email, Slack, or other preferred channels. This ensures prompt awareness of potential cost overruns or resource-intensive queries, enabling teams to adjust model configurations or scale resources swiftly to maintain operational efficiency.

In the example below, the rule is set to alert the user if token_count crosses a threshold of 500.

The alert is triggered when the token count exceeds the threshold as seen below

Another example is tracking invocation spikes, such as when the number of predictions or API calls surpasses a defined Service Level Objective (SLO). For example, if a Bedrock AI-hosted model experiences a sudden surge in invocations due to increased customer interactions, Elastic can alert teams to investigate potential anomalies or scale infrastructure accordingly. These proactive measures help maintain the reliability and cost-effectiveness of LLM-powered applications.

By providing pre-configured dashboards and customizable alerts, Elastic ensures that organizations can respond to critical events in real time, keeping their AI systems aligned with cost and performance goals as well as standards for content safety and reliability.

Conclusion

LLMs are transforming industries, but their complexity requires effective oversight observability to ensure their reliability and safe use. Elastic's LLM observability integrations provide a comprehensive solution, empowering businesses to monitor performance, manage resources, and address challenges like hallucinations and content safety. As LLMs become increasingly integral to various sectors, robust observability tools like those offered by Elastic ensure that these AI-driven innovations remain dependable, cost-effective, and aligned with ethical and safety standards.