Elastic Observability Labs - Articles by Muthukumar Paramasivam

Elastic SQL inputs: A generic solution for database metrics observability

Mon, 11 Sep 2023 00:00:00 GMT

Elastic^® SQL inputs (metricbeat module and input package) allows the user to execute SQL queries against many supported databases in a flexible way and ingest the resulting metrics to Elasticsearch^®. This blog dives into the functionality of generic SQL and provides various use cases for advanced users to ingest custom metrics to Elastic^®, for database observability. The blog also introduces the fetch from all database new capability, released in 8.10.

Why “Generic SQL”?

Elastic already has metricbeat and integration packages targeted for specific databases. One example is metricbeat for MySQL — and the corresponding integration package. These beats modules and integrations are customized for a specific database, and the metrics are extracted using pre-defined queries from the specific database. The queries used in these integrations and the corresponding metrics are not available for modification.

Whereas the Generic SQL inputs (metricbeat or input package) can be used to scrape metrics from any supported database using the user's SQL queries. The queries are provided by the user depending on specific metrics to be extracted. This enables a much more powerful mechanism for metrics ingestion, where users can choose a specific driver and provide the relevant SQL queries and the results get mapped to one or more Elasticsearch documents, using a structured mapping process (table/variable format explained later).

Generic SQL inputs can be used in conjunction with the existing integration packages, which already extract specific database metrics, to extract additional custom metrics dynamically, making this input very powerful. In this blog, Generic SQL input and Generic SQL are used interchangeably.

Functionalities details

This section covers some of the features that would help with the metrics extraction. We provide a brief description of the response format configuration. Then we dive into the merge_results functionality, which is used to combine results from multiple SQL queries into a single document.

The next key functionality users may be interested in is to collect metrics from all the custom databases, which is now possible with the fetch_from_all_databases feature.

Now let's dive into the specific functionalities:

Different drivers supported

The generic SQL can fetch metrics from the different databases. The current version has the capability to fetch metrics from the following drivers: MySQL, PostgreSQL, Oracle, and Microsoft SQL Server(MSSQL).

Response format

The response format in generic SQL is used to manipulate the data in either table or in variable format. Here’s an overview of the formats and syntax for creating and using the table and variables.

Syntax: response_format: table {{or}} variables

Response format table
This mode generates a single event for each row. The table format has no restrictions on the number of columns in the response. This format can have any number of columns.

Example:

driver: "mssql"
sql_queries:
 - query: "SELECT counter_name, cntr_value FROM sys.dm_os_performance_counters WHERE counter_name= 'User Connections'"
   response_format: table

This query returns a response similar to this:

"sql":{
      "metrics":{
         "counter_name":"User Connections ",
         "cntr_value":7
      },
      "driver":"mssql"
}

The response generated above adds the counter_name as a key in the document.

Response format variables
The variable format supports key:value pairs. This format expects only two columns to fetch in a query.

Example:

driver: "mssql"
sql_queries:
 - query: "SELECT counter_name, cntr_value FROM sys.dm_os_performance_counters WHERE counter_name= 'User Connections'"
   response_format: variables

The variable format takes the first variable in the query above as the key:

"sql":{
      "metrics":{
         "user connections ":7
      },
      "driver":"mssql"
}

In the above response, you can see the value of counter_name is used to generate the key in variable format.

Response optimization: merge_results

We are now supporting merging multiple query responses into a single event. By enabling merge_results , users can significantly optimize the storage space of the metrics ingested to Elasticsearch. This mode enables an efficient compaction of the document generated, where instead of generating multiple documents, a single merged document is generated wherever applicable. The metrics of a similar kind, generated from multiple queries, are combined into a single event.

Syntax: merge_results: true {{or}} false

In the below example, you can see how the data is loaded into Elasticsearch for the below query when the merge_results is disabled.

Example:

In this example, we are using two different queries to fetch metrics from the performance counter.

merge_results: false
driver: "mssql"
sql_queries:
  - query: "SELECT cntr_value As 'user_connections' FROM sys.dm_os_performance_counters WHERE counter_name= 'User Connections'"
    response_format: table
  - query: "SELECT cntr_value As 'buffer_cache_hit_ratio' FROM sys.dm_os_performance_counters WHERE counter_name = 'Buffer cache hit ratio' AND object_name like '%Buffer Manager%'"
    response_format: table

As you can see, the response for the above example generates a single document for each query.

The resulting document from the first query:

"sql":{
      "metrics":{
         "user_connections":7
      },
      "driver":"mssql"
}

And resulting document from the second query:

"sql":{
      "metrics":{
         "buffer_cache_hit_ratio":87
      },
      "driver":"mssql"
}

When we enable the merge_results flag in the query, both the above metrics are combined together and the data gets loaded in a single document.

You can see the merged document in the below example:

"sql":{
      "metrics":{
         "user connections ":7,
         “buffer_cache_hit_ratio”:87
      },
      "driver":"mssql"
}

However, such a merge is possible only if the table queries are merged, and each produces a single row. There is no restriction on variable queries being merged.

Introducing a new capability: fetch_from_all_databases

This is a new functionality to fetch all the database metrics automatically from the system and user databases of the Microsoft SQL Server, by enabling the fetch_from_all_databases flag.

Keep an eye out for the 8.10 release version where you can start using the fetch all database feature. Prior to the 8.10 version, users had to provide the database names manually to fetch metrics from custom/user databases.

Syntax: fetch_from_all_databases: true {{or}} false

Below is the sample query with fetch all databases flag as disabled:

fetch_from_all_databases: false
driver: "mssql"
sql_queries:
  - query: "SELECT @@servername AS server_name, @@servicename AS instance_name, name As 'database_name', database_id FROM sys.databases WHERE name='master';"

The above query fetches metrics only for the provided database name. Here the input database is master, so the metrics are fetched only for the master.

Below is the sample query with the fetch all databases flag as enabled:

fetch_from_all_databases: true
driver: "mssql"
sql_queries:
  - query: SELECT @@servername AS server_name, @@servicename AS instance_name, DB_NAME() AS 'database_name', DB_ID() AS database_id;
    response_format: table

The above query fetches metrics from all available databases. This is useful when the user wants to get data from all the databases.

Please note: currently this feature is supported only for Microsoft SQL Server and will be used by MS SQL integration internally, to support extracting metrics for all user DBs by default.

Using generic SQL: Metricbeat

The generic SQL metricbeat module provides flexibility to execute queries against different database drivers. The metricbeat input is available as GA for any production usage. Here, you can find more information on configuring the generic SQL for different drivers with various examples.

Using generic SQL: Input package

The input package provides a flexible solution to advanced users for customizing their ingestion experience in Elastic. Generic SQL is now also available as an SQLinput package. The input package is currently available for early users as a beta release. Let's take a walk through how users can use generic SQL via the input package.

Configurations of generic SQL input package:

The configuration options for the generic SQL input package are as below:

Driver** :** This is the SQL database for which you want to use the package. In this case, we will take mysql as an example.
Hosts: Here the user enters the connection string to connect to the database. It would vary depending on which database/driver is being used. Refer here for examples.
SQL Queries: Here the user writes the SQL queries they want to fire and the response_format is specified.
Data set: The user specifies a data set name to which the response fields get mapped.
Merge results** :** This is an advanced setting, used to merge queries into a single event.

Metrics extensibility with customized SQL queries

Let's say a user is using MYSQL Integration, which provides a fixed set of metrics. Their requirement now extends to retrieving more metrics from the MYSQL database by firing new customized SQL queries.

This can be achieved by adding an instance of SQL input package, writing the customized queries and specifying a new data set name as shown in the screenshot below.

This way users can get any metrics by executing corresponding queries. The resultant metrics of the query will be indexed to the new data set, sql_second_dataset.

When there are multiple queries, users can club them into a single event by enabling the Merge Results toggle.

Customizing user experience

Users can customize their data by writing their own ingest pipelines and providing their customized mappings. Users can also build their own bespoke dashboards.

As we can see above, the SQL input package provides the flexibility to get new metrics by running new queries, which are not supported in the default MYSQL integration (the user gets metrics from a predetermined set of queries).

The SQL input package also supports multiple drivers: mssql, postgresql and oracle. So a single input package can be used to cater to all these databases.

Note: The fetch_from_all_databases feature is not supported in the SQL input package yet.

Try it out!

Now that you know about various use cases and features of generic SQL, get started with Elastic Cloud and try using the SQL input package for your SQL database and get customized experience and metrics. If you are looking for newer metrics for some of our existing SQL based integrations — like Microsoft SQL Server, Oracle, and more — go ahead and give the SQL input package a swirl.

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.

LLM Observability for Google Cloud’s Vertex AI platform - understand performance, cost and reliability

Wed, 09 Apr 2025 00:00:00 GMT

As organizations increasingly adopt large language models (LLMs) for AI-powered applications such as content creation, Retrieval-Augmented Generation (RAG), and data analysis, SREs and developers face new challenges. Tasks like monitoring workflows, analyzing input and output, managing query latency, and controlling costs become critical. LLM observability helps address these issues by providing clear insights into how these models perform, allowing teams to quickly identify bottlenecks, optimize configurations, and improve reliability. With better observability, SREs can confidently scale LLM applications, especially on platforms like Google Cloud’s Vertex AI.

New Elastic Observability LLM integration with Google Cloud’s Vertex AI platform

We are thrilled to announce general availability of monitoring LLMs hosted in Google Cloud through the Elastic integration with Vertex AI. This integration enables users to experience enhanced LLM Observability by providing deep insights into the usage, cost and operational performance of models on Vertex AI, including latency, errors, token usage, frequency of model invocations as well as resources utilized by models. By leveraging this data, organizations can optimize resource usage, identify and resolve performance bottlenecks, and enhance the model efficiency and accuracy.

Observability needs for AI-powered applications using the Vertex AI platform

Leveraging AI models creates unique needs around the observability and monitoring of AI-powered applications. Some of the challenges that come with using LLMs are related to the high cost to call the LLMs, the quality and safety of LLM responses, and the performance, reliability and availability of the LLMs.

Lack of visibility into LLM observability data can make it harder for SREs and DevOps teams to ensure their AI-powered applications meet their service level objectives for reliability, performance, cost and quality of the AI-generated content and have enough telemetry data to troubleshoot related issues. Thus, robust LLM observability and detection of anomalies in the performance of models hosted on Google Cloud’s Vertex AI platform in real time is critical for the success of AI-powered applications.

Depending on the needs of their LLM applications, customers can make use of a growing list of models hosted on the Vertex AI platform such as Gemini 2.0 Pro, Gemini 2.0 Flash, and Imagen for image generation. Each model excels in specific areas and generates content in some modalities including Language, Audio, Vision, Code, etc. No two models are the same; each model has specific performance characteristics. So, it is important that service operators are able to track the individual performance, behaviour and cost of each model.

Unlocking Insights with Vertex AI Metrics

The Elastic integration with Google Cloud’s Vertex AI platform collects a wide range of metrics from models hosted on Vertex AI, enabling users to monitor, analyze, and optimize their AI deployments effectively.

Once you use the integration, you can review all the metrics in the Vertex AI dashboard

These metrics can be categorized into the following groups:

1. Prediction Metrics

Prediction metrics provide critical insights into model usage, performance bottlenecks, and reliability. These metrics help ensure smooth operations, optimize response times, and maintain robust, accurate predictions.

Prediction Count by Endpoint: Measures the total number of predictions across different endpoints.
Prediction Latency: Provides insights into the time taken to generate predictions, allowing users to identify bottlenecks in performance.
Prediction Errors: Monitors the count of failed predictions across endpoints.

2. Model Performance Metrics

Model performance metrics provide crucial insights into deployment efficiency, and responsiveness. These metrics help optimize model performance and ensure reliable operations.

Model Usage: Tracks the usage distribution among different model deployments.
Token Usage: Tracks the number of tokens consumed by each model deployment, which is critical for understanding model efficiency.

Invocation Rates: Tracks the frequency of invocations made by each model deployment.
Model Invocation Latency: Measures the time taken to invoke a model, helping in diagnosing performance issues.

3. Resource Utilization Metrics

Resource utilization metrics are vital for monitoring resource efficiency and workload performance. They help optimize infrastructure, prevent bottlenecks, and ensure smooth operation of AI deployments.

CPU Utilization: Monitors CPU usage to ensure optimal resource allocation for AI workloads.
Memory Usage: Tracks the memory consumed across all model deployments.
Network Usage: Measures bytes sent and received, providing insights into data transfer during model interactions.

4. Overview Metrics

These metrics give an overview of the models deployed in Google Cloud’s Vertex AI platform. They are essential for tracking overall performance, optimizing efficiency, and identifying potential issues across deployments.

Total Invocations: The overall count of prediction invocations across all models and endpoints, providing a comprehensive view of activity.
Total Tokens: The total number of tokens processed across all model interactions, offering insights into resource utilization and efficiency.
Total Errors: The total count of errors encountered across all models and endpoints, helping identify reliability issues.

All metrics can be filtered by region, offering localized insights for better analysis.

Note: The Elastic I integration with Vertex AI provides comprehensive visibility into both deployment models: provisioned throughput, where capacity is pre-allocated, and pay-as-you-go, where resources are consumed on demand.

Conclusion

This integration with Vertex AI represents a significant step forward in enhancing the LLM Observability for users of Google Cloud’s Vertex AI platform. By unlocking a wealth of actionable data, organizations can assess the health, performance and cost of LLMs and troubleshoot operational issues, ensuring scalability, and accuracy in AI-driven applications.

Now that you know how the Vertex AI integration enhances LLM Observability, it’s your turn to try it out n. Spin up an Elastic Cloud, and start monitoring your LLM applications hosted on Google Cloud’s Vertex AI platform.

LLM Observability with Elastic’s Azure AI Foundry Integration

Fri, 25 Jul 2025 00:00:00 GMT

Introduction

As organizations increasingly adopt LLMs for AI-powered applications such as content creation, Retrieval-Augmented Generation (RAG), and data analysis, SREs and developers face new challenges. Tasks like monitoring workflows, analyzing input and output, managing query latency, and controlling costs become critical. LLM Observability helps address these issues by providing clear insights into how these models perform, allowing teams to quickly identify bottlenecks, optimize configurations, and improve reliability. With better observability, SREs can confidently scale LLM applications, especially on platforms like Azure AI Foundry, while minimizing downtime and keeping costs in check.

Elastic is expanding support for LLM Observability with Elastic Observability's new Azure AI Foundry integration. This is now available as a tech preview on Elastic Cloud. This new observability integration provides you with comprehensive visibility into the performance and usage of foundational models, such as GPT-4, Mistral, Llama, and thousands of others from leading AI companies and from Azure available through Azure AI Foundry. The new Azure AI Foundry Integration in Elastic Observability integration offers an out-of-the-box experience by simplifying the collection of metrics and logs, making it easier to gain actionable insights and effectively manage your models. The integration is simple to set up and comes with pre-built, out-of-the-box dashboards. With real-time insights, SREs can now monitor, optimize and troubleshoot LLM applications that are using Azure AI Foundry.

This blog will walk through the features available to SREs, such as monitoring invocations, errors, and latency information across various models, along with the usage and performance of LLM requests. Additionally, the blog will show how easy it is to set up and what insights you can gain from Elastic for LLM Observability.

Prerequisites

To get started with the Azure AI Foundry integration, you will need:

An account on Elastic Cloud and a deployed stack in Azure (see instructions here). Ensure you are using version 9.0.0 or higher.
An Azure account with permissions to pull the necessary data from Azure and Azure AI Foundry. See details in our documentation.

Configuring Azure AI Foundry Integration

To collect logs and metrics from Azure AI Foundry ensure you properly configure Azure logs and metrics from the following links:

Configure to receive Azure Metrics - This integration specifically collects Azure AI Foundry metrics which will come from the service, and ensure you have the client id, subscription id, and tenant id from Azure AI Foundry to collect metrics.
Configure to receive Azure Logs and more specifically ensure that you configure Azure event hub to properly allow Elastic to ingest logs. Once you have the Azure event hub information, you will need it to configure the logs section of the Azure AI Foundry Integration.

Maximize Visibility with Out-of-the-box dashboards

Azure AI Foundry integration offers rich out-of-the-box visibility into the performance and usage information of models in Azure AI Foundry, including text and image models. There are several dashboards currently available. More will be coming as the integration goes to GA.

Azure AI Foundry Overview dashboard provides a summarized view of the invocations, errors and latency information across various models.
Azure AI Foundry Billing dashboard - which provides total costs and daily usage costs from Azure cognitive services.
Azure AI Foundry Advanced Monitoring - which focuses on logs generated by the Azure AI Foundry service when connected through the API Management Service. Provides request rate, error rate, model usage, latency, LLM prompt input, response completion.

Each dashboard provides specific insights important to SREs. Here is a quick overview of some of these insights:

Model Usage and Token Trends – Visualize token consumption and completion counts by model, endpoint, and time window.
Latency Metrics – Monitor average and percentile latency per prompt, per endpoint, and correlate with prompt types or user IDs.
Cost Estimation – Estimate API usage cost based on token consumption and model pricing.
Prompt/Completion Logging – View prompt-response pairs for debugging and quality monitoring.
Content Filtering and Guardrails – See which prompts or completions are being filtered, and why.

You can drill into specific users or sessions, slice by model type or region, and export reports for usage reviews or compliance.

Try it out today

The Azure AI Foundry Integration is currently available in Elastic Cloud (both serverless and hosted options). Sign up for a 7 day trial by signing up to Elastic Cloud directly or through Azure Marketplace. Alternatively you can also deploy a cluster on our Elasticsearch Service, download the Elasticsearch stack, or run Elastic from Azure Marketplace then spin up the new technical preview of Azure AI Foundry integration, open the curated dashboards in Kibana and start monitoring your Azure AI Foundry service!

Optimizing Spend and Content Moderation on Azure OpenAI with Elastic

Tue, 13 May 2025 00:00:00 GMT

In a previous blog we showed you how to set up observability for your models hosted on Azure OpenAI using Elastic’s integration. We’ve expanded the integration to also include Azure OpenAI content filtering, and cost analysis for Azure OpenAI. If you previously onboarded the Azure OpenAI integration, just upgrade it and you will automatically get all new features we discuss in this blog. The enhanced integration now provides multiple dashboards including a general Azure OpenAI Overview, Azure Provisioned Throughput Unit dashboard, Azure Content filtering, and a dashboard for Azure OpenAI billing.

In this blog we will cover how to use Azure OpenAI Content Filtering and tracking Azure OpenAI usage costs. Let’s first review what these two capabilities from Azure OpenAI enable you to do:

Azure OpenAI Content Filtering: Enhancing AI Safety

Content filtering for Azure OpenAI plays a critical role in addressing AI safety challenges by helping to mitigate the risks associated with harmful or inappropriate content generated by AI models. By implementing robust content filtering mechanisms, organizations can proactively identify and filter out potentially harmful content, such as hate speech, misinformation, or violent imagery, before it is disseminated to users. This helps prevent the spread of harmful content and reduces the potential negative impact on individuals and communities.

Monitoring Azure OpenAI content filtering is essential for staying proactive in addressing emerging content moderation challenges. By closely monitoring the system, businesses can quickly detect any new types of harmful content or patterns of misuse that may arise. This enables organizations to stay ahead of potential content moderation issues and take timely action to protect their users and uphold their brand reputation.

Tracking Azure OpenAI Usage Costs

Monitoring Azure OpenAI model usage costs is crucial for managing budget and resource allocation effectively. By keeping track of usage costs, organizations can optimize their operations to avoid unnecessary expenses and ensure that they are getting the best value from their investment in AI technologies. Additionally, it helps in forecasting future expenses and aids in scaling resources according to the demand without compromising performance or incurring excessive costs. Effective monitoring also allows for transparency and accountability, enabling better decision-making in terms of AI deployment and utilization within Azure environments.

As we walk through this blog, we will provide you with prerequisites to set up and use the pre-configured dashboards for both of these capabilities, which are part of the Azure OpenAI integration.

Prerequisites

In order to follow along in this blog you will have to

Set up and install the Azure billing integration to monitor the usage costs. Once the integration is installed, you can track the usage in the enhanced Azure OpenAI Billing dashboard.
Additionally, make sure you have enabled the Azure API Management service to access the Azure OpenAI models.

How to Use Azure API Management with Azure OpenAI:

Provision an Azure OpenAI resource: Create an Azure OpenAI resource and select a model for your application.
Create an API Management instance: Establish an Azure API Management instance to manage the Azure OpenAI APIs.
Import the Azure OpenAI API: Import the Azure OpenAI API into your API Management instance using its OpenAPI specification.
Configure Policies: Implement policies in API Management to manage request authentication, rate limiting, traffic shaping, and more.

Steps to create a content filter for Azure OpenAI

Before you set up observability for the content filtering, ensure that you have configured the Azure content filtering for your model. Follow the steps below to create an Azure OpenAI content filtering,

Access the Azure OpenAI service console:
- Sign in to the Azure Console with the appropriate permissions and navigate to the Azure OpenAI service console.
Navigate to Safety + security:
- From the left-hand menu, select Safety + security.
Create a New Content filter:
- Select Create content filter.
- Configure various content filter policies including the following
  - Set input filter: Content will be annotated by category and blocked according to the threshold you set for prompts.
  - Set output filter: Content will be annotated by category and blocked according to the threshold you set for response output.
  - Blocklists: Define specific words or phrases to block.
  - Deployments: Apply filters to model deployments.
Review and Create:
- Review your settings and select Create to finalize the content filter configurations.

Customers can also configure content filters and create custom safety policies that are tailored to their use case requirements. The configurability feature allows customers to adjust the settings, separately for prompts and completions, to filter content for each content category at different severity levels.

Content filter types

The content filtering categories,
- (hate, sexual, violence, self-harm)
- Other optional classification models aimed at detecting jailbreak risk and known content for text and code.
Severity level within each content filter category,
- (low, medium, high)
- Content detected at the 'safe' severity level is labeled in annotations but isn't subject to filtering and isn't configurable.

Understanding the pre-configured dashboard for Azure OpenAI Content Filtering

Now that you have set up the filter, you can see what is being filtered in Elastic through the Azure OpenAI content filtering dashboard.

Navigate to the Dashboard Menu – Select the Dashboard menu option in Elastic and search for [Azure OpenAI] Content Filtering Overview to open the dashboard.
Navigate to the Integrations Menu – Open the Integrations menu in Elastic, select Azure OpenAI, go to the Assets tab, and choose [Azure OpenAI] Content Filtering Overview from the dashboard assets.

The Azure OpenAI Content Filtering Overview dashboard in the Elastic integration provides insights into blocked requests, API latency, error rates. This dashboard also provides detailed breakdown of content being filtered by the content filtering policy.

Content Filter overview

When the content filtering system detects harmful content, you receive either an error on the API call if the prompt was deemed inappropriate, or the finish_reason on the response will be content_filter to signify that some of the completion was filtered.

This can be summarized as,

Prompt filters: The prompt content that is classified in the filtered category will return HTTP 400 error.
Non-streaming completion: When the content is filtered, non-streaming completions calls won't return any content. In rare cases with longer responses, a partial result can be returned. In these cases, the finish_reason is updated.
Streaming completion: For streaming completions calls, segments are returned back to the user as they're completed. The service continues streaming until either reaching a stop token, length, or when content that is classified at a filtered category and severity level is detected.

Prompt and response where content has been blocked

This dashboard section displays the original LLM prompt, inputs from various sources (API calls, applications, or chat interfaces), and the corresponding completion response. The panel below gives a view on the responses after applying content filtering policy for prompts and completions.

You can use the following code snippet to start integrating your current prompt and settings into your application to test the content filter:

chat_prompt = [
   {
       "role": "user",
       "content": "How to kill a mocking bird?"
   }
]

After running the code, you can find the content being filtered by violence category with the severity level medium.

Content filtered by content source (Input & Output)

The content filtering system helps monitor and moderate different categories of content based on severity levels. The categories typically include things like adult content, offensive language, hate speech, violence, and more. The severity levels indicate the degree of sensitivity or potential harm associated with the content. This panel helps the user to effectively monitor and filter out inappropriate or harmful content to maintain a safe environment.

These metrics can be categorized into the following groups:

Blocked requests by category: Provides insights into the total blocked requests by category.
Severity distribution by categories: Monitors the blocked requests by categories and severity distribution. The severity distribution may be either low, medium or high.
Content filtered categories: Provides insights into the content filtered categories over time.

Reviewing the Azure OpenAI Billing dashboard

You can now look at what you are spending on Azure OpenAI.

Here is what you see on this dashboard:

Total costs: This measures the total usage cost across all the model deployments.
Overall Usage by model: This tracks the total usage costs broken down by model.
Daily usage: Monitors usage costs on a daily basis.
Daily usage costs by model: Monitors daily usage costs broken down by model deployments.

Conclusion

The Azure OpenAI integration makes it easy for you to collect a curated set of metrics and logs for your LLM-powered applications using Azure OpenAI along with content filtered responses. It comes with an out-of-the-box dashboard which you can further customize for your specific needs.

Deploy a cluster on our Elasticsearch Service or download the stack, spin up the new Azure OpenAI integration, open the curated dashboards in Kibana and start monitoring your Azure OpenAI service!

LLM Observability with Elastic: Azure OpenAI Part 2

Fri, 23 Aug 2024 00:00:00 GMT

We recently announced GA of the Azure OpenAI integration. You can find details in our previous blog LLM Observability: Azure OpenAI.

Since then, we have added further capabilities to the Azure OpenAI GA package, which now offer prompt and response monitoring, PTU deployment performance tracking, and billing insights. Read on to learn more!

Advanced Logging and Monitoring

The initial GA release of the integration focused mainly on the native logs, to track the telemetry of the service by using cognitive services logging. This version of the Azure OpenAI integration allows you to process the advanced logs which gives a more holistic view of OpenAI resource usage.

To achieve this, you have to setup API Management services in Azure. The API Management service is a centralized place where you can put all OpenAI services endpoints to manage all of them end-to-end. Enable the API Management services and configure the Azure event hub to stream the logs.

To learn more about setting up the API Management service to access Azure OpenAI, please refer to the Azure documentation.

By using advanced logging, you can collect the following log data:

Request input text
Response output text
Content filter results
Usage Information
- Input prompt tokens
- Output completion tokens
- Total tokens

Azure OpenAI integration now collects the API Management Gateway logs. When a question from the user goes to the API Management, it logs the questions and the responses from the GPT models.

Here’s what a sample log looks like,

Content filtered results

Azure OpenAI’s content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. With Azure OpenAI model deployments, you can use the default content filter or create your own content filter.

Now, The integration collects the content filtered result logs. In this example let's create a custom filter in the Azure OpenAI Studio that generates an error log.

By leveraging the Azure Content Filters, you can create your own custom lists of terms or phrases to block or flag.

And the document ingested in Elastic would look like this: This screenshot provides insights into the content filtered request.

PTU Deployment Monitoring

Provisioned throughput units (PTU) are units of model processing capacity that you can reserve and deploy for processing prompts and generating completions.

The curated dashboard for PTU Deployment gives comprehensive visibility into metrics such as request latency, active token usage, PTU utilization, and fine-tuning activities, offering a quick snapshot of your deployment's health and performance.

Here are the essential PTU metrics captured by default:

Time to Response: Time taken for the first response to appear after a user send a prompt.
Active Tokens: Use this metric to understand your TPS or TPM based utilization for PTUs and compare to the benchmarks for target TPS or TPM scenarios.
Provision-managed Utilization V2: Provides insights into utilization percentages, helping prevent overuse and ensuring efficient resource allocation.
Prompt Token Cache Match Rate: The prompt token cache hit ratio expressed as a percentage.

Using Billing for cost

Using the curated overview dashboard you can now monitor the actual usage cost for the AI applications. You are one step away from processing the billing information.

You need to configure and install the Azure billing metrics integration. Once the installation is complete the usage cost is visualized for the cognitive services in the Azure OpenAI overview dashboard.

Try it out today

Deploy a cluster on our Elasticsearch Service or download the stack, spin up the new Azure OpenAI integration, open the curated dashboards in Kibana and start monitoring your Azure OpenAI service!