﻿---
title: Large language model performance matrix
description: This page summarizes internal test results comparing large language models (LLMs) across Elastic AI Assistant for Observability and Search use cases...
url: https://www.elastic.co/docs/solutions/observability/ai/llm-performance-matrix
products:
  - Elastic Observability
applies_to:
  - Elastic Cloud Serverless: Generally available
  - Elastic Stack: Generally available since 9.2
---

# Large language model performance matrix
This page summarizes internal test results comparing large language models (LLMs) across Elastic AI Assistant for Observability and Search use cases. To learn more about these use cases, refer to [AI Assistant](https://www.elastic.co/docs/solutions/observability/ai/observability-ai-assistant).
<important>
  Rating legend:**Excellent:** Highly accurate and reliable for the use case.
  **Great:** Strong performance with minor limitations.
  **Good:** Possibly adequate for many use cases but with noticeable tradeoffs.
  **Poor:** Significant issues; not recommended for production for the use case.Recommended models are those rated **Excellent** or **Great** for the particular use case.
</important>


## Proprietary models

Models from third-party LLM providers.

| Provider       | Model                 | **Alert questions** | **APM questions** | **Contextual insights** | **Documentation retrieval** | **Elasticsearch operations** | **ES|QL generation** | **Execute connector** | **Knowledge retrieval** |
|----------------|-----------------------|---------------------|-------------------|-------------------------|-----------------------------|------------------------------|----------------------|-----------------------|-------------------------|
| Amazon Bedrock | **Claude Sonnet 3.5** | Excellent           | Excellent         | Excellent               | Excellent                   | Excellent                    | Excellent            | Good                  | Excellent               |
| Amazon Bedrock | **Claude Sonnet 3.7** | Excellent           | Excellent         | Excellent               | Excellent                   | Excellent                    | Excellent            | Great                 | Excellent               |
| Amazon Bedrock | **Claude Sonnet 4**   | Excellent           | Excellent         | Excellent               | Excellent                   | Excellent                    | Excellent            | Great                 | Excellent               |
| Amazon Bedrock | **Claude Sonnet 4.5** | Excellent           | Excellent         | Excellent               | Excellent                   | Excellent                    | Excellent            | Good                  | Excellent               |
| Google Gemini  | **Gemini 2.0 Flash**  | Excellent           | Good              | Excellent               | Excellent                   | Excellent                    | Good                 | Good                  | Excellent               |
| Google Gemini  | **Gemini 2.5 Flash**  | Excellent           | Good              | Excellent               | Excellent                   | Excellent                    | Great                | Good                  | Excellent               |
| Google Gemini  | **Gemini 2.5 Pro**    | Excellent           | Great             | Excellent               | Excellent                   | Excellent                    | Great                | Good                  | Excellent               |
| OpenAI         | **GPT-4.1**           | Excellent           | Excellent         | Excellent               | Excellent                   | Excellent                    | Great                | Good                  | Excellent               |
| OpenAI         | **GPT-4.1 Mini**      | Excellent           | Great             | Excellent               | Excellent                   | Excellent                    | Great                | Good                  | Excellent               |
| OpenAI         | **GPT-5**             | Excellent           | Excellent         | Excellent               | Excellent                   | Excellent                    | Excellent            | Good                  | Excellent               |
| OpenAI         | **GPT-5.2**           | Excellent           | Great             | Excellent               | Excellent                   | Excellent                    | Great                | Good                  | Excellent               |


## Open-source models

<applies-to>
  - Elastic Cloud Serverless: Preview
  - Elastic Stack: Preview since 9.2
</applies-to>

Models you can [deploy and manage yourself](https://www.elastic.co/docs/explore-analyze/ai-features/llm-guides/connect-to-lmstudio-observability).

| Provider        | Model                                   | **Alert questions** | **APM questions** | **Contextual insights** | **Documentation retrieval** | **Elasticsearch operations** | **ES|QL generation** | **Execute connector** | **Knowledge retrieval** |
|-----------------|-----------------------------------------|---------------------|-------------------|-------------------------|-----------------------------|------------------------------|----------------------|-----------------------|-------------------------|
| DeepSeek        | **DeepSeek-V3.1**                       | Excellent           | Excellent         | Excellent               | Excellent                   | Excellent                    | Great                | Great                 | Excellent               |
| Google DeepMind | **Gemma-3-27b-it**                      | Excellent           | Good              | Great                   | Great                       | Excellent                    | Good                 | Great                 | Excellent               |
| OpenAI          | **gpt-oss-20b**                         | Poor                | Poor              | Great                   | Poor                        | Good                         | Poor                 | Good                  | Good                    |
| OpenAI          | **gpt-oss-120b**                        | Excellent           | Poor              | Great                   | Great                       | Excellent                    | Good                 | Good                  | Excellent               |
| Meta            | **Llama-3.3-70B-Instruct**              | Excellent           | Good              | Great                   | Excellent                   | Excellent                    | Good                 | Good                  | Excellent               |
| Meta            | **Llama-4-Maverick-17B-128E-Instruct**  | Great               | Good              | Great                   | Excellent                   | Excellent                    | Good                 | Good                  | Great                   |
| Mistral         | **Mistral-Small-3.2-24B-Instruct-2506** | Excellent           | Poor              | Great                   | Great                       | Excellent                    | Good                 | Good                  | Excellent               |
| Alibaba Cloud   | **Qwen2.5-72b-Instruct**                | Excellent           | Good              | Great                   | Excellent                   | Excellent                    | Good                 | Good                  | Excellent               |

<note>
  `Llama-3.3-70B-Instruct` and `Qwen2.5-72b-Instruct` were tested with simulated function calling.
</note>


## Evaluate your own model

You can run the Elastic AI Assistant for Observability and Search evaluation framework against any model, and use it to benchmark a custom or self-hosted model against the use cases in the matrix. Refer to the [evaluation framework README](https://github.com/elastic/kibana/blob/main/x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/README.md) for setup and usage details.
For consistency, all ratings in this matrix were generated using `Gemini 2.5 Pro` as the judge model (specified through the `--evaluateWith` flag). Use the same judge when evaluating your own model to ensure comparable results.