Elastic Observability Labs - Articles by Vinay Chandrasekhar

Elastic's metrics analytics gets 5x faster

Wed, 28 Jan 2026 00:00:00 GMT

In our previous blog in this series, we explored the fundamentals of analyzing metrics using the Elasticsearch Query Language (ES|QL) and the interactive power of Discover. Building on that foundation, we are excited to announce a suite of powerful enhancements to Time Series Data Streams (Elastic’s TSDB) and ES|QL designed to provide even more comprehensive and blazingly faster metrics analytics capabilities!

These latest updates, available in v9.3 and in Serverless, introduce significant performance gains, sophisticated time series functions, and native OpenTelemetry exponential histogram support that directly benefit SREs and Observability practitioners.

Query Performance and Storage Optimizations

Speed is paramount when diagnosing incidents. Compared to prior releases, we have achieved a 5x+ improvement in query latency when wildcarding or filtering by dimensions. Additionally, storage efficiency for OpenTelemetry metrics data has improved by approximately 2x, significantly reducing the infrastructure footprint required to retain high-volume observability data. If you’re hungry to learn more about what architectural updates are driving these optimizations, stay tuned… Tech blogs are on their way!

Expanded Time Series Analytics in ES|QL

The ESQL TS source command, which targets time series indices and enables time series aggregation functions, has been significantly enhanced to support complex analytics capabilities.

We have expanded the library of time series functions to include essential tools for identifying anomalies and trends.

PERCENTILE_OVER_TIME, STDDEV_OVER_TIME, VARIANCE_OVER_TIME: Calculate the percentile, standard deviation, or variance of a field over time, which is critical for understanding distribution and variability in service latency or resource usage.

Example: Seeing the worst-case latency in 5-minute intervals.

TS metrics*  | STATS MAX(PERCENTILE_OVER_TIME(kafka.consumer.fetch_latency_avg, 99))
  BY TBUCKET(5m)

DERIV: This command calculates the derivative of a numeric field over time using linear regression, useful for analyzing the rate of change in system metrics.

Example: trending gauge values over time.

TS metrics*  | STATS AVG(DERIV(container.memory.available))
  BY TBUCKET(1 hour)

CLAMP: To handle noisy data or outliers, this function limits sample values to a specified lower and upper bound.

Example: handling saturation metrics (like CPU or Memory utilization) where spikes or measurement errors can occasionally report values over 100%, making the rest of the data look like a flat line at the bottom of the chart.\

TS metrics*  | STATS AVG(CLAMP(k8s.pod.memory.node.utilization, 0, 100))
  BY k8s.pod.name

TRANGE: This new filter function allows you to filter data for a specific time range using the @timestamp attribute, simplifying query syntax for time-bound investigations.

Example: Filtering and showing metrics for the last 4 hours.

TS metrics*  | WHERE TRANGE(4h) | STATS AVG(host.cpu.pct)
  BY TBUCKET(5m)

Window Functions To smoothen results over specific periods, ES|QL now introduces window functions. Most time series aggregation functions now accept an optional second argument that specifies a sliding time window. For example, you can calculate a rate over a 10-minute sliding window while bucketing results by minute.

Example: Calculating the average rate of requests per host for every minute, using values over a sliding window of 5 minutes.

TS metrics*  | STATS AVG(RATE(app.frontend.requests, 5m))
  BY TBUCKET(1m)

Accepted window values are currently limited to multiples of the time bucket interval in the BY clause. Windows that are smaller than the time bucket interval or larger but not a multiple of the time bucket interval will be supported in feature releases.

Native OpenTelemetry Exponential Histograms

Elastic now provides native support for OpenTelemetry exponential histograms, enabling efficient ingest, querying, and downsampling of high-fidelity distribution data.

We have introduced a new exponential_histogram field type designed to capture distributions with fixed, exponentially spaced bucket boundaries. Because these fields are primarily intended for aggregations, the histogram is stored as compact doc values and is not indexed, optimizing storage efficiency. These fields are fully supported in ES|QL aggregation functions such as PERCENTILES, AVG, MIN, MAX, and SUM.

You can index documents with exponential histograms automatically through our OTLP endpoint or manually. For example, let’s create an index with an exponential histogram field and a keyword field:

PUT my-index-000001
{
  "settings": {
    "index": {
      "mode": "time_series",
      "routing_path": ["http.path"],
      "time_series": {
        "start_time": "2026-01-21T00:00:00Z",
        "end_time": "2026-01-25T00:00:00Z"
     }
    }
  },
  "mappings": {
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "http.path": {
        "type": "keyword",
        "time_series_dimension": true
      },
      "responseTime": {
        "type": "exponential_histogram",
        "time_series_metric": "histogram"
      }
    }
  }
}

Index a document with a full exponential histogram payload:

POST my-index-000001/_doc
{
  "@timestamp": "2026-01-22T21:25:00.000Z",
  "http.path": "/foo",
  "responseTime": {
    "scale":3,
    "sum":73.2,
    "min":3.12,
    "max":7.02,
    "positive": {
      "indices":[13,14,15,16,17,18,19,20,21,22],
      "counts":[1,1,2,2,1,2,1,3,1,1]
    }
  }
}

POST my-index-000001/_doc
{
  "@timestamp": "2026-01-22T21:26:00.000Z",
  "http.path": "/bar",
  "responseTime": {
    "scale":3,
    "sum":45.86,
    "min":2.15,
    "max":5.1,
    "positive": {
      "indices":[8,9,10,11,12,13,14,15,16,17,18],
      "counts":[1,1,1,1,1,1,1,2,1,1,2]
    }
  }
}

And finally, query the time series index using ES|QL and the TS source command:

TS my-index-000001  | STATS MIN(responseTime), MAX(responseTime),
        AVG(responseTime), MEDIAN(responseTime),
        PERCENTILE(responseTime, 90)
  BY http.path

Enhanced Downsampling

Downsampling is essential for long-term data retention. We have introduced a new "last value" downsampling mode. This method exchanges accuracy for storage efficiency and performance by keeping only the last sample value, providing a lightweight alternative to calculating aggregate metrics.

You can configure a time series data stream for last value downsampling in a similar way as regular downsampling, just by setting the downsampling_method to last_value. For example, by using a data stream lifecycle:

PUT _data_stream/my-data-stream/_lifecycle
{
  "data_retention": "7d",
  "downsampling_method": "last_value",
  "downsampling": [
     {
       "after": "1m",
       "fixed_interval": "10m"
      },
      {
        "after": "1d",
        "fixed_interval": "1h"
      }
   ]
}

In Conclusion

These enhancements mark a significant step forward in Elastic's metrics analytics capabilities, delivering 5x+ faster query latency, 2x storage efficiency and specialized commands like DERIV, CLAMP, and PERCENTILE_OVER_TIME. With native support for OpenTelemetry exponential histograms and expanded downsampling options, SREs can now perform richer, more cost-effective analysis on their observability data. This release empowers teams to detect anomalies faster and manage long-term metrics retention with greater efficiency.

We welcome you to try the new features today!

LLM Observability: Azure OpenAI

Mon, 24 Jun 2024 00:00:00 GMT

We are excited to announce the general availability of the Azure OpenAI Integration that provides comprehensive Observability into the performance and usage of the Azure OpenAI Service! Also look at Part 2 of this blog

While we have offered visibility into LLM environments for a while now, the addition of our Azure OpenAI integration enables richer out-of-the-box visibility into the performance and usage of your Azure OpenAI based applications, further enhancing LLM Observability.

The Azure OpenAI integration leverages Elastic Agent’s Azure integration capabilities to collect both logs (using Azure EventHub) and metrics (using Azure Monitor) to provide deep visibility on the usage of the Azure OpenAI Service.

The integration includes an out-of-the-box dashboard that summarizes the most relevant aspects of the service usage, including request and error rates, token usage and chat completion latency.

Creating Alerts and SLOs to monitor Azure OpenAI

As with every other Elastic integration, all the logs and metrics information is fully available to leverage in every capability in Elastic Observability, including SLOs, alerting, custom dashboards, in-depth logs exploration, etc.

To create an alert to monitor token usage, for example, start with the Custom Threshold rule on the Azure OpenAI datastream and set an aggregation condition to track and report violations of token usage past a certain threshold.

When a violation occurs, the Alert Details view linked in the alert notification for that alert provides rich context surrounding the violation, such as when the violation started, its current status, and any previous history of such violations, enabling quick triaging, investigation and root cause analysis.

Similarly, to create an SLO to monitor error rates in Azure OpenAI calls, start with the custom query SLI definition adding in the good events to be any result signature at or above 400 over a total value that includes all responses. Then, by setting an appropriate SLO target such as 99%, start monitoring your Azure OpenAI error rate SLO over a period of 7, 30, or 90 days to track degradation and take action before it becomes a pervasive problem.

Please refer to the User Guide to learn more and to get started!

Explore and Analyze Metrics with Ease in Elastic Observability

Thu, 23 Oct 2025 00:00:00 GMT

Metrics are critical in identifying the “what”

As a core pillar of Observability, metrics offer a highly structured, quantitative view of system performance and health. They provide a crucial symptomatic perspective—revealing what is happening, such as high application latency, increasing service errors, or spiking container CPU utilization, which is essential for initiating alerting and triaging efforts. This capability for effective monitoring, alerting, and triaging is paramount to ensuring robust service delivery and achieving successful business outcomes.

Elastic Observability provides a comprehensive, end-to-end experience for metrics data. Elastic ensures that metrics data can be collected from numerous sources, enriched as needed and shipped to the Elastic Stack. Elastic efficiently stores this time series data, including high-cardinality metrics, utilizing the TSDS index mode (Time Series Data Stream), introduced in prior versions and used across Elastic time series integrations. This foundation ensures comprehensive observability through out-of-the-box dashboards, alerts, SLOs, and streamlined data management.

Elastic Observability 9.2 provides enhancements to metrics exploration and analysis through powerful query language extensions and expanded UI capabilities. These enhancements focus on making analysis on TSDS data via counter rates and common aggregations over time easier and faster than ever before.

The main metrics enhancements center on these key features, offered as Tech Preview:

Metrics analytics with TSDS and ES|QL
Interactive metrics exploration in Discover
OTLP endpoint for metrics

Metrics analytics with TSDS and ES|QL

The introduction of the new TS source command in ES|QL (Elasticsearch Query Language) on TSDS metrics dramatically simplifies time series analysis.

The TS command is specifically designed to target only time series indices, differentiating it from the general FROM command. Its core power lies in enabling a dedicated suite of time series aggregation functions within the STATS command.

This mechanism utilizes a dual aggregation paradigm, which is standard for time series querying. These queries involve two aggregation functions:

Inner (Time Series) function: Applied implicitly per time series, often over bucketed time intervals.
Outer (Regular) function: Used to aggregate the results of the inner function across groups. For instance, if you use STATS SUM(RATE(search_requests)) BY TBUCKET(1 hour), host, the RATE() function is the inner function applied per time series in hourly buckets, and SUM() is the outer function, summing these rates for each host and hourly bucket.

If an ES|QL query using the TS command is missing an inner (time series) aggregation function, LAST_OVER_TIME() is implicitly assumed and used. For example, TS metrics | STATS AVG(memory_usage) is equivalent to TS metrics | STATS AVG(LAST_OVER_TIME(memory_usage)).

Key time series aggregation functions available in ES|QL via `TS` command

These functions allow for powerful analysis on time-series data:


Function	Description	Example Use Case
`RATE()` / `IRATE()`	Calculates the per-second average rate of increase of a counter (`RATE`), accounting for non-monotonic breaks like counter resets, making it the most appropriate function for counters, or the per-second rate of increase between the last two data points (`IRATE`), ignoring all but the last two points for high responsiveness.	Calculating request per second (RPS) or throughput.
`AVG_OVER_TIME()`	Calculates the average of a numeric field over the defined time range.	Determining average resource usage over an hour.
`SUM_OVER_TIME()`	Calculates the sum of a field over the time range.	Total errors over a specific time window.
`MAX_OVER_TIME()` / `MIN_OVER_TIME()`	Calculates the maximum or minimum value of a field over time.	Identifying peak resource consumption.
`DELTA()` / `IDELTA()`	Calculates the absolute change of a gauge field over a time window (`DELTA`) or specifically between the last two data points (`IDELTA`), making `IDELTA` more responsive to recent changes.	Tracking changes in system gauge metrics (e.g., buffer size).
`INCREASE()`	Calculates the absolute increase of a counter (`INCREASE`).	Analyzing immediate rate changes in fast-moving counters.
`FIRST_OVER_TIME()` / `LAST_OVER_TIME()`	Calculates the earliest or latest recorded value of a field, determined by the `@timestamp` field.	Inspecting initial and final metric states within a bucket.
`ABSENT_OVER_TIME()` / `PRESENT_OVER_TIME()`	Calculates the absence or presence of a field in the result over the time range.	Identifying monitoring coverage gaps.
`COUNT_OVER_TIME()` / `COUNT_DISTINCT_OVER_TIME()`	Calculates the total count or the count of distinct values of a field over time.	Measuring frequency or cardinality changes.

These functions, available with the TS command, allow SREs and Ops teams to easily perform rate calculations and other common aggregations, enabling efficient metrics analysis as a routine part of observability workflows. And it’s much faster, too! Internal performance testing has revealed that TS commands outperform other ways of querying metrics data by an order of magnitude or more, and consistently!

Interactive metrics exploration in Discover

The 9.2 release introduces the capability to explore and analyze metrics directly and interactively within the Discover interface. In addition to exploring and analyzing logs and raw events, Discover now provides a dedicated environment for metrics exploration:

Easy start: Begin exploration simply by querying metrics ingested via TS metrics-*.
Grid view and pre-applied aggregations: This command displays all metrics in a grid format at a glance, immediately applying the appropriate aggregations based on the metric type, such as rate versus avg.
Search and group-by: Quickly search for specific metrics by name. Also easily group and analyze metrics by dimensions (labels) and specific values. This allows narrowing down to metrics and dimensions of choice for targeted analysis.
Quick access to details: Furthermore, the interface provides access to crucial details, including query and response details, the underlying ES|QL commands, the metric field type, and applicable dimensions, for each metric.
Easy tweaking and dashboarding: The system automatically populates ES|QL queries, aiding in making easy tweaks, slicing, and dicing the data. Once analyzed, metrics and resulting analyses can be added to new or existing dashboards with ease.

OTLP endpoint for metrics

We are also introducing a native OpenTelemetry Protocol (OTLP) endpoint specifically for metrics ingest directly into Elasticsearch. The endpoint especially benefits self-managed customers, and will be integrated into our Elastic Cloud Managed OTLP Endpoint for Elastic-managed offerings. The native endpoint and related updates improve ingest performance and scalability of OTel metrics, providing up to 60% higher throughput via _otlp, and up to 25% higher throughput when using classic _bulk methods.

In Conclusion

By merging the power of ES|QL's new time series aggregations with the familiar interactive experience of Discover, Elastic 9.2 enables a potent set of metrics analytics tools. The tools significantly boost the exploration and analysis phase of any observability workflow. And we’re just getting started on unleashing the full power of metrics in Elastic Observability!

We welcome you to try the new features today!

Also learn more about how we provide metrics analytics for AWS, Azure, GCP, Kubernetes, and LLMs on Observability Labs

Elastic Observability Labs - Articles by Vinay Chandrasekhar

Elastic's metrics analytics gets 5x faster

Query Performance and Storage Optimizations

Expanded Time Series Analytics in ES|QL

Native OpenTelemetry Exponential Histograms

Enhanced Downsampling

In Conclusion

LLM Observability: Azure OpenAI

Creating Alerts and SLOs to monitor Azure OpenAI

Explore and Analyze Metrics with Ease in Elastic Observability

Metrics are critical in identifying the “what”

Metrics analytics with TSDS and ES|QL

Key time series aggregation functions available in ES|QL via TS command

Interactive metrics exploration in Discover

OTLP endpoint for metrics

In Conclusion

Key time series aggregation functions available in ES|QL via `TS` command