Elastic Observability Labs - Machine Learning

Using Anomaly Detection in Elastic Cloud to Identify Fraud

Thu, 30 Jan 2025 00:00:00 GMT

Fraud detection is one of the most pressing challenges facing the financial services industry today. With the rise of digital payments, app-based banking, and online financial services, the volume and sophistication of fraudulent activity have grown significantly. In recent years, high-profile incidents like the $200 million credit card fraud scheme uncovered by the U.S. Department of Justice, which involved the creation of thousands of fake identities, have highlighted just how advanced fraud operations have become. These threats pose serious risks to financial institutions and their customers, making real-time fraud prevention an absolute necessity.

Elastic Cloud provides a powerful solution to meet these challenges. Its scalable, high-performance platform enables organizations to ingest and analyze all data types efficiently (from transactional data to customers’ personal information to claims data), delivering actionable insights that empower fraud prevention teams to detect anomalies and stop fraud before it occurs. From identifying unusual spending patterns to uncovering hidden threats, Elastic Cloud offers the speed and flexibility needed to safeguard assets in an increasingly digital economy.

In this blog, we’ll walk you through how Elastic Cloud can be used to identify fraud within credit card transactions—a key area of focus due to the high volume of data and the significant potential for fraudulent activity.

We’ll use a Node.js code example to generate an example set of credit card transactions. The generated transactions include a data anomaly similar to an anomaly that might occur as a result of fraudulent activity known as “Card Testing”, which is when a malicious actor tests to see if stolen credit card data can be used to make fraudulent transactions. We’ll then import the credit card transactions into an Elastic Cloud index and use Elastic Observability’s Anomaly Detection feature to analyze the transactions to detect potential signs of “Card Testing”.

Performing fraud detection with Elastic Cloud

Generate example credit card transactions

Begin the process by using a terminal on your local computer to run a Node.js code example that will generate some example credit card transaction data.

Within your terminal window, run the following git clone command to clone the Github repository containing the Node.js code example:

git clone https://github.com/elastic/observability-examples

Run the following cd command to change directory to the code example folder:

cd observability-examples/anomaly-detection

Run the following npm install command to install the code example’s dependencies:

npm install

Enter the following node command to run the code example which will generate a JSON file named transactions.ndjson containing 1000 example credit card transactions:

node generate-transactions.js

Now that we've got some credit card transaction data, we can import the transactions into Elastic Cloud to analyze the data.

Import transactions data into an Elastic Cloud index

We’ll start the import process in Elastic Cloud. Create an Elastic Serverless project in which we can import and analyze the transaction data. Click Create project.

Click Next in the Elastic for Observability project type tile.

Click Create project.

Click Continue.

Select the Application tile.

Enter the text “Upload” into the search box.

Select the Upload a file tile.

Click Select or drag and drop a file.

Select the transactions.ndjson file on your local computer that was created from running the Node.js code example in a previous step.

Click Import.

Enter an Index name and click Import.

You’ll see a confirmation when the import process completes and the new index is successfully created.

Use Anomaly Detection to analyze credit card transactions

Anomaly Detection is a powerful tool that can analyze your data to find unusual patterns that would otherwise be difficult, if not impossible, to manually uncover. Now that we've got transaction data loaded into an index, let's use anomaly detection to analyze it. Click Machine learning in the navigation menu.

Select Anomaly Detection Jobs

Click Create anomaly detection job.

Select the Index containing the imported transactions as the data source of the anomaly detection job.

As mentioned above, one form of credit card fraud is called “Card Testing” where a malicious actor tests a batch of credit cards to determine if they are still valid.

We can analyze the transaction data in our index to detect fraudulent “Card Testing” by using the anomaly detection Population wizard. Select the Population wizard tile.

Click Use full data.

Click Next.

Click the Population field selector and select IPAddress.

Click the Add metric option.

Select Count(Event rate) as the metric to be added.

Click Next.

Enter a Job ID and click Next.

Click Next.

Click Create job.

Once the job completes, click View results.

You should see that an anomaly has been detected. It looks like a specific IP Address has been identified performing an exceedingly high number of transactions with multiple credit cards on a single day.

You can click the red highlighted segments in the timeline to see more details to assist you with evaluating possible remediation actions to implement.

In just a few steps, we were able to create a machine learning job that grouped all the transactions by the IP address that sent them and identified slices of time where one IP sent an unusually large number of requests compared to other IPs. Our fraudster!

Take the next step in fraud prevention

Fraud detection is an ongoing battle for organizations across industries, and the stakes are higher than ever. As digital payments, insurance claims, and online banking continue to dominate, the need for robust, real-time solutions to detect and prevent fraud is critical. In this blog, we demonstrated how Elastic Cloud empowers organizations to address this challenge effectively.

By using Elastic Cloud’s powerful capabilities, we ingested and analyzed a dataset of credit card transactions to detect potential fraudulent activity, such as “Card Testing.” From ingesting data into an Elastic index to leveraging machine learning-powered anomaly detection, this step-by-step process highlighted how Elastic Cloud can uncover hidden patterns and provide actionable insights to fraud prevention teams.

This example is just the beginning of what Elastic Cloud can do. Its scalable architecture, flexible tools, and powerful analytics make it an invaluable asset for any organization looking to protect their customers and assets from fraud. Whether it's detecting unusual spending patterns, identifying compromised accounts, or monitoring large-scale operations, Elastic Cloud provides the speed, precision, and efficiency financial services organizations need to stay one step ahead of fraudsters.

As fraud continues to evolve, so must the tools we use to combat it. Elastic Cloud gives you the power to meet these challenges head-on, enabling your institution to provide a safer, more secure experience for your customers.

Ready to explore more? View a guided tour of all the steps in this blog post or create an Elastic Serverless Observability project and start analyzing your data for anomalies today.

Related resources:

How Streams Generates a Log Pipeline in Seconds

Thu, 16 Apr 2026 00:00:00 GMT

Just click the Suggest pipeline button in Kibana's Processing tab and within a few seconds you're looking at a complete pipeline (Grok pattern, date normalization, type conversions) with a preview of how your actual log documents parse through it.

The alternative is doing this by hand: write a Grok pattern, testing it, fixing the edge cases, realizing the field names don't match ECS, renaming them, adding a date processor. And all that is just the work for a single service.

The three jobs every log pipeline has

Every log processing pipeline does the same three things: Things usually start with extracting fields from raw log messages, normalizing them to a consistent schema, and cleaning up whatever you don't need. Most teams would build and maintain these by hand, which can be challenging as log formats change and you realize that the person who wrote the Grok pattern moved teams, and nothing about the pipeline is documented except the pattern itself.

Every new service now means doing it again from scratch, with a different format, different edge cases, and eventually a different person maintaining a pattern they didn't write.

For the initial pipeline, Streams handles all three jobs automatically and validates the result before anything touches your production data.

What happens when you click "Suggest pipeline"

Open the Processing tab for a stream in Kibana. Click the button. Within seconds, the panel populates with a proposed pipeline (typically a parsing step, date normalization, type conversions, and field cleanup) along with a live preview showing what your most recent documents look like after the pipeline runs.

n this view you can see the exact fields that will be extracted, their types, and how many of your sample documents parsed successfully. If a field name is off, you can also edit it inline; if a step is adding noise, just remove it. And if the parse rate needs work, you can easily adjust and re-run generation. Nothing is written to the stream until you explicitly confirm. For now at least this is an important step for the human to be in the loop with these changes. As systems like these mature more, this may not be necessary in the future.

Let's walk through the steps in more detail.

Stage 1: Log grouping and pattern extraction

The first stage of our process doesn't involve a reasoning model. It's actually deterministic: the same input always produces the same output, with no variance from a model. It also scopes down what Stage 2 has to figure out.

Before any extraction runs, Streams clusters the messages by log format fingerprint. The algorithm is really simple too: digits map to 0, letters map to a, and punctuation is preserved as-is. Two messages that produce the same fingerprint land in the same group.

# two entries from the same nginx stream
2026-03-30 14:22:31 192.168.1.100 - james "GET /api/v1/health" 200
2026-03-30 08:01:05 10.0.0.5      - alice "GET /api/v2/status" 404

# fingerprint
0-0-0 0:0:0 0.0.0.0 - a     "a /a/a0/a" 0
0-0-0 0:0:0 0.0.0.0 - a     "a /a/a0/a" 0

A stream with mixed log formats produces multiple groups, one per distinct format in the batch. This is a fairly simple but really effective way for us to cluster similar logs together and it makes all the other steps much more reliable.

Both Grok and Dissect run on the same input, though they work differently. Grok runs per group, as it supports multiple patterns and handles each distinct format independently. Dissect uses a single pattern, so it targets only the largest group in the batch.

For each candidate, a heuristic algorithm analyzes the messages and identifies field boundaries: what's fixed text and what varies. It generates a pattern with positional placeholder names. An LLM then reviews the extracted field positions against a sample of up to 10 messages and renames the placeholders to human-readable, schema-compliant names.

# grok heuristic output (positional placeholders)
%{IPV4:field_0} - %{USER:field_1} \[%{HTTPDATE:field_2}\] "%{WORD:field_3} %{URIPATHPARAM:field_4}..."

# after LLM field naming (ECS-aligned)
%{IPV4:source.ip} - %{USER:user.name} \[%{HTTPDATE:@timestamp}\] "%{WORD:http.request.method} %{URIPATHPARAM:url.path}..."

# dissect heuristic output (positional placeholders)
%{field_0} - %{field_1} [%{field_2}] "%{field_3} %{field_4} %{?field_5}" %{field_6} %{field_7}

# after LLM field naming (ECS-aligned)
%{source.ip} - %{user.name} [%{@timestamp}] "%{http.request.method} %{url.path} %{?http_version}" %{http.response.status_code} %{http.response.body.bytes}

The resulting processor is simulated against your submitted documents to measure its parse rate. Grok is a little more expressive, with typed fields, named captures, multiple sub-patterns. The big downside is that it's also slower. Dissect on the other hand is faster but limited to fixed-position splits. Simple log formats tend to parse cleanly with dissect; complex ones need grok.

The candidate with the higher parse rate becomes that group's parsing processor. This runs for every group in the batch. Stage 1 hands Stage 2 one parsing processor per group found.

For a batch of nginx access logs, the extraction produces two candidates for the one format group present:

# input (sampled from 300 submitted documents)
192.168.1.100 - james [30/Mar/2026:14:22:31 +0000] "GET /api/v1/health HTTP/1.1" 200 1234

# grok candidate → parse rate 94% (282/300)
%{IPV4:source.ip} - %{USER:user.name} \[%{HTTPDATE:@timestamp}\] "%{WORD:http.request.method} %{URIPATHPARAM:url.path} HTTP/%{NUMBER:http.version}" %{NUMBER:http.response.status_code:int} %{NUMBER:http.response.body.bytes:int}

# dissect candidate → parse rate 71% (213/300)
%{source.ip} - %{user.name} [%{@timestamp}] "%{http.request.method} %{url.path} %{?http_version}" %{http.response.status_code} %{http.response.body.bytes}

# winner: grok

Grok wins here because %{HTTPDATE} handles the bracketed timestamp format; Dissect tries to split on fixed positions and fails on the surrounding brackets. Both run in parallel; comparing their results adds negligible time since this intial simulation is only done on a sample of documents.

Stage 2: The reasoning agent

Stage 1 produces a parsing processor; Stage 2 turns it into a complete, validated pipeline.

This stage uses a reasoning agent that iterates through a loop with two tools, running up to six iterations.

The loop:

The agent takes the Stage 1 parsing processor and proposes additional steps: date normalization, type conversions, field cleanup, and PII masking for fields it identifies as sensitive.
It runs the complete proposed pipeline against your original documents (the raw data, not pre-processed) and returns validation results.
If the simulation fails, the agent reads the error messages and adjusts. The failures are very specific, and we're making good use of the LLMs capabilities to understand them: which processor failed, on what percentage of documents, with what error type. When the parse rate drops below 80%, the tool returns:

Parse rate is too low: 67.00% (minimum required: 80%). The pipeline is not
extracting fields from enough documents. Review the processors and ensure
they handle the document structure correctly.

Processor "grok[0]" has a failure rate of 33.00% (maximum allowed: 20%).
This processor is failing on too many documents.

The agent now reads the processor name, the failure rate, and the threshold, then adjusts the pattern on the next iteration. It can't commit until the errors resolve.

This repeats until the pipeline passes, then commits and sends for user approval in the UI.

To ensure quality we enforce two hard thresholds at the tool level, not by the agent's judgment:

If fewer than 80% of documents parse successfully, the simulation returns an error. The agent must fix this before proceeding.
If any individual processor fails on more than 20% of documents, the simulation is invalid.

Validation is also embedded in the tool: the model sees an error message and must resolve it before proceeding. It can't commit a pipeline that fails these checks.

Under the hood we're steering the agent in a spefific direction. The system prompt here includes: "Simplify first. Remove problematic processors rather than adding workarounds. A pipeline that handles 95% of documents perfectly is better than one that attempts 100% but fails unpredictably."

If your data is already well-structured (proper @timestamp, correct field types, no raw text that needs parsing), the agent detects this and commits an empty pipeline. It doesn't add processors for the sake of it.

The output is Streamlang

The agent writes Streamlang DSL, Elastic's processing language for streams, which compiles to ingest pipelines behind the scenes.

The field schema, the processor types, the step format: all expressed in Streamlang. Here's what the user-approved pipeline looks like for the nginx example above, targeting an ECS stream:

steps:
  - action: grok
    from: message
    patterns:
      - "%{IPV4:source.ip} - %{USER:user.name} \\[%{HTTPDATE:@timestamp}\\] \"%{WORD:http.request.method} %{URIPATHPARAM:url.path} HTTP/%{NUMBER:http.version}\" %{NUMBER:http.response.status_code:int} %{NUMBER:http.response.body.bytes:int}"
  - action: date
    from: "@timestamp"
    formats:
      - "dd/MMM/yyyy:HH:mm:ss Z"
  - action: convert
    from: http.response.status_code
    type: integer
  - action: remove
    from: message

Two schemas, one generator

Not everyone lands logs in the same shape, and Elastic needs to support a variety of formats. Teams running OpenTelemetry collectors want their data in OTel-native fields. Teams on Elastic's traditional stack expect ECS. Both are valid, and forcing everyone onto one schema would mean asking half our users to restructure their pipelines before they can even get started.

So Streams supports both, and the generator handles both. We automatically detect if we should use OTel or ECS here. For this we mostly look at the name of the stream and check if it contains otel, as that's what the current naming in our stack defaults to.

The pipeline looks different for each because the canonical field names differ:

	OTel	ECS
Log body	`body.text`	`message`
Log level	`severity_text`	`log.level`
Service name	`resource.attributes.service.name`	`service.name`
Host name	`resource.attributes.host.name`	`host.name`

An OTel stream gets a grok processor that reads from body.text:

{ "action": "grok", "from": "body.text", "patterns": ["..."] }

An ECS stream reads from message:

{ "action": "grok", "from": "message", "patterns": ["..."] }

OTel streams alias the ECS field names to their OTel equivalents. log.level is an alias for severity_text. message is an alias for body.text. A query written for ECS works on an OTel stream without changes, since the alias layer handles the translation.

{
  "message":    { "path": "body.text",     "type": "alias" },
  "log.level":  { "path": "severity_text", "type": "alias" }
}

The agent is aware of which side of this it's on. It doesn't add a rename step for severity_text → log.level on an OTel stream because the alias already provides that mapping. On an ECS stream, it adds the normalization explicitly.

Schema normalization

Field extraction is the most important and obvious part, but our fields also need to align.

If two services both log HTTP requests but call the status code field differently (response_status in one, http_code in another), a query for http.response.status_code: 5* returns nothing for either of them. Schema normalization maps both to the standard name:

# before: extracted field names from two different services
{ "response_status": 500 }    # service A
{ "http_code": 500 }           # service B

# after: ECS normalization
{ "http.response.status_code": 500 }

Now every service uses http.response.status_code, and the query works across all of them.

During simulation, the agent checks ECS and OTel metadata for every field it generates. Fields that already have standard names are left alone. Fields that map to a known ECS field get renamed. The simulation metrics surface this explicitly: each field in the results includes its ECS or OTel type indicator, so you can see at a glance what's been normalized.

The bar the agent must clear

The system prompt sets explicit acceptance criteria for a user-approved pipeline:

99% of documents must have a valid @timestamp
All fields must have the correct types for the target schema
The overall failure rate must be below 0.5%

If the agent can't satisfy all of these within six iterations, the generation fails.

To summarize

Pipeline generation takes seconds where the manual process takes hours. The time savings come from automating the validation loop you'd otherwise run by hand: write a pattern, test it against real documents, read the failures, adjust, and try again. The agent does this in up to six cycles against the last documents your stream actually received.

What's coming next in Streams and processing

The most user-facing change in progress is the refinement loop. Right now, if the suggestion is close but not exactly right, you edit steps manually and that's it. The next version lets you adjust the proposed pipeline and send it back through the agent with your changes as context, so it builds from where you left off rather than starting from scratch.

Two other things are in progress: generation going async (currently it blocks the UI for a few seconds; soon it runs in the background), and support for streams that already have a pipeline. For now, it only handles streams without existing processing steps.

The same capabilities are also being exposed as callable tools in the Streams agent builder and as APIs for third-party agent frameworks. An agent can run a full pipeline generation as part of a broader onboarding workflow, without the UI.

ML and AI Ops Observability with OpenTelemetry and Elastic

Tue, 31 Mar 2026 00:00:00 GMT

While isolated execution logs might work for local experiments, they are no longer enough for the new era of complex, production-ready Machine Learning (ML) pipelines and Artificial Intelligence (AI) agents. Modern ML and AI systems present three unique challenges:

Distributed components: A single request might hit an API gateway, retrieve data from a feature store, evaluate a predictive model in a Python inference service, query a vector database, and call an external LLM.
Non-determinism: AI agents make autonomous decisions and tool calls. If an agent fails, you need a full trace to understand its reasoning loop and what external tools it tried to invoke.
Context dependence: You don't just care that an error happened; you need to know what model version was running, what hyperparameters were used, what the input data looked like, what was the commit that made that change. Many of these attributes are custom to your app, and you need an Observability environment that has the flexibility of creating new parameters on the fly and use them to find and fix issues.

On top of that, with the increased use of AI agents to generate code and make autonomous decisions, Observability becomes key to understanding what is working and what is not. It creates a critical feedback loop to quickly fix problems. More than ever, ML and AI applications need to adopt the best practices of mature software engineering systems to succeed.

This guide shows how to use OpenTelemetry and Elastic to correlate traces, logs, and metrics to track runs, compare model behavior, and trace requests across Python and Go services with one shared context.

Problem context: why AI systems are harder to debug

Traditional services already have distributed failure modes, but ML and AI systems add more moving parts:

notebook experiments and ad hoc jobs
batch training and evaluation pipelines
online inference services
external API calls, including LLM providers
changing model versions and hyperparameters

When one prediction path gets slower or starts failing, plain isolated logs do not answer enough questions. You need to correlate:

what ran (run ID, model version, parameters)
where time was spent (pipeline stage latencies)
what was the result (model stats, predictions, API calls, compare with other runs)
what changed (code, data, dependencies)

In a future blog post, we'll show you how to set up automatic RCA and remediations with Elastic Workflows and our AI integrations. But as a first step, ML and AI pipelines need a robust Observability framework, which is very easy to set up with OpenTelemetry and Elastic.

Solution overview

OpenTelemetry gives you a standard way to emit traces, metrics, and logs. Elastic provides full OpenTelemetry ingestion, giving you a single place to store and query that telemetry. Kibana's UI is fully integrated with OpenTelemetry, allowing you to explore your services, service dependencies, service latencies, spans, and metrics out-of-the-box.

You can start with two deployment options:

Cloud: send OpenTelemetry data directly to Elastic Cloud Managed OTLP Endpoint (mOTLP docs), without the overhead of managing collectors
Local: run Elastic and the EDOT Collector with start-local, the EDOT Collector will be automatically listening for OTLP data in localhost:4317

Both options let you keep your application code unchanged for the initial implementation.

Step 1: zero-code baseline for Python services

Start by just installing the Elastic Distribution of OpenTelemetry Python (EDOT Python) package and using the opentelemetry-instrument wrapper to run your script. By simply running your script with this wrapper—without modifying your application code—your Python services begin emitting standard telemetry right away. This includes any logs exported via logging, alongside metrics and traces for auto-instrumented libraries. This data can be routed directly to Elastic's managed OTLP endpoint or a local EDOT collector.

pip install elastic-opentelemetry
edot-bootstrap --action=install

Export the OpenTelemetry environment variables, then run opentelemetry-instrument on your script to enable auto-instrumentation.

export OTEL_EXPORTER_OTLP_ENDPOINT="https://" # No need when using start-local with EDOT
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=ApiKey " # No need when using start-local with EDOT
export OTEL_RESOURCE_ATTRIBUTES="deployment.environment=prod,service.version=1.0.0" # Set the environment and version for your app
export OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED=true
export ELASTIC_OTEL_SYSTEM_METRICS_ENABLED=true
export OTEL_METRIC_EXPORT_INTERVAL=5000 # Choose the interval for your application metrics

opentelemetry-instrument --service_name= python3 .py # Set your chosen name for your service

With this baseline, you can quickly get:

Centralized logs with trace context. Any logs exported via logging will be searchable in Elastic and Kibana, with the ability to perform full-text search on your logs
Set alerting on log errors
Process and system metrics. System and process metrics from the execution will be automatically exported to Elastic. You can visualize them, and analyse memory usage (leaks, OOM errors), CPU utilization (Bottlenecks / Spikes), thread counts, disk I/O bottlenecks or network I/O saturation.
Set alerting on metrics
Spans for auto instrumented libraries
Service latency baselines and error trends
Set manual or Anomaly detection alerting on error rates, latencies or throughput
Correlate logs, metrics, and traces in a single shared context to quickly find the root cause of issues, using OpenTelemetry for instrumentation and Elastic for analysis.

Once ingested, Kibana immediately populates out-of-the-box dashboards. You can explore full-text searchable logs, monitor system and process metrics, investigate auto-instrumented trace waterfalls, map out your ML dependencies with service maps, and easily set up alerts for latency spikes, memory or CPU usage or log errors.

For LLM-specific observability, OpenTelemetry provides official Semantic Conventions for Generative AI to standardize how you track token usage, model names, and prompts. These semantic conventions are still in development and not stable yet. Some instrumentations for the most used libraries in this space are being developed as part of the OpenTelemetry Python Contrib repository. Alternatively you can implement these conventions manually in your custom spans. LLM related OpenTelemetry logs, metrics and traces sent to Elastic will be in context and automatically correlated with the rest of your application or stack of applications.

Step 2: add ML-specific context with custom spans and log fields

Auto-instrumentation is a starting point. For ML and AI Ops, add explicit spans around business stages and attach run metadata. Elastic's schema flexibility and dynamic mappings make it a perfect fit for custom attributes or metrics that are exclusive to your pipelines or specific experiments. There is no need to know what the data will look like before writing it. You have the flexibility of creating new parameters on the fly, Elastic maps them automatically, and you can track them instantly.

Add custom fields and metric-like values as structured log fields so you can chart and alert on them later:

logger.info("training metrics", extra={
    "ml.run_id": run_id,
    "ml.training_accuracy": train_accuracy,
    "ml.validation_accuracy": val_accuracy,
    "ml.drift_detected": drift_detected,
})

Because Elastic handles dynamic mapping, any custom metrics or attributes you log, like model ids, training accuracy or drift detection, are instantly indexed and available to search in Discover or visualize via Dashboards.

This makes dashboards and rules practical:

alert when ml.validation_accuracy < 0.8
alert when ml.drift_detected == true
compare stage latency by ml.model_version

You can use these custom attributes to build targeted visualizations, and trigger alerts when ML-specific metrics like validation accuracy drop below a critical threshold.

Adding custom spans allows you to break down the specific stages of your ML pipelines, such as data loading and model training, wrapping them in their own measurable execution blocks, and analyze average latency or error rates for specific pipeline stages.

from opentelemetry import trace

tracer = trace.get_tracer("ml.pipeline")

with tracer.start_as_current_span("load_data") as span:
    span.set_attribute("ml.run_id", run_id)
    span.set_attribute("ml.dataset", dataset_source)
    load_data()

with tracer.start_as_current_span("train_model") as span:
    span.set_attribute("ml.model_version", model_version)
    span.set_attribute("ml.learning_rate", learning_rate)
    train_model()

Custom spans will be reflected in the APM UI alongside your traces. So you can explore their latency, impact in total execution, stack traces, error rates.

Step 3: trace across Python and Go in production

Real inference paths often cross service boundaries. For example:

In a production environment, a user request might pass through a Go-based API before hitting your Python ML inference service. OpenTelemetry ensures tracing context is preserved seamlessly across these boundaries.

In our example, we have a simple Go HTTP service that acts as the entry point and demonstrates OpenTelemetry instrumentation in Go. This REST API service stores and retrieves ML predictions by querying Elasticsearch based on data IDs from the source dataset. All of its endpoints are natively instrumented with OTel spans.

The full request lifecycle looks like this:

The Go API receives the client request.
It searches Elasticsearch for an existing prediction or calls the Python model service to run inference.
The Python service loads features, runs the model, and returns predictions.

When both services use OpenTelemetry, trace context is propagated automatically through headers. In Elastic, you can inspect one end-to-end trace and locate latency or errors by service and span.

The resulting distributed trace in Elastic pieces the entire journey together. You can see the exact breakdown of time spent in the Go API versus the Python model, and correlate logs from both services in a single unified view.

Validation checklist

After instrumentation, validate with a short runbook:

Confirm logs, metrics, and traces arrive for each service.
Verify your custom attributes (e.g. run_id, model_version, llm_ground_truth_score) are present in traces and logs.
Compare p95 latency per stage (load_data, train_model, predict).
Trigger a controlled failure and confirm error traces include stack context.
Test one rule for errors, one rule for latency spikes, and one rule for model-quality fields. Set up a connector and attach it to the rule to reach you in Slack, email, or trigger an auto-remediation workflow.

Conclusion and next steps

OpenTelemetry gives ML and AI teams a unified telemetry layer, while Elastic makes that data instantly queryable and actionable across your entire lifecycle—from notebook experiments to production inference. By starting with zero-code instrumentation and incrementally adding ML-specific attributes and cross-language tracing, your team can easily adopt the Observability best practices of mature software engineering systems and succeed in the new era of complex AI operations.

Try this setup in Elastic Cloud, and use mOTLP for a managed ingest path. If you want a local sandbox first, start with Elastic start-local + EDOT Collector.

Monitor your Python data pipelines with OTEL

Thu, 08 Aug 2024 00:00:00 GMT

This article delves into how to implement observability practices, particularly using OpenTelemetry (OTEL) in Python, to enhance the monitoring and quality control of data pipelines using Elastic. While the primary focus of the examples presented in the article is ETL (Extract, Transform, Load) processes to ensure the accuracy and reliability of data pipelines that is crucial for Business Intelligence (BI), the strategies and tools discussed are equally applicable to Python processes used for Machine Learning (ML) models or other data processing tasks.

Introduction

Data pipelines, particularly ETL processes, form the backbone of modern data architectures. These pipelines are responsible for extracting raw data from various sources, transforming it into meaningful information, and loading it into data warehouses or data lakes for analysis and reporting.

In our organization, we have Python-based ETL scripts that play a pivotal role in exporting and processing data from Elasticsearch (ES) clusters and loading it into Google BigQuery (BQ). This processed data then feeds into DBT (Data Build Tool) models, which further refine the data and make it available for analytics and reporting. To see the full architecture and learn how we monitor our DBT pipelines with Elastic see Monitor your DBT pipelines with Elastic Observability. In this article we focus on the ETL scripts. Given the critical nature of these scripts, it is imperative to set up mechanisms to control and ensure the quality of the data they generate.

The strategies discussed here can be extended to any script or application that handles data processing or machine learning models, regardless of the programming language used as long as there exists a corresponding agent that supports OTEL instrumentation.

Motivation

Observability in data pipelines involves monitoring the entire lifecycle of data processing to ensure that everything works as expected. It includes:

Data Quality Control:

Detecting anomalies in the data, such as unexpected drops in record counts.
Verifying that data transformations are applied correctly and consistently.
Ensuring the integrity and accuracy of the data loaded into the data warehouse.

Performance Monitoring:

Tracking the execution time of ETL scripts to identify bottlenecks and optimize performance.
Monitoring resource usage, such as memory and CPU consumption, to ensure efficient use of infrastructure.

Real-time Alerting:

Setting up alerts for immediate notification of issues such as failed ETL jobs, data quality issues, or performance degradation.
Identify the root case of such incidents
Proactively addressing incidents to minimize downtime and impact on business operations

Issues such as failed ETL jobs, can even point to larger infrastructure or data source data quality issues.

Steps for Instrumentation

Here are the steps to automatically instrument your Python script for exporting OTEL traces, metrics, and logs.

Step 1: Import Required Libraries

We first need to install the following libraries.

pip install elastic-opentelemetry google-cloud-bigquery[opentelemetry]

You can also them to your project's requirements.txt file and install them with pip install -r requirements.txt.

Explanation of Dependencies

elastic-opentelemetry: This package is the Elastic Distribution for OpenTelemetry Python. Under the hood it will install the following packages:
- opentelemetry-distro: This package is a convenience distribution of OpenTelemetry, which includes the OpenTelemetry SDK, APIs, and various instrumentation packages. It simplifies the setup and configuration of OpenTelemetry in your application.
- opentelemetry-exporter-otlp: This package provides an exporter that sends telemetry data to the OpenTelemetry Collector or any other endpoint that supports the OpenTelemetry Protocol (OTLP). This includes traces, metrics, and logs.
- opentelemetry-instrumentation-system-metrics: This package provides instrumentation for collecting system metrics, such as CPU usage, memory usage, and other system-level metrics.
google-cloud-bigquery[opentelemetry]: This package integrates Google Cloud BigQuery with OpenTelemetry, allowing you to trace and monitor BigQuery operations.

Step 2: Export OTEL Variables

Set the necessary OpenTelemetry (OTEL) variables by getting the configuration from APM OTEL from Elastic.

Go to APM -> Services -> Add data (top left corner).

In this section you will find the steps how to configure various APM agents. Navigate to OpenTelemetry to find the variables that you need to export.

Find OTLP Endpoint:

Look for the section related to OpenTelemetry or OTLP configuration.
The OTEL_EXPORTER_OTLP_ENDPOINT is typically provided as part of the setup instructions for integrating OpenTelemetry with Elastic APM. It might look something like https:///otlp.

Obtain OTLP Headers:

In the same section, you should find instructions or a field for OTLP headers. These headers are often used for authentication purposes.
Copy the necessary headers provided by the interface. They might look like Authorization: Bearer .

Note: Notice you need to replace the whitespace between Bearer and your token with %20 in the OTEL_EXPORTER_OTLP_HEADERS variable when using Python.

Alternatively you can use a different approach for authentication using API keys (see instructions). If you are using our serverless offering you will need to use this approach instead.

Set up the variables:

Replace the placeholders in your script with the actual values obtained from the Elastic APM interface and execute it in your shell via the source command source env.sh.

Below is a script to set these variables:

#!/bin/bash
echo "--- :otel: Setting OTEL variables"
export OTEL_EXPORTER_OTLP_ENDPOINT='https://your-apm-server/otlp:443'
export OTEL_EXPORTER_OTLP_HEADERS='Authorization=Bearer%20your-token'
export OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED=true
export OTEL_PYTHON_LOG_CORRELATION=true
export ELASTIC_OTEL_SYSTEM_METRICS_ENABLED=true
export OTEL_METRIC_EXPORT_INTERVAL=5000
export OTEL_LOGS_EXPORTER="otlp,console"

With these variables set, we are ready for auto-instrumentation without needing to add anything to the code.

Explanation of Variables

OTEL_EXPORTER_OTLP_ENDPOINT: This variable specifies the endpoint to which OTLP data (traces, metrics, logs) will be sent. Replace placeholder with your actual OTLP endpoint.
OTEL_EXPORTER_OTLP_HEADERS: This variable specifies any headers required for authentication or other purposes when sending OTLP data. Replace placeholder with your actual OTLP headers.
OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED: This variable enables auto-instrumentation for logging in Python, allowing logs to be automatically enriched with trace context.
OTEL_PYTHON_LOG_CORRELATION: This variable enables log correlation, which includes trace context in log entries to correlate logs with traces.
OTEL_METRIC_EXPORT_INTERVAL: This variable specifies the metric export interval in milliseconds, in this case 5s.
OTEL_LOGS_EXPORTER: This variable specifies the exporter to use for logs. Setting it to "otlp" means that logs will be exported using the OTLP protocol. Adding "console" specifies that logs should be exported to both the OTLP endpoint and the console. In our case for better visibility on the infa side, we choose to export to console as well.
ELASTIC_OTEL_SYSTEM_METRICS_ENABLED: It is needed to use this variable when using the Elastic distribution as by default it is set to false.

Note: OTEL_METRICS_EXPORTER and OTEL_TRACES_EXPORTER: This variables specify the exporter to use for metrics/traces, and are set to "otlp" by default, which means that metrics and traces will be exported using the OTLP protocol.

Running Python ETLs

We run Python ETLs with the following command:

OTEL_RESOURCE_ATTRIBUTES="service.name=x-ETL,service.version=1.0,deployment.environment=production" && opentelemetry-instrument python3 X_ETL.py

Explanation of the Command

OTEL_RESOURCE_ATTRIBUTES: This variable specifies additional resource attributes, such as service name, service version and deployment environment, that will be included in all telemetry data, you can customize these values per your needs. You can use a different service name for each script.
opentelemetry-instrument: This command auto-instruments the specified Python script for OpenTelemetry. It sets up the necessary hooks to collect traces, metrics, and logs.
python3 X_ETL.py: This runs the specified Python script (X_ETL.py).

Tracing

We export the traces via the default OTLP protocol.

Tracing is a key aspect of monitoring and understanding the performance of applications. Spans form the building blocks of tracing. They encapsulate detailed information about the execution of specific code paths. They record the start and end times of activities and can have hierarchical relationships with other spans, forming a parent/child structure.

Spans include essential attributes such as transaction IDs, parent IDs, start times, durations, names, types, subtypes, and actions. Additionally, spans may contain stack traces, which provide a detailed view of function calls, including attributes like function name, file path, and line number, which is especially useful for debugging. These attributes help us analyze the script's execution flow, identify performance issues, and enhance optimization efforts.

With the default instrumentation, the whole Python script would be a single span. In our case we have decided to manually add specific spans per the different phases of the Python process, to be able to measure their latency, throughput, error rate, etc individually. This is how we define spans manually:

from opentelemetry import trace

if __name__ == "__main__":

    tracer = trace.get_tracer("main")
    with tracer.start_as_current_span("initialization") as span:
            # Init code
            … 
    with tracer.start_as_current_span("search") as span:
            # Step 1 - Search code
            …
   with tracer.start_as_current_span("transform") as span:
           # Step 2 - Transform code
           …
   with tracer.start_as_current_span("load") as span:
           # Step 3 - Load code
           …

You can explore traces in the APM interface as shown below.

Metrics

We export metrics via the default OTLP protocol as well, such as CPU usage and memory. No extra code needs to be added in the script itself.

Note: Remember to set ELASTIC_OTEL_SYSTEM_METRICS_ENABLED to true.

Logging

We export logs via the default OTLP protocol as well.

For logging, we modify the logging calls to add extra fields using a dictionary structure (bq_fields) as shown below:

        job.result()  # Waits for table load to complete
        job_details = client.get_job(job.job_id)  # Get job details

        # Extract job information
        bq_fields = {
            # "slot_time_ms": job_details.slot_ms,
            "job_id": job_details.job_id,
            "job_type": job_details.job_type,
            "state": job_details.state,
            "path": job_details.path,
            "job_created": job_details.created.isoformat(),
            "job_ended": job_details.ended.isoformat(),
            "execution_time_ms": (
                job_details.ended - job_details.created
            ).total_seconds()
            * 1000,
            "bytes_processed": job_details.output_bytes,
            "rows_affected": job_details.output_rows,
            "destination_table": job_details.destination.table_id,
            "event": "BigQuery Load Job", # Custom event type
            "status": "success", # Status of the step (success/error)
            "category": category # ETL category tag 
        }

        logging.info("BigQuery load operation successful", extra=bq_fields)

This code shows how to extract BQ job stats, execution time, bytes processed, rows affected and destination table among them. You can add other metadata like we do such as custom event type, status, and category.

Any calls to logging (of all levels above the set threshold, in this case INFO logging.getLogger().setLevel(logging.INFO)) will create a log that will be exported to Elastic. This means that in Python scripts that already use logging there is no need to make any changes to export logs to Elastic.

For each of the log messages, you can go into the details view (click on the … when you hover over the log line and go into View details) to examine the metadata attached to the log message. You can also explore the logs in Discover.

Explanation of Logging Modification

logging.info: This logs an informational message. The message "BigQuery load operation successful" is logged.
extra=bq_fields: This adds additional context to the log entry using the bq_fields dictionary. This context can include details making the log entries more informative and easier to analyze. This data will be later used to set up alerts and data anomaly detection jobs.

Monitoring in Elastic's APM

As shown, we can examine traces, metrics, and logs in the APM interface. To make the most out of this data, we make use on top of nearly the whole suit of features in Elastic Observability alongside Elastic Analytic's ML capabilities.

Rules and Alerts

We can set up rules and alerts to detect anomalies, errors, and performance issues in our scripts.

The error count threshold rule is used to create a trigger when the number of errors in a service exceeds a defined threshold.

To create the rule go to Alerts and Insights -> Rules -> Create Rule -> Error count threshold, set the error count threshold, the service or environment you want to monitor (you can also set an error grouping key across services), how often to run the check, and choose a connector.

Next, we create a rule of type custom threshold on a given ETL logs data view (create one for your index) filtering on "labels.status: error" to get all the logs with status error from any of the steps of the ETL which have failed. The rule condition is set to document count > 0. In our case, in the last section of the rule config, we also set up Slack alerts every time the rule is activated. You can pick from a long list of connectors Elastic supports.

Then we can set up alerts for failures. We add status to the logs metadata as shown in the code sample below for each of the steps in the ETLs. It then becomes available in ES via labels.status.

logging.info(
            "Elasticsearch search operation successful",
            extra={
                "event": "Elasticsearch Search",
                "status": "success",
                "category": category,
                "index": index,
            },
        )

More Rules

We could also add rules to detect anomalies in the execution time of the different spans we define. This is done by selecting transaction/span -> Alerts and rules -> Custom threshold rule -> Latency. In the example below, we want to generate an alert whenever the search step takes more than 25s.

Alternatively, for finer-grained control, you can go with Alerts and rules -> Anomaly rule, set up an anomaly job, and pick a threshold severity level.

Anomaly detection job

In this example we set an anomaly detection job on the number of documents before transform.

We set up an Anomaly Detection jobs on the number of document before the transform using the [Single metric job] (https://www.elastic.co/guide/en/machine-learning/current/ml-anomaly-detection-job-types.html#multi-metric-jobs) to detect any anomalies with the incoming data source.

In the last step, you can create alerting similarly to what we did before to receive alerts whenever there is an anomaly detected, by setting up a severity level threshold. Using the anomaly score which is assigned to every anomaly, every anomaly is characterized by a severity level.

Similarly to the previous example, we set up a Slack connector to receive alerts whenever an anomaly is detected.

You can go to your custom dashboard by going to Add Panel -> ML -> Anomaly Swim Lane -> Pick your job.

Similarly, we add jobs for the number of documents after the transform, and a Multi-Metric one on the execution_time_ms, bytes_processed and rows_affected similarly to how it was done in Monitor your DBT pipelines with Elastic Observability.

Custom Dashboard

Now that your logs, metrics, and traces are in Elastic, you can use the full potential of our Kibana dashboards to extract the most from them. We can create a custom dashboard like the following one: a pie chart based on labels.event (category field for every type of step in the ETLs), a chart for every type of step broken down by status, a timeline of steps broken down by status, BQ stats for the ETL, and anomaly detection swim lane panels for the various anomaly jobs.

Conclusion

Elastic’s APM, in combination with other Observability and ML features, provides a unified view of our data pipelines, allowing us to bring a lot of value with minimal code changes:

Logging of new logs (no need to add custom logging) alongside their execution context
Monitor the runtime behavior of our models
Track data quality issues
Identify and troubleshoot real-time incidents
Optimize performance bottlenecks and resource usage
Identify dependencies on other services and their latency
Optimize data transformation processes
Set up alerts on latency, data quality issues, error rates of transactions or CPU usage)

With these capabilities, we can ensure the resilience and reliability of our data pipelines, leading to more robust and accurate BI system and reporting.

In conclusion, setting up OpenTelemetry (OTEL) in Python for data pipeline observability has significantly improved our ability to monitor, detect, and resolve issues proactively. This has led to more reliable data transformations, better resource management, and enhanced overall performance of our data transformation, BI and Machine Learning systems.

The next evolution of observability: unifying data with OpenTelemetry and generative AI

Wed, 11 Jun 2025 00:00:00 GMT

The Observability industry today stands at a critical juncture. While our applications generate more telemetry data than ever before, this wealth of information typically exists in siloed tools, separate systems for logs, metrics, and traces. Meanwhile, Generative AI is hurtling toward us like an asteroid about to make a tremendous impact on our industry.

As SREs, we've grown accustomed to jumping between dashboards, log aggregators, and trace visualizers when troubleshooting issues. But what if there was a better way? What if AI could analyze all your observability data holistically, answering complex questions in natural language, and identifying root causes automatically?

This is the next evolution of observability. But to harness this power, we need to rethink how we collect, store, and analyze our telemetry data.

The problem: siloed data limits AI effectiveness

Traditional observability setups separate data into distinct types:

Metrics: Numeric measurements over time (CPU, memory, request rates)
Logs: Detailed event records with timestamps and context
Traces: Request journeys through distributed systems
Profiles: Code-level execution patterns showing resource consumption and performance bottlenecks at the function/line level

This separation made sense historically due to the way the industry evolved. Different data types have traditionally had different cardinality, structure, access patterns and volume characteristics. However, this approach creates significant challenges for AI-powered analysis:

Metrics (Prometheus) → "CPU spiked at 09:17:00"
Logs (ELK) → "Exception in checkout service at 09:17:32" 
Traces (Jaeger) → "Slow DB queries in order-service at 09:17:28"
Profiles (pyroscope) -> "calculate_discount() is taking 75% of CPU time"

When these data sources live in separate systems, AI tools must either:

Work with an incomplete picture (seeing only metrics but not the related logs)
Rely on complex, brittle integrations that often introduce timing skew
Force developers to manually correlate information across tools

Imagine asking an AI, "Why did checkout latency spike at 09:17?" To answer comprehensively, it needs access to logs (to see the stack trace), traces (to understand the service path), and metrics (to identify resource strain). With siloed tools, the AI either sees only fragments of the story or requires complex ETL jobs that are slower than the incident itself.

Why traditional machine learning (ML) falls short

Traditional machine learning for observability typically focuses on anomaly detection within a single data dimension. It can tell you when metrics deviate from normal patterns, but struggles to provide context or root cause.

ML models trained on metrics alone might flag a latency spike, but can't connect it to a recent deployment (found in logs) or identify that it only affects requests to a specific database endpoint (found in traces). They behave like humans with extreme tunnel vision, seeing only a fraction of the relevant information and only the information that a specific vendor has given you an opinionated view into.

This limitation becomes particularly problematic in modern microservice architectures where problems frequently cascade across services. Without a unified view, traditional ML can detect symptoms but struggles to identify the underlying cause.

The solution: unified data with enriched logs

The solution is conceptually simple but transformative: unify metrics, logs, and traces into a single data store, ideally with enriched logs that contain all signals about a request in a single JSON document. We're about to see a merging of signals.

Think of traditional logs as simple text lines:

[2025-05-19 09:17:32] ERROR OrderService - Failed to process checkout for user 12345

Now imagine an enriched log that contains not just the error message, but also:

The complete distributed trace context
Related metrics at that moment
System environment details
Business context (user ID, cart value, etc.)

This approach creates a holistic view where every signal about the same event sits side-by-side, perfect for AI analysis.

How generative AI changes things

Generative AI differs fundamentally from traditional ML in its ability to:

Process unstructured data: Understanding free-form log messages and error text
Maintain context: Connecting related events across time and services
Answer natural language queries: Translating human questions into complex data analysis
Generate explanations: Providing reasoning alongside conclusions
Surface hidden patterns: Discovering correlations and anomalies in log data that would be impractical to find through manual analysis or traditional querying

With access to unified observability data, GenAI can analyze complete system behavior patterns and correlate across previously disconnected signals.

For example, when asked "Why is our checkout service slow?" a GenAI model with access to unified data can:

Analyze unified enriched logs to identify which specific operations are slow and to find errors or warnings in those components
Check attached metrics to understand resource utilization
Correlate all these signals with deployment events or configuration changes
Present a coherent explanation in natural language with supporting graphs and visualizations

Implementing unified observability with OpenTelemetry

OpenTelemetry provides the perfect foundation for unified observability with its consistent schema across metrics, logs, and traces. Here's how to implement enriched logs in a Java application:

import io.opentelemetry.api.OpenTelemetry;
import io.opentelemetry.api.metrics.Meter;
import io.opentelemetry.api.metrics.DoubleHistogram;
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.context.Scope;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.slf4j.MDC;
import java.lang.management.ManagementFactory;
import java.lang.management.OperatingSystemMXBean;

public class OrderProcessor {
    private static final Logger logger = LoggerFactory.getLogger(OrderProcessor.class);
    private final Tracer tracer;
    private final DoubleHistogram cpuUsageHistogram;
    private final OperatingSystemMXBean osBean;

    public OrderProcessor(OpenTelemetry openTelemetry) {
        this.tracer = openTelemetry.getTracer("order-processor");
        Meter meter = openTelemetry.getMeter("order-processor");
        this.cpuUsageHistogram = meter.histogramBuilder("system.cpu.load")
                                      .setDescription("System CPU load")
                                      .setUnit("1")
                                      .build();
        this.osBean = ManagementFactory.getOperatingSystemMXBean();
    }

    public void processOrder(String orderId, double amount, String userId) {
        Span span = tracer.spanBuilder("processOrder").startSpan();
        try (Scope scope = span.makeCurrent()) {
            // Add attributes to the span
            span.setAttribute("order.id", orderId);
            span.setAttribute("order.amount", amount);
            span.setAttribute("user.id", userId);
            // Populate MDC for structured logging
            MDC.put("trace_id", span.getSpanContext().getTraceId());
            MDC.put("span_id", span.getSpanContext().getSpanId());
            MDC.put("order_id", orderId);
            MDC.put("order_amount", String.valueOf(amount));
            MDC.put("user_id", userId);
            // Record CPU usage metric associated with the current trace context
            double cpuLoad = osBean.getSystemLoadAverage();
            if (cpuLoad >= 0) {
                cpuUsageHistogram.record(cpuLoad);
                MDC.put("cpu_load", String.valueOf(cpuLoad));
            }
            // Log a structured message
            logger.info("Processing order");
            // Simulate business logic
            // ...
            span.setAttribute("order.status", "completed");
            logger.info("Order processed successfully");
        } catch (Exception e) {
            span.recordException(e);
            span.setAttribute("order.status", "failed");
            logger.error("Order processing failed", e);
        } finally {
            MDC.clear();
            span.end();
        }
    }
}

This code demonstrates how to:

Create a span for the operation
Add business attributes
Add current CPU usage
Link everything with consistent IDs
Record exceptions and outcomes in the backend system

When configured with an appropriate exporter, this creates enriched logs that contain both application events and their complete context.

Powerful queries across previously separate data

With data that has not yet been enriched, there is still hope. Firstly with GenAI powered ingestion it is possible to extract key fields to help correlate data such as a session id's. This will help you enrich your logs so they get the structure they need to behave like other signals. Below we can see Elastic's Auto Import mechanism that will automatically generate ingest pipelines and pull unstructured information from logs into a structured format perfect for analytics.

Once you have this data in the same data store, you can perform powerful join queries that were previously impossible. For example, finding slow database queries that affected specific API endpoints:

FROM logs-nginx.access-default 
| LOOKUP JOIN .ds-logs-mysql.slowlog-default-2025.05.01-000002 ON request_id 
| KEEP request_id, mysql.slowlog.query, url.query 
| WHERE mysql.slowlog.query IS NOT NULL

This query joins web server logs with database slow query logs, allowing you to directly correlate user-facing performance with database operations.

For GenAI interfaces, these complex queries can be generated automatically from natural language questions:

"Show me all checkout failures that coincided with slow database queries"

The AI translates this into appropriate queries across your unified data store, correlating application errors with database performance.

Real-world applications and use cases

Natural language investigation

Imagine asking your observability system:

"Why did checkout latency spike at 09:17 yesterday?"

A GenAI-powered system with unified data could respond:

"Checkout latency increased by 230% at 09:17:32 following deployment v2.4.1 at 09:15. The root cause appears to be increased MySQL query times in the inventory-service. Specifically, queries to the 'product_availability' table are taking an average of 2300ms compared to the normal 95ms. This coincides with a CPU spike on database host db-03 and 24 'Lock wait timeout' errors in the inventory service logs."

Here's an example of Claude Desktop connected to Elastic's MCP (Model Context Protocol) Server which demonstrates how powerful natural language investigations can be. Here we ask Claude "analyze my web traffic patterns" and as you can see it has correctly identified that this is in our demo environment.

Unknown problem detection

GenAI can identify subtle patterns by correlating signals that would be missed in siloed systems. For example, it might notice that a specific customer ID appears in error logs only when a particular network path is taken through your microservices—indicating a data corruption issue affecting only certain user flows.

Predictive maintenance

By analyzing the unified historical patterns leading up to previous incidents, GenAI can identify emerging problems before they cause outages:

"Warning: Current load pattern on authentication-service combined with increasing error rates in user-profile-service matches 87% of the signature that preceded the April 3rd outage. Recommend scaling user-profile-service pods immediately."

The future: agentic AI for observability

The next frontier is agentic AI, systems that not only analyze but take action automatically.

These AI agents could:

Continuously monitor all observability signals
Autonomously investigate anomalies
Implement fixes for known patterns
Learn from the effectiveness of previous interventions

For example, an observability agent might:

Detect increased error rates in a service
Analyze logs and traces to identify a memory leak
Correlate with recent code changes
Increase the memory limit temporarily
Create a detailed ticket with the root cause analysis
Monitor the fix effectiveness

This is about creating systems that understand your application's behavior patterns deeply enough to maintain them proactively. See how this works in Elastic Observability, in the screenshot at the end of the RCA we are sending an email summary but this could trigger any action.

Business outcomes

Unifying observability data for GenAI analysis delivers concrete benefits:

Faster resolution times: Problems that previously required hours of manual correlation can be diagnosed in seconds
Fewer escalations: Junior engineers can leverage AI to investigate complex issues before involving specialists
Improved system reliability: Earlier detection and resolution of emerging issues
Better developer experience: Less time spent context-switching between tools
Enhanced capacity planning: More accurate prediction of resource needs

Implementation steps

Ready to start your observability transformation? Here's a practical roadmap:

Adopt OpenTelemetry: Standardize on OpenTelemetry for all telemetry data collection and use it to generate enriched logs.
Choose a unified storage solution: Select a platform that can efficiently store and query metrics, logs, traces and enriched logs together
Enrich your telemetry: Update application instrumentation to include relevant context
Create correlation IDs: Ensure every request has identifiers
Implement semantic conventions: Follow consistent naming patterns across your telemetry data
Start with focused use cases: Begin with high-value scenarios like checkout flows or critical APIs
Leverage GenAI tools: Integrate tools that can analyze your unified data and respond to natural language queries

Remember, AI can only be as smart as the data you feed it. The quality and completeness of your telemetry data will determine the effectiveness of your AI-powered observability.

Generative AI: an evolutionary catalyst for observability

The unification of observability data for GenAI analysis represents an evolutionary leap forward comparable to the transition from Internet 1.0 to 2.0. Early adopters will gain a significant competitive advantage through faster problem resolution, improved system reliability, and more efficient operations. GAI is a huge step for increasing observability maturity and moving your team to a more proactive stance.

Think of traditional observability as a doctor trying to diagnose a patient while only able to see their heart rate. Unified observability with GenAI is like giving that doctor a complete health picture, vital signs, lab results, medical history, and genetic data all accessible through natural conversation.

As SREs, we stand at the threshold of a new era in system observability. The asteroid of GenAI isn't a threat to be feared, it's an opportunity to evolve our practices and tools to build more reliable, understandable systems. The question isn't whether this transformation will happen, but who will lead it.

Will you?