Elastic Observability Labs - Log Analytics

3 models for logging with OpenTelemetry and Elastic

Tue, 27 Jun 2023 00:00:00 GMT

Arguably, OpenTelemetry exists to (greatly) increase usage of tracing and metrics among developers. That said, logging will continue to play a critical role in providing flexible, application-specific, event-driven data. Further, OpenTelemetry has the potential to bring added value to existing application logging flows:

Common metadata across tracing, metrics, and logging to facilitate contextual correlation, including metadata passed between services as part of REST or RPC APIs; this is a critical element of service observability in the age of distributed, horizontally scaled systems
An optional unified data path for tracing, metrics, and logging to facilitate common tooling and signal routing to your observability backend

Adoption of metrics and tracing among developers to date has been relatively small. Further, the number of proprietary vendors and APIs (compared to adoption rate) is relatively large. As such, OpenTelemetry took a greenfield approach to developing new, vendor-agnostic APIs for tracing and metrics. In contrast, most developers have nearly 100% log coverage across their services. Moreover, logging is largely supported by a small number of vendor-agnostic, open-source logging libraries and associated APIs (e.g., Logback and ILogger). As such, OpenTelemetry’s approach to logging meets developers where they already are using hooks into existing, popular logging frameworks. In this way, developers can add OpenTelemetry as a log signal output without otherwise altering their code and investment in logging as an observability signal.

Notably, logging is the least mature of OTel supported observability signals. Depending on your service’s language, and your appetite for adventure, there exist several options for exporting logs from your services and applications and marrying them together in your observability backend.

The intent of this article is to explore the current state of the art of OpenTelemetry logging and to provide guidance on the available approaches with the following tenants in mind:

Correlation of service logs with OTel-generated tracing where applicable
Proper capture of exceptions
Common context across tracing, metrics, and logging
Support for slf4j key-value pairs (“structured logging”)
Automatic attachment of metadata carried between services via OTel baggage
Use of an Elastic^® Observability backend
Consistent data fidelity in Elastic regardless of the approach taken

OpenTelemetry logging models

Three models currently exist for getting your application or service logs to Elastic with correlation to OTel tracing and baggage:

Output logs from your service (alongside traces and metrics) using an embedded OpenTelemetry Instrumentation library to Elastic via the OTLP protocol
Write logs from your service to a file scraped by the OpenTelemetry Collector, which then forwards to Elastic via the OTLP protocol
Write logs from your service to a file scraped by Elastic Agent (or Filebeat), which then forwards to Elastic via an Elastic-defined protocol

Note that (1), in contrast to (2) and (3), does not involve writing service logs to a file prior to ingestion into Elastic.

Logging vs. span events

It is worth noting that most APM systems, including OpenTelemetry, include provisions for span events. Like log statements, span events contain arbitrary, textual data. Additionally, span events automatically carry any custom attributes (e.g., a “user ID”) applied to the parent span, which can help with correlation and context. In this regard, it may be advantageous to translate some existing log statements (inside spans) to span events. As the name implies, of course, span events can only be emitted from within a span and thus are not intended to be a general purpose replacement for logging.

Unlike logging, span events do not pass through existing logging frameworks and therefore cannot (practically) be written to a log file. Further, span events are technically emitted as part of trace data and follow the same data path and signal routing as other trace data.

Polyfill appender

Some of the demos make use of a custom Logback “Polyfill appender” (inspired by OTel’s Logback MDC), which provides support for attaching slf4j key-value pairs to log messages for models (2) and (3).

Elastic Common Schema

For log messages to exhibit full fidelity within Elastic, they eventually need to be formatted in accordance with the Elastic Common Schema (ECS). In models (1) and (2), log messages remain formatted in OTel log semantics until ingested by the Elastic APM Server. The Elastic APM Server then translates OTel log semantics to ECS. In model (3), ECS is applied at the source.

Notably, OpenTelemetry recently adopted the Elastic Common Schema as its standard for semantic conventions going forward! As such, it is anticipated that current OTel log semantics will be updated to align with ECS.

Getting started

The included demos center around a “POJO” (no assumed framework) Java project. Java is arguably the most mature of OTel-supported languages, particularly with respect to logging options. Notably, this singular Java project was designed to support the three models of logging discussed here. In practice, you would only implement one of these models (and corresponding project dependencies).

The demos assume you have a working Docker environment and an Elastic Cloud instance.

git clone https://github.com/ty-elastic/otel-logging
Create an .env file at the root of otel-logging with the following (appropriately filled-in) environment variables:

# the service name
OTEL_SERVICE_NAME=app4

# Filebeat vars
ELASTIC_CLOUD_ID=(see https://www.elastic.co/guide/en/beats/metricbeat/current/configure-cloud-id.html)
ELASTIC_CLOUD_AUTH=(see https://www.elastic.co/guide/en/beats/metricbeat/current/configure-cloud-id.html)

# apm vars
ELASTIC_APM_SERVER_ENDPOINT=(address of your Elastic Cloud APM server... i.e., https://xyz123.apm.us-central1.gcp.cloud.es.io:443)
ELASTIC_APM_SERVER_SECRET=(see https://www.elastic.co/guide/en/apm/guide/current/secret-token.html)

Start up the demo with the desired model:

If you want to demo logging via OTel APM Agent, run MODE=apm docker-compose up
If you want to demo logging via OTel filelogreceiver, run MODE=filelogreceiver docker-compose up
If you want to demo logging via Elastic filebeat, run MODE=filebeat docker-compose up

Validate incoming span and correlated log data in your Elastic Cloud instance

Model 1: Logging via OpenTelemetry instrumentation

This model aligns with the long-term goals of OpenTelemetry: integrated tracing, metrics, and logging (with common attributes) from your services via the OpenTelemetry Instrumentation libraries, without dependency on log files and scrappers.

In this model, your service generates log statements as it always has, using popular logging libraries (e.g., Logback for Java). OTel provides a “Southbound hook” to Logback via the OTel Logback Appender, which injects ServiceName, SpanID, TraceID, slf4j key-value pairs, and OTel baggage into log records and passes the composed records to the co-resident OpenTelemetry Instrumentation library. We further employ a custom LogRecordProcessor to add baggage to the log record as attributes.

The OTel instrumentation library then formats the log statements per the OTel logging spec and ships them via OTLP to either an OTel Collector for further routing and enrichment or directly to Elastic.

Notably, as language support improves, this model can and will be supported by runtime agent binding with auto-instrumentation where available (e.g., no code changes required for runtime languages).

One distinguishing advantage of this model, beyond the simplicity it affords, is the ability to more easily tie together attributes and tracing metadata directly with log statements. This inherently makes logging more useful in the context of other OTel-supported observability signals.

Architecture

Although not explicitly pictured, an OpenTelemetry Collector can be inserted in between the service and Elastic to facilitate additional enrichment and/or signal routing or duplication across observability backends.

Pros

Simplified signal architecture and fewer “moving parts” (no files, disk utilization, or file rotation concerns)
Aligns with long-term OTel vision
Log statements can be (easily) decorated with OTel metadata
No polyfill adapter required to support structured logging with slf4j
No additional collectors/agents required
Conversion to ECS happens within Elastic keeping log data vendor-agnostic until ingestion
Common wireline protocol (OTLP) across tracing, metrics, and logs

Cons

Not available (yet) in many OTel-supported languages
No intermediate log file for ad-hoc, on-node debugging
Immature (alpha/experimental) Unknown “glare” conditions, which could result in loss of log data if service exits prematurely or if the backend is unable to accept log data for an extended period of time

Demo

MODE=apm docker-compose up

Model 2: Logging via the OpenTelemetry Collector

Given the cons of Model 1, it may be advantageous to consider a model that continues to leverage an actual log file intermediary between your services and your observability backend. Such a model is possible using an OpenTelemetry Collector collocated with your services (e.g., on the same host), running the filelogreceiver to scrape service log files.

In this model, your service generates log statements as it always has, using popular logging libraries (e.g., Logback for Java). OTel provides a MDC Appender for Logback (Logback MDC), which adds SpanID, TraceID, and Baggage to the Logback MDC context.

Notably, no log record structure is assumed by the OTel filelogreceiver. In the example provided, we employ the logstash-logback-encoder to JSON-encode log messages. The logstash-logback-encoder will read the OTel SpanID, TraceID, and Baggage off the MDC context and encode it into the JSON structure. Notably, logstash-logback-encoder doesn’t explicitly support slf4j key-value pairs. It does, however, support Logback structured arguments, and thus I use the Polyfill Appender to convert slf4j key-value pairs to Logback structured arguments.

We then configure the OTel Collector to scrape this log file (using the filelogreceiver). Because no assumptions are made about the format of the log lines, you need to explicitly map fields from your log schema to the OTel log schema.

From there, the OTel Collector batches and ships the formatted log lines via OTLP to Elastic.

Architecture

Pros

Easy to debug (you can manually read the intermediate log file)
Inherent file-based FIFO buffer
Less susceptible to “glare” conditions when service prematurely exits
Conversion to ECS happens within Elastic keeping log data vendor-agnostic until ingestion
Common wireline protocol (OTLP) across tracing, metrics, and logs

Cons

All the headaches of file-based logging (rotation, disk overflow)
Beta quality and not yet proven in the field
No support for slf4j key-value pairs

Demo

MODE=filelogreceiver docker-compose up

Model 3: Logging via Elastic Agent (or Filebeat)

Although the second model described affords some resilience as a function of the backing file, the OTel Collector filelogreceiver module is still decidedly “beta” in quality. Because of the importance of logs as a debugging tool, today I generally recommend that customers continue to import logs into Elastic using the field-proven Elastic Agent or Filebeat scrappers. Elastic Agent and Filebeat have many years of field maturity under their collective belt. Further, it is often advantageous to deploy Elastic Agent anyway to capture the multitude of signals outside the purview of OpenTelemetry (e.g., deep Kubernetes and host metrics, security, etc.).

In this model, your service generates log statements as it always has, using popular logging libraries (e.g., Logback for Java). As with model 2, we employ OTel’s Logback MDC to add SpanID, TraceID, and Baggage to the Logback MDC context.

From there, we employ the Elastic ECS Encoder to encode log statements compliant to the Elastic Common Schema. The Elastic ECS Encoder will read the OTel SpanID, TraceID, and Baggage off the MDC context and encode it into the JSON structure. Similar to model 2, the Elastic ECS Encoder doesn’t support sl4f key-vair arguments. Curiously, the Elastic ECS encoder also doesn’t appear to support Logback structured arguments. Thus, within the Polyfill Appender, I add slf4j key-value pairs as MDC context. This is less than ideal, however, since MDC forces all values to be strings.

From there, we write the log lines to a log file. If you are using Kubernetes or other container orchestration in your environment, you would more typically write to stdout (console) and let the orchestration log driver write to and manage log files.We then configure Elastic Agent or Filebeat to scrape the log file. Notably, the Elastic ECS Encoder does not currently translate incoming OTel SpanID and TraceID variables on the MDC. Thus, we need to perform manual translation of these variables in the Filebeat (or Elastic Agent) configuration to map them to their ECS equivalent.

Architecture

Pros

Robust and field-proven
Easy to debug (you can manually read the intermediate log file)
Inherent file-based FIFO buffer
Less susceptible to “glare” conditions when service prematurely exits
Native ECS format for easy manipulation in Elastic
Fleet-managed via Elastic Agent

Cons

All the headaches of file-based logging (rotation, disk overflow)
No support for slf4j key-value pairs or Logback structured arguments
Requires translation of OTel SpanID and TraceID in Filebeat config
Disparate data paths for logs versus tracing and metrics
Vendor-specific logging format

Demo

MODE=filebeat docker-compose up

Recommendations

For most customers, I currently recommend Model 3 — namely, write to logs in ECS format (with OTel SpanID, TraceID, and Baggage metadata) and collect them with an Elastic Agent installed on the node hosting the application or service. Elastic Agent (or Filebeat) today provides the most field-proven and robust means of capturing log files from applications and services with OpenTelemetry context.

Further, you can leverage this same Elastic Agent instance (ideally running in your Kubernetes daemonset) to collect rich and robust metrics and logs from Kubernetes and many other supported services via Elastic Integrations. Finally, Elastic Agent facilitates remote management via Fleet, avoiding bespoke configuration files.

Alternatively, for customers who either wish to keep their nodes vendor-neutral or use a consolidated signal routing system, I recommend Model 2, wherein an OpenTelemetry collector is used to scrape service log files. While workable and practiced by some early adopters in the field today, this model inherently carries some risk given the current beta nature of the OpenTelemetry filelogreceiver.

I generally do not recommend Model 1 given its limited language support, experimental/alpha status (the API could change), and current potential for data loss. That said, in time, with more language support and more thought to resilient designs, it has clear advantages both with regard to simplicity and richness of metadata.

Extracting more value from your logs

In contrast to tracing and metrics, most organizations have nearly 100% log coverage over their applications and services. This is an ideal beachhead upon which to build an application observability system. On the other hand, logs are notoriously noisy and unstructured; this is only amplified with the scale enabled by the hyperscalers and Kubernetes. Collecting log lines reliably is the easy part; making them useful at today’s scale is hard.

Given that logs are arguably the most challenging observability signal from which to extract value at scale, one should ideally give thoughtful consideration to a vendor’s support for logging in the context of other observability signals. Can they handle surges in log rates because of unexpected scale or an error or test scenario? Do they have the machine learning tool set to automatically recognize patterns in log lines, sort them into categories, and identify true anomalies? Can they provide cost-effective online searchability of logs over months or years without manual rehydration? Do they provide the tools to extract and analyze business KPIs buried in logs?

As an ardent and early supporter of OpenTelemetry, Elastic, of course, natively ingests OTel traces, metrics, and logs. And just like all logs coming into our system, logs coming from OTel-equipped sources avail themselves of our mature tooling and next-gen AI Ops technologies to enable you to extract their full value.Interested? Reach out to our pre-sales team to get started building with Elastic!

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.

AI-driven incident response with logs: A technical deep dive in Elastic Observability

Mon, 20 Oct 2025 00:00:00 GMT

AI-driven incident response with logs: A technical deep dive in Elastic Observability

Modern customer‑facing applications, whether e‑commerce sites, streaming platforms, or API gateways, run on fleets of microservices and cloud resources. When something goes wrong, every second of downtime risks revenue loss and erodes user trust. Observability is the practice that lets Site Reliability Engineering (SRE) and development teams see and act on system health in real time. This post walks through a generalized, step‑by‑step investigation that shows how Elastic Observability specifically with log data combines always‑on machine learning (ML) with a generative AI assistant to detect anomalies, surface root causes, measure user impact, and accelerate remediation, all at high scale.

Anomaly Detection

A production environment is ingesting millions of log lines per minute. Elastic’s AIOps jobs continuously profile normal log throughput and content without any manual rules. When log volume or message structure deviates beyond learned baselines, the platform automatically fires a high‑fidelity anomaly alert. Because the models are unsupervised, they adapt to changing traffic patterns and flag both sudden spikes (e.g., 10× error surge) and rare new log categories.

In addition to looking directly for Log Spikes, Elastic trains seasonal/univariant models to predict expected event counts per bucket and applies statistical tests to classify outliers. Simultaneously, log categorization clusters similar messages with cosine similarity on token embeddings, making it trivial to identify a previously unseen error string.

Investigating Alerts: Automated Pattern Analysis

Clicking the alert reveals more than a timestamp. Elastic’s ML job already correlates the spike with the dominant new log pattern ERROR 1114 (HY000): table "orders" is full and surfaces example lines. Instead of grep‑driven hunting, engineers get an immediate hypothesis about what subsystem is failing and why.

If deeper context is needed, the builtin Elastic AI Assistant can be invoked directly from the alert. Thanks to Retrieval‑Augmented Generation (RAG) over your telemetry, the assistant explains the anomaly in plain language, references the exact log events, and proposes next steps without hallucinating.

AI‑Assisted Root Cause Verification

From within the same chat, you might ask, “Using lens create a single graph of all http response status codes =400 from logs-nginx.access-default over the last 3 hours..” The assistant translates that intent into an ES|QL aggregation, retrieves the data, and renders a bar chart with no DSL knowledge required. If there are a number of errors with a status code above 400, you’ve validated that end‑users are impacted.

Global Impact Analysis with Enriched Logs

Structured log enrichment (e.g., GeoIP, user ID, service tags) lets the assistant answer business questions on the fly. A query like “What are the top 10 source.geo.country_name with http.response.status.code>=400 over the last 3 hours. Use logs-nginx.access-default. Provide counts for each country name.” surfaces whether the incident is regional or global.

Quantifying Business Impact

Technical metrics alone rarely sway executives. Suppose historical data shows the application normally processes $1,000 in transactions per minute. The assistant can combine that baseline with real‑time failure counts to estimate revenue loss. Presenting financial impact alongside error graphs sharpens prioritization and justifies extraordinary remediation steps.

Pinpointing Infrastructure & Ownership

Every log is automatically enriched with Kubernetes, cloud, and custom metadata. A single question “Which pod and cluster emit the ‘table full’ error, and who owns it?” returns the full information about the pod, namespace and owner as shown below.

Immediate, accurate routing replaces frantic Slack threads, cutting minutes (or hours) off of downtime.

Some of the magic happening here is because we can put instructions in the Elastic AI Assistants knowledge base to guide the AI assistant. For example this simple entry in the knowledge base is what allows the assistant to populate the response in the previous screenshot.

If asked about Kubernetes pod, namespace, cluster, location, or owner run the "query" tool.
1. Use the index `logs-mysql.error-default` unless another log location is specified.
2. Include the following fields in the query:
   - Pod: `agent.name`
   - Namespace: `data\_stream.namespace`
   - Cluster Name: `orchestrator.cluster.name`
   - Cloud Provider: `cloud.provider`
   - Region: `cloud.region`
   - Availability Zone: `cloud.availability\_zone`
   - Owner: `cloud.account.id`
3. Use the ES|QL query format:
   esql
   FROM logs-mysql.error-default
   | KEEP agent.name, data\_stream.namespace, orchestrator.cluster.name, cloud.provider, cloud.region, cloud.availability\_zone, cloud.account.id
   
4. Ensure the query is executed within the appropriate time range and context.

Leveraging Institutional Knowledge with RAG

Elastic can index runbooks, GitHub issues, and wikis alongside telemetry. Asking “Find documentation on fixing a full orders table” retrieves and summarizes a prior runbook that details archiving old rows and adding a partition. Grounding remediation in proven procedures avoids guesswork and accelerates fixes.

Automated Communication & Documentation

Good incident response includes timely stakeholder updates. A prompt such as “Draft an incident update email with root cause, impact, and next steps” lets the assistant assemble a structured message and send it via the alerting framework’s email or Slack connector complete with dashboard links and next‑update timelines. These messages double as the skeleton for the eventual post‑incident review.

Again as before, some of the magic happening here is because we can put instructions in the Elastic AI Assistants knowledge base to guide the AI assistant. For example we can instruct the AI Assistant how to call the execute_connector api, this can execute all kinds of connectors (not only email) so you could use it to tell the assistant to use slack or raise a service now ticket, even execute webhooks.

Here are specific instructions to send an email. Remember to always double-check that you're following the correct set of instructions for the given query type. Provide clear, concise, and accurate information in your response.

## Email Instructions

If the user's query requires sending an email:
1. Use the `Elastic-Cloud-SMTP` connector with ID `elastic-cloud-email`.
2. Prepare the email parameters:
   - Recipient email address(es) in the `to` field (array of strings)
   - Subject in the `subject` field (string)
   - Email body in the `message` field (string)
3. Include
   - Details for the alert along with a link to the alert
   - Root cause analysis
   - Revenue impact
   - Remediation recommendations
   - Link to GitHub issue
   - All relevant information from this conversation
   - Link to the Business Health Dashboard
4. Send the email immediately. Do not ask the user for confirmation.
5. Execute the connector using this format:
   
   execute_connector(
     id="elastic-cloud-email",
     params={
       "to": ["recipient@example.com"],
       "subject": "Your Email Subject",
       "message": "Your email content here."
     }
   )
   
6. Check the response and confirm if the email was sent successfully.

Conclusion & Key Takeaways

Elastic Observability's combination of unsupervised ML, schema-aware data ingestion, and a context-rich RAG powered AI assistant enables teams to transform incident response from reactive firefighting into proactive, data-driven operations. By automatically detecting anomalies, correlating patterns, and providing contextual insights, teams can:

Preserve revenue by quantifying business impact in real-time and prioritizing accordingly
Scale expertise by embedding institutional knowledge into RAG-powered recommendations
Improve continuously through automated documentation that feeds back into the knowledge base

The key is to collect logs broadly, maintain a unified observability store, and let ML and AI handle the heavy lifting. The payoff isn't just reduced downtime, it's the transformation of incident response from a source of organizational stress into a competitive advantage.

Try out this exact scenario and get hands in with this Elastic Logging Workshop: https://play.instruqt.com/elastic/invite/rx4yvknhpfci

The antidote for index mapping exceptions: ignore_malformed

Thu, 03 Aug 2023 00:00:00 GMT

In this article, I'll explain how the setting ignore_malformed can make the difference between a 100% dropping rate and a 100% success rate, even with ignoring some malformed fields.

As a senior software engineer working at Elastic®, I have been on the first line of support for anything related to Beats or Elastic Agent running on Kubernetes and Cloud Native integrations like Nginx ingress controller.

During my experience, I have seen all sorts of issues. Users have very different requirements. But at some point during their experience, most of them encounter a very common problem with Elasticsearch: index mapping exceptions.

How mappings work

Like any other document-based NoSQL database, Elasticsearch doesn’t force you to provide the document schema (called index mapping or simply mapping) upfront. If you provide a mapping, it will use it. Otherwise, it will infer one from the first document or any subsequent documents that contain new fields.

In reality, the situation is not black and white. You can also provide a partial mapping that covers only some of the fields, like the most common fields, and leave Elasticsearch to figure out the mapping of all the other fields during ingestion with Dynamic Mapping.

What happens when data is malformed?

No matter if you specified a mapping upfront or if Elasticsearch inferred one automatically, Elasticsearch will drop an entire document with just one field that doesn't match the mapping of an index and return an error instead. This is not much different from what happens with other SQL databases or NoSQL data stores with inferred schemas. The reason for this behavior is to prevent malformed data and exceptions at query time.

A problem arises if a user doesn't look at the ingestion logs and misses those errors. They might never figure out that something went wrong, or even worse, Elasticsearch might stop ingesting data entirely if all the subsequent documents are malformed.

The above situation sounds very catastrophic, but it's entirely possible since I have seen it many times when on-call for support or on discuss.elastic.co. The situation is even more likely to happen if you have user-generated documents, so you don't have full control over the quality of your data.

Luckily, there is a setting that not many people know about in Elasticsearch that solves the exact problems above. This field has been there since Elasticsearch 2.0. We are talking ancient history here since the latest version of the stack at the time of writing is Elastic Stack 8.9.0.

Let's now dive into how to use this Elasticsearch feature.

A toy use case

To make it easier to interact with Elasticsearch, I am going to use Kibana® Dev Tools in this tutorial.

The following examples are taken from the official documentation on ignore_malformed. I am here to expand on those examples by providing a few more details about what happens behind the scenes and on how to search for ignored fields. We are going to use the index name my-index, but feel free to change that to whatever you like.

First, we want to create an index mapping with two fields called number_one and number_two. Both fields have type integer, but only one of them has _ ignore_malformed _ set to true, and the other one inherits the default value ignore_malformed: false instead.

PUT my-index
{
  "mappings": {
    "properties": {
      "number_one": {
        "type": "integer",
        "ignore_malformed": true
      },
      "number_two": {
        "type": "integer"
      }
    }
  }
}

If the mentioned index didn’t exist before and the previous command ran successfully, you should get the following result:

{
  "acknowledged": true,
  "shards_acknowledged": true,
  "index": "my-index"
}

To double-check that the above mapping has been created correctly, we can query the newly created index with the command:

GET my-index/_mapping

You should get the following result:

{
  "my-index": {
    "mappings": {
      "properties": {
        "number_one": {
          "type": "integer",
          "ignore_malformed": true
        },
        "number_two": {
          "type": "integer"
        }
      }
    }
  }
}

Now we can ingest two sample documents — both invalid:

PUT my-index/_doc/1
{
  "text":       "Some text value",
  "number_one": "foo"
}

PUT my-index/_doc/2
{
  "text":       "Some text value",
  "number_two": "foo"
}

The document with id=1 is correctly ingested, while the document with id=2 fails with the following error. The difference between those two documents is in which field we are trying to ingest a sample string “foo” instead of an integer.

{
  "error": {
    "root_cause": [
      {
        "type": "document_parsing_exception",
        "reason": "[3:17] failed to parse field [number_two] of type [integer] in document with id '2'. Preview of field's value: 'foo'"
      }
    ],
    "type": "document_parsing_exception",
    "reason": "[3:17] failed to parse field [number_two] of type [integer] in document with id '2'. Preview of field's value: 'foo'",
    "caused_by": {
      "type": "number_format_exception",
      "reason": "For input string: \"foo\""
    }
  },
  "status": 400
}

Depending on the client used for ingesting your documents, you might get different errors or warnings, but logically the problem is the same. The entire document is not ingested because part of it doesn’t conform with the index mapping. There are too many possible error messages to name, but suffice it to say that malformed data is quite a common problem. And we need a better way to handle it.

Now that at least one document has been ingested, you can try searching with the following query:

GET my-index/_search
{
  "fields": [
    "*"
  ]
}

Here, the parameter fields is required to show the values of those fields that have been ignored. More on this later.

From the result, you can see that only the first document (with id=1) has been ingested correctly while the second document (with id=2) has been completely dropped.

{
  "took": 14,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": null,
    "hits": [
      {
        "_index": "my-index",
        "_id": "1",
        "_score": null,
        "_ignored": ["number_one"],
        "_source": {
          "text": "Some text value",
          "number_one": "foo"
        },
        "fields": {
          "text": ["Some text value"],
          "text.keyword": ["Some text value"]
        },
        "ignored_field_values": {
          "number_one": ["foo"]
        },
        "sort": ["1"]
      }
    ]
  }
}

From the above JSON response, you will notice some things, such as:

A new field called _ _ignored _ of type array with the list of all fields that have been ignored while ingesting documents
A new field called _ ignored_field_values _ with a dictionary of ignored fields and their values
The field called __ source _ contains the original document unmodified. This is especially useful if you want to fix the problems with the mapping later.
The field called _ text _ was not present in the original mapping, but it is now included since Elasticsearch automatically inferred the type of this field. In fact, if you try to query the mapping of the index _ my-index _ again via the command:

GET my-index/_mapping

You should get this result:

{
  "my-index": {
    "mappings": {
      "properties": {
        "number_one": {
          "type": "integer",
          "ignore_malformed": true
        },
        "number_two": {
          "type": "integer"
        },
        "text": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}

Finally, if you ingest some valid documents like the following command:

PUT my-index/_doc/3
{
  "text":       "Some text value",
  "number_two": 10
}

You can check how many documents have at least one ignored field with the following Exists query:

GET my-index/_search
{
  "query": {
    "exists": {
      "field": "_ignored"
    }
  }
}

You can also see that out of the two documents ingested (with id=1 and id=3) only the document with id=1 contains an ignored field.

{
  "took": 193,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "my-index",
        "_id": "1",
        "_score": 1,
        "_ignored": ["number_one"],
        "_source": {
          "text": "Some text value",
          "number_one": "foo"
        }
      }
    ]
  }
}

Alternatively, you can search for all documents that have a specific field being ignored with this Terms query:

GET my-index/_search
{
  "query": {
    "terms": {
      "_ignored": [ "number_one"]
    }
  }
}

The result, in this case, will be the same as the previous one since we only managed to ingest a single document with that exact single field ignored.

Conclusion

Because we are a big fan of this flag, we've enabled _ ignore_malformed _ by default for all Elastic integrations and in the default index template for logs data streams as of 8.9.0. More information can be found in the official documentation for ignore_malformed.

And since I am personally working on this feature, I can reassure you that it is a game changer.

You can start by setting _ ignore_malformed _ on any cluster manually before Elastic Stack 8.9.0. Or you can use the defaults that we set for you starting from Elastic Stack 8.9.0.

Automated log parsing in Streams with ML

Tue, 10 Feb 2026 00:00:00 GMT

In modern observability stacks, ingesting unstructured logs from diverse data providers into platforms like Elasticsearch remains a challenge. Reliance on manually crafted parsing rules creates brittle pipelines, where even minor upstream code updates lead to parsing failures and unindexed data. This fragility is compounded by the scalability challenge: in dynamic microservices environments, the continuous addition of new services turns manual rule maintenance into an operational nightmare.

Our goal was to transition to an automated, adaptive approach capable of handling both log parsing (field extraction) and log partitioning (source identification). We hypothesized that Large Language Models (LLMs), with their inherent understanding of code syntax and semantic patterns, could automate these tasks with minimal human intervention.

We are happy to announce that this feature is already available in Streams!

Dataset Description

We chose a Loghub collection of logs for PoC purposes. For our investigation, we selected representative samples from the following key areas:

Distributed systems: We used the HDFS (Hadoop Distributed File System) and Spark datasets. These contain a mix of info, debug, and error messages typical of big data platforms.
Server & web applications: Logs from Apache web servers and OpenSSH provided a valuable source of access, error, and security-relevant events. These are critical for monitoring web traffic and detecting potential threats.
Operating systems: We included logs from Linux and Windows. These datasets represent the common, semi-structured system-level events that operations teams encounter daily.
Mobile systems: To ensure our model could handle logs from mobile environments, we included the Android dataset. These logs are often verbose and capture a wide range of application and system-level activities on mobile devices.
Supercomputers: To test performance on high-performance computing (HPC) environments, we incorporated the BGL (Blue Gene/L) dataset, which features highly structured logs with specific domain terminology.

A key advantage of the Loghub collection is that the logs are largely unsanitized and unlabeled, mirroring a noisy live production environment with microservice architecture.

Log examples:

[Sun Dec 04 20:34:21 2005] [notice] jk2_init() Found child 2008 in scoreboard slot 6
[Sun Dec 04 20:34:25 2005] [notice] workerEnv.init() ok /etc/httpd/conf/workers2.properties
[Mon Dec 05 11:06:51 2005] [notice] workerEnv.init() ok /etc/httpd/conf/workers2.properties
17/06/09 20:10:58 INFO output.FileOutputCommitter: Saved output of task 'attempt_201706092018_0024_m_000083_1138' to hdfs://10.10.34.11:9000/pjhe/test/1/_temporary/0/task_201706092018_0024_m_000083
17/06/09 20:10:58 INFO mapred.SparkHadoopMapRedUtil: attempt_201706092018_0024_m_000083_1138: Committed

In addition, we created a Kubernetes cluster with a typical web application + database set up to mine extra logs in the most common domain.

Example of common log fields: timestamp, log level (INFO, WARN, ERROR), source, message.

Few-Shot Log Parsing with an LLM

Our first set of experiments focused on a fundamental question: Can an LLM reliably identify key fields and generate consistent parsing rules to extract them?

We asked a model to analyse raw log samples and generate log parsing rules in regular expression (regex) and Grok formats. Our results showed that this approach has a lot of potential, but also significant implementation challenges.

High Confidence & Context Awareness

Initial results were promising. The LLM demonstrated a strong ability to generate parsing rules that matched the provided few-shot examples with high confidence. Besides simple pattern matching, the model showed a capacity for log understanding —it could correctly identify and name the log source (e.g., health tracking app, Nginx web app, Mongo database).

The "Goldilocks" Dilemma of Input Samples

Our experiments quickly surfaced a significant lack of robustness because of extreme sensitivity to the input sample. The model's performance fluctuates wildly based on the specific log examples included in the prompt. We observed a log similarity problem where the log sample needs to include just diverse enough logs:

Too homogeneous (overfitting): If the input logs are too similar, the LLM tends to overspecify. It treats variable data—such as specific Java class names in a stack trace—as static parts of the template. This results in brittle rules that cover a tiny ratio of logs and extract unusable fields.
Too heterogeneous (confusion): Conversely, if the sample contains significant formatting variance—or worse, "trash logs" like progress bars, memory tables, or ASCII art—the model struggles to find a common denominator. It often resorts to generating complex, broken regexes or lazily over-generalizing the entire line into a single message blob field.

The Context Window Constraint

We also encountered a context window bottleneck. When input logs were long, heterogeneous, or rich in extractable fields, the model's output often deteriorated, becoming "messy" or too long to fit into the output context window. Naturally, chunking helps in this case. By splitting logs using character-based and entity-based delimiters, we could help the model focus on extracting the main fields without being overwhelmed by noise.

The consistency & standardization gap

Even when the model successfully generated rules, we noted slight inconsistencies:

Service naming variations: The model proposes different names for the same entity (e.g., labeling the source as "Spark," "Apache Spark," and "Spark Log Analytics" in different runs).
Field naming variations: Field names lacked standardization (e.g., id vs. service.id vs. device.id). We normalized names using a standardized Elastic field naming.
Resolution variance: The resolution of the field extraction varied depending on how similar the input logs were to one another.

Log Format Fingerprint

To address the challenge of log similarity, we introduce a high-performance heuristic: log format fingerprint (LFF).

Instead of feeding raw, noisy logs directly into an LLM, we first apply a deterministic transformation to reveal the underlying structure of each message. This pre-processing step abstracts away variable data, generating a simplified "fingerprint" that allows us to group related logs.

The mapping logic is simple to ensure speed and consistency:

Digit abstraction: Any sequence of digits (0-9) is replaced by a single ‘0’.
Text abstraction: Any sequence of alphabetical characters with whitespace is replaced by a single ‘a’.
Whitespace normalization: All sequences of whitespace (spaces, tabs, newlines) are collapsed into a single space.
Symbol preservation: Punctuation and special characters (e.g., :, [, ], /) are preserved, as they are often the strongest indicators of log structure.

We introduce the log mapping approach. The basic mapping patterns include the following:

Digits 0-9 of any length -> to ‘0.’

Text (alphabetical characters with spaces) of any length -> to ‘a’.
White spaces, tabs, and new lines -> to a single space.
Let's look at an example of how this mapping allows us to transform the logs.

As a result, we obtain the following log masks:

Notice the fingerprints of the first two logs. Despite different timestamps, source classes, and message content, their prefixes (0/0/0 0:0:0 a a.a:) are identical. This structural alignment allows us to automatically bucket these logs into the same cluster.

The third log, however, produces a completely divergent fingerprint (0-0-0...). This allows us to algorithmically separate it from the first group before we ever invoke an LLM.

Bonus Part: Instant Implementation with ES|QL

It’s as easy as passing this query in Discover.

FROM loghub |
EVAL pattern = REPLACE(REPLACE(REPLACE(REPLACE(raw_message, "[ \t\n]+", " "), "[A-Za-z]+", "a"), "[0-9]+", "0"), "a( a)+", "a") |
STATS total_count = COUNT(), ratio = COUNT() / 2000.0, datasources=VALUES(filename), example=TOP(raw_message, 3, "desc") BY SUBSTRING(pattern, 0, 15) |
SORT total_count DESC |
LIMIT 100

Query breakdown:

FROM loghub: Targets our index containing the raw log data.

EVAL pattern = …: The core mapping logic. We chain REPLACE functions to perform the abstraction (e.g., digits to '0', text to 'a', etc.) and save the result in a “pattern” field.

STATS [column1 =] expression1, … BY SUBSTRING(pattern, 0, 15):

This is a clustering step. We group logs that share the first 15 characters of their pattern and create aggregated fields such as total log count per group, list of log datasources, pattern prefix, 3 log examples

SORT total_count DESC | LIMIT 100 : Surfaces the top 100 most frequent log patterns

The query results on LogHub are displayed below:

As demonstrated in the visualization, this “LLM-free” approach partitions logs with high accuracy. It successfully clustered 10 out of 16 data sources (based on LogHub labels) completely (>90%) and achieved majority clustering in 13 out of 16 sources (>60%) —all without requiring additional cleaning, preprocessing, or fine-tuning.

Log format fingerprint offers a pragmatic, high-impact alternative and addition to sophisticated ML solutions like log pattern analysis. It provides immediate insights into log relationships and effectively manages large log clusters.

Versatility as a primitive

Thanks to ES|QL implementation, LFF serves both as a standalone tool for fast data diagnostics/visualisations, and as a building block in log analysis pipelines for high-volume use cases.

Flexibility

LFF is easy to customize and extend to capture specific patterns, i.e. hexadecimal numbers and IP addresses.

Deterministic stability

Unlike ML-based clustering algorithms, LFF logic is straightforward and deterministic. New incoming logs do not retroactively affect existing log clusters.

Performance and Memory

It requires minimal memory, no training or GPU making it ideal for real-time high-throughput environments.

Combining Log Format Fingerprint with an LLM

To validate the proposed hybrid architecture, each experiment contained a random 20% subset of the logs from each data source. This constraint simulates a real-world production environment where logs are processed in batches rather than as a monolithic historical dump.

The objective was to demonstrate that LFF acts as an effective compression layer. We aimed to prove that high-coverage parsing rules could be generated from small, curated samples and successfully generalized to the entire dataset.

Execution Pipeline

We implemented a multi-stage pipeline that filters, clusters, and applies stratified sampling to the data before it reaches the LLM.

Two-stage hierarchical clustering

Subclasses (exact match): Logs are aggregated by identical fingerprints. Every log in one subclass shares the exact same format structure.
Outlier cleaning: We discard any subclasses that represent less than 5% of the total log volume. This ensures the LLM focuses on the dominant signal and won’t be sidetracked by noise or malformed logs.
Metaclasses (prefix match): Remaining subclasses are grouped into Metaclasses by the first N characters of the format fingerprint match. This grouping strategy effectively splits lexically similar formats under a single umbrella.We chose N=5 for Log parsing and N=15 for Log partitioning when data sources are unknown.

Stratified sampling. Once the hierarchical tree is built, we construct the log sample for the LLM. The strategic goal is to maximize variance coverage while minimizing token usage.

We select representative logs from each valid subclass within the broader metaclass.
To manage an edge case of too numerous subclasses, we apply random down-sampling to fit the target window size.

Rule generation Finally, we prompt the LLM to generate a regex parsing rule that fits all logs in the provided sample for each Metaclass. For our PoC, we used the GPT-4o mini model.

Experimental Results & Observations

We achieved 94% parsing accuracy and 91% partitioning accuracy on the Loghub dataset.

The confusion matrix above illustrates log partitioning results. The vertical axis represents the actual data sources, and the horizontal axis represents the predicted data sources. The heatmap intensity corresponds to log volume, with lighter tiles indicating a higher count. The diagonal alignment demonstrates the model's high fidelity in source attribution, with minimal scattering.

Our Performance Benchmarks Insights

Optimal baseline: a context window of 30–40 log samples per category proved to be the "sweet spot," consistently producing robust parsing with both Regex and Grok patterns.
Input minimisation: we pushed the input size to 10 logs per category for Regex patterns and observed only 2% drop in parsing performance, confirming that diversity-based sampling is more critical than raw volume.

Unleash the power of Elastic and Amazon Kinesis Data Firehose to enhance observability and data analytics

Thu, 18 May 2023 00:00:00 GMT

As more organizations leverage the Amazon Web Services (AWS) cloud platform and services to drive operational efficiency and bring products to market, managing logs becomes a critical component of maintaining visibility and safeguarding multi-account AWS environments. Traditionally, logs are stored in Amazon Simple Storage Service (Amazon S3) and then shipped to an external monitoring and analysis solution for further processing.

To simplify this process and reduce management overhead, AWS users can now leverage the new Amazon Kinesis Firehose Delivery Stream to ingest logs into Elastic Cloud in AWS in real time and view them in the Elastic Stack alongside other logs for centralized analytics. This eliminates the necessity for time-consuming and expensive procedures such as VM provisioning or data shipper operations.

Elastic Observability unifies logs, metrics, and application performance monitoring (APM) traces for a full contextual view across your hybrid AWS environments alongside their on-premises data sets. Elastic Observability enables you to track and monitor performance across a broad range of AWS services, including AWS Lambda, Amazon Elastic Compute Cloud (EC2), Amazon Elastic Container Service (ECS), Amazon Elastic Kubernetes Service (EKS), Amazon Simple Storage Service (S3), Amazon Cloudtrail, Amazon Network Firewall, and more.

In this blog, we will walk you through how to use the Amazon Kinesis Data Firehose integration — Elastic is listed in the Amazon Kinesis Firehose drop-down list — to simplify your architecture and send logs to Elastic, so you can monitor and safeguard your multi-account AWS environments.

Announcing the Kinesis Firehose method

Elastic currently provides both agent-based and serverless mechanisms, and we are pleased to announce the addition of the Kinesis Firehose method. This new method enables customers to directly ingest logs from AWS into Elastic, supplementing our existing options.

Elastic Agent pulls metrics and logs from CloudWatch and S3 where logs are generally pushed from a service (for example, EC2, ELB, WAF, Route53) and ingests them into Elastic Cloud.
Elastic’s Serverless Forwarder (runs Lambda and available in AWS SAR) sends logs from Kinesis Data Stream, Amazon S3, and AWS Cloudwatch log groups into Elastic. To learn more about this topic, please see this blog post.
Amazon Kinesis Firehose directly ingests logs from AWS into Elastic (specifically, if you are running the Elastic Cloud on AWS).

In this blog, we will cover the last option since we have recently released the Amazon Kinesis Data Firehose integration. Specifically, we'll review:

A general overview of the Amazon Kinesis Data Firehose integration and how it works with AWS
Step-by-step instructions to set up the Amazon Kinesis Data Firehose integration on AWS and on Elastic Cloud

By the end of this blog, you'll be equipped with the knowledge and tools to simplify your AWS log management with Elastic Observability and Amazon Kinesis Data Firehose.

Prerequisites and configurations

If you intend to follow the steps outlined in this blog post, there are a few prerequisites and configurations that you should have in place beforehand.

You will need an account on Elastic Cloud and a deployed stack on AWS. Instructions for deploying a stack on AWS can be found here. This is necessary for AWS Firehose Log ingestion.
You will also need an AWS account with the necessary permissions to pull data from AWS. Details on the required permissions can be found in our documentation.
Finally, be sure to turn on VPC Flow Logs for the VPC where your application is deployed and send them to AWS Firehose.

Elastic’s Amazon Kinesis Data Firehose integration

Elastic has collaborated with AWS to offer a seamless integration of Amazon Kinesis Data Firehose with Elastic, enabling direct ingestion of data from Amazon Kinesis Data Firehose into Elastic without the need for Agents or Beats. All you need to do is configure the Amazon Kinesis Data Firehose delivery stream to send its data to Elastic's endpoint. In this configuration, we will demonstrate how to ingest VPC Flow logs and Firewall logs into Elastic. You can follow a similar process to ingest other logs from your AWS environment into Elastic.

There are three distinct configurations available for ingesting VPC Flow and Network firewall logs into Elastic. One configuration involves sending logs through CloudWatch, and another uses S3 and Kinesis Firehose; each has its own unique setup. With Cloudwatch and S3 you can store and forward but with Kinesis Firehose you will have to ingest immediately. However, in this blog post, we will focus on this new configuration that involves sending VPC Flow logs and Network Firewall logs directly to Elastic.

We will guide you through the configuration of the easiest setup, which involves directly sending VPC Flow logs and Firewalls logs to Amazon Kinesis Data Firehose and then into Elastic Cloud.

Note: It's important to note that this setup is only compatible with Elastic Cloud on AWS and cannot be used with self-managed or on-premise or other cloud provider Elastic deployments.

Setting it all up

To begin setting up the integration between Amazon Kinesis Data Firehose and Elastic, let's go through the necessary steps.

Step 0: Get an account on Elastic Cloud

Create an account on Elastic Cloud by following the instructions provided to get started on Elastic Cloud.

Step 1: Deploy Elastic on AWS

You can deploy Elastic on AWS via two different approaches: through the UI or through Terraform. We’ll start first with the UI option.

After logging into Elastic Cloud, create a deployment on Elastic. It's crucial to make sure that the deployment is on Elastic Cloud on AWS since the Amazon Kinesis Data Firehose connects to a specific endpoint that must be on AWS.

After your deployment is created, it's essential to copy the Elasticsearch endpoint to ensure a seamless configuration process.

The Elasticsearch HTTP endpoint should be copied and used for Amazon Firehose destination configuration purposes, as it will be required. Here's an example of what the endpoint should look like:

https://elastic-O11y-log.es.us-east-1.aws.found.io

Alternative approach using Terraform

An alternative approach to deploying Elastic Cloud on AWS is by using Terraform. It's also an effective way to automate and streamline the deployment process.

To begin, simply create a Terraform configuration file that outlines the necessary infrastructure. This file should include resources for your Elastic Cloud deployment and any required IAM roles and policies. By using this approach, you can simplify the deployment process and ensure consistency across environments.

One easy way to create your Elastic Cloud deployment with Terraform is to use this Github repo. This resource lets you specify the region, version, and deployment template for your Elastic Cloud deployment, as well as any additional settings you require.

Step 2: To turn on Elastic's AWS integrations, navigate to the Elastic Integration section in your deployment

To install AWS assets in your deployment's Elastic Integration section, follow these steps:

Log in to your Elastic Cloud deployment and open Kibana.
To get started, go to the management section of Kibana and click on " Integrations."
Navigate to the AWS integration and click on the "Install AWS Assets" button in the settings.This step is important as it installs the necessary assets such as dashboards and ingest pipelines to enable data ingestion from AWS services into Elastic.

Step 3: Set up the Amazon Kinesis Data Firehose delivery stream on the AWS Console

You can set up the Kinesis Data Firehose delivery stream via two different approaches: through the AWS Management Console or through Terraform. We’ll start first with the console option.

To set up the Kinesis Data Firehose delivery stream on AWS, follow these steps:

Go to the AWS Management Console and select Amazon Kinesis Data Firehose.
Click on Create delivery stream.
Choose a delivery stream name and select Direct PUT or other sources as the source.

Choose Elastic as the destination.
In the Elastic destination section, enter the Elastic endpoint URL that you copied from your Elastic Cloud deployment.

Choose the content encoding and retry duration as shown above.
Enter the appropriate parameter values for your AWS log type. For example, for VPC Flow logs, you would need to specify the _ es_datastream_name _ and _ logs-aws.vpc flow-default _.
Configure the Amazon S3 bucket as the source backup for the Amazon Kinesis Data Firehose delivery stream failed data or all data, and configure any required tags for the delivery stream.
Review the settings and click on Create delivery stream.

In the example above, we are using the es_datastream_name parameter to pull in VPC Flow logs through the logs-aws.vpcflow-default datastream. Depending on your use case, this parameter can be configured with one of the following types of logs:

logs-aws.cloudfront_logs-default (AWS CloudFront logs)
logs-aws.ec2_logs-default (EC2 logs in AWS CloudWatch)
logs-aws.elb_logs-default (Amazon Elastic Load Balancing logs)
logs-aws.firewall_logs-default (AWS Network Firewall logs)
logs-aws.route53_public_logs-default (Amazon Route 53 public DNS queries logs)
logs-aws.route53_resolver_logs-default (Amazon Route 53 DNS queries & responses logs)
logs-aws.s3access-default (Amazon S3 server access log)
logs-aws.vpcflow-default (AWS VPC flow logs)
logs-aws.waf-default (AWS WAF Logs)

Alternative approach using Terraform

Using the " aws_kinesis_firehose_delivery_stream" resource in Terraform is another way to create a Kinesis Firehose delivery stream, allowing you to specify the delivery stream name, data source, and destination - in this case, an Elasticsearch HTTP endpoint. To authenticate, you'll need to provide the endpoint URL and an API key. Leveraging this Terraform resource is a fantastic way to automate and streamline your deployment process, resulting in greater consistency and efficiency.

Here's an example code that shows you how to create a Kinesis Firehose delivery stream with Terraform that sends data to an Elasticsearch HTTP endpoint:

resource "aws_kinesis_firehose_delivery_stream" “Elasticcloud_stream" {
  name        = "terraform-kinesis-firehose-ElasticCloud-stream"
  destination = "http_endpoint”
  s3_configuration {
    role_arn           = aws_iam_role.firehose.arn
    bucket_arn         = aws_s3_bucket.bucket.arn
    buffer_size        = 5
    buffer_interval    = 300
    compression_format = "GZIP"
  }
  http_endpoint_configuration {
    url        = "https://cloud.elastic.co/"
    name       = “ElasticCloudEndpoint"
    access_key = “ElasticApi-key"
    buffering_hints {
      size_in_mb = 5
      interval_in_seconds = 300
    }

   role_arn       = "arn:Elastic_role"
   s3_backup_mode = "FailedDataOnly"
  }
}

Step 4: Configure VPC Flow Logs to send to Amazon Kinesis Data Firehose

To complete the setup, you'll need to configure VPC Flow logs in the VPC where your application is deployed and send them to the Amazon Kinesis Data Firehose delivery stream you set up in Step 3.

Enabling VPC flow logs in AWS is a straightforward process that involves several steps. Here's a step-by-step details to enable VPC flow logs in your AWS account:

Select the VPC for which you want to enable flow logs.
In the VPC dashboard, click on "Flow Logs" under the "Logs" section.
Click on the "Create Flow Log" button to create a new flow log.
In the "Create Flow Log" wizard, provide the following information:

Choose the target for your flow logs: In this case, Amazon Kinesis Data Firehose in the same AWS account.

Provide a name for your flow log.
Choose the VPC and the network interface(s) for which you want to enable flow logs.
Choose the flow log format: either AWS default or Custom format.

Configure the IAM role for the flow logs. If you have an existing IAM role, select it. Otherwise, create a new IAM role that grants the necessary permissions for the flow logs.
Review the flow log configuration and click "Create."

Create the VPC Flow log.

Step 5: After a few minutes, check if flows are coming into Elastic

To confirm that the VPC Flow logs are ingesting into Elastic, you can check the logs in Kibana. You can do this by searching for the index in the Kibana Discover tab and filtering the results by the appropriate index and time range. If VPC Flow logs are flowing in, you should see a list of documents representing the VPC Flow logs.

Step 6: Navigate to Kibana to see your logs parsed and visualized in the [Logs AWS] VPC Flow Log Overview dashboard

Finally, there is an Elastic out-of-the-box (OOTB) VPC Flow logs dashboard that displays the top IP addresses that are hitting your VPC, their geographic location, time series of the flows, and a summary of VPC flow log rejects within the selected time frame. This dashboard can provide valuable insights into your network traffic and potential security threats.

Note: For additional VPC flow log analysis capabilities, please refer to this blog.

Step 7: Configure AWS Network Firewall Logs to send to Kinesis Firehose

To create a Kinesis Data Firehose delivery stream for AWS Network firewall logs, first log in to the AWS Management Console, navigate to the Kinesis service, select "Data Firehose", and follow the step-by-step instructions as shown in Step 3. Specify the Elasticsearch endpoint, API key, add a parameter (_ es_datastream_name=logs-aws.firewall_logs-default _), and create the delivery stream.

Second, to set up a Network Firewall rule group to send logs to the Kinesis Firehose, go to the Network Firewall section of the console, create a rule group, add a rule to allow traffic to the Kinesis endpoint, and attach the rule group to your Network Firewall configuration. Finally, test the configuration by sending traffic through the Network Firewall to the Kinesis Firehose endpoint and verify that logs are being delivered to your S3 bucket.

Kindly follow the instructions below to set up a firewall rule and logging.

Set up a Network Firewall rule group to send logs to Amazon Kinesis Data Firehose:

Go to the AWS Management Console and select Network Firewall.
Click on "Rule groups" in the left menu and then click "Create rule group."
Choose "Stateless" or "Stateful" depending on your requirements, and give your rule group a name. Click "Create rule group."
Add a rule to the rule group to allow traffic to the Kinesis Firehose endpoint. For example, if you are using the us-east-1 region, you would add a rule like this:json

{
  "RuleDefinition": {
    "Actions": [
      {
        "Type": "AWS::KinesisFirehose::DeliveryStream",
        "Options": {
          "DeliveryStreamArn": "arn:aws:firehose:us-east-1:12387389012:deliverystream/my-delivery-stream"
        }
      }
    ],
    "MatchAttributes": {
      "Destination": {
        "Addresses": ["api.firehose.us-east-1.amazonaws.com"]
      },
      "Protocol": {
        "Numeric": 6,
        "Type": "TCP"
      },
      "PortRanges": [
        {
          "From": 443,
          "To": 443
        }
      ]
    }
  },
  "RuleOptions": {
    "CustomTCPStarter": {
      "Enabled": true,
      "PortNumber": 443
    }
  }
}

Save the rule group.

Attach the rule group to your Network Firewall configuration:

Go to the AWS Management Console and select Network Firewall.
Click on "Firewall configurations" in the left menu and select the configuration you want to attach the rule group to.
Scroll down to "Associations" and click "Edit."
Select the rule group you created in Step 2 and click "Save."

Test the configuration:

Send traffic through the Network Firewall to the Kinesis Firehose endpoint and verify that logs are being delivered to your S3 bucket.

Step 8: Navigate to Kibana to see your logs parsed and visualized in the [Logs AWS] Firewall Log dashboard

Wrapping up

We’re excited to bring you this latest integration for AWS Cloud and Kinesis Data Firehose into production. The ability to consolidate logs and metrics to gain visibility across your cloud and on-premises environment is crucial for today’s distributed environments and applications.

From EC2, Cloudwatch, Lambda, ECS and SAR, Elastic Integrations allow you to quickly and easily get started with ingesting your telemetry data for monitoring, analytics, and observability. Elastic is constantly delivering frictionless customer experiences, allowing anytime, anywhere access to all of your telemetry data — this streamlined, native integration with AWS is the latest example of our commitment.

Start a free trial today

You can begin with a 7-day free trial of Elastic Cloud within the AWS Marketplace to start monitoring and improving your users' experience today!

AWS VPC Flow log analysis with GenAI in Elastic

Fri, 07 Jun 2024 00:00:00 GMT

Elastic Observability provides a full observability solution, by supporting metrics, traces and logs for applications and infrastructure. In managing AWS deployments, VPC flow logs are critical in managing performance, network visibility, security, compliance, and overall management of your AWS environment. Several examples of :

Where traffic is coming in from and going out to from the deployment, and within the deployment. This helps identify unusual or unauthorized communications
Traffic volumes detecting spikes or drops which could indicate service issues in production or an increase in customer traffic
Latency and Performance bottlenecks - with VPC Flow logs, you can look at latency for a flow (in and outflows), and understand patterns
Accepted and rejected traffic helps determine where potential security threats and misconfigurations lie.

AWS VPC Logs is a great example of how logs are great. Logging is an important part of Observability, for which we generally think of metrics and tracing. However, the amount of logs an application and the underlying infrastructure output can be significantly daunting with VPC Logs. However, it also provides a significant amount of insight.

Before we proceed, it is important to understand what Elastic provides in managing AWS and VPC Flow logs:

A full set of integrations to manage VPC Flows and the entire end-to-end deployment on AWS.
Elastic has a simple-to-use AWS Firehose integration.
Elastic’s tools such as Discover, spike analysis, and anomaly detection help provide you with better insights and analysis.
And a set of simple Out-of-the-box dashboards

In today’s blog, we’ll cover how Elastics’ other features can support analyzing and RCA for potential VPC flow logs even more easily. Specifically, we will focus on managing the number of rejects, as this helps ensure there weren’t any unauthorized or unusual activities:

Set up an easy-to-use SLO (newly released) to detect when things are potentially degrading
Create an ML job to analyze different fields of the VPC Flow log
Using our newly released RAG-based AI Assistant to help analyze the logs without needing to know Elastic’s query language nor how to even graph on Elastic
ES|QL will help understand and analyze add latency for patterns.

In subsequent blogs, we will use AI Assistant and ESQL to show how to get other insights beyond just REJECT/ACCEPT from VPC Flow log.

Prerequisites and config

If you plan on following this blog, here are some of the components and details we used to set up this demonstration:

Ensure you have an account on Elastic Cloud and a deployed stack (see instructions here).
Follow the steps in the following blog to get AWS’s three-tier app installed instructed in git, and bring in the AWS VPC Flow logs.
Ensure you have an ML node configured in your Elastic stack
To use the AI Assistant you will need a trial or upgrade to Platinum.

SLO with VPC Flow Logs

Elastic’s SLO capability is based directly on the Google SRE Handbook. All the definitions and semantics are utilized as described in Google’s SRE handbook. Hence users can perform the following on SLOs in Elastic:

Define an SLO on Logs not just metrics - Users can use KQL (log-based query), service availability, service latency, custom metric, histogram metric, or a timeslice metric.
Define SLO, SLI, Error budget and burn rates. Users can also use occurrence versus time slice-based budgeting.
Manage, with dashboards, all the SLOs in a singular location.
Trigger alerts from the defined SLO, whether the SLI is off, the burn rate is used up, or the error rate is X.

Setting up an SLO for VPC is easy. You simply create a query you want to trigger off. In our case, we look for all the good events where aws.vpcflow.action=ACCEPT and we define the target at 85%.

As the following example shows, over the last 7 days, we have exceeded our budget by 43%. Additionally, we have not complied for the last 7 days.

Analyzing the SLO with AI Assistant

Now that we see that there is an issue with the VPC Flows, we immediately work with the AI Assistant to start analyzing the SLO. Because it's a chat interface we simply open the AI Assistant and work through some simple analysis: (See Animated GIF for a demo below)

AI Assistant analysis:

what were the top 3 source.address that had aws.vpcflow.action=REJECT over the last 7 days, which is causing this SLO issue? - We wanted to simply see what could be causing the loss in error budget. Were there any particular source.addresses causing a heavy reject rate.
- The answer: A table with the highest count = 42670 and source.address = 79.110.62.185
- There is one singular source.address that is causing the loss in SLO.
What is the largest number of aws.vpcflow.action=REJECT in a 30 min time frame for the last 3 days where the source.address=79.110.62.185? - After understanding that a specific source.address is causing the loss in SLO, we want to understand the averages.
- **The answer: ** "The largest number of aws.vpcflow.action=REJECT in a 30-minute time frame for the last 3 days where the source.address is 79.110.62.185 is 229. This occurred on 2024-06-01T04:00:00.000Z.”
- It means there must be a low REJECT rate but fairly consistent vs spiky over the last 7 days.
for the logs with source.address="79.110.62.185" was there any country code of source.geo.country_iso_code field present. If yes what is the value - Given the last question showed a low REJECT rate, it only means that this was fairly consistent vs spiky over the last 7 days.
- The answer: Yes, there is a country code present in the source.geo.country_iso_code field for logs with source.address="79.110.62.185". The value is BG (Bulgaria).
Is there a specific destination.address where source.address=79.110.62.185 is getting a aws.vpcflow.action=REJECT. Give me both the destination.address and the number of REJECTs for that destination.address?
- The answer: destination.address of 10.0.0.27 is giving a reject number of 53433 in this time frame.
Graph the number of REJECT vs ACCEPT for source.address="79.110.62.185" over the last 7 days. The graph is on a daily basis in a singular graph - We asked this question to see what the comparison is between ACCEPT and REJECT.
- The answer: See the animated GIF to see that the generated graph is fairly stable
Were there any source.address that had a spike, high reject rate in. a 30min period over the 30 days? - We wanted to see if there was any other spike
- The answer - Yes, there was a source.address that had a spike in high reject rates in a 30-minute period over the last 30 days. source.address: 185.244.212.67, Reject Count: 8975, Time Period: 2024-05-22T03:00:00.000Z

Watch the flow

Potential issue:

he server handling requests from source 79.110.62.185 is potentially having an issue.

Again using logs, we essentially asked the AI Assistant to give the eni ids where the internal ip address was 10.0.0.27

From our AWS console, we know that this is the webserver. Further analysis in Elastic, and with the developers we realized there is a new version that was installed recently causing a problem with connections.

Locating anomalies with ML

While using the AI Assistant is great for analyzing information, another important aspect of VPC flow management is to ensure you can manage log spikes and anomalies. Elastic has a machine learning platform that allows you to develop jobs to analyze specific metrics or multiple metrics to look for anomalies.

VPC Flow logs come with a large amount of information. The full set of fields is listed in AWS docs. We will use a specific subset to help detect anomalies.

We were setting up anomalies for aws.vpcflow.action=REJECT, which requires us to use multimetric anomaly detection in Elastic.

The config we used utilizes:

Detectors:

destination.address
destination.port

Influencers:

source.address
aws.vpcflow.action
destination.geo.region_iso_code

The way we set this up will help us understand if there is a large spike in REJECT/ACCEPT against destination.address values from a specific source.address and/or destination.geo.region_iso_code location.

The job once run reveals something interesting:

Notice that source.address 185.244.212.67 has had a high REJECT rate in the last 30 days.

Notice where we found this before? In the AI Assistant!!!!!

While we can run the AI Assistant and find this sort of anomaly, the ML job can be setup to run continuously and alert us on such spikes. This will help us understand if there are any issues with the webserver like we found above or even potential security attacks.

Conclusion:

You’ve now seen how easily Elastic’s RAG-based AI Assistant can help analyze VPC Flows without even the need to know query syntax, understand where the data is, and understand even the fields. Additionally, you’ve also seen how we can alert you when a potential issue or degradation in service (SLO). Check out our other blogs on AWS VPC Flow analysis in Elastic:

A full set of integrations to manage VPC Flows and the entire end-to-end deployment on AWS.
Elastic has a simple-to-use AWS Firehose integration.
Elastic’s tools such as Discover, spike analysis, and anomaly detection help provide you with better insights and analysis.
And a set of simple Out-of-the-box dashboards

Try it out

Existing Elastic Cloud customers can access many of these features directly from the Elastic Cloud console. Not taking advantage of Elastic on the cloud? Start a free trial.

All of this is also possible in your environment. Learn how to get started today.

In this blog post, we may have used or referred to third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.

Elastic, Elasticsearch, ESRE, Elasticsearch Relevance Engine and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.

Best Practices for Log Management: Leveraging Logs for Faster Problem Resolution

Wed, 11 Sep 2024 00:00:00 GMT

In today's rapid software development landscape, efficient log management is crucial for maintaining system reliability and performance. With expanding and complex infrastructure and application components, the responsibilities of operations and development teams are ever-growing and multifaceted. This blog post outlines best practices for effective log management, addressing the challenges of growing data volumes, complex infrastructures, and the need for quick problem resolution.

Understanding Logs and Their Importance

Logs are records of events occurring within your infrastructure, typically including a timestamp, a message detailing the event, and metadata identifying the source. They are invaluable for diagnosing issues, providing early warnings, and speeding up problem resolution. Logs are often the primary signal that developers enable, offering significant detail for debugging, performance analysis, security, and compliance management.

The Logging Journey

The logging journey involves three basic steps: collection and ingestion, processing and enrichment, and analysis and rationalization. Let's explore each step in detail, covering some of the best practices for each section.

1. Log Collection and Ingestion

Collect Everything Relevant and Actionable

The first step is to collect all logs into a central location. This involves identifying all your applications and systems and collecting their logs. Comprehensive data collection ensures no critical information is missed, providing a complete picture of your system's behavior. In the event of an incident, having all logs in one place can significantly reduce the time to resolution. It's generally better to collect more data than you need, as you can always filter out irrelevant information later, as well as delete logs that are no longer needed more quickly.

Leverage Integrations

Elastic provides over 300 integrations that simplify data onboarding. These integrations not only collect data but also come with dashboards, saved searches, and pipelines to parse the data. Utilizing these integrations can significantly reduce manual effort and ensure data consistency.

Consider Ingestion Capacity and Costs

An important aspect of log collection is ensuring you have sufficient ingestion capacity at a manageable cost. When assessing solutions, be cautious about those that charge significantly more for high cardinality data, as this can lead to unexpectedly high costs in observability solutions. We'll talk more about cost effective log management later in this post.

Use Kafka for Large Projects

For larger organizations, implementing Kafka can improve log data management. Kafka acts as a buffer, making the system more reliable and easier to manage. It allows different teams to send data to a centralized location, which can then be ingested into Elastic.

2. Processing and Enrichment

Adopt Elastic Common Schema (ECS)

One key aspect of log collection is to have the most amount of normalization across all of your applications and infrastructure. Having a common semantic schema is crucial. Elastic contributed Elastic Common Schema (ECS) to OpenTelemetry (OTel), helping accelerate the adoption of OTel-based observability and security. This move towards a more normalized way to define and ingest logs (including metrics and traces) is beneficial for the industry.

Using ECS helps standardize field names and data structures, making data analysis and correlation easier. This common schema ensures your data is organized predictably, facilitating more efficient querying and reporting. Learn more about ECS here.

Optimize Mappings for High Volume Data

For high cardinality fields or those rarely used, consider optimizing or removing them from the index. This can improve performance by reducing the amount of data that needs to be indexed and searched. Our documentation has sections to tune your setup for disk usage, search speed and indexing speed.

Managing Structured vs. Unstructured Logs

Structured logs are generally preferable as they offer more value and are easier to work with. They have a predefined format and fields, simplifying information extraction and analysis. For custom logs without pre-built integrations, you may need to define your own parsing rules.

For unstructured logs, full-text search capabilities can help mitigate limitations. By indexing logs, full-text search allows users to search for specific keywords or phrases efficiently, even within large volumes of unstructured data. This is one of the main differentiators of Elastic's observability solution. You can simply search for any keyword or phrase and get results in real-time, without needing to write complex regular expressions or parsing rules at query time.

Schema-on-Read vs. Schema-on-Write

There are two main approaches to processing log data:

Schema-on-read: Some observability dashboarding capabilities can perform runtime transformations to extract fields from non-parsed sources on the fly. This is helpful when dealing with legacy systems or custom applications that may not log data in a standardized format. However, runtime parsing can be time-consuming and resource-intensive, especially for large volumes of data.
Schema-on-write: This approach offers better performance and more control over the data. The schema is defined upfront, and the data is structured and validated at the time of writing. This allows for faster processing and analysis of the data, which is beneficial for enrichment.

3. Analysis and Rationalization

Full-Text Search

Elastic's full-text search capabilities, powered by Elasticsearch, allow you to quickly find relevant logs. The Kibana Query Language (KQL) enhances search efficiency, enabling you to filter and drill down into the data to identify issues rapidly.

Here are a few examples of KQL queries:

// Filter documents where a field exists
http.request.method: *

// Filter documents that match a specific value
http.request.method: GET

// Search all fields for a specific value
Hello

// Filter documents where a text field contains specific terms
http.request.body.content: "null pointer"

// Filter documents within a range
http.response.bytes < 10000

// Combine range queries
http.response.bytes > 10000 and http.response.bytes <= 20000

// Use wildcards to match patterns
http.response.status_code: 4*

// Negate a query
not http.request.method: GET

// Combine multiple queries with AND/OR
http.request.method: GET and http.response.status_code: 400

Machine Learning Integration

Machine learning can automate the detection of anomalies and patterns within your log data. Elastic offers features like log rate analysis that automatically identify deviations from normal behavior. By leveraging machine learning, you can proactively address potential issues before they escalate.

It is recommended that organizations utilize a diverse arsenal of machine learning algorithms and techniques to effectively uncover unknown-unknowns in log files. Unsupervised machine learning algorithms, should be employed for anomaly detection on real-time data, with rate-controlled alerting based on severity.

By automatically identifying influencers, users can gain valuable context for automated root cause analysis (RCA). Log pattern analysis brings categorization to unstructured logs, while log rate analysis and change point detection help identify the root causes of spikes in log data.

Take a look at the documentation to get started with machine learning in Elastic.

Dashboarding and Alerting

Building dashboards and setting up alerting helps you monitor your logs in real-time. Dashboards provide a visual representation of your logs, making it easier to identify patterns and anomalies. Alerting can notify you when specific events occur, allowing you to take action quickly.

Cost-Effective Log Management

Use Data Tiers

Implementing index lifecycle management to move data across hot, warm, cold, and frozen tiers can significantly reduce storage costs. This approach ensures that only the most frequently accessed data is stored on expensive, high-performance storage, while older data is moved to more cost-effective storage solutions.

Our documentation explains how to set up Index Lifecycle Management.

Compression and Index Sorting

Applying best compression settings and using index sorting can further reduce the data footprint. Optimizing the way data is stored on disk can lead to substantial savings in storage costs and improve retrieval performance. As of 8.15, Elasticsearch provides an indexing mode called "logsdb". This is a highly optimized way of storing log data. This new way of indexing data uses 2.5 times less disk space than the default mode. You can read more about it here. This mode automatically applies the best combination of settings for compression, index sorting, and other optimizations that weren't accessible to users before.

Snapshot Lifecycle Management (SLM)

SLM allows you to back up your data and delete it from the main cluster, freeing up resources. If needed, data can be restored quickly for analysis, ensuring that you maintain the ability to investigate historical events without incurring high storage costs.

Learn more about SLM in the documentation.

Dealing with Large Amounts of Log Data

Managing large volumes of log data can be challenging. Here are some strategies to optimize log management:

Develop a logs deletion policy. Evaluate what data to collect and when to delete it.
Consider discarding DEBUG logs or even INFO logs earlier, and delete dev and staging environment logs sooner.
Aggregate short windows of identical log lines, which is especially useful for TCP security event logging.
For applications and code you control, consider moving some logs into traces to reduce log volume while maintaining detailed information.

Centralized vs. Decentralized Log Storage

Data locality is an important consideration when managing log data. The costs of ingressing and egressing large amounts of log data can be prohibitively high, especially when dealing with cloud providers.

In the absence of regional redundancy requirements, your organization may not need to send all log data to a central location. Consider keeping log data local to the datacenter where it was generated to reduce ingress and egress costs.

Cross-cluster search functionality enables users to search across multiple logging clusters simultaneously, reducing the amount of data that needs to be transferred over the network.

Cross-cluster replication is useful for maintaining business continuity in the event of a disaster, ensuring data availability even during an outage in one datacenter.

Monitoring and Performance

Monitor Your Log Management System

Using a dedicated monitoring cluster can help you track the performance of your Elastic deployment. Stack monitoring provides metrics on search and indexing activity, helping you identify and resolve performance bottlenecks.

Adjust Bulk Size and Refresh Interval

Optimizing these settings can balance performance and resource usage. Increasing bulk size and refresh interval can improve indexing efficiency, especially for high-throughput environments.

Logging Best Practices

Adjust Log Levels

Ensure that log levels are appropriately set for all applications. Customize log formats to facilitate easier ingestion and analysis. Properly configured log levels can reduce noise and make it easier to identify critical issues.

Use Modern Logging Frameworks

Implement logging frameworks that support structured logging. Adding metadata to logs enhances their usefulness for analysis. Structured logging formats, such as JSON, allow logs to be easily parsed and queried, improving the efficiency of log analysis. If you fully control the application and are already using structured logging, consider using Elastic's version of these libraries, which can automatically parse logs into ECS fields.

Leverage APM and Metrics

For custom-built applications, Application Performance Monitoring (APM) provides deeper insights into application performance, complementing traditional logging. APM tracks transactions across services, helping you understand dependencies and identify performance bottlenecks.

Consider collecting metrics alongside logs. Metrics can provide insights into your system's performance, such as CPU usage, memory usage, and network traffic. If you're already collecting logs from your systems, adding metrics collection is usually a quick process.

Traces can provide even deeper insights into specific transactions or request paths, especially in cloud-native environments. They offer more contextual information and excel at tracking dependencies across services. However, implementing tracing is only possible for applications you own, and not all developers have fully embraced it yet.

A combined logging and tracing strategy is recommended, where traces provide coverage for newer instrumented apps, and logging supports legacy applications and systems you don't own the source code for.

Conclusion

Effective log management is essential for maintaining system reliability and performance in today's complex software environments. By following these best practices, you can optimize your log management process, reduce costs, and improve problem resolution times.

Key takeaways include:

Ensure comprehensive log collection with a focus on normalization and common schemas.
Use appropriate processing and enrichment techniques, balancing between structured and unstructured logs.
Leverage full-text search and machine learning for efficient log analysis.
Implement cost-effective storage strategies and smart data retention policies.
Enhance your logging strategy with APM, metrics, and traces for a complete observability solution.

Continuously evaluate and adjust your strategies to keep pace with the growing volume and complexity of log data, and you'll be well-equipped to ensure the reliability, performance, and security of your applications and infrastructure.

Check out our other blogs:

Ready to get started? Use Elastic Observability on Elastic Cloud — the hosted Elasticsearch service that includes all of the latest features.

Bringing Your Cloud-Managed Kubernetes Audit Logs into Elasticsearch

Mon, 19 Aug 2024 00:00:00 GMT

Introduction:

Kubernetes audit logs are essential for ensuring the security, compliance, and transparency of Kubernetes clusters. However, with managed Kubernetes infrastructure, traditional audit file-based log shipping is often not supported, and audit logs are only available via the control plane API or the Cloud Provider logging facility. In this blog, we will show you how to ingest the audit logs from these other sources and still take advantage of the Elastic Kubernetes Audit Log Integration.

In this blog we will be focusing on AWS as our cloud provider and when ingesting logs from AWS you have several options:

AWS Custom Logs integration (which we will utilize in this blog)
AWS Firehose to send logs from Cloudwatch to Elastic
AWS General integration which supports many AWS sources

In part 1 of this two-part series, we will focus on properly ingesting Kubernetes Audit, and part 2 will focus on investigation, analytics, and alerting.

Kubernetes auditing documentation describes the need for auditing in order to get answers to the questions below:

What happened?
When did it happen?
Who initiated it?
What resource did it occur on?
Where was it observed?
From where was it initiated (Source IP)?
Where was it going (Destination IP)?

Answers to the above questions become important when an incident occurs and an investigation follows. Alternatively, it could just be a log retention use case for a regulated company trying to fulfill compliance requirements.

We are giving special importance to audit logs in Kubernetes because audit logs are not enabled by default. Audit logs can take up a large amount of memory and storage. So, usually, it’s a balance between retaining/investigating audit logs against giving up resources budgeted otherwise for workloads to be hosted on the Kubernetes cluster. Another reason we’re talking about audit logs in Kubernetes is that, unlike usual container logs, after being turned on, these logs are orchestrated to write to the cloud provider’s logging service. This is true for most cloud providers because the Kubernetes control plane is managed by the cloud providers. It makes sense for cloud providers to use their built-in orchestration workflows involving the control plane for a managed service backed by their implementation of a logging framework.

Kubernetes audit logs can be quite verbose by default. Hence, it becomes important to selectively choose how much logging needs to be done so that all the audit requirements are met for the organization. This is done in the audit policy file. The audit policy file is submitted against the kube-apiserver. It is not necessary that all flavors of cloud-provider-hosted Kubernetes clusters allow you to play with the kube-apiserver directly. For example, AWS EKS allows for this logging to be done only by the control plane.

In this blog we will be using Elastic Kubernetes Service (Amazon EKS) on AWS with the Kubernetes Audit Logs that are automatically shipped to AWS CloudWatch.

A sample audit log for a secret by the name “empty-secret” created by an admin user on EKS is logged on AWS CloudWatch in the following format:

Once the audit logs show up on CloudWatch, it is time to consider how to transfer them to Elasticsearch. Elasticsearch is a great platform for creating dashboards that visualize different audit events recorded in a Kubernetes cluster. It is also a powerful tool for analyzing various audit events. For example, how many secret object creation attempts were made in an hour?

Now that we established the Kubernetes audit logs are being logged in CloudWatch, let’s discuss how to get the logs ingested into Elasticsearch. Elasticsearch has an integration to consume logs written on CloudWatch. Just using this integration by default is going to get the JSON from CloudWatch as is i.e. the real audit log JSON is nested inside the wrapper CloudWatch JSON. When bringing logs to Elasticsearch, it is important that we use the Elastic Common Schema(ECS) to get the best search and analytics performance. This means that there needs to be an ingest pipeline that parses a standard Kubernetes audit JSON message and creates an ECS Compliant document in Elasticsearch. Let’s dive into how to achieve this.

Elasticsearch has a Kubernetes integration using Elastic Agent to consume Kubernetes container logs from the console and audit logs written to a file path. For a cloud-provider use case, as described above, it may not be feasible to write audit logs to a path on the Kubernetes cluster. So, how do we leverage the ECS designed for parsing the Kubernetes audit logs already implemented in the Kubernetes integration to work on the CloudWatch audit logs? That is the most exciting plumbing piece! Let’s see how to do it.

What we’re going to do is:

Read the Kubernetes audit logs from the cloud provider’s logging module, in our case, AWS CloudWatch since this is where logs reside. We will use Elastic Agent and Elasticsearch AWS Custom Logs integration to read from logs from CloudWatch. Note: please be aware, there are several Elastic AWS integration, we are specifically using the AWS Custom Logs integration.
Create two simple ingest pipelines (we do this for best practices of isolation and composability)
The first pipeline looks for Kubernetes audit JSON messages and then redirects them to the second pipeline
The second custom pipeline will associate the JSON message field with the correct field expected by the Elasticsearch Kubernetes Audit managed pipeline (aka the Integration) and then reroute the message to the correct data stream, kubernetes.audit_logs-default, which in turn applies all the proper mapping and ingest pipelines for the incoming message
The overall flow will be

1. Create an AWS CloudWatch integration:

a. Populate the AWS access key and secret pair values

b. In the logs section, populate the log ARN, Tags and Preserve the original event if you want to, and then Save this integration and exit from the page

2. Next, we will configure the custom ingest pipeline

We are doing this because we want to override what the generic managed pipeline does. We will retrieve the custom component name by searching for managed pipeline created as an asset when we install the AWS CloudWatch integration. In this case we will be adding the custom ingest pipeline logs-aws_logs.generic@custom

From the Dev tools console, run below. Here, we are extracting the message field from the CloudWatch JSON and putting the value in a field called kubernetes.audit. Then, we are rerouting this message to the default Kubernetes audit dataset or ECS that comes with Kubernetes integration

PUT _ingest/pipeline/logs-aws_logs.generic@custom
{
    "processors": [
      {
        "pipeline": {
          "if": "ctx.message.contains('audit.k8s.io')",
          "name": "logs-aws-process-k8s-audit"
        }
      }
    ]
}

PUT _ingest/pipeline/logs-aws-process-k8s-audit
{
  "processors": [
    {
      "json": {
        "field": "message",
        "target_field": "kubernetes.audit"
      }
    },
    {
      "remove": {
        "field": "message"
      }
    },
    {
      "reroute": {
        "dataset": "kubernetes.audit_logs",
        "namespace": "default"
      }
    }
  ]
}

Let’s understand this further:

When we create a Kubernetes integration, we get a managed index template called logs-kubernetes.audit_logs that writes to the pipeline called logs-kubernetes.audit_logs-1.62.2 by default
If we look into the pipeline logs-kubernetes.audit_logs-1.62.2, we see that all the processor logic is working against the field kubernetes.audit. This is the reason why our json processor in the above code snippet is creating a field called kubernetes.audit before dropping the original message field and rerouting. Rerouting is directed to the kubernetes.audit_logs dataset that backs the logs-kubernetes.audit_logs-1.62.2 pipeline (dataset name is derived from the pipeline name convention that’s in the format logs--version)

3. Now let’s verify that the logs are actually flowing through and the audit message is being parsed

a. We will use Elastic Agent and enroll using Fleet and the integration policy we created in the Step 1. There are a number of ways to deploy Elastic Agent and for this exercise we will deploy using docker which is quick and easy.

% docker run --env FLEET_ENROLL=1 --env FLEET_URL=<> --env FLEET_ENROLLMENT_TOKEN=<>  --rm docker.elastic.co/beats/elastic-agent:8.19.11

b. Check the messages in Discover. In 8.15 there is also a new feature called Logs Explorer which provides an ability to see Kubernetes Audit logs (and container logs) with a few clicks (see image below). Voila! We can see the Kubernetes audit messages parsed!

4. Let's do a quick recap of what we did

We configured CloudWatch integration in Elasticsearch to read Kubernetes audit logs from CloudWatch. Then, we created custom ingest pipelines to reroute the audit messages to the correct data stream and all the OOTB mappings and parsing that come with the Kubernetes Audit Logs integration.

In the next part, we’ll look at how to analyze the ingested Kubernetes Audit log data.

Customize your data ingestion with Elastic input packages

Tue, 26 Sep 2023 00:00:00 GMT

Elastic^® has enabled the collection, transformation, and analysis of data flowing between the external data sources and Elastic Observability Solution through integrations. Integration packages achieve this by encapsulating several components, including agent configuration, inputs for data collection, and assets like ingest pipelines, data streams, index templates, and visualizations. The breadth of these assets supported in the Elastic Stack increases day by day.

This blog dives into how input packages provide an extremely generic and flexible solution to the advanced users for customizing their ingestion experience in Elastic.

What are input packages?

An Elastic Package is an artifact that contains a collection of assets that extend the Elastic Stack, providing new capabilities to accomplish a specific task like integration with an external data source. The first use of Elastic packages is integration packages, which provide an end-to-end experience — from configuring Elastic Agent, to collecting signals from the data source, to ingesting them correctly and using the data once ingested.

However, advanced users may need to customize data collection, either because an integration does not exist for a specific data source, or even if it does, they want to collect additional signals or in a different way. Input packages are another type of Elastic package that provides the capability to configure Elastic Agent to use the provided inputs in a custom way.

Let’s look at an example

Say hello to Julia, who works as an engineer at Ascio Innovation firm. She is currently working with Oracle Weblogic server and wants to get a set of metrics for monitoring it. She goes ahead and installs Elastic Oracle Weblogic Integration, which uses Jolokia in the backend to fetch the metrics.

Now, her team wants to advance in the monitoring and has the following requirements:

We should be able to extract metrics other than the default ones, which are not supported by the default Oracle Weblogic Integration.
We want to have our own bespoke pipelines, visualizations, and experience.
We should be able to identify the metrics coming in from two different instances of Weblogic Servers by having data mapped to separate indices.

All the above requirements can be met by using the Jolokia input package to get a customized experience. Let's see how.

Julia can add the configuration of Jolokia input package as below, fulfilling the first requirement.

hostname, JMX Mappings for the fields you want to fetch for the JVM application, and the data set name to which the response fields would get mapped.

Julia can customize her data by writing her own ingest pipelines and providing her customized mappings. Also, she can then build her own bespoke dashboards, hence meeting her second requirement.

Let’s say now Julia wants to use another instance of Oracle Weblogic and get a different set of metrics.

This can be achieved by adding another instance of Jolokia input package and specifying a new data set name as shown in the screenshot below. The resultant metrics will be mapped to a different index/data set hence fulfilling her third requirement. This will help Julia to differentiate metrics coming in from two different instances of Oracle Weblogic.

The resultant metrics of the query will be indexed to the new data set, jolokia_second_dataset in the below example.

As we can see above, the Jolokia input package provides the flexibility to get new metrics by specifying different JMX Mappings, which are not supported in the default Oracle Weblogic integration (the user gets metrics from a predetermined set of JMX Mappings).

The Jolokia Input package also can be used for monitoring any Java-based application, which pushes its metrics through JMX. So a single input package can be used to collect metrics from multiple Java applications/services.

Elastic input packages

Elastic has started supporting input packages from the 8.8.0 release. Some of the input packages are now available in beta and will mature gradually:

SQL input package: The SQL input package allows you to execute queries against any SQL database and store the results in Elasticsearch^®.
Prometheus input package: This input package can collect metrics from Prometheus Exporters (Collectors).It can be used by any service exporting its metrics to a Prometheus endpoint.
Jolokia input package: This input package collects metrics from Jolokia agents running on a target JMX server or dedicated proxy server. It can be used for monitoring any Java-based application, which pushes its metrics through JMX.
Statsd input package: The statsd input package spawns a UDP server and listens for metrics in StatsD compatible format. This input can be used to collect metrics from services that send data over the StatsD protocol.
GCP Metrics input package: The GCP Metrics input package can collect custom metrics for any GCP service.

Try it out!

Now that you know more about input packages, try building your own customized integration for your service through input packages, and get started with an Elastic Cloud free trial.

We would love to hear from you about your experience with input packages on the Elastic Discuss forum or in the Elastic Integrations repository.

Elastic Observability: Streams Data Quality and Failure Store Insights

Tue, 18 Nov 2025 00:00:00 GMT

When working with observability and logging data, not all documents make it into Elasticsearch in pristine condition. Some may be dropped due to processing failures in ingest pipelines or mapping errors, while others may be partially ingested with ignored fields if a fields value is incompatible with the defined mappings. These issues can impact downstream analysis and dashboards. Streams data quality makes it easier than ever to monitor the health of your ingested data, identify potential issues, and take corrective action right from the UI. With data quality, you can now see exactly how well your Stream is performing and quickly understand whether your data has a Good, Degraded, or Poor quality.

What's in data quality

At-a-glance summary

The summary card shows:

Degraded documents - Documents that contain the _ignored field - see this for more info.
Failed documents - Documents that were rejected at ingestion due to mapping conflicts or pipeline failures.

The overall quality score (Good, Degraded, Poor) is automatically calculated based on the percentage of degraded and failed documents.

Trends over time

The tab includes a time-series chart so you can track how degraded and failed documents are accumulating over time. Use the date picker to zoom into a specific range and understand when problems are spiking.

Quality issues table

A detailed table lists the types of issues affecting your stream. For each issue, you can:

See which fields are causing problems.
Review counts of affected documents.
Filter by issues that have not been solved yet (Current issues only).
Open a flyout to dive deeper into the cause of the issue and learn how to fix it.

Monitoring degraded documents

A degraded document is one that contains the _ignored field, which means one or more of its fields were ignored during indexing. One of the reasons could be that their values didn’t match the expected mappings. While the rest of the document is still indexed, a high number of degraded documents can affect query results, dashboards, and overall observability accuracy.

To help keep these issues under control, the Data quality tab provides visibility into the percentage of degraded documents in your stream.

Set up a rule to stay ahead of issues

You can use the Create rule button above the Degraded docs chart to define an alert that notifies you when the percentage of degraded documents crosses a certain threshold. This makes it easy to proactively monitor for mapping mismatches and ensure your data continues to meet quality expectations.

For more information on how to configure this rule, see Degraded docs rule conditions.

Handling failed documents with the failure store

Failure store is a special index that captures documents rejected during ingestion. Instead of losing this data, the failure store retains it in a dedicated ::failures index, allowing you to inspect the problematic documents, understand what went wrong, and fix the underlying issues.

In Data Quality tab, the failed documents are only visible if your stream has a failure store enabled, for checking failure store documents you are required to have at least read_failure_store privileges. If the failure store is not enabled, you’ll see an “Enable failure store” link that opens a modal to configure it and set the retention period. For enabling failure store you are required to have manage_failure_store privileges over the specific data stream. For further information about failure store security you can refer to Searching failures.

Once enabled, you can edit the failure store configuration or disable it at any time using the Edit button above the failed docs chart.

The failure store can also be configured in the Streams Retention tab - see this article for more information.

Technical implementation

Under the hood, the Data quality tab builds on the existing Dataset quality plugin - the same one that powers the Dataset quality page in Stack Management. However, instead of working in the context of datasets following the Data stream naming scheme, it’s now tailored specifically for streams.

To determine the quality of a stream, the UI sends three ES|QL query server requests:

All documents (including failures):

 FROM myStream, myStream::failures | STATS doc_count = COUNT(*)

Failed documents only:

 FROM myStream::failures | STATS failed_doc_count = COUNT(*)

Degraded documents:

FROM myStream METADATA _ignored | WHERE _ignored IS NOT NULL | STATS degraded_doc_count = COUNT(*)

The results of these queries are then used to calculate the percentages of failed and degraded documents. The overall data quality is determined using simple thresholds:

Good: Both percentages are 0%
Degraded: Any percentage is greater than 0% but less than 3%
Poor: Any percentage is above 3%

For managing the failure store, Streams uses the Update data stream options API with the failure_store parameter to configure and update the failure store settings, including enabling the store and setting the retention period.

Why you’ll love this

The new Data quality tab gives you:

Visibility into ingestion problems without digging into logs
A clear breakdown of degraded vs. failed documents
Insights into which fields are ignored and why
Tools to capture and troubleshoot failed docs with the failure store

By surfacing data quality issues directly in the Streams UI, we’re making it easier to keep your data flowing reliably and to ensure your analytics are built on a strong foundation.

Try it out today

The data quality feature is available in Elastic Observability on Serverless, and coming soon for self-managed and Elastic Cloud users.

Sign up for an Elastic trial at cloud.elastic.co, and trial Elastic's Serverless offering which will allow you to play with all of the Streams functionality.

For more information on Streams:

Read about Reimagining streams

Look at the Streams website

Read the Streams documentation

Deploying Elastic Agent with Confluent Cloud's Elasticsearch Connector

Wed, 22 Jan 2025 00:00:00 GMT

Elastic and Confluent are key technology partners and we're pleased to announce new investments in that partnership. Built by the original creators of Apache Kafka®, Confluent's data streaming platform is a key component of many Enterprise ingest architectures, and it ensures that customers can guarantee delivery of critical Observability and Security data into their Elasticsearch clusters. Together, we've been working on key improvements to how our products fit together. With Elastic Agent's new Kafka output and Confluent's newly improved Elasticsearch Sink Connectors it's never been easier to seamlessly collect data from the edge, stream it through Kafka, and into an Elasticsearch cluster.

In this blog, we examine a simple way to integrate Elastic Agent with Confluent Cloud's Kafka offering to reduce the operational burden of ingesting business-critical data.

Benefits of Elastic Agent and Confluent Cloud

When combined, Elastic Agent and Confluent Cloud's updated Elasticsearch Sink connector provide a myriad of advantages for organizations of all sizes. This combined solution offers flexibility in handling any type of data ingest workload in an efficient and resilient manner.

Fully Managed

When combined, Elastic Cloud Serverless and Confluent Cloud provide users with a fully managed service. This makes it effortless to deploy and ingest nearly unlimited data volumes without having to worry about nodes, clusters, or scaling.

Full Elastic Integrations Support

Sending data through Kafka is fully supported with any of the 300+ Elastic Integrations. In this blog post, we outline how to set up the connection between the two platforms. This ensures you can benefit from our investments in built-in alerts, SLOs, AI Assistants, and more.

Decoupled Architecture

Kafka acts as a resilient buffer between data sources (such as Elastic Agent and Logstash) and Elasticsearch, decoupling data producers from consumers. This can significantly reduce total cost of ownership by enabling you to size your Elasticsearch cluster based on typical data ingest volume, not maximum ingest volume. It also ensures system resilience during spikes in data volume.

Ultimate control over your data

With our new Output per Integration capability, customers can now send different data to different destinations using the same agent. Customers can easily send security logs directly to Confluent Cloud/Kafka, which can provide delivery guarantees, while sending less critical application logs and system metrics directly to Elasticsearch.

Deploying the reference architecture

In the following sections, we will walk you through one of the ways Confluent Kafka can be integrated with Elastic Agent and Elasticsearch using Confluent Cloud's Elasticsearch Sink Connector. As with any streaming and data collection technology, there are many ways a pipeline can be configured depending on the particular use case. This blog post will focus on a simple architecture that can be used as a starting point for more complex deployments.

Some of the highlights of this architecture are:

Dynamic Kafka topic selection at Elastic Agents
Elasticsearch Sink Connectors for fully managed transfer from Confluent Kafka to Elasticsearch
Processing data leveraging Elastic's 300+ Integrations

Prerequisites

Before getting started ensure you have a Kafka cluster deployed in Confluent Cloud, an Elasticsearch cluster or project deployed in Elastic Cloud, and an installed and enrolled Elastic Agent.

Configure Confluent Cloud Kafka Cluster for Elastic Agent

Navigate to the Kafka cluster in Confluent Cloud, and select Cluster Settings. Locate and note the Bootstrap Server address, we will need this value later when we create the Kafka Output in Fleet.

Navigate to Topics in the left-hand navigation menu and create two topics:

A topic named logs
A topic named metrics

Next, navigate to API Keys in the left-hand navigation menu:

Click + Add API Key
Select the Service Account API key type
Provide a meaningful name for this API Key
Grant the key write permission to the metrics and logs topics
Create the key

Note the provided Key and the Secret, we will need it later when we configure the Kafka Output in Fleet.

Configure Elasticsearch and Elastic Agent

In this section, we will configure the Elastic Agent to send data to Confluent Cloud's Kafka cluster and we will configure Elasticsearch so it can receive data from the Confluent Cloud Elasticsearch Sink Connector.

Configure Elastic Agent to send data to Confluent Cloud

Elastic Fleet simplifies sending data to Kafka and Confluent Cloud. With Elastic Agent, a Kafka "output" can be easily attached to all data coming from an agent or it can be applied only to data coming from a specific data source.

Find Fleet in the left-hand navigation, click the Settings tab. On the Settings tab, find the Outputs section and click Add Output.

Perform the following steps to configure the new Kafka output:

Provide a Name for the output
Set the Type to Kafka
Populate the Hosts field with the Bootstrap Server address we noted earlier .
Under Authentication, populate the Username with the API Key and the Password with the Secret we noted earlier
Under Topics, select Dynamic Topic and set Topic from field to data_stream.type
Click Save and apply settings

Next, we will navigate to the Agent Policies tab in Fleet and click to edit the Agent Policy that we want to attach the Kafka output to. With the Agent Policy open, click the Settings tab and change Output for integrations and Output for agent monitoring to the Kafka output we just created.

Selecting an Output per Elastic Integration: To set the Kafka output to be used for specific data sources, see the integration-level outputs documentation.

A note about Topic Selection: The data_stream.type field is a reserved field which Elastic Agent automatically sets to logs if the data we're sending is a log and metrics if the data we're sending is a metric. Enabling Dynamic Topic selection using data_stream.type, will cause Elastic Agent to automatically route metrics to a metrics topic and logs to a logs topic. For information on topic selection, see the Kafka Output's Topics settings documentation.

Configuring a publishing endpoint in Elasticsearch

Next, we will set up two publishing endpoints (data streams) for the Confluent Cloud Sink Connector to use when publishing documents to Elasticsearch:

We will create a data stream logs-kafka.reroute-default for handling logs
We will create a data stream metrics-kafka.reroute-default for handling metrics

If we were to leave the data in those data streams as-is, the data would be available but we would find the data is unparsed and lacking vital enrichment. So we will also create two index templates and two ingest pipelines to make sure the data is processed by our Elastic Integrations.

Creating the Elasticsearch Index Templates and Ingest Pipelines

The following steps use Dev Tools in Kibana, but all of these steps can be completed via the REST API or using the relevant user interfaces in Stack Management.

First, we will create the Index Template and Ingest Pipeline for handling logs:

PUT _index_template/logs-kafka.reroute
{
  "template": {
    "settings": {
      "index.default_pipeline": "logs-kafka.reroute"
    }
  },
  "index_patterns": [
    "logs-kafka.reroute-default"
  ],
  "data_stream": {}
}

PUT _ingest/pipeline/logs-kafka.reroute
{
  "processors": [
    {
      "reroute": {
        "dataset": [
          "{{data_stream.dataset}}"
        ],
        "namespace": [
          "{{data_stream.namespace}}"
        ]
      }
    }
  ]
}

Next, we will create the Index Template and Ingest Pipeline for handling metrics:

PUT _index_template/metrics-kafka.reroute
{
  "template": {
    "settings": {
      "index.default_pipeline": "metrics-kafka.reroute"
    }
  },
  "index_patterns": [
    "metrics-kafka.reroute-default"
  ],
  "data_stream": {}
}

PUT _ingest/pipeline/metrics-kafka.reroute
{
  "processors": [
    {
      "reroute": {
        "dataset": [
          "{{data_stream.dataset}}"
        ],
        "namespace": [
          "{{data_stream.namespace}}"
        ]
      }
    }
  ]
}

A note about rerouting: For a practical example of how this works, a document related to a Linux Network Metric would be first land in metrics-kafka.reroute-default and this Ingest Pipeline would inspect the document and find data_stream.dataset set to system.network and data_stream.namespace set to default. It would use these values to reroute the document from metrics-kafka.reroute-default to metrics-system.network-default where it would be processed by the system integration.

Configure the Confluent Cloud Elasticsearch Sink Connector

Now it's time to configure the Confluent Cloud Elasticsearch Sink Connector. We will perform the following steps twice and create two separate connectors, one connector for logs and one connector for metrics. Where the required settings differ, we will highlight the correct values.

Navigate to your Kafka cluster in Confluent Cloud and select Connectors from the left-hand navigation menu. On the Connectors page, select Elasticsearch Service Sink from a catalog of connectors available.

Confluent Cloud presents a simplified workflow for the user to configure a connector. Here we will walk through each step of the process:

Step 1: Topic Selection

First, we will select the topic that the connector will consume data from based on which connector we are deploying:

When deploying the Elasticsearch Sink Connector for logs, select the logs topic.
When deploying the Elasticsearch Sink Connector for metrics, select the metrics topic.

Step 2: Kafka Credentials

Choose KAFKA_API_KEY as the cluster authentication mode. Provide the API Key and Secret noted earlier when we gather required Confluent Cloud Cluster information.

Step 3: Authentication

Provide the Elasticsearch Endpoint address of our Elasticsearch cluster as the Connection URI. The Connection user and Connection password are the authentication information for the account in Elasticsearch that will be used by the Elasticsearch Sink Connector to write data to Elasticsearch.

Step 4: Configuration

In this step we will keep the Input Kafka record value format set to JSON. Next, expand Advanced Configuration.

We will set Data Stream Dataset to kafka.reroute
We will set Data Stream Typebased on the connector we are deploying:
- When deploying the Elasticsearch Sink Connector for logs, we will set Data Stream Type to logs
- When deploying the Elasticsearch Sink Connector for metrics, we will set Data Stream Type to metrics
The correct values for other settings will depend on the specific environment.

Step 5: Sizing

In this step, notice that Confluent Cloud provides a recommended minimum number of tasks for our deployment. Following the recommendation here is a good starting place for most deployments.

Step 6: Review and Launch

Review the Connector configuration and Connector pricing sections and if everything looks good, it's time to click continue and launch the connector! The connector may report as provisioning but will soon start consuming data from the Kafka topic and writing it to the Elasticsearch cluster.

You can now navigate to Discover in Kibana and find your logs flowing into Elasticsearch! Also check out the real time metrics that Confluent Cloud provides for your new Elasticsearch Sink Connector deployments.

If you have only deployed the first logs sink connector, you can now repeat the steps above to deploy the second metrics sink connector.

Enjoy your fully managed data ingest architecture

If you followed the steps above, congratulations. You have successfully:

Configured Elastic Agent to send logs and metrics to dedicated topics in Kafka
Created publishing endpoints (data streams) in Elasticsearch dedicated to handling data from the Elasticsearch Sink Connector
Configured managed Elasticsearch Sink connectors to consume data from multiple topics and publish that data to Elasticsearch

Next you should enable additional integrations, deploy more Elastic Agents, explore your data in Kibana, and enjoy the benefits of a fully managed data ingest architecture with Elastic Serverless and Confluent Cloud!

The DNA of DATA Increasing Efficiency with the Elastic Common Schema

Wed, 25 Sep 2024 00:00:00 GMT

The Elastic Common Schema is a fantastic way to simplify and unify a search experience. By aligning disparate data sources into a common language, users have a lower bar to overcome with interpreting events of interest, resolving incidents or hunting for unknown threats. However, there are underlying infrastructure reasons to justify adopting the Elastic Common Schema.

In this blog you will learn about the quantifiable operational benefits of ECS, how to leverage ECS with any data ingest tool, and the pitfalls to avoid. The data source leveraged in this blog is a 3.3GB Nginx log file obtained from Kaggle. The representation of this dataset is divided into three categories: raw, self, and ECS; with raw having zero normalization, self being a demonstration of commonly implemented mistakes observed from my 5+ years of experience working with various users, and finally ECS with the optimal approach of data hygiene.

This hygiene is achieved through the parsing, enrichment, and mapping of data ingested; akin to the sequencing of DNA in order to express genetic traits. Through the understanding of the data's structure, and assigning the correct mapping, a more thorough expression may be represented, stored and searched upon.

If you would like to learn more about ECS, the dataset used in this blog, or available Elastic integrations, please be sure to check out these related links:

Dataset Validation

Before we begin, let us review how many documents exist and what we're required to ingest. We have 10,365,152 documents/events from our Nginx log file:

With 10,365,152 documents in our targeted end-state:

Dataset Ingestion: Raw & Self

To achieve the raw and self ingestion techniques, this example is leveraging Logstash for simplicity. For the raw data ingest, a simple file input with no additional modifications or index templates.


    input {
      file {
      id => "NGINX_FILE_INPUT"
      path => "/etc/logstash/raw/access.log"
      ecs_compatibility => disabled
      start_position => "beginning"
      mode => read
      }
    }
    filter {
    }
    output {
      elasticsearch {
        hosts => ["https://mycluster.es.us-east4.gcp.elastic-cloud.com:9243"]
          index => "nginx-raw"
          ilm_enabled => true
          manage_template => false
          user => "username"
          password => "password"
          ssl_verification_mode => none
          ecs_compatibility => disabled
          id => "NGINX-FILE_ES_Output"
      }
    }

For the self ingest, a custom Logstash pipeline with a simple Grok filter was created with no index template applied:

    input {
      file {
        id => "NGINX_FILE_INPUT"
        path => "/etc/logstash/self/access.log"
        ecs_compatibility => disabled
        start_position => "beginning"
        mode => read
      }
    }
    filter {
      grok {
        match => { "message" => "%{IP:clientip} - (?:%{NOTSPACE:requestClient}|-) \[%{HTTPDATE:timestamp}\] \"(?:%{WORD:requestMethod} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})\" (?:-|%{NUMBER:response}) (?:-|%{NUMBER:bytes_in}) (-|%{QS:bytes_out}) %{QS:user_agent}" }
      }
    }
    output {
      elasticsearch {
        hosts => ["https://myscluster.es.us-east4.gcp.elastic-cloud.com:9243"]
        index => "nginx-self"
        ilm_enabled => true
        manage_template => false
        user => "username"
        password => "password"
        ssl_verification_mode => none
        ecs_compatibility => disabled
        id => "NGINX-FILE_ES_Output"
      }
    }

Dataset Ingestion: ECS

Elastic comes included with many available integrations which contain everything you need to achieve to ensure that your data is ingested as efficiently as possible.

For our use case of Nginx, we'll be using the associated integration's assets only.

The assets which are installed are more than just dashboards, there are ingest pipelines which not only normalize but enrich the data while simultaneously mapping the fields to their correct type via component templates. All we have to do is make sure that as the data is coming in, that it will traverse through the ingest pipeline and use these supplied mappings.

Create your index template, and select the supplied component templates provided from your integration.

Think of the component templates like building blocks to an index template. These allow for the reuse of core settings, ensuring standardization is adopted across your data.

For our ingestion method, we merely point to the index name that we specified during the index template creation, in this case, nginx-ecs and Elastic will handle all the rest!

    input {
      file {
      id => "NGINX_FILE_INPUT"
      path => "/etc/logstash/ecs/access.log"
      #ecs_compatibility => disabled
      start_position => "beginning"
      mode => read
      }
    }
    filter {
    }
    output {
      elasticsearch {
        hosts => ["https://mycluster.es.us-east4.gcp.elastic-cloud.com:9243"]
        index => "nginx-ecs"
        ilm_enabled => true
        manage_template => false
        user => "username"
        password => "password"
        ssl_verification_mode => none
        ecs_compatibility => disabled
        id => "NGINX-FILE_ES_Output"
      }
    }

Data Fidelity Comparison

Let's compare how many fields are available to search upon the three indices as well as the quality of the data. Our raw index has but 15 fields to search upon, with most being duplicates for aggregation purposes.

However from a Discover perspective, we are limited to 6 fields!

Our self-parsed index has 37 available fields, however these too are duplicated and not ideal for efficient searching.

From a Discover perspective here we have almost 3x as many fields to choose from, yet without the correct mapping the ease of which this data may be searched is less than ideal. A great example of this, is attempting to calculate the average bytes_in on a text field.

Finally with our ECS index, we have 71 fields available to us! Notice that courtesy of the ingest pipeline, we have enriched fields of geographic information as well as event categorial fields.

Now what about Discover? There were 51 fields directly available to us for searching purposes:

Using Discover as our basis, our self-parsed index has 283% more fields to search upon whereas our ECS index has 850%!

Storage Utilization Comparison

Surely with all these fields in our ECS index the size would be exponentially larger than the self normalized index, let alone the raw index? The results may surprise you.

Accounting for the replica of data of our 3.3GB size data set, we can see that the impact of normalized and mapped data has a significant impact on the amount of storage required.

Conclusion

While there is an increase in the amount required storage for any dataset that is enriched, Elastic provides easy solutions to maximize the fidelity of the data to be searched while simultaneously ensuring operational storage efficiency; that is the power of the Elastic Common Schema.

Let's review how we were able to maximize search, while minimizing storage

Installing integration assets for our dataset that we are going to ingest.

Customizing the index template to leverage the included components to ensure mapping and parsing are aligned to the Elastic Common Schema.

Ready to get started? Sign up for Elastic Cloud and try out the features and capabilities I've outlined above to get the most value and visibility out of your data.

Accelerate log analytics in Elastic Observability with Automatic Import powered by Search AI

Wed, 04 Sep 2024 00:00:00 GMT

Elastic is accelerating the adoption of AI-driven log analytics by automating the ingestion of custom logs, which is increasingly important as the deployment of GenAI-based applications grows. These custom data sources must be ingested, parsed, and indexed effortlessly, enabling broader visibility and more straightforward root cause analysis (RCA) without requiring effort from Site Reliability Engineers (SREs). Achieving visibility across an enterprise IT environment is inherently challenging for SREs due to constant growth and change, such as new applications, added systems, and infrastructure migrations to the cloud. Until now, the onboarding of custom data has been costly and complex for SREs. With automatic import, SREs can concentrate on deploying, optimizing, and improving applications.

Automatic Import uses generative AI to automate the development of custom data integrations, reducing the time required from several days to less than 10 minutes and significantly lowering the learning curve for onboarding data. Powered by the Elastic Search AI Platform, it provides model-agnostic access to leverage large language models (LLMs) and grounds answers in proprietary data through retrieval augmented generation (RAG). This capability is further enhanced by Elastic's expertise in enabling observability teams to utilize any type of data and the flexibility of its Search AI Lake. Arriving at a crucial time when organizations face an explosion of applications and telemetry data, such as logs, Automatic Import streamlines the initial stages of data migration by simplifying data collection and normalization. It also addresses the challenges of building custom connectors, which can otherwise delay deployments, issue analysis, and impact customer experiences.

Enhancing AI Powered Observability with Automatic Import

Automatic Import builds on Elastic Observability’s AI-driven log analytics innovations—such as anomaly detection, log rate and pattern analysis, and Elastic AI Assistant, and further automates and simplifies SRE’s workflows. Automatic Import applies generative AI to automate the creation of custom data integrations, allowing SREs to focus on logs and other telemetry data. While Elastic provides over 400+ prebuilt data integrations, automatic import allows SREs to extend integrations to fit their workflows and expand visibility into production environments.

In conjunction with automatic import, Elastic is introducing Elastic Express Migration, a commercial incentive program designed to overcome migration inertia from existing deployments and contracts, providing a faster adoption path for new customers.

Automatic Import leverages Elastic Common Schema (ECS) with public LLMs to process and analyze data in ECS format which is also part of OpenTelemetry. Once the data is in, SRE’s can leverage Elastic’s RAG-based AI Assistant to solve root cause analysis (RCA) challenges in dynamic, complex environments.

Configuring and using Automatic Import

Automatic Import is available to everyone with an Enterprise license. Here is how it works:

The user configures connectivity to an LLM and uploads sample data
Automatic Import then extrapolates what to expect from the data source. These log samples are paired with LLM prompts that have been honed by Elastic engineers to reliably produce conformant Elasticsearch ingest pipelines.
Automatic Import then iteratively builds, tests, and tweaks a custom ingest pipeline until it meets Elastic integration requirements.

Automatic Import powered by the Elastic Search AI Platform

Within minutes, a validated custom integration is created that accurately maps raw data into ECS and custom fields, populates contextual information (such as related.* fields), and categorizes events.

Automatic Import currently supports Anthropic models via Elastic’s connector for Amazon Bedrock, and additional LLMs will be introduced soon. It supports JSON and NDJSON-based log formats currently.

Automatic Import workflow

SREs are constantly having to manage new tools and components that developers add into applications. Neo4j, is a database that doesn’t have an integration in Elastic. The following steps walk you through how to create an integration for Neo4j with automatic import:

Start by navigating to Integrations -> Create new integration.

Provide a name and description for the new data source.

Next, fill in other details and provide some sample data, anonymized as you see fit.

Click “Analyze logs” to submit integration details, sample logs, and expert-written instructions from Elastic to the specified LLM, which builds the integration package using generative AI. Automatic Import then fine-tunes the integration in an automated feedback loop until it is validated to meet Elastic requirements.

Review what automatic Import presents as recommended mappings to ECS fields and custom fields. You can easily adjust these settings if necessary.

After finalizing the integration, add it to Elastic Agent or view it in Kibana. It is now available alongside your other integrations and follows the same workflows as prebuilt integrations.

Upon deployment, you can begin analyzing newly ingested data immediately. Start by looking at the new Logs Explorer in Elastic Observability

Accelerate log-analytics with automatic import

Automatic Import lowers the time required to build and test custom data integrations from days to minutes, accelerating the switch to AI-driven log analytics. Elastic Observability pairs the unique power of Automatic Import with Elastic’s deep library of prebuilt data integrations, enabling wider visibility and fast data onboarding, along with AI-based features, such as the Elastic AI Assistant to accelerate RCA and reduce operational overhead.

Interested in our Express Migration program to level up to Elastic? Contact Elastic to learn more.

Connecting the Dots: ES|QL Joins for Richer Observability Insights

Thu, 29 May 2025 00:00:00 GMT

Connecting the Dots: ES|QL Joins for Richer Observability Insights

You might have seen our recent announcement about the arrival of SQL-style joins in Elasticsearch with ES|QL's LOOKUP JOIN command (now in Tech Preview!). While that post covered the basics, let's take a closer look at this in the context of Observability. How can this new join capability specifically help engineers and SREs make sense of their logs, metrics, and traces and make Elasticsearch more storage efficient by not denormalizing as much data?

Note: Before we jump into the details, it’s important to mention again that this type of functionality today relies on a special lookup index. It is not (yet) possible to JOIN any arbitrary index.

Observability isn't just about collecting data; it's about understanding it. Often, the raw telemetry data – a log line, a metric point, a trace span – lacks the full context needed for quick diagnosis or impact assessment. We need to correlate data, enrich it with business or infrastructure context, and ask more advanced questions.

Historically, achieving this in Elasticsearch involved techniques like denormalizing data at ingest time (using ingest pipelines with enrich processors, for example) or performing joins client-side.

By adding the necessary context (like host details or user attributes) as data flowed in, each document arrived fully ready for queries and analytics without extra processing later on. This approach worked well in many cases and still does, particularly when the reference data changes slowly or when the enriched fields are critical for nearly every search.

However, as environments become more dynamic and diverse, the need to frequently update reference data (or avoid storing repetitive fields in every document) highlighted some of the trade-offs.

With the introduction of ES|QL LOOKUP JOIN in Elasticsearch 8.18 and 9.0, you now have an additional, more flexible option for situations where real-time lookups and minimal duplication are desired. Both methods—ingest-time enrichment and on-the-fly LOOKUP JOIN—complement each other and remain valid, depending on use case needs around update frequency, query performance, and storage considerations.

Why Lookup Joins for Observability

Lookup joins keep things flexible. You can decide on the fly if you’d like to look up additional information to assist you in your investigation.

Here are some examples:

Deployment Information: Which version of the code is generating these errors?
Infrastructure Mapping: Which Kubernetes cluster or cloud region is experiencing high latency? What hardware does it use?
Business Context: Are critical customers being affected by this slowdown?
Team Ownership: Which team owns the service throwing these exceptions?

Keeping this kind of information perfectly denormalized onto every single log line or metric point can be challenging and inefficient. Lookup datasets – like lists of deployments, server inventories, customer tiers, or service ownership mappings – often change independently of the telemetry data itself.

LOOKUP JOIN is ideal here because:

Lookup Indices are Writable: Update your deployment list, CMDB export, or on-call rotation in the lookup index, and your next ES|QL query immediately uses the fresh data. No need to re-run complex enrich policies or re-index data.
Flexibility: You decide at query time which context to join. Maybe today you care about deployment versions, tomorrow about cloud regions.
Simpler Setup: As the original post highlighted, there are no enrich policies to manage. Just create an index with index.mode: lookup and load your data - up to 2 billion documents per lookup index.

Observability Use Cases & Examples with ES|QL

Let’s now look at a few examples to see how Lookup Joins can help.

Enriching Error Logs with Deployment Context

Lets say you're seeing a spike in errors for your checkout-service. You have logs flowing into a data stream, but they only contain the service name. The documents don’t have any information about the deployment activity itself.

FROM logs-*
  | WHERE log.level == "error"
  | WHERE service.name == "opbeans-ruby"

You need to know if a recent deployment is contributing to these errors. To do this, we can maintain a deployments_info_lkp index (set with index.mode: lookup) that maps service names to their deployment times. This index could be updated from our CI/CD pipeline automatically any time a deployment happens.

PUT /deployments_info_lkp
{
  "settings": {
    "index.mode": "lookup"
  },
  "mappings": {
    "properties": {
      "service": {
        "properties": {
          "name": {
            "type": "keyword"
          },
          "deployment_time": {
            "type": "date"
          },
          "version": {
            "type": "keyword"
          }
        }
      }
    }
  }
}
# Bulk index the deployment documents
POST /_bulk
{ "index" : { "_index" : "deployments_info_lkp" } }
{ "service.name": "opbeans-ruby", "service.version": "1.0", "deployment_time": "2025-05-22T06:00:00Z" }
{ "index" : { "_index" : "deployments_info_lkp" } }
{ "service.name": "opbeans-go", "service.version": "1.1.0", "deployment_time": "2025-05-22T06:00:00Z" }

Using this information you can now write a query that joins these two sources.

ES|QL Query:

FROM logs-* 
  | WHERE log.level == "error"
  | WHERE service.name == "opbeans-ruby"
  | LOOKUP JOIN deployments_info_lkp ON service.name

This alone is a good step towards troubleshooting the problem. You now have the deployment_time column available for each of your error documents. The last remaining step now is to use this for further filtering.

Any of the data we managed to join from the lookup index can be handled as any other data we’d usually have available in the ES|QL query. This means that we can filter on it, and check if we had a recent deployment.

FROM logs-*
  | WHERE log.level == "error"
  | WHERE service.name == "opbeans-ruby"
  | LOOKUP JOIN deployments_info_lkp ON service.name 
  | KEEP message, service.name, service.version, deployment_time 
  | WHERE deployment_time > NOW() - 2h

Saving disk space using JOIN

Denormalizing data by including contextual information like host OS or cloud provider details directly in every log event is convenient for querying but can increase storage consumption, especially with high-volume data streams. Instead of storing this often-redundant information repeatedly, we can leverage joins to retrieve it on demand, potentially saving valuable disk space. While compression often handles repetitive data well, removing these fields entirely can still yield noticeable storage savings.

In this example we’ll use a dataset of 1,000,000 Kubernetes container logs using the default mapping of the Kubernetes integration, with logsdb index mode enabled. The starting size for this index is 35.5mb.

GET _cat/indices/k8s-logs-default?h=index,pri.store.size
### 
k8s-logs-default       35.5mb

Using the disk usage API, we observed that fields like host.os and cloud.* contribute roughly 5% to the total index size on disk (35.5mb). These fields can be useful in some cases, but information like the os.name is rarely queried.

// Example host.os structure
"os": {
  "codename": "Plow", "family": "redhat", "kernel": "6.6.56+",
  "name": "Red Hat Enterprise Linux", "platform": "rhel", "type": "linux", "version": "9.5 (Plow)"
}

// Example cloud structure
"cloud": {
  "account": { "id": "elastic-observability" },
  "availability_zone": "us-central1-c",
  "instance": { "id": "5799032384800802653", "name": "gke-edge-oblt-edge-oblt-pool-46262cd0-w905" },
  "machine": { "type": "e2-standard-4" },
  "project": { "id": "elastic-observability" },
  "provider": "gcp", "region": "us-central1", "service": { "name": "GCE" }
}

Instead of storing this information with every document, let's instead drop this information in an ingest pipeline.

PUT _ingest/pipeline/drop-host-os-cloud
{
  "processors": [
      { "remove": { "field": "host.os" } },
      { "set": { "field": "tmp1", "value": "{{cloud.instance.id}}" } }, // Temporarily store the ID

      { "remove": { "field": "cloud" } },                             // Remove the entire cloud object
      { "set": { "field": "cloud.instance.id", "value": "{{tmp1}}" } }, // Restore just the cloud instance ID
      { "remove": { "field": "tmp1", "ignore_missing": true } }         // Clean up temporary field
    ]
}

Reindexing (and force merging to one segment) now shows the following size, resulting in approximately 5% less space.

GET _cat/indices/k8s-logs-*?h=index,pri.store.size
### 
k8s-logs-default             33.7mb
k8s-logs-drop-cloud-os       35.5mb

Now, to regain access to the removed host.os and cloud.* information during analysis without storing it in every log document, we can create a lookup index. This index will store the full host and cloud metadata, keyed by the cloud.instance.id that we preserved in our logs. This instance_metadata_lkp index will be significantly smaller than the space saved across millions or billions of log lines, as it only needs one document per unique instance.

# Create the lookup index for instance metadata
PUT /instance_metadata_lkp
{
  "settings": {
    "index.mode": "lookup"
  },
  "mappings": {
    "properties": {

      "cloud.instance.id": {  # The join key we kept in the logs
        "type": "keyword"
      },
      "host.os": {           # The full host.os object we removed
        "type": "object",
        "enabled": false      # Often don't need to search sub-fields here
      },
      "cloud": {             # The full cloud object we removed (mostly)
         "type": "object",
         "enabled": false     # Often don't need to search sub-fields here
      }
    }
  }
}

# Bulk index sample instance metadata (keyed by cloud.instance.id)
# This data might come from your cloud provider API or CMDB
POST /_bulk
{ "index" : { "_index" : "instance_metadata_lkp", "_id": "5799032384800802653" } }
{ "cloud.instance.id": "5799032384800802653", "host.os": { "codename": "Plow", "family": "redhat", "kernel": "6.6.56+", "name": "Red Hat Enterprise Linux", "platform": "rhel", "type": "linux", "version": "9.5 (Plow)" }, "cloud": { "account": { "id": "elastic-observability" }, "availability_zone": "us-central1-c", "instance": { "id": "5799032384800802653", "name": "gke-edge-oblt-edge-oblt-pool-46262cd0-w905" }, "machine": { "type": "e2-standard-4" }, "project": { "id": "elastic-observability" }, "provider": "gcp", "region": "us-central1", "service": { "name": "GCE" } } }

With this setup, when you need the full host or cloud context for your logs, you can simply use LOOKUP JOIN in your ES|QL query and continue filtering on the data from the lookup index

FROM logs-* 
  | LOOKUP JOIN instance_metadata_lkp ON cloud.instance.id 
  | WHERE cloud.region == "us-central1"

This approach allows us to query the full context when needed (e.g., filtering logs by host.os.name or cloud.region) while significantly reducing the storage footprint of the high-volume log indices by avoiding redundant data denormalization.

It should be noted that low cardinality metadata fields generally compress well and a large part of the storage savings in this case come from the “text” mapping of the host.os.name and cloud.instance.name field. Make sure to use the disk usage API to evaluate if this approach would be worth it in your specific use case.

Getting Started with Lookups for Observability

Creating the necessary lookup indices is straightforward. As detailed in our initial blog post, you can use Kibana's Index Management UI, the Create Index API, or the File Upload utility – the key is setting "index.mode": "lookup" in the index settings.

For Observability, consider automating the population of these lookup indices:

Export data periodically from your CMDB, CRM, or HR systems.
Have your CI/CD pipeline update the deployments_lkp index upon successful deployment.
Use tools like Logstash with an elasticsearch output configured to write to your lookup index.

A Note on Performance and Alternatives

While incredibly powerful, joins aren't free. Each LOOKUP JOIN adds processing overhead to your query. For contextual data that is very static (e.g., the cloud region a host permanently resides in) and needed in almost every query against that data, the traditional approach of enriching at ingest time might still be slightly more performant for those specific queries, trading upfront processing and storage for query speed.

However, for the dynamic, flexible, and targeted enrichment scenarios common in Observability – like mapping to ever-changing deployments, user segments, or team structures – LOOKUP JOIN offers a compelling, efficient, and easier-to-manage solution.

Conclusion

ES|QL's LOOKUP JOIN is making it easy to correlate and enrich your logs, metrics, and traces with up-to-date external information at query time; you can move faster from detecting problems to understanding their scope, impact, and root cause.

This feature is currently in Technical Preview in Elasticsearch 8.18 and Serverless, available now on Elastic Cloud. We encourage you to try it out with your own Observability data and share your feedback using the "Submit feedback" button in the ES|QL editor in Discover. We're excited to see how you use it to connect the dots in your systems!

Introducing Streams for Observability: Your first stop for investigations

Mon, 27 Oct 2025 00:00:00 GMT

We're excited to introduce Streams, a new AI capability within Elastic Observability. Built on the Elasticsearch platform, it's designed for Site Reliability Engineers (SREs) to use logs as the primary signal for investigations, enabling faster answers and quicker issue resolution. For decades, logs have been considered too noisy, expensive, and complex to manage, and many observability vendors have treated them as a second-class citizen. Streams flips this script by transforming raw logs into your most valuable asset to immediately identify not only the root cause, but also the why behind the root cause to enable instant resolution.

SREs today identify the "what" with metrics and the "where" with traces, which are important for troubleshooting. However, it's often the "why" that's needed for faster and more accurate incident resolution. The crucial “why” is buried in your logs, but the massive volume and unstructured nature of logs in modern microservice environments have made them difficult to use effectively. This has forced teams into a difficult position, either spending countless hours building and maintaining complex data pipelines to tame the chaos or drop valuable log data to control costs and risk critical visibility gaps. As a result, when an incident occurs, SREs waste precious time manually hunting for clues and reverse-engineering data instead of quickly resolving the issue.

Streams, from ingest to answers with logs

Streams directly addresses this challenge by using AI to transform the chaos of raw logs into your clearest path to a solution, enabling logs to be the primary signal for investigations. It processes raw logs at scale ingested from any source and in any format (structured and unstructured), then partitions, parses, and helps manage retention and data quality. Streams reduces the need for SREs to constantly normalize data, manage custom schemas, or sift through endless noise. Streams also surfaces Significant Events, like major errors and anomalies, enabling you to be proactive in your investigations. SREs can now focus on resolving issues faster than ever by spending less time on data management and hunting through the noise.

Lets see Streams in action. In the demo below, watch an SRE tackle an issue with a critical trading application in production. In minutes, Streams processes the raw logs, pinpoints a Java out-of-memory error, and the AI Assistant guides the SRE straight to the root cause, turning hours of manual work into a quick fix.

Let's walk through some of the key Streams capabilities highlighted in the video:

AI-based partitioning - simplifies ingest by allowing SREs to send all logs to a single endpoint, without worrying about agents or integrations. Our AI automatically determines that logs are coming from two different systems, Hadoop and Spark. As more data comes through, it continues to learn and identify additional components, making segmentation effortless.

AI-based parsing - eliminates the manual effort of building and managing log processing pipelines. In the demo Streams automatically detects logs from Spark and generates a GROK rule that perfectly parses 100% of the fields.

Identifying Significant Events - Cuts through the noise so you can focus immediately on key issues. Streams analyzes the parsed Spark logs and pinpoints the Java out-of-memory errors and exceptions. This provides SREs with a clear, actionable starting point for their investigations instead of forcing them to hunt through raw data.

AI Assistant - The AI Assistant provides instant root cause analysis, turning hours of work into immediate answers. After Streams identifies the Java OOM error, an SRE can analyze logs in Discover with the AI Assistant. Within moments, it determines the root cause is that Spark lacks sufficient memory for the datasets being processed, delivering a precise answer to guide remediation.

One item that isn't in the video, is how easy Streams makes logs ingest. In this example above, we used the OTel collector, and merely configured it with a processor, exporter and service statements in values.yaml file for the OTel Collector's helm chart:

processors:
  transform/logs-streams:
      log_statements:
        - context: resource
          statements:
            - set(attributes["elasticsearch.index"], "logs")
exporters:
  debug:
  otlp/ingest:
    endpoint: ${env:ELASTIC_OTLP_ENDPOINT}
    headers:
      Authorization: ApiKey ${env:ELASTIC_API_KEY}

service:
  pipelines:
      logs:
        receivers: [filelog]
        processors: [batch, transform/logs-streams]
        exporters: [elasticsearch, debug]

With Streams you can use any log forwarder, OTel Collector (as in the example above), fluentd, fluentbit, etc. This makes ingesting simple and ensures you aren't locked into any specific log forwarder for Elastic.

As you've seen in this example, Streams helps SREs focus on finding the “why”, without the manual, error-prone work of making logs usable. What used to happen in hours can now be accomplished in minutes.

Streams: Key Features and availability

While the previous example shows how easy and fast it is to get to RCA with partitioning, parsing, Significant events, and the AI Assistant, Streams has more capabilities which is highlighted in the following diagram:

All of these capabilities are available in two primary modes: Streams for data already indexed in Elasticsearch, and Logs Streams for ingesting raw logs directly. Both modes support AI-driven partitioning and parsing, the identification of Significant Events, and essential tools for managing data quality, retention, and cost-efficient storage.

Streams (GA in 9.2)

Provides foundational capabilities that reduce pipeline management for SREs. Streams works with logs from existing agents and integrations as well as raw, unstructured logs coming through Logs Streams. Key capabilities include:

Streams Processing: simulate and refine log parsing using AI-powered Parsing or a point-and-click UI. Compare before-and-after states and modify schemas to simplify log processing.
Streams Retention Management: define time-based or advanced ILM policies directly in the UI, gain visibility into ingestion volume, and manage data in the failure store..
Streams Data Quality: detect and fix ingestion failures via a failure store that captures and exposes failed documents for inspection.

Logs Streams (Tech Preview)

Enables SREs to ingest any log, in any format, directly into Elasticsearch, without the need for agents or integrations. Key capabilities include:

Direct Ingestion with any log forwarder into Elasticsearch: send raw logs directly into /logs index using any mechanism, such as the logs_index parameter in an OpenTelemetry collector.
AI-Driven Partitioning: automatically or manually segment a single log stream into distinct parts (e.g., by service or component) using contextual AI-based suggestions..

Significant Events (tech preview)

Significant Events is available in both Streams and Logs Streams, and surfaces errors and anomalies that truly matter, such as startup and shutdown messages, out-of-memory errors, internal server failures, and other signals of change. These events act as actionable markers, giving SREs early warning and an investigative starting point before a service impact occurs.

What does this mean for SREs in practice?

With Elastic Streams, SREs no longer need to spend time data wrangling before they can be investigators. Logs are the primary investigation signal because Streams provides SREs with the ability to:

Log everything in any format, and don't worry about pipelines - Stop wasting time building and maintaining complex ingestion pipelines. Send logs in any format, structured or unstructured, from any source directly to a single Elastic endpoint, without needing specific agents. Use OTel collectors or any other data shipper to send logs to Elastic. Streams AI-driven processing parses and structures your log data, making it immediately “ready for investigation”. This means you can adapt to new log formats on the fly without the need to maintain brittle configurations. Streams ensures you always have the data you need, the moment you need it.

Don't just collect logs, get answers from them - Streams analyzes your data to surface “Significant Events,” proactively identifying critical errors, anomalies, and performance bottlenecks like out-of-memory exceptions. Instead of manually sifting through terabytes of data, you get a clear, prioritized starting point for your investigation. This allows you to go from symptom to solution in minutes, fixing issues before they impact users.

Achieve Complete Visibility at a Lower Cost: Get comprehensive visibility across all your services without the expected expense. By intelligently structuring data and surfacing only the most critical events, Streams reduces operational complexity and dramatically cuts down root cause analysis time. This efficiency allows you to store all relevant log data cost-effectively, ensuring you never have to sacrifice crucial visibility to meet a budget. Get clearer answers faster and lower your total cost of ownership.

Conclusion

Elastic Streams revolutionizes observability by transforming logs from a noisy and expensive data source into a primary investigation signal. Through AI-powered capabilities like automatic partitioning, parsing, retention management, and the surfacing of Significant Events, Streams empowers SREs to move beyond data management and directly pinpoints the root cause of issues. By reducing operational complexity, lowering storage costs, and providing complete visibility, Streams ensures that logs, enriched by AI become the fastest path to resolution by answering the critical question “why” for observability.

Sign up for an Elastic trial at cloud.elastic.co, and trial Elastic's Serverless offering which will allow you to play with all of the Streams functionality.

Additionally, check out:

Read about Reimagining streams

Look at the Streams website

Read the Streams documentation

Elastic SQL inputs: A generic solution for database metrics observability

Mon, 11 Sep 2023 00:00:00 GMT

Elastic^® SQL inputs (metricbeat module and input package) allows the user to execute SQL queries against many supported databases in a flexible way and ingest the resulting metrics to Elasticsearch^®. This blog dives into the functionality of generic SQL and provides various use cases for advanced users to ingest custom metrics to Elastic^®, for database observability. The blog also introduces the fetch from all database new capability, released in 8.10.

Why “Generic SQL”?

Elastic already has metricbeat and integration packages targeted for specific databases. One example is metricbeat for MySQL — and the corresponding integration package. These beats modules and integrations are customized for a specific database, and the metrics are extracted using pre-defined queries from the specific database. The queries used in these integrations and the corresponding metrics are not available for modification.

Whereas the Generic SQL inputs (metricbeat or input package) can be used to scrape metrics from any supported database using the user's SQL queries. The queries are provided by the user depending on specific metrics to be extracted. This enables a much more powerful mechanism for metrics ingestion, where users can choose a specific driver and provide the relevant SQL queries and the results get mapped to one or more Elasticsearch documents, using a structured mapping process (table/variable format explained later).

Generic SQL inputs can be used in conjunction with the existing integration packages, which already extract specific database metrics, to extract additional custom metrics dynamically, making this input very powerful. In this blog, Generic SQL input and Generic SQL are used interchangeably.

Functionalities details

This section covers some of the features that would help with the metrics extraction. We provide a brief description of the response format configuration. Then we dive into the merge_results functionality, which is used to combine results from multiple SQL queries into a single document.

The next key functionality users may be interested in is to collect metrics from all the custom databases, which is now possible with the fetch_from_all_databases feature.

Now let's dive into the specific functionalities:

Different drivers supported

The generic SQL can fetch metrics from the different databases. The current version has the capability to fetch metrics from the following drivers: MySQL, PostgreSQL, Oracle, and Microsoft SQL Server(MSSQL).

Response format

The response format in generic SQL is used to manipulate the data in either table or in variable format. Here’s an overview of the formats and syntax for creating and using the table and variables.

Syntax: response_format: table {{or}} variables

Response format table
This mode generates a single event for each row. The table format has no restrictions on the number of columns in the response. This format can have any number of columns.

Example:

driver: "mssql"
sql_queries:
 - query: "SELECT counter_name, cntr_value FROM sys.dm_os_performance_counters WHERE counter_name= 'User Connections'"
   response_format: table

This query returns a response similar to this:

"sql":{
      "metrics":{
         "counter_name":"User Connections ",
         "cntr_value":7
      },
      "driver":"mssql"
}

The response generated above adds the counter_name as a key in the document.

Response format variables
The variable format supports key:value pairs. This format expects only two columns to fetch in a query.

Example:

driver: "mssql"
sql_queries:
 - query: "SELECT counter_name, cntr_value FROM sys.dm_os_performance_counters WHERE counter_name= 'User Connections'"
   response_format: variables

The variable format takes the first variable in the query above as the key:

"sql":{
      "metrics":{
         "user connections ":7
      },
      "driver":"mssql"
}

In the above response, you can see the value of counter_name is used to generate the key in variable format.

Response optimization: merge_results

We are now supporting merging multiple query responses into a single event. By enabling merge_results , users can significantly optimize the storage space of the metrics ingested to Elasticsearch. This mode enables an efficient compaction of the document generated, where instead of generating multiple documents, a single merged document is generated wherever applicable. The metrics of a similar kind, generated from multiple queries, are combined into a single event.

Syntax: merge_results: true {{or}} false

In the below example, you can see how the data is loaded into Elasticsearch for the below query when the merge_results is disabled.

Example:

In this example, we are using two different queries to fetch metrics from the performance counter.

merge_results: false
driver: "mssql"
sql_queries:
  - query: "SELECT cntr_value As 'user_connections' FROM sys.dm_os_performance_counters WHERE counter_name= 'User Connections'"
    response_format: table
  - query: "SELECT cntr_value As 'buffer_cache_hit_ratio' FROM sys.dm_os_performance_counters WHERE counter_name = 'Buffer cache hit ratio' AND object_name like '%Buffer Manager%'"
    response_format: table

As you can see, the response for the above example generates a single document for each query.

The resulting document from the first query:

"sql":{
      "metrics":{
         "user_connections":7
      },
      "driver":"mssql"
}

And resulting document from the second query:

"sql":{
      "metrics":{
         "buffer_cache_hit_ratio":87
      },
      "driver":"mssql"
}

When we enable the merge_results flag in the query, both the above metrics are combined together and the data gets loaded in a single document.

You can see the merged document in the below example:

"sql":{
      "metrics":{
         "user connections ":7,
         “buffer_cache_hit_ratio”:87
      },
      "driver":"mssql"
}

However, such a merge is possible only if the table queries are merged, and each produces a single row. There is no restriction on variable queries being merged.

Introducing a new capability: fetch_from_all_databases

This is a new functionality to fetch all the database metrics automatically from the system and user databases of the Microsoft SQL Server, by enabling the fetch_from_all_databases flag.

Keep an eye out for the 8.10 release version where you can start using the fetch all database feature. Prior to the 8.10 version, users had to provide the database names manually to fetch metrics from custom/user databases.

Syntax: fetch_from_all_databases: true {{or}} false

Below is the sample query with fetch all databases flag as disabled:

fetch_from_all_databases: false
driver: "mssql"
sql_queries:
  - query: "SELECT @@servername AS server_name, @@servicename AS instance_name, name As 'database_name', database_id FROM sys.databases WHERE name='master';"

The above query fetches metrics only for the provided database name. Here the input database is master, so the metrics are fetched only for the master.

Below is the sample query with the fetch all databases flag as enabled:

fetch_from_all_databases: true
driver: "mssql"
sql_queries:
  - query: SELECT @@servername AS server_name, @@servicename AS instance_name, DB_NAME() AS 'database_name', DB_ID() AS database_id;
    response_format: table

The above query fetches metrics from all available databases. This is useful when the user wants to get data from all the databases.

Please note: currently this feature is supported only for Microsoft SQL Server and will be used by MS SQL integration internally, to support extracting metrics for all user DBs by default.

Using generic SQL: Metricbeat

The generic SQL metricbeat module provides flexibility to execute queries against different database drivers. The metricbeat input is available as GA for any production usage. Here, you can find more information on configuring the generic SQL for different drivers with various examples.

Using generic SQL: Input package

The input package provides a flexible solution to advanced users for customizing their ingestion experience in Elastic. Generic SQL is now also available as an SQLinput package. The input package is currently available for early users as a beta release. Let's take a walk through how users can use generic SQL via the input package.

Configurations of generic SQL input package:

The configuration options for the generic SQL input package are as below:

Driver** :** This is the SQL database for which you want to use the package. In this case, we will take mysql as an example.
Hosts: Here the user enters the connection string to connect to the database. It would vary depending on which database/driver is being used. Refer here for examples.
SQL Queries: Here the user writes the SQL queries they want to fire and the response_format is specified.
Data set: The user specifies a data set name to which the response fields get mapped.
Merge results** :** This is an advanced setting, used to merge queries into a single event.

Metrics extensibility with customized SQL queries

Let's say a user is using MYSQL Integration, which provides a fixed set of metrics. Their requirement now extends to retrieving more metrics from the MYSQL database by firing new customized SQL queries.

This can be achieved by adding an instance of SQL input package, writing the customized queries and specifying a new data set name as shown in the screenshot below.

This way users can get any metrics by executing corresponding queries. The resultant metrics of the query will be indexed to the new data set, sql_second_dataset.

When there are multiple queries, users can club them into a single event by enabling the Merge Results toggle.

Customizing user experience

Users can customize their data by writing their own ingest pipelines and providing their customized mappings. Users can also build their own bespoke dashboards.

As we can see above, the SQL input package provides the flexibility to get new metrics by running new queries, which are not supported in the default MYSQL integration (the user gets metrics from a predetermined set of queries).

The SQL input package also supports multiple drivers: mssql, postgresql and oracle. So a single input package can be used to cater to all these databases.

Note: The fetch_from_all_databases feature is not supported in the SQL input package yet.

Try it out!

Now that you know about various use cases and features of generic SQL, get started with Elastic Cloud and try using the SQL input package for your SQL database and get customized experience and metrics. If you are looking for newer metrics for some of our existing SQL based integrations — like Microsoft SQL Server, Oracle, and more — go ahead and give the SQL input package a swirl.

Smarter Alerting Arrives with Faster Triage, Clearer Groupings, and Actionable Guidance

Thu, 04 Sep 2025 00:00:00 GMT

In the 9.1 release, we've made significant upgrades to alerting to help SREs and operators cut through the noise, understand what's happening faster, and take meaningful action with less guesswork.

Here's what's new:

Improved Related Alert Grouping with Relevance Scoring & Reasoning

We've enhanced our related alert detection to go beyond surface-level correlations. Alerts are now grouped based on a relevance score that reflects the strength of their relationship across dimensions like:

Shared entities or resources (e.g. same host, pod, or service)
Temporal proximity (alerts firing within a suspiciously short window)
Signal similarity (e.g. spikes in logs, metrics, and traces that point to the same failure mode)

More importantly, we now show the why. You'll see why an alert is grouped, whether it's sharing the same Kubernetes pod, has similar log patterns, or was triggered by the same upstream anomaly. This gives users confidence in the grouping logic and accelerates root cause analysis.

Link Dashboards to Alert Rules and Get Smart Suggestions

You can now link dashboards directly to your alert rules, giving responders an instant visual lens into the metrics or logs that matter most for that alert. No more scrambling to remember which dashboard to check — just click and go.

And we've made this smarter too: Elastic will now suggest relevant dashboards based on the alert's source, rule logic, or monitored entities, helping users land on the right view without needing to configure anything upfront.

Investigation Guides Embedded Into Alerts

Every alert can now be configured with an investigation guide, a set of pre-configured, context-aware instructions or next steps tailored to the alert. Think of it as a playbook that's embedded right where and when you need it.

Use it to:

Document your team's runbooks and standard triage steps or link to existing runbooks
Guide junior engineers or on-call responders through unfamiliar territory
Automate the first few steps of root cause analysis

Why This Matters

These changes are all about reducing time to detect (MTTD) and time to resolve (MTTR). By:

Grouping alerts more intelligently (and transparently)
Giving you the dashboards you need, when you need them
Embedding action-oriented guides in every alert

We're bringing you closer to a truly streamlined incident response workflow; No swivel-chairing, no guesswork, just clarity.

Additionally, look at some of our other articles on Elastic Observability Labs related to analysis:

Streams Processing: Stop Fighting with Grok. Parse Your Logs in Streams.

Thu, 11 Dec 2025 00:00:00 GMT

With Streams, Elastic's new AI capability in 9.2, we make parsing your logs so simple, it's no longer a concern. In general your logs are messy, lots of fields, some understood, some unknown. You have to constantly keep up with the semantics and pattern match to properly parse them. In some cases, even fields you know have different values or semantics. For instance, timestamp is the ingest time, not the event time. Or you can't even filter by log.level or user.id because they're buried inside the message field. As a result, your dashboards are flat and not useful.

Fixing this used to mean leaving Kibana, learning Grok syntax, manually editing ingest pipeline JSON or a complicated Logstash config, and hoping you didn't break parsing for everything else.

We built Streams to fix this, and much more. It's your one place for data processing, built right into Kibana, that lets you build, test, and deploy parsing logic on live data in seconds. It turns a high-risk backend task into a fast, predictable, interactive UI workflow. You can use AI to generate automated GROK rules from a sample of logs, or build them easily with the UI. Let's walk through an example

A Quick Walkthrough

Let's fix a common "unstructured" log right now.

Start in Discover. You find a log that isn't structured. The @timestamp is wrong, and fields like log.level aren't being extracted, so your histograms are just a single-color bar.

Inspect the log. Open the document flyout (the "Inspect a single log event" view). You'll see a button: "Parse content in Streams" (or "Edit processing in Streams"). Click it.

Go to Processing. This takes you directly to the Streams processing tab, pre-loaded with sample documents from that data stream. Click "Create your first step."

Generate a Pattern. The processor defaults to Grok. You don't have to write any. Just click the "Generate Pattern" button. Streams analyzes 100 sample documents from your stream and suggests a Grok pattern for you. By default, this uses the Elastic Managed LLM, but you can configure your own.

Accept and Simulate. Click "Accept." Instantly, the UI runs a simulation across all 100 sample documents. You can make changes to the pattern or adjust field names, and the simulation re-runs with every keystroke.

When you're happy, you save it. Your new logs will now be parsed correctly.

Powerful Features for Messy, Real-World Logs

That's the simple case. But real-world data is rarely that clean. Here are the features built to handle the complexity.

The Interactive Grok UI

When you use the Grok processor, the UI gives you a visual indication of what your pattern is extracting. You can see which parts of the message field are being mapped to which new field names. This immediate feedback means you're not just guessing. Autocompletion of GROK patterns and instant pattern validation are also part of it.

The Diff Viewer

How do you know what exactly changed? Expand any row in the simulation table. You'll get a diff view showing precisely which fields were added, removed, or modified for that specific document. No more guesswork.

End to End Simulation and Detecting Failures

This is the most critical part. Streams doesn't just simulate the processor; it simulates the entire indexing process. If you try to map a non-timestamp string (like the message field) directly to the @timestamp field, the simulation will show a failure. It detects the mapping conflict before you save it and before it can create a data-mapping conflict in your cluster. This safety net is what lets you move fast.

Conditional Processing

What if one data stream contains a large variety of logs? You can't use one Grok pattern for all.

Streams has conditional processing built for this. The UI lets you build "if-then" logic. The UI shows you exactly what percentage of your sample documents are skipped or processed by your conditions. Right now, the UI supports up to 3 levels of nesting, and we plan to add a YAML mode in the future for more complex logic.

Changing Your Test Data (Document Samples)

A random 100-document sample isn't always helpful, especially in a massive, mixed stream from Kubernetes or a central message broker.

You can change the document sample to test your changes on a more specific set of logs. You can either provide documents manually (copy-paste) or, more powerfully, specify a KQL query to fetch 100 specific documents. For example: service.name : "data_processing", to fetch 100 additional sample documents to be used in the simulation. Now you can build and test a processor on the exact logs you care about.

How Processing Works Under the Hood

There’s no magic. In simple terms, it's a UI that makes our existing best practices more accessible. As of version 9.2, Streams runs exclusively on Elasticsearch ingest pipelines. (We have plans to offer more than that, stay tuned)

When you save your changes, Streams appends processing steps by:

Locating the most specific @custom ingest pipeline for your data stream.
Adding a single pipeline processor to it.
This processor calls a new, dedicated pipeline named @stream.processing, which contains the Grok, conditional, and other logic you built in the UI.

You can even see this for yourself by going to the Advanced tab in your Stream and clicking the pipeline name.

Processing in OTel, Elastic Agent, Logstash, or Streams? What to Use?

This is a fair question. You have lots of ways to parse data.

Best: Structured logging at the Source. If you control the app writing the logs, make it log JSON in the right format of your choice. This will always stay the best way to do logging, but it's not always possible.
Good, but not all the time: Elastic Agent + Integrations: If there is an existing integration for collecting and parsing your data, Streams won't do it any better. Use it!
Good for tech savvy users: OTel at the Edge. Use OTel (with OTTL) to set yourself up for the future.
The easy Catch-All: In Streams. Especially when using an Integration that primarily just ships the data into Elastic, Streams can add a lot of value. The Kubernetes Logs integration is a good example of this where an Integration is used, but most logs aren't parsed automatically as they may be from a wide variety of pods.

Think of Streams as your universal "catch-all" for everything that arrives unstructured. It's perfect for data from sources you don't control, for legacy systems, or for when you just need to fix a parsing error right now without a full application redeploy.

A quick note on schemas: Streams can handle both ECS (Elastic Common Schema) and OTel (OpenTelemetry) data. By default, it assumes your target schema is ECS. However, Streams will automatically detect and adapt to the OTel schema if your Stream's name contains the word “otel”, or if you're using the special Logs Stream (currently in tech preview). You get the same visual parsing workflow regardless of the schema.

All processing changes can also be made using a Kibana API. Note that the API is still in tech preview while we mature some of the functionality.

Summary

Parsing logs shouldn't be a tedious, high-stakes, backend-only task. Streams moves the entire workflow from a complex, error-prone approach to an interactive UI right where you already are. You can now build, test, and deploy parsing logic with instant, safe feedback. This means you can stop fighting your logs and finally start using them. The next time you see a messy log, don't ignore it. Click "Parse in Streams" and fix it in 60 seconds.

Check out more log analytics articles in Elasitc Observability Labs.

Try out Elastic. Sign up for a trial at Elastic Cloud.

Reconciliation in Elastic Streams: A Robust Architecture Deep Dive

Tue, 04 Nov 2025 00:00:00 GMT

Streams is a new, unified approach to data management in the Elastic Stack. It wraps a set of existing Elasticsearch building blocks—data streams, index templates, ingest pipelines, retention policies—into a single, coherent primitive: the Stream. Instead of configuring these parts individually and in the right order, users can now rely on Streams to orchestrate them safely and automatically. With a unified UI in Kibana and a simplified API, Streams reduces cognitive load, lowers the risk of misconfiguration, and supports more flexible workflows like late binding—where users can ingest data first and decide how to process and route it later.

But behind that clean user experience lies a fast-moving, evolving codebase. In this post, we’ll explore how we rethought its architecture to keep up with product demands—while laying the groundwork for future flexibility and scale.

Rapid experimentation often leads to messy code—but before shipping to customers, we have to ask: If this succeeds, can we continue evolving it? That question puts code health front and center. To move fast in the long term, we need a foundation that supports iteration.

When I joined the Streams team about six months ago, the project was moving fast through uncharted territory amid high uncertainty. This combination of speed and uncertainty created the perfect conditions for, well, spaghetti code—crafted by some of our most senior engineers, doing their best with a recipe missing a few ingredients.

The code was pragmatic and effective: it did exactly what it needed to do. But it was becoming increasingly difficult to understand and extend. Related logic was scattered across many files, with little separation of concerns, making it difficult to safely identify where and how to introduce changes. And the project still had a long road ahead.

Recently, we undertook a refactor of the underlying architecture—not just to bring greater clarity and structure to the codebase, but to establish clear phases that make it easier to debug and evolve. Our primary goal was to build a foundation that would let us continue moving quickly and confidently. As a secondary goal, we aimed to enable new capabilities like bulk updates, dry runs, and system diagnostics.

In this post, we’ll briefly explore the challenges that prompted a new approach, share the architectural patterns that inspired us, explain how the new design works under the hood, and highlight what it enables for the future.

The Challenges We Faced

Streams aims to be a declarative model for data management. Users describe how data should flow: where it should go, what processing should happen along the way, and which mappings should apply. Behind the scenes, each API request results in one or more Elasticsearch resources being changed.

Before the refactor, the underlying code was increasingly difficult to reason about. There was no clear lifecycle that each request followed. Data was loaded only when it happened to be needed, validation was scattered across different functions, and cascading changes—like child streams reacting to parent updates—were applied recursively and implicitly. Elasticsearch requests could happen at any point during a request.

This led to several key challenges:

No clear place for validation
Without a single, centralized validation step, engineers weren’t sure where to add new checks—or whether existing ones would even run reliably. Some validations happened early, others late.
No clear picture of the overall system state
Because there was no way to manage the system state as a whole it was hard to reason about or validate the state. We couldn’t easily check whether a change was valid in the context of all other existing streams or dependencies.
Unpredictable side effects
Since Elasticsearch operations could occur at different points in the flow, failures were harder to handle or roll back. We didn’t have a clear “commit point” where the changes were executed.
Tangled stream logic
Logic for different types of streams was mixed together in shared code paths, often guarded by conditionals. This made it hard to isolate behavior, test individual types, or add new ones without risking unintended consequences.

These challenges made it clear: we needed a more structured foundation, one capable of supporting both the current complexity and future growth.

What We Needed to Move Forward

To move faster yet with confidence, we needed a foundation that could evolve gracefully, make behavior easier to reason about, and reduce the likelihood of unexpected side effects.

We aligned around a few key goals:

A clear request lifecycle
Each request should move through clear, well-defined phases: loading the current state, applying changes, validating the resulting state, determining the Elasticsearch actions, and executing the actions. This structure would help engineers understand where things happen—and why.
A unified state model
We wanted a clear model of desired vs. current state—a single place to reason about the outcome of a change. This would enable safer validation, more efficient updates, and easier debugging by allowing us to compute the difference between the two states.
A single commit point
All Elasticsearch changes should happen in one place, after everything’s validated and we know exactly what needs to change. This would reduce side effects, make failures easier to manage, and unlock support for dry runs.
Isolated stream logic
We needed clearer separation between stream types so each could be developed and tested in isolation. This would simplify adding new types, reduce unintended side effects, and clarify whether changes belong to a stream type or the state management layer.
Bulk operations and system introspection
Finally, we wanted to support features like bulk updates, dry runs, and health diagnostics—capabilities that were difficult or impossible with the old design. A more explicit and inspectable model of system state would make this possible.

These goals became our north star as we explored new architectural patterns to get there, with a strong focus on comparing the current state with the desired state.

Where We Drew Inspiration From

Our new design drew inspiration from two well-known open source projects: Kubernetes and React. Though very different, both share a central concept: reconciliation.

Reconciliation means comparing two states, calculating their differences, and taking the necessary actions to move the system from its current state to its desired state.

In Kubernetes, you declare the desired state of your resources, and the controller continuously works to align the cluster with that state.
In React, each component defines how it should render, and the virtual DOM updates the real DOM efficiently to match that.

We were also inspired by the Plan/Execute pattern which aims to separate decision making from execution. This sounded like what we needed in order to perform all validations before committing to any actions—ensuring we could reason about and inspect the system's intent ahead of time.

These concepts resonated with what we needed. It made clear that we required two key pieces:

A model representing system state, responsible for comparing states and driving the overall workflow (like the Kubernetes controller loop).
A representation of individual streams that make up that state, handling the specific logic for each stream type (like React components).

Each Stream is defined and stored in Elasticsearch. We recognized a disconnect between data management and state changes in our existing code, so we designed each stream to manage both. This fits naturally with the Active Record pattern, where a class encapsulates both domain logic and persistence.

To make the system easier to extend and the state model’s interface simpler, we implemented an abstract Active Record class using the Template Method pattern, clearly defining the interface new stream types must follow.

We did have some concerns that adopting these more advanced patterns—like reconciliation, the Active Record, and Template Method—might make it harder for new or less experienced engineers to get up to speed. While the code would be cleaner and more straightforward for those familiar with the patterns, we worried it could create a barrier for juniors or newcomers unfamiliar with these concepts.

In practice, however, we found the opposite: the code became easier to follow because the patterns provided a clear, consistent structure. More importantly, the architectural choices helped keep the focus on the domain itself, rather than on complex implementation details, making it more approachable for the whole team. The patterns are there but the code doesn't talk about them, it talks about the domain.

How We Structured the System

When a request hits one of our API endpoints in Kibana, the handler performs basic request validation, then passes the request to the Streams Client. The client’s job is to translate the request into one or more Change objects. Each Change represents the creation, modification, or deletion of a Stream.

These Change objects are then passed to a central class we introduced called State, which plays two key roles:

It holds the set of Stream instances that make up the current version of the system.
It orchestrates the pipeline that applies changes and transitions from one state to another.

Let’s walk through the key phases the State class manages when applying a change.

Loading the Starting State

First, the State class loads the current system state by reading the stored Stream definitions from Elasticsearch. This becomes our reference point for all subsequent comparisons—used during validation, diffing, and action planning.

Applying Changes

We begin by cloning the starting state. Each Stream is responsible for cloning itself. Then we process each incoming Change:

The change is presented to all Streams in the current state (creating a new one if needed).
Each Stream can react by updating itself and optionally emitting cascading changes—additional changes that ripple through related Streams.
Cascading changes are processed in a loop until no more are generated (or until we hit a safety threshold).

We then move to the next requested Change.
If any requested or cascading Change cannot be applied safely, the system aborts the entire request to prevent partial updates.

Validating the Desired State

Once we’ve applied all Changes and cascading effects, we run validations to ensure the resulting configuration is safe and consistent.

Each Stream is asked to validate itself in the context of the full desired state and the original starting state. This allows for both localized checks (within a Stream) and broader coordination (between related Streams). If any validation fails, we abort the request.

Determining Actions

Next, each Stream is asked to determine what Elasticsearch actions are needed to move from the starting state to the desired state. This is the first point where the system needs to consider which Elasticsearch resources back an individual Stream.

If the request is a dry run, we stop here and return a summary of what would happen. If it’s meant to be executed, we move to the next phase.

Planning and Execution

The list of Elasticsearch actions is handed off to a dedicated class called ExecutionPlan. This class handles:

Resolving cross-stream dependencies that individual Streams cannot address alone.
Organizing the actions into the correct order to ensure safe application (e.g. to avoid data loss when routing rules change).
Maximizing parallelism wherever possible within those ordering constraints.

If the plan executes successfully, we return a success response from the API.

Handling Failures

If the plan fails during execution, the State class attempts a roll back—it computes a new plan that should return the system to its starting state (by going from desired state to starting state instead) and tries to execute it.

If the roll back also fails, we have a fallback mechanism: a “reset” operation that re-applies the known-good state stored in Elasticsearch, skipping diffing entirely.

A Closer Look at the Stream Active Record Classes

All Streams in the State are subclasses of an abstract class called StreamActiveRecord. This class is responsible for:

Tracking the change status of the Stream
Routing change application, validation, and action determination to specialized template method hooks implemented by its concrete subclasses based on the change status.

These hooks are as follows:

Apply upsert / Apply deletion
Validate upsert / Validate deletion
Determine actions for creation / change / deletion

With this architecture in place, we’ve created a clear, phased, and declarative flow from input to action—one that’s modular, testable, and resilient to failure. It cleanly separates generic stream lifecycle logic (like change tracking and orchestration) from stream-specific behaviors (such as what “upsert” means for a given Stream type), enabling a highly extensible system. This structure allows us to isolate side effects, validate with confidence, and reason more clearly about system-wide behavior—all while supporting dry runs and bulk operations.

Now that we’ve covered how it works, let’s explore what this unlocks—the capabilities, safety guarantees, and new workflows this design makes possible.

What This Unlocks

The reconciliation based design we landed on isn’t just easier to reason about—it directly addresses many of the core limitations we faced in the earlier version of the system.

Bulk operations and dry runs, by design

One of our key goals was to support bulk configuration changes across many Streams in a single request. The previous codebase made this difficult because the side effects were interleaved with decision-making logic, making it risky to apply multiple changes at once.

Now, bulk changes are the default. The State class handles any number of changes, tracks cascading effects automatically, and validates the end result as a whole. Whether you're updating one Stream or fifty, the pipeline handles it consistently.

Dry runs were another desired feature. Because actions are now computed in a side-effect-free step—before anything is sent to Elasticsearch—we can generate a full preview of what would happen. This includes both which Streams would change and what specific Elasticsearch operations would be performed. That visibility helps users and developers make confident, informed decisions.

Easier debugging, better diagnostics

In the old system, debugging required reconstructing the execution context and piecing together side effects. Now, every phase of the pipeline is explicit and testable in isolation by following the phases.

Because validation and Elasticsearch actions are now tied directly to the Stream definition and lifecycle, any inconsistencies or errors are easier to trace to their source.

Validated planning before execution

Because we now validate and plan before making any changes, the risk of leaving the system in an inconsistent or partially-updated state has been greatly reduced. All actions are determined in advance, and only executed once we’re confident the entire set of changes is valid and coherent.

And if something does go wrong during execution, we can lean on the fact that both the starting and desired states are fully modeled in memory. This allows us to generate a roll back plan automatically, and when that’s not possible, fall back to a complete reset from the stored state. In short: safety is now built in, not bolted on.

Extensible by default

Adding a new type of Stream used to mean editing logic scattered across multiple files. Now, it’s a focused, well-defined task. You subclass StreamActiveRecord and implement the handful of lifecycle hooks.

That’s it. The orchestration, tracking, and dependency handling are already wired up. That also means it’s easier to onboard new developers or experiment with new Stream types without fear of breaking unrelated parts of the system.

Easier to test

Because each Stream is now encapsulated and has clear, isolated responsibilities, testing is much simpler. You can test individual Stream classes by simulating specific inputs and asserting the resulting cascading changes, validation results, or Elasticsearch actions. There's no need to spin up a full end-to-end environment just to test a single validation.

What’s Next

At Elastic, we live by our Source Code, which states “Progress, SIMPLE Perfection”—a reminder to favor steady, incremental improvement over chasing perfection.

This new system is a solid foundation—but it’s only the beginning. Our focus so far has been on clarity, safety, and extensibility, and while we’ve addressed some long-standing pain points, there’s still plenty of room to evolve.

Continuous improvement ahead

We intentionally shipped this work with a sharp scope and have already identified several enhancements that we will be adding in the coming weeks:

Introduce a locking layer
To safely handle concurrent updates, we plan to introduce a locking mechanism that prevents race conditions during parallel modifications.
Expose bulk and dry-run features via our APIs
The State class already supports them—now it’s time to make those capabilities available to users.
Improve debugging output
Now that state transitions are modeled explicitly, we can expose clearer diagnostics to help both users and developers reason about changes.
Avoiding Redundant Elasticsearch Requests
Currently we make multiple redundant requests during validation. Introducing a lightweight in-memory cache would let us avoid reloading the same resource more than once.
Improve access controls
Currently, we rely on Elasticsearch to enforce access control. Because a single change can touch many different resources, it’s difficult to determine up front which privileges are required. We plan to extend our action definitions with privilege metadata, enabling us to validate the full set of required permissions before executing any actions. This will let us detect and report missing privileges early—before the plan runs.
Add APM instrumentation
With the system structured in distinct, well-defined phases, we’re now in a great position to add performance instrumentation. This will help us identify bottlenecks and improve responsiveness over time.

Revisiting responsibilities

As our orchestration becomes more robust, we’re also re-evaluating where it should live. Large-scale bulk operations, for example, might eventually be better handled closer to Elasticsearch itself, where we can benefit from greater atomicity and tighter performance guarantees. That kind of deep integration would have been premature earlier on—when we were still figuring out the right abstractions and phases for the system. But now that the design has stabilized, we’re in a much better position to start that conversation.

Built to evolve

We designed this system with adaptability in mind. Whether improvements come in the form of internal refactors, better developer experience, or deeper collaboration with Elasticsearch, we’re in a strong position to keep evolving. The architecture is modular by design—and that gives us both the stability to rely on and the flexibility to grow.

Wrapping Up

Building robust, maintainable systems is never just about code — it’s about aligning architecture with the evolving needs and direction of the product. Our journey refactoring Streams reaffirmed that a thoughtful, phased approach not only improves technical clarity but also empowers teams to move faster and innovate more confidently.

If you’re working on complex systems facing similar challenges—whether tangled logic, unpredictable side effects, or the need for extensibility—you’re not alone. We hope our story offers some useful insights and inspiration as you shape your own path forward.

We welcome feedback and collaboration from the community—whether it’s in the form of questions, ideas, or code.

To learn more about Streams, explore:

Read about Reimagining streams

Look at the Streams website

Read the Streams documentation

Check out the pull request on GitHub to dive into the code or join the conversation.

Future-proof your logs with ecs@mappings template

Mon, 23 Sep 2024 00:00:00 GMT

As the Elasticsearch ecosystem evolves, so do the tools and methodologies designed to streamline data management. One advancement that will significantly benefit our community is the ecs@mappings component template.

ECS (Elastic Common Schema) is a standardized data model for logs and metrics. It defines a set of common field names and data types that help ensure consistency and compatibility.

ecs@mappings is a component template that offers an Elastic-maintained definition of ECS mappings. Each Elasticsearch release contains an always up-to-date definition of all ECS fields.

Elastic Common Schema and Open Telemetry

Elastic will preserve our user's investment in Elastic Common Schema by donating ECS to Open Telemetry. Elastic participates and collaborates with the OTel community to merge ECS and Open Telemetry's Semantic Conventions over time.

The Evolution of ECS Mappings

Historically, users and integration developers have defined ECS (Elastic Common Schema) mappings manually within individual index templates and packages, each meticulously listing its fields. Although straightforward, this approach proved time-consuming and challenging to maintain.

To tackle this challenge, integration developers moved towards two primary methodologies:

Referencing ECS mappings
Importing ECS mappings directly

These methods were steps in the right direction but introduced their challenges, such as the maintenance cost of keeping the ECS mappings up-to-date with Elasticsearch changes.

Enter ecs@mappings

The ecs@mappings component template supports all the field definitions in ECS, leveraging naming conventions and a set of dynamic templates.

Elastic started shipping the ecs@mappings component template with Elasticsearch v8.9.0, including it in the logs-- index template.

With Elasticsearch v8.13.0, Elastic now includes ecs@mappings in the index templates of all the Elastic Agent integrations.

This move was a breakthrough because:

Centralized and official: With ecs@mappings, we now have an official definition of ECS mappings.
Out-of-the-box functionality: ECS mappings are readily available, reducing the need for additional imports or references.
Simplified maintenance: The need to manually keep up with ECS changes has diminished since the template from Elasticsearch itself remains up-to-date.

Enhanced Consistency and Reliability

With ecs@mappings, ECS mappings become the single source of truth. This unified approach means fewer discrepancies and higher consistency in data streams across integrations.

How Community Users Benefit

Community users stand to gain manifold from the adoption of ecs@mappings. Here are the key advantages:

Reduced configuration hassles: Whether you are an advanced user or just getting started, the simplified setup means fewer configuration steps and fewer opportunities for errors.
Improved data integrity: Since ecs@mappings ensures that field definitions are accurate and up-to-date, data integrity is maintained effortlessly.
Better performance: With less overhead in maintaining and referencing ECS fields, your Elasticsearch operations run more smoothly.
Enhanced documentation and discoverability: As we standardize ECS mappings, the documentation can be centralized, making it easier for users to discover and understand ECS fields.

Let's explore how the ecs@mappings component template helps users achieve these benefits.

Reduced configuration hassles

Modern Elasticsearch versions come with out-of-the-box full ECS field support (see the “requirements” section later for specific versions).

For example, the Custom AWS Logs integration installed on a supported Elasticsearch cluster already includes the ecs@mappings component template in its index template:

GET _index_template/logs-aws_logs.generic
{
  "index_templates": [
    {
      "name": "logs-aws_logs.generic",
      ...,
        "composed_of": [
          "logs@settings",
          "logs-aws_logs.generic@package",
          "logs-aws_logs.generic@custom",
          "ecs@mappings",
          ".fleet_globals-1",
          ".fleet_agent_id_verification-1"
        ],
    ...

There is no need to import or define any ECS field.

Improved data integrity

The ecs@mappings component template supports all the existing ECS fields. If you use any ECS field in your document, it will accurately have the expected type.

To ensure that ecs@mappings is always up to date with the ECS repository, we set up a daily automated test to ensure that the component template supports all fields.

Better Performance

Compact definitions

The ECS field definition is exceptionally compact; at the time of this writing, it is 228 lines long and supports all ECS fields. To learn more, see the ecs@mappings component template source code.

It relies on naming conventions and uses dynamic templates to achieve this compactness.

Lazy mapping

Elasticsearch only adds existing document fields to the mapping, thanks to dynamic templates. The lazy mapping keeps memory overhead at a minimum, improving cluster performance and making field suggestions more relevant.

Enhanced documentation and discoverability

All Elastic Agent integrations are migrating to the ecs@mappings component template. These integrations no longer need to add and maintain ECS field mappings and can reference the official ECS Field Reference or the ECS source code in the Git repository: https://github.com/elastic/ecs/.

Getting started

Requirements

To leverage the ecs@mappings component template, ensure the following stack version:

8.9.0: if your data stream uses the logs index template or you define your index template.
8.13.0: if your data stream uses the index template of an Elastic Agent integration.

Example

We will use the Custom AWS Logs integration to show you how ecs@mapping can handle mapping for any out-of-the-box ECS field.

Imagine you want to ingest the following log event using the Custom AWS Logs integration:

{
  "@timestamp": "2024-06-11T13:16:00+02:00", 
  "command_line": "ls -ltr",
  "custom_score": 42
}

Dev Tools

Kibana offers an excellent tool for experimenting with Elasticseatch API, the Dev Tools console. With the Dev Tools, users can run all API requests quickly and without much friction.

To open the Dev Tools:

Open Kibana
Select Management > Dev Tools > Console

Elasticsearch version < 8.13

On Elasticsearch versions before 8.13, the Custom AWS Logs integration has the following index template:

GET _index_template/logs-aws_logs.generic
{
  "index_templates": [
    {
      "name": "logs-aws_logs.generic",
      "index_template": {
        "index_patterns": [
          "logs-aws_logs.generic-*"
        ],
        "template": {
          "settings": {},
          "mappings": {
            "_meta": {
              "package": {
                "name": "aws_logs"
              },
              "managed_by": "fleet",
              "managed": true
            }
          }
        },
        "composed_of": [
          "logs-aws_logs.generic@package",
          "logs-aws_logs.generic@custom",
          ".fleet_globals-1",
          ".fleet_agent_id_verification-1"
        ],
        "priority": 200,
        "_meta": {
          "package": {
            "name": "aws_logs"
          },
          "managed_by": "fleet",
          "managed": true
        },
        "data_stream": {
          "hidden": false,
          "allow_custom_routing": false
        }
      }
    }
  ]
}

As you can see, it does not include the ecs@mappings component template.

If we try to index the test document:

POST logs-aws_logs.generic-default/_doc
{
  "@timestamp": "2024-06-11T13:16:00+02:00", 
  "command_line": "ls -ltr",
  "custom_score": 42
}

The data stream will have the following mappings:

GET logs-aws_logs.generic-default/_mapping/field/command_line
{
  ".ds-logs-aws_logs.generic-default-2024.06.11-000001": {
    "mappings": {
      "command_line": {
        "full_name": "command_line",
        "mapping": {
          "command_line": {
            "type": "keyword",
            "ignore_above": 1024
          }
        }
      }
    }
  }
}

GET logs-aws_logs.generic-default/_mapping/field/custom_score
{
  ".ds-logs-aws_logs.generic-default-2024.06.11-000001": {
    "mappings": {
      "custom_score": {
        "full_name": "custom_score",
        "mapping": {
          "custom_score": {
            "type": "long"
          }
        }
      }
    }
  }
}

These mappings do not align with ECS, so users and developers had to maintain them.

Elasticsearch version >= 8.13

On Elasticsearch versions equal to or newer to 8.13, the Custom AWS Logs integration has the following index template:

GET _index_template/logs-aws_logs.generic
{
  "index_templates": [
    {
      "name": "logs-aws_logs.generic",
      "index_template": {
        "index_patterns": [
          "logs-aws_logs.generic-*"
        ],
        "template": {
          "settings": {},
          "mappings": {
            "_meta": {
              "package": {
                "name": "aws_logs"
              },
              "managed_by": "fleet",
              "managed": true
            }
          }
        },
        "composed_of": [
          "logs@settings",
          "logs-aws_logs.generic@package",
          "logs-aws_logs.generic@custom",
          "ecs@mappings",
          ".fleet_globals-1",
          ".fleet_agent_id_verification-1"
        ],
        "priority": 200,
        "_meta": {
          "package": {
            "name": "aws_logs"
          },
          "managed_by": "fleet",
          "managed": true
        },
        "data_stream": {
          "hidden": false,
          "allow_custom_routing": false
        },
        "ignore_missing_component_templates": [
          "logs-aws_logs.generic@custom"
        ]
      }
    }
  ]
}

The index template for logs-aws_logs.generic now includes the ecs@mappings component template.

If we try to index the test document:

POST logs-aws_logs.generic-default/_doc
{
  "@timestamp": "2024-06-11T13:16:00+02:00", 
  "command_line": "ls -ltr",
  "custom_score": 42
}

The data stream will have the following mappings:

GET logs-aws_logs.generic-default/_mapping/field/command_line
{
  ".ds-logs-aws_logs.generic-default-2024.06.11-000001": {
    "mappings": {
      "command_line": {
        "full_name": "command_line",
        "mapping": {
          "command_line": {
            "type": "wildcard",
            "fields": {
              "text": {
                "type": "match_only_text"
              }
            }
          }
        }
      }
    }
  }
}

GET logs-aws_logs.generic-default/_mapping/field/custom_score
{
  ".ds-logs-aws_logs.generic-default-2024.06.11-000001": {
    "mappings": {
      "custom_score": {
        "full_name": "custom_score",
        "mapping": {
          "custom_score": {
            "type": "float"
          }
        }
      }
    }
  }
}

In Elasticsearch 8.13, fields like command_line and custom_score get their definition from ECS out-of-the-box.

These mappings align with ECS, so users and developers do not have to maintain them. The same applies to all the hundreds of field definitions in the Elastic Common Schema. You can achieve this by including a 200-liner component template in your data stream.

Caveats

Some aspects of how the ecs@mappings component template deals with data types are worth mentioning.

ECS types are not enforced

The ecs@mappings component template does not contain mappings for ECS fields where dynamic mapping already uses the correct field type. Therefore, if you send a field value with a compatible but wrong type, Elasticsearch will not coerce the value.

For example, if you send the following document with a faas.coldstart field (defined as boolean in ECS):

{
  "faas.coldstart": "true"
}

Elasticsearch will map faas.coldstart as a keyword and not a boolean. Therefore, you need to make sure that the values you ingest to Elasticsearch use the right JSON field types, according to how they’re defined in ECS.

This is the tradeoff for having a compact and efficient ecs@mappings component template. It also allows for better compatibility when dealing with a mix of ECS and custom fields because documents won’t be rejected if the types are not consistent with the ones defined in ECS.

Conclusion

The introduction of ecs@mappings marks a significant improvement in managing ECS mappings within Elasticsearch. By centralizing and streamlining these definitions, we can ensure higher consistency, reduced maintenance, and better overall performance.

Whether you're an integration developer or a community user, moving to ecs@mappings represents a step towards more efficient and reliable Elasticsearch operations. As we continue incorporating feedback and evolving our tools, your journey with Elasticsearch will only get smoother and more rewarding.

Join the Conversation

Do you have questions or feedback about ecs@mappings? Post on our helpful community of users on our community discussion forum and Slack instance and share your experiences. Your input is invaluable in helping us fine-tune these advancements for the entire community.

Happy mapping!

Getting more from your logs with OpenTelemetry

Thu, 11 Sep 2025 00:00:00 GMT

Getting more from your logs with OpenTelemetry

Most people today use their logging tools mostly still in the same way we have for decades as a simple search lake, essentially still grepping for logs but from a centralized platform. There’s nothing wrong with this, you can get a lot of value by having a centralized logging platform but the question becomes how can I start to evolve beyond this basic log and search use case? Where can I start to be more effective with my incident investigations? In this blog we start from where most of our customers are today and give you some practical tips on how to move a little beyond this simple logging use case.

Ingestion

Let's start at the beginning, ingest. Typically many of you are using older tools for ingestion today. If you want to be more forward thinking here, it’s time to introduce you to OpenTelemetry. OpenTelemetry was once not very mature or capable for logging but things have changed significantly. Elastic has been working particularly hard to improve the log capabilities resident in OpenTelemetry. So let's start by exploring how we can get started bringing logs into Elastic via the OpenTelemetry collector.

Firstly if you want to follow along simply create a host to run the log generator and OpenTelemetry collector.

Follow the instructions here to get the log generator running:

https://github.com/davidgeorgehope/log-generator-bin/

To get the OpenTelemetry collector up and running in Elastic Serverless, you can click on Add Data from the bottom left, then 'host' and finally 'opentelemetry'

Follow the instructions but don’t start the collector just yet.

Our host here is running a 3 tier application with an Nginx frontend, backend and connected to a MySQL database. So let's start by bringing the logs into Elastic.

First we’ll install the Elastic Distributions for OpenTelemetry but before starting it, we will make a small change to the OpenTelemetry configuration file to expand the directories it will search for logs in. Edit the otel.yml by simply using vi or your favorite editor:

vi otel.yml

Instead of simply /var/log/.log we will add /var/log/**/*.log to bring in all our log files.

receivers:
  # Receiver for platform specific log files
  filelog/platformlogs:
    include: [ /var/log/**/*.log ]
    retry_on_failure:
      enabled: true
    start_at: end
    storage: file_storage

Start the otel collector

sudo ./otelcol --config otel.yml

And we can see these are being brought in, in discover

Now one thing that is immediately noticeable is that we automatically without changing anything get a bunch of useful additional information such as the os name and cpu information.

The OpenTelemetry collector has automatically, without any changes, started to enrich our logs, making it useful for additional processing, though we could do significantly better!

To start with we want to give our logs some structure. Lets edit that otel.yml file and add some OTTL to extract some key data from our NGINX logs.

  transform/parse_nginx:
    trace_statements: []
    metric_statements: []
    log_statements:
      - context: log
        conditions:
          - 'attributes["log.file.name"] != nil and IsMatch(attributes["log.file.name"], "access.log")'
        statements:
          - merge_maps(attributes, ExtractPatterns(body, "^(?P\\S+)"), "upsert")
          - merge_maps(attributes, ExtractPatterns(body, "^\\S+ - (?P\\S+)"), "upsert")
          - merge_maps(attributes, ExtractPatterns(body, "\\[(?P[^\\]]+)\\]"), "upsert")
          - merge_maps(attributes, ExtractPatterns(body, "\"(?P\\S+) "), "upsert")
          - merge_maps(attributes, ExtractPatterns(body, "\"\\S+ (?P\\S+)\\?"), "upsert")
          - merge_maps(attributes, ExtractPatterns(body, "req_id=(?P[^ ]+)"), "upsert")
          - merge_maps(attributes, ExtractPatterns(body, "\" (?P\\d+) "), "upsert")
          - merge_maps(attributes, ExtractPatterns(body, "\" \\d+ (?P\\d+)"), "upsert")
.....

   logs/platformlogs:
      receivers: [filelog/platformlogs]
      processors: [transform/parse_nginx,resourcedetection]
      exporters: [elasticsearch/otel]

Now when we start the Otel collector with this new configuration

sudo ./otelcol --config otel.yml

We will see that we now have structured logs!!

Store and Optimize

To ensure you aren’t blowing your budget out with all this additional structured data there are few things you can do to help maximize storage efficiency.

You can use the filter processors in the Otel collector with granular filtering/dropping of irrelevant attributes to control volume going out of the collector for example.

processors:
  filter/drop_logs_without_user_attributes:
    logs:
      log_record:
        - 'attributes["user"] == nil'
  filter/drop_200_logs:
    logs:
      log_record:
        - 'attributes["status"] == "200"'

service:
  pipelines:
    logs/platformlogs:
      receivers: [filelog/platformlogs]
      processors: [transform/parse_nginx, filter/drop_logs_without_user_attributes, filter/drop_200_logs, resourcedetection]
      exporters: [elasticsearch/otel]

The filter processor will help reduce the noise for example if you wanted to drop the debug logs or logs from a noisy service. Great ways to keep a lid on your observability spend.

Additionally for your most critical flows and logs where you don’t want to drop any data, Elastic has you covered. In version 9.x of Elastic you now have LogsDB switched on by default.

With LogsDB, Elastic has reduced the storage footprint of log data in Elasticsearch by up to 65% allowing you to store more observability and security data without exceeding your budget, while keeping all data accessible and searchable.

LogsDB reduces log storage by up to 65%. This dramatically minimizes storage footprints by leveraging advanced compression techniques like ZSTD, delta encoding, and run-length encoding, and it also reconstructs the _source field on demand, saving about 40% more storage by not retaining the original JSON document. Synthetic _source represents the introduction of columnar storage within Elasticsearch.

Analytics

So we have our data in Elastic, it’s structured, it conforms to the idea of a wide-event log since it has lots of good context, user ids, request ids and the data is captured at the start of a request Next we’re going to look at the analytics part of this. First let's take a stab at looking at the number of Errors for each user transaction in our application.

FROM logs-generic.otel-default
| WHERE log.file.name == "access.log"
| WHERE attributes.status >= "400"
| STATS error_count = COUNT(*) BY attributes.user
| SORT error_count DESC

It’s pretty easy now to save this and put it on a dashboard, we just click the save button:

Next let's look at putting something together to show the global impact, first we will update our collector config to enrich our log data with geo location.

Update the OTTL configuration with this new line:

   log_statements:
      - context: log
        conditions:
          - 'attributes["log.file.name"] != nil and IsMatch(attributes["log.file.name"], "access.log")'
        statements:
          - merge_maps(attributes, ExtractPatterns(body, "^(?P\\S+)"), "upsert")
          - merge_maps(attributes, ExtractPatterns(body, "^\\S+ - (?P\\S+)"), "upsert")
          - merge_maps(attributes, ExtractPatterns(body, "\\[(?P[^\\]]+)\\]"), "upsert")
          - merge_maps(attributes, ExtractPatterns(body, "\"(?P\\S+) "), "upsert")
          - merge_maps(attributes, ExtractPatterns(body, "\"\\S+ (?P\\S+)\\?"), "upsert")
          - merge_maps(attributes, ExtractPatterns(body, "req_id=(?P[^ ]+)"), "upsert")
          - merge_maps(attributes, ExtractPatterns(body, "\" (?P\\d+) "), "upsert")
          - merge_maps(attributes, ExtractPatterns(body, "\" \\d+ (?P\\d+)"), "upsert")
          - set(attributes["source.address"], attributes["client_ip"]) where attributes["client_ip"] != nil

Next add a new processor (you will need to download the GeoIP database from MaxMind)

geoip:
  context: record
  source:
    from: attributes
  providers:
    maxmind:
      database_path: /opt/geoip/GeoLite2-City.mmdb

And add this to the log pipeline after the parse_nginx

service:
  pipelines:
    logs/platformlogs:
      receivers: [filelog/platformlogs]
      processors: [transform/parse_nginx, geoip, resourcedetection]
      exporters: [elasticsearch/otel]

Start the otel collector

sudo ./otelcol --config otel.yml

Once the data starts flowing we can add a map visualization:

Add a layer:

Use ES|QL

Use the following ES|QL

And this should give you a map showing the locations of all your NGINX server requests!

As you can see, analytics is a breeze with your new Otel data collection pipeline.

Conclusion: Beyond log aggregation to operational intelligence

The journey from basic log aggregation to structured, enriched observability represents more than a technical upgrade, it's a shift in how organizations approach system understanding and incident response. By adopting OpenTelemetry for ingestion, implementing intelligent filtering to manage costs, and leveraging LogsDB's storage optimizations, you're not just modernizing your ELK stack; you're building the foundation for proactive system management.

The structured logs, geographic enrichment, and analytical capabilities demonstrated here transform raw log data into actionable intelligence with ES|QL. Instead of reactive grepping through logs during incidents, you now have the infrastructure to identify patterns, track user journeys, and correlate issues across your entire stack before they become critical problems.

But here's the key question: Are you prepared to act on these insights? Having rich, structured data is only valuable if your organization can shift from a reactive "find and fix" mentality to a proactive "predict and prevent" approach. The real evolution isn't in your logging stack, it's in your operational culture.

Get started with this today in Elastic Serverless

How to remove PII from your Elastic data in 3 easy steps

Tue, 20 Jun 2023 00:00:00 GMT

Personally identifiable information (PII) compliance is an ever-increasing challenge for any organization. Whether you’re in ecommerce, banking, healthcare, or other fields where data is sensitive, PII may inadvertently be captured and stored. Having structured logs enables quick identification, removal, and protection of sensitive data fields easily; but what about unstructured messages? Or perhaps call center transcriptions?

Elasticsearch, with its long experience in machine learning, provides various options to bring in custom models, such as large language models (LLMs), and provides its own models. These models will help implement PII redaction.

If you would like to learn more about natural language processing, machine learning, and Elastic, please be sure to check out these related articles:

In this blog, we will show you how to set up PII redaction through the use of Elasticsearch’s ability to load a trained model within machine learning and the flexibility of Elastic’s ingest pipelines.

Specifically, we’ll walk through setting up a named entity recognition (NER) model for person and location identification, as well as deploying the redact processor for custom data identification and removal. All of this will then be combined with an ingest pipeline where we can use Elastic machine learning and data transformations capabilities to remove sensitive information from your data.

Loading the trained model

Before we begin, we must load our NER model into our Elasticsearch cluster. This may be easily accomplished with Docker and the Elastic Eland client. From a command line, let’s install the Eland client via git:

git clone https://github.com/elastic/eland.git

Navigate into the recently downloaded client:

cd eland/

Now let’s build the client:

docker build -t elastic/eland .

From here, you’re ready to deploy the trained model to an Elastic machine learning node! Be sure to replace your username, password, es-cluster-hostname, and esport.

If you’re using the Elastic Cloud or have signed certificates, simply run this command:

docker run -it --rm --network host elastic/eland eland_import_hub_model --url https://:@:/ --hub-model-id dslim/bert-base-NER --task-type ner --start

If you’re using self-signed certificates, run this command:

docker run -it --rm --network host elastic/eland eland_import_hub_model --url https://:@:/ --insecure --hub-model-id dslim/bert-base-NER --task-type ner --start

From here you’ll witness the Eland client in action downloading the trained model from HuggingFace and automatically deploying it into your cluster!

Synchronize your newly loaded trained model by clicking on the blue hyperlink via your Machine Learning Overview UI “Synchronize your jobs and trained models.”

Now click the Synchronize button.

That’s it! Congratulations, you just loaded your first trained model into Elastic!

Create the redact processor and ingest pipeline

From DevTools, let’s configure the redact processor along with our inference processor to take advantage of Elastic’s trained model we just loaded. This will create an ingest pipeline named “redact” that we can then use to remove sensitive data from any field we wish. In this example, I’ll be focusing on the “message” field. Note: at the time of this writing, the redact processor is experimental and must be created via DevTools.

PUT _ingest/pipeline/redact
{
  "processors": [
    {
      "set": {
        "field": "redacted",
        "value": "{{{message}}}"
      }
    },
    {
      "inference": {
        "model_id": "dslim__bert-base-ner",
        "field_map": {
          "message": "text_field"
        }
      }
    },
    {
      "script": {
        "lang": "painless",
        "source": "String msg = ctx['message'];\r\n                for (item in ctx['ml']['inference']['entities']) {\r\n                msg = msg.replace(item['entity'], '<' + item['class_name'] + '>')\r\n                }\r\n                ctx['redacted']=msg"
      }
    },
    {
      "redact": {
        "field": "redacted",
        "patterns": [
          "%{EMAILADDRESS:EMAIL}",
          "%{IP:IP_ADDRESS}",
          "%{CREDIT_CARD:CREDIT_CARD}",
          "%{SSN:SSN}",
          "%{PHONE:PHONE}"
        ],
        "pattern_definitions": {
          "CREDIT_CARD": "\d{4}[ -]\d{4}[ -]\d{4}[ -]\d{4}",
          "SSN": "\d{3}-\d{2}-\d{4}",
          "PHONE": "\d{3}-\d{3}-\d{4}"
        }
      }
    },
    {
      "remove": {
        "field": [
          "ml"
        ],
        "ignore_missing": true,
        "ignore_failure": true
      }
    }
  ],
  "on_failure": [
    {
      "set": {
        "field": "failure",
        "value": "pii_script-redact"
      }
    }
  ]
}

OK, but what does each processor really do? Let’s walk through each processor in detail here:

The SET processor creates the field “redacted,” which is copied over from the message field and used later on in the pipeline.
The INFERENCE processor calls the NER model we loaded to be used on the message field for identifying names, locations, and organizations.
The SCRIPT processor then replaced the detected entities within the redacted field from the message field.
Our REDACT processor uses Grok patterns to identify any custom set of data we wish to remove from the redacted field (which was copied over from the message field).
The REMOVE processor deletes the extraneous ml.* fields from being indexed; note we’ll add “message” to this processor once we validate data is being redacted properly.
The ON_FAILURE / SET processor captures any errors just in case we have them.

Slice your PII

Now that your ingest pipeline with all the necessary steps has been configured, let’s start testing how well we can remove sensitive data from documents. Navigate over to Stack Management, select Ingest Pipelines and search for “redact”, and then click on the result.

Click on the Manage button, and then click Edit.

Here we are going to test our pipeline by adding some documents. Below is a sample you can copy and paste to make sure everything is working correctly.

{
  "_source":
    {
      "message": "John Smith lives at 123 Main St. Highland Park, CO. His email address is jsmith123@email.com and his phone number is 412-189-9043.  I found his social security number, it is 942-00-1243. Oh btw, his credit card is 1324-8374-0978-2819 and his gateway IP is 192.168.1.2",
    },
}

Simply press the Run the pipeline button, and you will then see the following output:

What’s next?

After you’ve added this ingest pipeline to a data set you’re indexing and validated that it is meeting expectations, you can add the message field to be removed so that no PII data is indexed. Simply update your REMOVE processor to include the message field and simulate again to only see the redacted field.

Conclusion

With this step-by-step approach, you are now ready and able to detect and redact any sensitive data throughout your indices.

Here’s a quick recap of what we covered:

Loading a pre-trained named entity recognition model into an Elastic cluster
Configuring the Redact processor, along with the inference processor, to use the trained model during data ingestion
Testing sample data and modifying the ingest pipeline to safely remove personally identifiable information

Ready to get started? Sign up for Elastic Cloud and try out the features and capabilities I’ve outlined above to get the most value and visibility out of your OpenTelemetry data.

In this blog post, we may have used third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.

Elastic, Elasticsearch and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.

Gaining new perspectives beyond logging: An introduction to application performance monitoring

Tue, 30 May 2023 00:00:00 GMT

Prioritize customer experience with APM and tracing

Enterprise software development and operations has become an interesting space. We have some incredibly powerful tools at our disposal, yet as an industry, we have failed to adopt many of these tools that can make our lives easier. One such tool that is currently underutilized is application performance monitoring (APM) and tracing, despite the fact that OpenTelemetry has made it possible to adopt at low friction.

Logging, however, is ubiquitous. Every software application has logs of some kind, and the default workflow for troubleshooting (even today) is to go from exceptions experienced by customers and systems to the logs and start from there to find a solution.

There are various challenges with this, one of the main ones being that logs often do not give enough information to solve the problem. Many services today return ambiguous 500 errors with little or nothing to go on. What if there isn’t an error or log file at all or the problem is that the system is very slow? Logging alone cannot help solve these problems. This leaves users with half broken systems and poor user experiences. We’ve all been on the wrong side of this, and it can be incredibly frustrating.

The question I find myself asking is why does the customer experience often come second to errors? If the customer experience is a top priority, then a strategy should be in place to adopt tracing and APM and make this as important as logging. Users should stop going to logs by default and thinking primarily in logs, as many are doing today. This will also come with some required changes to mental models.

What’s the path to get there? That’s exactly what we will explore in this blog post. We will start by talking about supporting organizational changes, and then we’ll outline a recommended journey for moving from just logging to a fully integrated solution with logs, traces, and APM.

Cultivating a new monitoring mindset: How to drive APM and tracing adoption

To get teams to shift their troubleshooting mindset, what organizational changes need to be made?

Initially, businesses should consider strategic priorities and goals that need to be shared broadly among the teams. One thing that can help drive this in a very large organization is to consider an entire product team devoted to Observability or a CoE (Center of Excellence) with its own roadmap and priorities.

This team (either virtual or permanent) should start with the customer in mind and work backward, starting with key questions like: What do I need to collect? What do I need to observe? How do I act? Once team members understand the answers to these questions, they can start to think about the technology decisions needed to drive those outcomes.

From a tracing and APM perspective, the areas of greatest concern are the customer experience, service level objectives, and service level outcomes. From here, organizations can start to implement programs of work to continuously improve and share knowledge across teams. This will help to align teams around a common framework with shared goals.

In the next few sections, we will go through a four step journey to help you maximize your success with APM and tracing. This journey will take you through the following key steps on your journey to successful APM adoption:

Ingest: What choices do you have to make to get tracing activated and start ingesting trace data into your observability tools?
Integrate: How does tracing integrate with logs to enable full end-to-end observability, and what else beyond simple tracing can you utilize to get even better resolution on your data?
Analytics and AIOPs: Improve the customer experience and reduce the noise through machine learning.
Scale and total cost of ownership: Roll out enterprise-wide tracing and adopt strategies to deal with data volume.

1. Ingest

Ingesting data for APM purposes generally involves “instrumenting” the application. In this section, we will explore methods for instrumenting applications, talk a little bit about sampling, and finally wrap up with a note on using common schemas for data representation.

Getting started with instrumentation

What options do we have for ingesting APM and trace data? There are many, many options we will discuss to help guide you, but first let's take a step back. APM has a deep history — in very first implementations of APM, people were concerned mainly with timing methods, like this below:

Usually you had a configuration file to specify which methods you wanted to time, and the APM implementation would instrument the specified code with method timings.

From here things started to evolve, and one of the first additions to APM was to add in tracing.

For Java, it’s fairly trivial to implement a system to do this by using what's known as a Java agent. You just specify -javagent command line argument, and the agent code gets access to the dynamic compilation routines within Java so it can modify the code before it is compiled into machine code, allowing you to “wrap” specific methods with timing or tracing routines. So, auto instrumenting Java was one of the first things that the original APM vendors did.

OpenTelemetry has agents like this, and most observability vendors that offer APM solutions have their own proprietary ways of doing this, often with more advanced and differing features from the open source tooling.

Things have moved on since then, and Node.JS and Python are now popular.

As a result, ways of auto instrumenting these language runtimes have appeared, which mostly work by injecting the libraries into the code before starting them up. OpenTelemetry has a way of doing this on Kubernetes with an Operator and sidecar here, which supports Python, Node.JS, Java, and DotNet.

The other alternative is to start adding APM and tracing API calls into your own code, which is not dissimilar to adding logging functionality. You may even wish to create an abstraction in your code to deal with this cross-cutting concern, although this is less of a problem now that there are open standards with which you can implement this.

You can see an example of how to add OpenTelemetry spans and attributes to your code for manual instrumentation below and here.

from flask import Flask
import monitor  # Import the module
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
import urllib
import os

from opentelemetry import trace
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.requests import RequestsInstrumentor


# Service name is required for most backends
resource = Resource(attributes={
    SERVICE_NAME: "your-service-name"
})

provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint=os.getenv('OTEL_EXPORTER_OTLP_ENDPOINT'),
        headers="Authorization=Bearer%20"+os.getenv('OTEL_EXPORTER_OTLP_AUTH_HEADER')))

provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
RequestsInstrumentor().instrument()

# Initialize Flask app and instrument it
app = Flask(__name__)

@app.route("/completion")
@tracer.start_as_current_span("do_work")
def completion():
        span = trace.get_current_span()
        if span:
            span.set_attribute("completion_count",1)

By implementing APM in this way, you could even eliminate the need to do any logging by storing all your required logging information within span attributes, exceptions, and metrics. The downside is that you can only do this with code that you own, so you will not be able to remove all logs this way.

Sampling

Many people don’t realize that APM is an expensive process. It adds a lot of CPU cycles and memory to your applications, and although there is a lot of value to be had, there are certainly trade-offs to be made.

Should you sample everything 100% and eat the cost? Or should you think about an intelligent trade-off with fewer samples or even tail-based sampling, which many products commonly support? Here, we will talk about the two most common sampling techniques — head-based sampling and tail-based sampling — to help you decide.

Head-based sampling
In this approach, sampling decisions are made at the beginning of a trace, typically at the entry point of a service or application. A fixed rate of traces is sampled, and this decision propagates through all the services involved in a distributed trace.

With head-based sampling, you can control the rate using a configuration, allowing you to control the percentage of requests that are sampled and reported to the APM server. For instance, a sampling rate of 0.5 means that only 50% of requests are sampled and sent to the server. This is useful for reducing the amount of collected data while still maintaining a representative sample of your application's performance.

Tail-based sampling
Unlike head-based sampling, tail-based sampling makes sampling decisions after the entire trace has been completed. This allows for more intelligent sampling decisions based on the actual trace data, such as only reporting traces with errors or traces that exceed a certain latency threshold.

We recommend tail-based sampling because it has the highest likelihood of reducing the noise and helping you focus on the most important issues. It also helps keep costs down on the data store side. A downside of tail-based sampling, however, is that it results in more data being generated from APM agents. This could use more CPU and memory on your application.

OpenTelemetry Semantic Conventions and Elastic Common Schema

OpenTelemetry prescribes Semantic Conventions, or Semantic Attributes, to establish uniform names for various operations and data types. Adhering to these conventions fosters standardization across codebases, libraries, and platforms, ultimately streamlining the monitoring process.

Creating OpenTelemetry spans for tracing is flexible, allowing implementers to annotate them with operation-specific attributes. These spans represent particular operations within and between systems, often involving widely recognized protocols like HTTP or database calls. To effectively represent and analyze a span in monitoring systems, supplementary information is necessary, contingent upon the protocol and operation type.

Unifying attribution methods across different languages is essential for operators to easily correlate and cross-analyze telemetry from polyglot microservices without needing to grasp language-specific nuances.

Elastic's recent contribution of the Elastic Common Schema to OpenTelemetry enhances Semantic Conventions to encompass logs and security.

Abiding by a shared schema yields considerable benefits, enabling operators to rapidly identify intricate interactions and correlate logs, metrics, and traces, thereby expediting root cause analysis and reducing time spent searching for logs and pinpointing specific time frames.

We advocate for adhering to established schemas such as ECS when defining trace, metrics, and log data in your applications, particularly when developing new code. This practice will conserve time and effort when addressing issues.

2. Integrate

Integrations are very important for APM. How well your solution can integrate with other tools and technologies such as cloud, as well as its ability to integrate logs and metrics into your tracing data, is critical to fully understand the customer experience. In addition, most APM vendors have adjacent solutions for synthetic monitoring and profiling to gain deeper perspectives to supercharge your APM. We will explore these topics in the following section.

APM + logs = superpowers!

Because APM agents can instrument code, they can also instrument code that is being used for logging. This way, you can capture log lines directly within APM. This is normally simple to enable.

With this enabled, you will also get automated injection of useful fields like these:

service.name, service.version, service.environment
trace.id, transaction.id, error.id

This means log messages will be automatically correlated with transactions as shown below, making it far easier to reduce mean time to resolution (MTTR) and find the needle in the haystack:

If this is available to you, we highly recommend turning it on.

Deploying APM inside Kubernetes

It is common for people to want to deploy APM inside a Kubernetes environment, and tracing is critical for monitoring applications in cloud-native environments. There are three different ways you can tackle this.

1. Auto instrumentation using sidecars
With Kubernetes, it is possible to use an init container and something that will modify Kubernetes manifests on the fly to auto instrument your applications.

The init container will be used simply to copy the required library or jar file into the container at startup that you need to the main Kubernetes pod. Then, you can use Kustomize to add the required command line arguments to bootstrap your agents.

If you are not familiar with it, Kustomize adds, removes, or modifies Kubernetes manifests on the fly. It is even available as a flag to the Kubernetes CLI — simply execute kubectl -k.

OpenTelemetry has an operator that does all this for you automatically (without the need for Kustomize) for Java, DotNet, Python, and Node.JS, and many vendors also have their own operator or helm charts that can achieve the same result.

2. Baking APM into containers or code
A second option for deploying out APM in Kubernetes — and indeed any containerized environment — is using Docker to bake the APM agents and configuration into a dockerfile.

Have a look at an example here using the OpenTelemetry Java Agent:

# Use the official OpenJDK image as the base image
FROM openjdk:11-jre-slim

# Set up environment variables
ENV APP_HOME /app
ENV OTEL_VERSION 1.7.0-alpha
ENV OTEL_JAVAAGENT_URL https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/download/v${OTEL_VERSION}/opentelemetry-javaagent-${OTEL_VERSION}-all.jar

# Create the application directory
RUN mkdir $APP_HOME
WORKDIR $APP_HOME

# Download the OpenTelemetry Java agent
ADD ${OTEL_JAVAAGENT_URL} /otel-javaagent.jar

# Add your Java application JAR file
COPY your-java-app.jar $APP_HOME/your-java-app.jar

# Expose the application port (e.g. 8080)
EXPOSE 8080

# Configure the OpenTelemetry Java agent and run the application
CMD java -javaagent:/otel-javaagent.jar \
      -Dotel.resource.attributes=service.name=your-service-name \
      -Dotel.exporter.otlp.endpoint=your-otlp-endpoint:4317 \
      -Dotel.exporter.otlp.insecure=true \
      -jar your-java-app.jar

3. Tracing using a service mesh (Envoy/Istio)
The final option you have here is if you are using a service mesh. A service mesh is a dedicated infrastructure layer for handling service-to-service communication in a microservices architecture. It provides a transparent, scalable, and efficient way to manage and control the communication between services, enabling developers to focus on building application features without worrying about inter-service communication complexities.

The great thing about this is that we can activate tracing within the proxy and therefore get visibility into requests between services. We don’t have to change any code or even run APM agents for this; we simply turn on the OpenTelemetry collector that exists within the proxy — therefore this is likely the lowest overhead solution. Learn more about this option.

Synthetics Universal Profiling

Most APM vendors have add ons to the primary APM use cases. Typically we see synthetics and continuous profiling being added to APM solutions. APM can integrate with both, and there is some good value in bringing these technologies together to give even more insights into issues.

Synthetics
Synthetic monitoring is a method used to measure the performance, availability, and reliability of web applications, websites, and APIs by simulating user interactions and traffic. It involves creating scripts or automated tests that mimic real user behavior, such as navigating through pages, filling out forms, or clicking buttons, and then running these tests periodically from different locations and devices.

This gives Development and Operations teams the ability to spot problems far earlier than they might otherwise, catching issues before real users do in many cases.

Synthetics can be integrated with APM — inject an APM agent into the website when the script runs, so even if you didn’t put end user monitoring into your website initially, it can be injected at run time. This usually happens without any input from the user. From there, a tracing id for each request can be passed down through the various layers of the system, allowing teams to follow the request all the way from the synthetics script to the lowest levels of the application stack such as the database.

Universal profiling
“Profiling” is a dynamic method of analyzing the complexity of a program, such as CPU utilization or the frequency and duration of function calls. With profiling, you can locate exactly which parts of your application are consuming the most resources. “Continuous profiling” is a more powerful version of profiling that adds the dimension of time. By understanding your system’s resources over time, you can then locate, debug, and fix issues related to performance.

Universal profiling is a further extension of this, which allows you to capture profile information about all of the code running in your system all the time. Using a technology like eBPF can allow you to see all the function calls in your systems, including into things like the Kubernetes runtime. Doing this gives you the ability to finally see unknown unknowns — things you didn’t know were problems. This is very different from APM, which is really about tracking individual traces and requests and the overall customer experience. Universal profiling is about overcoming those issues you didn’t even know existed and even answering the question “What is my most expensive line of code?”

Universal profiling can be linked into APM, showing you profiles that occurred during a specific customer issue, for example, or by linking profiles directly to traces by looking at the global state that exists at the thread level. These technologies can work wonders when used together.

Typically, profiles are viewed as “flame graphs” shown below. The boxes represent the amount of “on-cpu” time spent executing a particular function.

3. Analytics and AIOps

The interesting thing about APM is it opens up a whole new world of analytics versus just logs. All of a sudden, you have access to the information flows from inside applications.

This allows you to easily capture things like the amount of money a specific customer is currently spending on your most critical ecommerce store, or look at failed trades in a brokerage app to see how much lost revenue those failures are impacting. You can even then apply machine learning algorithms to project future spend or look at anomalies occurring in this data, giving you a new window into how your business runs.

In this section, we will look at ways to do this and how to get the most out of this new world, as well as how to apply AIOps practices to this new data. We will also discuss getting SLIs and SLOs setup for APM data.

Getting business data into your traces

There are generally two ways of getting business data into your traces. You can modify code and add in Span attributes, an example of which is available here and shown below. Or you can write an extension or a plugin, which has the benefit of avoiding code changes. OpenTelemetry supports adding extensions in its auto-instrumentation agents. Most other APM vendors usually have something similar.

def count_completion_requests_and_tokens(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        counters['completion_count'] += 1
        response = func(*args, **kwargs)

        token_count = response.usage.total_tokens
        prompt_tokens = response.usage.prompt_tokens
        completion_tokens = response.usage.completion_tokens
        cost = calculate_cost(response)
        strResponse = json.dumps(response)

        # Set OpenTelemetry attributes
        span = trace.get_current_span()
        if span:
            span.set_attribute("completion_count", counters['completion_count'])
            span.set_attribute("token_count", token_count)
            span.set_attribute("prompt_tokens", prompt_tokens)
            span.set_attribute("completion_tokens", completion_tokens)
            span.set_attribute("model", response.model)
            span.set_attribute("cost", cost)
            span.set_attribute("response", strResponse)
        return response
    return wrapper

Using business data for fun and profit

Once you have the business data in your traces, you can start to have some fun with it. Take a look at the example below for a financial services fraud team. Here we are tracking transactions — average transaction value for our larger business customers. Crucially, we can see if there are any unusual transactions.

A lot of this is powered by machine learning, which can classify transactions or do anomaly detection. Once you start capturing the data, it is possible to do a lot of useful things like this, and with a flexible platform, integrating machine learning models into this process becomes a breeze.

SLIs and SLOs

Service level indicators (SLIs) and service level objectives (SLOs) serve as critical components for maintaining and enhancing application performance. SLIs, which represent key performance metrics such as latency, error rate, and throughput, help quantify an application's performance, while SLOs establish target performance levels to meet user expectations.

By selecting relevant SLIs and setting achievable SLOs, organizations can better monitor their application's performance using APM tools. Continually evaluating and adjusting SLIs and SLOs in response to changes in application requirements, user expectations, or the competitive landscape ensures that the application remains competitive and delivers an exceptional user experience.

In order to define and track SLIs and SLOs, APM becomes a critical perspective that is needed for understanding the user experience. Once APM is implemented, we recommend that organizations perform the following steps.

Define SLOs and SLIs required to track them.
Define SLO budgets and how they are calculated. Reflect business’ perspective and set realistic targets.
Define SLIs to be measured from a user experience perspective.
Define different alerting and paging rules, page only on customer facing SLO degradations, record symptomatic alerts, notify on critical symptomatic alerts.

Synthetic monitoring and end user monitoring (EUM) can also help with getting even more data required to understand latency, throughput, and error rate from the user’s perspective, where it is critical to get good business focused metrics and data from.

4. Scale and total cost of ownership

With increased perspectives, customers often run into scalability and total cost of ownership issues. All this new data can be overwhelming. Luckily there are various techniques you can use to deal with this. Tracing itself can actually help with volume challenges because you can decompose unstructured logs and combine them with traces, which leads to additional efficiency. You can also use different sampling methods to deal with scale challenges (i.e., both techniques we previously mentioned).

In addition to this, for large enterprise scale, we can use streaming pipelines like Kafka or Pulsar to manage the data volumes. This has an additional benefit that you get for free: if you take down the systems consuming the data or they face outages, it is less likely you will lose data.

With this configuration in place, your “Observability pipeline” architecture would look like this:

This completely decouples your sources of data from your chosen observability solution, which will future proof your observability stack going forward, enable you to reach massive scale, and make you less reliant on specific vendor code for collection of data.

Another thing we recommend doing is being intelligent about instrumentation. This will serve two benefits: you will get some CPU cycles back in the instrumented application, and your backend data collection systems will have less data to process. If you know, for example, that you have no interest in tracking calls to a specific endpoint, you can exclude those classes and methods from instrumentation.

And finally, data tiering is a transformative approach for managing data storage that can significantly reduce the total cost of ownership (TCO) for businesses. Primarily, it allows organizations to store data across different types of storage mediums based on their accessibility needs and the value of the data. For instance, frequently accessed, high-value data can be stored in expensive, high-speed storage, while less frequently accessed, lower-value data can be stored in cheaper, slower storage.

This approach, often incorporated in cloud storage solutions, enables cost optimization by ensuring that businesses only pay for the storage they need at any given time. Furthermore, it provides the flexibility to scale up or down based on demand, eliminating the need for large capital expenditures on storage infrastructure. This scalability also reduces the need for costly over-provisioning to handle potential future demand.

Conclusion

In today's highly competitive and fast-paced software development landscape, simply relying on logging is no longer sufficient to ensure top-notch customer experiences. By adopting APM and distributed tracing, organizations can gain deeper insights into their systems, proactively detect and resolve issues, and maintain a robust user experience.

In this blog, we have explored the journey of moving from a logging-only approach to a comprehensive observability strategy that integrates logs, traces, and APM. We discussed the importance of cultivating a new monitoring mindset that prioritizes customer experience, and the necessary organizational changes required to drive APM and tracing adoption. We also delved into the various stages of the journey, including data ingestion, integration, analytics, and scaling.

By understanding and implementing these concepts, organizations can optimize their monitoring efforts, reduce MTTR, and keep their customers satisfied. Ultimately, prioritizing customer experience through APM and tracing can lead to a more successful and resilient enterprise in today's challenging environment.

Learn more about APM at Elastic.

Dynamic workload discovery on Kubernetes now supported with EDOT Collector

Tue, 01 Apr 2025 00:00:00 GMT

At Elastic, Kubernetes is one of the most significant observability use cases we focus on. We want to provide the best onboarding experience and lifecycle management based on real-world GitOps best practices.

OpenTelemetry recently published a blog on how to do Autodiscovery based on Kubernetes Pods' annotations with the OpenTelemetry Collector.

In this blog post, we will talk about how to use this Kubernetes-related feature of the OpenTelemetry Collector, which is already available with the Elastic Distribution of the OpenTelemetry (EDOT) Collector.

In addition to this feature, at Elastic, we heavily invest in making OpenTelemetry the best, standardized ingest solution for Observability. You might already have seen us focusing on:

Semantic Conventions standardization
significant log collection improvements
various other topics around instrumentation
profiling

Let's walk you through a hands-on journey using the EDOT Collector covering various use cases you might encounter in the real world, highlighting the capabilities of this powerful feature.

Configuring EDOT Collector

The Collector’s configuration is not our main focus here, since based on the nature of this feature it is minimal, letting workloads define how they should be monitored.

To illustrate the point, here is the Collector configuration snippet that enables the feature for both logs and metrics:

receivers:
    receiver_creator/metrics:
      watch_observers: [k8s_observer]
      discovery:
        enabled: true
      receivers:

    receiver_creator/logs:
      watch_observers: [k8s_observer]
      discovery:
        enabled: true
      receivers:

You can include the above in the EDOT’s Collector configuration, specifically the receivers’ section.

Since logs collection in our examples will happen from the discovery feature make sure that the static filelog receiver configuration block is removed and its preset is disabled (i.e. set to false) to avoid having log duplication.

Make sure that the receiver creator is properly added in the pipelines for logs (in addition to removing the filelog receiver completely) and metrics respectively.

Ensure that k8sobserver is enabled as part of the extensions:

extensions:
  k8s_observer:
    observe_nodes: true
    observe_services: true
    observe_ingresses: true

// ...

service:
  extensions: [k8s_observer]

Last but not least, ensure the log files' volume is mounted properly:

volumeMounts:
 - name: varlogpods
   mountPath: /var/log/pods
   readOnly: true

volumes:
  - name: varlogpods
    hostPath:
      path: /var/log/pods

Once the configuration is ready follow the Kubernetes quickstart guides on how to deploy the EDOT Collector. Make sure to replace the values.yaml file linked in the quickstart guide with the file that includes the above-described modifications.

Collecting Metrics from Moving Targets Based on Their Annotations

In this example, we have a Deployment with a Pod spec that consists of two different containers. One container runs a Redis server, while the other runs an NGINX server. Consequently, we want to provide different hints for each of these target containers.

The annotation-based discovery feature supports this, allowing us to specify metrics annotations per exposed container port.

Here is how the complete spec file looks:

apiVersion: v1
kind: ConfigMap
metadata:
  name: nginx-conf
data:
  nginx.conf: |
    user  nginx;
    worker_processes  1;
    error_log  /dev/stderr warn;
    pid        /var/run/nginx.pid;
    events {
      worker_connections  1024;
    }
    http {
      include       /etc/nginx/mime.types;
      default_type  application/octet-stream;

      log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                        '$status $body_bytes_sent "$http_referer" '
                        '"$http_user_agent" "$http_x_forwarded_for"';
      access_log  /dev/stdout main;
      server {
          listen 80;
          server_name localhost;

          location /nginx_status {
              stub_status on;
          }
      }
      include /etc/nginx/conf.d/*;
    }
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-deployment
  labels:
    app: redis
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
      annotations:
        # redis container port hints
        io.opentelemetry.discovery.metrics.6379/enabled: "true"
        io.opentelemetry.discovery.metrics.6379/scraper: redis
        io.opentelemetry.discovery.metrics.6379/config: |
          collection_interval: "20s"
          timeout: "10s"

        # nginx container port hints
        io.opentelemetry.discovery.metrics.80/enabled: "true"
        io.opentelemetry.discovery.metrics.80/scraper: nginx
        io.opentelemetry.discovery.metrics.80/config: |
          endpoint: "http://`endpoint`/nginx_status"
          collection_interval: "30s"
          timeout: "20s"
    spec:
      volumes:
      - name: nginx-conf
        configMap:
          name: nginx-conf
          items:
            - key: nginx.conf
              path: nginx.conf
      containers:
        - name: webserver
          image: nginx:latest
          ports:
            - containerPort: 80
              name: webserver
          volumeMounts:
            - mountPath: /etc/nginx/nginx.conf
              readOnly: true
              subPath: nginx.conf
              name: nginx-conf
        - image: redis
          imagePullPolicy: IfNotPresent
          name: redis
          ports:
            - name: redis
              containerPort: 6379
              protocol: TCP

When this workload is deployed, the Collector will automatically discover it and identify the specific annotations. After this, two different receivers will be started, each one responsible for each of the target containers.

Collecting Logs from Multiple Target Containers

The annotation-based discovery feature also supports log collection based on the provided annotations. In the example below, we again have a Deployment with a Pod consisting of two different containers, where we want to apply different log collection configurations. We can specify annotations that are scoped to individual container names:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: busybox-logs-deployment
  labels:
    app: busybox
spec:
  replicas: 1
  selector:
    matchLabels:
      app: busybox
  template:
    metadata:
      labels:
        app: busybox
      annotations:
        io.opentelemetry.discovery.logs.lazybox/enabled: "true"
        io.opentelemetry.discovery.logs.lazybox/config: |
          operators:
            - id: container-parser
              type: container
            - id: some
              type: add
              field: attributes.tag
              value: hints-lazybox
        io.opentelemetry.discovery.logs.busybox/enabled: "true"
        io.opentelemetry.discovery.logs.busybox/config: |
          operators:
            - id: container-parser
              type: container
            - id: some
              type: add
              field: attributes.tag
              value: hints-busybox
    spec:
      containers:
        - name: busybox
          image: busybox
          args:
            - /bin/sh
            - -c
            - while true; do echo "otel logs from busybox at $(date +%H:%M:%S)" && sleep 5s; done
        - name: lazybox
          image: busybox
          args:
            - /bin/sh
            - -c
            - while true; do echo "otel logs from lazybox at $(date +%H:%M:%S)" && sleep 25s; done

The above configuration enables two different filelog receiver instances, each applying a unique parsing configuration. This is handy when we know how to parse specific technology logs, such as Apache server access logs.

Combining Both Metrics and Logs Collection

In our third example, we illustrate how to define both metrics and log annotations on the same workload. This allows us to collect both signals from the discovered workload. Below is a Deployment with a Pod consisting of a Redis server and a BusyBox container that performs dummy log writing. We can target annotations to the port and container levels to collect metrics from the Redis server using the Redis receiver, and logs from the BusyBox using the filelog receiver. Here’s how:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-deployment
  labels:
    app: redis
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
      annotations:
        io.opentelemetry.discovery.metrics.6379/enabled: "true"
        io.opentelemetry.discovery.metrics.6379/scraper: redis
        io.opentelemetry.discovery.metrics.6379/config: |
          collection_interval: "20s"
          timeout: "10s"

        io.opentelemetry.discovery.logs.busybox/enabled: "true"
        io.opentelemetry.discovery.logs.busybox/config: |
          operators:
            - id: container-parser
              type: container
            - id: some
              type: add
              field: attributes.tag
              value: hints
    spec:
      containers:
        - image: redis
          imagePullPolicy: IfNotPresent
          name: redis
          ports:
            - name: redis
              containerPort: 6379
              protocol: TCP
        - name: busybox
          image: busybox
          args:
            - /bin/sh
            - -c
            - while true; do echo "otel logs at $(date +%H:%M:%S)" && sleep 15s; done

Explore and analyse data coming from dynamic targets in Elastic

Once the target Pods are discovered and the Collector has started collecting telemetry data from them, we can then explore this data in Elastic. In Discover we can search for Redis and NGINX metrics as well as logs collected from the Busybox container. Here is how it looks like:

Summary

The examples above showcase how users of our OpenTelemetry Collector can take advantage of this new feature — one we played a major role in developing.

For this, we leveraged our years of experience with similar features already supported in Metricbeat, Filebeat, and Elastic-Agent. This makes us extremely happy and confident, as it closes the feature gap between Elastic's specific monitoring agents and the OpenTelemetry Collector — making it even better.

Interested in learning more? Visit the documentation and give it a try by following our EDOT quickstart guide.

Kibana: How to create impactful visualisations with magic formulas ? (part 1)

Mon, 09 Sep 2024 00:00:00 GMT

Kibana: How to create impactful visualizations with magic formulas? (part 1)

Introduction

In the previous blog post, Designing Intuitive Kibana Dashboards as a non-designer, we highlighted the importance of creating intuitive dashboards. It demonstrated how simple changes (grouping themes, changing type charts, and more) can make a difference in understanding your data. When delivering courses like Data Analysis with Kibana or Elastic Observability Engineer courses, we emphasize this blog post and how these changes help bring essential information to the surface. I like a complementary approach to reach this goal: using two colors to separate the highest data values from the common ones.

To illustrate this idea, we will use the Sample flight data dataset. Now, let’s compare two visualizations ranking the top 10 destination countries per total number of flights. Which visualization has a higher impact?

If you chose the second one, you may be wondering how this was done with the Kibana Lens editor. While preparing for the certification last year, I found a way to achieve this result. The secret is using two different layers and some magic formulas. This post will explain how math in Lens formulas helps create two data-color visualizations.

We will start with the first example that emphasizes only the highest value of the dataset we are focusing on. The second example describes how to highlight other high values (as shown in the illustration above).

[Note: the tips explained in this blog post can be applied from v 7.15]

Only the highest value

To understand how math helps to separate high values from common ones, let’s start with this first example: emphasizing only the highest value.

We start with a bar horizontal chart:

We need to identify the highest value of the scope we are currently examining. We will use one proper overall_* function: the overall_max(), a pipeline function (equivalent to a pipeline aggregation in Query DSL).

In our example, we group the flights by country(destination). This means we count the number of flights for each DestCountry (= 1 bucket). The overall_max() will select which bucket has the highest value.

The math trick here is to divide the number of flights per bucket by the maximum value found among all buckets. Only one bucket will return 1: the bucket matching the max value found by overall_max(). All the other buckets will return a value < 1 and >0. We use floor() to ensure any 0.xxx values are rounded to 0.

Now, we can multiple it with a count() and we have our formula for the 1st layer!

Layer 1: count()*floor(count()/overall_max(count()))

From here, in Lens Editor, we duplicate the layer to adjust the formula of the second layer containing the rest of the data. We need to append another count() followed by the minus operator to the formula. This is the other trick. In this layer, we just need to ensure the highest value is not represented, which will happen only once. It is when count() = overall_max(), which is = 1 when we divide them.

Layer 2: count() - count()*floor(count()/overall_max(count()))

To achieve a nice merge of these two layers, we need to do the following adjustments in both:

select bar horizontal stacked
Vertical axis: change”Rank by” to Custom and ensure Rank function is “Count”

Here is the final setup of the two layers:

Layer 1: count()*floor(count()/overall_max(count()))

Layer 2: count() - count()*floor(count()/overall_max(count()))

This visualization also works well for time series data where you need to quickly highlight which time period (12h in the example below) had the highest number of flights:

Above the surface

Building on what we have done earlier, we can extend the approach to get other high values above the surface. Let’s see which formula we used to create the visualization in the introduction:

For this visualization, we used a property of the round() function. This function brings in only the values greater than 50% of the highest value.

50% of max explanation" />

Let's duplicate our first visualization and swap out the floor() function with round().

Layer 1: count()*round(count()/overall_max(count()))

Layer 2: count() - count()*round(count()/overall_max(count()))

It was an easy fix.
What if we want to extend the first layer further by adding more high values?
For instance, we would like all the values above the average.

To do this, we use overall_average() as a new reference value instead of the overall_max () reference to separate the eligible values in Layer 1.

As we are comparing against the average value among all the buckets, the division might return values greater than 1.

Here, the clamp() function nicely solves this issue.

According to the formula reference, clamp() "limits the value from a minimum to maximum". Combining clamp() and floor() ensures that there are only two possible output values: either the minimum value ( 0 ) or the maximum value ( 1 ) given as parameters.

Applied to our flights dataset, it highlights the country destinations that have more flights than the average:

Layer 1: count()*clamp(floor(count()/overall_average(count())),0,1)

Layer 2: count() - count()*clamp(floor(count()/overall_average(count())),0,1)

It also opens up options for using other dynamic references. For instance, we could place all the values greater than 60% of the highest above the surface ( > 0.6*overall_max(count())). We can tune our formula as follow:


count()*clamp(floor(count()/(0.6*overall_max(count()) ) ),0,1)

Conclusion

In the first part, we have seen the main tips allowing us to create a two-color histogram:

Two layers: one for the highest value and one for the remaining values
Visualization type: bar horizontal/vertical stacked
To separate the data we use a formula where only the highest value return 1 otherwise 0

Then in the second part, we have seen how we can extend this principle to embrace more high values above the surface. This approach can be summarized as follows:

Start with layer 1 focusing on the high value: count()*
Duplicate the layer and adjust the formula:
( count() - count()*)

Finally, we provide 4 generic formulas that are ready to use to spice up your dashboards:


1. Only the highest
Layer 1	`count()*floor(count()/overall_max(count()))`
Layer 2	`count() - count()*floor(count()/overall_max(count()))`


2.1. Above the surface : high values (above 50% of the max value)
Layer 1	`count()*floor(count()/overall_max(count()))`
Layer 2	`count() - count()*floor(count()/overall_max(count()))`


2.2. Above the surface : all values above the overall average
Layer 1	`count()*clamp(floor(count()/overall_average(count())),0,1)`
Layer 2	`count() - count()*clamp(floor(count()/overall_average(count())),0,1)`


2.2. Above the surface : all the values greater than 60% of the highest
Layer 1	`count()clamp(floor(count()/(0.6overall_max(count()) ) ),0,1)`
Layer 2	`count() - count()clamp(floor(count()/(0.6overall_max(count()) ) ),0,1)`

Try these examples out for yourself by signing up for a free trial of Elastic Cloud or download the self-managed version of the Elastic Stack for free. If you have additional questions about getting started, head on over to the Kibana forum or check out the Kibana documentation guide.
In the next blog post, we will see how the new function ifelse() (introduced in version 8.6) will greatly simplify the creation of visualizations with more advanced formulas.

References:

Designing intuitive Kibana dashboards as a non-designer
Kibana: Lens editor - use formula to perform math
Discovering the clamp() function in this discussion (Thanks Marco!)

Serverless log analytics powered by Elasticsearch, in a new low priced tier

Thu, 07 Aug 2025 00:00:00 GMT

We're thrilled to introduce Elastic Observability Logs Essentials (Logs Essentials), a new tier in Elastic Cloud Serverless (SaaS). Built on the same robust stateless architecture as Elastic Observability Complete, it’s designed for Site Reliability Engineers (SREs) and developers seeking powerful, efficient, and economical log analytics, without the overhead of managing the Elastic Stack. As the leader in log management, Elasticsearch powers this new tier with unmatched search and analytics.

Logs Essentials is ideal for teams that want Elastic’s speed and scale without paying for premium features or managing the Elastic Stack. With Elastic Cloud Serverless, there’s no infrastructure to manage, and pricing is simple and predictable, making it easy to get started, stay supported, and focus on solving problems faster.

Unmatched value for log analytics

Logs Essentials empowers SREs and developers with analytics capabilities designed to help them quickly pinpoint the root cause of issues.

Accelerate root cause analysis with fast, precise log search using filters, pattern matching, and event identification in seconds.
Gain deep contextual insights through ES|QL, Elastic’s powerful piped query language that supports structured exploration and joins across indices.
Detect issues proactively by setting alerts for error spikes or unusual log volumes, enabling timely incident response.
Visualize and monitor operational health with rich dashboards built in Kibana, giving teams a clear and actionable view of system behavior.

Once on Logs Essentials, if you need SLOs, AI/ML, AI Assistant, or other advanced features to analyze logs, you should upgrade to Observability Complete. Additionally, if you are also interested in expanding to traces and metrics, you should upgrade to Observability Complete.

SaaS making it simple

SREs don’t have to worry about managing the powerful Elastic Stack with Logs Essentials. Elastic Cloud Serverless automatically scales and adjusts to needs seamlessly without impacting performance, all while keeping costs low. SREs don’t have to worry about the operational overhead of managing your deployment or being an Elastic Stack expert. SREs get the following benefits:

No infrastructure to manage or scale: Elastic Cloud Serverless transitions from traditional stateful deployments to a fully stateless, autoscaling architecture, offloading storage to cloud-native object stores and orchestrating compute through Kubernetes. SRE teams can now focus solely on logs and insights, not capacity planning or cluster sizing.

High reliability, resilience, and automation built-in: Elastic’s Cloud Serverless features multi-region deployments, automated control-plane and data-plane upgrades, automatic configuration updates, canary deployments, and capacity pool management to ensure always-on observability

These capabilities deliver what SREs need: a hassle-free, scale-as-you-go, high-availability logging solution that empowers SREs to focus entirely on operational insights, not infrastructure.

Affordable log analytics

Logs Essentials offers a cost-effective and predictable path to log analytics. Elastic Cloud Serverless employs advanced autoscaling controllers that adjust compute and storage dynamically, enabling a flexible pricing model that charges based on real usage (ingest and retention), enabling SREs to “sign up and use,” without upfront provisioning or surprise costs.

Instead of paying for idle capacity or managing infrastructure costs, users are billed based on ingest, and retention, eliminating the guesswork and overprovisioning common in traditional observability solutions. SREs can simply sign up and start analyzing logs. No infrastructure to manage, no surprise costs, just transparent, cost-effective pricing for what they use.

Logs Essentials in action

Let’s walk through how a Site Reliability Engineer (SRE) would use it in a real-world scenario. Customers are unable to complete transactions on an ecommerce site and the root cause isn’t clear. The issue could be in the front end, the back end, the database, or even the load balancer. Fortunately, logs are being collected from multiple components including NGINX, MySQL, and the application itself. With Elastic Observability Logs Essentials, an SRE can quickly dive into these logs to investigate the issue by starting with high-level symptoms and drill down across services using powerful search, correlation via ES|QL, and visualization tools like dashboarding.

The investigation continues as the SRE walks through several steps using ES|QL, search, and dashboards.

There is an alert indicating a logs spike, which is triggered by a significant number of MySQL errors indicating that a database table “orders” is full. We also use ES|QL to understand how many errors have been seen in the last three hours.

Next, the SRE tries to understand the impact on customers and potential revenue by looking at how many http issues are occurring and what region is seeing it most. With a significant number of >=400 and the US as the main region seeing the issue, this is revenue impacting.

Next, the SRE looks at whether infrastructure is being impacted by finding the related Kubernetes cluster and pod. With this the SRE can further investigate whether the MySQL pod or the Kubernetes node is having CPU or memory utilization issues.

SREs can also create visualizations and dashboards easily through Observability Logs Essentials’ ES|QL, discover, alerting, and dashboards capabilities.

Get started with Observability Logs Essentials

By combining the trusted capabilities of Elasticsearch with the flexibility and scalability of Elastic Cloud Serverless offering, Log Essentials delivers a streamlined, cost-effective solution that helps teams resolve incidents faster and with greater clarity. Whether you're troubleshooting critical outages, monitoring service health, or building dashboards for proactive insight, Logs Essentials gives you the tools you need — search, ES|QL, alerting, and visualization — in a package that’s simple to adopt and scale.

In order to get started, first register on Elastic Cloud and start a trial.

Convert Logstash pipelines to OpenTelemetry Collector Pipelines

Fri, 25 Oct 2024 00:00:00 GMT

Convert Logstash pipelines to OpenTelemetry Collector Pipelines

Introduction

Elastic observability strategy is increasingly aligned with OpenTelemetry. With the recent launch of Elastic Distributions of OpenTelemetry we’re expanding our offering to make it easier to use OpenTelemetry, the Elastic Agent now offers an "otel" mode, enabling it to run a custom distribution of the OpenTelemetry Collector, seamlessly enhancing your observability onboarding and experience with Elastic.

This post is designed to assist users familiar with Logstash transitioning to OpenTelemetry by demonstrating how to convert some standard Logstash pipelines into corresponding OpenTelemetry Collector configurations.

What is OpenTelemetry Collector and why should I care?

OpenTelemetry is an open-source framework that ensures vendor-agnostic data collection, providing a standardized approach for the collection, processing, and ingestion of observability data. Elastic is fully committed to this principle, aiming to make observability truly vendor-agnostic and eliminating the need for users to re-instrument their observability when switching platforms.

By embracing OpenTelemetry, you have access to these benefits:

Unified Observability: By using the OpenTelemetry Collector, you can collect and manage logs, metrics, and traces from a single tool, providing holistic observability into your system's performance and behavior. This simplifies monitoring and debugging in complex, distributed environments like microservices.
Flexibility and Scalability: Whether you're running a small service or a large distributed system, the OpenTelemetry Collector can be scaled to handle the amount of data generated, offering the flexibility to deploy as an agent (running alongside applications) or as a gateway (a centralized hub).
Open Standards: Since OpenTelemetry is an open-source project under the Cloud Native Computing Foundation (CNCF), it ensures that you're working with widely accepted standards, contributing to the long-term sustainability and compatibility of your observability stack.
Simplified Telemetry Pipelines: The ability to build pipelines using receivers, processors, and exporters simplifies telemetry management by centralizing data flows and minimizing the need for multiple agents.

In the next sections, we will explain how OTEL Collector and Logstash pipelines are structured, and we will clarify how the steps for each option are used.

OTEL Collector Configuration

An OpenTelemetry Collector Configuration has different sections:

Receivers: Collect data from different sources.
Processors: Transform the data collected by receivers
Exporters: Send data to different collectors
Connectors: Link two pipelines together
Service: defines which components are active
- Pipelines: Combine the defined receivers, processors, exporters, and connectors to process the data
- Extensions are optional components that expand the capabilities of the Collector to accomplish tasks not directly involved with processing telemetry data (e.g., health monitoring)
- Telemetry where you can set observability for the collector itself (e.g., logging and monitoring)

We can visualize it schematically as follows:

We refer to the official documentation Configuration | OpenTelemetry for an in-depth introduction in the components.

Logstash pipeline definition

A Logstash pipeline is composed of three main components:

Input Plugins: Allow us to read data from different sources
Filters Plugins: Allow us to transform and filter the data
Output Plugins: Allow us to send the data

Logstash also has a special input and a special output that allow the pipeline-to-pipeline communication, we can consider this as a similar concept to an OpenTelemetry connector.

Logstash pipeline compared to Otel Collector components

We can schematize how Logstash Pipeline and OTEL Collector pipeline components can relate to each other as follows:

Enough theory! Let us dive into some examples.

Convert a Logstash Pipeline into OpenTelemetry Collector Pipeline

Example 1: Parse and transform log line

Let's consider the below line:

2024-09-20T08:33:27: user frank accessed from 89.66.167.22:10592 path /blog with error 404

We will apply the following steps:

Read the line from the file /tmp/demo-line.log.
Define the output to be an Elasticsearch datastream logs-access-default.
Extract the @timestamp, user.name, client.ip, client.port, url.path and http.status.code.
Drop log messages related to the SYSTEM user.
Parse the date timestamp with the relevant date format and store it in @timestamp.
Add a code http.status.code_description based on known codes' descriptions.
Send data to Elasticsearch.

Logstash pipeline

input {
    file {
        path => "/tmp/demo-line.log" #[1]
        start_position => "beginning"
        add_field => { #[2]
            "[data_stream][type]" => "logs"
            "[data_stream][dataset]" => "access_log"
            "[data_stream][namespace]" => "default"
        }
    }
}

filter {
    grok { #[3]
        match => {
            "message" => "%{TIMESTAMP_ISO8601:[date]}: user %{WORD:[user][name]} accessed from %{IP:[client][ip]}:%{NUMBER:[client][port]:int} path %{URIPATH:[url][path]} with error %{NUMBER:[http][status][code]}"
        }
    }
    if "_grokparsefailure" not in [tags] {
        if [user][name] == "SYSTEM" { #[4]
            drop {}
        }
        date { #[5]
            match => ["[date]", "ISO8601"]
            target => "[@timestamp]"
            timezone => "UTC"
            remove_field => [ "date" ]
        }
        translate { #[6]
            source => "[http][status][code]"
            target => "[http][status][code_description]"
            dictionary => {
                "200" => "OK"
                "403" => "Permission denied"
                "404" => "Not Found"
                "500" => "Server Error"
            }
            fallback => "Unknown error"
        }
    }
}

output {
    elasticsearch { #[7]
        hosts => "elasticsearch-enpoint:443"
        api_key => "${ES_API_KEY}"
    }
}

OpenTelemtry Collector configuration

receivers:
  filelog: #[1]
    start_at: beginning
    include:
      - /tmp/demo-line.log
    include_file_name: false
    include_file_path: true
    storage: file_storage 
    operators:
    # Copy the raw message into event.original (this is done OOTB by Logstash in ECS mode)
    - type: copy
      from: body
      to: attributes['event.original']
    - type: add #[2]
      field: attributes["data_stream.type"]
      value: "logs"
    - type: add #[2]
      field: attributes["data_stream.dataset"]
      value: "access_log_otel" 
    - type: add #[2]
      field: attributes["data_stream.namespace"]
      value: "default"

extensions:
  file_storage:
    directory: /var/lib/otelcol/file_storage

processors:
  # Adding  host.name (this is done OOTB by Logstash)
  resourcedetection/system:
    detectors: ["system"]
    system:
      hostname_sources: ["os"]
      resource_attributes:
        os.type:
          enabled: false

  transform/grok: #[3]
    log_statements:
      - context: log
        statements:
        - 'merge_maps(attributes, ExtractGrokPatterns(attributes["event.original"], "%{TIMESTAMP_ISO8601:date}: user %{WORD:user.name} accessed from %{IP:client.ip}:%{NUMBER:client.port:int} path %{URIPATH:url.path} with error %{NUMBER:http.status.code}", true), "insert")'

  filter/exclude_system_user:  #[4]
    error_mode: ignore
    logs:
      log_record:
        - attributes["user.name"] == "SYSTEM"

  transform/parse_date: #[5]
    log_statements:
      - context: log
        statements:
          - set(time, Time(attributes["date"], "%Y-%m-%dT%H:%M:%S"))
          - delete_key(attributes, "date")
        conditions:
          - attributes["date"] != nil

  transform/translate_status_code:  #[6]
    log_statements:
      - context: log
        conditions:
        - attributes["http.status.code"] != nil
        statements:
        - set(attributes["http.status.code_description"], "OK")                where attributes["http.status.code"] == "200"
        - set(attributes["http.status.code_description"], "Permission Denied") where attributes["http.status.code"] == "403"
        - set(attributes["http.status.code_description"], "Not Found")         where attributes["http.status.code"] == "404"
        - set(attributes["http.status.code_description"], "Server Error")      where attributes["http.status.code"] == "500"
        - set(attributes["http.status.code_description"], "Unknown Error")     where attributes["http.status.code_description"] == nil

exporters:
  elasticsearch: #[7]
    endpoints: ["elasticsearch-enpoint:443"]
    api_key: ${env:ES_API_KEY}
    tls:
    logs_dynamic_index:
      enabled: true
    mapping:
      mode: ecs

service:
  extensions: [file_storage]
  pipelines:
    logs:
      receivers:
        - filelog
      processors:
        - resourcedetection/system
        - transform/grok
        - filter/exclude_system_user
        - transform/parse_date
        - transform/translate_status_code
      exporters:
        - elasticsearch

These will generate the following document in Elasticsearch

{
    "@timestamp": "2024-09-20T08:33:27.000Z",
    "client": {
        "ip": "89.66.167.22",
        "port": 10592
    },
    "data_stream": {
        "dataset": "access_log",
        "namespace": "default",
        "type": "logs"
    },
    "event": {
        "original": "2024-09-20T08:33:27: user frank accessed from 89.66.167.22:10592 path /blog with error 404"
    },
    "host": {
        "hostname": "my-laptop",
        "name": "my-laptop",
     },
    "http": {
        "status": {
            "code": "404",
            "code_description": "Not Found"
        }
    },
    "log": {
        "file": {
            "path": "/tmp/demo-line.log"
        }
    },
    "message": "2024-09-20T08:33:27: user frank accessed from 89.66.167.22:10592 path /blog with error 404",
    "url": {
        "path": "/blog"
    },
    "user": {
        "name": "frank"
    }
}

Example 2: Parse and transform a NDJSON-formatted log file

Let's consider the below json line:

{"log_level":"INFO","message":"User login successful","service":"auth-service","timestamp":"2024-10-11 12:34:56.123 +0100","user":{"id":"A1230","name":"john_doe"}}

We will apply the following steps:

Read a line from the file /tmp/demo.ndjson.
Define the output to be an Elasticsearch datastream logs-json-default
Parse the JSON and assign relevant keys and values.
Parse the date.
Override the message field.
Rename fields to follow ECS conventions.
Send data to Elasticsearch.

Logstash pipeline

input {
    file {
        path => "/tmp/demo.ndjson" #[1]
        start_position => "beginning"
        add_field => { #[2]
            "[data_stream][type]" => "logs"
            "[data_stream][dataset]" => "json"
            "[data_stream][namespace]" => "default"
        }
    }
}

filter {
  if [message] =~ /^\{.*/ {
    json { #[3] & #[5]
        source => "message"
    }
  }
  date { #[4]
    match => ["[timestamp]", "yyyy-MM-dd HH:mm:ss.SSS Z"]
    remove_field => "[timestamp]"
  }
  mutate {
    rename => { #[6]
      "service" => "[service][name]"
      "log_level" => "[log][level]"
    }
  }
}


output {
    elasticsearch { # [7]
        hosts => "elasticsearch-enpoint:443"
        api_key => "${ES_API_KEY}"
    }
}

OpenTelemtry Collector configuration

receivers:
  filelog/json: # [1]
    include: 
      - /tmp/demo.ndjson
    retry_on_failure:
      enabled: true
    start_at: beginning
    storage: file_storage 
    operators:
     # Copy the raw message into event.original (this is done OOTB by Logstash in ECS mode)
    - type: copy
      from: body
      to: attributes['event.original']
    - type: add #[2]
      field: attributes["data_stream.type"]
      value: "logs"      
    - type: add #[2]
      field: attributes["data_stream.dataset"]
      value: "otel" #[2]
    - type: add
      field: attributes["data_stream.namespace"]
      value: "default"     


extensions:
  file_storage:
    directory: /var/lib/otelcol/file_storage

processors:
  # Adding  host.name (this is done OOTB by Logstash)
  resourcedetection/system:
    detectors: ["system"]
    system:
      hostname_sources: ["os"]
      resource_attributes:
        os.type:
          enabled: false

  transform/json_parse:  #[3]
    error_mode: ignore
    log_statements:
      - context: log
        statements:
          - merge_maps(attributes, ParseJSON(body), "upsert")
        conditions: 
          - IsMatch(body, "^\\{")
      

  transform/parse_date:  #[4]
    error_mode: ignore
    log_statements:
      - context: log
        statements:
          - set(time, Time(attributes["timestamp"], "%Y-%m-%d %H:%M:%S.%L %z"))
          - delete_key(attributes, "timestamp")
        conditions: 
          - attributes["timestamp"] != nil

  transform/override_message_field: [5]
    error_mode: ignore
    log_statements:
      - context: log
        statements:
          - set(body, attributes["message"])
          - delete_key(attributes, "message")

  transform/set_log_severity: # [6]
    error_mode: ignore
    log_statements:
      - context: log
        statements:
          - set(severity_text, attributes["log_level"])          

  attributes/rename_attributes: #[6]
    actions:
      - key: service.name
        from_attribute: service
        action: insert
      - key: service
        action: delete
      - key: log_level
        action: delete

exporters:
  elasticsearch: #[7]
    endpoints: ["elasticsearch-enpoint:443"]
    api_key: ${env:ES_API_KEY}
    tls:
    logs_dynamic_index:
      enabled: true
    mapping:
      mode: ecs

service:
  extensions: [file_storage]
  pipelines:
    logs/json:
      receivers: 
        - filelog/json
      processors:
        - resourcedetection/system    
        - transform/json_parse
        - transform/parse_date        
        - transform/override_message_field
        - transform/set_log_severity
        - attributes/rename_attributes
      exporters: 
        - elasticsearch

These will generate the following document in Elasticsearch

{
    "@timestamp": "2024-10-11T12:34:56.123000000Z",
    "data_stream": {
        "dataset": "otel",
        "namespace": "default",
        "type": "logs"
    },
    "event": {
        "original": "{\"log_level\":\"WARNING\",\"message\":\"User login successful\",\"service\":\"auth-service\",\"timestamp\":\"2024-10-11 12:34:56.123 +0100\",\"user\":{\"id\":\"A1230\",\"name\":\"john_doe\"}}"
    },
    "host": {
        "hostname": "my-laptop",
        "name": "my-laptop",
     },
    "log": {
        "file": {
            "name": "json.log"
        },
        "level": "WARNING"
    },
    "message": "User login successful",
    "service": {
        "name": "auth-service"
    },
    "user": {
        "id": "A1230",
        "name": "john_doe"
    }
}

Conclusion

In this post, we showed examples of how to convert a typical Logstash pipeline into an OpenTelemetry Collector pipeline for logs. While OpenTelemetry provides powerful tools for collecting and exporting logs, if your pipeline relies on complex transformations or scripting, Logstash remains a superior choice. This is because Logstash offers a broader range of built-in features and a more flexible approach to handling advanced data manipulation tasks.

What's Next?

Now that you've seen basic (but realistic) examples of converting a Logstash pipeline to OpenTelemetry, it's your turn to dive deeper. Depending on your needs, you can explore further and find more detailed resources in the following repositories:

OpenTelemetry Collector: Learn about the core OpenTelemetry components, from receivers to exporters.
OpenTelemetry Collector Contrib: Find community-contributed components for a wider range of integrations and features.
Elastic's opentelemetry-collector-components: Dive into Elastic's extensions for the OpenTelemetry Collector, offering more tailored features for Elastic Stack users.

If you encounter specific challenges or need to handle more advanced use cases, these repositories will be an excellent resource for discovering additional components or integrations that can enhance your pipeline. All these repositories have a similar structure with folders named receiver, processor, exporter, connector, which should be familiar after reading this blog. Whether you are migrating a simple Logstash pipeline or tackling more complex data transformations, these tools and communities will provide the support you need for a successful OpenTelemetry implementation.

Migrating 1 billion log lines from OpenSearch to Elasticsearch

Wed, 11 Oct 2023 00:00:00 GMT

What are the current options to migrate from OpenSearch to Elasticsearch^®?

OpenSearch is a fork of Elasticsearch 7.10 that has diverged quite a bit from itself lately, resulting in a different set of features and also different performance, as this benchmark shows (hint: it’s currently much slower than Elasticsearch).

Given the differences between the two solutions, restoring a snapshot from OpenSearch is not possible, nor is reindex-from-remote, so our only option is then using something in between that will read from OpenSearch and write to Elasticsearch.

This blog will show you how easy it is to migrate from OpenSearch to Elasticsearch for better performance and less disk usage!

1 billion log lines

We are going to use part of the data set we used for the benchmark, which takes about half a terabyte on disk, including replicas, and spans over a week ( January 1–7, 2023).

We have in total 1,009,165,775 documents that take 453.5GB of space in OpenSearch, including the replicas. That’s 241.2KB per document. This is going to be important later when we enable a couple optimizations in Elasticsearch that will bring this total size way down without sacrificing performance!

This billion log line data set is spread over nine indices that are part of a datastream we are calling logs-myapplication-prod. We have primary shards of about 25GB in size, according to the best practices for optimal shard sizing. A GET _cat/indices show us the indices we are dealing with:

index                              docs.count pri rep pri.store.size store.size
.ds-logs-myapplication-prod-000049  102519334   1   1         22.1gb     44.2gb
.ds-logs-myapplication-prod-000048  114273539   1   1         26.1gb     52.3gb
.ds-logs-myapplication-prod-000044  111093596   1   1         25.4gb     50.8gb
.ds-logs-myapplication-prod-000043  113821016   1   1         25.7gb     51.5gb
.ds-logs-myapplication-prod-000042  113859174   1   1         24.8gb     49.7gb
.ds-logs-myapplication-prod-000041  112400019   1   1         25.7gb     51.4gb
.ds-logs-myapplication-prod-000040  113362823   1   1         25.9gb     51.9gb
.ds-logs-myapplication-prod-000038  110994116   1   1         25.3gb     50.7gb
.ds-logs-myapplication-prod-000037  116842158   1   1         25.4gb     50.8gb

Both OpenSearch and Elasticsearch clusters have the same configuration: 3 nodes with 64GB RAM and 12 CPU cores. Just like in the benchmark, the clusters are running in Kubernetes.

Moving data from A to B

Typically, moving data from one Elasticsearch cluster to another is easy as a snapshot and restore if the clusters are compatible versions of each other or a reindex from remote if you need real-time synchronization and minimized downtime. These methods do not apply when migrating data from OpenSearch to Elasticsearch because the projects have significantly diverged from the 7.10 fork. However, there is one method that will work: scrolling.

Scrolling

Scrolling involves using an external tool, such as Logstash^®, to read data from the source cluster and write it to the destination cluster. This method provides a high degree of customization, allowing us to transform the data during the migration process if needed. Here are a couple advantages of using Logstash:

Easy parallelization: It’s really easy to write concurrent jobs that can read from different “slices” of the indices, essentially maximizing our throughput.
Queuing: Logstash automatically queues documents before sending.
Automatic retries: In the event of a failure or an error during data transmission, Logstash will automatically attempt to resend the data; moreover, it will stop querying the source cluster as often, until the connection is re-established, all without manual intervention.

Scrolling allows us to do an initial search and to keep pulling batches of results from Elasticsearch until there are no more results left, similar to how a “cursor” works in relational databases.

A scrolled search takes a snapshot in time by freezing the segments that make the index up until the time the request is made, preventing those segments from merging. As a result, the scroll doesn’t see any changes that are made to the index after the initial search request has been made.

Migration strategies

Reading from A and writing in B in can be slow without optimization because it involves paginating through the results, transferring each batch over the network to Logstash, which will assemble the documents in another batch and then transfer those batches over the network again to Elasticsearch, where the documents will be indexed. So when it comes to such large data sets, we must be very efficient and extract every bit of performance where we can.

Let’s start with the facts — what do we know about the data we need to transfer? We have nine indices in the datastream, each with about 100 million documents. Let’s test with just one of the indices and measure the indexing rate to see how long it takes to migrate. The indexing rate can be seen by activating the monitoring functionality in Elastic^® and then navigating to the index you want to inspect.

Scrolling in the deep
The simplest approach for transferring the log lines over would be to make Elasticsearch scroll over the entire data set and check it later when it finishes. Here we will introduce our first two variables: PAGE_SIZE and BATCH_SIZE. The former is how many records we are going to bring from the source every time we query it, and the latter is how many documents are going to be assembled together by Logstash and written to the destination index.

With such a large data set, the scroll slows down as this deep pagination progresses. The indexing rate starts at 6,000 docs/second and steadily descends down to 700 docs/second because the pagination gets very deep. Without any optimization, it would take us 19 days (!) to migrate the 1 billion documents. We can do better than that!

Slice me nice
We can optimize scrolling by using an approach called Sliced scroll, where we split the index in different slices to consume them independently.

Here we will introduce our last two variables: SLICES and WORKERS. The amount of slices cannot be too small as the performance decreases drastically over time, and it can’t be too big as the overhead of maintaining the scrolls would counter the benefits of a smaller search.

Let’s start by migrating a single index (out of the nine we have) with different parameters to see what combination gives us the highest throughput.


SLICES	PAGE_SIZE	WORKERS	BATCH_SIZE	Average Indexing Rate
3	500	3	500	13,319 docs/sec
3	1,000	3	1,000	13,048 docs/sec
4	250	4	250	10,199 docs/sec
4	500	4	500	12,692 docs/sec
4	1,000	4	1,000	10,900 docs/sec
5	500	5	500	12,647 docs/sec
5	1,000	5	1,000	10,334 docs/sec
5	2,000	5	2,000	10,405 docs/sec
10	250	10	250	14,083 docs/sec
10	250	4	1,000	12,014 docs/sec
10	500	4	1,000	10,956 docs/sec

It looks like we have a good set of candidates for maximizing the throughput for a single index, in between 12K and 14K documents per second. That doesn't mean we have reached our ceiling. Even though search operations are single threaded and every slice will trigger sequential search operations to read data, that does not prevent us from reading several indices in parallel.

By default, the maximum number of open scrolls is 500 — this limit can be updated with the search.max_open_scroll_context cluster setting, but the default value is enough for this particular migration.

Let’s migrate

Preparing our destination indices

We are going to create a datastream called logs-myapplication-reindex to write the data to, but before indexing any data, let’s ensure our index template and index lifecycle management configurations are properly set up. An index template acts as a blueprint for creating new indices, allowing you to define various settings that should be applied consistently across your indices.

Index lifecycle management policy
Index lifecycle management (ILM) is equally vital, as it automates the management of indices throughout their lifecycle. With ILM, you can define policies that determine how long data should be retained, when it should be rolled over into new indices, and when old indices should be deleted or archived. Our policy is really straightforward:

PUT _ilm/policy/logs-myapplication-lifecycle-policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_primary_shard_size": "25gb"
          }
        }
      },
      "warm": {
        "min_age": "0d",
        "actions": {
          "forcemerge": {
            "max_num_segments": 1
          }
        }
      }
    }
  }
}

Index template (and saving 23% in disk space)
Since we are here, we’re going to go ahead and enable Synthetic Source, a clever feature that allows us to store and discard the original JSON document while still reconstructing it when needed from the stored fields.

For our example, enabling Synthetic Source resulted in a remarkable 23.4% improvement in storage efficiency , reducing the size required to store a single document from 241.2KB in OpenSearch to just 185KB in Elasticsearch.

Our full index template is therefore:

PUT _index_template/logs-myapplication-reindex
{
  "index_patterns": [
    "logs-myapplication-reindex"
  ],
  "priority": 500,
  "data_stream": {},
  "template": {
    "settings": {
      "index": {
        "lifecycle.name": "logs-myapplication-lifecycle-policy",
        "codec": "best_compression",
        "number_of_shards": "1",
        "number_of_replicas": "1",
        "query": {
          "default_field": [
            "message"
          ]
        }
      }
    },
    "mappings": {
      "_source": {
        "mode": "synthetic"
      },
      "_data_stream_timestamp": {
        "enabled": true
      },
      "date_detection": false,
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "agent": {
          "properties": {
            "ephemeral_id": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "id": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "name": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "type": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "version": {
              "type": "keyword",
              "ignore_above": 1024
            }
          }
        },
        "aws": {
          "properties": {
            "cloudwatch": {
              "properties": {
                "ingestion_time": {
                  "type": "keyword",
                  "ignore_above": 1024
                },
                "log_group": {
                  "type": "keyword",
                  "ignore_above": 1024
                },
                "log_stream": {
                  "type": "keyword",
                  "ignore_above": 1024
                }
              }
            }
          }
        },
        "cloud": {
          "properties": {
            "region": {
              "type": "keyword",
              "ignore_above": 1024
            }
          }
        },
        "data_stream": {
          "properties": {
            "dataset": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "namespace": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "type": {
              "type": "keyword",
              "ignore_above": 1024
            }
          }
        },
        "ecs": {
          "properties": {
            "version": {
              "type": "keyword",
              "ignore_above": 1024
            }
          }
        },
        "event": {
          "properties": {
            "dataset": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "id": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "ingested": {
              "type": "date"
            }
          }
        },
        "host": {
          "type": "object"
        },
        "input": {
          "properties": {
            "type": {
              "type": "keyword",
              "ignore_above": 1024
            }
          }
        },
        "log": {
          "properties": {
            "file": {
              "properties": {
                "path": {
                  "type": "keyword",
                  "ignore_above": 1024
                }
              }
            }
          }
        },
        "message": {
          "type": "match_only_text"
        },
        "meta": {
          "properties": {
            "file": {
              "type": "keyword",
              "ignore_above": 1024
            }
          }
        },
        "metrics": {
          "properties": {
            "size": {
              "type": "long"
            },
            "tmin": {
              "type": "long"
            }
          }
        },
        "process": {
          "properties": {
            "name": {
              "type": "keyword",
              "ignore_above": 1024
            }
          }
        },
        "tags": {
          "type": "keyword",
          "ignore_above": 1024
        }
      }
    }
  }
}

Building a custom Logstash image

We are going to use a containerized Logstash for this migration because both clusters are sitting on a Kubernetes infrastructure, so it's easier to just spin up a Pod that will communicate to both clusters.

Since OpenSearch is not an official Logstash input, we must build a custom Logstash image that contains the logstash-input-opensearch plugin. Let’s use the base image from docker.elastic.co/logstash/logstash:9.3.0 and just install the plugin:

FROM docker.elastic.co/logstash/logstash:9.3.0

USER logstash
WORKDIR /usr/share/logstash
RUN bin/logstash-plugin install logstash-input-opensearch

Writing a Logstash pipeline

Now we have our Logstash Docker image, and we need to write a pipeline that will read from OpenSearch and write to Elasticsearch.

The input

input {
    opensearch {
        hosts => ["os-cluster:9200"]
        ssl => true
        ca_file => "/etc/logstash/certificates/opensearch-ca.crt"
        user => "${OPENSEARCH_USERNAME}"
        password => "${OPENSEARCH_PASSWORD}"
        index => "${SOURCE_INDEX_NAME}"
        slices => "${SOURCE_SLICES}"
        size => "${SOURCE_PAGE_SIZE}"
        scroll => "5m"
        docinfo => true
        docinfo_target => "[@metadata][doc]"
    }
}

Let’s break down the most important input parameters. The values are all represented as environment variables here:

hosts: Specifies the host and port of the OpenSearch cluster. In this case, it’s connecting to “os-cluster” on port 9200.
index: Specifies the index in the OpenSearch cluster from which to retrieve logs. In this case, it’s “logs-myapplication-prod” which is a datastream that contains the actual indices (e.g., .ds-logs-myapplication-prod-000049).
size: Specifies the maximum number of logs to retrieve in each request.
scroll: Defines how long a search context will be kept open on the OpenSearch server. In this case, it’s set to “5m,” which means each request must be answered and a new “page” asked within five minutes.
docinfo and docinfo_target: These settings control whether document metadata should be included in the Logstash output and where it should be stored. In this case, document metadata is being stored in the [@metadata][doc] field — this is important because the document’s _id will be used as the destination id as well.

The ssl and ca_file are highly recommended if you are migrating from clusters that are in a different infrastructure (separate cloud providers). You don’t need to specify a ca_file if your TLS certificates are signed by a public authority, which is likely the case if you are using a SaaS and your endpoint is reachable over the internet. In this case, only ssl => true would suffice. In our case, all our TLS certificates are self-signed, so we must also provide the Certificate Authority (CA) certificate.

The (optional) filter
We could use this to drop or alter the documents to be written to Elasticsearch if we wanted, but we are not going to, as we want to migrate the documents as is. We are only removing extra metadata fields that Logstash includes in all documents, such as "@version" and "host". We are also removing the original "data_stream" as it contains the source data stream name, which might not be the same in the destination.

filter {
    mutate {
        remove_field => ["@version", "host", "data_stream"]
    }
}

The output
The output is really simple — we are going to name our datastream logs-myapplication-reindex and we are using the document id of the original documents in document_id, to ensure there are no duplicate documents. In Elasticsearch, datastream names follow a convention -- so our logs-myapplication-reindex datastream has “myapplication” as dataset and “prod” as namespace.

elasticsearch {
    hosts => "${ELASTICSEARCH_HOST}"

    user => "${ELASTICSEARCH_USERNAME}"
    password => "${ELASTICSEARCH_PASSWORD}"

    document_id => "%{[@metadata][doc][_id]}"

    data_stream => "true"
    data_stream_type => "logs"
    data_stream_dataset => "myapplication"
    data_stream_namespace => "prod"
}

Deploying Logstash

We have a few options to deploy Logstash: it can be deployed locally from the command line, as a systemd service, via docker, or on Kubernetes.

Since both of our clusters are deployed in a Kubernetes environment, we are going to deploy Logstash as a Pod referencing our Docker image created earlier. Let’s put our pipeline inside a ConfigMap along with some configuration files (pipelines.yml and config.yml).

In the below configuration, we have SOURCE_INDEX_NAME, SOURCE_SLICES, SOURCE_PAGE_SIZE, LOGSTASH_WORKERS, and LOGSTASH_BATCH_SIZE conveniently exposed as environment variables so you just need to fill them out.

apiVersion: v1
kind: Pod
metadata:
  name: logstash-1
spec:
  containers:
    - name: logstash
      image: ugosan/logstash-opensearch-input:8.10.0
      imagePullPolicy: Always
      env:
        - name: SOURCE_INDEX_NAME
          value: ".ds-logs-benchmark-dev-000037"
        - name: SOURCE_SLICES
          value: "10"
        - name: SOURCE_PAGE_SIZE
          value: "500"
        - name: LOGSTASH_WORKERS
          value: "4"
        - name: LOGSTASH_BATCH_SIZE
          value: "1000"
        - name: OPENSEARCH_USERNAME
          valueFrom:
            secretKeyRef:
              name: os-cluster-admin-password
              key: username
        - name: OPENSEARCH_PASSWORD
          valueFrom:
            secretKeyRef:
              name: os-cluster-admin-password
              key: password
        - name: ELASTICSEARCH_USERNAME
          value: "elastic"
        - name: ELASTICSEARCH_PASSWORD
          valueFrom:
            secretKeyRef:
              name: es-cluster-es-elastic-user
              key: elastic
      resources:
        limits:
          memory: "4Gi"
          cpu: "2500m"
        requests:
          memory: "1Gi"
          cpu: "300m"
      volumeMounts:
        - name: config-volume
          mountPath: /usr/share/logstash/config
        - name: etc
          mountPath: /etc/logstash
          readOnly: true
  volumes:
    - name: config-volume
      projected:
        sources:
          - configMap:
              name: logstash-configmap
              items:
                - key: pipelines.yml
                  path: pipelines.yml
                - key: logstash.yml
                  path: logstash.yml
    - name: etc
      projected:
        sources:
          - configMap:
              name: logstash-configmap
              items:
                - key: pipeline.conf
                  path: pipelines/pipeline.conf
          - secret:
              name: os-cluster-http-cert
              items:
                - key: ca.crt
                  path: certificates/opensearch-ca.crt
          - secret:
              name: es-cluster-es-http-ca-internal
              items:
                - key: tls.crt
                  path: certificates/elasticsearch-ca.crt
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: logstash-configmap
data:
  pipelines.yml: |
    - pipeline.id: reindex-os-es
      path.config: "/etc/logstash/pipelines/pipeline.conf"
      pipeline.batch.size: ${LOGSTASH_BATCH_SIZE}
      pipeline.workers: ${LOGSTASH_WORKERS}
  logstash.yml: |
    log.level: info
    pipeline.unsafe_shutdown: true
    pipeline.ordered: false
  pipeline.conf: |
    input {
        opensearch {
          hosts => ["os-cluster:9200"]
          ssl => true
          ca_file => "/etc/logstash/certificates/opensearch-ca.crt"
          user => "${OPENSEARCH_USERNAME}"
          password => "${OPENSEARCH_PASSWORD}"
          index => "${SOURCE_INDEX_NAME}"
          slices => "${SOURCE_SLICES}"
          size => "${SOURCE_PAGE_SIZE}"
          scroll => "5m"
          docinfo => true
          docinfo_target => "[@metadata][doc]"
        }
    }

    filter {
        mutate {
            remove_field => ["@version", "host", "data_stream"]
        }
    }

    output {
        elasticsearch {
            hosts => "https://es-cluster-es-http:9200"
            ssl => true
            ssl_certificate_authorities => ["/etc/logstash/certificates/elasticsearch-ca.crt"]
            ssl_verification_mode => "full"

            user => "${ELASTICSEARCH_USERNAME}"
            password => "${ELASTICSEARCH_PASSWORD}"

            document_id => "%{[@metadata][doc][_id]}"

            data_stream => "true"
            data_stream_type => "logs"
            data_stream_dataset => "myapplication"
            data_stream_namespace => "reindex"
        }
    }

That’s it.

After a couple hours, we successfully migrated 1 billion documents from OpenSearch to Elasticsearch and even saved 23% plus on disk storage! Now that we have the logs in Elasticsearch how about extracting actual business value from them? Logs contain so much valuable information - we can not only do all sorts of interesting things with AIOPS, like Automatically Categorize those logs, but also extract business metrics and detect anomalies on them, give it a try.


OpenSearch			Elasticsearch
Index	docs	size	Index	docs	size	Diff.
.ds-logs-myapplication-prod-000037	116842158	27285520870	logs-myapplication-reindex-000037	116842158	21998435329	21.46%
.ds-logs-myapplication-prod-000038	110994116	27263291740	logs-myapplication-reindex-000038	110994116	21540011082	23.45%
.ds-logs-myapplication-prod-000040	113362823	27872438186	logs-myapplication-reindex-000040	113362823	22234641932	22.50%
.ds-logs-myapplication-prod-000041	112400019	27618801653	logs-myapplication-reindex-000041	112400019	22059453868	22.38%
.ds-logs-myapplication-prod-000042	113859174	26686723701	logs-myapplication-reindex-000042	113859174	21093766108	23.41%
.ds-logs-myapplication-prod-000043	113821016	27657006598	logs-myapplication-reindex-000043	113821016	22059454752	22.52%
.ds-logs-myapplication-prod-000044	111093596	27281936915	logs-myapplication-reindex-000044	111093596	21559513422	23.43%
.ds-logs-myapplication-prod-000048	114273539	28111420495	logs-myapplication-reindex-000048	114273539	22264398939	23.21%
.ds-logs-myapplication-prod-000049	102519334	23731274338	logs-myapplication-reindex-000049	102519334	19307250001	20.56%

Interested in trying Elasticsearch? Start our 14-day free trial.

AIOps with Elastic Observability: Modern AIOps & Log Intelligence

Wed, 26 Nov 2025 00:00:00 GMT

AIOps Blog Refresher: Unlocking Intelligence from Your Logs with Elastic

Elastic has been leading the charge with AIOps, especially in the recent 9.2 update of Elastic Observability with Streams. The conversation around AIOps has shifted dramatically as we move through the year. DevOps and SRE teams aren't asking whether they need AIOps, they're asking how to leverage it more effectively to stay ahead of exponentially growing complexity.

The current challenge of AIOps is that modern cloud-native environments generate massive volumes of telemetry data that are magnitudes larger than past environments. But here's what many teams overlook: logs are the richest source of operational intelligence you have. Logs are able to tell you exactly what happened and why, while metrics only tell you something is wrong, and traces only tell you where. The problem is that most organizations are drowning in logs. Microservices, such as user authentications or inventories, serverless functions, and Kubernetes generate millions of log entries daily. Without AI and machine learning, finding meaningful patterns in this data takes too much time and energy.

Log Intelligence Improvement: What's New in 2025

Historically in observability, unlocking your log intelligence included long manual effort that required not only parsing through logs, but also structuring those logs. Elastic Observability has drastically changed how teams extract value from logs. Observability is not just simple signal analysis - modern tools need to have proactive, log-driven investigations. At Elastic, this modernity is Streams.

Streams, a new release from Elastic, is a collection of AI-driven tools that identify significant events in parsed raw logs by enriching logs with meaningful fields. With Streams, SREs can maximize the value of their data, their logs, and their systems. With system reliability as the goal, Streams helps to reduce pipeline management overhead and accelerates observability analysis. And it takes nearly no time to set up!

Here is how Streams powers the Elastic Observability capabilities available now.

Advanced Log Rate Analysis

Log rate analysis can go far beyond only detecting spikes. Elastic's machine learning automatically identifies when log volumes deviate from expected baselines, then contextualizes these changes within your broader system performance. When your application suddenly generates more error logs, Elastic’s AIOps doesn't just alert you, it also determines whether it's a critical issue requiring immediate attention or just a temporary anomaly.

This matters to your analysis because not all log spikes are equal. A 10x increase in DEBUG logs might indicate verbose logging accidentally enabled in production. A 2x increase in ERROR logs could signal a cascading failure. Log rate analysis distinguishes between these scenarios automatically, giving your team the context needed to respond appropriately.

Intelligent Log Categorization with Streams

This is where AIOps shines with log data. Streams uses machine learning algorithms in order to automatically classify and group similar log patterns, dramatically reducing noise. Instead of manually parsing millions of entries, the system identifies common structures, groups related events, and surfaces the categories that matter most.

Logs are unstructured by nature, making them difficult to analyze at scale. Streams corrals chaotic log streams into organized, queryable patterns. Instantly, you can see that 80% of your errors fall into three categories, helping you prioritize where to focus remediation efforts. This approach helps you reduce noise and accelerate analysis, allowing teams to act on insights faster.

Multi-Dimensional Anomaly Detection

Anomaly detection now simultaneously examines relationships between logs, metrics, and traces. A slight increase in response time might not trigger an alert by itself, but when correlated with unusual log patterns and memory consumption changes, the system recognizes it as an early warning sign.

Logs contain a myriad of contextual information that metrics and traces can't capture: stack traces, user IDs, transaction details, error messages, etc. By correlating log anomalies with other signals, you get the full picture of what's happening in your system. This whole holistic view enables teams to catch issues earlier, as well as understand their full impact across the stack.

Enhanced Root Cause Analysis Powered by Significant Events

When an issue occurs, Elastic's Streams accelerates root cause analysis through AI-assisted parsing of logs and bringing about “Significant events.” Significant event queries can be defined by AI or manually, depending on if you know what logs you are looking for or not. Then, Elastic’s AIOps traces the problem through your entire stack using these events, as well as enriched log data combined with distributed tracing. This system is able to correlate failed transactions with specific log entries, deployment events, and infrastructure changes. This helps you understand not just what broke, but why and when.

Streams makes the analysis of your logs quick and automatic by going across your entire distributed system within seconds, grabbing relevant log entries such as stack traces, state information, error messages, and more. What used to require hours of manual investigation and deduction now happens automatically, freeing you and your team from tedious detective work and enabling faster resolution.

Logs in Action: Real-World Impact

Let's look at how these capabilities work together in practice. Imagine your payment processing service is experiencing intermittent failures - only 0.5% of transactions, but enough to concern your team. Traditional monitoring shows everything is mostly okay, but customers are still complaining.

Without Streams, an SRE might initially run some broad queries, manually sift through thousands of logs, struggle to connect all the dots, and ultimately not understand the correlation between the errors and recent system changes.

With Elastic Streams and AIOps, many of these potential problems are instantly mitigated:

Streams automatically parse the payment service, adding connection timeouts to a new category of significant events
Log rate analysis with Streams reveal that this significant event category has been slowly growing over the past month, showing growth of the timeouts from a small number of occurrences into a larger amount
Elastic’s built-in anomaly detection correlates these significant events with deployment data, and identifies that they started appearing after a recent load balancer configuration
Root analysis pinpoints the exact database connection pool setting that is too restrictive for peak load by tracing affected transactions through previously enriched logs

What usually takes 4-8 hours of manual log analysis is resolved in minutes, with Elastic automatically highlighting the relevant log entries that tell the complete story. This is the power of AIOps and Streams as applied to log intelligence.

The Power of Unified Log Intelligence

What sets Elastic apart is treating logs as a priority in your observability strategy. Elastic provides comprehensive log ingestion that centralizes petabytes of logs from across your infrastructure with flexible parsing and enrichment. The platform uses purpose-built machine learning models that understand log patterns, not generic algorithms retrofitted for log analysis.

Logs don't exist in isolation, which is why Elastic correlates log data with metrics, traces, and business events to provide complete context. And because log volumes can be massive, Elastic's tiered storage approach means you can retain years of logs for compliance and historical analysis without breaking the budget.

Why Logs Matter More Than Ever

Logs have become the cornerstone of effective AIOps for three critical reasons.

First off, logs capture what metrics can't. A metric tells you the CPU is at 80%, but a log tells you which process is consuming resources and why. This level of detail is essential for understanding not just that something is wrong, but what specifically is causing the problem.

Second, logs provide business context. Error messages contain user IDs, transaction ldetails, and business logic failures that help you understand customer impact. When you're troubleshooting an issue, knowing which customers are affected and what they were trying to do is invaluable for prioritizing your response.

Third, logs enable true root cause analysis. Stack traces, error messages, and application state captured in logs are essential for understanding the why behind every incident. Without this information, teams are left guessing at root causes rather than definitively identifying and fixing them.

The teams winning with AIOps in 2025 aren't just monitoring metrics, they're extracting intelligence from their logs at scale, turning operational data into actionable insights.

Transform Your Log Strategy Today

Every hour your team spends manually searching through logs is an hour they're not spending on innovation. Every incident that could have been prevented through intelligent log analysis represents both technical debt and business risk.

Elastic Observability provides the foundation you need to unlock the intelligence hidden in your logs. With automatic categorization, anomaly detection, and ML-powered analysis, you can start seeing value immediately. Check out this recent article to get started with Elastic Streams and Observability today!

The observability gap: Why your monitoring strategy isn't ready for what's coming next

Mon, 25 Aug 2025 00:00:00 GMT

Anyone that’s been to London knows the announcements at the Tube to “Mind the gap” but what about the gap that’s developing in our monitoring and observability strategies? I’ve been through this toil before, and have run a distributed system that was humming along perfectly. My alerts were manageable, my dashboards made sense, and when things broke, I could usually track down the issue in a reasonable amount of time.

Fast forward 3-5 years and things have changed, we added Kubernetes, embraced microservices, maybe these days you might have even sprinkled in some AI-powered features. Suddenly, you're drowning in telemetry data, your alert fatigue is real, and correlating issues across your distributed architecture feels stressful.

You're experiencing what I call the "observability gap", where system complexity rockets ahead while our monitoring maturity crawls behind. Today, we're going to explore why this gap exists, what's driving it wider, and most importantly, how to close it using modern observability practices.

The complexity rocket ship has left the station

Let's be honest about what we're dealing with. The scale and complexity of our infrastructure isn't growing linearly, it's exponential. We've gone from monolithic applications running on physical servers to container orchestration platforms managing hundreds of microservices, with AI algorithms now starting to make scaling decisions autonomously.

This trajectory shows no signs of slowing down. With AI-assisted coding accelerating development cycles and intelligent orchestration systems like Kubernetes evolving toward predictive scaling, we're looking at infrastructure that's not just complex, but dynamically complex.

Meanwhile, our observability tooling? It's stuck in the past, designed for a world where you knew exactly how many servers you had and could manually correlate logs with metrics by cross-referencing timestamps.

The telemetry data explosion (and why sampling isn't the answer)

One of the first things teams notice as they scale is their observability bill climbing faster than their infrastructure costs. The knee-jerk reaction is often to start sampling data downsample metrics, head-sample traces, deduplicate logs. While these techniques have their place, they're fundamentally at odds with where we're heading.

Here's the thing: ML and AI systems thrive on rich, contextual data. When you sample away the "noise," you're often discarding the very signals that could help you understand system behavior patterns or predict failures. Instead of asking "how can we collect less data?", the better question is "how can we store and process all this data cost-effectively?"

Modern storage architectures, particularly those leveraging object storage and advanced compression techniques like ZStandard, can achieve remarkable cost-to-value ratios. The secret is organizing related data together and moving it to cheaper storage tiers quickly. This approach lets you have your cake and eat it too, full fidelity data retention without breaking the bank.

Now of course there is a balance to this and not all your applications are equal, so as a first step you should look at all your most critical flows and applications and ensure that they have the richest telemetry. Do not use a sledge hammer approach and sample all your data just to reduce bills when a scalpel is best.

OpenTelemetry (OTel): the foundation everything else builds on

If I had to pick the single most transformative change in observability during my career, it would be OpenTelemetry. Not because it's flashy or revolutionary in concept, but because it solves fundamental problems that have plagued us for years.

Before OTel, instrumenting applications meant vendor lock-in. Want to switch from vendor A to vendor B? Good luck re-instrumenting your entire codebase. Want to send the same telemetry to multiple backends? Hope you enjoy maintaining multiple agent configurations.

OpenTelemetry changes things completely. Here's the three main reasons why.

Vendor Neutrality: Your instrumentation code becomes portable. The same OTEL SDK can send data to any compliant backend.

OpenTelemetry Semantic Conventions: All your telemetry (logs, metrics, traces, profiles, wide-events) shares common metadata like service names, resource attributes, and trace context.

Auto-Instrumentation: For most popular languages and frameworks, you get rich telemetry with zero code changes.

OTEL also makes manual instrumentation incredibly valuable with minimal effort. Adding a single line like this

baggage.set_baggage("customer.id", "alice123")

In your authentication service means that customer ID automatically flows through every downstream service call, every database query, every log message. Suddenly, you can search all your telemetry data by customer ID across your entire distributed system.

The trajectory is clear: within a few years, OTel will be as ubiquitous and invisible as Kubernetes is becoming today. Runtimes will include it by default, cloud providers will offer OTel collectors at the edge, and frameworks will come pre-instrumented.

Correlation: the secret sauce that makes everything click

You get an alert about high latency. You check your metrics dashboard yep, 95th percentile is spiking. You switch to your tracing system and you can see some slow requests. You hop over to your logging system and there are some error messages around the same time. Now comes the fun part: figuring out which logs correspond to which traces and whether they're related to the metric that alerted you.

This context-switching nightmare is exactly what proper correlation eliminates. When your telemetry data shares common identifiers for example, trace IDs in logs, consistent service names, synchronized timestamps or even customer IDs you can seamlessly pivot between different signal types without losing context.

But correlation goes beyond just technical convenience. When you can search all your logs by customer.id and immediately see the traces and metrics for that customer's journey through your system, you transform how you approach support and debugging. When you can filter your entire observability stack by deployment version and instantly understand the impact of a release, you change how you think about deployments.

Metrics? Yes, even metrics can be correlated by using OpenTelemetry exemplars, for example using python you would turn on exemplars as follows.

# Setup metrics with exemplars enabled

exemplar_filter = ExemplarFilter(trace_based=True)  

exemplar_reservoir = ExemplarReservoir(

    exemplar_filter=exemplar_filter,`

    max_exemplars=5
)

This would then associate metrics with a trace that happens to be occurring so you get some metrics correlated to your traces.

Then again, why correlate at all?

So you may be thinking, this is great and I can see this being a useful strategy. It is especially useful when you have metrics, logs and traces in separate systems, however, pretty soon you realize that it's a lot of effort when you could just combine all this data together in a single data structure and avoid the need to correlate at all. The observability industry agrees and has recently been espousing the benefits of a new signal type called wide-events.

Wide-events are just really structured logs, the idea is to put metric data, trace data and log data all into the same wide data structure which can make analysis much easier. Think about it, if you have a single data structure you can very quickly run queries and aggregations without having to join any data which can get pretty expensive.

Additionally you are increasing the information density per log record which is particularly great for AI applications. AI gets a context-rich dataset to do analysis on with minimal latency, a single record with enough descriptive capability to quickly find the root cause of your issue without having to dig around in other data stores and try to figure out whatever schema those data stores are using.

LLMs especially LOVE context and if you can give them all the context they need without having them try to find it, your investigation time will significantly reduce.

This isn't just about making SRE life easier (though it does that). It's about creating the rich, interconnected dataset that AI and ML systems need to understand your infrastructure's behavior patterns.

AI-driven investigations

Observability tools today have been pretty good at solving the alerting fatigue and dashboarding problems, things have gotten quite mature there. Alert correlation and other techniques drastically reduce the noise in these domains, not to mention a focus on being alerted by SLOs instead of pure technical metrics. Life has gotten better over the past few years for SREs here.

Now alerts are one piece of the puzzle but the latest AI techniques using LLMs and agentic AI can unlock time savings in a different spot, during investigations. Think about it, investigations are typically what drags on when you have an outage, the cognitive overload while the pressure is on is very real and pretty stressful for SREs.

The good news is that when we get our data in good shape with correlation, enrichment and adopting wide-events and we store the data in full fidelity we now have the tools to help us drive faster investigations.

LLMs can take all that rich data and do some very powerful analysis that can cut down your investigation time. Let's walk through an example.

Imagine we have the following basic log. We only have a limited amount of data for an LLM to reason about. All it can tell is that a database failed.

Let's see what this looks like when we use a wide-event, notice that already we can see some significant benefits, firstly we only had to visit the log from a single node, the node that serviced the request. We didn’t have to dig into downstream logs. This already makes life easier for the LLM; it doesn't have to figure out how to correlate multiple log lines and traces and metrics though we do still have correlation IDs if we desperately need to look in downstream systems.

Next we have all this additional rich data that an LLM can use to reason about what happened. LLMs work best with context and if you can feed them as much context as possible they will work more effectively to reduce your investigation time.

Field	How an LLM uses it
`trace_id`, `parent_span_id`	Thread every hop together without parsing free-text
`status.code`, `error.*`	Precise failure class; no NLP guess-work
`db.*`	Root-cause surface ("postgres isn't provisioned")
`user.id`, `cloud.region`	Instant blast-radius queries
`deployment.version`	Correlation with new releases

Notice that we didn’t get rid of the unstructured error message, this is still useful context! LLMs are great at processing unstructured text so this textual description helps it understand the problem even further.

Large language models shine when they’re handed complete, context-rich evidence, exactly what wide-event logging supplies. Invest once in richer logs, and every downstream AI workflow (summaries, anomaly detection, natural-language queries) becomes simpler, cheaper, and far more reliable.

Building toward the future

As I look ahead, three trends seem inevitable:

OpenTelemetry semantic conventions powers wide-events: OTel semantic conventions will become as standard as logging is today to create wide-events. Cloud providers, runtimes, and frameworks will use it by default.
Making sense of logs with LLMs: Both improving the richness of your data and having LLMs automatically improve the richness of your existing logs will become essential for shortening investigation times.
AI will be essential: As system complexity outpaces human cognitive ability to understand it, AI assistance will become necessary for maintaining reasonable investigation times.

The organizations that start building toward this future now, adopting OpenTelemetry, investing in richer observability, and beginning to experiment with AI-assisted debugging will have a significant advantage as these trends accelerate.

Your next steps

If you're dealing with the observability gap in your own environment, here's where I'd start

Evaluate your logs: Do your logs have the richness of data you need to shorten investigation times? Can LLMs help provide additional context?
Start experimenting with OpenTelemetry: Even if you can't migrate everything immediately, instrumenting new services with OTel and using semantic conventions to produce wide-events gives you experience with the technology and starts building your enriched dataset.
Add high-value context: Customer IDs, session IDs, deployment versions even small amounts of contextual metadata can dramatically improve your debugging capabilities.
Think beyond storage costs: Instead of sampling data away, investigate modern storage architectures that let you keep everything at a reasonable cost for your most critical services.

The complexity rocket ship has left the station, and it's not slowing down. The question isn't whether your observability strategy needs to evolve; it's whether you'll evolve it proactively or reactively. I know which approach leads to better sleep at night.

Additional resources

Monitor dbt pipelines with Elastic Observability

Fri, 26 Jul 2024 00:00:00 GMT

In the Data Analytics team within the Observability organization in Elastic, we use dbt (dbt™, data build tool) to execute our SQL data transformation pipelines. dbt is a SQL-first transformation workflow that lets teams quickly and collaboratively deploy analytics code. In particular, we use dbt core, the open-source project, where you can develop from the command line and run your dbt project.

Our data transformation pipelines run daily and process the data that feed our internal dashboards, reports, analyses, and Machine Learning (ML) models.

There have been incidents in the past when the pipelines have failed, the source tables contained wrong data or we have introduced a change into our SQL code that has caused data quality issues, and we only realized once we saw it in a weekly report that was showing an anomalous number of records. That’s why we have built a monitoring system that proactively alerts us about these types of incidents as soon as they happen and helps us with visualizations and analyses to understand their root cause, saving us several hours or days of manual investigations.

We have leveraged our own Observability Solution to help solve this challenge, monitoring the entire lifecycle of our dbt implementation. This setup enables us to track the behavior of our models and conduct data quality testing on the final tables. We export dbt process logs from run jobs and tests into Elasticsearch and utilize Kibana to create dashboards, set up alerts, and configure Machine Learning jobs to monitor and assess issues.

The following diagram shows our complete architecture. In a follow-up article, we’ll also cover how we observe our python data processing and ML model processes using OTEL and Elastic - stay tuned.

Why monitor dbt pipelines with Elastic?

With every invocation, dbt generates and saves one or more JSON files called artifacts containing log data on the invocation results. dbt run and dbt test invocation logs are stored in the file run_results.json, as per the dbt documentation:

This file contains information about a completed invocation of dbt, including timing and status info for each node (model, test, etc) that was executed. In aggregate, many run_results.json can be combined to calculate average model runtime, test failure rates, the number of record changes captured by snapshots, etc.

Monitoring dbt run invocation logs can help solve several issues, including tracking and alerting about table volumes, detecting excessive slot time from resource-intensive models, identifying cost spikes due to slot time or volume, and pinpointing slow execution times that may indicate scheduling issues. This system was crucial when we merged a PR with a change in our code that had an issue, producing a sudden drop in the number of daily rows in upstream Table A. By ingesting the dbt run logs into Elastic, our anomaly detection job quickly identified anomalies in the daily row counts for Table A and its downstream tables, B, C, and D. The Data Analytics team received an alert notification about the issue, allowing us to promptly troubleshoot, fix and backfill the tables before it affected the weekly dashboards and downstream ML models.

Monitoring dbt test invocation logs can also address several issues, such as identifying duplicates in tables, detecting unnoticed alterations in allowed values for specific fields through validation of all enum fields, and resolving various other data processing and quality concerns. With dashboards and alerts on data quality tests, we proactively identify issues like duplicate keys, unexpected category values, and increased nulls, ensuring data integrity. In our team, we had an issue where a change in one of our raw lookup tables produced duplicated rows in our user table, doubling the number of users reported. By ingesting the dbt test logs into Elastic, our rules detected that some duplicate tests had failed. The team received an alert notification about the issue, allowing us to troubleshoot it right away by finding the upstream table that was the root cause. These duplicates meant that downstream tables had to process 2x the amount of data, creating a spike in the bytes processed and slot time. The anomaly detection and alerts on the dbt run logs also helped us spot these spikes for individual tables and allowed us to quantify the impact on our billing.

Processing our dbt logs with Elastic and Kibana allows us to obtain real-time insights, helps us quickly troubleshoot potential issues, and keeps our data transformation processes running smoothly. We set up anomaly detection jobs and alerts in Kibana to monitor the number of rows processed by dbt, the slot time, and the results of the tests. This lets us catch real-time incidents, and by promptly identifying and fixing these issues, Elastic makes our data pipeline more resilient and our models more cost-effective, helping us stay on top of cost spikes or data quality issues.

We can also correlate this information with other events ingested into Elastic, for example using the Elastic Github connector, we can correlate data quality test failures or other anomalies with code changes to find the root cause of the commit or PR that caused the issues. By ingesting application logs into Elastic, we can also analyze if these issues in our pipelines have affected downstream applications, increasing latency, throughput or error rates using APM. Ingesting billing, revenue data or web traffic, we could also see the impact in business metrics.

How to export dbt invocation logs to Elasticsearch

We use the Python Elasticsearch client to send the dbt invocation logs to Elastic after we run our dbt run and dbt test processes daily in production. The setup just requires you to install the Elasticsearch Python client and obtain your Elastic Cloud ID (go to https://cloud.elastic.co/deployments/, select your deployment and find the Cloud ID) and Elastic Cloud API Key (following this guide)

This python helper function will index the results from your run_results.json file to the specified index. You just need to export the variables to the environment:

RESULTS_FILE: path to your run_results.json file
DBT_RUN_LOGS_INDEX: the name you want to give to dbt run logs index in Elastic, e.g. dbt_run_logs
DBT_TEST_LOGS_INDEX: the name you want to give to the dbt test logs index in Elastic, e.g. dbt_test_logs
ES_CLUSTER_CLOUD_ID
ES_CLUSTER_API_KEY

Then call the function log_dbt_es from your python code or save this code as a python script and run it after executing your dbt run or dbt test commands:

from elasticsearch import Elasticsearch, helpers
import os
import sys
import json

def log_dbt_es():
   RESULTS_FILE = os.environ["RESULTS_FILE"]
   DBT_RUN_LOGS_INDEX = os.environ["DBT_RUN_LOGS_INDEX"]
   DBT_TEST_LOGS_INDEX = os.environ["DBT_TEST_LOGS_INDEX"]
   es_cluster_cloud_id = os.environ["ES_CLUSTER_CLOUD_ID"]
   es_cluster_api_key = os.environ["ES_CLUSTER_API_KEY"]


   es_client = Elasticsearch(
       cloud_id=es_cluster_cloud_id,
       api_key=es_cluster_api_key,
       request_timeout=120,
   )


   if not os.path.exists(RESULTS_FILE):
       print(f"ERROR: {RESULTS_FILE} No dbt run results found.")
       sys.exit(1)


   with open(RESULTS_FILE, "r") as json_file:
       results = json.load(json_file)
       timestamp = results["metadata"]["generated_at"]
       metadata = results["metadata"]
       elapsed_time = results["elapsed_time"]
       args = results["args"]
       docs = []
       for result in results["results"]:
           if result["unique_id"].split(".")[0] == "test":
               result["_index"] = DBT_TEST_LOGS_INDEX
           else:
               result["_index"] = DBT_RUN_LOGS_INDEX
           result["@timestamp"] = timestamp
           result["metadata"] = metadata
           result["elapsed_time"] = elapsed_time
           result["args"] = args
           docs.append(result)
       _ = helpers.bulk(es_client, docs)
   return "Done"

# Call the function
log_dbt_es()

If you want to add/remove any other fields from run_results.json, you can modify the above function to do it.

Once the results are indexed, you can use Kibana to create Data Views for both indexes and start exploring them in Discover.

Go to Discover, click on the data view selector on the top left and “Create a data view”.

Now you can create a data view with your preferred name. Do this for both dbt run (DBT_RUN_LOGS_INDEX in your code) and dbt test (DBT_TEST_LOGS_INDEX in your code) indices:

Going back to Discover, you’ll be able to select the Data Views and explore the data.

dbt run alerts, dashboards and ML jobs

The invocation of dbt run executes compiled SQL model files against the current database. dbt run invocation logs contain the following fields:

unique_id: Unique model identifier
execution_time: Total time spent executing this model run

The logs also contain the following metrics about the job execution from the adapter:

adapter_response.bytes_processed
adapter_response.bytes_billed
adapter_response.slot_ms
adapter_response.rows_affected

We have used Kibana to set up Anomaly Detection jobs on the above-mentioned metrics. You can configure a multi-metric job split by unique_id to be alerted when the sum of rows affected, slot time consumed, or bytes billed is anomalous per table. You can track one job per metric. If you have built a dashboard of the metrics per table, you can use this shortcut to create the Anomaly Detection job directly from the visualization. After the jobs are created and are running on incoming data, you can view the jobs and add them to a dashboard using the three dots button in the anomaly timeline:

We have used the ML job to set up alerts that send us emails/slack messages when anomalies are detected. Alerts can be created directly from the Jobs (Machine Learning > Anomaly Detection Jobs) page, by clicking on the three dots at the end of the ML job row:

We also use Kibana dashboards to visualize the anomaly detection job results and related metrics per table, to identify which tables consume most of our resources, to have visibility on their temporal evolution, and to measure aggregated metrics that can help us understand month over month changes.

dbt test alerts and dashboards

You may already be familiar with tests in dbt, but if you’re not, dbt data tests are assertions you make about your models. Using the command dbt test, dbt will tell you if each test in your project passes or fails. Here is an example of how to set them up. In our team, we use out-of-the-box dbt tests (unique, not_null, accepted_values, and relationships) and the packages dbt_utils and dbt_expectations for some extra tests. When the command dbt test is run, it generates logs that are stored in run_results.json.

dbt test logs contain the following fields:

unique_id: Unique test identifier, tests contain the “test” prefix in their unique identifier
status: result of the test, pass or fail
execution_time: Total time spent executing this test
failures: will be 0 if the test passes and 1 if the test fails
message: If the test fails, reason why it failed

The logs also contain the metrics about the job execution from the adapter.

We have set up alerts on document count (see guide) that will send us an email / slack message when there are any failed tests. The rule for the alerts is set up on the dbt test Data View that we have created before, the query filtering on status:fail to obtain the logs for the tests that have failed, and the rule condition is document count bigger than 0. Whenever there is a failure in any test in production, we get an alert with links to the alert details and dashboards to be able to troubleshoot them:

We have also built a dashboard to visualize the tests run, tests failed, and their execution time and slot time to have a historical view of the test run:

Finding Root Causes with the AI Assistant

The most effective way for us to analyze these multiple sources of information is using the AI Assistant to help us troubleshoot the incidents. In our case, we got an alert about a test failure, and we used the AI Assistant to give us context on what happened. Then we asked if there were any downstream consequences, and the AI Assistant interpreted the results of the Anomaly Detection job, which indicated a spike in slot time for one of our downstream tables and the increase of the slot time vs. the baseline. Then, we asked for the root cause, and the AI Assistant was able to find and provide us a link to a PR from our Github changelog that matched the start of the incident and was the most probable cause.

Conclusion

As a Data Analytics team, we are responsible for guaranteeing that the tables, charts, models, reports, and dashboards we provide to stakeholders are accurate and contain the right sources of information. As teams grow, the number of models we own becomes larger and more interconnected, and it isn’t easy to guarantee that everything is running smoothly and providing accurate results. Having a monitoring system that proactively alerts us on cost spikes, anomalies in row counts, or data quality test failures is like having a trusted companion that will alert you in advance if something goes wrong and help you get to the root cause of the issue.

dbt invocation logs are a crucial source of information about the status of our data pipelines, and Elastic is the perfect tool to extract the maximum potential out of them. Use this blog post as a starting point for utilizing your dbt logs to help your team achieve greater reliability and peace of mind, allowing them to focus on more strategic tasks rather than worrying about potential data issues.

Monitor your Python data pipelines with OTEL

Thu, 08 Aug 2024 00:00:00 GMT

This article delves into how to implement observability practices, particularly using OpenTelemetry (OTEL) in Python, to enhance the monitoring and quality control of data pipelines using Elastic. While the primary focus of the examples presented in the article is ETL (Extract, Transform, Load) processes to ensure the accuracy and reliability of data pipelines that is crucial for Business Intelligence (BI), the strategies and tools discussed are equally applicable to Python processes used for Machine Learning (ML) models or other data processing tasks.

Introduction

Data pipelines, particularly ETL processes, form the backbone of modern data architectures. These pipelines are responsible for extracting raw data from various sources, transforming it into meaningful information, and loading it into data warehouses or data lakes for analysis and reporting.

In our organization, we have Python-based ETL scripts that play a pivotal role in exporting and processing data from Elasticsearch (ES) clusters and loading it into Google BigQuery (BQ). This processed data then feeds into DBT (Data Build Tool) models, which further refine the data and make it available for analytics and reporting. To see the full architecture and learn how we monitor our DBT pipelines with Elastic see Monitor your DBT pipelines with Elastic Observability. In this article we focus on the ETL scripts. Given the critical nature of these scripts, it is imperative to set up mechanisms to control and ensure the quality of the data they generate.

The strategies discussed here can be extended to any script or application that handles data processing or machine learning models, regardless of the programming language used as long as there exists a corresponding agent that supports OTEL instrumentation.

Motivation

Observability in data pipelines involves monitoring the entire lifecycle of data processing to ensure that everything works as expected. It includes:

Data Quality Control:

Detecting anomalies in the data, such as unexpected drops in record counts.
Verifying that data transformations are applied correctly and consistently.
Ensuring the integrity and accuracy of the data loaded into the data warehouse.

Performance Monitoring:

Tracking the execution time of ETL scripts to identify bottlenecks and optimize performance.
Monitoring resource usage, such as memory and CPU consumption, to ensure efficient use of infrastructure.

Real-time Alerting:

Setting up alerts for immediate notification of issues such as failed ETL jobs, data quality issues, or performance degradation.
Identify the root case of such incidents
Proactively addressing incidents to minimize downtime and impact on business operations

Issues such as failed ETL jobs, can even point to larger infrastructure or data source data quality issues.

Steps for Instrumentation

Here are the steps to automatically instrument your Python script for exporting OTEL traces, metrics, and logs.

Step 1: Import Required Libraries

We first need to install the following libraries.

pip install elastic-opentelemetry google-cloud-bigquery[opentelemetry]

You can also them to your project's requirements.txt file and install them with pip install -r requirements.txt.

Explanation of Dependencies

elastic-opentelemetry: This package is the Elastic Distribution for OpenTelemetry Python. Under the hood it will install the following packages:
- opentelemetry-distro: This package is a convenience distribution of OpenTelemetry, which includes the OpenTelemetry SDK, APIs, and various instrumentation packages. It simplifies the setup and configuration of OpenTelemetry in your application.
- opentelemetry-exporter-otlp: This package provides an exporter that sends telemetry data to the OpenTelemetry Collector or any other endpoint that supports the OpenTelemetry Protocol (OTLP). This includes traces, metrics, and logs.
- opentelemetry-instrumentation-system-metrics: This package provides instrumentation for collecting system metrics, such as CPU usage, memory usage, and other system-level metrics.
google-cloud-bigquery[opentelemetry]: This package integrates Google Cloud BigQuery with OpenTelemetry, allowing you to trace and monitor BigQuery operations.

Step 2: Export OTEL Variables

Set the necessary OpenTelemetry (OTEL) variables by getting the configuration from APM OTEL from Elastic.

Go to APM -> Services -> Add data (top left corner).

In this section you will find the steps how to configure various APM agents. Navigate to OpenTelemetry to find the variables that you need to export.

Find OTLP Endpoint:

Look for the section related to OpenTelemetry or OTLP configuration.
The OTEL_EXPORTER_OTLP_ENDPOINT is typically provided as part of the setup instructions for integrating OpenTelemetry with Elastic APM. It might look something like https:///otlp.

Obtain OTLP Headers:

In the same section, you should find instructions or a field for OTLP headers. These headers are often used for authentication purposes.
Copy the necessary headers provided by the interface. They might look like Authorization: Bearer .

Note: Notice you need to replace the whitespace between Bearer and your token with %20 in the OTEL_EXPORTER_OTLP_HEADERS variable when using Python.

Alternatively you can use a different approach for authentication using API keys (see instructions). If you are using our serverless offering you will need to use this approach instead.

Set up the variables:

Replace the placeholders in your script with the actual values obtained from the Elastic APM interface and execute it in your shell via the source command source env.sh.

Below is a script to set these variables:

#!/bin/bash
echo "--- :otel: Setting OTEL variables"
export OTEL_EXPORTER_OTLP_ENDPOINT='https://your-apm-server/otlp:443'
export OTEL_EXPORTER_OTLP_HEADERS='Authorization=Bearer%20your-token'
export OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED=true
export OTEL_PYTHON_LOG_CORRELATION=true
export ELASTIC_OTEL_SYSTEM_METRICS_ENABLED=true
export OTEL_METRIC_EXPORT_INTERVAL=5000
export OTEL_LOGS_EXPORTER="otlp,console"

With these variables set, we are ready for auto-instrumentation without needing to add anything to the code.

Explanation of Variables

OTEL_EXPORTER_OTLP_ENDPOINT: This variable specifies the endpoint to which OTLP data (traces, metrics, logs) will be sent. Replace placeholder with your actual OTLP endpoint.
OTEL_EXPORTER_OTLP_HEADERS: This variable specifies any headers required for authentication or other purposes when sending OTLP data. Replace placeholder with your actual OTLP headers.
OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED: This variable enables auto-instrumentation for logging in Python, allowing logs to be automatically enriched with trace context.
OTEL_PYTHON_LOG_CORRELATION: This variable enables log correlation, which includes trace context in log entries to correlate logs with traces.
OTEL_METRIC_EXPORT_INTERVAL: This variable specifies the metric export interval in milliseconds, in this case 5s.
OTEL_LOGS_EXPORTER: This variable specifies the exporter to use for logs. Setting it to "otlp" means that logs will be exported using the OTLP protocol. Adding "console" specifies that logs should be exported to both the OTLP endpoint and the console. In our case for better visibility on the infa side, we choose to export to console as well.
ELASTIC_OTEL_SYSTEM_METRICS_ENABLED: It is needed to use this variable when using the Elastic distribution as by default it is set to false.

Note: OTEL_METRICS_EXPORTER and OTEL_TRACES_EXPORTER: This variables specify the exporter to use for metrics/traces, and are set to "otlp" by default, which means that metrics and traces will be exported using the OTLP protocol.

Running Python ETLs

We run Python ETLs with the following command:

OTEL_RESOURCE_ATTRIBUTES="service.name=x-ETL,service.version=1.0,deployment.environment=production" && opentelemetry-instrument python3 X_ETL.py

Explanation of the Command

OTEL_RESOURCE_ATTRIBUTES: This variable specifies additional resource attributes, such as service name, service version and deployment environment, that will be included in all telemetry data, you can customize these values per your needs. You can use a different service name for each script.
opentelemetry-instrument: This command auto-instruments the specified Python script for OpenTelemetry. It sets up the necessary hooks to collect traces, metrics, and logs.
python3 X_ETL.py: This runs the specified Python script (X_ETL.py).

Tracing

We export the traces via the default OTLP protocol.

Tracing is a key aspect of monitoring and understanding the performance of applications. Spans form the building blocks of tracing. They encapsulate detailed information about the execution of specific code paths. They record the start and end times of activities and can have hierarchical relationships with other spans, forming a parent/child structure.

Spans include essential attributes such as transaction IDs, parent IDs, start times, durations, names, types, subtypes, and actions. Additionally, spans may contain stack traces, which provide a detailed view of function calls, including attributes like function name, file path, and line number, which is especially useful for debugging. These attributes help us analyze the script's execution flow, identify performance issues, and enhance optimization efforts.

With the default instrumentation, the whole Python script would be a single span. In our case we have decided to manually add specific spans per the different phases of the Python process, to be able to measure their latency, throughput, error rate, etc individually. This is how we define spans manually:

from opentelemetry import trace

if __name__ == "__main__":

    tracer = trace.get_tracer("main")
    with tracer.start_as_current_span("initialization") as span:
            # Init code
            … 
    with tracer.start_as_current_span("search") as span:
            # Step 1 - Search code
            …
   with tracer.start_as_current_span("transform") as span:
           # Step 2 - Transform code
           …
   with tracer.start_as_current_span("load") as span:
           # Step 3 - Load code
           …

You can explore traces in the APM interface as shown below.

Metrics

We export metrics via the default OTLP protocol as well, such as CPU usage and memory. No extra code needs to be added in the script itself.

Note: Remember to set ELASTIC_OTEL_SYSTEM_METRICS_ENABLED to true.

Logging

We export logs via the default OTLP protocol as well.

For logging, we modify the logging calls to add extra fields using a dictionary structure (bq_fields) as shown below:

        job.result()  # Waits for table load to complete
        job_details = client.get_job(job.job_id)  # Get job details

        # Extract job information
        bq_fields = {
            # "slot_time_ms": job_details.slot_ms,
            "job_id": job_details.job_id,
            "job_type": job_details.job_type,
            "state": job_details.state,
            "path": job_details.path,
            "job_created": job_details.created.isoformat(),
            "job_ended": job_details.ended.isoformat(),
            "execution_time_ms": (
                job_details.ended - job_details.created
            ).total_seconds()
            * 1000,
            "bytes_processed": job_details.output_bytes,
            "rows_affected": job_details.output_rows,
            "destination_table": job_details.destination.table_id,
            "event": "BigQuery Load Job", # Custom event type
            "status": "success", # Status of the step (success/error)
            "category": category # ETL category tag 
        }

        logging.info("BigQuery load operation successful", extra=bq_fields)

This code shows how to extract BQ job stats, execution time, bytes processed, rows affected and destination table among them. You can add other metadata like we do such as custom event type, status, and category.

Any calls to logging (of all levels above the set threshold, in this case INFO logging.getLogger().setLevel(logging.INFO)) will create a log that will be exported to Elastic. This means that in Python scripts that already use logging there is no need to make any changes to export logs to Elastic.

For each of the log messages, you can go into the details view (click on the … when you hover over the log line and go into View details) to examine the metadata attached to the log message. You can also explore the logs in Discover.

Explanation of Logging Modification

logging.info: This logs an informational message. The message "BigQuery load operation successful" is logged.
extra=bq_fields: This adds additional context to the log entry using the bq_fields dictionary. This context can include details making the log entries more informative and easier to analyze. This data will be later used to set up alerts and data anomaly detection jobs.

Monitoring in Elastic's APM

As shown, we can examine traces, metrics, and logs in the APM interface. To make the most out of this data, we make use on top of nearly the whole suit of features in Elastic Observability alongside Elastic Analytic's ML capabilities.

Rules and Alerts

We can set up rules and alerts to detect anomalies, errors, and performance issues in our scripts.

The error count threshold rule is used to create a trigger when the number of errors in a service exceeds a defined threshold.

To create the rule go to Alerts and Insights -> Rules -> Create Rule -> Error count threshold, set the error count threshold, the service or environment you want to monitor (you can also set an error grouping key across services), how often to run the check, and choose a connector.

Next, we create a rule of type custom threshold on a given ETL logs data view (create one for your index) filtering on "labels.status: error" to get all the logs with status error from any of the steps of the ETL which have failed. The rule condition is set to document count > 0. In our case, in the last section of the rule config, we also set up Slack alerts every time the rule is activated. You can pick from a long list of connectors Elastic supports.

Then we can set up alerts for failures. We add status to the logs metadata as shown in the code sample below for each of the steps in the ETLs. It then becomes available in ES via labels.status.

logging.info(
            "Elasticsearch search operation successful",
            extra={
                "event": "Elasticsearch Search",
                "status": "success",
                "category": category,
                "index": index,
            },
        )

More Rules

We could also add rules to detect anomalies in the execution time of the different spans we define. This is done by selecting transaction/span -> Alerts and rules -> Custom threshold rule -> Latency. In the example below, we want to generate an alert whenever the search step takes more than 25s.

Alternatively, for finer-grained control, you can go with Alerts and rules -> Anomaly rule, set up an anomaly job, and pick a threshold severity level.

Anomaly detection job

In this example we set an anomaly detection job on the number of documents before transform.

We set up an Anomaly Detection jobs on the number of document before the transform using the [Single metric job] (https://www.elastic.co/guide/en/machine-learning/current/ml-anomaly-detection-job-types.html#multi-metric-jobs) to detect any anomalies with the incoming data source.

In the last step, you can create alerting similarly to what we did before to receive alerts whenever there is an anomaly detected, by setting up a severity level threshold. Using the anomaly score which is assigned to every anomaly, every anomaly is characterized by a severity level.

Similarly to the previous example, we set up a Slack connector to receive alerts whenever an anomaly is detected.

You can go to your custom dashboard by going to Add Panel -> ML -> Anomaly Swim Lane -> Pick your job.

Similarly, we add jobs for the number of documents after the transform, and a Multi-Metric one on the execution_time_ms, bytes_processed and rows_affected similarly to how it was done in Monitor your DBT pipelines with Elastic Observability.

Custom Dashboard

Now that your logs, metrics, and traces are in Elastic, you can use the full potential of our Kibana dashboards to extract the most from them. We can create a custom dashboard like the following one: a pie chart based on labels.event (category field for every type of step in the ETLs), a chart for every type of step broken down by status, a timeline of steps broken down by status, BQ stats for the ETL, and anomaly detection swim lane panels for the various anomaly jobs.

Conclusion

Elastic’s APM, in combination with other Observability and ML features, provides a unified view of our data pipelines, allowing us to bring a lot of value with minimal code changes:

Logging of new logs (no need to add custom logging) alongside their execution context
Monitor the runtime behavior of our models
Track data quality issues
Identify and troubleshoot real-time incidents
Optimize performance bottlenecks and resource usage
Identify dependencies on other services and their latency
Optimize data transformation processes
Set up alerts on latency, data quality issues, error rates of transactions or CPU usage)

With these capabilities, we can ensure the resilience and reliability of our data pipelines, leading to more robust and accurate BI system and reporting.

In conclusion, setting up OpenTelemetry (OTEL) in Python for data pipeline observability has significantly improved our ability to monitor, detect, and resolve issues proactively. This has led to more reliable data transformations, better resource management, and enhanced overall performance of our data transformation, BI and Machine Learning systems.

NGNIX log analytics with GenAI in Elastic

Fri, 05 Jul 2024 00:00:00 GMT

Elastic Observability provides a full observability solution, supporting metrics, traces, and logs for applications and infrastructure. NGINX, which is highly used for web serving, load balancing, http caching, and reverse proxy, is the key to many applications and outputs a large volume of logs. NGINX’s access logs, which detail all requests made to the NGINX server, and error logs which record server-related issues and problems are key to managing and analyzing NGINX issues along with understanding what is happening to your application.

In managing NGINX Elastic provides several capabilities:

Easy ingest, parsing, and out-of-the-box dashboards. Check out the simple how-to in our docs. Based on logs, these dashboards show several items over time, response codes, errors, top pages, data volume, browsers used, active connections, drop rates, and much more.
Out-of-the-box ML-based anomaly detection jobs for your NGINX logs. These jobs help pinpoint anomalies against request rates, IP address request rates, URL access, status codes, and visitor rate anomalies.
ES|QL which helps work through logs and build out charts during analysis.
Elastic’s GenAI Assistant provides a simple natural language interface that helps analyze all the logs and can pull out issues from ML jobs and even create dashboards. The Elastic AI Assistant also automatically uses ES|QL.
NGINX SLOs - Finally Elastic provides the ability to define and monitor SLOs for your NGINX logs. While most SLOs are metrics-based, Elastic allows you to create logs-based SLOs. We detailed this in a previous blog.

NGINX logs are another example of why logs are great. Logging is an important part of Observability, for which we generally think of metrics and tracing. However, the amount of logs an application and the underlying infrastructure output can be significantly daunting and NGINX is usually the starting point for most analyses.

In today’s blog, we’ll cover how the out-of-the-box ML-based anomaly detection jobs can help RCA, and how Elastic’s GenAI Assistant helps easily work through logs to pinpoint issues in minutes.

Prerequisites and config

If you plan on following this blog, here are some of the components and details we used to set up this demonstration:

Ensure you have an account on Elastic Cloud and a deployed stack (see instructions here).
Bring up an NGINX server on a host. OR run an application with NGINX as a front end and drive traffic.
Install the NGINX integration and assets and review the dashboards as noted in the docs.
Ensure you have an ML node configured in your Elastic stack
To use the AI Assistant you will need a trial or upgrade to Platinum.

In our scenario, we use data from 3 months from our Elastic environment to help highlight the features. Hence you might need to run your application with traffic for a specific time frame to follow along.

Analyzing the issues with AI Assistant

As detailed in a previous blog, you can get alerted on issues via SLO monitoring against NGINX logs. Let’s assume you have an SLO based on status codes as we outlined in the previous blog. You can immediately analyze the issue via the AI Assistant. Because it's a chat interface we simply open the AI Assistant and work through some simple analysis: (See Animated GIF for a demo)

AI Assistant analysis:

Using lens graph all http response status codes < 400 and > =400 from filebeat-nginx-elasticco-anon-2017. http.response.status.code is not an integer - We wanted to simply understand the amount of requests resulting in status code >= 400 and graph the results. We see that 15% of the requests were not successful, hence an SLO alert being triggered.
Which ip address (field source.adress) has the highest number of http.response.status.code >= 400 from filebeat-nginx-elasticco-anon-2017. http.response.status.code is not an integer - We were curious is there was a specific IP address not having successful requests. 72.57.0.53, with a count of 25,227 occurrences is daily high but not the ensure 2 failed requests.
What country (source.geo.country_iso_code) is source.address=72.57.0.53 coming from. Use filebeat-nginx-elasticco-anon-2017. - Again we were curious if this came from a specific country. And the IP address 72.57.0.53 is coming from the country with the ISO code IN, which corresponds to India. Nothing out of the ordinary.
Did source.address=72.57.0.53 have any (http.response.status.code < 400) from filebeat-nginx-elasticco-anon-2017. http.response.status.code is not an integer - Oddly the IP address in question only had 4000+ successful responses. Meaning its not malicious, and points to something else.
What are the different status codes (http.response.status.code>=400), from source.address=72.57.0.53. Use filebeat-nginx-elasticco-anon-2017. http.response.status.code is not an integer. Provide counts for each status code - We are curious whether or not we see any 502, which there were none, but most of the failures were 404.
What are the different status codes (http.response.status.code>=400). Use filebeat-nginx-elasticco-anon-2017. http.response.status.code is not an integer. Provide counts for each status code - Regardless of a specific address, what is the largest number of status code occurrences > 400. This also points to 404.
What does a high 404 count from a specific IP address mean from NGINX logs? - Asking this question, we need to understand the potential causes of this from our application. From the answers, we can rule out security probing and web scraping, as we validated that a specific address 72.57.0.53 has a low non-success request status code. It also rules out User error. Hence this points potentially to Broken Links or Missing Resources.

Watch the flow:

Potential issue:

It seems that we potentially have an issue with the backend serving specific answers or having issues with resources (database, or broken links). This is cursing the higher-than-normal non-successful status codes>=400.

Key highlights from AI Assistant:

As you watched this video you will notice a few things:

We analyzed millions of logs in a matter of minutes using a set of simple natural language queries.
We didn’t need to know any special query language. The AI Assistant used Elastic’s ES|QL but can similarly use KQL also.
The AI Assistant easily builds out graphs
The AI Assistant is accessing and using internal information stored in Elastic’s indices. Vs a simple “google foo” based AI Assistant. This is enabled through RAG, and the AI Assistant can also bring up known issues in github, runbooks, and other useful internal information.

Check out the following blog on how the AI Assistant uses RAG to retrieve internal information. Specifically using github and runbooks.

Locating anomalies with ML

While using the AI Assistant is great for analyzing information, another important aspect of NGINX log management is to ensure you can manage log spikes and anomalies. Elastic has a machine learning platform that allows you to develop jobs to analyze specific metrics or multiple metrics to look for anomalies.When using NGINX, there are several out-of-the-box anomaly detection jobs. These work specifically on NGINX access logs.

Low_request_rate_nginx - Detect low request rates
Source_ip_request_rate_nginx - Detect unusual source IPs - high request rates
Source_ip_url_count_nginx - Detect unusual source IPs - high distinct count of URLs
Status_code_rate_nginx - Detect unusual status code rates
Visitor_rate_nginx - Detect unusual visitor rates

Being right out of the box, lets look at the job - Status_code_rate_nginx, which is related to our previous analysis.

With a few simple clicks we immediately get an analysis showing a specific IP address - 72.57.0.53, having higher than normal non-successful requests. Oddly we also found this is using the AI Assistant.

We can take this further with conversations with the AI Assistant, look at the logs, and/or even look at the other ML anomaly jobs.

Conclusion:

You’ve now seen how easily Elastic’s RAG-based AI Assistant can help analyze NGINX logs without even the need to know query syntax, understand where the data is, and understand even the fields. Additionally, you’ve also seen how we can alert you when a potential issue or degradation in service (SLO).

Check out other resources on NGINX logs:

Out-of-the-box anomaly detection jobs for NGINX

Using the NGINX integration to ingest and analyze NGINX Logs

NGINX Logs based SLOs in Elastic

Using GitHub issues, runbooks, and other internal information for RCAs with Elastic’s RAG based AI Assistant

Try it out

Existing Elastic Cloud customers can access many of these features directly from the Elastic Cloud console. Not taking advantage of Elastic on the cloud? Start a free trial.

All of this is also possible in your environment. Learn how to get started today.

Root cause analysis with logs: Elastic Observability's AIOps Labs

Thu, 27 Apr 2023 00:00:00 GMT

In the previous blog in our root cause analysis with logs series, we explored how to analyze logs in Elastic Observability with Elastic’s anomaly detection and log categorization capabilities. Elastic’s platform enables you to get started on machine learning (ML) quickly. You don’t need to have a data science team or design a system architecture. Additionally, there’s no need to move data to a third-party framework for model training.

Preconfigured machine learning models for observability and security are available. If those don't work well enough on your data, in-tool wizards guide you through the few steps needed to configure custom anomaly detection and train your model with supervised learning. To get you started, there are several key features built into Elastic Observability to aid in analysis, bypassing the need to run specific ML models. These features help minimize the time and analysis of logs.

Let’s review the set of machine learning-based observability features in Elastic:

Anomaly detection: Elastic Observability, when turned on (see documentation), automatically detects anomalies by continuously modeling the normal behavior of your time series data — learning trends, periodicity, and more — in real time to identify anomalies, streamline root cause analysis, and reduce false positives. Anomaly detection runs in and scales with Elasticsearch and includes an intuitive UI.

Log categorization: Using anomaly detection, Elastic also identifies patterns in your log events quickly. Instead of manually identifying similar logs, the logs categorization view lists log events that have been grouped, based on their messages and formats, so that you can take action more quickly.

High-latency or erroneous transactions: Elastic Observability’s APM capability helps you discover which attributes are contributing to increased transaction latency and identifies which attributes are most influential in distinguishing between transaction failures and successes. Read APM correlations in Elastic Observability: Automatically identifying probable causes of slow or failed transactions for an overview of this capability.

AIOps Labs: AIOps Labs provides two main capabilities using advanced statistical methods:

Log spike detector helps identify reasons for increases in log rates. It makes it easy to find and investigate the causes of unusual spikes by using the analysis workflow view. Examine the histogram chart of the log rates for a given data view, and find the reason behind a particular change possibly in millions of log events across multiple fields and values.
Log pattern analysis helps you find patterns in unstructured log messages and makes it easier to examine your data. It performs categorization analysis on a selected field of a data view, creates categories based on the data, and displays them together with a chart that shows the distribution of each category and an example document that matches the category.

As we showed in the last blog, using machine learning-based features helps minimize the extremely tedious and time-consuming process of analyzing data using traditional methods, such as alerting and simple pattern matching (visual or simple searching, etc.). Trying to find the needle in the haystack requires the use of some level of artificial intelligence due to the increasing amounts of telemetry data (logs, metrics, and traces) being collected across ever-growing applications.

In this blog post, we’ll cover two capabilities found in Elastic’s AIOps Labs: log spike detector and log pattern analysis. We’ll use the same data from the previous blog and analyze it using these two capabilities.

_ We will cover log spike detector and log pattern analysis against the popular Hipster Shop app developed by Google, and modified recently by OpenTelemetry. _

Overviews of high-latency capabilities can be found here, and an overview of AIOps labs can be found here.

Below, we will examine a scenario where we use anomaly detection and log categorization to help identify a root cause of an issue in Hipster Shop.

Prerequisites and config

If you plan on following this blog, here are some of the components and details we used to set up this demonstration:

Ensure you have an account on Elastic Cloud and a deployed stack (see instructions here) on AWS. Deploying this on AWS is required for Elastic Serverless Forwarder.
Utilize a version of the popular Hipster Shop demo application. It was originally written by Google to showcase Kubernetes across a multitude of variants available, such as the OpenTelemetry Demo App. The Elastic version is found here.
Ensure you have configured the app for either Elastic APM agents or OpenTelemetry agents. For more details, please refer to these two blogs: Independence with OTel in Elastic and Observability and Security with OTel in Elastic. Additionally, review the OTel documentation in Elastic.
Look through an overview of Elastic Observability APM capabilities.
Look through our anomaly detection documentation for logs and log categorization documentation.

Once you’ve instrumented your application with APM (Elastic or OTel) agents and are ingesting metrics and logs into Elastic Observability, you should see a service map for the application as follows:

In our example, we’ve introduced issues to help walk you through the root cause analysis features. You might have a different set of issues depending on how you load the application and/or introduce specific feature flags.

As part of the walk-through, we’ll assume we are DevOps or SRE managing this application in production.

Root cause analysis

While the application has been running normally for some time, you get a notification that some of the services are unhealthy. This can occur from the notification setting you’ve set up in Elastic or other external notification platforms (including customer-related issues). In this instance, we’re assuming that customer support has called in multiple customer complaints about the website.

How do you as a DevOps or SRE investigate this? We will walk through two avenues in Elastic to investigate the issue:

Log spike analysis
Log pattern analysis

While we show these two paths separately, they can be used in conjunction and are complementary, as they are both tools Elastic Observability provides to help you troubleshoot and identify a root cause.

Starting with the service map, you can see anomalies identified with red circles and as we select them, Elastic will provide a score for the anomaly.

In this example, we can see that there is a score of 96 for a specific anomaly for the productCatalogService in the Hipster Shop application. An anomaly score indicates the significance of the anomaly compared to previously seen anomalies. Rather than jump into anomaly detection (see previous blog), let’s look at some of the potential issues by reviewing the service details in APM.

What we see for the productCatalogService is that there are latency issues, failed transactions, a large number of issues, and a dependency to PostgreSQL. When we look at the errors in more detail and drill down, we see they are all coming from PQ - which is a PostgreSQL driver in Go.

As we drill further, we still can’t tell why the productCatalogService is not able to pull information from the PostgreSQL database.

We see that there is a spike in errors, so let's see if we can gleam further insight using one of our two options:

Log rate spikes
Log pattern analysis

Log rate spikes

Let’s start with the log rate spikes detector capability from Elastic’s AIOps Labs section of Elastic’s machine learning capabilities. We also pre-select analyzing the spike against a baseline history.

The log rate spikes detector has looked at all the logs from the spike and compared them to the baseline, and it's seeing higher-than-normal counts in specific log messages. From a visual inspection, we see that PostgreSQL log messages are high. We further filter this with postgres.

We immediately notice that this issue is potentially caused by pgbench, a popular PostgreSQL tool to help benchmark the database. pgbench runs the same sequence of SQL commands over and over, possibly in multiple, concurrent database sessions. While pgbench is definitely a useful tool, it should not be used in a production environment as it causes a heavy load on the database host, likely causing higher latency issues on the site.

While this may or may not be the ultimate root cause, we have rather quickly identified a potential issue that has a high probability of being the root cause. An engineer likely intended to run pgbench against a staging database to evaluate its performance, and not the production environment.

Log pattern analysis

Instead of log rate spikes, let’s use log pattern analysis to investigate the spike in errors we saw in productCatalogService. In AIOps Labs, we simply select Log Pattern Analysis, use Logs data, filter the results with postgres (since we know it's related to PostgreSQL), and look at information from the message field of the logs we are processing. We see the following:

Almost immediately we see the biggest pattern it finds is a log message where pgbench is updating the database. We can further directly drill into this log message from log pattern analysis into Discover and review the details and further analyze the messages.

As we mentioned in the previous section, while it may or may not be the root cause, it quickly gives us a place to start and a potential root cause. A developer likely intended to run pgbench against a staging database to evaluate its performance, and not the production environment.

Conclusion

Between the first blog and this one, we’ve shown how Elastic Observability can help you further identify and get closer to pinpointing the root cause of issues without having to look for a “needle in a haystack.” Here’s a quick recap of what you learned in this blog.

Elastic Observability has numerous capabilities to help you reduce your time to find the root cause and improve your MTTR (even MTTD). In particular, we reviewed the following two main capabilities (found in AIOps Labs in Elastic) in this blog:
1. Log rate spikes detector helps identify reasons for increases in log rates. It makes it easy to find and investigate the causes of unusual spikes by using the analysis workflow view. Examine the histogram chart of the log rates for a given data view, and find the reason behind a particular change possibly in millions of log events across multiple fields and values.
2. Log pattern analysis helps you find patterns in unstructured log messages and makes it easier to examine your data. It performs categorization analysis on a selected field of a data view, creates categories based on the data, and displays them together with a chart that shows the distribution of each category and an example document that matches the category.
You learned how easy and simple it is to use Elastic Observability’s log categorization and anomaly detection capabilities without having to understand machine learning (which helps drive these features) or having to do any lengthy setups.

Ready to get started? Register for Elastic Cloud and try out the features and capabilities outlined above.

Additional logging resources:

Common use case examples with logs:

Elastic and Elasticsearch are trademarks, logos or registered trademarks of Elasticsearch B.V. in the United States and other countries.

Monitoring service performance: An overview of SLA calculation for Elastic Observability

Mon, 24 Apr 2023 00:00:00 GMT

Elastic Stack provides many valuable insights for different users. Developers are interested in low-level metrics and debugging information. SREs are interested in seeing everything at once and identifying where the root cause is. Managers want reports that tell them how good service performance is and if the service level agreement (SLA) is met. In this post, we’ll focus on the service perspective and provide an overview of calculating an SLA.

Since version 8.8, we have a built in functionality to calculate SLOs — check out our guide!

Foundations of calculating an SLA

There are many ways to calculate and measure an SLA. The most important part is the definition of the SLA, and as a consultant, I’ve seen many different ways. Some examples include:

Count of HTTP 2xx must be above 98% of all HTTP status
Response time of successful HTTP 2xx requests must be below x milliseconds
Synthetic monitor must be up at least 99%
95% of all batch transactions from the billing service need to complete within 4 seconds

Depending on the origin of the data, calculating the SLA can be easier or more difficult. For uptime (Synthetic Monitoring), we automatically provide SLA values and offer out-of-the-box alerts to simply define alert when availability below 98% for the last 1 hour.

I personally recommend using Elastic Synthetic Monitoring whenever possible to monitor service performance. Running HTTP requests and verifying the answers from the service, or doing fully fledged browser monitors and clicking through the website as a real user does, ensures a better understanding of the health of your service.

Sometimes this is impossible because you want to calculate the uptime of a specific Windows Service that does not offer any TCP port or HTTP interaction. Here the caveat applies that just because the service is running, it does not necessarily imply that the service is working fine.

Transforms to the rescue

We have identified our important service. In our case, it is the Steam Client Helper. There are two ways to solve this.

Lens formula

You can use Lens and formula (for a deep dive into formulas, check out this blog). Use the Search bar to filter down the data you want. Then use the formula option in Lens. We are dividing all counts of records with Running as a state and dividing it by the overall count of records. This is a nice solution when there is a need to calculate quickly and on the fly.

count(kql='windows.service.state: "Running" ')/count()

Using the formula posted above as the bar chart's vertical axis calculates the uptime percentage. We use an annotation to mark why there is a dip and why this service was below the threshold. The annotation is set to reboot, which indicates a reboot happening, and thus, the service was down for a moment. Lastly, we add a reference line and set this to our defined threshold at 98%. This ensures that a quick look at the visualization allows our eyes to gauge if we are above or below the threshold.

Transform

What if I am not interested in just one service, but there are multiple services needed for your SLA? That is where Transforms can solve this problem. Furthermore, the second issue is that this data is only available inside the Lens. Therefore, we cannot create any alerts on this.

Go to Transforms and create a pivot transform.

Add the following filter to narrow it to only services data sets: data_stream.dataset: "windows.service". If you are interested in a specific service, you can always add it to the search bar if you want to know if a specific remote management service is up in your entire fleet!
Select data histogram(@timestamp) and set it to your chosen unit. By default, the Elastic Agent only collects service states every 60 seconds. I am going with 1 hour.
Select agent.name and windows.service.name as well.

Now we need to define an aggregation type. We will use a value_count of windows.service.state. That just counts how many records have this value.

Rename the value_count to total_count.
Add value_count for windows.service.state a second time and use the pencil icon to edit it to terms, which aggregates for running.

This opens up a sub-aggregation. Once again, select value_count(windows.service.state) and rename it to values.
Now, the preview shows us the count of records with any states and the count of running.

Here comes the tricky part. We need to write some custom aggregations to calculate the percentage of uptime. Click on the copy icon next to the edit JSON config.
In a new tab, go to Dev Tools. Paste what you have in the clipboard.
Press the play button or use the keyboard shortcut ctrl+enter/cmd+enter and run it. This will create a preview of what the data looks like. It should give you the same information as in the table preview.
Now, we need to calculate the percentage of up, which means doing a bucket script where we divide running.values by total_count, just like we did in the Lens visualization. Suppose you name the columns differently or use more than a single value. In that case, you will need to adapt accordingly.

"availability": {
        "bucket_script": {
          "buckets_path": {
            "up": "running>values",
            "total": "total_count"
          },
          "script": "params.up/params.total"
        }
      }

This is the entire transform for me:

POST _transform/_preview
{
  "source": {
    "index": [
      "metrics-*"
    ]
  },
  "pivot": {
    "group_by": {
      "@timestamp": {
        "date_histogram": {
          "field": "@timestamp",
          "calendar_interval": "1h"
        }
      },
      "agent.name": {
        "terms": {
          "field": "agent.name"
        }
      },
      "windows.service.name": {
        "terms": {
          "field": "windows.service.name"
        }
      }
    },
    "aggregations": {
      "total_count": {
        "value_count": {
          "field": "windows.service.state"
        }
      },
      "running": {
        "filter": {
          "term": {
            "windows.service.state": "Running"
          }
        },
        "aggs": {
          "values": {
            "value_count": {
              "field": "windows.service.state"
            }
          }
        }
      },
      "availability": {
        "bucket_script": {
          "buckets_path": {
            "up": "running>values",
            "total": "total_count"
          },
          "script": "params.up/params.total"
        }
      }
    }
  }
}

The preview in Dev Tools should work and be complete. Otherwise, you must debug any errors. Most of the time, it is the bucket script and the path to the values. You might have called it up instead of running. This is what the preview looks like for me.

{
  "running": {
    "values": 1
  },
  "agent": {
    "name": "AnnalenasMac"
  },
  "@timestamp": "2021-12-07T19:00:00.000Z",
  "total_count": 1,
  "availability": 1,
  "windows": {
    "service": {
      "name": "InstallService"
    }
  }
},

Now we only paste the bucket script into the transform creation UI after selecting Edit JSON. It looks like this:

Give your transform a name, set the destination index, and run it continuously. When selecting this, please also make sure not to use @timestamp. Instead, opt for event.ingested. Our documentation explains this in detail.

Click next and create and start. This can take a bit, so don’t worry.

To summarize, we have now created a pivot transform using a bucket script aggregation to calculate the running time of a service in percentage. There is a caveat because Elastic Agent, per default, only collects the every 60 seconds the services state. It can be that a service is up exactly when collected and down a few seconds later. If it is that important and no other monitoring possibilities, such as Elastic Synthetics are possible, you might want to reduce the collection time on the Agent side to get the services state every 30 seconds, 45 seconds. Depending on how important your thresholds are, you can create multiple policies having different collection times. This ensures that a super important server might collect the services state every 10 seconds because you need as much granularity and insurance for the correctness of the metric. For normal workstations where you just want to know if your remote access solution is up the majority of the time, you might not mind having a single metric every 60 seconds.

After you have created the transform, one additional feature you get is that the data is stored in an index, similar to in Elasticsearch. When you just do the visualization, the metric is calculated for this visualization only and not available anywhere else. Since this is now data, you can create a threshold alert to your favorite connection (Slack, Teams, Service Now, Mail, and so many more to choose from).

Visualizing the transformed data

The transform created a data view called windows-service. The first thing we want to do is change the format of the availability field to a percentage. This automatically tells Lens that this needs to be formatted as a percentage field, so you don’t need to select it manually as well as do calculations. Furthermore, in Discover, instead of seeing 0.5 you see 50%. Isn’t that cool? This is also possible for durations, like event.duration if you have it as nanoseconds! No more calculations on the fly and thinking if you need to divide by 1,000 or 1,000,000.

We get this view by using a simple Lens visualization with a timestamp on the vertical axis with the minimum interval for 1 day and an average of availability. Don’t worry — the other data will be populated once the transformation finishes. We can add a reference line using the value 0.98 because our target is 98% uptime of the service.

Summary

This blog post covered the steps needed to calculate the SLA for a specific data set in Elastic Observability, as well as how to visualize it. Using this calculation method opens the door to a lot of interesting use cases. You can change the bucket script and start calculating the number of sales, and the average basket size. Interested in learning more about Elastic Synthetics? Read our documentation or check out our free Synthetic Monitoring Quick Start training.

Collecting OpenShift container logs using Red Hat’s OpenShift Logging Operator

Tue, 16 Jan 2024 00:00:00 GMT

This blog explores a possible approach to collecting and formatting OpenShift Container Platform logs and audit logs with Red Hat OpenShift Logging Operator. We recommend using Elastic® Agent for the best possible experience! We will also show how to format the logs to Elastic Common Schema (ECS) for the best experience viewing, searching, and visualizing your logs. All examples in this blog are based on OpenShift 4.14.

Why use OpenShift Logging Operator?

A lot of enterprise customers use OpenShift as their orchestrating solution. The advantages of this approach are:

It is developed and supported by Red Hat
It can automatically update the OpenShift cluster along with the Operating system to make sure that they are and remain compatible
It can speed up developing life cycles with features like source to image
It uses enhanced security

In our consulting experience, this latter aspect poses challenges and frictions with OpenShift administrators when we try to install an Elastic Agent to collect the logs of the pods. Indeed, Elastic Agent requires the files of the host to be mounted in the pod, and it also needs to be run in privileged mode. (Read more about the permissions required by Elastic Agent in the official Elasticsearch® Documentation). While the solution we explore in this post requires similar privileges under the hood, it is managed by the OpenShift Logging Operator, which is developed and supported by Red Hat.

Which logs are we going to collect?

In OpenShift Container Platform, we distinguish three broad categories of logs: audit, application, and infrastructure logs:

Audit logs describe the list of activities that affected the system by users, administrators, and other components.
Application logs are composed of the container logs of the pods running in non-reserved namespaces.
Infrastructure logs are composed of container logs of the pods running in reserved namespaces like openshift*, kube*, and default along with journald messages from the nodes.

In the following, we will consider only audit and application logs for the sake of simplicity. In this post, we will describe how to format audit and application Logs in the format expected by the Kubernetes integration to take the most out of Elastic Observability.

Getting started

To collect the logs from OpenShift, we must perform some preparation steps in Elasticsearch and OpenShift.

Inside Elasticsearch

We first install the Kubernetes integration assets. We are mainly interested in the index templates and ingest pipelines for the logs-kubernetes.container_logs and logs-kubernetes.audit_logs.

To format the logs received from the ClusterLogForwarder in ECS format, we will define a pipeline to normalize the container logs. The field naming convention used by OpenShift is slightly different from that used by ECS. To get a list of exported fields from OpenShift, refer to Exported fields | Logging | OpenShift Container Platform 4.14. To get a list of exported fields of the Kubernetes integration, you can refer to Kubernetes fields | Filebeat Reference [8.11] | Elastic and Logs app fields | Elastic Observability [8.11]. Further, specific fields like kubernetes.annotations must be normalized by replacing dots with underscores. This operation is usually done automatically by Elastic Agent.

PUT _ingest/pipeline/openshift-2-ecs
{
  "processors": [
    {
      "rename": {
        "field": "kubernetes.pod_id",
        "target_field": "kubernetes.pod.uid",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "kubernetes.pod_ip",
        "target_field": "kubernetes.pod.ip",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "kubernetes.pod_name",
        "target_field": "kubernetes.pod.name",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "kubernetes.namespace_name",
        "target_field": "kubernetes.namespace",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "kubernetes.namespace_id",
        "target_field": "kubernetes.namespace_uid",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "kubernetes.container_id",
        "target_field": "container.id",
        "ignore_missing": true
      }
    },
    {
      "dissect": {
        "field": "container.id",
        "pattern": "%{container.runtime}://%{container.id}",
        "ignore_failure": true
      }
    },
    {
      "rename": {
        "field": "kubernetes.container_image",
        "target_field": "container.image.name",
        "ignore_missing": true
      }
    },
    {
      "set": {
        "field": "kubernetes.container.image",
        "copy_from": "container.image.name",
        "ignore_failure": true
      }
    },
    {
      "set": {
        "copy_from": "kubernetes.container_name",
        "field": "container.name",
        "ignore_failure": true
      }
    },
    {
      "rename": {
        "field": "kubernetes.container_name",
        "target_field": "kubernetes.container.name",
        "ignore_missing": true
      }
    },
    {
      "set": {
        "field": "kubernetes.node.name",
        "copy_from": "hostname",
        "ignore_failure": true
      }
    },
    {
      "rename": {
        "field": "hostname",
        "target_field": "host.name",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "level",
        "target_field": "log.level",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "file",
        "target_field": "log.file.path",
        "ignore_missing": true
      }
    },
    {
      "set": {
        "copy_from": "openshift.cluster_id",
        "field": "orchestrator.cluster.name",
        "ignore_failure": true
      }
    },
    {
      "dissect": {
        "field": "kubernetes.pod_owner",
        "pattern": "%{_tmp.parent_type}/%{_tmp.parent_name}",
        "ignore_missing": true
      }
    },
    {
      "lowercase": {
        "field": "_tmp.parent_type",
        "ignore_missing": true
      }
    },
    {
      "set": {
        "field": "kubernetes.pod.{{_tmp.parent_type}}.name",
        "value": "{{_tmp.parent_name}}",
        "if": "ctx?._tmp?.parent_type != null",
        "ignore_failure": true
      }
    },
    {
      "remove": {
        "field": [
          "_tmp",
          "kubernetes.pod_owner"
          ],
          "ignore_missing": true
      }
    },
    {
      "script": {
        "description": "Normalize kubernetes annotations",
        "if": "ctx?.kubernetes?.annotations != null",
        "source": """
        def keys = new ArrayList(ctx.kubernetes.annotations.keySet());
        for(k in keys) {
          if (k.indexOf(".") >= 0) {
            def sanitizedKey = k.replace(".", "_");
            ctx.kubernetes.annotations[sanitizedKey] = ctx.kubernetes.annotations[k];
            ctx.kubernetes.annotations.remove(k);
          }
        }
        """
      }
    },
    {
      "script": {
        "description": "Normalize kubernetes namespace_labels",
        "if": "ctx?.kubernetes?.namespace_labels != null",
        "source": """
        def keys = new ArrayList(ctx.kubernetes.namespace_labels.keySet());
        for(k in keys) {
          if (k.indexOf(".") >= 0) {
            def sanitizedKey = k.replace(".", "_");
            ctx.kubernetes.namespace_labels[sanitizedKey] = ctx.kubernetes.namespace_labels[k];
            ctx.kubernetes.namespace_labels.remove(k);
          }
        }
        """
      }
    },
    {
      "script": {
        "description": "Normalize special Kubernetes Labels used in logs-kubernetes.container_logs to determine service.name and service.version",
        "if": "ctx?.kubernetes?.labels != null",
        "source": """
        def keys = new ArrayList(ctx.kubernetes.labels.keySet());
        for(k in keys) {
          if (k.startsWith("app_kubernetes_io_component_")) {
            def sanitizedKey = k.replace("app_kubernetes_io_component_", "app_kubernetes_io_component/");
            ctx.kubernetes.labels[sanitizedKey] = ctx.kubernetes.labels[k];
            ctx.kubernetes.labels.remove(k);
          }
        }
        """
      }
    }
    ]
}

Similarly, to handle the audit logs like the ones collected by Kubernetes, we define an ingest pipeline:

PUT _ingest/pipeline/openshift-audit-2-ecs
{
  "processors": [
    {
      "script": {
        "source": """
        def audit = [:];
        def keyToRemove = [];
        for(k in ctx.keySet()) {
          if (k.indexOf('_') != 0 && !['@timestamp', 'data_stream', 'openshift', 'event', 'hostname'].contains(k)) {
            audit[k] = ctx[k];
            keyToRemove.add(k);
          }
        }
        for(k in keyToRemove) {
          ctx.remove(k);
        }
        ctx.kubernetes=["audit":audit];
        """,
        "description": "Move all the 'kubernetes.audit' fields under 'kubernetes.audit' object"
      }
    },
    {
      "set": {
        "copy_from": "openshift.cluster_id",
        "field": "orchestrator.cluster.name",
        "ignore_failure": true
      }
    },
    {
      "set": {
        "field": "kubernetes.node.name",
        "copy_from": "hostname",
        "ignore_failure": true
      }
    },
    {
      "rename": {
        "field": "hostname",
        "target_field": "host.name",
        "ignore_missing": true
      }
    },
    {
      "script": {
        "if": "ctx?.kubernetes?.audit?.annotations != null",
        "source": """
          def keys = new ArrayList(ctx.kubernetes.audit.annotations.keySet());
          for(k in keys) {
            if (k.indexOf(".") >= 0) {
              def sanitizedKey = k.replace(".", "_");
              ctx.kubernetes.audit.annotations[sanitizedKey] = ctx.kubernetes.audit.annotations[k];
              ctx.kubernetes.audit.annotations.remove(k);
            }
          }
          """,
        "description": "Normalize kubernetes audit annotations field as expected by the Integration"
      }
    }
  ]
}

The main objective of the pipeline is to mimic what Elastic Agent is doing: storing all audit fields under the kubernetes.audit object.

We are not going to use the conventional @custom pipeline approach because the fields must be normalized before invoking the logs-kubernetes.container_logs integration pipeline that uses fields like kubernetes.container.name and kubernetes.labels to determine the fields service.name and service.version. Read more about custom pipelines in Tutorial: Transform data with custom ingest pipelines | Fleet and Elastic Agent Guide [8.11].

The OpenShift Cluster Log Forwarder writes the data in the indices app-write and audit-write by default. It is possible to change this behavior, but it still tries to prepend the prefix “app” and the suffix “write”, so we opted to send the data to the default destination and use the reroute processor to send it to the right data streams. Read more about the Reroute Processor in our blog Simplifying log data management: Harness the power of flexible routing with Elastic and our documentation Reroute processor | Elasticsearch Guide [8.11] | Elastic.

In this case, we want to redirect the container logs (app-write index) to logs-kubernetes.container_logs and the Audit logs (audit-write) to logs-kubernetes.audit_logs:

PUT _ingest/pipeline/app-write-reroute-pipeline
{
  "processors": [
    {
      "pipeline": {
        "name": "openshift-2-ecs",
        "description": "Format the Openshift data in ECS"
      }
    },
    {
      "set": {
        "field": "event.dataset",
        "value": "kubernetes.container_logs"
      }
    },
    {
      "reroute": {
        "destination": "logs-kubernetes.container_logs-openshift"
      }
    }
  ]
}



PUT _ingest/pipeline/audit-write-reroute-pipeline
{
  "processors": [
    {
      "pipeline": {
        "name": "openshift-audit-2-ecs",
        "description": "Format the Openshift data in ECS"
      }
    },
    {
      "set": {
        "field": "event.dataset",
        "value": "kubernetes.audit_logs"
      }
    },
    {
      "reroute": {
        "destination": "logs-kubernetes.audit_logs-openshift"
      }
    }
  ]
}

Please note that given that app-write and audit-write do not follow the data stream naming convention, we are forced to add the destination field in the reroute processor. The reroute processor will also fill up the data_stream fields for us. Note that this step is done automatically by Elastic Agent at source.

Further, we create the indices with the default pipelines we created to reroute the logs according to our needs.

PUT app-write
{
  "settings": {
      "index.default_pipeline": "app-write-reroute-pipeline"
   }
}


PUT audit-write
{
  "settings": {
    "index.default_pipeline": "audit-write-reroute-pipeline"
  }
}

Basically, what we did can be summarized in this picture:

Let us take the container logs. When the operator attempts to write in the app-write index, it will invoke the default_pipeline “app-write-reroute-pipeline” that formats the logs into ECS format and reroutes the logs to logs-kubernetes.container_logs-openshift datastreams. This calls the integration pipeline that invokes, if it exists, the logs-kubernetes.container_logs@custom pipeline. Finally, the logs-kubernetes_container_logs pipeline may reroute the logs to another data set and namespace utilizing the elastic.co/dataset and elastic.co/namespace annotations as described in the Kubernetes integration documentation, which in turn can lead to the execution of an another integration pipeline.

Create a user for sending the logs

We are going to use basic authentication because, at the time of writing, it is the only supported authentication method for Elasticsearch in OpenShift logging. Thus, we need a role that allows the user to write and read the app-write, and audit-write logs (required by the OpenShift agent) and auto_configure access to logs-*-* to allow custom Kubernetes rerouting:

PUT _security/role/YOURROLE
{
    "cluster": [
      "monitor"
    ],
    "indices": [
      {
        "names": [
          "logs-*-*"
        ],
        "privileges": [
          "auto_configure",
          "create_doc"
        ],
        "allow_restricted_indices": false
      },
      {
        "names": [
          "app-write",
          "audit-write",
        ],
        "privileges": [
          "create_doc",
          "read"
        ],
        "allow_restricted_indices": false
      }
    ],
    "applications": [],
    "run_as": [],
    "metadata": {},
    "transient_metadata": {
      "enabled": true
    }

}



PUT _security/user/YOUR_USERNAME
{
  "password": "YOUR_PASSWORD",
  "roles": ["YOURROLE"]
}

On OpenShift

On the OpenShift Cluster, we need to follow the official documentation of Red Hat on how to install the Red Hat OpenShift Logging and configure Cluster Logging and the Cluster Log Forwarder.

We need to install the Red Hat OpenShift Logging Operator, which defines the ClusterLogging and ClusterLogForwarder Resources. Afterward, we can define the Cluster Logging resource:

apiVersion: logging.openshift.io/v1
kind: ClusterLogging
metadata:
  name: instance
  namespace: openshift-logging
spec:
  collection:
    logs:
      type: vector
      vector: {}

The Cluster Log Forwarder is the resource responsible for defining a daemon set that will forward the logs to the remote Elasticsearch. Before creating it, we need to create in the same namespace as the ClusterLogForwarder a secret containing the Elasticsearch credentials for the user we created previously in the namespace, where the ClusterLogForwarder will be deployed:

apiVersion: v1
kind: Secret
metadata:
  name: elasticsearch-password
  namespace: openshift-logging
type: Opaque
stringData:
  username: YOUR_USERNAME
  password: YOUR_PASSWORD

Finally, we create the ClusterLogForwarder resource:

kind: ClusterLogForwarder
apiVersion: logging.openshift.io/v1
metadata:
  name: instance
  namespace: openshift-logging
spec:
  outputs:
    - name: remote-elasticsearch
      secret:
        name: elasticsearch-password
      type: elasticsearch
      url: "https://YOUR_ELASTICSEARCH_URL:443"
      elasticsearch:
        version: 8 # The default is version 6 with the _type field
  pipelines:
    - inputRefs:
        - application
        - audit
      name: enable-default-log-store
      outputRefs:
        - remote-elasticsearch

Note that we explicitly defined the version of Elasticsearch to be 8, otherwise the ClusterLogForwarder will send the _type field, which is not compatible with Elasticsearch 8 and that we collect only application and audit logs.

Result

Once the logs are collected and passed through all the pipelines, the result is very close to the out-of-the-box Kubernetes integration. There are important differences, like the lack of host and cloud metadata information that don’t seem to be collected (at least without an additional configuration). We can view the Kubernetes container logs in the logs explorer:

In this post, we described how you can use the OpenShift Logging Operator to collect the logs of containers and audit logs. We still recommend leveraging Elastic Agent to collect all your logs. It is the best user experience you can get. No need to maintain or transform the logs yourself to ECS formatting. Additionally, Elastic Agent uses API keys as the authentication method and collects metadata like cloud information that allow you in the long run to do more.

Learn more about log monitoring with the Elastic Stack.

Have feedback on this blog? Share it here.

Monitor your C++ Applications with Elastic APM

Tue, 11 Feb 2025 00:00:00 GMT

Monitor your C++ Applications with Elastic APM

Introduction

One of the main challenges that developers, SREs, and DevOps professionals face is the absence of an extensive tool that provides them with visibility to their application stack. Many of the APM solutions out on the market do provide methods to monitor applications that were built on languages and frameworks (i.e., .NET, Java, Python, etc.) but fall short when it comes to C++ applications.

Luckily, Elastic has been one of the leading solutions in observability space and a contributor to the OpenTelemetry project. Elastic’s unique position and its extensive observability capabilities allows end-users to monitor applications built with object-oriented programming languages & Framework in a variety of ways.

In this blog we will explore using Elastic APM to investigate C++ traces with the OpenTelemetry client. We will be providing a comprehensive guide on how to implement the OpenTelemetry client for C++ applications and connecting to Elastic APM solutions. While OTel has its libraries, and this blog reviews how to use the OTel CPP library, Elastic also has its own Elastic Distributions of OpenTelemetry, which were developed to provide commercial support, and are completely upstreamed regularly.

Here are some resources to help get you started:

Step by Step Guide

Prerequisites

Environment

Choosing an environment is quite important as there is limited support for the OTEL client. We have experimented with using multiple Operating Systems and here are the suggestions:

Ubuntu 22.04
Debian 11 Bullseye
For this guide we are focusing on Ubuntu 22.04.
- Machine: 2 vCPU, 4GB is sufficient.
- Image: Ubuntu 22.04 LTS (x86_64).
- Disk: ~30 GB is enough.

Implementation method

We have experimented with multiple methods but we found that the most suitable approach is to use a package manager. After extensive testing, It appears that trying to run otel-cpp client could be quite challenging to the users. If practitioners desire to build with tools such as CMake and Bazel that is a viable solution. With that, as we tested both methods it became obvious that we were spending most of our time and effort fixing compatibility and dependencies’ issues for the OS Vs. Focusing on sending data to our APM. Hence we decided to move to a different method.

The main issues that we kept running into as we test are:

Compatibility of packages.
Availability of packages.
Dependencies of libraries and packages.

In this guide we will use vcpkg since it allows us to bring in all the dependencies required to run the Opentelemetry C++ client.

Installing required OS tools

Update package lists

    sudo apt-get update

Install build essentials, cmake, git, and sqlite dev library

    sudo apt-get install -y build-essential cmake git curl zip unzip sqlite3 libsqlite3-dev

sqlite3 and libsqlite3-dev allow us to build/run SQLite queries in our C++ code.

Set Up vcpkg

vcpkg is the C++ package manager that we’ll use to install opentelemetry-cpp client.

    # Clone vcpkg
    cd ~
    git clone https://github.com/microsoft/vcpkg.git

    # Bootstrap
    cd ~/vcpkg
    ./bootstrap-vcpkg.sh

Install OpenTelemetry C++ with OTLP gRPC

In this guide we focus on trace export to Elastic. At time of writing, vcpkg’s opentelemetry-cpp

version 1.18.0 fully supports traces but has limited direct metrics exporting.

Install the package

    cd ~/vcpkg
    ./vcpkg install opentelemetry-cpp[otlp-grpc]:x64-linux

Note

Sometimes when installing opentelemetry-cpp on linux it doesn't install all the required packages. As a workaround if you run into that case, try running again but pass a flag to allow-unsupported:

    ./vcpkg install opentelemetry-cpp[*]:x64-linux --allow-unsupported

Verify

    ./vcpkg list | grep opentelemetry-cpp

The output thould be something like this:

opentelemetry-cpp:x64-linux 1.18.0

Create the C++ Project with Database Spans

We’ll build a sample in ~/otel-app that:

Uses SQLite to do basic CREATE/INSERT/SELECT queries. This is helpful to showcase capturing transactions for apps that use databases on Elastic APM.
Generate random traces to showcase how they are captured on Elastic APM.

This app is going to generate random queries where some will contain database transactions and some are just application traces. Each query is contained in a child span, so they appear in APM as separate database transactions.

# Below is the structure of our project

    otel-app/
    ├── main.cpp
    └── CMakeLists.txt

Create App Project

    cd ~
    mkdir otel-app
    cd otel-app

Inside this project we will create two files

main.cpp
CMakeLists.txt

Keep in mind that main.cpp is where you are going to pass the otel exporters that are going to send data to the Elastic cluster. So for your tech stack it would be your application's source code.

Sample application code

    main.cpp
    // Below we declare required libraries that we will be using to ship
    // traces to Elastic APM
    #include 
    #include 
    #include 
    #include 

    #include 
    #include 
    #include 
    #include 
    #include   // for rand(), srand()
    #include     // for time()

    // Namespace aliases
    namespace trace_api = opentelemetry::trace;
    namespace sdktrace  = opentelemetry::sdk::trace;
    namespace otlp      = opentelemetry::exporter::otlp;

    // Below we are using a helper function to run SQLITE statement inside 
    // child span
    bool ExecuteSql(sqlite3 *db, const std::string &sql,
                    trace_api::Tracer &tracer,
                    const std::string &span_name)
    {
      // Starting the child span
      auto db_span = tracer.StartSpan(span_name);
      {
        auto scope = tracer.WithActiveSpan(db_span);

        // Here we mark Database attributes for clarity in APM
        db_span->SetAttribute("db.system", "sqlite");
        db_span->SetAttribute("db.statement", sql);

        char *errMsg = nullptr;
        int rc = sqlite3_exec(db, sql.c_str(), nullptr, nullptr, &errMsg);
        if (rc != SQLITE_OK)
        {
          db_span->AddEvent("SQLite error: " + std::string(errMsg ? errMsg : "unknown"));
          sqlite3_free(errMsg);
          db_span->End();
          return false;
        }
        db_span->AddEvent("Query OK");
      }
      db_span->End();
      return true;
    }

    /**
     * DoNonDbWork - Simulate some other operation
     */
    void DoNonDbWork(trace_api::Tracer &tracer, const std::string &span_name)
    {
      auto child_span = tracer.StartSpan(span_name);
      {
        auto scope = tracer.WithActiveSpan(child_span);
        // Just sleep or do some "fake" work
        std::cout << "[TRACE] Doing non-DB work for " << span_name << "...\n";
        std::this_thread::sleep_for(std::chrono::milliseconds(200 + rand() % 300));
        child_span->AddEvent("Finished non-DB work");
      }
      child_span->End();
    }

    int main()
    {
      // Seed random generator for example
      srand(static_cast(time(nullptr)));

      // 1) Create OTLP exporter for traces
      otlp::OtlpGrpcExporterOptions opts;
      auto exporter = std::make_unique(opts);

      // 2) Simple Span Processor
      auto processor = std::make_unique(std::move(exporter));

      // 3) Tracer Provider
      auto sdk_tracer_provider = std::make_shared(std::move(processor));
      auto tracer = sdk_tracer_provider->GetTracer("my-cpp-multi-app");

      // Prepare an in-memory SQLite DB (for random DB usage)
      sqlite3 *db = nullptr;
      int rc = sqlite3_open(":memory:", &db);
      if (rc == SQLITE_OK)
      {
        // Create a table so we can do inserts/reads
        ExecuteSql(db, "CREATE TABLE IF NOT EXISTS items (id INTEGER PRIMARY KEY, info TEXT);",
                   *tracer.get(), "db_create_table");
      }

      // Create the following loop to generate multiple transactions
      int num_transactions = 5;  // Change this variable to the desired number of transaction
      for (int i = 1; i <= num_transactions; i++)
      {
        // Each iteration is a top-level transaction
        std::string transaction_name = "transaction_" + std::to_string(i);
        auto parent_span = tracer->StartSpan(transaction_name);
        {
          auto scope = tracer->WithActiveSpan(parent_span);

          std::cout << "\n=== Starting " << transaction_name << " ===\n";

          // Randomly select whether a transaction will interact with the database or not.
          bool doDb = (rand() % 2 == 0); // 50% chance

          if (doDb && db)
          {
            // Insert random data
            std::string insert_sql = "INSERT INTO items (info) VALUES ('Item " + std::to_string(i) + "');";
            ExecuteSql(db, insert_sql, *tracer.get(), "db_insert_item");

            // Select from DB
            ExecuteSql(db, "SELECT * FROM items;", *tracer.get(), "db_select_items");
          }
          else
          {
            // Do some random non-DB tasks
            DoNonDbWork(*tracer.get(), "non_db_task_1");
            DoNonDbWork(*tracer.get(), "non_db_task_2");
          }

          // Sleep a little to simulate transaction time
          std::this_thread::sleep_for(std::chrono::milliseconds(200));
        }
        parent_span->End();
      }

      // Close DB
      sqlite3_close(db);

      // Extra sleep to ensure final flush
      std::cout << "\n[INFO] Sleeping 5 seconds to allow flush...\n";
      std::this_thread::sleep_for(std::chrono::seconds(5));
      std::cout << "[INFO] Exiting.\n";
      return 0;
    }

What does the code do?

We create 5 top-level “transaction_i” spans.

For each transaction, we randomly choose to do DB or non-DB work

- If DB: Insert a row, then select. Each is a child span.

- If non-DB: We do two “fake tasks” (child spans).

Once we finish, we close the database connection and wait 5 seconds for data flush.

Sample instruction file

CMakeLists.txt : This file contains instructions describing the source files and targets.

    cmake_minimum_required(VERSION 3.10)
    project(OtelApp VERSION 1.0)

    set(CMAKE_CXX_STANDARD 11)
    set(CMAKE_CXX_STANDARD_REQUIRED ON)

    # Here we are pointing to use the vcpkg toolchain
    set(CMAKE_TOOLCHAIN_FILE "PATH-TO/vcpkg.cmake" CACHE STRING "Vcpkg toolchain file")

    find_package(opentelemetry-cpp CONFIG REQUIRED)

    add_executable(otel_app main.cpp)

    # Below we are linking the OTLP gRPC exporter, trace library, and sqlite3
    target_link_libraries(otel_app PRIVATE
        opentelemetry-cpp::otlp_grpc_exporter
        opentelemetry-cpp::trace
        sqlite3
    )

Declare Environmental Variables

Here we are going to export our Elastic Cloud endpoints as environmental variables

You can get that information by doing the following:

Login into your elastic cloud
Go into your deployment
On the Left hand side, click on the hamburger menu and scroll down to “Integrations”
Go on the search bar inside the integration and type “APM”

Click on the APM integration
Scroll down and click on the OpenTelemetry Option on the far left side

You should be able to see values similar to the screenshot below. Once you copy the values to export, click on launch APM.

As you copy the required values, go ahead and export them.

    export OTEL_EXPORTER_OTLP_ENDPOINT="APM-ENDPOINT"
    export OTEL_EXPORTER_OTLP_HEADERS="KEY"
    export OTEL_RESOURCE_ATTRIBUTES="service.name=my-app,service.version=1.0.0,deployment.environment=dev"

Note that the elastic OTEL_EXPORTER_OTLP_HEADERS value usually starts with “Authorization=Bearer” make sure that you convert the upper case “A” in authorization to a lower case “a”. This is due to the fact that the otel header exporter expects a lower case “a” for authorization.

Build and Run

Once we create the two files we then move to building the application.

cd ~/otel-app
mkdir -p build
cd build

cmake -DCMAKE_TOOLCHAIN_FILE=~/vcpkg/scripts/buildsystems/vcpkg.cmake \
      -DCMAKE_PREFIX_PATH=~/vcpkg/installed/x64-linux/share \
      ..
make

Once make is successful run the the application

./otel-app

You should be able to see the script execute with a similar console output

    Console outcome:
    === Starting transaction_1 ===
    [TRACE] Doing non-DB work for non_db_task_1...
    [TRACE] Doing non-DB work for non_db_task_2...

    === Starting transaction_2 ===
    [TRACE] Doing DB work for doDb_task_1...
    [TRACE] Doing DB work for doDb_task_2...

    === Starting transaction_3 ===
    [TRACE] Doing non-DB work for non_db_task_1...
    [TRACE] Doing non-DB work for non_db_task_2...

    === Starting transaction_4 ===
    [TRACE] Doing non-DB work for non_db_task_1...
    [TRACE] Doing non-DB work for non_db_task_2...

    === Starting transaction_5 ===
    [TRACE] Doing non-DB work for non_db_task_1...
    [TRACE] Doing non-DB work for non_db_task_2...

    [INFO] Sleeping 5 seconds to allow flush...
    [INFO] Exiting.

Once the script executes you should be able to observe those traces on Elastic APM similar to the screenshots below.

Observe in Elastic APM

Go to Elastic Cloud, open your deployment, and navigate to Observability > APM.

Look for the app name in the service list (as defined by OTEL_RESOURCE_ATTRIBUTES).

Inside that service’s Traces tab, you’ll find multiple transactions like “transaction_1”,

“transaction_2”, etc.

Expanding each transaction shows child spans:

- Possibly db_insert_item and db_select_items if random DB path was taken.

- Otherwise, non_db_task_1 and non_db_task_2.

You can see how some transactions do DB calls, some do not, each with different spans.

This variety demonstrates how your real application might produce multiple different

“routes” or “operations.”

Service Map

If everything runs correctly, you should be able to view your services and see service maps for your application.

Services

My Elastic App

App Transactions

Dependencies

Logs

Navigate to your logs window/Discover to see the incoming application logs

Patterns

Log pattern analysis helps you to find patterns in unstructured log messages and makes it easier to examine your data.

Final Recap

Here is a quick summary of what we did:

Provisioned an Ubuntu 22.04 machine.
Installed build tools for SQLite, dev libs, and vcpkg.
Installed the client for opentelemetry-cpp via vcpkg.
Created a minimal C++ project that executes app traces and captures database operations.
Connected database sqlite3 in CMakeLists.txt.
Exported the Elastic OTLP endpoint & token as environment variables (with a lowercase authorization=Bearer key!).
Ran the application and observed DB interactions and app traces in Elastic APM.
Observed application logs and patterns on Elastic logs and Discover.

FAQ & Common Issues

Getting “Could not find package configuration file provided by opentelemetry-cpp”?

Make sure you pass

-DCMAKE_TOOLCHAIN_FILE=... and -DCMAKE_PREFIX_PATH=...

to cmake, or embed them in CMakeLists.txt.

Crash: “validate_metadata: INTERNAL:Illegal header key”?

Use all-lowercase in

OTEL_EXPORTER_OTLP_HEADERS, e.g. authorization=Bearer \.

Missing otlp_grpc_metrics_exporter.h?

Your vcpkg version of opentelemetry-cpp (1.18.0) lacks a direct metrics exporter for OTLP. For metrics, either upgrade the library or consider an OpenTelemetry Collector approach.

No data in Elastic APM?

Double-check your endpoint URL, Bearer token, firewall rules, or service name in the APM

Additional Resources:

Optimizing Observability with ES|QL: Streamlining SRE operations and issue resolution for Kubernetes and OTel

Wed, 01 Nov 2023 00:00:00 GMT

As an operations engineer (SRE, IT Operations, DevOps), managing technology and data sprawl is an ongoing challenge. Simply managing the large volumes of high dimensionality and high cardinality data is overwhelming.

As a single platform, Elastic® helps SREs unify and correlate limitless telemetry data, including metrics, logs, traces, and profiling, into a single datastore — Elasticsearch®. By then applying the power of Elastic’s advanced machine learning (ML), AIOps, AI Assistant, and analytics, you can break down silos and turn data into insights. As a full-stack observability solution, everything from infrastructure monitoring to log monitoring and application performance monitoring (APM) can be found in a single, unified experience.

In Elastic 8.11, a technical preview is now available of Elastic’s new piped query language, ES|QL (Elasticsearch Query Language), which transforms, enriches, and simplifies data investigations. Powered by a new query engine, ES|QL delivers advanced search capabilities with concurrent processing, improving speed and efficiency, irrespective of data source and structure. Accelerate resolution by creating aggregations and visualizations from one screen, delivering an iterative, uninterrupted workflow.

Advantages of ES|QL for SREs

SREs using Elastic Observability can leverage ES|QL to analyze logs, metrics, traces, and profiling data, enabling them to pinpoint performance bottlenecks and system issues with a single query. SREs gain the following advantages when managing high dimensionality and high cardinality data with ES|QL in Elastic Observability:

Improved operational efficiency: By using ES|QL, SREs can create more actionable notifications with aggregated values as thresholds from a single query, which can also be managed through the Elastic API and integrated into DevOps processes.
Enhanced analysis with insights: ES|QL can process diverse observability data, including application, infrastructure, business data, and more, regardless of the source and structure. ES|QL can easily enrich the data with additional fields and context, allowing the creation of visualizations for dashboards or issue analysis with a single query.
Reduced mean time to resolution: ES|QL, when combined with Elastic Observability's AIOps and AI Assistant, enhances detection accuracy by identifying trends, isolating incidents, and reducing false positives. This improvement in context facilitates troubleshooting and the quick pinpointing and resolution of issues.

ES|QL in Elastic Observability not only enhances an SRE's ability to manage the customer experience, an organization's revenue, and SLOs more effectively but also facilitates collaboration with developers and DevOps by providing contextualized aggregated data.

In this blog, we will cover some of the key use cases SREs can leverage with ES|QL:

ES|QL integrated with the Elastic AI Assistant, which uses public LLM and private data, enhances the analysis experience anywhere in Elastic Observability.
SREs can, in a single ES|QL query, break down, analyze, and visualize observability data from multiple sources and across any time frame.
Actionable alerts can be easily created from a single ES|QL query, enhancing operations.

I will work through these use cases by showcasing how an SRE can solve a problem in an application instrumented with OpenTelemetry and running on Kubernetes. The OpenTelemetry (OTel) demo is on an Amazon EKS cluster, with Elastic Cloud 8.11 configured.

You can also check out our Elastic Observability ES|QL Demo, which walks through ES|QL functionality for Observability.

ES|QL with AI Assistant

As an SRE, you are monitoring your OTel instrumented application with Elastic Observability, and while in Elastic APM, you notice some issues highlighted in the service map.

Using Elastic AI Assistant, you can easily ask for analysis, and in particular, we check on what the overall latency is across the application services.

My APM data is in traces-apm*. What's the average latency per service over the last hour? Use ESQL, the data is mapped to ECS

The Elastic AI Assistant generates an ES|QL query, which we run in the AI Assistant to get a list of the average latencies across all the application services. We can easily see the top four are:

load generator
front-end proxy
frontendservice
checkoutservice

With a simple natural language query in the AI Assistant, it generated a single ES|QL query that helped list out the latencies across the services.

Noticing that there is an issue with several services, we decide to start with the frontend proxy. As we work through the details, we see significant failures, and through Elastic APM failure correlation , it becomes apparent that the frontend proxy is not properly completing its calls to downstream services.

ES|QL insightful and contextual analysis in Discover

Knowing that the application is running on Kubernetes, we investigate if there are issues in Kubernetes. In particular, we want to see if there are any services having issues.

We use the following query in ES|QL in Elastic Discover:

from metrics-* | where kubernetes.container.status.last_terminated_reason != "" and kubernetes.namespace == "default" | stats reason_count=count(kubernetes.container.status.last_terminated_reason) by kubernetes.container.name, kubernetes.container.status.last_terminated_reason | where reason_count > 0

ES|QL helps analyze 1,000s/10,000s of metric events from Kubernetes and highlights two services that are restarting due to OOMKilled.

The Elastic AI Assistant, when asked about OOMKilled, indicates that a container in a pod was killed due to an out-of-memory condition.

We run another ES|QL query to understand the memory usage for emailservice and productcatalogservice.

ES|QL easily found the average memory usage fairly high.

We can now further investigate both of these services’ logs, metrics, and Kubernetes-related data. However, before we continue, we create an alert to track heavy memory usage.

Actionable alerts with ES|QL

Suspecting a specific issue, that might recur, we simply create an alert that brings in the ES|QL query we just ran that will track for any service that exceeds 50% in memory utilization.

We modify the last query to find any service with high memory usage:

FROM metrics*
| WHERE @timestamp >= NOW() - 1 hours
| STATS avg_memory_usage = AVG(kubernetes.pod.memory.usage.limit.pct) BY kubernetes.deployment.name | where avg_memory_usage > .5

With that query, we create a simple alert. Notice how the ES|QL query is brought into the alert. We simply connect this to pager duty. But we can choose from multiple connectors like ServiceNow, Opsgenie, email, etc.

With this alert, we can now easily monitor for any services that exceed 50% memory utilization in their pods.

Make the most of your data with ES|QL

In this post, we demonstrated the power ES|QL brings to analysis, operations, and reducing MTTR. In summary, the three use cases with ES|QL in Elastic Observability are as follows:

ES|QL integrated with the Elastic AI Assistant, which uses public LLM and private data, enhances the analysis experience anywhere in Elastic Observability.
SREs can, in a single ES|QL query, break down, analyze, and visualize observability data from multiple sources and across any time frame.
Actionable alerts can be easily created from a single ES|QL query, enhancing operations.

Elastic invites SREs and developers to experience this transformative language firsthand and unlock new horizons in their data tasks. Try it today at https://ela.st/free-trial now in technical preview.

Elastic Observability Tour

The power of effective log management

Transforming Observability with the AI Assistant

ES|QL announcement blog

Using NLP and Pattern Matching to Detect, Assess, and Redact PII in Logs - Part 1

Wed, 25 Sep 2024 00:00:00 GMT

Introduction:

The prevalence of high-entropy logs in distributed systems has significantly raised the risk of PII (Personally Identifiable Information) seeping into our logs, which can result in security and compliance issues. This 2-part blog delves into the crucial task of identifying and managing this issue using the Elastic Stack. We will explore using NLP (Natural Language Processing) and Pattern matching to detect, assess, and, where feasible, redact PII from logs that are being ingested into Elasticsearch.

In Part 1 of this blog, we will cover the following:

Review the techniques and tools we have available to manage PII in our logs
Understand the roles of NLP / NER in PII detection
Build a composable processing pipeline to detect and assess PII
Sample logs and run them through the NER Model
Assess the results of the NER Model

In Part 2 of this blog of this blog, we will cover the following:

Redact PII using NER and the redact processor
Apply field-level security to control access to the un-redacted data
Enhance the dashboards and alerts
Production considerations and scaling
How to run these processes on incoming or historical data

Here is the overall flow we will construct over the 2 blogs:

All code for this exercise can be found at: https://github.com/bvader/elastic-pii.

Tools and Techniques

There are four general capabilities that we will use for this exercise.

Named Entity Recognition Detection (NER)
Pattern Matching Detection
Log Sampling
Ingest Pipelines as Composable Processing

Named Entity Recognition (NER) Detection

NER is a sub-task of Natural Language Processing (NLP) that involves identifying and categorizing named entities in unstructured text into predefined categories such as:

Person: Names of individuals, including celebrities, politicians, and historical figures.
Organization: Names of companies, institutions, and organizations.
Location: Geographic locations, including cities, countries, and landmarks.
Event: Names of events, including conferences, meetings, and festivals.

For our use PII case, we will choose the base BERT NER model bert-base-NER that can be downloaded from Hugging Face and loaded into Elasticsearch as a trained model.

Important Note: NER / NLP Models are CPU-intensive and expensive to run at scale; thus, we will want to employ a sampling technique to understand the risk in our logs without sending the full logs volume through the NER Model. We will discuss the performance and scaling of the NER model in part 2 of the blog.

Pattern Matching Detection

In addition to using an NER, regex pattern matching is a powerful tool for detecting and redacting PII based on common patterns. The Elasticsearch redact processor is built for this use case.

Log Sampling

Considering the performance implications of NER and the fact that we may be ingesting a large volume of logs into Elasticsearch, it makes sense to sample our incoming logs. We will build a simple log sampler to accomplish this.

Ingest Pipelines as Composable Processing

We will create several pipelines, each focusing on a specific capability and a main ingest pipeline to orchestrate the overall process.

Building the Processing Flow

Logs Sampling + Composable Ingest Pipelines

The first thing we will do is set up a sampler to sample our logs. This ingest pipeline simply takes a sampling rate between 0 (no log) and 10000 (all logs), which allows as low as ~0.01% sampling rate and marks the sampled logs with sample.sampled: true. Further processing on the logs will be driven by the value of sample.sampled. The sample.sample_rate can be set here or "passed in" from the orchestration pipeline.

The command should be run from the Kibana -> Dev Tools

The code can be found here for the following three sections of code.

logs-sampler pipeline code - click to open/close

# logs-sampler pipeline - part 1
DELETE _ingest/pipeline/logs-sampler
PUT _ingest/pipeline/logs-sampler
{
  "processors": [
    {
      "set": {
        "description": "Set Sampling Rate 0 None 10000 all allows for 0.01% precision",
        "if": "ctx.sample.sample_rate == null",
        "field": "sample.sample_rate",
        "value": 10000
      }
    },
    {
      "set": {
        "description": "Determine if keeping unsampled docs",
        "if": "ctx.sample.keep_unsampled == null",
        "field": "sample.keep_unsampled",
        "value": true
      }
    },
    {
      "set": {
        "field": "sample.sampled",
        "value": false
      }
    },
    {
      "script": {
        "source": """ Random r = new Random();
        ctx.sample.random = r.nextInt(params.max); """,
        "params": {
          "max": 10000
        }
      }
    },
    {
      "set": {
        "if": "ctx.sample.random <= ctx.sample.sample_rate",
        "field": "sample.sampled",
        "value": true
      }
    },
    {
      "drop": {
         "description": "Drop unsampled document if applicable",
        "if": "ctx.sample.keep_unsampled == false && ctx.sample.sampled == false"
      }
    }
  ]
}

Now, let's test the logs sampler. We will build the first part of the composable pipeline. We will be sending logs to the logs-generic-default data stream. With that in mind, we will create the logs@custom ingest pipeline that will be automatically called using the logs data stream framework for customization. We will add one additional level of abstraction so that you can apply this PII processing to other data streams.

Next, we will create the process-pii pipeline. This is the core processing pipeline where we will orchestrate PII processing component pipelines. In this first step, we will simply apply the sampling logic. Note that we are setting the sampling rate to 100, which is equivalent to 10% of the logs.

process-pii pipeline code - click to open/close

# Process PII pipeline - part 1
DELETE _ingest/pipeline/process-pii
PUT _ingest/pipeline/process-pii
{
  "processors": [
    {
      "set": {
        "description": "Set true if enabling sampling, otherwise false",
        "field": "sample.enabled",
        "value": true
      }
    },
    {
      "set": {
        "description": "Set Sampling Rate 0 None 10000 all allows for 0.01% precision",
        "field": "sample.sample_rate",
        "value": 1000
      }
    },
    {
      "set": {
        "description": "Set to false if you want to drop unsampled data, handy for reindexing hostorical data",
        "field": "sample.keep_unsampled",
        "value": true
      }
    },
    {
      "pipeline": {
        "if": "ctx.sample.enabled == true",
        "name": "logs-sampler",
        "ignore_failure": true
      }
    }
  ]
}

Finally, we create the logs logs@custom, which will simply call our process-pii pipeline based on the correct data_stream.dataset

logs@custom pipeline code - click to open/close

# logs@custom pipeline - part 1
DELETE _ingest/pipeline/logs@custom
PUT _ingest/pipeline/logs@custom
{
  "processors": [
    {
      "set": {
        "field": "pipelinetoplevel",
        "value": "logs@custom"
      }
    },
        {
      "set": {
        "field": "pipelinetoplevelinfo",
        "value": "{{{data_stream.dataset}}}"
      }
    },
    {
      "pipeline": {
        "description" : "Call the process_pii pipeline on the correct dataset",
        "if": "ctx?.data_stream?.dataset == 'pii'", 
        "name": "process-pii"
      }
    }
  ]
}

Now, let's test to see the sampling at work.

Load the data as described here Data Loading Appendix. Let's use the sample data first, and we will talk about how to test with your incoming or historical logs later at the end of this blog.

If you look at Observability -> Logs -> Logs Explorer with KQL filter data_stream.dataset : pii and Breakdown by sample.sampled, you should see the breakdown to be approximately 10%

At this point we have a composable ingest pipeline that is "sampling" logs. As a bonus, you can use this logs sampler for any other use cases you have as well.

Loading, Configuration, and Execution of the NER Pipeline

Loading the NER Model

You will need a Machine Learning node to run the NER model on. In this exercise, we are using Elastic Cloud Hosted Deployment on AWS with the CPU Optimized (ARM) architecture. The NER inference will run on a Machine Learning AWS c5d node. There will be GPU options in the future, but today, we will stick with CPU architecture.

This exercise will use a single c5d with 8 GB RAM with 4.2 vCPU up to 8.4 vCPU

Please refer to the official documentation on how to import an NLP-trained model into Elasticsearch for complete instructions on uploading, configuring, and deploying the model.

The quickest way to get the model is using the Eland Docker method.

The following command will load the model into Elasticsearch but will not start it. We will do that in the next step.

docker run -it --rm --network host docker.elastic.co/eland/eland \
  eland_import_hub_model \
  --url https://mydeployment.es.us-west-1.aws.found.io:443/ \
  -u elastic -p password \
  --hub-model-id dslim/bert-base-NER --task-type ner

Deploy and Start the NER Model

In general, to improve ingest performance, increase throughput by adding more allocations to the deployment. For improved search speed, increase the number of threads per allocation.

To scale ingest, we will focus on scaling the allocations for the deployed model. More information on this topic is available here. The number of allocations must be less than the available allocated processors (cores, not vCPUs) per node.

To deploy and start the NER Model. We will do this using the Start trained model deployment API

We will configure the following:

4 Allocations to allow for more parallel ingestion
1 Thread per Allocation
0 Byes Cache, as we expect a low cache hit rate
8192 Queue

# Start the model with 4 Allocators x 1 Thread, no cache, and 8192 queue
POST _ml/trained_models/dslim__bert-base-ner/deployment/_start?cache_size=0b&number_of_allocations=4&threads_per_allocation=1&queue_capacity=8192

You should get a response that looks something like this.

{
  "assignment": {
    "task_parameters": {
      "model_id": "dslim__bert-base-ner",
      "deployment_id": "dslim__bert-base-ner",
      "model_bytes": 430974836,
      "threads_per_allocation": 1,
      "number_of_allocations": 4,
      "queue_capacity": 8192,
      "cache_size": "0",
      "priority": "normal",
      "per_deployment_memory_bytes": 430914596,
      "per_allocation_memory_bytes": 629366952
    },
...
    "assignment_state": "started",
    "start_time": "2024-09-23T21:39:18.476066615Z",
    "max_assigned_allocations": 4
  }
}

The NER model has been deployed and started and is ready to be used.

The following ingest pipeline implements the NER model via the inference processor.

There is a significant amount of code here, but only two items of interest now exist. The rest of the code is conditional logic to drive some additional specific behavior that we will look closer at in the future.

The inference processor calls the NER model by ID, which we loaded previously, and passes the text to be analyzed, which, in this case, is the message field, which is the text_field we want to pass to the NER model to analyze for PII.
The script processor loops through the message field and uses the data generated by the NER model to replace the identified PII with redacted placeholders. This looks more complex than it really is, as it simply loops through the array of ML predictions and replaces them in the message string with constants, and stores the results in a new field redact.message. We will look at this a little closer in the following steps.

The code can be found here for the following three sections of code.

The NER PII Pipeline

logs-ner-pii-processor pipeline code - click to open/close

# NER Pipeline
DELETE _ingest/pipeline/logs-ner-pii-processor
PUT _ingest/pipeline/logs-ner-pii-processor
{
  "processors": [
    {
      "set": {
        "description": "Set to true to actually redact, false will run processors but leave original",
        "field": "redact.enable",
        "value": true
      }
    },
    {
      "set": {
        "description": "Set to true to keep ml results for debugging",
        "field": "redact.ner.keep_result",
        "value": true
      }
    },
    {
      "set": {
        "description": "Set to PER, LOC, ORG to skip, or NONE to not drop any replacement",
        "field": "redact.ner.skip_entity",
        "value": "NONE"
      }
    },
    {
      "set": {
        "description": "Set to PER, LOC, ORG to skip, or NONE to not drop any replacement",
        "field": "redact.ner.minimum_score",
        "value": 0
      }
    },
    {
      "set": {
        "if": "ctx?.redact?.message == null",
        "field": "redact.message",
        "copy_from": "message"
      }
    },
    {
      "set": {
        "field": "redact.ner.successful",
        "value": true
      }
    },
    {
      "set": {
        "field": "redact.ner.found",
        "value": false
      }
    },
    {
      "inference": {
        "model_id": "dslim__bert-base-ner",
        "field_map": {
          "message": "text_field"
        },
        "on_failure": [
          {
            "set": {
              "description": "Set 'error.message'",
              "field": "failure",
              "value": "REDACT_NER_FAILED"
            }
          },
          {
            "set": {
              "field": "redact.ner.successful",
              "value": false
            }
          }
        ]
      }
    },
    {
      "script": {
        "if": "ctx.failure_ner != 'REDACT_NER_FAILED'",
        "lang": "painless",
        "source": """String msg = ctx['message'];
          for (item in ctx['ml']['inference']['entities']) {
          	if ((item['class_name'] != ctx.redact.ner.skip_entity) && 
          	  (item['class_probability'] >= ctx.redact.ner.minimum_score)) {  
          		  msg = msg.replace(item['entity'], '<' + 
          		  'REDACTNER-'+ item['class_name'] + '_NER>')
          	}
          }
          ctx.redact.message = msg""",
        "on_failure": [
          {
            "set": {
              "description": "Set 'error.message'",
              "field": "failure",
              "value": "REDACT_REPLACEMENT_SCRIPT_FAILED",
              "override": false
            }
          },
          {
            "set": {
              "field": "redact.successful",
              "value": false
            }
          }
        ]
      }
    },
    
    {
      "set": {
        "if": "ctx?.ml?.inference?.entities.size() > 0", 
        "field": "redact.ner.found",
        "value": true,
        "ignore_failure": true
      }
    },
    {
      "set": {
        "if": "ctx?.redact?.pii?.found == null",
        "field": "redact.pii.found",
        "value": false
      }
    },
    {
      "set": {
        "if": "ctx?.redact?.ner?.found == true",
        "field": "redact.pii.found",
        "value": true
      }
    },
    {
      "remove": {
        "if": "ctx.redact.ner.keep_result != true",
        "field": [
          "ml"
        ],
        "ignore_missing": true,
        "ignore_failure": true
      }
    }
  ],
  "on_failure": [
    {
      "set": {
        "field": "failure",
        "value": "GENERAL_FAILURE",
        "override": false
      }
    }
  ]
}

The updated PII Processor Pipeline, which now calls the NER Pipeline

process-pii pipeline code - click to open/close

# Updated Process PII pipeline that now call the NER pipeline
DELETE _ingest/pipeline/process-pii
PUT _ingest/pipeline/process-pii
{
  "processors": [
    {
      "set": {
        "description": "Set true if enabling sampling, otherwise false",
        "field": "sample.enabled",
        "value": true
      }
    },
    {
      "set": {
        "description": "Set Sampling Rate 0 None 10000 all allows for 0.01% precision",
        "field": "sample.sample_rate",
        "value": 1000
      }
    },
    {
      "set": {
        "description": "Set to false if you want to drop unsampled data, handy for reindexing hostorical data",
        "field": "sample.keep_unsampled",
        "value": true
      }
    },
    {
      "pipeline": {
        "if": "ctx.sample.enabled == true",
        "name": "logs-sampler",
        "ignore_failure": true
      }
    },
    {
      "pipeline": {
        "if": "ctx.sample.enabled == false || (ctx.sample.enabled == true && ctx.sample.sampled == true)",
        "name": "logs-ner-pii-processor"
      }
    }
  ]
}

Now reload the data as described here in Reloading the logs

Results

Let's take a look at the results with the NER processing in place. In the Logs Explorer with KQL query bar, execute the following query data_stream.dataset : pii and ml.inference.entities.class_name : ("PER" and "LOC" and "ORG" )

Logs Explorer should look something like this, open the top message to see the details.

NER Model Results

Lets take a closer look at what these fields mean.

Field: ml.inference.entities.class_name
Sample Value: [PER, PER, LOC, ORG, ORG]
Description: An array of the named entity classes that the NER model has identified.

Field: ml.inference.entities.class_probability
Sample Value: [0.999, 0.972, 0.896, 0.506, 0.595]
Description: The class_probability is a value between 0 and 1, which indicates how likely it is that a given data point belongs to a certain class. The higher the number, the higher the probability that the data point belongs to the named class. This is important as in the next blog we can decide a threshold that we will want to use to alert and redact on.' You can see in this example it identified a LOC as an ORG, we can filter this out / find them by setting a threshold.

Field: ml.inference.entities.entity
Sample Value: [Paul Buck, Steven Glens, South Amyborough, ME, Costco]
Description: The array of entities identified that align positionally with the class_name and class_probability.

Field: ml.inference.predicted_value
Sample Value: [2024-09-23T14:32:14.608207-07:00Z] log.level=INFO: Payment successful for order #4594 (user: [Paul Buck](PER&Paul+Buck), david59@burgess.net). Phone: 726-632-0527x520, Address: 3713 [Steven Glens](PER&Steven+Glens), [South Amyborough](LOC&South+Amyborough), [ME](ORG&ME) 93580, Ordered from: [Costco](ORG&Costco)
Description: The predicted value of the model.

PII Assessment Dashboard

Lets take a quick look at a dashboard built to assess PII the data.

To load the dashboard, go to Kibana -> Stack Management -> Saved Objects and import the pii-dashboard-part-1.ndjson file that can be found here:

https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-1/pii-dashboard-part-1.ndjson

More complete instructions on Kibana Saved Objects can be found here.

After loading the dashboard, navigate to it and select the right time range and you should see something like below. It shows metrics such as sample rate, percent of logs with NER, NER Score Trends etc. We will examine the assessment and actions in part 2 of this blog.

Summary and Next Steps

In this first part of the blog, we have accomplished the following.

Reviewed the techniques and tools we have available for PII detection and assement
Reviewed NLP / NER role in PII detection and assessment
Built the necessary composable ingest pipelines to sample logs and run them through the NER Model
Reviewed the NER results and are ready to move to the second blog

In the upcoming Part 2 of this blog of this blog, we will cover the following:

Redact PII using NER and redact processor
Apply field-level security to control access to the un-redacted data
Enhance the dashboards and alerts
Production considerations and scaling
How to run these processes on incoming or historical data

Data Loading Appendix

Code

The data loading code can be found here:

https://github.com/bvader/elastic-pii

$ git clone https://github.com/bvader/elastic-pii.git

Creating and Loading the Sample Data Set

$ cd elastic-pii
$ cd python
$ python -m venv .env
$ source .env/bin/activate
$ pip install elasticsearch
$ pip install Faker

Run the log generator

$ python generate_random_logs.py

If you do not changes any parameters, this will create 10000 random logs in a file named pii.log with a mix of logs that containe and do not contain PII.

Edit load_logs.py and set the following

# The Elastic User 
ELASTIC_USER = "elastic"

# Password for the 'elastic' user generated by Elasticsearch
ELASTIC_PASSWORD = "askdjfhasldfkjhasdf"

# Found in the 'Manage Deployment' page
ELASTIC_CLOUD_ID = "deployment:sadfjhasfdlkjsdhf3VuZC5pbzo0NDMkYjA0NmQ0YjFiYzg5NDM3ZDgxM2YxM2RhZjQ3OGE3MzIkZGJmNTE0OGEwODEzNGEwN2E3M2YwYjcyZjljYTliZWQ="

Then run the following command.

$ python load_logs.py

Reloading the logs

Note To reload the logs, you can simply re-run the above command. You can run the command multiple time during this exercise and the logs will be reloaded (actually loaded again). The new logs will not collide with previous runs as there will be a unique run.id for each run which is displayed at the end of the loading process.

$ python load_logs.py

Using NLP and Pattern Matching to Detect, Assess, and Redact PII in Logs - Part 2

Tue, 22 Oct 2024 00:00:00 GMT

Introduction:

In Part 1 of this blog, we covered the following:

Review the techniques and tools we have available to manage PII in our logs
Understand the roles of NLP / NER in PII detection
Build a composable processing pipeline to detect and assess PII
Sample logs and run them through the NER Model
Assess the results of the NER Model

In Part 2 of this blog, we will cover the following:

Apply the redact regex pattern processor and assess the results
Create Alerts using ESQL
Apply field-level security to control access to the un-redacted data
Production considerations and scaling
How to run these processes on incoming or historical data

Reminder of the overall flow we will construct over the 2 blogs:

All code for this exercise can be found at: https://github.com/bvader/elastic-pii.

Part 1 Prerequisites

This blog picks up where Part 1 of this blog left off. You must have the NER model, ingest pipelines, and dashboard from Part 1 installed and working.

Loaded and configured NER Model
Installed all the composable ingest pipelines from Part 1 of the blog
Installed dashboard

You can access the complete solution for Blog 1 here. Don't forget to load the dashboard, found here.

Applying the Redact Processor

Next, we will apply the redact processor. The redact processor is a simple regex-based processor that takes a list of regex patterns and looks for them in a field and replaces them with literals when found. The redact processor is reasonably performant and can run at scale. At the end, we will discuss this in detail in the production scaling section.

Elasticsearch comes packaged with a number of useful predefined patterns that can be conveniently referenced by the redact processor. If one does not suit your needs, create a new pattern with a custom definition. The Redact processor replaces every occurrence of a match. If there are multiple matches, they will all be replaced with the pattern name.

In the code below, we leveraged some of the predefined patterns as well as constructing several custom patterns.

        "patterns": [
          "%{EMAILADDRESS:EMAIL_REGEX}",      << Predefined
          "%{IP:IP_ADDRESS_REGEX}",           << Predefined
          "%{CREDIT_CARD:CREDIT_CARD_REGEX}", << Custom
          "%{SSN:SSN_REGEX}",                 << Custom
          "%{PHONE:PHONE_REGEX}"              << Custom
        ]

We also replaced the PII with easily identifiable patterns we can use for assessment.

In addition, it is important to note that since the redact processor is a simple regex find and replace, it can be used against many "secrets" patterns, not just PII. There are many references for regex and secrets patterns, so you can reuse this capability to detect secrets in your logs.

The code can be found here for the following two sections of code.

redact processor pipeline code - click to open/close

# Add the PII redact processor pipeline
DELETE _ingest/pipeline/logs-pii-redact-processor
PUT _ingest/pipeline/logs-pii-redact-processor
{
  "processors": [
    {
      "set": {
        "field": "redact.proc.successful",
        "value": true
      }
    },
    {
      "set": {
        "field": "redact.proc.found",
        "value": false
      }
    },
    {
      "set": {
        "if": "ctx?.redact?.message == null",
        "field": "redact.message",
        "copy_from": "message"
      }
    },
    {
      "redact": {
        "field": "redact.message",
        "prefix": "",
        "patterns": [
          "%{EMAILADDRESS:EMAIL_REGEX}",
          "%{IP:IP_ADDRESS_REGEX}",
          "%{CREDIT_CARD:CREDIT_CARD_REGEX}",
          "%{SSN:SSN_REGEX}",
          "%{PHONE:PHONE_REGEX}"
        ],
        "pattern_definitions": {
          "CREDIT_CARD": """\d{4}[ -]\d{4}[ -]\d{4}[ -]\d{4}""",
          "SSN": """\d{3}-\d{2}-\d{4}""",
          "PHONE": """(\+\d{1,2}\s?)?1?\-?\.?\s?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}"""
        },
        "on_failure": [
          {
            "set": {
              "description": "Set 'error.message'",
              "field": "failure",
              "value": "REDACT_PROCESSOR_FAILED",
              "override": false
            }
          },
          {
            "set": {
              "field": "redact.proc.successful",
              "value": false
            }
          }
        ]
      }
    },
    {
      "set": {
        "if": "ctx?.redact?.message.contains('REDACTPROC')",
        "field": "redact.proc.found",
        "value": true
      }
    },
    {
      "set": {
        "if": "ctx?.redact?.pii?.found == null",
        "field": "redact.pii.found",
        "value": false
      }
    },
    {
      "set": {
        "if": "ctx?.redact?.proc?.found == true",
        "field": "redact.pii.found",
        "value": true
      }
    }
  ],
  "on_failure": [
    {
      "set": {
        "field": "failure",
        "value": "GENERAL_FAILURE",
        "override": false
      }
    }
  ]
}

And now, we will add the logs-pii-redact-processor pipeline to the overall process-pii pipeline

redact processor pipeline code - click to open/close

# Updated Process PII pipeline that now call the NER and Redact Processor pipeline
DELETE _ingest/pipeline/process-pii
PUT _ingest/pipeline/process-pii
{
  "processors": [
    {
      "set": {
        "description": "Set true if enabling sampling, otherwise false",
        "field": "sample.enabled",
        "value": true
      }
    },
    {
      "set": {
        "description": "Set Sampling Rate 0 None 10000 all allows for 0.01% precision",
        "field": "sample.sample_rate",
        "value": 1000
      }
    },
    {
      "set": {
        "description": "Set to false if you want to drop unsampled data, handy for reindexing hostorical data",
        "field": "sample.keep_unsampled",
        "value": true
      }
    },
    {
      "pipeline": {
        "if": "ctx.sample.enabled == true",
        "name": "logs-sampler",
        "ignore_failure": true
      }
    },
    {
      "pipeline": {
        "if": "ctx.sample.enabled == false || (ctx.sample.enabled == true && ctx.sample.sampled == true)",
        "name": "logs-ner-pii-processor"
      }
    },
    {
      "pipeline": {
        "if": "ctx.sample.enabled == false || (ctx.sample.enabled == true &&  ctx.sample.sampled == true)",
        "name": "logs-pii-redact-processor"
      }
    }
  ]
}

Reload the data as described in the Reloading the logs. If you have not generated the logs the first time, follow the instructions in the Data Loading Appendix

Go to Discover and enter the following into the KQL bar sample.sampled : true and redact.message: REDACTPROC and add the redact.message to the table and you should see something like this.

And if you did not load the dashboard from Blog Part 1 at already, load it, it can be found here using the Kibana -> Stack Management -> Saved Objects -> Import.

It should look something like this now. Note that the REGEX portions of the dashboard are now active.

Checkpoint

At this point, we have the following capabilities:

Ability to sample incoming logs and apply this PII redaction
Detect and Assess PII with the NER/NLP and Pattern Matching
Assess the amount, type and quality of the PII detections

This is a great point to stop if you are just running all this once to see how it works, but we have a few more steps to make this useful in production systems.

Clean up the working and unredacted data
Update the Dashboard to work with the cleaned-up data
Apply Role Based Access Control to protect the raw unredacted data
Create Alerts
Production and Scaling Considerations
How to run these processes on incoming or historical data

Applying to Production Systems

Cleanup working data and update the dashboard

And now we will add the cleanup code to the overall process-pii pipeline.

In short, we set a flag redact.enable: true that directs the pipeline to move the unredacted message field to raw.message and the move the redacted message field redact.messageto the message field. We will "protect" the raw.message in the following section.

NOTE: Of course you can change this behavior if you want to completely delete the unredacted data. In this exercise we will keep it and protect it.

In addition we set redact.cleanup: true to clean up the NLP working data.

These fields allow a lot of control over what data you decide to keep and analyze.

The code can be found here for the following two sections of code.

redact processor pipeline code - click to open/close

# Updated Process PII pipeline that now call the NER and Redact Processor pipeline and cleans up 
DELETE _ingest/pipeline/process-pii
PUT _ingest/pipeline/process-pii
{
  "processors": [
    {
      "set": {
        "description": "Set true if enabling sampling, otherwise false",
        "field": "sample.enabled",
        "value": true
      }
    },
    {
      "set": {
        "description": "Set Sampling Rate 0 None 10000 all allows for 0.01% precision",
        "field": "sample.sample_rate",
        "value": 1000
      }
    },
    {
      "set": {
        "description": "Set to false if you want to drop unsampled data, handy for reindexing hostorical data",
        "field": "sample.keep_unsampled",
        "value": true
      }
    },
    {
      "pipeline": {
        "if": "ctx.sample.enabled == true",
        "name": "logs-sampler",
        "ignore_failure": true
      }
    },
    {
      "pipeline": {
        "if": "ctx.sample.enabled == false || (ctx.sample.enabled == true && ctx.sample.sampled == true)",
        "name": "logs-ner-pii-processor"
      }
    },
    {
      "pipeline": {
        "if": "ctx.sample.enabled == false || (ctx.sample.enabled == true &&  ctx.sample.sampled == true)",
        "name": "logs-pii-redact-processor"
      }
    },
    {
      "set": {
        "description": "Set to true to actually redact, false will run processors but leave original",
        "field": "redact.enable",
        "value": true
      }
    },
    {
      "rename": {
        "if": "ctx?.redact?.pii?.found == true && ctx?.redact?.enable == true",
        "field": "message",
        "target_field": "raw.message"
      }
    },
    {
      "rename": {
        "if": "ctx?.redact?.pii?.found == true && ctx?.redact?.enable == true",
        "field": "redact.message",
        "target_field": "message"
      }
    },
    {
      "set": {
        "description": "Set to true to actually to clean up working data",
        "field": "redact.cleanup",
        "value": true
      }
    },
    {
      "remove": {
        "if": "ctx?.redact?.cleanup == true",
        "field": [
          "ml"
        ],
        "ignore_failure": true
      }
    }
  ]
}

Reload the data as described here in the Reloading the logs.

Go to Discover and enter the following into the KQL bar sample.sampled : true and redact.pii.found: true and add the following fields to the table

message,raw.message,redact.ner.found,redact.proc.found,redact.pii.found

You should see something like this

We have everything we need to move forward with protecting the PII and Alerting on it.

Load up the new dashboard that works on the cleaned-up data

To load the dashboard, go to Kibana -> Stack Management -> Saved Objects and import the pii-dashboard-part-2.ndjson file that can be found here.

The new dashboard should look like this. Note: It uses different fields under the covers since we have cleaned up the underlying data.

You should see something like this

Apply Role Based Access Control to protect the raw unredacted data

Elasticsearch supports role-based access control, including field and document level access control natively; it dramatically reduces the operational and maintenance complexity required to secure our application.

We will create a Role that does not allow access to the raw.message field and then create a user and assign that user the role. With that role, the user will only be able to see the redacted message, which is now in the message field, but will not be able to access the protected raw.message field.

NOTE: Since we only sampled 10% of the data in this exercise the non-sampled message fields are not moved to the raw.message, so they are still viewable, but this shows the capability you can apply in a production system.

The code can be found here for the following section of code.

RBAC protect-pii role and user code - click to open/close

# Create role with no access to the raw.message field
GET _security/role/protect-pii
DELETE _security/role/protect-pii
PUT _security/role/protect-pii
{
 "cluster": [],
 "indices": [
   {
     "names": [
       "logs-*"
     ],
     "privileges": [
       "read",
       "view_index_metadata"
     ],
     "field_security": {
       "grant": [
         "*"
       ],
       "except": [
         "raw.message"
       ]
     },
     "allow_restricted_indices": false
   }
 ],
 "applications": [
   {
     "application": "kibana-.kibana",
     "privileges": [
       "all"
     ],
     "resources": [
       "*"
     ]
   }
 ],
 "run_as": [],
 "metadata": {},
 "transient_metadata": {
   "enabled": true
 }
}

# Create user stephen with protect-pii role
GET _security/user/stephen
DELETE /_security/user/stephen
POST /_security/user/stephen
{
 "password" : "mypassword",
 "roles" : [ "protect-pii" ],
 "full_name" : "Stephen Brown"
}

Now log into a separate window with the new user stephen with the protect-pii role. Go to Discover and put redact.pii.found : true in the KQL bar and add the message field to the table. Also, notice that the raw.message is not available.

You should see something like this

Create an Alert when PII Detected

Now, with the processing of the pipelines, creating an alert when PII is detected is easy. To review Alerting in Kibana in detail if needed

NOTE: Reload the data if needed to have recent data.

First, we will create a simple ES|QL query in Discover.

The code can be found here.

FROM logs-pii-default
| WHERE redact.pii.found == true
| STATS pii_count = count(*)
| WHERE pii_count > 0

When you run this you should see something like this.

Now click the Alerts menu and select Create search threshold rule, and will create an alert to alert us when PII is found.

Select a time field: @timestamp Set the time window: 5 minutes

Assuming you loaded the data recently when you run Test it should do something like

pii_count : 343 Alerts generated query matched

Add an action when the alert is Active.

For each alert: On status changes Run when: Query matched

Elasticsearch query rule {{rule.name}} is active:

- PII Found: true
- PII Count: {{#context.hits}} {{_source.pii_count}}{{/context.hits}}
- Conditions Met: {{context.conditions}} over {{rule.params.timeWindowSize}}{{rule.params.timeWindowUnit}}
- Timestamp: {{context.date}}
- Link: {{context.link}}

Add an Action for when the Alert is Recovered.

For each alert: On status changes Run when: Recovered

Elasticsearch query rule {{rule.name}} is Recovered:

- PII Found: false
- Conditions Not Met: {{context.conditions}} over {{rule.params.timeWindowSize}}{{rule.params.timeWindowUnit}}
- Timestamp: {{context.date}}
- Link: {{context.link}}

When all setup it should look like this and Save

You should get an Active alert that looks like this if you have recent data. I sent mine to Slack.

Elasticsearch query rule pii-found-esql is active:
- PII Found: true
- PII Count:  374
- Conditions Met: Query matched documents over 5m
- Timestamp: 2024-10-15T02:44:52.795Z
- Link: https://mydeployment123.aws.found.io:9243/app/management/insightsAndAlerting/triggersActions/rule/7d6faecf-964e-46da-aaba-8a2f89f33989

And then if you wait you will get a Recovered alert that looks like this.

Elasticsearch query rule pii-found-esql is Recovered:
- PII Found: false
- Conditions Not Met: Query did NOT match documents over 5m
- Timestamp: 2024-10-15T02:49:04.815Z
- Link: https://mydeployment123.kb.us-west-1.aws.found.io:9243/app/management/insightsAndAlerting/triggersActions/rule/7d6faecf-964e-46da-aaba-8a2f89f33989

Production Scaling

NER Scaling

As we mentioned Part 1 of this blog of this blog, NER / NLP Models are CPU-intensive and expensive to run at scale; thus, we employed a sampling technique to understand the risk in our logs without sending the full logs volume through the NER Model.

Please review the setup and configuration of the NER model from Part 1 of the blog.

We chose the base BERT NER model bert-base-NER for our PII case.

The metrics below are related to the model and configuration from Part 1 of the blog.

4 Allocations to allow for more parallel ingestion
1 Thread per Allocation
0 Byes Cache, as we expect a low cache hit rate Note If there are many repeated logs, cache can help, but with timestamps and other variations, cache will not help and can even slow down the process
8192 Queue

GET _ml/trained_models/dslim__bert-base-ner/_stats
.....
           "node": {
              "0m4tq7tMRC2H5p5eeZoQig": {
.....
                "attributes": {
                  "xpack.installed": "true",
                  "region": "us-west-1",
                  "ml.allocated_processors": "5", << HERE 
.....
            },
            "inference_count": 5040,
            "average_inference_time_ms": 138.44285714285715, << HERE 
            "average_inference_time_ms_excluding_cache_hits": 138.44285714285715,
            "inference_cache_hit_count": 0,
.....
            "threads_per_allocation": 1,
            "number_of_allocations": 4,  <<< HERE
            "peak_throughput_per_minute": 1550,
            "throughput_last_minute": 1373,
            "average_inference_time_ms_last_minute": 137.55280407865988,
            "inference_cache_hit_count_last_minute": 0
          }
        ]
      }
    }

There are 3 key pieces of information above:

"ml.allocated_processors": "5" The number of physical cores / processors available
"number_of_allocations": 4 The number of allocations which is maximum 1 per physical core. Note: we could have used 5 allocations, but we only allocated 4 for this exercise
"average_inference_time_ms": 138.44285714285715 The averages inference time per document.

The math is pretty straightforward for throughput for Inferences per Min (IPM) per allocation (1 allocation per physical core), since an inference uses a single core and a single thread.

Then the Inferences per Min per Allocation is simply:

IPM per allocation = 60,000 ms (in a minute) / 138ms per inference = 435

When then lines up with the Total Inferences per Minute

Total IPM = 435 IPM / allocation * 4 Allocations = ~1740

Suppose we want to do 10,000 IPMs, how many allocations (cores) would I need?

Allocations = 10,000 IPM / 435 IPM per allocation = 23 Allocation (cores rounded up)

Or perhaps logs are coming in at 5000 EPS and you want to do 1% Sampling.

IPM = 5000 EPS * 60sec * 0.01 sampling = 3000 IPM sampled

Then

Number of Allocators = 3000 IPM / 435 IPM per allocation = 7 allocations (cores rounded up)

Want Faster! Turns out there is a more lightweight NER Model distilbert-NER model that is faster, but the tradeoff is a little less accuracy.

Running the logs through this model results in an inference time nearly twice as fast!

"average_inference_time_ms": 66.0263959390863

Here is some quick math: $IPM per allocation = 60,000 ms (in a minute) / 61ms per inference = 983

Suppose we want to do 25,000 IPMs, how many allocations (cores) would I need?

Allocations = 25,000 IPM / 983 IPM per allocation = 26 Allocation (cores rounded up)

Now you can apply this math to determine the correct sampling and NER scaling to support your logging use case.

Redact Processor Scaling

In short, the redact processor should scale to production loads as long as you are using appropriately sized and configured nodes and have well-constructed regex patterns.

Assessing incoming logs

If you want to test on incoming logs data in a data stream. All you need to do is change the conditional in the logs@custom pipeline to apply the process-pii to the dataset you want to. You can use any conditional that fits your condition.

Note: Just make sure that you have accounted for the proper scaling for the NER and Redact processors they were described above in Production Scaling

    {
      "pipeline": {
        "description" : "Call the process_pii pipeline on the correct dataset",
        "if": "ctx?.data_stream?.dataset == 'pii'", <<< HERE
        "name": "process-pii"
      }
    }

So if for example your logs are coming into logs-mycustomapp-default you would just change the conditional to

        "if": "ctx?.data_stream?.dataset == 'mycustomapp'",

Assessing historical data

If you have a historical (already ingested) data stream or index you can run the assessment over them using the _reindex API>

Note: Just make sure that you have accounted for the proper scaling for the NER and Redact processors they were described above in Production Scaling

There are a couple of extra steps: The code can be found here.

First we can set the parameters to ONLY keep the sampled data as there is no reason to make a copy of all the unsampled data. In the process-pii pipeline, there is a setting sample.keep_unsampled, which we can set to false, which will then only keep the sampled data

    {
      "set": {
        "description": "Set to false if you want to drop unsampled data, handy for reindexing hostorical data",
        "field": "sample.keep_unsampled",
        "value": false <<< SET TO false
      }
    },

Second, we will create a pipeline that will reroute the data to the correct data stream to run through all the PII assessment/detection pipelines. It also sets the correct dataset and namespace

DELETE _ingest/pipeline/sendtopii
PUT _ingest/pipeline/sendtopii
{
  "processors": [
    {
      "set": {
        "field": "data_stream.dataset",
        "value": "pii"
      }
    },
    {
      "set": {
        "field": "data_stream.namespace",
        "value": "default"
      }
    },
    {
      "reroute" : 
      {
        "dataset" : "{{data_stream.dataset}}",
        "namespace": "{{data_stream.namespace}}"
      }
    }
  ]
}

Finally, we can run a _reindex to select the data we want to test/assess. It is recommended to review the _reindex documents before trying this. First, select the source data stream you want to assess, in this example, it is the logs-generic-default logs data stream. Note: I also added a range filter to select a specific time range. There is a bit of a "trick" that we need to use since we are re-routing the data to the data stream logs-pii-default. To do this, we just set "index": "logs-tmp-default" in the _reindex as the correct data stream will be set in the pipeline. We must do that because reroute is a noop if it is called from/to the same datastream.

POST _reindex?wait_for_completion=false
{
  "source": {
    "index": "logs-generic-default",
    "query": {
      "bool": {
        "filter": [
          {
            "range": {
              "@timestamp": {
                "gte": "now-1h/h",
                "lt": "now"
              }
            }
          }
        ]
      }
    }
  },
  "dest": {
    "op_type": "create",
    "index": "logs-tmp-default",
    "pipeline": "sendtopii"
  }
}

Summary

At this point, you have the tools and processes need to assess, detect, analyze, alert and protect PII in your logs.

The end state solution can be found here:.

In Part 1 of this blog, we accomplished the following.

Reviewed the techniques and tools we have available for PII detection and assessment
Reviewed NLP / NER role in PII detection and assessment
Built the necessary composable ingest pipelines to sample logs and run them through the NER Model
Reviewed the NER results and are ready to move to the second blog

In Part 2 of this blog, we covered the following:

Redact PII using NER and redact processor
Apply field-level security to control access to the un-redacted data
Enhance the dashboards and alerts
Production considerations and scaling
How to run these processes on incoming or historical data

So get to work and reduce risk in your logs!

Data Loading Appendix

Code

The data loading code can be found here:

https://github.com/bvader/elastic-pii

$ git clone https://github.com/bvader/elastic-pii.git

Creating and Loading the Sample Data Set

$ cd elastic-pii
$ cd python
$ python -m venv .env
$ source .env/bin/activate
$ pip install elasticsearch
$ pip install Faker

Run the log generator

$ python generate_random_logs.py

If you do not changes any parameters, this will create 10000 random logs in a file named pii.log with a mix of logs that containe and do not contain PII.

Edit load_logs.py and set the following

# The Elastic User 
ELASTIC_USER = "elastic"

# Password for the 'elastic' user generated by Elasticsearch
ELASTIC_PASSWORD = "askdjfhasldfkjhasdf"

# Found in the 'Manage Deployment' page
ELASTIC_CLOUD_ID = "deployment:sadfjhasfdlkjsdhf3VuZC5pbzo0NDMkYjA0NmQ0YjFiYzg5NDM3ZDgxM2YxM2RhZjQ3OGE3MzIkZGJmNTE0OGEwODEzNGEwN2E3M2YwYjcyZjljYTliZWQ="

Then run the following command.

$ python load_logs.py

Reloading the logs

$ python load_logs.py

Pruning incoming log volumes with Elastic

Fri, 23 Jun 2023 00:00:00 GMT

filebeat.inputs:
  - type: filestream
    id: my-logging-app
    paths:
      - /var/log/*.log

filebeat.inputs:
  - type: filestream
    id: my-logging-app
    paths:
      - /var/tmp/other.log
      - /var/log/*.log
processors:
  - drop_event:
      when:
        and:
          - equals:
            url.scheme: http
          - equals:
            url.path: /profile

filebeat.inputs:
  - type: filestream
    id: my-logging-app
    paths:
      - /var/tmp/other.log
      - /var/log/*.log
processors:
  - drop_fields:
      when:
        and:
          - equals:
            url.scheme: http
          - equals:
            http.response.status_code: 200
        fields: ["event.message"]
        ignore_missing: false

input {
  file {
    id => "my-logging-app"
    path => [ "/var/tmp/other.log", "/var/log/*.log" ]
  }
}
filter {
  if [url.scheme] == "http" && [url.path] == "/profile" {
    drop {
      percentage => 80
    }
  }
}
output {
  elasticsearch {
        hosts => "https://my-elasticsearch:9200"
        data_stream => "true"
    }
}

# Input configuration omitted
filter {
  if [url.scheme] == "http" && [http.response.status_code] == 200 {
    drop {
      percentage => 80
    }
    mutate {
      remove_field: [ "event.message" ]
    }
  }
}
# Output configuration omitted

PUT _ingest/pipeline/my-logging-app-pipeline
{
  "description": "Event and field dropping for my-logging-app",
  "processors": [
    {
      "drop": {
        "description" : "Drop event",
        "if": "ctx?.url?.scheme == 'http' && ctx?.url?.path == '/profile'",
        "ignore_failure": true
      }
    },
    {
      "remove": {
        "description" : "Drop field",
        "field" : "event.message",
        "if": "ctx?.url?.scheme == 'http' && ctx?.http?.response?.status_code == 200",
        "ignore_failure": false
      }
    }
  ]
}

PUT _ingest/pipeline/my-logging-app-pipeline
{
  "description": "Event and field dropping for my-logging-app with failures",
  "processors": [
    {
      "drop": {
        "description" : "Drop event",
        "if": "ctx?.url?.scheme == 'http' && ctx?.url?.path == '/profile'",
        "ignore_failure": true
      }
    },
    {
      "remove": {
        "description" : "Drop field",
        "field" : "event.message",
        "if": "ctx?.url?.scheme == 'http' && ctx?.http?.response?.status_code == 200",
        "ignore_failure": false
      }
    }
  ],
  "on_failure": [
    {
      "set": {
        "description": "Set 'ingest.failure.message'",
        "field": "ingest.failure.message",
        "value": "Ingestion issue"
        }
      }
  ]
}

receivers:
  filelog:
    include: [/var/tmp/other.log, /var/log/*.log]
processors:
  filter/denylist:
    error_mode: ignore
    logs:
      log_record:
        - 'url.scheme == "info"'
        - 'url.path == "/profile"'
        - "http.response.status_code == 200"
  attributes/errors:
    actions:
      - key: error.message
        action: delete
  memory_limiter:
    check_interval: 1s
    limit_mib: 2000
  batch:
exporters:
  # Exporters configuration omitted
service:
  pipelines:
    # Pipelines configuration omitted

Root cause analysis with logs: Elastic Observability's anomaly detection and log categorization

Tue, 07 Feb 2023 00:00:00 GMT

With more and more applications moving to the cloud, an increasing amount of telemetry data (logs, metrics, traces) is being collected, which can help improve application performance, operational efficiencies, and business KPIs. However, analyzing this data is extremely tedious and time consuming given the tremendous amounts of data being generated. Traditional methods of alerting and simple pattern matching (visual or simple searching etc) are not sufficient for IT Operations teams and SREs. It’s like trying to find a needle in a haystack.

In this blog post, we’ll cover some of Elastic’s artificial intelligence for IT operations (AIOps) and machine learning (ML) capabilities for root cause analysis.

Elastic’s machine learning will help you investigate performance issues by providing anomaly detection and pinpointing potential root causes through time series analysis and log outlier detection. These capabilities will help you reduce time in finding that “needle” in the haystack.

Elastic’s platform enables you to get started on machine learning quickly. You don’t need to have a data science team or design a system architecture. Additionally, there’s no need to move data to a third-party framework for model training.

Preconfigured machine learning models for observability and security are available. If those don't work well enough on your data, in-tool wizards guide you through the few steps needed to configure custom anomaly detection and train your model with supervised learning. To help get you started, there are several key features built into Elastic Observability to aid in analysis, helping bypass the need to run specific ML models. These features help minimize the time and analysis for logs.

Let’s review some of these built-in ML features:

High-latency or erroneous transactions: Elastic Observability’s APM capability helps you discover which attributes are contributing to increased transaction latency and identifies which attributes are most influential in distinguishing between transaction failures and successes. An overview of this capability is published here: APM correlations in Elastic Observability: Automatically identifying probable causes of slow or failed transactions.

AIOps Labs: AIOps Labs provides two main capabilities using advanced statistical methods:

Log spike detector helps identify reasons for increases in log rates. It makes it easy to find and investigate causes of unusual spikes by using the analysis workflow view. Examine the histogram chart of the log rates for a given data view, and find the reason behind a particular change possibly in millions of log events across multiple fields and values.
Log pattern analysis helps you find patterns in unstructured log messages and makes it easier to examine your data. It performs categorization analysis on a selected field of a data view, creates categories based on the data, and displays them together with a chart that shows the distribution of each category and an example document that matches the category.

_ In this blog, we will cover anomaly detection and log categorization against the popular “Hipster Shop app” developed by Google, and modified recently by OpenTelemetry. _

Overviews of high-latency capabilities can be found here, and an overview of AIOps labs can be found here.

In this blog, we will examine a scenario where we use anomaly detection and log categorization to help identify a root cause of an issue in Hipster Shop.

Prerequisites and config

If you plan on following this blog, here are some of the components and details we used to set up this demonstration:

Ensure you have an account on Elastic Cloud and a deployed stack (see instructions here) on AWS. Deploying this on AWS is required for Elastic Serverless Forwarder.
Utilize a version of the ever so popular Hipster Shop demo application. It was originally written by Google to showcase Kubernetes across a multitude of variants available, such as the OpenTelemetry Demo App. The Elastic version is found here.
Ensure you have configured the app for either Elastic APM agents or OpenTelemetry agents. For more details, please refer to these two blogs: Independence with OTel in Elastic and Observability and security with OTel in Elastic. Additionally, review the OTel documentation in Elastic.
Look through an overview of Elastic Observability APM capabilities.
Look through our Anomaly detection documentation for logs and log categorization documentation.

In our example, we’ve introduced issues to help walk you through the root cause analysis features: anomaly detection and log categorization. You might have a different set of anomalies and log categorization depending on how you load the application and/or introduce specific issues.

As part of the walk-through, we’ll assume we are a DevOps or SRE managing this application in production.

Root cause analysis

While the application has been running normally for some time, you get a notification that some of the services are unhealthy. This can occur from the notification setting you’ve set up in Elastic or other external notification platforms (including customer related issues). In this instance, we’re assuming that customer support has called in multiple customer complaints about the website.

How do you as a DevOps or SRE investigate this? We will walk through two avenues in Elastic to investigate the issue:

Anomaly detection
Log categorization

Machine learning for anomaly detection

Elastic will detect anomalies based on historical patterns and identify a probability of these issues.

Starting with the service map, you can see anomalies identified with red circles and as we select them, Elastic will provide a score for the anomaly.

In this example, we can see that there is a score of 96 for a specific anomaly for the productCatalogService in the Hipster Shop application. An anomaly score indicates the significance of the anomaly compared to previously seen anomalies. More information on anomaly detection results can be found here. We can also dive deeper into the anomaly and analyze the details.

What you will see for the productCatalogService is that there is a severe spike in average transaction latency time, which is the anomaly that was detected in the service map. Elastic’s machine learning has identified a specific metric anomaly (shown in the single metric view). It’s likely that customers are potentially responding to the slowness of the site and that the company is losing potential transactions.

One step to take next is to review all the other potential anomalies that we saw in the service map in a larger picture. Use an anomaly explorer to view all the anomalies that have been identified.

Elastic is identifying numerous services with anomalies. productCatalogService has the highest score and a good number or others: frontend, checkoutService, advertService, and others, also have high scores. However, this analysis is looking at just one metric.

Elastic can help detect anomalies across all types of data, such as kubernetes data, metrics, and traces. If we analyze across all these types (via individual jobs we’ve created in Elastic machine learning), we will see a more comprehensive view as to what is potentially causing this latency issue.

Once all the potential jobs are selected and we’ve sorted by service.name, we can see that productCatalogService is still showing a high anomaly influencer score.

In addition to the chart giving us a visual of the anomalies, we can review all the potential anomalies. As you will notice, Elastic has also categorized these anomalies (see category examples column). As we scroll through the results, we notice a potential postgreSQL issue from the categorization, which also has a high 94 score. Machine learning has identified a “rare mlcategory,” meaning that it has rarely occurred, hence pointing to a potential cause of the issue customers are seeing.

We also notice that this issue is potentially caused by pgbench , a popular postgreSQL tool to help benchmark the database. pgbench runs the same sequence of SQL commands over and over, possibly in multiple, concurrent database sessions. While pgbench is definitely a useful tool, it should not be used in a production environment as it causes heavy load on the database host, likely causing the higher latency issues on the site.

While this may or may not be the ultimate root cause, we have rather quickly identified a potentially issue that has a high probability of being the root cause. An engineer likely intended to run pgbench against a staging database to evaluate its performance, and not the production environment.

Machine learning for log categorization

Elastic Observability’s service map has detected an anomaly, and in this part of the walk-through, we take a different approach by investigating the service details from the service map versus initially exploring the anomaly. When we explore the service details for productCatalogService, we see the following:

The service details are identifying several things:

There is an abnormally high latency compared to expected bounds of the service. We see that recently there was a higher than normal (upward of 1s latency) compared to the average to 275ms on average.
There is also a high failure rate for the same time frame as the high latency (lower left chart “ Failed transaction rate ”).
Additionally, we can see the transactions and one in particular /ListProduct has an abnormally high latency, in addition to a high failure rate.
We see productCatalogService has a dependency on postgreSQL.
We also see errors all related to postgreSQL.

We have an option to dig through the logs and analyze in Elastic or we can use a capability to identify the logs more easily.

If we go to Categories under Logs in Elastic Observability and search for postgresql.logto help identify postgresql logs that could be causing this error, we see that Elastic’s machine learning has automatically categorized the postgresql logs.

We notice two additional items:

There is a high count category (message count of 23,797 with a high anomaly of 70) related to pgbench (which is odd to see in production). Hence we search further for all pgbench related logs in Categories .
We see an odd issue regarding terminating the connection (with a low count).

While investigating the second error, which is severe, we can see logs from Categories before and after the error.

This troubleshooting shows postgreSQL having a FATAL error, the database shutting down prior to the error, and all connections terminating. Given the two immediate issues we identified, we have an idea that someone was running pgbench and this potentially overloaded the database, causing the latency issue that customers are seeing.

The next steps here could be to investigate anomaly detection and/or work with the developers to review the code and identify pgbench as part of the deployed configuration.

Conclusion

I hope you’ve gotten an appreciation for how Elastic Observability can help you further identify and get closer to pinpointing root cause of issues without having to look for a “needle in a haystack.” Here’s a quick recap of lessons and what you learned:

Elastic Observability has numerous capabilities to help you reduce your time to find root cause and improve your MTTR (even MTTD). In particular, we reviewed the following two main capabilities in this blog:
1. Anomaly detection: Elastic Observability, when turned on (see documentation), automatically detects anomalies by continuously modeling the normal behavior of your time series data — learning trends, periodicity, and more — in real time to identify anomalies, streamline root cause analysis, and reduce false positives. Anomaly detection runs in and scales with Elasticsearch and includes an intuitive UI.
2. Log categorization: Using anomaly detection, Elastic also identifies patterns in your log events quickly. Instead of manually identifying similar logs, the logs categorization view lists log events that have been grouped based on their messages and formats so that you can take action quicker.
You learned how easy and simple it is to use Elastic Observability’s log categorization and anomaly detection capabilities without having to understand machine learning (which help drive these features), nor having to do any lengthy setups. Ready to get started? Register for Elastic Cloud and try out the features and capabilities I’ve outlined above.

Additional logging resources:

Common use case examples with logs:

Live logs and prosper: fixing a fundamental flaw in observability

Mon, 27 Oct 2025 00:00:00 GMT

SREs are often overwhelmed by dashboards and alerts that show what and where things are broken, but fail to reveal why. This industry-wide focus on visualizing symptoms forces engineers to manually hunt for answers. The crucial "why" is buried in information-rich logs, but their massive volume and unstructured nature has led the industry to throw them aside or treat them like a second-class citizen. As a result, SREs are forced to turn every investigation into a high-stress, time-consuming hunt for clues. We can solve this problem with logs, but unlocking their potential requires us to reimagine how we work with them and improve the overall investigations journey.

Observability, the broken promise

To see why the current model fails, let’s look at the all-too-familiar challenge every SRE dreads: knowing a problem exists but needing to spend valuable time just trying to find where to even start the investigation.

Imagine you get a Slack message from the support team: "a few high-value customers are reporting their payments are failing." You have no shortage of alerts, but most are just flagging symptoms. You don’t know where to start. You decide to check the logs to see if there is anything obvious, starting with the systems that have the high CPU alert.

You spend a few minutes searching and grep-ing through terabytes of logs for affected customer IDs, trying to piece together the problem. Nothing. You worry that you aren’t getting all the logs to reveal the problem, so you turn on more logging in the application. Now you’re knee-deep in data, desperately trying to find patterns, errors, or other "hints" that will give you a clue as to the why.

Finally, one of the broader log queries hits on an error code associated with an impacted customer ID. This is the first real clue. You pivot your search to this new error code and after an hour of digging, you finally uncover the error message. You've finally found the why, but it was a stressful, manual hunt that took far too much time and impacted dozens more customers.

This incident perfectly illustrates the broken promise of modern observability: The complete failure of the investigation process. Investigations are a manual, reactive process that SREs are forced into every day. At Elastic, we believe metrics, traces, and logs are all essential, but their roles, and the workflow between them, must be fundamentally re-imagined for effective investigations.

Observability is about having the clearest understanding possible of the what, where, and why. Metrics are essential for understanding the what. They are the heartbeat of your system, powering the dashboards and alerts that tell you when a threshold has been breached, like high CPU utilization or error rates. But they are aggregates; they show the symptom, rarely the root cause. Traces are good at identifying the where. They map the journey of a request through a distributed system, pinpointing the specific microservice or function where latency spikes or an error originates. Yet, their effectiveness hinges on complete and consistent code instrumentation, a constant dependency on development teams that can leave you with critical visibility gaps. Logs tell you the why. They contain all the rich, contextual, and unfiltered truth of an event. If we can more proactively and efficiently extract information from logs, we can greatly improve our overall understanding of our environments.

Challenges of logs in modern environments

While logs are in the standard toolbox, they have been neglected. SREs using today’s solutions deal with several major problems:

First, due to their unstructured nature, it’s very difficult to parse and manage logs so that they’re useful. As a result, many SRE teams spend a lot of time building and maintaining complex pipelines to help manage this process.

Second, logs can get expensive at high volume, which leads teams to drop them on the floor to control costs, throwing away valuable information in the process. Consequently, when an incident occurs, you waste precious time hunting for the right logs, and manually correlating across services.

Finally, nobody has built a log solution that proactively works to find the important signals in logs and to surface those critical whys to you when you need them. As a result, log-based investigations are too painful and slow.

Why are we here? As applications became more complex, log volume became unmanageable. Instead of solving this with automation, the industry took a shortcut: it gave up on getting the most out of logs and prioritized more manageable but less informative signals.

This decision is the origin of the broken, reactive model. It forced observability into a manual loop of 'observing' alerts, rather than building automation that could help us truly understand our systems to improve how we root cause and resolve issues. This has transformed SREs from investigators into full-time data wranglers, wrestling with Grok patterns and fragile ETL scripts instead of solving outages.

Introducing Streams to rethink how you use logs for investigations

Streams is an agentic AI solution that simplifies working with logs to help SRE teams rapidly understand the why behind an issue for faster resolution. The combination of Elasticsearch and AI is turning manual management of noisy logs into automated workflows that identify patterns, context, and meaning, marking a fundamental shift in observability.

Log everything in any format

By applying the Elasticsearch platform for context engineering to bring together retrieval and AI-driven parsing to keep up with schema changes, we are reimagining the entire log pipeline.

Streams ingests raw logs from all your sources to a single destination. It then uses AI to partition incoming logs into their logical components and parses them to extract relevant fields for an SRE to validate, approve, or modify. Imagine a world where you simply point your logs to a single endpoint, and everything just works. Less wrestling with Grok patterns, configuring processors, and hunting for the right plugin. All of which significantly reduces the complexity. Streams is a big step towards realizing that vision.

As a result, SREs are freed from managing complex ingestion pipelines, allowing them to spend less time on data wrangling and more time preventing service disruptions.

Solve incidents faster with Significant Events

Significant Events, a capability within Streams, uses AI to automatically surface major errors and anomalies, enabling you to be proactive in your investigations. So, instead of just combing through endless noise, you can focus on the events that truly matter, such as startup and shutdown messages, out-of-memory errors, internal server failures, and other significant signals of change. These events act as actionable markers, giving SREs early warning and clear focus to begin an investigation before service impact.

With this new foundation, logs will become your primary signal for investigation. The panicked, manual search for a needle in a digital haystack is about to be over. Significant Events acts like a smart metal detector that sifts through the chaos and only beeps when it finds issues, helping you to easily ignore all that hay and find the "needle" faster.

Now imagine the same scenario we started with. Instead of starting a frantic, time-consuming grep through terabytes of logs. Streams has already done the heavy lifting. Its AI-driven analysis has detected a new, anomalous pattern that began before your support team even knew about it and automatically surfaced it as a significant event. Rather than you hunting for a clue, the clue finds you.

With a single click, you have the why: a Java out-of-memory error in a specific service component. This is your starting point. You find the root cause in under two minutes and begin remediation. The customer impact is stopped, the dev team gets the specific error, and the problem is contained before it can escalate. In this case, metrics and traces were unhelpful in finding the why. The answer was waiting in the logs all along.

This ideal outcome is possible because you can both afford to keep every log and instantly find the signal within them. Elastic's cost-efficient architecture with powerful compression, searchable snapshots, and data tiering makes full retention a reality. From there, Streams automatically surfaces the significant event, ensuring that the answer is never lost in the noise.

Elastic is the only company that provides an AI-driven log-first approach to elevate your observability signals and make it dramatically faster and easier to get to why. This is built on our decades of leadership in search, relevance, and powerful analytics that provides the foundation for understanding logs at a deep, semantic level.

The vision for Streams

The partitioning, parsing, and Significant Events you see today is just the starting point. The next step in our vision is to use the Significant Events to automatically generate critical SRE artifacts. Imagine Streams creating intelligent alerts, on-the-fly investigation dashboards, and even data-driven SLOs based only on the events that actually impact service health. From there, the goal is to use AI to drive automated Root Cause Analysis (RCA) directly from log patterns and generate remediation runbooks, turning a multi-hour hunt into an instant resolution recommendation.

Once this AI-drive log foundation is in place, our vision for Streams expands to become a unified intelligence layer that operates across all your telemetry data. It’s not just about making each signal better in isolation, but about understanding the context and relationships between them to solve complex problems.

For metrics, Streams won’t just alert you to a single metric spike but detect a correlated anomaly across multiple, seemingly unrelated metrics e.g. p99 latency for a specific service, rise in garbage collection time, transaction success rate.

Similarly, for traces it identifies a new, unexpected service call (e.g., a new database or an external API) appears in a critical transaction path after a deployment or identifies specific span is suddenly responsible for a majority of errors across all traces, even if the overall error rate hasn't breached a threshold.

The goal is not to have separate streams for logs, metrics, and traces, but to weave them into a single narrative that automatically correlates all three signals. Ultimately, Streams is about fundamentally changing the goal from human led data gathering exercise to proactive, AI-driven resolution.

For more on Streams:

Read the Streams launch blog

Look at the Streams website

Build better Service Level Objectives (SLOs) from logs and metrics

Fri, 23 Feb 2024 00:00:00 GMT

In today's digital landscape, applications are at the heart of both our personal and professional lives. We've grown accustomed to these applications being perpetually available and responsive. This expectation places a significant burden on the shoulders of developers and operations teams.

Site reliability engineers (SREs) face the challenging task of sifting through vast quantities of data, not just from the applications themselves but also from the underlying infrastructure. In addition to data analysis, they are responsible for ensuring the effective use and development of operational tools. The growing volume of data, the daily resolution of issues, and the continuous evolution of tools and processes can detract from the focus on business performance.

Elastic Observability offers a solution to this challenge. It enables SREs to integrate and examine all telemetry data (logs, metrics, traces, and profiling) in conjunction with business metrics. This comprehensive approach to data analysis fosters operational excellence, boosts productivity, and yields critical insights, all of which are integral to maintaining high-performing applications in a demanding digital environment.

To help manage operations and business metrics, Elastic Observability's SLO (Service Level Objectives) feature was introduced in 8.12. This feature enables setting measurable performance targets for services, such as availability, latency, traffic, errors, and saturation or define your own. Key components include:

Defining and monitoring SLIs (Service Level Indicators)
Monitoring error budgets indicating permissible performance shortfalls
Alerting on burn rates showing error budget consumption

Users can monitor SLOs in real-time with dashboards, track historical performance, and receive alerts for potential issues. Additionally, SLO dashboard panels offer customized visualizations.

Service Level Objectives (SLOs) are generally available for our Platinum and Enterprise subscription customers.

In this blog, we will outline the following:

What are SLOs? A Google SRE perspective
Several scenarios of defining and managing SLOs

Service Level Objective overview

Service Level Objectives (SLOs) are a crucial component for Site Reliability Engineering (SRE), as detailed in Google's SRE Handbook. They provide a framework for quantifying and managing the reliability of a service. The key elements of SLOs include:

Service Level Indicators (SLIs): These are carefully selected metrics, such as uptime, latency, throughput, error rates, or other important metrics, that represent the aspects of the service and are important from an operations or business perspective. Hence, an SLI is a measure of the service level provided (latency, uptime, etc.), and it is defined as a ratio of good over total events, with a range between 0% and 100%.
Service Level Objective (SLO): An SLO is the target value for a service level measured as a percentage by an SLI. Above the threshold, the service is compliant. As an example, if we want to use service availability as an SLI, with the number of successful responses at 99.9%, then any time the number of failed responses is > .1%, the SLO will be out of compliance.
Error budget: This represents the threshold of acceptable errors, balancing the need for reliability with practical limits. It is defined as 100% minus the SLO quantity of errors that is tolerated.
Burn rate: This concept relates to how quickly the service is consuming its error budget, which is the acceptable threshold for unreliability agreed upon by the service providers and its users.

Understanding these concepts and effectively implementing them is essential for maintaining a balance between innovation and reliability in service delivery. For more detailed information, you can refer to Google's SRE Handbook.

One main thing to remember is that SLO monitoring is not incident monitoring. SLO monitoring is a proactive, strategic approach designed to ensure that services meet established performance standards and user expectations. It involves tracking Service Level Objectives, error budgets, and the overall reliability of a service over time. This predictive method helps in preventing issues that could impact users and aligns service performance with business objectives.

In contrast, incident monitoring is a reactive process focused on detecting, responding to, and mitigating service incidents as they occur. It aims to address unexpected disruptions or failures in real time, minimizing downtime and impact on service. This includes monitoring system health, errors, and response times during incidents, with a focus on rapid response to minimize disruption and preserve the service's reputation.

Elastic®’s SLO capability is based directly off the Google SRE Handbook. All the definitions and semantics are utilized as described in Google’s SRE handbook. Hence users can perform the following on SLOs in Elastic:

Define an SLO on an SLI such as KQL (log based query), service availability, service latency, custom metric, histogram metric, or a timeslice metric. Additionally, set the appropriate threshold.
Utilize occurrence versus time slice based budgeting. Occurrences is the number of good events over the number of total events to compute the SLO. Timeslices break the overall time window into slammer slices of a defined duration and compute the number of good slices over the total slices to compute the SLO. Timeslice targets are more accurate and useful when calculating things like a service’s SLO when trying to meet agreed upon customer targets.
Manage all the SLOs in a singular location.
Trigger alerts from the defined SLO, whether the SLI is off, burn rate is used up, or the error rate is X.
Create unique service level dashboards with SLO information for a more comprehensive view of the service.

SREs need to be able to manage business metrics.

SLOs based on logs: NGINX availability

Defining SLOs does not always mean metrics need to be used. Logs are a rich form of information, even when they have metrics embedded in them. Hence it’s useful to understand your business and operations status based on logs.

Elastic allows you to create an SLO based on specific fields in the log message, which don’t have to be metrics. A simple example is a simple multi-tier app that has a web server layer (nginx), a processing layer, and a database layer.

Let’s say that your processing layer is managing a significant number of requests. You want to ensure that the service is up properly. The best way is to ensure that all http.response.status_code are less than 500. Anything less ensures the service is up and any errors (like 404) are all user or client errors versus server errors.

If we use Discover in Elastic, we see that there are close to 2M log messages over a seven-day time frame.

Additionally, the number of messages with http.response.status_code > 500 is minimal, like 17K.

Rather than creating an alert, we can create an SLO with this query:

We chose to use occurrences as the budgeting method to keep things simple.

Once defined, we can see how well our SLO is performing over a seven-day time frame. We can see not only the SLO, but also the burn rate, the historical SLI, and error budget, and any specific alerts against the SLO.

Not only do we get information about the violation, but we also get:

Historical SLI (7 days)
Error budget burn down
Good vs. bad events (24 hours)

We can see how we’ve easily burned through our error budget.

Hence something must be going on with nginx. To investigate, all we need to do is utilize the AI Assistant, and use its natural language interface to ask questions to help analyze the situation.

Let’s use Elastic’s AI Assistant to analyze the breakdown of http.response.status_code across all the logs from the past seven days. This helps us understand how many 50X errors we are getting.

As we can see, the number of 502s is minimal compared to the number of overall messages, but it is affecting our SLO.

However, it seems like Nginx is having an issue. In order to reduce the issue, we also ask the AI Assistant how to work on this error. Specifically, we ask if there is an internal runbook the SRE team has created.

AI Assistant gets a runbook the team has added to its knowledge base. I can now analyze and try to resolve or reduce the issue with nginx.

While this is a simple example, there are an endless number of possibilities that can be defined based on KQL. Some other simple examples:

99% of requests occur under 200ms
99% of log message are not errors

Application SLOs: OpenTelemetry demo cartservice

A common application developers and SREs use to learn about OpenTelemetry and test out Observability features is the OpenTelemetry demo.

This demo has feature flags to simulate issues. With Elastic’s alerting and SLO capability, you can also determine how well the entire application is performing and how well your customer experience is holding up when these feature flags are used.

Elastic supports OpenTelemetry by taking OTLP directly with no need for an Elastic specific agent. You can send in OpenTelemetry data directly from the application (through OTel libraries) and through the collector.

We’ve brought up the OpenTelemetry demo on a K8S cluster (AWS EKS) and turned on the cartservice feature flag. This inserts errors into the cartservice. We’ve also created two SLOs to monitor the cartservice’s availability and latency.

We can see that the cartservice’s availability is violated. As we drill down, we see that there aren’t as many successful transactions, which is affecting the SLO.

As we drill into the service, we can see in Elastic APM that there is a higher than normal failure rate of about 5.5% for the emptyCart service.

We can investigate this further in APM, but that is a discussion for another blog. Stay tuned to see how we can use Elastic’s machine learning, AIOps, and AI Assistant to understand the issue.

Conclusion

SLOs allow you to set clear, measurable targets for your service performance, based on factors like availability, response times, error rates, and other key metrics. Hopefully with the overview we’ve provided in this blog, you can see that:

SLOs can be based on logs. In Elastic, you can use KQL to essentially find and filter on specific logs and log fields to monitor and trigger SLOs.
AI Assistant is a valuable, easy-to-use capability to analyze, troubleshoot, and even potentially resolve SLO issues.
APM Service based SLOs are easy to create and manage with integration to Elastic APM. We also use OTel telemetry to help monitor SLOs.

For more information on SLOs in Elastic, check out Elastic documentation and the following resources:

Ready to get started? Sign up for Elastic Cloud and try out the features and capabilities I’ve outlined above to get the most value and visibility out of your SLOs.

Simplifying log data management: Harness the power of flexible routing with Elastic

Tue, 13 Jun 2023 00:00:00 GMT

In Elasticsearch 8.8, we’re introducing the reroute processor in technical preview that makes it possible to send documents, such as logs, to different data streams, according to flexible routing rules. When using Elastic Observability, this gives you more granular control over your data with regard to retention, permissions, and processing with all the potential benefits of the data stream naming scheme. While optimized for data streams, the reroute processor also works with classic indices. This blog post contains examples on how to use the reroute processor that you can try on your own by executing the snippets in the Kibana dev tools.

Elastic Observability offers a wide range of integrations that help you to monitor your applications and infrastructure. These integrations are added as policies to Elastic agents, which help ingest telemetry into Elastic Observability. Several examples of these integrations include the ability to ingest logs from systems that send a stream of logs from different applications, such as Amazon Kinesis Data Firehose, Kubernetes container logs, and syslog. One challenge is that these multiplexed log streams are sending data to the same Elasticsearch data stream, such as logs-syslog-default. This makes it difficult to create parsing rules in ingest pipelines and dashboards for specific technologies, such as the ones from the Nginx and Apache integrations. That’s because in Elasticsearch, in combination with the data stream naming scheme, the processing and the schema are both encapsulated in a data stream.

The reroute processor helps you tease apart data from a generic data stream and send it to a more specific one. You may use that mechanism to send logs to a data stream that is set up by the Nginx integration, for example, so that the logs are parsed with that integration and you can use the integration’s prebuilt dashboards or create custom ones with the fields, such as the url, the status code, and the response time that the Nginx pipeline has parsed out of the Nginx log message. You can also split out/separate regular Nginx logs and errors with the reroute processor, providing further separation ability and categorization of logs.

Example use case

To use the reroute processor, first:

Ensure you are on Elasticsearch 8.8
Ensure you have permissions to manage indices and data streams
If you don’t already have an account on Elastic Cloud, sign up for one

Next, you’ll need to set up a data stream and create a custom Elasticsearch ingest pipeline that is called as the default pipeline. Below we go through this step by step for the “mydata” data set that we’ll simulate ingesting container logs into. We start with a basic example and extend it from there.

The following steps should be utilized in the Elastic console, which is found at Management -> Dev tools -> Console. First, we need an an ingest pipeline and a template for the data stream:

PUT _ingest/pipeline/logs-mydata
{
  "description": "Routing for mydata",
  "processors": [
    {
      "reroute": {
      }
    }
  ]
}

This creates an ingest pipeline with an empty reroute processor. To make use of it, we need an index template:

PUT _index_template/logs-mydata
{
  "index_patterns": [
    "logs-mydata-*"
  ],
  "data_stream": {},
  "priority": 200,
  "template": {
    "settings": {
      "index.default_pipeline": "logs-mydata"
    },
    "mappings": {
      "properties": {
        "container.name": {
          "type": "keyword"
        }
      }
    }
  }
}

The above template is applied to all data that is shipped to logs-mydata-*. We have mapped container.name as a keyword, as this is the field we will be using for routing later on. Now, we send a document to the data stream and it will be ingested into logs-mydata-default:

POST logs-mydata-default/_doc
{
  "@timestamp": "2023-05-25T12:26:23+00:00",
  "container": {
    "name": "foo"
  }
}

We can check that it was ingested with the command below, which will show 1 result.

GET logs-mydata-default/_search

Without modifying the routing processor, this already allows us to route documents. As soon as the reroute processor is specified, it will look for data_stream.dataset and data_stream.namespace fields by default and will send documents to the corresponding data stream, according to the data stream naming scheme logs--. Let’s try this out:

POST logs-mydata-default/_doc
{
  "@timestamp": "2023-03-30T12:27:23+00:00",
  "container": {
"name": "foo"
  },
  "data_stream": {
    "dataset": "myotherdata"
  }
}

As can be seen with the GET logs-mydata-default/_search command, this document ended up in the logs-myotherdata-default data stream. But instead of using default rules, we want to create our own rules for the field container.name. If the field is container.name = foo, we want to send it to logs-foo-default. For this we modify our routing pipeline:

PUT _ingest/pipeline/logs-mydata
{
  "description": "Routing for mydata",
  "processors": [
    {
      "reroute": {
        "tag": "foo",
        "if" : "ctx.container?.name == 'foo'",
        "dataset": "foo"
      }
    }
  ]
}

Let's test this with a document:

POST logs-mydata-default/_doc
{
  "@timestamp": "2023-05-25T12:26:23+00:00",
  "container": {
    "name": "foo"
  }
}

While it would be possible to specify a routing rule for each container name, you can also route by the value of a field in the document:

PUT _ingest/pipeline/logs-mydata
{
  "description": "Routing for mydata",
  "processors": [
    {
      "reroute": {
        "tag": "mydata",
        "dataset": [
          "{{container.name}}",
          "mydata"
        ]
      }
    }
  ]
}

In this example, we are using a field reference as a routing rule. If the container.name field exists in the document, it will be routed — otherwise it falls back to mydata. This can be tested with:

POST logs-mydata-default/_doc
{
  "@timestamp": "2023-05-25T12:26:23+00:00",
  "container": {
    "name": "foo1"
  }
}

POST logs-mydata-default/_doc
{
  "@timestamp": "2023-05-25T12:26:23+00:00",
  "container": {
    "name": "foo2"
  }
}

This creates the data streams logs-foo1-default and logs-foo2-default.

NOTE: There is currently a limitation in the processor that requires the fields specified in a {{field.reference}} to be in a nested object notation. A dotted field name does not currently work. Also, you’ll get errors when the document contains dotted field names for any data_stream.* field. This limitation will be fixed in 8.8.2 and 8.9.0.

API keys

When using the reroute processor, it is important that the API keys specified have permissions for the source and target indices. For example, if a pattern is used for routing from logs-mydata-default, the API key must have write permissions for logs-*-* as data could end up in any of these indices (see example further down).

We’re currently working on extending the API key permissions for our integrations so that they allow for routing by default if you’re running a Fleet-managed Elastic Agent.

If you’re using a standalone Elastic Agent, or any other shipper, you can use this as a template to create your API key:

POST /_security/api_key
{
  "name": "ingest_logs",
  "role_descriptors": {
    "ingest_logs": {
      "cluster": [
        "monitor"
      ],
      "indices": [
        {
          "names": [
            "logs-*-*"
          ],
          "privileges": [
            "auto_configure",
            "create_doc"
          ]
        }
      ]
    }
  }
}

Future plans

In Elasticsearch 8.8, the reroute processor was released in technical preview. The plan is to adopt this in our data sink integrations like syslog, k8s, and others. Elastic will provide default routing rules that just work out of the box, but it will also be possible for users to add their own rules. If you are using our integrations, follow this guide on how to add a custom ingest pipeline.

Try it out!

This blog post has shown some sample use cases for document based routing. Try it out on your data by adjusting the commands for index templates and ingest pipelines to your own data, and get started with Elastic Cloud through a 7-day free trial. Let us know via this feedback form how you’re planning to use the reroute processor and whether you have suggestions for improvement.

How Streams in Elastic Observability Simplifies Retention Management

Thu, 30 Oct 2025 00:00:00 GMT

Managing retention in Elasticsearch can get complicated fast. Between Data stream lifecycle (DSL), Index lifecycle management (ILM), templates, and individual index settings, keeping policies consistent across data streams often takes more effort than it should.

Streams changes that. It introduces a clear, unified way to manage how long your data lives, whether you’re using DSL or ILM. You can visualize ingestion, understand where data sits across tiers, and adjust retention with confidence, applying updates to a single stream without worrying about unintended changes elsewhere, all from a single view.

Walkthrough: Exploring the Retention Tab

Retention management lives in the Retention tab of each stream. This is your control panel for understanding how much data you’re storing, how quickly it’s growing, and how your lifecycle policies are applied. It’s also where you can monitor and configure the Failure store, which tracks and retains documents that failed to be ingested.

Metrics at a glance

At the top of the view, you’ll find an overview of key metrics:

Storage size: the total data volume currently held by the stream.
Ingestion averages: calculated from the selected time range, Streams extrapolates both daily and monthly averages to give you a sense of long-term trends.

This combination of near-real-time and projected values helps you quickly spot when ingestion is ramping up and whether your retention policy aligns with it.

Ingestion over time

Below the metrics, a graph shows ingestion volume over time. This information is approximated based on the number of documents over time, multiplied by the average document size in the backing index.

Visualizing lifecycle phases

When an ILM policy is effective, the retention view becomes more visual. Streams displays a phase breakdown (hot, warm, cold, frozen) showing the data volume stored in each phase. This gives you a clear sense of how your data is distributed across the storage tiers and whether your lifecycle is doing what you expect.

Failure store

A failure store is a secondary set of indices inside a data stream, dedicated to storing documents that failed to be ingested. Within the Retention tab, you can toggle the Failure store on or off, and configure its own retention period. We’ll cover Failure store and Data quality in more detail in this article.

Updating Retention

Beyond visualizing your retention, Streams makes it easy to change how it’s managed.

Switching between DSL and ILM

You can freely switch a stream between DSL and ILM management, or update a DSL retention with just a few clicks. Streams takes care of updating the lifecycle settings at the data stream level, ensuring consistent retention across all existing backing indices, not just new ones.

Whether you prefer the simplicity of DSL or the fine-grained tiering of ILM, you can move between the two seamlessly.

Clicking “Edit data retention” opens a modal that allows you to update the stream’s configuration. From there you can update the ILM policy or set a custom retention period via DSL.

You can set a custom period, or pick an Indefinite retention for your data.

You can also update streams’ lifecycle via the Upsert stream or the Update ingest stream settings Kibana APIs.

Inherit or defer: different strategies for different stream types

Classic streams

For classic streams, you can default to the existing index template’s retention. Retention isn’t managed by Streams in this case, it follows the lifecycle configuration defined in the template just as it normally would.

This option is useful if you’re onboarding existing data streams and want to keep their lifecycle behavior intact while still benefiting from Streams’ visibility and monitoring features.

Wired streams

Wired streams live in a tree structure, and that hierarchy allows an inheritance model.

A child stream can inherit the lifecycle of its nearest ancestor that has a concrete policy (ILM or DSL). This keeps your configuration lean and consistent since you can set a single lifecycle at a higher level in the tree and let Streams automatically apply it to all relevant descendants.

If that ancestor’s lifecycle is later updated, Streams cascades the change down to all children that inherit it, so everything stays in sync.

In the figure below, we set a different retention for logs.prod and logs.staging environments. The child partitions of these environments automatically inherit the configuration.

How it works under the hood

When you apply or update a lifecycle, Streams calls Elasticsearch’s /_data_stream/_settings. This is a new API we’ve added in 8.19 / 9.1 for this purpose.

This API is key to keeping retention consistent:

It applies the lifecycle directly at the data stream level, overriding any configuration from cluster settings or index templates.
It propagates the retention update to all existing backing indices, not just new ones, so retention remains uniform across your historical and future data.

By centralizing lifecycle management at the data stream level and applying a consistent configuration across the backing indices, we remove the ambiguity that used to exist between template-level and index-level configurations. You always know which retention policy is actually in effect, and you can see it directly in the UI.

Wrapping Up

With Streams, retention management becomes clear and consistent. You can visualize ingestion, switch between DSL and ILM, or inherit policies across streams, all without diving into templates or manual index settings.

By unifying retention into a single view, Streams turns lifecycle management into something simple, predictable, and transparent.

Sign up for an Elastic trial at cloud.elastic.co, and trial Elastic's Serverless offering which will allow you to play with all of the Streams functionality.

Additionally, check out:

Read about Reimagining streams

Look at the Streams website

Read the Streams documentation

Smarter log analytics in Elastic Observability

Mon, 10 Jun 2024 00:00:00 GMT

Discover a smarter way to handle your logs with Kibana's latest features! Our new Data Source selector makes it effortless to zero in on the logs you need, whether they're from System Logs or Application Logs by selecting your integrations or data views. Plus, with the introduction of Smart Fields, your log analysis is now more intuitive and insightful. Get ready to simplify your workflow and uncover deeper insights with these game-changing updates. Dive in and see how easy log exploration can be!

Find the logs you’re looking for

Focus on logs from specific integrations or data views

We've added the Data Source selector, a handy new feature for viewing specific logs. Now, you can easily filter your logs based on your integrations, like System Logs, Nginx, or Elastic APM, or switch between different data views, like logs or metrics. This new selector is all about making your data easier to find and helping you focus on what matters most in your analysis.

Dive into your logs

Analyze logs with Smart Fields in Kibana

Logs in Kibana have undergone a significant transformation, particularly in the way log data is presented. The once-basic table view has evolved with the introduction of Smart Fields, providing users with a more insightful and dynamic log analysis experience.

Resource Smart Field - centralizing log source information

The resource column further elevates the Logs Explorer page by providing users with a single column for exploring the resource that created the log event. This column groups various resource-indicating fields together, streamlining the investigation process. Currently, the following ECS fields are grouped under this single column and we recommend including them in your logs:

We know this does not include all use cases and would like your feedback on other fields you use/are important for you to help us provide a tailored and user-centric log analysis experience.

Content Smart Field - a deeper dive into log data

The content column revolutionizes log analysis by seamlessly rendering log.level and message fields. Notably, it automatically handles fallbacks, ensuring a smooth transition when the actual message field is not available. This enhancement simplifies the log exploration process, offering users a more comprehensive understanding of their data.

Actions column - unleashing additional columns

As part of our commitment to empowering users, we are introducing the actions column, adding a layer of functionality to the document table. This column includes two powerful actions:

Degraded document indicator: This indicator provides insights about the quality of your data by indicating fields were ignored when the document was indexed and ended up in the _ignored property of the document. To help analyze what caused the document to degrade, we suggest reading this blog - The antidote for index mapping exceptions: ignore_malformed.
Stacktrace indicator: This indicator informs users of the presence of stack traces in the document. This makes it easy to navigate through logs documents and know if they have additional information.

Investigate individual logs by expanding log details

Now, when you click the expand icon in the actions column, it opens up the Log details flyout for any log entry. This new feature gives you a detailed overview of the entry right at your fingertips. Inside the flyout, the Overview tab is neatly organized into four sections—Content breakdown, Service & Infrastructure, Cloud, and Others—each offering a snapshot of the most crucial information. Plus, you'll find the same handy controls you're used to in the main table, like filtering in or out, adding or removing columns, and copying data, making it easier than ever to manage your logs directly from the flyout.

The Observability AI Assistant is fully integrated into this view providing contextual insights about the log event and helping to find similar messages.

Experience a streamlined approach to log exploration

These enhancements simplify the process of finding and focusing on specific logs and offer more intuitive and insightful data presentation. Dive into your logs with these I tools and streamline your workflow, uncovering deeper insights with ease. Try it now and transform your log analysis!

Easily analyze AWS VPC Flow Logs with Elastic Observability

Mon, 23 Jan 2023 00:00:00 GMT

Elastic Observability provides a full-stack observability solution, by supporting metrics, traces, and logs for applications and infrastructure. In a previous blog, I showed you an AWS monitoring infrastructure running a three-tier application. Specifically we reviewed metrics ingest and analysis on Elastic Observability for EC2, VPC, ELB, and RDS. In this blog, we will cover how to ingest logs from AWS, and more specifically, we will review how to get VPC Flow Logs into Elastic and what you can do with this data.

Logging is an important part of observability, for which we generally think of metrics and/or tracing. However, the amount of logs an application or the underlying infrastructure output can be significantly daunting.

With Elastic Observability, there are three main mechanisms to ingest logs:

The new Elastic Agent pulls metrics and logs from CloudWatch and S3 where logs are generally pushed from a service (for example, EC2, ELB, WAF, Route53, etc ). We reviewed Elastic agent metrics configuration for EC2, RDS (Aurora), ELB, and NAT metrics in this blog.
Using Elastic’s Serverless Forwarder (runs on Lambda and available in AWS SAR) to send logs from Firehose, S3, CloudWatch, and other AWS services into Elastic.
Beta feature (contact your Elastic account team): Using AWS Firehose to directly insert logs from AWS into Elastic — specifically if you are running the Elastic stack on AWS infrastructure.

In this blog we will provide an overview of the second option, Elastic’s serverless forwarder collecting VPC Flow Logs from an application deployed on EC2 instances. Here’s what we'll cover:

A walk-through on how to analyze VPC Flow Log info with Elastic’s Discover, dashboard, and ML analysis.
A detailed step-by-step overview and setup of the Elastic serverless forwarder on AWS as a pipeline for VPC Flow Logs into Elastic Cloud.

Elastic’s serverless forwarder on AWS Lambda

AWS users can quickly ingest logs stored in Amazon S3, CloudWatch, or Kinesis with the Elastic serverless forwarder, an AWS Lambda application, and view them in the Elastic Stack alongside other logs and metrics for centralized analytics. Once the AWS serverless forwarder is configured and deployed from AWS, Serverless Application Registry (SAR) logs will be ingested and available in Elastic for analysis. See the following links for further configuration guidance:

In our configuration we will ingest VPC Flow Logs into Elastic for the three-tier app deployed in the previous blog.

There are three different configurations with the Elastic serverless forwarder:

Logs can be directly ingested from:

Amazon CloudWatch: Elastic serverless forwarder can pull VPC Flow Logs directly from an Amazon CloudWatch log group, which is a commonly used endpoint to store VPC Flow Logs in AWS.
Amazon Kinesis: Elastic serverless forwarder can pull VPC Flow Logs directly from Kinesis, which is another location to publish VPC Flow Logs.
Amazon S3: Elastic serverless forwarder can pull VPC Flow Logs from Amazon S3 via SQS event notifications, which is a common endpoint to publish VPC Flow Logs in AWS.

We will review how to utilize a common configuration, which is to send VPC Flow Logs to Amazon S3 and into Elastic Cloud in the second half of this blog.

But first let's review how to analyze VPC Flow Logs on Elastic.

Analyzing VPC Flow Logs in Elastic

Now that you have VPC Flow Logs in Elastic Cloud, how can you analyze them?

There are several analyses you can perform on the VPC Flow Log data:

Use Elastic’s Analytics Discover capabilities to manually analyze the data.
Use Elastic Observability’s anomaly feature to identify anomalies in the logs.
Use an out-of-the-box (OOTB) dashboard to further analyze data.

Using Elastic Discover

In Elastic analytics, you can search and filter your data, get information about the structure of the fields, and display your findings in a visualization. You can also customize and save your searches and place them on a dashboard. With Discover, you can:

View logs in bulk, within specific time frames
Look at individual details of each entry (document)
Filter for specific values
Analyze fields
Create and save searches
Build visualizations

For a complete understanding of Discover and all of Elastic’s analytics capabilities, look at Elastic documentation.

For VPC Flow Logs, an important stat is to understand:

How many logs were accepted/rejected
Where potential security violations are occur (for example, source IPs from outside the VPC)
What port is generally being queried

I’ve filtered the logs on the following:

Amazon S3: bshettisartest
VPC Flow Log action: REJECT
VPC Network Interface: Webserver 1

We want to see what IP addresses are trying to hit our web servers.

From that, we want to understand which IP addresses we are getting the most REJECTS from, and we simply find the source.ip field. Then, we can quickly get a breakdown that shows 185.242.53.156 is the most rejected for the last 3+ hours we’ve turned on VPC Flow Logs.

Additionally, I can see a visualization by selecting the “Visualize” button. We get the following, which we can add to a dashboard:

In addition to IP addresses, we want to also see what port is being hit on our web servers.
We select the destination port field, and the quick pop-up shows us a list of ports being targeted. We can see that port 23 is being targeted (this port is generally used for telnet), port 445 is being targeted (used for Microsoft Active Directory), and port 433 (used for https ssl). We also see these are all REJECT.

Anomaly detection in Elastic Observability logs

Addition to Discover, Elastic Observability provides the ability to detect anomalies on logs. In Elastic Observability -> logs -> anomalies you can turn on machine learning for:

Log rate: automatically detects anomalous log entry rates
Categorization: automatically categorizes log messages

For our VPC Flow Log, we turned both on. And when we look at what has been detected for anomalous log entry rates, we see:

Elastic immediately detected a spike in logs when we turned on VPC Flow Logs for our application. The rate change is being detected because we’re also ingesting VPC Flow Logs from another application for a couple of days prior to adding the application in this blog.

We can further drill down into this anomaly with machine learning and analyze further.

There is more machine learning analysis you can utilize with your logs — check out Elastic machine learning documentation.

Since we know that a spike exists, we can also use Elastic AIOps Labs Explain Log Rate Spikes capability in Machine Learning. Additionally, we’ve grouped them to see what is causing some of the spikes.

As we can see, a specific network interface is sending more VPC log flows than others. We can further drill down into this further in Discover.

VPC Flow Log dashboard on Elastic Observability

Finally, Elastic also provides an OOTB dashboard to showing the top IP addresses hitting your VPC, geographically where they are coming from, the time series of the flows, and a summary of VPC Flow Log rejects within the time frame.

This is a baseline dashboard that can be enhanced with visualizations you find in Discover, as we reviewed in option 1 (Using Elastic’s Analytics Discover capabilities) above.

Setting it all up

Let’s walk through the details of configuring Amazon Kinesis Data Firehose and Elastic Observability to ingest data.

Prerequisites and config

If you plan on following steps, here are some of the components and details we used to set up this demonstration:

Ensure you have an account on Elastic Cloud and a deployed stack (see instructions here) on AWS. Deploying this on AWS is required for Elastic Serverless Forwarder.
Ensure you have an AWS account with permissions to pull the necessary data from AWS. Specifically, ensure you can configure the agent to pull data from AWS as needed. Please look at the documentation for details.
We used AWS’s three-tier app and installed it as instructed in GitHub. (See blog on ingesting metrics from the AWS services supporting this app.)
Configure and install Elastic’s Serverless Forwarder.
Ensure you turn on VPC Flow Logs for the VPC where the application is deployed and send logs to AWS Firehose.

Step 0: Get an account on Elastic Cloud

Follow the instructions to get started on Elastic Cloud.

Step 1: Deploy Elastic on AWS

Once logged in to Elastic Cloud, create a deployment on AWS. It’s important to ensure that the deployment is on AWS. The Amazon Kinesis Data Firehose connects specifically to an endpoint that needs to be on AWS.

Once your deployment is created, make sure you copy the Elasticsearch endpoint.

The endpoint should be an AWS endpoint, such as:

https://aws-logs.es.us-east-1.aws.found.io

Step 2: Turn on Elastic’s AWS Integrations on AWS

In your deployment’s Elastic Integration section, go to the AWS integration and select Install AWS assets.

Step 3: Deploy your application

Follow the instructions listed out in AWS’s Three-Tier app and instructions in the workshop link on GitHub. The workshop is listed here.

Once you’ve installed the app, get credentials from AWS. This will be needed for Elastic’s AWS integration.

There are several options for credentials:

Use access keys directly
Use temporary security credentials
Use a shared credentials file
Use an IAM role Amazon Resource Name (ARN)

View more details on specifics around necessary credentials and permissions.

Step 4: Send VPC Flow Logs to Amazon S3 and set up Amazon SQS

In the VPC for the application deployed in Step 3, you will need to configure VPC Flow Logs and point them to an Amazon S3 bucket. Specifically, you will want to keep it as AWS default format.

Create the VPC Flow log.

Step 5: Set up Elastic Serverless Forwarder on AWS

Follow instructions listed in Elastic’s documentation and refer to the previous blog providing an overview. The important bits during the configuration in Lambda’s application repository are to ensure you:

Specify the S3 Bucket in ElasticServerlessForwarderS3Buckets where the VPC Flow Logs are being sent. The value is the ARN of the S3 Bucket you created in Step 4.
Specify the configuration file path in ElasticServerlessForwarderS3ConfigFile. The value is the S3 url in the format "s3://bucket-name/config-file-name" pointing to the configuration file (sarconfig.yaml).
Specify the S3 SQS Notifications queue used as the trigger of the Lambda function in ElasticServerlessForwarderS3SQSEvents. The value is the ARN of the SQS Queue you set up in Step 4.

Once Amazon CloudFormation finishes setting up Elastic serverless forwarder, you should see two Amazon Lambda functions:

In order to check if logs are coming in, go to the function with “ ApplicationElasticServer ” in the name, and go to monitor and look at logs. You should see the logs being pulled from S3.

Step 6: Check and ensure you have logs in Elastic

Now that steps 1–4 are complete, you can go to Elastic’s Discover capability and you should see VPC Flow Logs coming in. In the image below, we’ve filtered by Amazon S3 bucket bshettisartest.

Conclusion: Elastic Observability easily integrates with VPC Flow Logs for analytics, alerting, and insights

I hope you’ve gotten an appreciation for how Elastic Observability can help you manage AWS VPC Flow Logs. Here’s a quick recap of lessons and what you learned:

A walk-through of how Elastic Observability provides enhanced analysis for VPC Flow Logs:
- Using Elastic’s Analytics Discover capabilities to manually analyze the data
- Leveraging Elastic Observability’s anomaly features to:
  - Identify anomalies in the VPC flow logs
  - Detects anomalous log entry rates
  - Automatically categorizes log messages
- Using an OOTB dashboard to further analyze data
A more detailed walk-through of how to set up the Elastic Serverless Forwarder

Start your own 7-day free trial by signing up via AWS Marketplace and quickly spin up a deployment in minutes on any of the Elastic Cloud regions on AWS around the world. Your AWS Marketplace purchase of Elastic will be included in your monthly consolidated billing statement and will draw against your committed spend with AWS.